- Updated Call for Participation.(2020/4/27)
SHINRA is a resource creation project started in the year 2017, aiming to structure the knowledge in Wikipedia.
SHINRA2020-ML is the first shared-task of text categorization in SHINRA project, tackling the problem of classifying 30 language Wikipedia entities in fine-grained categories. The task is conducted as one of the NTCIR-15 tasks.
[Video] (approx.11 min):
Introduction of SHINRA2020-ML task (categorization of 30-language Wikipedia into ENE)
The task is to classify 30 language Wikipedia entities into 219 categories defined in Extended Named Entity (ENE) (ver.8.0) (a four-layer ontology for names, time, and numbers) , using categorized Japanese Wikipedia pages and the interlanguage links to the corresponding pages in target languages.
The participants are expected to select one or more target languages, and for each language, use the Wikipedia pages linked from the categorized Japanese pages as the training data, and run the system to classify the remaining pages which are not linked from the Japanese pages.
We will provide the training data for 30 languages, created by the categorized Japanese Wikipedia of 920K pages and Wikipedia language links for 30 languages. For example, out of 2,263K German Wikipedia pages, 275K pages have a language link from Japanese Wikipedia, which will potentially serve as a-bit-noisy training data for German. So, the task is “to classify the remaining 1,988K pages into 219 categories, based on the 275K categorized pages.” The same holds true for other 29 languages as shown in Data statistics.
- The target data for each language is provided as a Wikipedia dump. The participants are requested to submit the outputs for the entire target data.
- The submitted data will be be open to public so that people can try some ensemble learnings to create the resource of the categories of Wikipedia pages in 30 languages.
The 30 target languages are: English, Spanish, French, German, Chinese, Russian, Portuguese, Italian, Arabic, Indonesian, Turkish, Dutch, Polish, Persian, Swedish, Vietnamese, Korean, Hebrew, Romanian, Norwegian, Czech, Ukrainian, Hindi, Finnish, Hungarian, Danish, Thai, Catalan, Greek, Bulgarian.
If you are interested in Participate:
Wikipedia consists of a large volume of entities (a.k.a. articles), which is a great resource of knowledge to be utilized in many NLP tasks. To maximize the use of such knowledge, resources created from Wikipedia need to be structured for inference, reasoning, or any other purposes in many NLP applications. The current structured knowledge bases, such as DBpedia, Wikidata, Freebase, YAGO, and Wikidata among others, are created mostly by bottom-up crowdsourcing, which lead to a significant amount of undesirable noises in the knowledge base. We believe that the structure of the knowledge should be defined top-down rather than bottom-up to create cleaner and more valuable knowledge bases. Instead of the existing, cumbersome Wikipedia categories, we should rely on a well-defined and fine-grained categories. Among a few definitions of fine-grained named entities, Extended Named Entity (ENE) is a well-defined name ontology, which has about 200 hierarchical categories and a set of attributes are defined for each category.
The final goal of SHINRA project is to structure the knowledge in Wikipedia including the attribute, but as a first step, we need to classify each Wikipedia entry into one of the ENE categories before extracting attribute values. The task of SHINRA2020-ML is to classify Wikipedia pages in 30 languages into ENE (ver.8.0) categories. We have classified major Japanese Wikipedia pages (920K pages) into ENE (ver.8.0) categories already. We can use language links to create the training data in 30 languages (i.e. 274K in German and so on). So the task is to categorize the remaining pages in 30 languages using the training data.
The goal of this project is not only to compare the participated systems and see which system performs the best, but also to create the knowledge base using the outputs of the participated systems. We can utilize the state of the art “ensemble learning technologies” to gather the fruit of the systems and create the KB as accurate as possible.
Statistics of Wikipedias in 31 languages
|Language||Num. of pages||Links from ja Wikipedia with category||Percentage|
Slides of Task Description