SHINRA2020-ML is a shared-task of text categorization on Wikipedia entities in 30 languages, but at the same time, it is a resource creation project.
The task is to categorize Wikipedia entities in 30 languages into the named entity ontology of 219 kinds, called Extended Named Entity. We will provide the training data for 30 languages, which were created by the hand-categorized Japanese Wikipedia of 920K pages, and Wikipedia language links for 30 languages. For example, out of 2,263K German Wikipedia pages, 275K pages have a language link from Japanese Wikipedia, and these pages will be the potentially a-bit-noisy training data for German. So, the task is “based on the 275K categorized pages, the participants are requested to categorize the remaining 1,988K pages into 219 categories”. Similarly, the training data will be available to the other 29 languages as shown in Data statistic page. The test data is blind for the participants, and the participants are required to submit the outputs for all remaining data. The participants are not required to participate all 30 languages. The submitted data will be open to public so that people can try some ensemble learnings to create the resource of the categories of Wikipedia pages in 30 languages. The 30 target languages are: English, Spanish, French, German, Chinese, Russian, Portuguese, Italian, Arabic, Indonesian, Turkish, Dutch, Polish, Persian, Swedish, Vietnamese, Korean, Hebrew, Romanian, Norwegian, Czech, Ukrainian, Hindi, Finnish, Hungarian, Danish, Thai, Catalan, Greek, Bulgarian
If you are interested in Participate:
Wikipedia consists of a large volume of entities (a.k.a. articles), which is a great resource of knowledge to be utilized in many NLP tasks. To maximize the use of such knowledge, resources created from Wikipedia need to be structured for inference, reasoning, or any other purposes in many NLP applications. The current structured knowledge bases, such as DBpedia, Wikidata, Freebase, YAGO, and Wikidata among others, are created mostly by bottom-up crowdsourcing, which lead to a significant amount of undesirable noises in the knowledge base. We believe that the structure of the knowledge should be defined top-down rather than bottom-up to create cleaner and more valuable knowledge bases. Instead of the existing, cumbersome Wikipedia categories, we should rely on a well-defined and fine-grained categories. Among a few definitions of fine-grained named entities, Extended Named Entity (ENE) is a well-defined name ontology, which has about 200 hierarchical categories and a set of attributes are defined for each category.
The final goal is to structure the knowledge in Wikipedia including the attribute, but as a first step, we need to categorize each Wikipedia entry into one of the ENE categories before extracting attribute values. The task of SHINRA2020-ML is to categorize Wikipedia pages in 30 languages into ENE category. We have categorized major Japanese Wikipedia pages, 920K pages, into ENE categories already. We can use language links to create the training data in 30 languages (i.e. 320K in German and so on; see Data page). So the task is to categorize the remaining pages in 30 languages using the training data.
The goal of this project is not only compare the participated systems and see which system performs the best, but also to create the knowledge base using the outputs of the participated systems. We can utilize the state of the art “ensemble learning technologies” to gather the fruit of the systems and create the KB as accurate as possible.
Statistics of Wikipedias in 31 languages
|Language||Num. of pages||Links from ja Wikipedia with category||Percentage|