SHINRA2020-ML: Classification Task

What’s New

Task Description

SHINRA is a resource creation project started in the year 2017, aiming to structure the knowledge in Wikipedia.

SHINRA2020-ML is the first shared-task of text categorization in SHINRA project, tackling the problem of classifying 30 language Wikipedia entities in fine-grained categories. The task is conducted as one of the NTCIR-15 tasks.

  [Video] (approx.11 min):
  Introduction of SHINRA2020-ML task (categorization of 30-language Wikipedia into ENE)

The task is to classify 30 language Wikipedia entities into 219 categories defined in Extended Named Entity (ENE) (ver.8.0) (a four-layer ontology for names, time, and numbers) , using categorized Japanese Wikipedia pages and the interlanguage links to the corresponding pages in target languages.

The participants are expected to select one or more target languages, and for each language, use the Wikipedia pages linked from the categorized Japanese pages as the training data, and run the system to classify the remaining pages which are not linked from the Japanese pages.

We will provide the training data for 30 languages, created by the categorized Japanese Wikipedia of 920K pages and Wikipedia language links for 30 languages. For example, out of 2,263K German Wikipedia pages, 275K pages have a language link from Japanese Wikipedia, which will potentially serve as a-bit-noisy training data for German. So, the task is “to classify the remaining 1,988K pages into 219 categories, based on the 275K categorized pages.” The same holds true for other 29 languages as shown in Data statistics.

  • The target data for each language is provided as a Wikipedia dump. The participants are requested to submit the outputs for the entire target data.
  • The submitted data will be be open to public so that people can try some ensemble learnings to create the resource of the categories of Wikipedia pages in 30 languages.

The 30 target languages are: English, Spanish, French, German, Chinese, Russian, Portuguese, Italian, Arabic, Indonesian, Turkish, Dutch, Polish, Persian, Swedish, Vietnamese, Korean, Hebrew, Romanian, Norwegian, Czech, Ukrainian, Hindi, Finnish, Hungarian, Danish, Thai, Catalan, Greek, Bulgarian.

If you are interested in Participate:

Background

Wikipedia consists of a large volume of entities (a.k.a. articles), which is a great resource of knowledge to be utilized in many NLP tasks. To maximize the use of such knowledge, resources created from Wikipedia need to be structured for inference, reasoning, or any other purposes in many NLP applications. The current structured knowledge bases, such as DBpedia, Wikidata, Freebase, YAGO, and Wikidata among others, are created mostly by bottom-up crowdsourcing, which lead to a significant amount of undesirable noises in the knowledge base. We believe that the structure of the knowledge should be defined top-down rather than bottom-up to create cleaner and more valuable knowledge bases. Instead of the existing, cumbersome Wikipedia categories, we should rely on a well-defined and fine-grained categories. Among a few definitions of fine-grained named entities, Extended Named Entity (ENE) is a well-defined name ontology, which has about 200 hierarchical categories and a set of attributes are defined for each category.

The final goal of SHINRA project is to structure the knowledge in Wikipedia including the attribute, but as a first step, we need to classify each Wikipedia entry into one of the ENE categories before extracting attribute values. The task of SHINRA2020-ML is to classify Wikipedia pages in 30 languages into ENE (ver.8.0) categories. We have classified major Japanese Wikipedia pages (920K pages) into ENE (ver.8.0) categories already. We can use language links to create the training data in 30 languages (i.e. 274K in German and so on). So the task is to categorize the remaining pages in 30 languages using the training data.

The goal of this project is not only to compare the participated systems and see which system performs the best, but also to create the knowledge base using the outputs of the participated systems. We can utilize the state of the art “ensemble learning technologies” to gather the fruit of the systems and create the KB as accurate as possible.

Statistics of Wikipedias in 31 languages

Language Num. of pages Links from ja Wikipedia with category Percentage
English (en) 5,790,377 439.354 7.6
Spanish (es) 1,500,013 257,835 17.2
French (fr) 2,074,648 318,828 15.4
German (de) 2,262,582 274,732 12.1
Chinese (zh) 1,041,039 267,107 25.7
Russian (ru) 1,523,013 253,012 16.6
Portuguese (pt) 1,014,832 217,896 21.5
Italian (it) 1,496,975 270,295 18.1
Arabic (ar) 661,205 73,054 11.0
Japanese 1,136,222
Indonesian (id) 451,336 115,643 25.6
Turkish (tr) 321,937 111,592 34.7
Dutch (nl) 1,955,483 199,983 10.2
Polish (pl) 1,316,130 225,552 17.1
Persian (fa) 660,487 169,053 25.6
Swedish (sv) 3,759,167 180,948 4.8
Vietnamese (vi) 1,200,157 116,280 9.7
Korean (ko) 439,577 190,807 43.7
Hebrew (he) 236,984 103,137 43.5
Romanian (ro) 391,231 92,002 23.5
Norwegian (no) 501,475 135,935 27.1
Czech (cs) 420,195 135,935 25.1
Ukrainian (uk) 881,572 181,122 20.5
Hindi (hi) 129,141 30,547 23.6
Finnish (fi) 450,537 144,750 32.1
Hungarian (hu) 443,060 120,295 27.2
Danish (da) 242,523 91,811 35.6
Thai (th) 129,294 59,791 46.2
Catalan (ca) 601,473 139,032 23.1
Greek (el) 157,566 60,513 38.4
Bulgarian (bg) 248,913 89,017 35.7

Data License

Reuse of Wikipedia contents are licensed under the CC BY-SA 3.0. For details, please check Wikipedia: Reusing Wikipedia content.

Organizers

Slides of Task Description

20190930_NTCIR15KickOff_20190919 (1)-圧縮済み