SHINRA2020-ML: Classification Task

What’s New

SHINRA2020-ML evaluation results and run results are available from NTCIR-15: SHINRA2020-ML System Data Download. (2021/1/8)
SHINRA2020-ML leaderboard has been restored. (2020/10/19)
Notice: SHINRA2020-ML Leaderboard downtime (Oct 16 night – Oct 17).(2020/10/16)
SHINRA2020-ML leaderboard has been restored. We will keep the site running after the official results submission deadline. (2020/8/31)
Notice: SHINRA2020-ML Leaderboard downtime (Aug 28 – Aug 31 morning).
SHINRA2020-ML FAQs are available here. (2020/8/6)
You can submit the run results from the SHINRA2020-ML: Results Submission page.(2020/7/10)
SHINRA2020-ML leaderboard has been released.(2020/7/6)
The registration & result submission deadline has been extended to August 31, 2020 (2020/6/30). We updated the Call for Participation.(2020/7/1)
Updated Call for Participation.(2020/4/27)

Task Description

SHINRA is a resource creation project started in the year 2017, aiming to structure the knowledge in Wikipedia.

SHINRA2020-ML is the first shared-task of text categorization in SHINRA project, tackling the problem of classifying 30 language Wikipedia entities in fine-grained categories. The task is conducted as one of the NTCIR-15 tasks.

　　[Video] (approx.11 min):
　　Introduction of SHINRA2020-ML task (categorization of 30-language Wikipedia into ENE)

The task is to classify 30 language Wikipedia entities into 219 categories defined in Extended Named Entity (ENE) (ver.8.0) (a four-layer ontology for names, time, and numbers) , using categorized Japanese Wikipedia pages and the interlanguage links to the corresponding pages in target languages.

[SHINRA2020-ML Overview]

The participants are expected to select one or more target languages, and for each language, use the Wikipedia pages linked from the categorized Japanese pages as the training data, and run the system to classify the remaining pages which are not linked from the Japanese pages.

We will provide the training data for 30 languages, created by the categorized Japanese Wikipedia of 920K pages and Wikipedia language links for 30 languages. For example, out of 2,263K German Wikipedia pages, 275K pages have a language link from Japanese Wikipedia, which will potentially serve as a-bit-noisy training data for German. So, the task is “to classify the remaining 1,988K pages into 219 categories, based on the 275K categorized pages.” The same holds true for other 29 languages as shown in Data statistics.

The target data for each language is provided as a Wikipedia dump. The participants are requested to submit the outputs for the entire target data.
The submitted data will be be open to public so that people can try some ensemble learnings to create the resource of the categories of Wikipedia pages in 30 languages.

The 30 target languages are: English, Spanish, French, German, Chinese, Russian, Portuguese, Italian, Arabic, Indonesian, Turkish, Dutch, Polish, Persian, Swedish, Vietnamese, Korean, Hebrew, Romanian, Norwegian, Czech, Ukrainian, Hindi, Finnish, Hungarian, Danish, Thai, Catalan, Greek, Bulgarian.

If you are interested in Participate:

Background

Wikipedia consists of a large volume of entities (a.k.a. articles), which is a great resource of knowledge to be utilized in many NLP tasks. To maximize the use of such knowledge, resources created from Wikipedia need to be structured for inference, reasoning, or any other purposes in many NLP applications. The current structured knowledge bases, such as DBpedia, Wikidata, Freebase, YAGO, and Wikidata among others, are created mostly by bottom-up crowdsourcing, which lead to a significant amount of undesirable noises in the knowledge base. We believe that the structure of the knowledge should be defined top-down rather than bottom-up to create cleaner and more valuable knowledge bases. Instead of the existing, cumbersome Wikipedia categories, we should rely on a well-defined and fine-grained categories. Among a few definitions of fine-grained named entities, Extended Named Entity (ENE) is a well-defined name ontology, which has about 220 hierarchical categories and a set of attributes are defined for each category.

The final goal of SHINRA project is to structure the knowledge in Wikipedia including the attribute, but as a first step, we need to classify each Wikipedia entry into one of the ENE categories before extracting attribute values. The task of SHINRA2020-ML is to classify Wikipedia pages in 30 languages into ENE (ver.8.0) categories. We have classified major Japanese Wikipedia pages (920K pages) into ENE (ver.8.0) categories already. We can use language links to create the training data in 30 languages (i.e. 274K in German and so on). So the task is to categorize the remaining pages in 30 languages using the training data.

The goal of this project is not only to compare the participated systems and see which system performs the best, but also to create the knowledge base using the outputs of the participated systems. We can utilize the state of the art “ensemble learning technologies” to gather the fruit of the systems and create the KB as accurate as possible.

Statistics of Wikipedias in 31 languages

Language	Num. of pages	Links from ja Wikipedia with category	Percentage
English (en)	5,790,377	439,354	7.6
Spanish (es)	1,500,013	257,835	17.2
French (fr)	2,074,648	318,828	15.4
German (de)	2,262,582	274,732	12.1
Chinese (zh)	1,041,039	267,107	25.7
Russian (ru)	1,523,013	253,012	16.6
Portuguese (pt)	1,014,832	217,896	21.5
Italian (it)	1,496,975	270,295	18.1
Arabic (ar)	661,205	73,054	11.0
Japanese (ja)	1,136,222	–	–
Indonesian (id)	451,336	115,643	25.6
Turkish (tr)	321,937	111,592	34.7
Dutch (nl)	1,955,483	199,983	10.2
Polish (pl)	1,316,130	225,552	17.1
Persian (fa)	660,487	169,053	25.6
Swedish (sv)	3,759,167	180,948	4.8
Vietnamese (vi)	1,200,157	116,280	9.7
Korean (ko)	439,577	190,807	43.4
Hebrew (he)	236,984	96,434	40.7
Romanian (ro)	391,231	92,002	23.5
Norwegian (no)	501,475	135,935	27.1
Czech (cs)	420,195	125,959	30.0
Ukrainian (uk)	881,572	167,237	19.0
Hindi (hi)	129,141	30,547	23.7
Finnish (fi)	450,537	144,750	32.1
Hungarian (hu)	443,060	120,295	27.2
Danish (da)	242,523	86,238	35.6
Thai (th)	129,294	59,791	46.2
Catalan (ca)	601,473	139,032	23.1
Greek (el)	157,566	60,513	38.4
Bulgarian (bg)	248,913	89,017	35.8

Data License

Reuse of Wikipedia contents are licensed under the CC BY-SA 3.0. For details, please check Wikipedia: Reusing Wikipedia content.

Organizers

- - - Organizer Page