Project SHINRA

What’s New

  • You can watch the Project SHINRA introduction video on Youtube. Please take a look.(2021/6/30)
  • We appreciate your participation in SHINRA2021-ML kick-off meeting. The video, presentation file and statistical data are now available from SHINRA2021-ML homepage. (2021/6/1)
  • SHINRA2021-ML and SHINRA2021-LinkJP sites are now open. (2021/3/12)
    -SHINRA2021-ML task is the categorization task of 30 language Wikipedias into 200 ENE categories.
    -SHINRA2021-LinkJP task is to make a link from entity string to the corresponding Wikipedia page. 

SHINRA Project 2021 Introduction video

“Project SHINRA” aims to build structured Knowledge Base combining Wikipedia and Extended Named Entity by “Resource by Collaborative Contribution” scheme.

Structuring Wikipedia and RbCC

Wikipedia, which is created by crowds, has so many entries and up-to-date information and is a great Knowledge Base. However, most of the information is written for people to read, but not for machine to manipulate. SHINRA project is aiming at structuring the information in Wikipedia for machine to manipulate.

We are hosting shared-tasks to build a structured KB by categorizing Wikipedia entities based on the Extended Named Entity(ENE) definitions, which includes 200+ categories, and extract attributes defined for each category. The outputs by the participated systems will be unified, and the final results will be distributed as a structured KB. We call this scheme as “Resource by Collaborative Contribution (RbCC)”, and asking many collaborators to participate the tasks.

Three kinds of tasks

The final goal is to build a structured Wikipedia, but we have to conduct three tasks in order to achieve the goal: categorization, attribute extraction, and attribute value linking.

Categorization: Categorize Wikipedia entities into ENE category(ies). This has done in Japanese in 2017 by ML method, and human experts check the results. It is updated to 2019 version of Wikipedia by the similar methods. For 30 languages, we have been working this by RbCC scheme at ML tasks, namely, SHINRA2021-ML , in succession to SHINRA2020-ML.

Attribute Extraction: From each Wikipedia entities, extract values of pre-defined attributes for each category. We are working on this in Japanese tasks; for 5 categories at SHINRA2018-JP, for additional 30 categories at SHINRA2019-JP and for additional 47 categories at SHINRA2020-JP.

Attribute Value Linking:Mapping the attribute values (texts) extracted from Wikipedia articles to the corresponding Wikipedia entities (specified by pageids of the articles). We will start this for 7 categories at SHINRA2021-LinkJP.

Shared-tasks

It’s a task to classify 30 language Wikipedia pages into about 220 fine-grained Named Entity categories, with a huge training data (i.e. more than 100K pages) in succession to SHINRA2020-ML. You have an option to utilize the system results of SHINRA2020-ML.

A new task to map the attribute values (texts) extracted from Wikipedia articles to the corresponding Wikipedia entities (specified by pageids of the articles).

The task is to categorize Wikipedia entities in 30-languages. The training data is provided by hand-categorized Japanese Wikipedia entities and language-links to each language Wikipedia. For example, 316K entities in German Wikipedia has a link from 920K hand-categorized Japanese Wikipedia. The participants are supposed to categorize the remaining 1.7M German entities. One of the NTCIR-15 shared tasks.

The 30 languages are English, Spanish, French, German, Chinese, Russian, Portuguese, Italian, Arabic, Indonesian, Turkish, Dutch, Polish, Persian, Swedish, Vietnamese, Korean, Hebrew, Romanian, Norwegian, Czech, Ukrainian, Hindi, Finnish, Hungarian, Danish, Thai, Catalan, Greek, Bulgarian. (These Wikipedias have the largest number of users.)

The task is to structure Japanese Wikipedia entities. The categories includes those used at SHINRA2019, JP-5 and JP-30, as well as the new 47 categories under facilities and events, JP-47.

The task is to structure Japanese Wikipedia entities. The categories includes those used at SHINRA2018, JP-5, as well as the new 30 categories under location and organization, JP-30 (2 categories are not used due to the small number of entities). There were 11 participants to this task.

The task is to structure Japanese Wikipedia entities in 5 categories. The categories are person, city, company, airport and chemical compound. There were 8 participants to this task.

Links

SHINRA Data Download

All SHINRA data can be downloaded. In order to do this, you need to create a free account.

SHINRA Document Download

All SHINRA documents (slides) can be downloaded. In order to do this, you need to create a free account.

SHINRA Publications/ Related Works

Research publications related to project SHINRA and related works are available on the page.

Extended Named Entity Definition

About 220 categories are defined for Name, Numerical value and Time. Examples include person, company, city, airport and chemical compound. Each category has its attribute definitions 10 to 30 attributes for each category.

Contact

Please send your inquiry and comments regarding SHINRA2021-ML task to shinra2021ml-info (at) googlegroups.com, and regarding SHINRA2021-LinkJP task to shinra2021linkjp-info (at) googlegroups.com.