“Project SHINRA” aims to build structured Knowledge Base combining Wikipedia and Extended Named Entity by “Resource by Collaborative Contribution” scheme.
Structuring Wikipedia and RbCC
Wikipedia, which is created by crowds, has so many entries and up-to-date information and is a great Knowledge Base. However, most of the information is written for people to read, but not for machine to manipulate. SHINRA project is aiming at structuring the information in Wikipedia for machine to manipulate.
We are hosting shared-tasks to build a structured KB by categorizing Wikipedia entities based on the Extended Named Entity definitions, which includes 200+ categories, and extract attributes defined for each category. The outputs by the participated systems will be unified, and the final results will be distributed as a structured KB. We call this scheme as “Resource by Collaborative Contribution (RbCC)”, and asking many collaborators to participate the tasks.
Two kinds of tasks
The final goal is to build a structured Wikipedia, but we have to conduct two tasks in order to achieve the goal: categorization and structuring.
Categorization: Categorize Wikipedia entities into ENE category(ies). This has done in Japanese in 2017 by ML method, and human experts check the results. It is updated to 2019 version of Wikipedia by the similar methods. For 30 languages, we will do it by RbCC scheme at SHINRA2020-ML task.
Structuring: From each Wikipedia entities, extract values of pre-defined attributes for each category. We are working on this in Japanese tasks; for 5 categories at SHINRA2018-JP, for additional 30 categories at SHINRA2019-JP and for additional 47 categories at SHINRA2020-JP task.
The task is to categorize Wikipedia entities in 30-languages. The training data is provided by hand-categorized Japanese Wikipedia entities and language-links to each language Wikipedia. For example, 316K entities in German Wikipedia has a link from 920K hand-categorized Japanese Wikipedia. The participants are supposed to categorize the remaining 1.7M German entities. This task will be run as one of the NTCIR-15 shared tasks and you need a registration at NTCIR15 project.
The 30 languages are English, Spanish, French, German, Chinese, Russian, Portuguese, Italian, Arabic, Indonesian, Turkish, Dutch, Polish, Persian, Swedish, Vietnamese, Korean, Hebrew, Romanian, Norwegian, Czech, Ukrainian, Hindi, Finnish, Hungarian, Danish, Thai, Catalan, Greek, Bulgarian. (These Wikipedias have the largest number of users.)
The task is to structure Japanese Wikipedia entities. The categories includes those used at SHINRA2019, JP-5 and JP-30, as well as the new 47 categories under facilities and events, JP-47.
The task is to structure Japanese Wikipedia entities. The categories includes those used at SHINRA2018, JP-5, as well as the new 30 categories under location and organization, JP-30 (2 categories are not used due to the small number of entities). There were 11 participants to this task.
The task is to structure Japanese Wikipedia entities in 5 categories. The categories are person, city, company, airport and chemical compound. There were 8 participants to this task.
All SHINRA data can be downloaded. In order to do this, you need to create a free account.
All SHINRA documents (slides) can be downloaded. In order to do this, you need to create a free account.
219 categories are defined for Name, Numerical value and Time. Examples include person, company, city, airport and chemical compound. Each category has its attribute definitions 10 to 30 attributes for each category.
Please send your inquiry and comments regarding SHINRA2020-ML task to shinra2020ml-info _(@) _ googlegroups.com, and regaring SHINRA2020-JP task to shinra2020jp-info _(@)_ googlegroups.com.