Overview

Wikipedia consists of a large volume of entities (a.k.a. articles), which is a great resource of knowledge to be utilized in many NLP tasks. To maximize the use of such knowledge, resources created from Wikipedia need to be structured for inference, reasoning, or any other purposes in many NLP applications. The current structured knowledge bases, such as DBpedia, Wikidata, Freebase, YAGO, and Wikidata among others, are created mostly by bottom-up crowdsourcing, which lead to a significant amount of undesirable noises in the knowledge base. We believe that the structure of the knowledge should be defined top-down rather than bottom-up to create cleaner and more valuable knowledge bases. Instead of the existing, cumbersome Wikipedia categories, we should rely on a well-defined and fine-grained categories. Among a few definitions of fine-grained named entities, Extended Named Entity (ENE) is a well-defined name ontology, which has about 200 hierarchical categories and a set of attributes are defined for each category.

The final goal is to structure the knowledge in Wikipedia including the attribute, but as a first step, we need to categorize each Wikipedia entry into one of the ENE categories before extracting attribute values. The task of SHINRA2020-ML is to categorize Wikipedia pages in 30 languages into ENE category. We have categorized major Japanese Wikipedia pages, 920K pages, into ENE categories already. We can use language links to create the training data in 30 languages (i.e. 320K in German and so on; see Data page). So the task is to categorize the remaining pages in 30 languages using the training data.

The goal of this project is not only compare the participated systems and see which system performs the best, but also to create the knowledge base using the outputs of the participated systems. We can utilize the state of the art “ensemble learning technologies” to gather the fruit of the systems and create the KB as accurate as possible.