SHINRA2020-ML Data Download

The participants of SHINRA2020-ML will need to download the data including (1) training data, and (2) target data. Based on your needs and your situation, please select the one which fits you!


  • Just wanna peek it!

You can download only the training data for all 30 languages (Trial Datasets). You can feel what the data look like and understand the task. With the training data of a specific language, your task is to categorize all the remaining Wikipedia entities of the language. 

  • I will participate it with only the necessary data

If you are going to participate some or all languages with only the necessary data, please download the Mininum Datasets for your languages

  • I will participate it with all related data

If you are willing to be deep on this task and to get all the related data (which may or may not help you to achieve the maximum performance), download all the Minimum Datasets and  Extended Datasets.  The “Extended datasets” includes “categorized Japanese Wikipedia”, “Wikipedia language link “, and “Cirrus search dump – general” and “Wikipedia dump” data for 31 languages. Our local expert said you may NOT get much benefit, though. 

  • My internet is too slim, and need to get the data by traditional mail men.

Please send email to info _(at)_ shinra-project.info. We will ship all the data (including both “minimum data set” and “extended data set”) in USB memory stick. It will cost $100. 


Trial Datasets

These datasets are subset of Minimum Datasets and contain only a range of training data.

These target the following persons:

  • Who want to see the data before deciding to participate.
  • Who want to try learning the model for the time being.


Minimum Datasets

These datasets contain, in addition to the training data, unlabeled data that needs label prediction and submission.

These target to following persons:

  • Who decided to participate.


Extended Datasets

The extended datasets include 4 types of data. Please download the ones you need.

(1) Japanese Wikipedia articles categorized into Extended Named Entity

(2) Language Link information between Wikipedia of different languages

We are providing both the SQL dump data provided by Mediawiki, and the cleaned JSON format data. Note that SQL data doesn’t include page ID used in the training and target data we are providing. That information was reproduced in JSON data (i.e. JSON data is much easier to use).

Attention! Hindi Language link(hi) created from other language links dump, which has a link to Hindi Language article because of the dump is missing.

(3) Script to create the training data by (1) and (2) with the following script.

shinra2020_ml_train_maker

(4) Wikipedia dump data in 31 languages.

There are three sets of data as follows. The task is to categorize all the articles in this target data into ENE categories. 

      • Wikipedia Dump
        • XXwiki-yymmdd-pages-articles.xml.bz2: XML format dump data of Wikipedia articles
      • Cirrus Dump: Wikipedia dump data for Elasticsearch. It includes not only articles, but also information for search purposes. In general, XML tags are removed and easier to be treated for NLP purposes.
        • XXwiki-yymmdd-cirrussearch-content.json: This file includes only article pages.
        • XXwiki-yymmdd-cirrussearch-general.json: This file includes any other article pages such as talk pages, templates, etc (English(en), Arabic(ar) and Vietnamese(vi) “general” dump includes all article dumps in “content”).

Attention! a Cirrus Dump of the Greek(el) Wikipedia. Do Not Exists. Please use Greek Wikipedia Dump instead of Cirrus Dump.

See more detail about the Cirrus Dump:

https://meta.wikimedia.org/wiki/Data_dumps/Misc_dumps_format

All Languages

This type of files contain the all data for all 31 languages (each of three files contains onne of three types of data). So, these are very big files.

Each Language

This type of files contain the all data for each language. It is easier to download if you are plannning to participate a part of the 30 languages.