SHINRA2020-ML Data Download

The participants of SHINRA2020-ML will need to download the data including (1) training data, and (2) target data. Based on your needs and your situation, please select the one which fits you!


  • Just wanna peek it!

You can download only the training data for all 30 languages (Trial Datasets). You can feel what the data look like and understand the task. With the training data of a specific language, your task is to categorize all the remaining Wikipedia entities of the language. 

  • I will participate it with only the necessary data

If you are going to participate some or all languages with only the necessary data, please download the Mininum Datasets for your languages

  • I will participate it with all related data

If you are willing to be deep on this task and to get all the related data (which may or may not help you to achieve the maximum performance), download all the Minimum Datasets and  Extended Datasets.  The “Extended datasets” includes “categorized Japanese Wikipedia”, “Wikipedia language link “, and “Cirrus search dump – general” and “Wikipedia dump” data for 31 languages. Our local expert said you may NOT get much benefit, though. 

  • My internet is too slim, and need to get the data by traditional mail men.

Please send email to info _(at)_ shinra-project.info. We will ship all the data (including both “minimum data set” and “extended data set”) in USB memory stick. It will cost $100. 


Trial Datasets

These datasets are subset of Minimum Datasets and contain only a range of training data.

These target the following persons:

  • Who want to see the data before deciding to participate.
  • Who want to try learning the model for the time being.


Minimum Datasets

These datasets contain, in addition to the training data, unlabeled data that needs label prediction and submission.

These target to following persons:

  • Who decided to participate.


Extended Datasets

The extended datasets include 4 types of data. Please download the ones you need.

(1) Japanese Wikipedia articles categorized into Extended Named Entity

(2) Language Link information between Wikipedia of different languages

We are providing both the SQL dump data provided by Mediawiki, and the cleaned JSON format data. Note that SQL data doesn’t include page ID used in the training and target data we are providing. That information was reproduced in JSON data (i.e. JSON data is much easier to use).

Attention! Hindi Language link(hi) created from other language links dump, which has a link to Hindi Language article because of the dump is missing.

(3) Script to create the training data by (1) and (2) with the following script.

shinra2020_ml_train_maker

(4) Wikipedia dump data in 31 languages.

There are three sets of data as follows. The task is to categorize all the articles in this target data into ENE categories. 

      • Wikipedia Dump
        • XXwiki-yymmdd-pages-articles.xml.bz2: XML format dump data of Wikipedia articles
      • Cirrus Dump: Wikipedia dump data for Elasticsearch. It includes not only articles, but also information for search purposes. In general, XML tags are removed and easier to be treated for NLP purposes.
        • XXwiki-yymmdd-cirrussearch-content.json: This file includes only article pages.
        • XXwiki-yymmdd-cirrussearch-general.json: This file includes any other article pages such as talk pages, templates, etc (English(en), Arabic(ar) and Vietnamese(vi) “general” dump includes all article dumps in “content”).

Attention! a Cirrus Dump of the Greek(el) Wikipedia. Do Not Exists. Please use Greek Wikipedia Dump instead of Cirrus Dump.

See more detail about the Cirrus Dump:

https://meta.wikimedia.org/wiki/Data_dumps/Misc_dumps_format

All Languages

This type of files contain the all data for all 31 languages (each of three files contains onne of three types of data). So, these are very big files.

Each Language

This type of files contain the all data for each language. It is easier to download if you are plannning to participate a part of the 30 languages.


Providing Data and Script

(1) Training Data

Based on the following 2 sets of data (1.1) and (1.2), you can create the training data for all 30 languages using the tool (1.3). Now, we are providing the ready-made training data (which could be produced by (1.3), as (1.4).

(1.1) ENE categorized Japanese Wikipedia Article Index

(1.2) Language Links

(1.3) Scripts

(1.4) Ready-made training data for 30 languages

(2) Target Data

The data to be categorized by the participants. It is Wikipedia dump in two formats (2.1). The organizer is providing a tool for building easy-to-process format of the data (2.2). 

(2.1) Download Dump Data

(2.2) Scripts


1. Training Data

Based on the following 2 sets of data (1.1) and (1.2), you can create the training data for all 30 languages using the tool (1.3). Now, we are providing the ready-made training data (which could be produced by (1.3), as (1.4).

1.1. Japanese Wikipedia Articles categorized into ENE

Japanese Wikipedia articles categorized into Extended Named Entity.

/* サンプル */
{"pageid": 72942, "title": "バックス (ローマ神話)", "ENEs": {"AUTO.TOHOKU.201906": [{"prob": 0.9981889128684998, "ENE_id": "1.2"}]}}
{"pageid": 401755, "title": "覚信尼", "ENEs": {"AUTO.TOHOKU.201906": [{"prob": 0.9999418258666992, "ENE_id": "1.1"}]}}
{"pageid": 96942, "title": "水谷正村", "ENEs": {"AUTO.TOHOKU.201906": [{"prob": 0.9999151825904846, "ENE_id": "1.1"}], "HAND.AIP.201910": [{"ENE_id": "1.1", "prob": 1.0}]}}

It’s a JSON object; one line for one data. Each line consists of the following information.

      • pageid: Page ID of the page (provided by Wikipedia)
      • title: Title of the  page
      • ENEs: (one or more) ENE category object(s) assigned to the page
        • Procedure of the assignment: It indicates how the ENE is assigned. Example: ‘AUTO.TOHOKU.201906’ (generally, it has information of how (AUTO or HAND), who (AIP, TOHOKU, LC and etc) and when (201906 etc)).
          • ENE_id: ENE category assigned. It’s indicated by ID number of ENE. The definition is described at the following page.
          • prob: Degree of likelihood in case it is assigned by AUTO method. The rabnge is between 0 and 1.

Language Link information between Wikipedia of different languages. We are providing the SQL dump data provided by Mediawiki, and the cleaned JSON format data. Note that SQL data doesn’t include page ID used in the training and target data we are providing. That information was reproduced in JSON data.

Attention! Hindi Language link(hi) created from other language links dump, which has a link to Hindi Language article because of the dump is missing.

The detail of this dump is as follows. The dump includes only links between normal articles (doesn’t include maintenance or other wikipedia meta-pages).

      • source
        • pageid: Pageid of Source Wikipedia
        • title: Title of Source Wikipedia
        • lang: Language of Source Wikipedia(abbreviation ex. ja, en)
      • destination
        • pageid: Pageid of Destination Wikipedia
        • title: Title of Destination Wikipedia
        • lang: Language of Destination Wikipedia(abbreviation ex. ja, en)

Example. Article “NewYork”  in English Wikipedia to other language Wikipedias:

/* Example */
{ 
"source" : { "lang" : "en", "pageid" : 673381, "title" : "New York" }, 
"destination" : { "lang" : "de", "pageid" : 109581, "title" : "New York" } 
}
{ 
"source" : { "lang" : "en", "pageid" : 673381, "title" : "New York" }, 
"destination" : { "lang" : "es", "pageid" : 7762118, "title" : "New York (desambiguación)" } 
}
{ 
"source" : { "lang" : "en", "pageid" : 673381, "title" : "New York" }, 
"destination" : { "lang" : "fa", "pageid" : 158288, "title" : "نیویورک (ابهام‌زدایی)" } 
}
{ 
"source" : { "lang" : "en", "pageid" : 673381, "title" : "New York" }, 
"destination" : { "lang" : "ja", "pageid" : 5038, "title" : "ニューヨーク (曖昧さ回避)" } 
}

1.3 Script

You can make training data by (1.1) and (1.2) with the following script.

shinra2020_ml_train_maker

But, you don’t have to use this script, because you can get made training data already in (1.4).

1.4 Ready-made Training Data

Now we provide, the training data created based on (1.1) and (1.2) using the script (1.3). The file are exactly the output of the script described above and you can use it without downloading the Japanese Wikipedia categorization file and language link file, and running the script.

2. Target Data

The data to be categorized by the participants. It is Wikipedia dump in two formats (2.1). The organizer is providing a tool for building easy-to-process format of the data (2.2). 

2.1. Target Wikipedia Data

This data includes 31 language Wikipedia dump data. There are three sets of data as follows. The task is to categorize all the articles in this target data into ENE categories. 

Data in this Dumps

      • Wikipedia Dump
        • XXwiki-yymmdd-pages-articles.xml.bz2: XML format dump data of Wikipedia articles
      • Cirrus Dump: Wikipedia dump data for Elasticsearch. It includes not only articles, but also information for search purposes. In general, XML tags are removed and easier to be treated for NLP purposes.
        • XXwiki-yymmdd-cirrussearch-content.json: This file includes only article pages.
        • XXwiki-yymmdd-cirrussearch-general.json: This file includes any other article pages such as talk pages, templates, etc (English(en), Arabic(ar) and Vietnamese(vi) “general” dump includes all article dumps in “content”).

Attention! a Cirrus Dump of the Greek(el) Wikipedia. Do Not Exists. Please use Greek Wikipedia Dump instead of Cirrus Dump.

See more detail about the Cirrus Dump:

https://meta.wikimedia.org/wiki/Data_dumps/Misc_dumps_format

All Languages

This type of files contain the all data for all 31 languages (each of three files contains onne of three types of data). So, these are very big files.

Each Language

This type of files contain the all data for each language. It is easier to download if you are plannning to participate a part of the 30 languages.