SHINRA2020-ML: Data Download

Depending on your situation, select one which fits you best.

  • Just wanna peek it!

If you want to see what it would be like, you can download (1) Trial Datasets, which include parts of (2) Minimum Datasets.

  • Participate in it with minimum effort: you may need only the minimum data

If you are going to participate in the task with required data, please download the (2) Minimum Datasets of the target language.

  • Participate in it with full effort: you can get all data

If you are strongly committed to the task, download any or all of the (3) Additional Datasets in addition to (2) Minimum Datasets.

Notice that you need an account to get any of the data provided for the task. Please create your SHINRA account at SHINRA: Sign in page.

We arranged various datasets, for the data might be too big to download at a time.
As for the datasets and their corresponding data formats, please check the linked pages listed below.

Datasets and their corresponding Data formats
Datasets Data Format
(1) Trial Datasets Training data for each language,
Wikipedia Dump: Cirrus Dump (JSON)
(2) Minimum Datasets Training data for each language,
Wikipedia Dump: Cirrus Dump (JSON)
(3-1) Japanese Wikipedia articles classified into Extended Named Entity categories Japanese Wikipedia categorized into Extend Named Entity
(3-2) Language Link information between Wikipedia of different languages Language links for 31 language Wikipedias
(3-3) Script to build the training data using (3-1) and (3-2)
(3-4) Wikipedia dump data in 31 languages Wikipedia Dump: Wiki Dump (XML),
Wikipedia Dump: Cirrus Dump (JSON)
(3-5) Extended Named Entity Definition Extended Named Entity Definition

If it is difficult for you to download a large file over the internet, a traditional mail man will deliver all data to you. Please send email to:

shinra2020ml-info _(at)_ googlegroups.com.

We will ship all the data (including both (2) Minimum Datasets and (3) Additional Datasets) in USB memory stick. It will cost $100. 


(1) Trial Datasets

These datasets are part of (2) Minimum Datasets , as examples of the training data classified into categories and the corresponding part of the target data (Wikipedia Cirrus Dump).

These target the following persons:

  • Who want to see the data before deciding to participate.
  • Who want to make a trial learning for the time being.

 

(2) Minimum Datasets

The minimum datasets contain the training data classified into ENE (ver.8.0) categories and the entire target data (Wikipedia Cirrus dump).

These target the following persons:

  • Who decided to participate.

 

(3) Additional Datasets

You can download any of the following additional data. If you would like to build the training data by yourself, use (3-1), (3-2) and (3-3).

Notice that these datasets may or may not help you to achieve the maximum performance.

 

(3-1) Japanese Wikipedia articles classified into Extended Named Entity categories

Language Link Information between different languages are available in JSON format. We cleaned the SQL dumps and added page IDs necessary for the task.

・Please notice that the packages are available in JSON format only. We corrected the misleading names and descriptions of the above packages.
・Notice that Language Link from Hindi is created from Language Link to Hindi in JSON, for there is no Language Link from Hindi in SQL.

 

(3-3) Script to build the training data using (3-1) and (3-2)

If you would like to build the training data using (3-1) and (3-2) by yourself, the script is available at the following page:

 

(3-4) Wikipedia dump data in 31 languages

Wikipedia dumps are available in the following formats. 

1) Wiki Dump
XML format dump data of Wikipedia articles.
2) Cirrus Dump
Wikipedia dump data for Elasticsearch. It includes not only articles, but also information for search purposes. In general, XML tags are removed and easier to be treated for NLP purposes.

There are three types of files.

(a) wikidump
XML format dump data of Wikipedia articles.
(file name) XXwiki-yymmdd-pages-articles.xml.bz2
(b) cirrusdump-content
Encyclopedia article pages in Wikipedia Main name space.
(file name) XXwiki-yymmdd-cirrussearch-content.json
(c) cirrusdump-general
Wikipedia pages in any name space, including talk pages, templates, etc.
(file name) XXwiki-yymmdd-cirrussearch-general.json
Notice that (b) cirrusdump-content for Greek is not available.
As for English(en), Arabic(ar) and Vietnamese(vi), (c) cirrusdump-general includes all the article dumps of (b) cirrusdump-content.

 

You can download either of the followings:

(3-4-1) All languages

These files contain all the data for all 31 languages. So, these are very big files.

If Wikipedia Dump(2020ML_31ArticleDump) is too large for you to download, please use split files instead.

(3-4-1-1) All languages (split files)

Parts of 2020ML_31ArticleDump(Wikidump, 52GB) split across 4 files. Please concatenate all 4 parts into one file.

$cat 30wikidump_articles.tar.bz2-* > 30wikidump_articles.tar.bz2

(3-4-2) Each Language

These files contain all the data for each language. It is easier to download if you are planning to participate in a part of the 30 languages.

If English Dumps (EnglishDumps190120) is too large for you to download, please use split files instead.

 

(3-4-2-1) Each language (split files)

Parts of EnglishDumps190120(Wikipedia dumps (Wikidump and Cirrusdump), 86GB)) split across 5 files. Please concatenate all 5 parts into one zip file.

$ cat English.zip-* > English.zip

 

(3-5) Extended Named Entity Definition

Extended Named Entity Definition v8.0 is available at:
ENE Definition v8.0.
Please check the license.

Also, you can understand the whole picture and browse versions of the Extended Named Entity at Extended Named Entity page.


Data License

These contents are licensed under the CC BY-SA 3.0. For details, please check Wikipedia: Reusing Wikipedia content.