SHINRA2020-ML Data Download


Depending on your situation, select one which fits you best.  (As for data examples of each format, see Data Formats.)

  • Just wanna peek it!

If you want to see what it would be like, you can download (1)Trial Datasets, which include parts of (2)Minimum Datasets .

  • Participate in it with minimum effort: you may need only the minimum data

If you are going to participate in the task with required data, please download the (2)Minimum Datasets of the target language.

  • Participate in it with full effort: you can get all data

If you are strongly committed to the task, download any or all of the (3)Additional Datasets in addition to (2)Minimum Datasets.

Notice that you need an account to get any of the data provided for the task. Please create your SHINRA account at SHINRA: Sign in page.

We arranged various datasets, for the data might be too big to download at a time.

If it is difficult for you to download a large file over the internet, a traditional mail man will deliver all data to you. Please send email to shinra2020ml-info _(at)_ googlegroups.com. We will ship all the data (including both Minimum Datasets and Additional Datasets) in USB memory stick. It will cost $100. 


(1) Trial Datasets

These datasets are part of Minimum Datasets , as examples of the training data classified into categories and the target data to be classified.

These target the following persons:

      • Who want to see the data before deciding to participate.
      • Who want to make a trial learning for the time being.


(2) Minimum Datasets

The minimum datasets contain the training data classified into ENE categories and the target data (Cirrus dump) to be classified.

These target to following persons:

      • Who decided to participate.


(3) Additional Datasets

You can download any of the following additional data, namely, categorized Japanese Wikipedia articles, language link information, Wikipedia dump data, and the script to build training data. Notice that Additional Datasets may or may not help you to achieve the maximum performance, for our local expert says you may NOT get much benefit.


(3-1) Japanese Wikipedia articles classified into Extended Named Entity categories

(3-2) Language Link information between Wikipedia of different languages

Language Link Information between different languages are available either in SQL provided by Mediawiki or the cleaned up JSON data. Note that SQL data doesn’t include page ID used in the training and target data of the task. That information was reproduced in JSON data (i.e. JSON data is much easier to use.).

Attention! Language Link from Hindi is created from Language Link to Hindi in JSON, for there is no Language Link from Hindi in SQL.

(3-3) Script to build the training data using (3-1) and (3-2)

shinra2020_ml_train_maker

(3-4) Wikipedia dump data in 31 languages

Wikipedia dumps are available in the following formats. 

1) Wiki Dump: XML format dump data of Wikipedia articles

2) Cirrus Dump: Wikipedia dump data for Elasticsearch. It includes not only articles, but also information for search purposes. In general, XML tags are removed and easier to be treated for NLP purposes.  See Misc Dumps Format for more detail about the Cirrus Dump.

There are three types of files.

       (a)XXwiki-yymmdd-pages-articles.xml.bz2: (wikidump)
                    XML format dump data of Wikipedia articles

       (b) XXwiki-yymmdd-cirrussearch-content.json: (cirrusdump-content)
                    Encyclopedia article pages in Wikipedia Main name space.

       (c) XXwiki-yymmdd-cirrussearch-general.json: (cirrusdump-general)
                     Wikipedia pages in  any name space other than Main name space, such as talk pages, templates, etc. 

Notice:
(b)(cirrusdump-content) for Greek is not available.
As for English(en), Arabic(ar) and Vietnamese(vi), (c)(cirrusdump-general) includes all the article dumps of (b)(cirrusdump-content).

You can download the files in one of the following:

(3-4-1) All languages

These files contain all the data for all 31 languages (each of three files contains one of three types of data). So, these are very big files.

(3-4-2) Each Language

These files contain all the data for each language. It is easier to download if you are planning to participate in a part of the 30 languages.

License

These contents are  licensed under the CC BY-SA 3.0. For details, please check Reusing Wikipedia content.