Depending on your situation, select one which fits you best. (As for data examples of each format, see Data Formats.)
Just wanna peek it!
Participate in it with minimum effort: you may need only the minimum data
If you are going to participate in the task with required data, please download the (2)Minimum Datasets of the target language.
Participate in it with full effort: you can get all data
We arranged various datasets, for the data might be too big to download at a time.
If it is difficult for you to download a large file over the internet, a traditional mail man will deliver all data to you. Please send email to shinra2020ml-info _(at)_ googlegroups.com. We will ship all the data (including both Minimum Datasets and Additional Datasets) in USB memory stick. It will cost $100.
(1) Trial Datasets
These datasets are part of Minimum Datasets , as examples of the training data classified into categories and the target data to be classified.
These target the following persons:
- Who want to see the data before deciding to participate.
- Who want to make a trial learning for the time being.
(2) Minimum Datasets
The minimum datasets contain the training data classified into ENE categories and the target data (Cirrus dump) to be classified.
These target to following persons:
- Who decided to participate.
(3) Additional Datasets
You can download any of the following additional data, namely, categorized Japanese Wikipedia articles, language link information, Wikipedia dump data, and the script to build training data. Notice that Additional Datasets may or may not help you to achieve the maximum performance, for our local expert says you may NOT get much benefit.
(3-1) Japanese Wikipedia articles classified into Extended Named Entity categories
(3-2) Language Link information between Wikipedia of different languages
Language Link Information between different languages are available either in SQL provided by Mediawiki or the cleaned up JSON data. Note that SQL data doesn’t include page ID used in the training and target data of the task. That information was reproduced in JSON data (i.e. JSON data is much easier to use.).
Attention! Language Link from Hindi is created from Language Link to Hindi in JSON, for there is no Language Link from Hindi in SQL.
(3-3) Script to build the training data using (3-1) and (3-2)
(3-4) Wikipedia dump data in 31 languages
Wikipedia dumps are available in the following formats.
1) Wiki Dump: XML format dump data of Wikipedia articles
2) Cirrus Dump: Wikipedia dump data for Elasticsearch. It includes not only articles, but also information for search purposes. In general, XML tags are removed and easier to be treated for NLP purposes. See Misc Dumps Format for more detail about the Cirrus Dump.
There are three types of files.
XML format dump data of Wikipedia articles
(b) XXwiki-yymmdd-cirrussearch-content.json: (cirrusdump-content)
Encyclopedia article pages in Wikipedia Main name space.
(c) XXwiki-yymmdd-cirrussearch-general.json: (cirrusdump-general)
Wikipedia pages in any name space other than Main name space, such as talk pages, templates, etc.
You can download the files in one of the following:
- (a) or (b), for 31 languages → (3-4-1) All Languages
- (a) and (b) and (c), for each language → (3-4-2) Each Language
(3-4-1) All languages
These files contain all the data for all 31 languages (each of three files contains one of three types of data). So, these are very big files.
(3-4-2) Each Language
These files contain all the data for each language. It is easier to download if you are planning to participate in a part of the 30 languages.