SHINRA2020-ML: Data Formats

Data formats and their corresponding Data download pages
Data Format Data Download Page
Training data for each language (1) Trial Datasets,
(2) Minimum Datasets
Japanese Wikipedia categorized into Extend Named Entity (3-1) Japanese Wikipedia articles classified into Extended Named Entity categories
Language links for 31 language Wikipedias (3-2) Language Link information between Wikipedia of different languages
Wikipedia Dump: Wiki Dump (XML) (3-4) Wikipedia dump data in 31 languages (format: 1) WikiDump)
Wikipedia Dump: Cirrus Dump (JSON) (1) Trial Datasets,
(2) Minimum Datasets,
(3-4) Wikipedia dump data in 31 languages (format: 2) CirrusDump)
Extended Named Entity Definition (3-5) Extended Named Entity Definition
Submission Format
Notice that the single lines are split over multiple lines and white space characters are inserted for human readability in the following examples.

Training data for each language [JSON]

example

{
   "pageid": 187830,
   "title": "Präfektur Tokio",
   "ja_pageid": 774362,
   "ja_title": "東京都",
   "_stamp": "HAND.AIP.201910",
   "ENEs": [
      {
         "ENE_id": "1.5.1.2",
         "ENE_name": "Province"
      }
   ]
}
Data description
name optionality explanation note
pageid Wikipedia page ID. ‘wgArticleID’ in web page, ‘_id’ in Cirrus Dump.
title Page title. ‘wgTitle’ in web page, ‘title’ in Cirrus Dump.
ja_pageid Wikipedia page ID of the corresponding Japanese page from which the target page is linked.
ja_title Page title of the corresponding Japanese page from which the target page is linked.
_stamp Flag to distinguish between annotation types. The value is either “AUTO.TOHOKU.201906” or “HAND.AIP.201910”. “AUTO” denotes the value is estimated by the system, and “HAND” denotes human annotation.
ENEs ENE (Extended Named Entity) (ver.8.0) category info. Info on ENE categories assigned for the Japanese page. [NOTICE] Each Wikipedia page of the training data has a language link from Japanese Wikipedia and is assigned the same ENE category (ver.8.0) of the link source page.
ENE_id ENE (ver.8.0) id of the category .
ENE_name ENE (ver.8.0) name of the category .
Notice:
There are cases where multiple records in the training data share the same pageid, which means that the page is linked from multiple Japanese pages.
In such cases, please get the ENE_ids of the page from all the records with the same pageid in the training data.

Japanese Wikipedia categorized into Extended Named Entity [JSON]

example


{
   "pageid": 774362,
   "title": "東京都",
   "ENEs": {
      "AUTO.TOHOKU.201906": [
         {
            "ENE_id": "1.5.1.2",
            "prob": 0.9403673410415649
         }
      ],
      "HAND.AIP.201910": [
         {
            "ENE_id": "1.5.1.2",
            "prob": 1.0
         }
      ]
   }
}

Data description
name optionality explanation note
pageid Wikipedia page ID. ‘wgArticleID’ in web page, ‘_id’ in Cirrus Dump.
title Page title. ‘wgTitle’ in web page, ‘title’ in Cirrus Dump.
ENEs ENE (Extended Named Entity) (ver.8.0) category info. Info on ENE categories assigned for the Japanese page.
ENE_id ENE (ver.8.0) id of the category.
prob The probability of the ENE (ver.8.0) category estimated for the page. As for human-categorized pages, the value is set to 1.0.
“AUTO.TOHOKU.201906”, “HAND.AIP.201910” Flags to distinguish between annotation types. “HAND” denotes human annotation. [NOTICE]: If both “AUTO.TOHOKU.201906” and “HAND.AIP.201910” are included in ENEs, please use the latter info annotated by human.

example


{
   "source":{
      "pageid":774362,
      "lang":"ja",
      "title":"東京都"
   },
   "destination":{
      "pageid":30057,
      "lang":"en",
      "title":"Tokyo"
   }
}

{
   "source":{
      "pageid":774362,
      "lang":"ja",
      "title":"東京都"
   },
   "destination":{
      "pageid":187830,
      "lang":"de",
      "title":"Präfektur Tokio"
   }
}

Data description
name optionality explanation note
source Info on the start page (source page) of the interlanguage link.
destination Info on the target page (destination page) of the interlanguage link.
pageid Wikipedia page ID. ‘wgArticleID’ in web page, ‘_id’ in Cirrus Dump.
lang Language code of the page. ‘wgPageContentLanguage’ in web page, ‘language’ in Cirrus Dump.
title Page title. ‘wgTitle’ in web page, ‘title’ in Cirrus Dump.

Wikipedia Dump: Wiki Dump [XML]

example

  <page>
    <title>Präfektur Tokio</title>
    <ns>0</ns>
    <id>187830</id>
    <revision>
      <id>183496717</id>
      <parentid>179947161</parentid>
      <timestamp>2018-12-07T21:06:42Z</timestamp>
      .....
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve">{{Infobox Japanische Präfektur
|Name           = Tokio
|Kanji          = 東京都
.....
}}

Die '''Präfektur Tokio''' ({{jaS|東京都|Tōkyō-to}}, {{enS|Tokyo Prefecture}} oder in Eigenbezeichnung '''Tokyo Metropolis''', oft nur ''Tokyo'') ist eine der [[Präfektur (Japan)|Präfekturen Japans]] und liegt größtenteils in der [[Kantō-Ebene]]. 
.....
{{SORTIERUNG:Prafektur Tokio}}
[[Kategorie:Japanische Präfektur|Tokio]]
[[Kategorie:Präfektur Tokio| ]]</text>
      <sha1>mw331f8297bhfnc6e99vpnjargx6155</sha1>
    </revision>
  </page>
Data description
name optionality explanation note
page Wikipedia Page Info.
title Title of the page. ‘wgTitle’ in web page, ‘title’ in Cirrus Dump.
ns Number of the Wikipedia namespace. ‘0’ means (Main/Article), i.e., the main namespace (or article namespace) of the Wikipedia page. For further details, see Wikipedia: HelpMediaWiki namespace.
id Wikipedia page ID. ‘wgArticleID’ in web page, ‘_id’ in Cirrus Dump.
revision Revision info of the page.
id (child element of revision) Revision ID.
timestamp Timestamp of the revision.
text Text of the page revision.

Wikipedia Dump: Cirrus Dump [JSON]

example

{
   "index": {
      "_type": "page",
      "_id": "187830"
   }
}
{
   "template": [
      .....
   ],
   "content_model": "wikitext",
   "opening_text": "Die Präfektur Tokio (japanisch 東京都 Tōkyō-to, englisch Tokyo Prefecture oder in Eigenbezeichnung Tokyo Metropolis, oft nur Tokyo) ist eine der Präfekturen Japans und liegt größtenteils in der Kantō-Ebene. .....",
   "wiki": "dewiki",
   "auxiliary_text": [
      "Tōkyō-to 東京都 Basisdaten Verwaltungssitz: Shinjuku, Tokio Region: Kantō ......
   ],
   "language": "de",
   "title": "Präfektur Tokio",
   "text": "Die Präfektur Tokio (japanisch 東京都 Tōkyō-to, englisch Tokyo Prefecture oder in Eigenbezeichnung Tokyo Metropolis, oft nur Tokyo) ist eine der Präfekturen Japans und liegt größtenteils in der Kantō-Ebene. .....,
   "defaultsort": "Prafektur Tokio",
   "timestamp": "2018-12-07T21:06:42Z",
   "redirect": [
      {
         "namespace": 0,
         "title": "Präfektur Tōkyō"
      }, 
      {
         "namespace": 0,
         "title": "Präfektur Tokyo"
      }, 
      .....
   ],
   "wikibase_item": "Q1490",
      .....
   "source_text": "{
      {
         Infobox Japanische Präfektur\n
         |Name           = Tokio\n|
         |Kanji          = 東京都\n|
         .....
   ",
   .....
   "namespace_text": "",
   "namespace": 0,
   "text_bytes": 34389,
   "incoming_links": 1550,
   "category": [
      "Japanische Präfektur",
      "Präfektur Tokio"
   ],
   "outgoing_link": [
      "Südkorea",
      .....
   ],
   "popularity_score": 3.7743927508022694e-06,
   "create_timestamp": "2004-04-19T16:48:45Z"
}
Data description
name optionality explanation note
_type One of ‘page’ or ‘namespace’.
_id Wikipedia page ID. ‘wgArticleID’ in web page, ‘pageid’ in Training data.
opening_text Text before the first heading.
language Language code of the page. ‘wgPageContentLanguage’ in web page,’lang’ in Language links.
title Page title. ‘wgTitle’ in web page.
text Text of the page.
timestamp Timestamp of the revision.
redirect Redirect info of the pages which redirect to the page.
namespace Number of the Wikipedia namespace. ‘0’ means (Main/Article), i.e., the main namespace (or article namespace) of the Wikipedia page. For further details, see Wikipedia: HelpMediaWiki namespace.
wikibase_item Wikidata entity ID.
source_text Source text.
incoming_links Number of pages that link to this page.
category List of categories to which this page belongs.
outgoing_link List of links that lead to other pages.

[Reference]:MediaWiki: Data dumps/Misc dumps format
As for CirrusSearch, check MediaWiki: Help:CirrusSearch.

Extended Named Entity Definition (English/Japanese) [JSON]

example

{
   "ENE_id":"1.4.7.2",
   "definition":{
      "en":"A name of a political party, which is an organized group of
       people who come together to engage in political activities. A 
         smaller group inside a political party is not included here, 
         but in 1.4.7.0 Political_Organization_Other Category. ",
      "ja":"政治活動を行う政党や会派の名前。派閥など、政党内の小グループについては
         「政治的組織名_その他」とする。"
   },
   "name":{
      "en":"Political_Party",
      "ja":"政党名"
   },
   "parent_category":"1.4.7",
   "children_category":[
   ]
}
Data description
name optionality explanation note
ENE_id ENE (Extended Named Entity) (ver.8.0) id of the category .
definition Definition of the category.
en English.
ja Japanese.
name Name of the category in ENE (ver.8.0).
parent_category Immediate upper category in ENE (ver.8.0).
children_category Immediate lower category in ENE (ver.8.0).

Submission Format [JSON]

example

{
   "pageid": 34550,
   "title": "Der kleine Prinz",
   "ENEs": [
      {
         "ENE_id": "1.7.19.3",
         "ENE_name": "Movie",
         "score": 0.684
      }, 
      {
         "ENE_id": "1.7.19.6",
         "ENE_name": "Book",
         "score": 0.924
      }, 
      {
         "ENE_id": "1.7.19.2",
         "ENE_name": "Broadcast_Program",
         "score": 0.213
      }, 
      {
         "ENE_id": "1.7.19.4",
         "ENE_name": "Show",
         "score": 0.107
      }
   ]
}
Data description
name optionality explanation note
pageid Wikipedia page ID. ‘wgArticleID’ in web page, ‘_id’ in Cirrus Dump.
title optional Page title. ‘wgTitle’ in web page, ‘title’ in Cirrus Dump.
ENEs ENE (Extended Named Entity) (ver.8.0) category info . Info on ENE categories estimated for the page.
ENE_id The ENE (ver.8.0) id of the category estimated by the system. NOTICE: The ENE_id is evaluated regardless of the score.
ENE_name optional The ENE (ver.8.0) name of the category estimated by the system.
score optional but highly recommended Some score for each ENE (ver.8.0) category predicted for the page. NOTICE: It is desirable that the score is normalized from 0 to 1 range.  The range of non-normalized scores would have to be specified in the system description report.

※ The time stamp of All Wikipedia related data is January 20, 2019