What’s New
- New addition: SHINRA2021-ML Development Data. (2021/5/25)
Data Formats
Notice that the single lines are split over multiple lines and white space characters are inserted for human readability in the following examples.
Training data for each language [JSON]
example
{ "pageid": 187830, "title": "Präfektur Tokio", "ja_pageid": 774362, "ja_title": "東京都", "_stamp": "HAND.AIP.201910", "ENEs": [ { "ENE_id": "1.5.1.2", "ENE_name": "Province" } ] }
name | optionality | explanation | note |
---|---|---|---|
pageid | Wikipedia page ID. | ‘wgArticleID’ in web page, ‘_id’ in Cirrus Dump. | |
title | Page title. | ‘wgTitle’ in web page, ‘title’ in Cirrus Dump. | |
ja_pageid | Wikipedia page ID of the corresponding Japanese page from which the target page is linked. | ||
ja_title | Page title of the corresponding Japanese page from which the target page is linked. | ||
_stamp | Flag to distinguish between annotation types. The value is either “AUTO.TOHOKU.201906” or “HAND.AIP.201910”. “AUTO” denotes the value is estimated by the system, and “HAND” denotes human annotation. | ||
ENEs | ENE (Extended Named Entity) (ver.8.0) category info. Info on ENE categories assigned for the Japanese page. | [NOTICE] Each Wikipedia page of the training data has a language link from Japanese Wikipedia and is assigned the same ENE category (ver.8.0) of the link source page. | |
ENE_id | ENE (ver.8.0) id of the category . | ||
ENE_name | ENE (ver.8.0) name of the category . |
Notice:
There are cases where multiple records in the training data share the same pageid, which means that the page is linked from multiple Japanese pages.
In such cases, please get the ENE_ids of the page from all the records with the same pageid in the training data.
There are cases where multiple records in the training data share the same pageid, which means that the page is linked from multiple Japanese pages.
In such cases, please get the ENE_ids of the page from all the records with the same pageid in the training data.
Japanese Wikipedia categorized into Extended Named Entity [JSON]
example
{ "pageid": 774362, "title": "東京都", "ENEs": { "AUTO.TOHOKU.201906": [ { "ENE_id": "1.5.1.2", "prob": 0.9403673410415649 } ], "HAND.AIP.201910": [ { "ENE_id": "1.5.1.2", "prob": 1.0 } ] } }
name | optionality | explanation | note |
---|---|---|---|
pageid | Wikipedia page ID. | ‘wgArticleID’ in web page, ‘_id’ in Cirrus Dump. | |
title | Page title. | ‘wgTitle’ in web page, ‘title’ in Cirrus Dump. | |
ENEs | ENE (Extended Named Entity) (ver.8.0) category info. Info on ENE categories assigned for the Japanese page. | ||
ENE_id | ENE (ver.8.0) id of the category. | ||
prob | The probability of the ENE (ver.8.0) category estimated for the page. | As for human-categorized pages, the value is set to 1.0. | |
“AUTO.TOHOKU.201906”, “HAND.AIP.201910” | Flags to distinguish between annotation types. “HAND” denotes human annotation. | [NOTICE]: If both “AUTO.TOHOKU.201906” and “HAND.AIP.201910” are included in ENEs, please use the latter info annotated by human. |
Language links for 31 language Wikipedias [JSON]
example
{ "source":{ "pageid":774362, "lang":"ja", "title":"東京都" }, "destination":{ "pageid":30057, "lang":"en", "title":"Tokyo" } } { "source":{ "pageid":774362, "lang":"ja", "title":"東京都" }, "destination":{ "pageid":187830, "lang":"de", "title":"Präfektur Tokio" } }
name | optionality | explanation | note |
---|---|---|---|
source | Info on the start page (source page) of the interlanguage link. | ||
destination | Info on the target page (destination page) of the interlanguage link. | ||
pageid | Wikipedia page ID. | ‘wgArticleID’ in web page, ‘_id’ in Cirrus Dump. | |
lang | Language code of the page. | ‘wgPageContentLanguage’ in web page, ‘language’ in Cirrus Dump. | |
title | Page title. | ‘wgTitle’ in web page, ‘title’ in Cirrus Dump. |
Wikipedia Dump: Wiki Dump [XML]
example
<page> <title>Präfektur Tokio</title> <ns>0</ns> <id>187830</id> <revision> <id>183496717</id> <parentid>179947161</parentid> <timestamp>2018-12-07T21:06:42Z</timestamp> ..... <model>wikitext</model> <format>text/x-wiki</format> <text xml:space="preserve">{{Infobox Japanische Präfektur |Name = Tokio |Kanji = 東京都 ..... }} Die '''Präfektur Tokio''' ({{jaS|東京都|Tōkyō-to}}, {{enS|Tokyo Prefecture}} oder in Eigenbezeichnung '''Tokyo Metropolis''', oft nur ''Tokyo'') ist eine der [[Präfektur (Japan)|Präfekturen Japans]] und liegt größtenteils in der [[Kantō-Ebene]]. ..... {{SORTIERUNG:Prafektur Tokio}} [[Kategorie:Japanische Präfektur|Tokio]] [[Kategorie:Präfektur Tokio| ]]</text> <sha1>mw331f8297bhfnc6e99vpnjargx6155</sha1> </revision> </page>
name | optionality | explanation | note |
---|---|---|---|
page | Wikipedia Page Info. | ||
title | Title of the page. | ‘wgTitle’ in web page, ‘title’ in Cirrus Dump. | |
ns | Number of the Wikipedia namespace. | ‘0’ means (Main/Article), i.e., the main namespace (or article namespace) of the Wikipedia page. For further details, see Wikipedia: HelpMediaWiki namespace. | |
id | Wikipedia page ID. | ‘wgArticleID’ in web page, ‘_id’ in Cirrus Dump. | |
revision | Revision info of the page. | ||
id (child element of revision) | Revision ID. | ||
timestamp | Timestamp of the revision. | ||
text | Text of the page revision. |
Wikipedia Dump: Cirrus Dump [JSON]
example
{ "index": { "_type": "page", "_id": "187830" } } { "template": [ ..... ], "content_model": "wikitext", "opening_text": "Die Präfektur Tokio (japanisch 東京都 Tōkyō-to, englisch Tokyo Prefecture oder in Eigenbezeichnung Tokyo Metropolis, oft nur Tokyo) ist eine der Präfekturen Japans und liegt größtenteils in der Kantō-Ebene. .....", "wiki": "dewiki", "auxiliary_text": [ "Tōkyō-to 東京都 Basisdaten Verwaltungssitz: Shinjuku, Tokio Region: Kantō ...... ], "language": "de", "title": "Präfektur Tokio", "text": "Die Präfektur Tokio (japanisch 東京都 Tōkyō-to, englisch Tokyo Prefecture oder in Eigenbezeichnung Tokyo Metropolis, oft nur Tokyo) ist eine der Präfekturen Japans und liegt größtenteils in der Kantō-Ebene. ....., "defaultsort": "Prafektur Tokio", "timestamp": "2018-12-07T21:06:42Z", "redirect": [ { "namespace": 0, "title": "Präfektur Tōkyō" }, { "namespace": 0, "title": "Präfektur Tokyo" }, ..... ], "wikibase_item": "Q1490", ..... "source_text": "{ { Infobox Japanische Präfektur\n |Name = Tokio\n| |Kanji = 東京都\n| ..... ", ..... "namespace_text": "", "namespace": 0, "text_bytes": 34389, "incoming_links": 1550, "category": [ "Japanische Präfektur", "Präfektur Tokio" ], "outgoing_link": [ "Südkorea", ..... ], "popularity_score": 3.7743927508022694e-06, "create_timestamp": "2004-04-19T16:48:45Z" }
redirectRedirect info of the pages which redirect to the page.namespaceNumber of the Wikipedia namespace.‘0’ means (Main/Article), i.e., the main namespace (or article namespace) of the Wikipedia page. For further details, see Wikipedia: HelpMediaWiki namespace.
name | optionality | explanation | note |
---|---|---|---|
_type | One of ‘page’ or ‘namespace’. | ||
_id | Wikipedia page ID. | ‘wgArticleID’ in web page, ‘pageid’ in Training data. | |
opening_text | Text before the first heading. | ||
language | Language code of the page. | ‘wgPageContentLanguage’ in web page,’lang’ in Language links. | |
title | Page title. | ‘wgTitle’ in web page. | |
text | Text of the page. | ||
timestamp | Timestamp of the revision. | ||
wikibase_item | Wikidata entity ID. | ||
source_text | Source text. | ||
incoming_links | Number of pages that link to this page. | ||
category | List of categories to which this page belongs. | ||
outgoing_link | List of links that lead to other pages. |
[Reference]:MediaWiki: Data dumps/Misc dumps format
As for CirrusSearch, check MediaWiki: Help:CirrusSearch.
Extended Named Entity Definition (English/Japanese) [JSON]
example
{ "ENE_id":"1.4.7.2", "definition":{ "en":"A name of a political party, which is an organized group of people who come together to engage in political activities. A smaller group inside a political party is not included here, but in 1.4.7.0 Political_Organization_Other Category. ", "ja":"政治活動を行う政党や会派の名前。派閥など、政党内の小グループについては 「政治的組織名_その他」とする。" }, "name":{ "en":"Political_Party", "ja":"政党名" }, "parent_category":"1.4.7", "children_category":[ ] }
name | optionality | explanation | note |
---|---|---|---|
ENE_id | ENE (Extended Named Entity) (ver.8.0) id of the category . | ||
definition | Definition of the category. | ||
en | English. | ||
ja | Japanese. | ||
name | Name of the category in ENE (ver.8.0). | ||
parent_category | Immediate upper category in ENE (ver.8.0). | ||
children_category | Immediate lower category in ENE (ver.8.0). |
Submission Format [JSON]
example
{ "pageid": 34550, "title": "Der kleine Prinz", "ENEs": [ { "ENE_id": "1.7.19.3", "ENE_name": "Movie", "score": 0.684 }, { "ENE_id": "1.7.19.6", "ENE_name": "Book", "score": 0.924 }, { "ENE_id": "1.7.19.2", "ENE_name": "Broadcast_Program", "score": 0.213 }, { "ENE_id": "1.7.19.4", "ENE_name": "Show", "score": 0.107 } ] }
name | optionality | explanation | note |
---|---|---|---|
pageid | Wikipedia page ID. | ‘wgArticleID’ in web page, ‘_id’ in Cirrus Dump. | |
title | optional | Page title. | ‘wgTitle’ in web page, ‘title’ in Cirrus Dump. |
ENEs | ENE (Extended Named Entity) (ver.8.0) category info . Info on ENE categories estimated for the page. | ||
ENE_id | The ENE (ver.8.0) id of the category estimated by the system. | NOTICE: The ENE_id is evaluated regardless of the score. | |
ENE_name | optional | The ENE (ver.8.0) name of the category estimated by the system. | |
score | optional but highly recommended | Some score for each ENE (ver.8.0) category predicted for the page. | NOTICE: It is desirable that the score is normalized from 0 to 1 range. The range of non-normalized scores would have to be specified in the system description report. |
Development Data [JSON]
example
{ "pageid": 187830, "ENEs": [ { "ENE_id": "1.5.1.2", "ENE_name": "Province" } ] }
name | optionality | explanation | note |
---|---|---|---|
pageid | Wikipedia page ID. | ‘wgArticleID’ in web page, ‘_id’ in Cirrus Dump. | |
ENEs | ENE (Extended Named Entity) (ver.8.0) category info. Info on ENE categories assigned for the Japanese page. | ||
ENE_id | ENE (ver.8.0) id of the category . | ||
ENE_name | ENE (ver.8.0) name of the category . |
SHINRA2020-ML System Results [JSON] (New addition. Mar 2021)
example
{ "pageid": "9999999", "results": [ { "team": "TEAM1", "method": "BERT", "ENEs": [ { "ENE_id": "0" }, { "ENE_id": "1.7.21.0" } ] }, { "team": "TEAM2", "method": "ML-BERT", "ENEs": [ { "ENE_id": "0", "score": 0.5366164445877075} ] } ] }
name | optionality | explanation | note |
---|---|---|---|
pageid | Wikipedia page ID. | ‘wgArticleID’ in web page, ‘_id’ in Cirrus Dump. | |
ENEs | ENE (Extended Named Entity) (ver.8.0) category info. Info on ENE categories estimated for the page. | ||
ENE_id | The ENE (ver.8.0) id of the category estimated by the system. | NOTICE: The ENE_id is evaluated regardless of the score. | |
ENE_name | optional | The ENE (ver.8.0) name of the category estimated by the system. | |
score | optional | Some score for each ENE (ver.8.0) category predicted for the page. | NOTICE: It is desirable that the score is normalized from 0 to 1 range. The range of non-normalized scores should have been specified in the system description report. |
team | The group ID of the SHINRA2020-ML participant group. | ||
method | The ID to distinguish between the methods used for the runs. The ID is specified using any combination of ASCII alphanumeric characters other than ‘_’. ex:’BERT’ . |
※ The time stamp of All Wikipedia related data is January 20, 2019