Wikipedia CJK Corpora
Wikipedia web pages in different languages are rarely linked except for the cross-lingual link between web pages about the same subject. Collected in June 2010, this data collection consists of 10GB of tagged Chinese, Japanese and Korean articles, converted from Wikipedia to an XML structure by a multi-lingual adaptation of the YAWN system (see Related Information). Data were collected as part of the NII Test Collection for IR Systems (NTCIR) Project, which aims to enhance research in Information Access (IA) technologies, including information retrieval, to enhance cross-lingual link discovery (a way of automatically finding potential links between documents written in different languages). Through cross-lingual link discovery, users are able to discover documents in languages which they are either familiar with, or which have a richer set of documents than in their language of choice.
Geographical area of data collection
Cite this collection
Data file types
Creative Commons Attribution-Share Alike 4.0 (CC-BY-SA)