2022-02-15T13:40:41 q7

Wikipedia CJK Corpora

Viewed: 1666

Wikipedia web pages in different languages are rarely linked except for the cross-lingual link between web pages about the same subject. Collected in June 2010, this data collection consists of 10GB of tagged Chinese, Japanese and Korean articles, converted from Wikipedia to an XML structure by a multi-lingual adaptation of the YAWN system (see Related Information). Data were collected as part of the NII Test Collection for IR Systems (NTCIR) Project, which aims to enhance research in Information Access (IA) technologies, including information retrieval, to enhance cross-lingual link discovery (a way of automatically finding potential links between documents written in different languages). Through cross-lingual link discovery, users are able to discover documents in languages which they are either familiar with, or which have a richer set of documents than in their language of choice.

Geographical area of data collection

text

Publications

Tang, Ling-Xiang, Geva, Shlomo, Trotman, Andrew, Xu, Yue, and Itakura, Kelly. (2011). Overview of the NTCIR-9 crosslink task: cross-lingual link discovery. In Kando, Noriko, Ishikawa, Daisuke, and Sugimoto, Miho (Eds.) Proceedings of the 9th NTCIR Workshop Meeting on Evaluation of Information Access Technologies: Information Retreival, Question Answering and Cross-Lingual Information Access, National Institute of Informatics, National Center of Sciences, Tokyo, pp. 437-463. http://eprints.qut.edu.au/49127/
Tang, Ling-Xiang, Itakura, Kelly, Geva, Shlomo, Trotman, Andrew, and Xu, Yue. (2011). The effectiveness of cross-lingual link discovery. In Kishida, Kazuaki, Sanderson, Mark, Webber, William, Kando, Noriko, Ishikawa, Noriko, and Sugimoto, Miho (Eds.) Proceedings of The Fourth International Workshop on Evaluating Information Access, National Insitute of Informatics (Japan), National Institute of Informatics, Tokyo, pp. 1-8. http://eprints.qut.edu.au/49128/
Schenkel, Ralf, Suchanek, Fabian and Kasneci, Gjergji; (n.d.); YAWN: A semantically annotated Wikipedia XML Corpus; Max Planck Institute. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.92.8900&rep=rep1&type=pdf
Tang, Ling-Xiang, Cavanagh, Daniel, Trotman, Andrew, Geva, Shlomo, and Xu, Yue. (2011). Automated cross-lingual link discovery in Wikipedia. In Kando, Noriko, Ishikawa, Daisuke, and Sugimoto, Miho (Eds.) Proceedings of the NTCIR-9 Workshop Meeting on Evaluation of Information Access Technologies: Information Retrieval, Question Answering and Cross-Lingual Information Access, National Institute of Informatics, National Center of Sciences, Tokyo, pp. 512-519. http://eprints.qut.edu.au/49125/

Research areas

Validation tool
Anchor identification
Evaluation metrics
Wikipedia
Assessment tool
Cross-lingual link discovery (CLLD)
Evaluation tool
Information Retrieval and Web Search
Link recommendation

Cite this collection

Tang, Ling-Xiang; Geva, Shlomo (2012): Wikipedia CJK Corpora. Queensland University of Technology. (Dataset) https://doi.org/10.4225/09/587dac9dd6f7b

Related information

NII Test Collection for IR Systems (NCTIR) Project http://research.nii.ac.jp/ntcir/index-en.html
National Institute of Informatics (NII), Japan http://www.nii.ac.jp/en/
Wikipedia database dump of articles http://dumps.wikimedia.org/backup-index.html

Data file types

xml

Licence


Creative Commons Attribution-Share Alike 4.0 (CC-BY-SA)
http://creativecommons.org/licenses/by-sa/4.0/

Copyright

©

Dates of data collection

From 2010-06-01 to 2010-06-30

Connections

Has association with
Shlomo Geva  (Researcher)
Was collected by
Ling-Xiang (Eric) Tang  (Researcher)

Contacts

Name: Shlomo Geva

Other

Date record created:
2012-03-21
Date record modified:
2022-02-15T13:40:41
Record status:
Published - Open Access