Wikipedia CJK Corpora

Name: Wikipedia CJK Corpora
Creator: Shlomo Geva
License: http://creativecommons.org/licenses/by-sa/4.0/
Keywords: Validation tool,Anchor identification,Evaluation metrics,Wikipedia,Assessment tool,Cross-lingual link discovery (CLLD),Evaluation tool,Information Retrieval and Web Search,Link recommendation,research data,data collections,research project

Tang, Ling-Xiang; Geva, Shlomo

doi:10.4225/09/587dac9dd6f7b

Wikipedia CJK Corpora

Viewed: 1666

Wikipedia web pages in different languages are rarely linked except for the cross-lingual link between web pages about the same subject. Collected in June 2010, this data collection consists of 10GB of tagged Chinese, Japanese and Korean articles, converted from Wikipedia to an XML structure by a multi-lingual adaptation of the YAWN system (see Related Information). Data were collected as part of the NII Test Collection for IR Systems (NTCIR) Project, which aims to enhance research in Information Access (IA) technologies, including information retrieval, to enhance cross-lingual link discovery (a way of automatically finding potential links between documents written in different languages). Through cross-lingual link discovery, users are able to discover documents in languages which they are either familiar with, or which have a richer set of documents than in their language of choice.

Geographical area of data collection

text

Publications

Tang, Ling-Xiang, Geva, Shlomo, Trotman, Andrew, Xu, Yue, and Itakura, Kelly. (2011). Overview of the NTCIR-9 crosslink task: cross-lingual link discovery. In Kando, Noriko, Ishikawa, Daisuke, and Sugimoto, Miho (Eds.) Proceedings of the 9th NTCIR Workshop Meeting on Evaluation of Information Access Technologies: Information Retreival, Question Answering and Cross-Lingual Information Access, National Institute of Informatics, National Center of Sciences, Tokyo, pp. 437-463. http://eprints.qut.edu.au/49127/

Tang, Ling-Xiang, Itakura, Kelly, Geva, Shlomo, Trotman, Andrew, and Xu, Yue. (2011). The effectiveness of cross-lingual link discovery. In Kishida, Kazuaki, Sanderson, Mark, Webber, William, Kando, Noriko, Ishikawa, Noriko, and Sugimoto, Miho (Eds.) Proceedings of The Fourth International Workshop on Evaluating Information Access, National Insitute of Informatics (Japan), National Institute of Informatics, Tokyo, pp. 1-8. http://eprints.qut.edu.au/49128/

Schenkel, Ralf, Suchanek, Fabian and Kasneci, Gjergji; (n.d.); YAWN: A semantically annotated Wikipedia XML Corpus; Max Planck Institute. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.92.8900&rep=rep1&type=pdf

Tang, Ling-Xiang, Cavanagh, Daniel, Trotman, Andrew, Geva, Shlomo, and Xu, Yue. (2011). Automated cross-lingual link discovery in Wikipedia. In Kando, Noriko, Ishikawa, Daisuke, and Sugimoto, Miho (Eds.) Proceedings of the NTCIR-9 Workshop Meeting on Evaluation of Information Access Technologies: Information Retrieval, Question Answering and Cross-Lingual Information Access, National Institute of Informatics, National Center of Sciences, Tokyo, pp. 512-519. http://eprints.qut.edu.au/49125/

Research areas

Validation tool

Anchor identification

Evaluation metrics

Wikipedia

Assessment tool

Cross-lingual link discovery (CLLD)

Evaluation tool

Information Retrieval and Web Search

Link recommendation

Cite this collection

Tang, Ling-Xiang; Geva, Shlomo (2012): Wikipedia CJK Corpora. Queensland University of Technology. (Dataset) https://doi.org/10.4225/09/587dac9dd6f7b

Related information

NII Test Collection for IR Systems (NCTIR) Project http://research.nii.ac.jp/ntcir/index-en.html

National Institute of Informatics (NII), Japan http://www.nii.ac.jp/en/

Wikipedia database dump of articles http://dumps.wikimedia.org/backup-index.html

Access the data

http://warehouse.ntcir.nii.ac.jp/openaccess/crosslink/crosslink_documents.html

Data file types

xml

Licence

Creative Commons Attribution-Share Alike 4.0 (CC-BY-SA)
http://creativecommons.org/licenses/by-sa/4.0/

Copyright

Dates of data collection

From 2010-06-01 to 2010-06-30

Connections

Contacts

Name: Shlomo Geva

Email: s.geva@qut.edu.au

Other

Date record created:

2012-03-21

Date record modified:

2022-02-15T13:40:41

Record status:

Published - Open Access

Found this data useful? Tell us about your experience.

Wikipedia CJK Corpora

Geographical area of data collection

Publications

Research areas

Cite this collection

Related information

Access the data

Data file types

Licence

Copyright

Dates of data collection

Connections

Has association with

Is output of

Is owned by

Was collected by

Contacts

Other