NTU Multilingual Corpus

Welcome to the NTU Multilingual Corpus!
Corpus Search:

We are currently developing a corpus search tool that allows searching over the full corpus. Queries can be made by concepts, word, lemmas, parts-of-speech, etc., and can also be intersected to for finer results.
Results are also made available using crosslingual sentence alignment and/or displaying sentiment scores.
You can try our search tool here.


POS Distribution:

Below you can access the distribution of parts-of-speech across the NTUMC. We also make available mappings to the 12 universal POS tags, as described in "A Universal Part-of-Speech Tagset" by Slav Petrov, Dipanjan Das and Ryan McDonald. These mappings were made using Version 1.03, for which there was not an official release of mappings for any of the parts-of-speech sets we are currently using. For this reason, new mappings were tailor-made and may differ from previous or later versions of the official mapping provided by Petrov, Das and McDonald.

The presented lists are dynamically updated, sortable, and display the five most frequent words assigned to every part-of-speech. (*UPOS refers to the Universal Part-of-Speech Tagset)

EnglishChinese JapaneseIndonesian
English
English (UPOS)
Mandarin
Mandarin (UPOS)
Japanese
Japanese (UPOS)
Indonesian
Indonesian (UPOS)

References:

Canonical Citation:

Liling Tan and Francis Bond. 2012. Building and annotating the linguistically diverse NTU-MC (NTU-multilingual corpus). In International Journal of Asian Language Processing 22(4) pp 161–174.

Other References:

Francis Bond, Shan Wang, Eshley Huini Gao, Hazel Shuwen Mok, and Jeanette Yiwen Tan. 2013. Developing parallel sense-tagged corpora with wordnets. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse (LAW 2013). Sofia. pp 149–158.

Yu Jie Seah and Francis Bond. 2014. Annotation of Pronouns in a Multilingual Corpus of Mandarin Chinese, English and Japanese. In 10th Joint ACL - ISO Workshop on Interoperable Semantic Annotation Reykjavik.

Slav Petrov, Dipanjan Das, and Ryan McDonald. 2011. A universal part-of-speech tagset. arXiv preprint arXiv:1104.2086.

Shan Wang and Francis Bond. 2014. Building The Sense-Tagged Multilingual Parallel Corpus. In 9th Edition of the Language Resources and Evaluation Conference (LREC 2014), Reykjavik.


Contributors: Francis Bond, Liling Tan, Tuan Anh Le, Luís Morgado da Costa.


Francis Bond <bond@ieee.org>
Division of Linguistics and Multilingual Studies
Nanyang Technological University
Level 3, Room 55, 14 Nanyang Drive, Singapore 637332
Tel: (+65) 6592 1568; Fax: (+65) 6794 6303