Open Multilingual Wordnet
This page provides access to open wordnets in a variety of
languages, all linked to
the Princeton Wordnet of
English (PWN). The goal is to make it easy to use wordnets
in multiple languages. The individual wordnets have been made
by many different projects and vary greatly in size and
accuracy. We have (i) extracted and normalized the data,
(ii) linked it to Princeton WordNet 3.0 and (iii) put it in one
place. The Open Multilingual Wordnet and its components are
open: they can be freely used,
modified, and shared by anyone for any purpose.
There is a fuller list of wordnets at the Global Wordnet Association's
Wordnets
in the World page.
If you use these wordnets, please cite the original projects
who created them (linked in Table 1), if you got value from this
aggregation/normalization, please cite Bond and Paik (2012).
You can access the wordnets through the (python)
Natural
Language Tool-Kit wordnet interface (NLTK).
We have an extended version with
automatically extracted data for over a 150 languages
from Wiktionary
and the Unicode Common Locale
Data Repository (Bond and Foster, 2013).
Documentation, News and Updates
Search
We have a simple search
interface (search the extended
wordnet). It uses the SQL database originally developed by the Japanese
Wordnet.
34 Open Wordnets Merged
Wordnet |
Lang |
Synsets |
Words |
Senses |
Core |
Licence |
Data |
Citation |
Albanet |
als |
4,675 |
5,988 |
9,599 |
31% |
CC BY 3.0 |
als.zip (+xml) |
cite:als; (.bib) |
Arabic WordNet (AWN v2) |
arb |
9,916 |
17,785 |
37,335 |
47% |
CC BY SA 3.0 |
arb.zip (+xml) |
cite:arb; (.bib) |
BulTreeBank Wordnet (BTB-WN) |
bul |
4,959 |
6,720 |
8,936 |
99% |
CC BY 3.0 |
bul.zip (+xml) |
cite:bul; (.bib) |
Chinese Open Wordnet |
cmn |
42,312 |
61,533 |
79,809 |
100% |
wordnet |
cmn.zip (+xml) |
cite:cmn; (.bib) |
Chinese Wordnet (Taiwan) |
qcn |
4,913 |
3,206 |
8,069 |
28% |
wordnet |
qcn.zip (+xml) |
cite:qcn; (.bib) |
DanNet |
dan |
4,476 |
4,468 |
5,859 |
81% |
wordnet |
dan.zip (+xml) |
cite:dan; (.bib) |
Greek Wordnet |
ell |
18,049 |
18,227 |
24,106 |
57% |
Apache 2.0 |
ell.zip (+xml) |
cite:ell; (.bib) |
Princeton WordNet |
eng |
117,659 |
148,730 |
206,978 |
100% |
wordnet |
eng.zip (+xml) |
cite:eng; (.bib) |
Persian Wordnet |
fas |
17,759 |
17,560 |
30,461 |
41% |
Free to use |
fas.zip (+xml) |
cite:fas; (.bib) |
FinnWordNet |
fin |
116,763 |
129,839 |
189,227 |
100% |
CC BY 3.0 |
fin.zip (+xml) |
cite:fin; (.bib) |
WOLF (Wordnet Libre du Français) |
fra |
59,091 |
55,373 |
102,671 |
92% |
CeCILL-C |
fra.zip (+xml) |
cite:fra; (.bib) |
Hebrew Wordnet |
heb |
5,448 |
5,325 |
6,872 |
27% |
wordnet |
heb.zip (+xml) |
cite:heb; (.bib) |
Croatian Wordnet |
hrv |
23,120 |
29,008 |
47,900 |
100% |
CC BY 3.0 |
hrv.zip (+xml) |
cite:hrv; (.bib) |
IceWordNet |
isl |
4,951 |
11,504 |
16,004 |
99% |
CC BY 3.0 |
isl.zip (+xml) |
|
MultiWordNet |
ita |
35,001 |
41,855 |
63,133 |
83% |
CC BY 3.0 |
ita.zip (+xml) |
cite:ita; (.bib) |
ItalWordnet |
ita |
15,563 |
19,221 |
24,135 |
48% |
ODC-BY 1.0 |
ita.zip (+xml) |
cite:iwn
(.bib) |
Japanese Wordnet |
jpn |
57,184 |
91,964 |
158,069 |
95% |
wordnet |
jpn.zip (+xml) |
cite:jpn; (.bib) |
Multilingual Central Repository |
cat |
45,826 |
46,531 |
70,622 |
81% |
CC BY 3.0 |
cat.zip (+xml) |
cite:cat; (.bib) |
Multilingual Central Repository |
eus |
29,413 |
26,240 |
48,934 |
71% |
CC BY 3.0 |
eus.zip (+xml) |
cite:eus; (.bib) |
Multilingual Central Repository |
glg |
19,312 |
23,124 |
27,138 |
36% |
CC BY 3.0 |
glg.zip (+xml) |
cite:glg; (.bib) |
Multilingual Central Repository |
spa |
38,512 |
36,681 |
57,764 |
76% |
CC BY 3.0 |
spa.zip (+xml) |
cite:spa; (.bib) |
Wordnet Bahasa |
ind |
38,085 |
36,954 |
106,688 |
94% |
MIT |
ind.zip (+xml) |
cite:ind; (.bib) |
Wordnet Bahasa |
zsm |
36,911 |
33,932 |
105,028 |
96% |
MIT |
zsm.zip (+xml) |
cite:zsm; (.bib) |
Open Dutch WordNet |
nld |
30,177 |
43,077 |
60,259 |
67% |
CC BY SA 4.0 |
nld.zip (+xml) |
cite:nld; (.bib) |
Norwegian Wordnet |
nno |
3,671 |
3,387 |
4,762 |
66% |
wordnet |
nno.zip (+xml) |
cite:nno; (.bib) |
Norwegian Wordnet |
nob |
4,455 |
4,186 |
5,586 |
81% |
wordnet |
nob.zip (+xml) |
cite:nob; (.bib) |
plWordNet |
pol |
33,826 |
45,387 |
52,378 |
54% |
wordnet |
pol.zip (+xml) |
cite:pol; (.bib) |
OpenWN-PT |
por |
43,895 |
54,071 |
74,012 |
84% |
CC BY-SA |
por.zip (+xml) |
cite:por; (.bib) |
Romanian Wordnet |
ron |
56,026 |
49,987 |
84,638 |
94% |
CC BY SA |
ron.zip (+xml) |
cite:ron; (.bib) |
Lithuanian WordNet |
lit |
9,462 |
11,395 |
16,032 |
35% |
CC BY SA 3.0 |
lit.zip (+xml) |
cite:lit; (.bib) |
Slovak WordNet |
slk |
18,507 |
29,150 |
44,029 |
58% |
CC BY SA 3.0 |
slk.zip (+xml) |
|
sloWNet |
slv |
42,583 |
40,233 |
70,947 |
86% |
CC BY SA 3.0 |
slv.zip (+xml) |
cite:slv; (.bib) |
Swedish (SALDO) |
swe |
6,796 |
5,824 |
6,904 |
99% |
CC-BY 3.0 |
swe.zip (+xml) |
cite:swe; (.bib) |
Thai Wordnet |
tha |
73,350 |
82,504 |
95,517 |
81% |
wordnet |
tha.zip (+xml) |
cite:tha; (.bib) |
25 synsets shared from 117,677 (0%)
Language codes linked to Lewis, M. Paul (ed.), 2009. Ethnologue: Languages of the World, Sixteenth edition. Dallas, Tex.: SIL International. Online version: http://www.ethnologue.com/
Data has, for each language, the script to make the tab file, the tab file, the wordnet LMF file and the LICENSE file. You can also get this with wordnet-LMF and lemon-rdf encoded files (+xml). If you want all the languages in one file, it is here: data for all of the wordnets, data for all of the wordnets with wordnet-LMF and lemon-rdf (big file). The code used to generate the extended wordnet is available here under the MIT license. It is neither well documented nor cleaned for release (sorry).
Core refers to the percentage of synsets covered from the
semi-automatically compiled list of 5000 "core" word senses in
Princeton WordNet (approximately the 5000 most frequently used word
senses). They are marked with ✪ in the interface. The original list is here
from http://wordnetcode.princeton.edu/standoff-files/core-wordnet.txt
(Boyd-Graber et al., 2008). Our version (converted to wn30 synsets).
The wordnets are linked to the
Suggested Upper Merged
Ontology (Sumo: Niles and Pease, 2001;
Pease, 2011); the
TempoWordNet
(Dias et al., 2014); the
Multilingual, layered sentiment lexicons
(ML-SentiCon: Cruz et al., 2014);
and SentiWordNet3.0
(Baccianella et al., 2010).
The fullest list of wordnets is the Global Wordnet
Association's Wordnets
in the World.
Mapping between wordnet versions was done using the mappings from TALP at UPC
(Daudé et al. 2000).
Formats
Tab files
The wn-data-*.tab files are tab separated files of synset-lemma pairs; or synset-subid-definition/example
# name␉lang␉url␉license
offset-pos␉lang:lemma␉word
offset-pos␉lang:def␉sid␉definition
offset-pos␉lang:exe␉sid␉example
...
name | the name of the project |
lang | the iso 3 letter code for the name |
url | the url of the project |
license | a short name for the license |
offset | the Princeton WordNet 3.0 offset 8 digit offset |
pos | one of [a,v,n,r] (we treat 's' as 'a') |
lemma | the lemma (word separator normalized to ' ') |
sid a | the sub id of the definition/example (starting from 0) |
Example:
# Wordnet Bahasa ind http://wn-msa.sourceforge.net/ MIT
00019613-n ind:def 0 masalah fisik yang nyata
00019613-n ind:lemma inti
00019613-n ind:lemma unsur
11407591-n ind:def 0 Novelis dan kritikus Perancis
11407591-n ind:def 1 pembela Dreyfus
11407591-n ind:lemma Emile Zola
11407591-n ind:lemma Zola
For this data to be really useful you need to combine it with the
synset relations from the Princeton wordnet.
Wordnet LMF Files
Wordnet-LMF format files are made by combining the tab files with the Princeton wordnet. Note: individual wordnet projects may have better versions of the wordnet LMF files.
Known Problems
- We discard any synsets not linked to PWN (such as new synsets
in the Arabic wordnet). The
Global Wordnet Association (including us) is working to build a better
version that can handle these links.
- If the wordnet has a different structure, we only show those
concepts with synonymous or near synonymous links to PWN. So for
Danish, Polish and Norwegian, we only have a small subset of the
entire wordnet.
- We currently only make use of synset level sentiment analysis
from ML-SentiCon (Cruz et al., 2014),
we do not show the language specific lemma level analysis.
- We currently can't add wordnets that don't link to PWN (such
as Gaelic).
- We are focused on adding lemmas, we do not have all extra
information from other projects such as:
- Definitions and examples from wordnets such as Spanish
- Orthographic variation and pronunciation in the Hebrew Wordnet
We plan to add this information as time permits.
- We should strip diacritics from the Arabic wordnet to make it easier for lookup.
- We may yet be missing some available wordnets: please help us add
more. Any wordnet with an open license that links to the
Princeton Wordnet is welcome.
- The interface is not very multilingual.
(BibTeX Complete References)
- als
Ervin Ruci (2008)
- On
the current state of Albanet and related applications,
Technical Report, University of Vlora
- all
Francis Bond and Kyonghee Paik (2012)
- A survey of wordnets and their licenses
In Proceedings of the 6th Global WordNet Conference
(GWC 2012). Matsue. 64–71
- Francis Bond and Ryan Foster (2013)
- Linking
and extending an open multilingual wordnet. In 51st Annual
Meeting of the Association for Computational Linguistics:
ACL-2013. Sofia. 1352–1362
- arb Black W.,
Elkateb S., Rodriguez H., Alkhalifa M., Vossen P., Pease A.,
Bertran M., Fellbaum C., (2006)
- The Arabic WordNet Project, Proceedings of LREC 2006
- Lahsen Abouenour, Karim Bouzoubaa, Paolo Rosso (2013)
- On the evaluation and improvement of {Arabic} WordNet coverage and usability,
Language Resources and Evaluation 47(3) pp 891–917
- bul Simov, Kiril and Osenova, Petya (2010)
- Constructing of an Ontology-based Lexicon for Bulgarian, Proceedings of LREC 2010
-
cat,
glg,
spa,
Aitor Gonzalez-Agirre, Egoitz Laparra and German Rigau (2012)
- Multilingual
Central Repository version 3.0: upgrading a very large lexical
knowledge base. In Proceedings of the 6th Global WordNet
Conference (GWC 2012) Matsue, Japan.
-
eus
Elisabete Pociello, Eneko Agirre and Izaskun ldezabal (2010)
- Methodology and construction of the Basque WordNet Language Resources and Evaluation
Springer Netherlands 45(2) pp 121–142
- core
Boyd-Graber, J., Fellbaum, C., Osherson, D., and Schapire, R. (2006)
- Adding
dense, weighted connections to WordNet. In: Proceedings
of the Third Global WordNet Meeting, Jeju Island, Korea,
January 2006
- cmn
Shan Wang and Francis Bond (2013)
- Building
the Chinese Open Wordnet (COW): Starting from Core
Synsets. In Proceedings of the 11th Workshop on
Asian Language Resources, a Workshop of The 6th
International Joint Conference on Natural Language
Processing (IJCNLP-6). Nagoya, Japan. pp.10–18.
- qcn
Huang, C.-R., Hsieh, S.-K., Hong, J.-F., Chen, Y.-Z., Su, I.-L., Chen, Y.-X.,
and Huang, S.-W. (2010).
- Chinese wordnet: Design and implementation of a cross-lingual
knowledge processing infrastructure. In Journal of Chinese Information Processing. 24(2) pp 14–23. (in Chinese)
- dan
Pedersen, B. S., Nimb, S., Asmussen, J., Sørensen, N. H., Trap-Jensen, L. and Lorentzen, H. (2009)
- DanNet -- the challenge of compiling a WordNet
for Danish by reusing a monolingual dictionary
Language Resources and EvaluationVolume 43:3 pp. 269-299
- eng
Christiane Fellbaum. (ed.) (1998)
- WordNet: An Electronic Lexical Database, MIT Press
- ell
Sofia Stamou, Goran Nenadic and Dimitris Christodoulakis (2004)
- Exploring Balkanet Shared Ontology for Multilingual Conceptual Indexing,
Proceedings of LREC 2004
- fra
Benoit Sagot and Darla Fišer (2008)
- Building a free French wordnet from multilingual
resources, E. L. R. A. (ELRA) (ed.), Proceedings of
the Sixth International Language Resources and Evaluation
(LREC’08), Marrakech, Morocco
- heb
Noam Ordan and Shuly Wintner (2007)
- Hebrew WordNet: a test case of aligning lexical databases across languages.
International Journal of Translation 19(1):39–58, 2007
- hrv
Oliver A., Šojat, K., Srebačić, M. (2015)
- Automatic Expansion of Croatian Wordnet
In Proceedings of the 29th CALS international conference:
Applied Linguistic Research and Methodology Zadar (Croatia)
- Raffaelli, Ida; Bekavac, Božo; Agić, Željko; Tadić, Marko. (2008)
- Building Croatian WordNet.
In Proceedings of the Fourth Global WordNet Conference pp349-359
- ita
Emanuele Pianta, Luisa Bentivogli and
Christian Girardi. (2002)
- MultiWordNet: Developing an Aligned Multilingual Database.
In Proceedings of the First International Conference on Global WordNet,
Mysore, India, January 21-25, 2002, pp. 293-302.
- Antonio Toral, Stefania Bracal, Monica Monachini and Claudia Soria (2010)
- Rejuvenating the Italian WordNet: upgrading,
standardising, extending
In Proceedings of the 5th International Conference of
the Global WordNet Association (GWC-2010)
Mumbai
- ind,zsm
Nurril Hirfana Mohamed Noor,
Suerya Sapuan and Francis Bond (2011)
- Creating
the open Wordnet Bahasa
In Proceedings of the 25th Pacific Asia Conference
on Language, Information and Computation (PACLIC 25)
pages 258–267. Singapore
- jpn
Hitoshi
Isahara, Francis Bond, Kiyotaka Uchimoto, Masao Utiyama and Kyoko
Kanzaki (2008)
- Development of Japanese WordNet.
In LREC-2008, Marrakech.
- fas
Montazery, Mortaza and Heshaam Faili (2010)
- Automatic Persian WordNet Construction the 23rd
International conference on computational linguistics
pp. 846–850
- fin
Lindén K., Carlson. L., (2010)
- FinnWordNet — WordNet påfinska via översättning,LexicoNordica
— Nordic Journal of Lexicography, 17 pp 119–140
- lit
Garabík, Radovan and Pileckytė, Indrė (2013)
- From Multilingual Dictionary to Lithuanian
WordNet. In: Natural Language Processing, Corpus
Linguistics, E-Learning. Ed. Katarína Gajdošová — Adriána
Žáková. Lüdenscheid: RAM-Verlag, pp. 74–80.
- sentiwn
Baccianella, Andrea Esuli Stefano and Sebastiani, Fabrizio, (2010)
-
SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment
Analysis and Opinion Mining., Proceedings of the Seventh conference
on International Language Resources and Evaluation (LREC'10) , Valletta, Malta, 2010
- ml-senticon
Cruz, Fermín L., José A. Troyano, Beatriz Pontes, F. Javier Ortega, (2014)
-
Building layered, multilingual sentiment lexicons at synset and lemma
levels, Expert Systems with Applications , 2014
- mapp
Jordi Daudé, Lluís Padró and German Rigau
(2000)
- Mapping WordNets Using Structural Information.
38th Annual Meeting of the Association for Computational Linguistics (ACL'2000),
Hong Kong
- nld
Marten Postma, Emiel van Miltenburg, Roxane Segers, Anneleen Schoen and Piek Vossen (2016)
- Open Dutch WordNet, Proceedings of the Eight Global
Wordnet Conference Bucharest, Romania.
- nno,nob
Fjeld, Ruth Vatvedt and Nygaard, Lars (2009)
-
NorNet - a monolingual wordnet of modern Norwegian
In Proceedings of the NODALIDA 2009 workshop WordNets
and other Lexical Semantic Resources — between Lexical Semantics,
Lexicography, Terminology and Formal Ontologies.
pages 13–16. Estonia
- pol
Maciej Piasecki, Stanisław Szpakowicz and Bartosz Broda. (2009)
- A
Wordnet from the Ground Up. Wroclaw: Oficyna Wydawnicza
Politechniki Wroclawskiej, Poland.
- por
Valeria de Paiva and Alexandre Rademaker (2012)
- Revisiting a Brazilian wordnet. In Proceedings of
Global Wordnet Conference, Matsue. Global Wordnet
Association. (also with Gerard de Melo's contribution)
- ron
Tufiș, Dan, Ion, Radu, Bozianu, Luigi, Ceaușu, Alexandru and Ștefănescu, Dan. (2008)
- Romanian Wordnet: Current State, New Applications and Prospects.
In Proceedings of the 4th Global WordNet Conference, GWC-2008
Eds. Tanacs, Attila, Csendes, Dora, Vincze, Veronika, Fellbaum, Christiane and Vossen, Piek.
Szeged, Hungary, pp. 441–452
- slv
Fišer, Darja, and Novak, Jernej,
and Eejavec, Tomaž (2012)
- sloWNet 3.0: development, extension and cleaning.
In Proceedings of the 6th International Global Wordnet
Conference (GWC 2012).. The Global WordNet Association,
pp. 113-117
-
sumo
Adam Pease (2011)
- Ontology: A Practical Guide. Articulate Software
Press, Angwin, CA. ISBN 978-1-889455-10-5.
-
sumo
Niles, I and Adam Pease (2001)
- Toward a Standard Upper Ontology. In
Proceedings of the 2nd International Conference
on Formal Ontology in Information Systems
(FOIS-2001), Chris Welty and Barry Smith, eds.
-
swe
Borin, Lars and Forsberg, Markus and Lönngren, Lennart (2013)
-
SALDO: a touch of yin to WordNet's yang.
Language Resources and Evaluation 47(4) pp 1191–1211
tempo
Gaël Dias, Mohammed Hasanuzzaman, Stéphane Ferrari, Yann Mathet (2014)
-
TempoWordNet for Sentence Time Tagging.
Proceedings of the Companion Publication of the 23rd
International Conference on World Wide Web Companion
pages 833–838, Switzerland
- tha
Thoongsup S., Charoenporn T., Robkop K., Sinthurahat T.,
Mokarat C., Sornlertlamvanich V., Isahara H. (2009)
- Thai Wordnet Construction Proceedings of The 7th
Workshop on Asian Language Resources (ALR7), Joint
conference of the 47th Annual Meeting of the Association
for Computational Linguistics (ACL) and the 4th
International Joint Conference on Natural Language
Processing (IJCNLP) Suntec, Singapore
Contributors: Francis Bond, Lars Nygaard, Adam Pease, John McRae, Luís Morgado da Costa and all the wordnet projects.
Francis Bond
<bond@ieee.org>
Division of Linguistics and Multilingual Studies
Nanyang Technological University
Level 3, Room 55, 14 Nanyang Drive, Singapore 637332
Tel: (+65) 6592 1568; Fax: (+65) 6794 6303