HG2051 Project Two

This assignment constitutes 30% of your final grade. Please work on this in groups of two or three, and submit a combined report both in ntulearn and hardcopy. If the results/data are too big, give it to me in person or give me a link.

Deadline: 2017 Friday, Mon Nov 27 17:00.

Pick one of the following:

  1. Use the disambiguated Swadesh lists (Morgado da Costa et al., 2016) to evaluate the wordnets in the OMW. (★) Swadesh lists are lists of basic concepts designed for the purposes of historical-comparative linguistics by Morish Swadesh.
    New: project2.py is some scaffolding to help you read in the data. New
  2. Disambiguate the individual words in all (or most) MWEs in wordnet (★★)
  3. Choose your own task of equivalent difficulty (★ - ★★★)

Evaluating with Swadesh Lists

It is hard to evaluate resources in many languages. One way is to compare them to existing hand-built multilingual resources (and hope that the compilers did not also use this data). In this task we use the Swadesh list to see how good the coverage is of different wordnets in OMW.

However, there is no guarantee that all the Swadesh lists are formatted well, or correct, so in this assignment you will do an in depth error analysis of the evaluation for at least three wordnets (if the wordnets you choose have very few synsets that are in the Swadesh lists, then do more than three).

This is a relatively easy task. Make sure you explain properly what assumptions you are making for synsets with multiple entries, and format your results clearly.

If you find things that maybe should be treated specially (such as chemical formula or scientific names) see if you can identify them as such generally.

How to convert offset-pos to synsets

>>> wn.of2ss('14845743-n')
>>> wn.of2ss('14845743-n').lemma_names()
['water', 'H2O']for 

Disambiguating Multi-Word Expressions

Multi-word expressions come in various types, from those where the meaning is fully predictable from the individual words (compositional: like colour television) to those where knowing the individual words does not allow you to predict the meaning (non-compositional: like red head "person with red hair").

Write a program that takes a list of multi-word nouns (you can expand to other parts-of-speech) from a dictionary and determines the meaning (i.e. the synset) of the elements. If it is compositional (e.g. in English, is N1 N2 an N2) you can get the meaning of the head by looking at the hypernyms. For other words, you can try to disambiguate by looking at the definitions and examples: in general, if you can find a close link, then that will be the right meaning. You can use the wordnet gloss corpus to get disambiguated definitions.

This is a hard task --- for non-compositional words there will be no good sense, and even for compositional words accuracy is typically less than 70%.

Own Task

You can suggest your own task, and if I say ok, do it instead. This can be used to fit in with other research you are doing, but should not duplicate work done for other assessment, you must do something new for this class.


You should deliver a paper, the program and the output files.

Assignment Two for HG2051