HG2051 Project Two

This assignment constitutes 30% of your final grade. Please work on this in groups of two or three, and submit a combined report by email to me bond@ieee.org. If the results/data are too big, give it to me in person or give me a link.

Deadline: Friday, Nov 20 12:00 noon.

Pick one of the following:

Disambiguating Multi-Word Expressions

Multi-word expressions come in various types, from those where the meaning is fully predictable from the individual words (compositional: like colour television) to those where knowing the individual words does not allow you to predict the meaning (non-compositional: like red head "person with red hair").

Write a program that takes a list of multi-word nouns (you can expand to other parts-of-speech) from a dictionary and determines the meaning (i.e. the synset) of the elements. If it is compositional (e.g. in English, is N1 N2 an N2) you can get the meaning of the head by looking at the hypernyms. For other words, you can try to disambiguate by looking at the definitions and examples: in general, if you can find a close link, then that will be the right meaning. You can use the wordnet gloss corpus to get disambiguated definitions.

This is a hard task --- for non-compositional words there will be no good sense, and even for compositional words accuracy is typically less than 70%.

Part of Speech Tagging

This is a more straight-forward task. Morphologically complex languages are much harder, try to pick something relatively analytic.

NLTK has several POS taggers, you may want to try more than one and compare them.

If you have text from different genres, you may want to test the cross-domain accuracy.


You should deliver a paper, the program and the output files.

Assignment Two for HG2051