HG2051 Project Two
This assignment constitutes 30% of your final grade. Please work
on this in groups of two or three, and submit a combined report by
email to me firstname.lastname@example.org. If the results/data are too
big, give it to me in person or give me a link.
Deadline: Friday, Nov 20 12:00 noon.
Pick one of the following:
- Disambiguate the individual words in all (or most) MWEs in wordnet (★★)
- For machine translation in English return:
- For loose cannon
[("loose", None), ("cannon", None)]
- Possibly also give "unsure" or a confidence score
- Train and test a POS tagger for a new language (★)
- Find a corpus (not from NLTK) with POS tags
(e.g. from the pan-localization project)
- Read it in and convert it into a list of tuples (word, tag)
- Divided it into train/dev/test
- Train the best POS tagger that you can
- Discuss which POS tags/words are hard and why
- It could be that you find errors in the corpus: discuss how they could be fixed
- Bonus: convert to Universal POS and see if the accuracy improves.
- Choose your own task of equivalent difficulty (★ - ★★★)
- Find all examples of some phenomena in a corpus semi-automatically and analyze them
- Serial verb constructions
- Taboo words (try the massive Enron corpus)
- Productive affixes (un-, -less, -ish)
- Identify loan words and their sources in the dictionary of a language
Run it by me early
Disambiguating Multi-Word Expressions
Multi-word expressions come in various types, from those where the
meaning is fully predictable from the individual words
(compositional: like colour television) to those where
knowing the individual words does not allow you to predict the meaning
(non-compositional: like red head "person with red hair").
Write a program that takes a list of multi-word nouns (you can
expand to other parts-of-speech) from a dictionary and determines the
meaning (i.e. the synset) of the elements. If it is compositional
(e.g. in English, is N1 N2 an N2) you can get the
meaning of the head by looking at the hypernyms. For other words, you
can try to disambiguate by looking at the definitions and examples: in
general, if you can find a close link, then that will be the right
meaning. You can use the wordnet gloss corpus to get disambiguated definitions.
This is a hard task --- for non-compositional words there will be
no good sense, and even for compositional words accuracy is typically
less than 70%.
Part of Speech Tagging
This is a more straight-forward task. Morphologically complex
languages are much harder, try to pick something relatively analytic.
NLTK has several POS taggers, you may want to try more than one and compare them.
If you have text from different genres, you may want to test the cross-domain accuracy.
You should deliver a paper, the program and the output files.
- The deliverable is be a paper of no more than twelve pages
including diagrams, with up to three pages of references, formatted
according to the ACL
2013 format, following the
Linguistic Style Guidelines.
However, do not make your paper anonymous: put your
name, matriculation number and email address under the paper title.
- Include your entire program (including internal
documentation) as an appendix: this does not count
toward the 15 page total.
- Include quantitive results
- Include representative examples
- For this assignment, it is permissible not to gloss languages
you cannot read (but try to if you can)
- You should properly reference everything (including, but not limited to)
- The resources you use (wordnets)
- The languages you investigate
- The program should be executable by me (possibly with some external libraries).
Assignment Two for