This assignment constitutes 30% of your final grade for HG2051. Please work on the final program and report individually.
Deadline: Oct 19 17:00
It is often the case that languages will have two (or more) orthographic variants for the same word. For example, in Englsih, we have colour and color reflecting differences in usage between the UK (and many countries) and the US. Sometimes differences come from incomplete lexicalization: e-mail, email; database, data-base, database. Sometimes these are regular: colour, colour; authour, author and sometimes not gaol, jail.
Chose a language other than English and write a program that compares words within each synset, identifies similar ones and categorizes them in some way. Use the data from the open multilingual wordnet, through the NLTK interface. Note that you must have downloaded the omw.zip corpus through nltk.downloads(), it is in the corpus section.
Your output should look something like this:
synset ↹ var1 ↹ var2 ↹ category color.n.01 ↹ colour ↹ color ↹ us/gb:u electronic_mail.n.01 ↹ e-mail ↹ email ↹ hyphen-none always.r.01 ↹ ever ↹ e'er ↹ v-quote
↹ is TAB ('\t')
Here is a some scaffolding to help you get started:
import nltk from nltk.corpus import wordnet as wn from nltk.metrics import edit_distance lng='eng' maxdist=2 ## get all the synsets all_synsets=list(wn.all_synsets()) vars =  for ss in all_synsets[:1000]: # check synsets lemmas= ss.lemma_names(lang=lng) for l1 in lemmas: for l2 in lemmas: ### check if they are similar if l1 != l2 and edit_distance(l1,l2) < maxdist: # try to categorize cat = 'different' if l1.replace('ise',"ize") == l2: cat = 'us/gb:ize' # store the result vars.append((ss,l1,l2,cat)) out=open('variants.txt',mode='w') for (ss,l1,l2,cat) in vars: print ("\t".join([ss.name(),l1,l2, cat]),file=out) ### close the file out.close()
Here is the code: wn-var.py
You should try to improve this in various ways, for example:
Your write up should talk a little about why you think the variants exist (loans, historical relatedness, some other reasons).
See the Open Multilingual Wordnet for more information about how these were built and the format they are in, as well as how to cite them.
Assignment One for HG2051 2017