Investigating Orthographic Variation

This assignment constitutes 30% of your final grade for HG2051. Please work on the final program and report individually.

Deadline: Oct 19 17:00

Different spellings with similar meanings

It is often the case that languages will have two (or more) orthographic variants for the same word. For example, in Englsih, we have colour and color reflecting differences in usage between the UK (and many countries) and the US. Sometimes differences come from incomplete lexicalization: e-mail, email; database, data-base, database. Sometimes these are regular: colour, colour; authour, author and sometimes not gaol, jail.

Chose a language other than English and write a program that compares words within each synset, identifies similar ones and categorizes them in some way. Use the data from the open multilingual wordnet, through the NLTK interface. Note that you must have downloaded the corpus through nltk.downloads(), it is in the corpus section.

Your output should look something like this:

synset ↹ var1 ↹  var2 ↹ category
color.n.01 ↹  colour  ↹ color  ↹ us/gb:u    
electronic_mail.n.01  ↹ e-mail  ↹ email  ↹ hyphen-none
always.r.01  ↹ ever  ↹ e'er  ↹ v-quote

↹ is TAB ('\t')

Here is a some scaffolding to help you get started:

import nltk

from nltk.corpus import wordnet as wn
from nltk.metrics import edit_distance


## get all the synsets

vars = []
for ss in all_synsets[:1000]: # check  synsets
    lemmas= ss.lemma_names(lang=lng)
    for l1 in lemmas:
        for l2 in lemmas:
            ### check if they are similar
            if l1 != l2 and edit_distance(l1,l2) < maxdist:
                # try to categorize
                cat = 'different'
                if l1.replace('ise',"ize") == l2:
                    cat = 'us/gb:ize'
                # store the result

for (ss,l1,l2,cat) in vars:
print ("\t".join([,l1,l2, cat]),file=out)
### close the file

Here is the code:

You should try to improve this in various ways, for example:

Your write up should talk a little about why you think the variants exist (loans, historical relatedness, some other reasons).


See the Open Multilingual Wordnet for more information about how these were built and the format they are in, as well as how to cite them.


Assignment One for HG2051 2017