5. Lexical Resources and WordNet.

Lecture notes

Further reading

Before Class (code, output)

  1. Wordlist Corpora
    Find the 50 most frequent words (see Week 2) in Jane Austen's Emma.
    Then find the 50 most frequent words that are not stopwords.
  2. A Pronouncing Dictionary
    Most words in the dictionary have the same first phonetic code as their first letter:
    E.g.: for ('fir', ['F', 'ER1']) 'f' is the same as 'F'
    Sometimes they do not
    E.g.: for ('yves', ['IY1', 'V'])
    What proportion of the words start with the same code?
    What are some common mismatches?
  3. Start a WordNet browser by doing any one of the following:
    1. Use NTU's online Open Multilingual Wordnet
    2. Download and install WordNet 2.1 for Windows
    3. Use Princeton's online WordNet 3.0 Search
    4. Use NLTK's WordNet browser from Python by
            >>> import nltk
            >>> nltk.app.wordnet()
          
      Note: this did not work on my machine (FCB, 2017)
    Whatever you use to access WordNet, try the following:

Practical work (code, output)

  1. Find the 50 most frequent bigrams (see Week 6) in Jane Austen's Emma.
    Then find the 50 most frequent bigrams that do not include a stopword.
  2. A Pronouncing Dictionary
    Find python in the Pronouncing Dictionary and print its pronunciation.
    Find marathon and print its pronunciation.
    Find all the words whose last syllable rhymes with python.
    Find all the words whose last syllable rhymes with marathon.
    ★ Write a function that converts Arbabet to IPA
    Use it to print the pronunciations of python and marathon
  3. Load wordnet inside python.
  4. ★ Tabulate the average polysemy per word length for all words in wordnet, and then seperately for each part of speech. (Hint: polysemy is number of synsets/word; you can get all words by [w for w in wn.all_lemma_names()]; for just nouns you can do: [w for w in wn.all_lemma_names('n')]. )

★ these problems are only for over achievers :-)


HG251: Language and the Computer Francis Bond.