4. NLTK Text Corpora and Conditional Frequencies.

Lecture notes

Further reading

Before Class (code, output)

  1. The Gutenberg Corpus
    1. How many words (tokens) are there in Jane Austen's novel Persuasion?
    2. How many times does the word persuasion occur?
    3. Make a concordance for persuasion in the novel.
    4. How many letters (including punctuation and spaces) are there in the novel?
    5. How many sentences are there? Find and print the longest sentence.
  2. The Brown Corpus
    1. Make a frequency distribution for the "news" category of the Brown Corpus.
    2. Print counts of the modal verbs can, could, may, might, must, will.
    3. Print counts of the wh- words what, when, where, who, how, why.
    4. Make a frequency distribution for the "romance" category.
    5. Print counts of the modal verbs and the wh- words.
    6. Compare the counts for the two genres. Are there any clear differences?
  3. Take a look at words from the Swadesh list.
    1. nltk has lists for many languages nltk.corpus.swadesh.fileids()
      You can access a single list for, e.g., English, as: nltk.corpus.swadesh.words('en')
      These are the language codes
      * Choose any language and print out the list, one entry per line
    2. Choose any three languages, make sure you know one of them
      * Print them in parallel, joined by semicolons: e.g.
      I; eu;  ich
      you (singular), thou;  tu du; du, Sie
      he; el, ea; er
      ...
            

      Hint: you can use a range of numbers, and then the list index function
    3. Think about how you could you test for similarity?

Practical work (code, output)

  1. Comparing authors
    1. Make a frequency distribution for Jane Austen's novel Persuasion.
    2. Print counts of the modal verbs can, could, may, might, must, will.
    3. Print counts of the personal pronouns he, him, himself, she, her, herself.
    4. Make a frequency distribution for Herman Melville's novel Moby Dick.
    5. Print counts of the modal verbs and the personal pronouns.
    6. Compare the counts for the two authors. Are there any clear differences?
    7. Now try to do the same thing with a single Conditional Frequency Distribution
      Hint: make a list of [(novel1, word1), (novel1,word2), ... (novel2, word1), (novel2, word2), ....] to feed to nltk.ConditionalFreqDist()
  2. Measure the similarity of two languages' Swadesh lists
    1. Hint: measure the similarity of each pair of words
      e.g. by measuring the number of shared letters over all letters:
      'thou' and 'tu' is |tu|/|thou| = 0.5
      'we' and 'wir' is |w|/|weir| = 0.25
    2. If there is more than one word for a given entry
      • take the first one from each side [easy]
      • take the maximum score [harder]
    3. Sum this for all words
    4. Do you think you can get a better similarity score?
  3. Find the pair of languages that are most similar according to the Swadesh lists.

    Extension problem (no need to do it if you don't want to)

  4. Using IDLE as an editor, as shown in More Python: Reusing Code, write a Python program generate.py to do the following.
    1. In Generating Random Text with Bigrams, a function generate_model() is defined. Copy this function definition exactly as shown.
      def generate_model(cfdist, word, num=15):
          for i in range(num):
              print(word, end=' ')
              word = cfdist[word].max()
      
    2. Make a conditional frequency distribution of all the bigrams in Jane Austen's novel Emma, like this:
          emma_text = nltk.corpus.gutenberg.words('austen-emma.txt')
          emma_bigrams = nltk.bigrams(emma_text)
          emma_cfd = nltk.ConditionalFreqDist(emma_bigrams)
        
    3. Try to generate 100 words of random Emma-like text:
          generate_model(emma_cfd, 'The', 100)
        
    4. To avoid getting stuck in a loop, the generation function needs to make a choice from the probable continuation words. Modify the function like this:
          words = list(cfdist[word])
          word = random.choice(words)
        
    5. Before using functions from the random module, your program needs:
      import random
        
    6. Now try again to generate 100 words of random Emma-like text:
          generate_model(emma_cfd, 'The', 100)
        
      Repeat this several times to check if the texts are random.
    7. Make a conditional frequency distribution of all the bigrams in Melville's novel Moby Dick, like this:
          moby_text = nltk.corpus.gutenberg.words('melville-moby_dick.txt')
      moby_bigrams = nltk.bigrams(moby_text)
          moby_cfd = nltk.ConditionalFreqDist(moby_bigrams)
        
    8. Now generate 100 words of random Moby Dick-like text:
          generate_model(moby_cfd, 'The', 100)
        
      Repeat this several times to check if the texts are random.
    9. Can you observe different styles in the two types of generated texts?

HG2051: Language and the Computer Francis Bond.