12. Summary and Review

For revision: reread chapters 1-6, redo the tutorial problems.

NLTK Chapters: 1-6


  1. Strings, lists and tuples are different kinds of sequence object, supporting common operations such as indexing, slicing, len(), sorted(), and membership testing using in.
  2. Texts are represented in Python using lists: ['Monty', 'Python']. We can use indexing, slicing, and the len() function on lists.
  3. A string is specified in Python using single or double quotes: 'Monty Python', "Monty Python".
  4. Strings have some special tests associated with them:
  5. Strings can be split into lists: 'Monty Python'.split() gives ['Monty', 'Python'].
  6. Lists can be joined into strings: '/'.join(['Monty', 'Python']) gives 'Monty/Python'.
  7. A dictionary is used to map between arbitrary types of information, such as a string and a number:
    freq['cat'] = 12.
  8. We create dictionaries using the brace notation:
    pos = {}
    pos = {'furiously': 'adv', 'ideas': 'n', 'colorless': 'adj'}
  9. A tuple consists of a number of values separated by commas:
    t = 12345, 54321, 'hello!'
  10. Tuples are normally enclosed in parentheses ()
  11. List comprehensions provide a concise way to create lists. The resulting list definition tends often to be clearer than lists built using those constructs. Each list comprehension consists of an expression followed by a for clause, then zero or more for or if clauses. The result will be a list resulting from evaluating the expression in the context of the for and if clauses which follow it. If the expression would evaluate to a tuple, it must be parenthesized.
    to get nouns from a tagged corpus: nouns = [ w for (w, t) in tagged_text if t.startswith('N')]
  12. A string formatting expression template % arg_tuple consists of a format string template that contains conversion specifiers like %-6s and %0.2d.
    print "Tokens = %d; Types = %d; Tokens/Type = %0.2d" % (len(words), len(set(words), 1.0*len(words)/len(set(words)))

Iterating over sequences

Python Expression Comment
for item in s iterate over the items of s
for item in sorted(s) iterate over the items of s in order
for item in set(s) iterate over unique elements of s
for item in reversed(s) iterate over elements of s in reverse
for item in set(s).difference(t) iterate over elements of s not in t
for item in random.shuffle(s) iterate over elements of s in random order

List Methods

Operation Result
s[i] = x item i of s is replaced by x
s[i:j] = t slice of s from i to j is replaced by the contents of the iterable t
del s[i:j] same as s[i:j] = []
s[i:j:k] = t the elements of s[i:j:k] are replaced by those of t
del s[i:j:k] removes the elements of s[i:j:k] from the list
s.append(x) same as s[len(s):len(s)] = [x]
s.extend(x) same as s[len(s):len(s)] = x
s.count(x) return number of i‘s for which s[i] == x
s.index(x[, i[, j]]) return smallest k such that s[k] == x and i <= k < j
s.insert(i, x) same as s[i:i] = [x]
s.pop([i]) same as x = s[i]; del s[i]; return x
s.remove(x) same as del s[s.index(x)]
s.reverse() reverses the items of s in place
s.sort([cmp[, key[, reverse]]]) sort the items of s in place

String Methods

Method Functionality
s.find(t) index of first instance of string t inside s (-1 if not found)
s.rfind(t) index of last instance of string t inside s (-1 if not found)
s.index(t) like s.find(t) except it raises ValueError if not found
s.rindex(t) like s.rfind(t) except it raises ValueError if not found
s.join(text) combine the words of the text into a string using s as the glue
s.split(t) split s into a list wherever a t is found (whitespace by default)
s.splitlines() split s into a list of strings, one per line
s.lower() a lowercased version of the string s
s.upper() an uppercased version of the string s
s.title() a titlecased version of the string s
s.strip() a copy of s without leading or trailing whitespace
s.replace(t, u) replace instances of t with u inside s

String Comparison Operators

Function Meaning
s.startswith(t) test if s starts with t
s.endswith(t) test if s ends with t
t in s test if t is contained inside s
s.islower() test if all cased characters in s are lowercase
s.isupper() test if all cased characters in s are uppercase
s.isalpha() test if all characters in s are alphabetic
s.isalnum() test if all characters in s are alphanumeric
s.isdigit() test if all characters in s are digits
s.istitle() test if s is titlecased (all words in s have have initial capitals)

Dictionary Methods

Example Description
d = {} create an empty dictionary and assign it to d
d[key] = value assign a value to a given dictionary key
d.keys() the list of keys of the dictionary
list(d) the list of keys of the dictionary
sorted(d) the keys of the dictionary, sorted
key in d test whether a particular key is in the dictionary
for key in d iterate over the keys of the dictionary
d.values() the list of values in the dictionary
dict([(k1,v1), (k2,v2), ...]) create a dictionary from a list of key-value pairs
d1.update(d2) add all items from d2 to d1
defaultdict(int) a dictionary whose default value is zero

Frequency Distributions

NLTK's Frequency Distributions: commonly-used methods and idioms for defining, accessing, and visualizing distributions of frequency
Example Description
fdist = FreqDist(samples) create a frequency distribution containing the given samples
fdist[sample] += 1 increment the count for this sample
fdist['monstrous'] count of the number of times a given sample occurred
fdist.freq('monstrous') frequency of a given sample
fdist.N() total number of samples
fdist.most_common(n) the n most common samples and their frequencies
for sample in fdist: iterate over the samples
fdist.max() sample with the greatest count
fdist.tabulate() tabulate the frequency distribution
fdist.plot() graphical plot of the frequency distribution
fdist.plot(cumulative=True) cumulative plot of the frequency distribution
fdist1 |= fdist2 update fdist1 with counts from fdist2
fdist1 < fdist2 test if samples in fdist1 occur less frequently than in fdist2

Conditional Frequency Distributions

NLTK's Conditional Frequency Distributions: commonly-used methods and idioms for defining, accessing, and visualizing a conditional frequency distribution of counters
Example Description
cfdist = ConditionalFreqDist(pairs) create a conditional frequency distribution from a list of pairs
cfdist.conditions() the conditions
cfdist[condition] the frequency distribution for this condition
cfdist[condition][sample] frequency for the given sample for this condition
cfdist.tabulate() tabulate the conditional frequency distribution
cfdist.tabulate(samples, conditions) tabulation limited to the specified samples and conditions
cfdist.plot() graphical plot of the conditional frequency distribution
cfdist.plot(samples, conditions) graphical plot limited to the specified samples and conditions
cfdist1 < cfdist2 test if samples in cfdist1 occur less frequently than in cfdist2

Control and Program Style

  1. We process each word in a text using a for statement, such as for w in t: or for word in text:. This must be followed by the colon character and an indented block of code, to be executed each time through the loop.
  2. We can also loop through things using a while loop.
    while (n != 20):
  3. We can also do this in a single command (a list comprehension). To derive the vocabulary, collapsing case distinctions and ignoring punctuation, we can write set([w.lower() for w in text if w.isalpha()]).
  4. We test a condition using an if statement: if len(word) < 5:. This must be followed by the colon character and an indented block of code, to be executed only if the condition is true.
  5. A function is a block of code that has been assigned a name and can be reused. Functions are defined using the def keyword, as in def mult(x, y); x and y are parameters of the function, and act as placeholders for actual data values.
  6. A function is called by specifying its name followed by one or more arguments inside parentheses, like this: mult(3, 4), e.g., len(text1).
  7. Python programs more than a few lines long should be entered using a text editor, saved to a file with a .py extension, and accessed using an import statement.
  8. Python functions permit you to associate a name with a particular block of code, and re-use that code as often as necessary.
  9. A function serves as a namespace: names defined inside a function are not visible outside that function, unless those names are declared to be global.
  10. Some functions are not available by default, but must be accessed using Python's import statement.
  11. To find out about some variable v, type help(v) in the Python interactive interpreter to read the help entry for this kind of object.
  12. Python's assignment and parameter passing use object references; e.g. if a is a list and we assign b = a, then any operation on a will modify b, and vice versa.
  13. The is operation tests if two objects are identical internal objects, while == tests if two objects are equivalent. This distinction parallels the type-token distinction.
  14. It is good to test functions with specialized test data (unit tests) to check that they work.
    To test + we make some data test = [(1, 1, 2), (1, 0, 1), (2, 2, 4)]
     for (a, b,c ) in test:
    	if (a + b != c):
    		print "test failed: %s + %s not equal to %s" % (a, b, c) 

Lexical Resources

  1. A word "token" is a particular appearance of a given word in a text; a word "type" is the unique form of the word as a particular sequence of letters. We count word tokens using len(text) and word types using len(set(text)).
  2. We obtain the vocabulary of a text t using sorted(set(t)).
  3. We operate on each item of a text using [f(x) for x in text].
  4. A frequency distribution is a collection of items along with their frequency counts (e.g., the words of a text and their frequency of appearance).
  5. A text corpus is a large, structured collection of texts. NLTK comes with many corpora, e.g., the Brown Corpus, nltk.corpus.brown.
  6. Often you can access many views:
  7. Some text corpora are categorized, e.g., by genre or topic; sometimes the categories of a corpus overlap each other.
  8. A conditional frequency distribution is a collection of frequency distributions, each one for a different condition. They can be used for counting word frequencies, given a context or a genre.
  9. WordNet is a semantically-oriented dictionary of English, consisting of synonym sets or synsets and organized into a network.

Dealing with raw text

  1. In this book we view a text as a list of words. A "raw text" is a potentially long string containing words and whitespace formatting, and is how we typically store and visualize a text.
  2. We can read text from a file f using text = open(f).read().
  3. We can read text from a URL u using text = urlopen(u).read().
  4. We can write text to a file by opening the file for writing: file = open('output.txt', 'w'), then adding content to the file: file.write("Monty Python"), and finally closing the file: file.close().
  5. We can iterate over the lines of a text file using for line in open(f):.
  6. Texts found on the web may contain unwanted material (such as headers, footers, markup), that need to be removed before we do any linguistic processing.
  7. Tokenization is the segmentation of a text into basic units or tokens such as words and punctuation. Tokenization based on whitespace is inadequate for many applications because it bundles punctuation together with words. NLTK provides an off-the-shelf tokenizer nltk.word_tokenize().
  8. Lemmatization is a process that maps the various forms of a word (such as appeared, appears) to the canonical or citation form of the word, also known as the lexeme or lemma (e.g. appear).

Regular Expressions

  1. Regular expressions are a powerful and flexible method of specifying patterns. Once we have imported the re module, we can use re.findall() to find all substrings in a string that match a pattern.
  2. If a regular expression string includes a backslash, you should tell Python not to preprocess the string, by using a raw string with an r prefix: r'regexp'.
  3. Regular expressions use meta-characters: . ^ $ * + ? { [ ] \ | ( )
  4. When backslash is used before certain characters, e.g. \n, this takes on a special meaning (newline character); however, when backslash is used before regular expression wildcards and operators, e.g. \., \|, \$, these characters lose their special meaning and are matched literally.
  5. Regular expressions can be used to discover knowledge through patterns in text:
    「([non-ascii]+)」\(([ \w]+)\) finds translation pairs
    自然言語処理 」(natural language processing)
    \b(\w+) such as (w+) finds hypernym-hyponyms
    animals such as cats, dogs and raccoons
  6. We can match and then see what we matched
          m=re.match(pattern, string)
          if m:
              print ('Match found: ', m.group())

Regular Expression Summary

Operator Behavior
. Wildcard, matches any character
^abc Matches some pattern abc at the start of a string
abc$ Matches some pattern abc at the end of a string
[abc] Matches one of a set of characters
[A-Z0-9] Matches one of a range of characters
|ing|s Matches one of the specified strings (disjunction)
* Zero or more of previous item, e.g. a*, [a-z]* (also known as Kleene Closure)
+ One or more of previous item, e.g. a+, [a-z]+
? Zero or one of the previous item (i.e. optional), e.g. a?, [a-z]?
{n} Exactly n repeats where n is a non-negative integer
{n,} At least n repeats
{,n} No more than n repeats
{m,n} At least m and no more than n repeats
a(b|c)+ Parentheses that indicate the scope of the operators
([A-Z][a-z]+)\1 Parentheses also mark a match group

Regular Expression Predefined Sequences

Symbol Function
\b Word boundary (zero width)
\d Any decimal digit (equivalent to [0-9])
\D Any non-digit character (equivalent to [^0-9])
\s Any whitespace character (equivalent to [ \t\n\r\f\v]
\S Any non-whitespace character (equivalent to [^ \t\n\r\f\v])
\w Any alphanumeric character (equivalent to [a-zA-Z0-9_])
\W Any non-alphanumeric character (equivalent to [^a-zA-Z0-9_])
\t The tab character
\n The newline character
\1 The first match group (\2 is the second, ...)

Tagging and Classification

  1. Words can be grouped into classes, such as nouns, verbs, adjectives, and adverbs. These classes are known as lexical categories or parts of speech. Parts of speech are assigned short labels, or tags, such as NN, VB,
  2. The process of automatically assigning parts of speech to words in text is called part-of-speech tagging, POS tagging, or just tagging.
  3. Automatic tagging is an important step in the NLP pipeline, and is useful in a variety of situations including: predicting the behavior of previously unseen words, analyzing word usage in corpora, and text-to-speech systems.
  4. Some linguistic corpora, such as the Brown Corpus, have been POS tagged.
  5. A variety of tagging methods are possible, e.g. default tagger, regular expression tagger, unigram tagger and n-gram taggers. These can be combined using a technique known as backoff.
  6. Taggers can be trained and evaluated using tagged corpora. Useful features include:
  7. Backoff is a method for combining models: when a more specialized model (such as a bigram tagger) cannot assign a tag in a given context, we backoff to a more general model (such as a unigram tagger).
  8. Part-of-speech tagging is an important, early example of a sequence classification task in NLP: a classification decision at any one point in the sequence makes use of words and tags in the local context.
  9. N-gram taggers can be defined for large values of n, but once n is larger than 3 we usually encounter the sparse data problem; even with a large quantity of training data we only see a tiny fraction of possible contexts.
  10. Modeling the linguistic data found in corpora can help us to understand linguistic patterns, and can be used to make predictions about new language data.
  11. Supervised classifiers use labeled training corpora to build models that predict the label of an input based on specific features of that input.
  12. Supervised classifiers can perform a wide variety of NLP tasks, including document classification, part-of-speech tagging, sentence segmentation, dialogue act type identification, and determining entailment relations, and many other tasks.
  13. Decision trees are automatically constructed tree-structured flowcharts that are used to assign labels to input values based on their features. Although they're easy to interpret, they are not very good at handling cases where feature values interact in determining the proper label.
  14. Most of the models that are automatically constructed from a corpus are descriptive they let us know which features are relevant to a given patterns or construction, but they don't give any information about causal relationships between those features and patterns.

Empirical Linguistics and Language Engineering

  1. It is important to evaluate your results against (i) a baseline and if possible (ii) a gold standard.
  2. We normally have to trade off recall (coverage) vs precision (accuracy).
    We can get more results OR better ones.
  3. When training a supervised classifier, you should split your corpus into three datasets: a training set for building the classifier model; a dev-test set for helping select and tune the model's features; and a test set for evaluating the final model's performance.
  4. When evaluating a supervised classifier, it is important that you use fresh (unseen) data, that was not included in the training or dev-test set. Otherwise, your evaluation results may be unrealistically optimistic.

Sample Question

The exam consists of 5 questions, with many sub-questions.

Answers don't have to be perfect, do the best you can: better to sketch an outline than do nothing.

Question X: Tuples

This question tests your understanding of tuples. Remember you will not be marked down for small errors in syntax or typos, so long as you demonstrate that you understand how tuples work in python.

(0) What is the value of ('Speaking', 'VBG', 'speak')[2]?


(1) You have a list of words: sent = ['Hello', 'there', 'Prof'] Write some code to make a list of tuples, where the first element is the word, and the second element is the word in lowercase.

  new = [(w, w.lower()) for w in sent]

(2) You have a tagged corpus, consisting of a list of (word, pos) tuples, and a function lemmatize, that takes a word and part of speech and returns the lemma. Show how you can make a new corpus consisting of (word, lemma, pos) tuples.

  [(w, lemmatize(w,p), p) for (w,p) in tagged]

(3) Write a function that takes a list of words and returns a list of tuples consisting of a word type and its frequency, sorted by the frequency, with the most frequent first.

  def FreqDist(words):
      dist = sorted([(words.count(w), w) for w in set(words)], reverse=True) 
      return [(w,c) for (c, w) in dist]


  def FreqDist(words):
      dist = sorted([(-words.count(w), w) for w in set(words)]) 
      return [(w,-c) for (c, w) in dist]

  def FreqDist(words):
      dist = sorted([(words.count(w), w) for w in set(words)]).reverse() 
      return [(w,c) for (c, w) in dist]

Note: lists of tuples are sorted according to their first member, numbers are sorted smallest to largest.

That's all folks

HG251: Language and the Computer Francis Bond, 2011.