HG3051: Corpus Linguistics

Francis Bond, 2011, 2012, 2014, 2018. Classes are shared with the post-graduate course HG7032: Topics in Corpus Linguistics.

This course is an introduction to the fast growing field of corpus linguistics. It aims to familiarise students with key concepts and common methods used in the construction of language corpora, as well as tools that have been developed for searching and using major corpora such as the British National Corpus. Students will be given hands-on experience in pre-editing, annotating, and searching corpora. Criteria and methods used for evaluating corpora and analytical tools will also be discussed. This lays the groundwork for research using big data.

The main aim of this module is to master the uses of text corpora in linguistics research and applications.

Course Content

This course introduces basic corpus skills for linguists:

Course Page: http://compling.hss.ntu.edu.sg/courses/hg3051.

There is no text book, readings will be assigned each week.

Course Outline

Lecture Date Topic Readings Assessment/Extra Information/Tools Fun
1 Jan 16 Basic Concepts, What can we do with Corpora? Corpus and Text: Basic Principles (Sinclair 2005) in Wynne (2005)
2 Jan 23 Markup and Annotation BNC Manual, BYU Interface Syntax
Adding Linguistic Annotation (Geoffrey Leech 2005) and
Metadata for Corpus Work (Lou Burnard 2005) in Wynne (2005)
NTU-MC Tagsets: cmn; eng; jpn; ind; universal; universal (old version);
Email results of the two tasks (ISLRN and tagset mapping)
Phonetic Punctuation Victor Borge
3 Jan 30 Multimodal and Multilingual Corpora Koehn (2005) Martin et al (2007) and
Character Encoding in Corpus Construction (Anthony McEnery and Richard Xiao 2005) in Wynne (2005)
Lab 1 due
Email corpus choice
4 Feb 06 A survey of Available Corpora Various Corpora Lab 2 due
5 Feb 13 Collocation, Frequency, Corpus Statistics Dunning (1993) Social Science Statistics
Corpus test Wizard
6 Feb 20 DIY Corpora, Web as Corpus, Processing Raw Text, SQL NLTK Chapter 11; SQLite tutorial SQLite; SQLMAN (Developer/admin tool); sqlitebrowser (DB Browser);
NTU Multilingual Corpus: English, Chinese, Wordnet,
7 Feb 27 Lexical and Grammatical Studies, Variation Biber et al. (1998) Chapters 2, 3 Lab 3 due
Recess (Mar 4-9)
8 Mar 13 Case studies: Pronouns and Classifiers Bond et al. (1995), Bond (2005), Seah and Bond (2014) Project 1 Due
9 Mar 20 Contrastive and Diachronic Studies Stubbs 7,8 Lab 4 due
10 Mar 27 02 Corpora and Language Engineering Newman 2007 Project 2 Description
11 Apr 03 Representativeness and Balance; Project Presentations Ide and Macleaod (2001) Project 2 Presentation
12 Apr 10 Conclusions and Review Stubbs 9

Apr 24

Project 2 Due (and Project 3)

Slides may be updated at any time! Labs may also change.

Recommended Readings

Projects that became papers

Assessment (HG3051)

Learning Outcomes

On completion of this module, students should be able to:

Assessment for HG7032 Topics in Corpus Linguistics


Francis Bond <bond@ieee.org>
Computational Linguistics Lab
Division of Linguistics and Multilingual Studies
Nanyang Technological University
Level 3, Room 55, 14 Nanyang Drive, Singapore 637332
Tel: (+65) 6592 1568; Fax: (+65) 6794 6303