HG3051: Corpus Linguistics

Francis Bond, 2011, 2012, 2014, 2018. Classes are shared with the post-graduate course HG7032: Topics in Corpus Linguistics.

This course is an introduction to the fast growing field of corpus linguistics. It aims to familiarise students with key concepts and common methods used in the construction of language corpora, as well as tools that have been developed for searching and using major corpora such as the British National Corpus. Students will be given hands-on experience in pre-editing, annotating, and searching corpora. Criteria and methods used for evaluating corpora and analytical tools will also be discussed. This lays the groundwork for research using big data.

The main aim of this module is to master the uses of text corpora in linguistics research and applications.

Course Content

This course introduces basic corpus skills for linguists:

Course Page: http://compling.hss.ntu.edu.sg/courses/hg3051.

There is no text book, readings will be assigned each week.

Course Outline

Lecture Date Topic Readings/Extra Information/Tools Assessment Fun
1 Jan 16 Basic Concepts, What can we do with Corpora? Corpus and Text: Basic Principles (Sinclair 2005) in Wynne (2005)

2 Jan 23 Markup and Annotation BNC Manual, BYU Interface Syntax
Adding Linguistic Annotation (Geoffrey Leech 2005) and
Metadata for Corpus Work (Lou Burnard 2005) in Wynne (2005)
NTU-MC Tagsets: cmn; eng; jpn; ind; universal; universal (old version);
Email results of the two tasks to your paired group (ISLRN and tagset mapping)

BNC Login Password: ug coursecode;
Org: Nanyang Technological University
Phonetic Punctuation Victor Borge
3 Jan 30 Multimodal and Multilingual Corpora Koehn (2005) Martin et al (2007) and
Character Encoding in Corpus Construction (Anthony McEnery and Richard Xiao 2005) in Wynne (2005)
VACE multimodal corpus
Discuss the results of the tasks
Lab 1 due
Email corpus choice for Lab 2

4 Feb 06 A survey of Available Corpora Various Corpora Present Lab 2
Upload slides here

5 Feb 13 Collocation, Frequency, Corpus Statistics Dunning (1993)
Corpus test Wizard
Social Science Statistics
Project 1 Passives at the Language Log
6 Feb 20 DIY Corpora, Web as Corpus, Processing Raw Text, SQL NLTK Chapter 1; SQLite tutorial
SQLite; SQLMAN (Developer/admin tool); sqlitebrowser (DB Browser);
NTU Multilingual Corpus: English, Chinese, Wordnet,
Lab 3 Bobby Tables
7 Feb 27 Lexical and Grammatical Studies, Variation Biber et al. (1998) Chapters 2, 3 Lab 3 due
Upload Here

Recess (Mar 4-9)
8 Mar 13 Case studies: Pronouns and Classifiers Bond et al. (1995), Bond (2005), Seah and Bond (2014) Project 1 Due
Upload Here

9 Mar 20 Guest Lecture Corpora in Sociolinguistics by Ivan Panović
Contrastive and Diachronic Studies
Picking the right cherries? A comparison of corpus-based and qualitative analyses of news articles about masculinity
Stubbs 7,8
Lab 4 intro
10 Mar 27 02 Corpora and Language Engineering Project 2 Description
Lab 4 due
Upload Here

11 Apr 03 Representativeness and Balance
Project Presentations
Ide and Macleod (2001)
Newman 2007
Project 2 Presentation
12 Apr 10 Conclusions and Review Stubbs 9

Apr 24

Project 2 Due
(and Project 3 for HG7032)
Upload Here

Slides may be updated at any time! Labs may also change.

Recommended Readings

Projects that became papers

Assessment (HG3051)

Learning Outcomes

On completion of this module, students should be able to:

Assessment for HG7032 Topics in Corpus Linguistics

Francis Bond <bond@ieee.org>
Computational Linguistics Lab
Division of Linguistics and Multilingual Studies
Nanyang Technological University
Level 3, Room 55, 14 Nanyang Drive, Singapore 637332
Tel: (+65) 6592 1568; Fax: (+65) 6794 6303