Lab 4: Designing a Corpus
Due before lecture 9.
Imagine that NTU has asked you to create a corpus of a particular
language, or a specialized corpus (by register, historical period,
topic, etc) of a given language, or to extend an existing corpus with
new information. Write a brief 2–4 page outline (any format: but
submit as PDF and make sure you include your name) in which you take
into account the following issues and features, and why you have made
the decisions that you have.
You do not actually have to build this corpus.
You may wish to make this the basis of the corpus you build or extend in
Upload the final lab report
It should be called hg3051-lab4-name-misc.pdf
- Is it an archive, an electronic text library, a corpus, or a sub-corpus?
- What types of written and/or spoken texts will be in the corpus?
- More specifically, briefly discuss the following characteristics
of your corpus: mode, text origin, constitution, medium, style,
topic, date, and author(s).
- Estimate the cost (time, person-hours) to construct the corpus
- How will you distribute it to others?
- What types of annotation will there be (tagging, text identification, etc)?
- More specifically, what information about each text will be included in the header, index, or source files?
- Will it be grammatically tagged? Why or why not?
- How will you handle the following types of text features: (for
written) non-ascii characters, quotations, lists, headings, proper
names, and pagination; (for spoken) speaker change, syntax,
accent/dialect, interruptions, pauses, and inaudible segments?
- What are some copyright/ethical problems that you might face? How will you deal with these?
- Not that you cannot record someone without getting their permission beforehand
— it is definitely unethical and probably illegal
- How representative will your corpus be of the entire population (i.e. all possible texts)? What means will you take to create a representative corpus?
- Who will be the main users of your corpus? What types of information will they likely be looking for?
HG3051 (Corpus Linguistics) main page.
Computational Linguistics Lab
Division of Linguistics and Multilingual Studies
Nanyang Technological University
Level 3, Room 55, 14 Nanyang Drive, Singapore 637332
Tel: (+65) 6592 1568; Fax: (+65) 6794 6303