HG3051 Lab 4: Designing a Corpus

Due before lecture 9.

Imagine that NTU has asked you to create a corpus of a particular language, or a specialized corpus (by register, historical period, topic, etc) of a given language, or to extend an existing corpus with new information. Write a brief 2–4 page outline (any format: but submit as PDF and make sure you include your name) in which you take into account the following issues and features, and why you have made the decisions that you have. You do not actually have to build this corpus.

You may wish to make this the basis of the corpus you build or extend in Project Two.

Upload the final lab report here as pdf
It should be called hg3051-lab4-name-misc.pdf or hg7032-lab4-name-misc.pdf

  1. Is it an archive, an electronic text library, a corpus, or a sub-corpus?
  2. What types of written and/or spoken texts will be in the corpus?
  3. More specifically, briefly discuss the following characteristics of your corpus: mode, text origin, constitution, medium, style, topic, date, and author(s).
  4. How will you distribute it to others?
  5. What types of annotation will there be (tagging, text identification, etc)?
  6. More specifically, what information about each text will be included in the header, index, or source files?
  7. Will it be grammatically tagged? Why or why not?
  8. How will you handle the following types of text features: (for written) non-ascii characters, quotations, lists, headings, proper names, and pagination; (for spoken) speaker change, syntax, accent/dialect, interruptions, pauses, and inaudible segments?
  9. What are some copyright/ethical problems that you might face? How will you deal with these?

HG3051 (Corpus Linguistics) main page.

Francis Bond <bond@ieee.org>
Computational Linguistics Lab
Division of Linguistics and Multilingual Studies
Nanyang Technological University
Level 3, Room 55, 14 Nanyang Drive, Singapore 637332
Tel: (+65) 6592 1568; Fax: (+65) 6794 6303