HG3051 Lab 3: SQL

This lab will give you practice using SQL. We will use SQLite (e.g. using the DB Browser for SQLite) to search through the NTU Multilingual Corpus (NTU-MC). The NTU-MC is a multilingual sense-tagged corpus. You'll be provided with working databases for both English and Mandarin Chinese (but you don't need to have any command of Mandarin Chinese to complete this assignment).

Please answer each question providing both the SQL query, the retrieved answer and where necessary some discussion.

Email the results to: hg3051c@gmail.com, replacing the ??? with your answers.

This is a simplified explanation of the tables you'll have to use:

table: concept
sid - sentence id (unique)
cid - concept id (unique only in a given sentence)
clemma - concept lemma (may be a multi-word expression)
tag - concept tag (may be a synset, or a number of other predefined tags that you don't need to know about)
tags - IGNORE
comment - IGNORE
ntag - IGNORE
usrname - IGNORE

table: word
sig - sentence id (unique)
wid - word id (unique only in a given sentence)
word - word surface form
pos - part-of-speech (consult http://compling.hss.ntu.edu.sg/ntumc/ for the tag sets)
cfrom - IGNORE
cto - IGNORE
comment - IGNORE
usrname - IGNORE

### not actually used 
table: cwl (concept to word links for each sentence) 
sid - sentence id (unique)
wid - word id (unique only in a given sentence) 
cid - concept id (unique only in a given sentence)
usrname - IGNORE

Q1: (2)

What is the most frequent word in the English Corpus? And in the (Mandarin) Chinese Corpus?


Q2: (2)

What is the total number of words tagged as Adjective, Noun, Verb and Adverb (using the Universal Part-of-Speech Tagset) for English? And for Chinese? (Please use the mappings provided by NTU-MC).

Your answer should have eight numbers (one for each POS+Language combination) and possibly eight queries.


Q3: (4)

What are the five concepts with the most variety in lemmas for English and Mandarin (i.e. the synsets with the most different lemmas). Please ignore tags that are not synsets, i.e. the tag must be of the form: xxxxxxxx-p (where x are numbers and p is the pos code).


Look the synsets up in the corpus search and the open multilingual wordnet and look at the lemmas (in either English, Mandarin or both): why you think these concepts have so many lemmas.


Q4: (2)

Compare the frequency of usage of adjectives (using Petrov, Das and McDonald's Universal Part-of-Speech Tagset) between English and Chinese. Use the Frequency Comparison provided at http://sigil.collocations.de/wizard.html. Using the Universal Part-of-Speech Tagset and the NTU-MC as sample, can we say that any of the two languages differ (significantly) in the usage of adjectives? (Please show any new queries used, the input you give to the wizard, and paste the statistical results along with a short answer).


HG3051 (Corpus Linguistics) main page.

Francis Bond <bond@ieee.org>
Computational Linguistics Lab
Division of Linguistics and Multilingual Studies
Nanyang Technological University
Level 3, Room 55, 14 Nanyang Drive, Singapore 637332
Tel: (+65) 6592 1568; Fax: (+65) 6794 6303