This lab will give you practice using SQL. We will use SQLite (e.g. using the DB Browser for SQLite) to search through the NTU Multilingual Corpus (NTU-MC). The NTU-MC is a multilingual sense-tagged corpus. You'll be provided with working databases for both English and Mandarin Chinese (but you don't need to have any command of Mandarin Chinese to complete this assignment).
Please answer each question providing both the SQL query, the retrieved answer and where necessary some discussion.
Email the results to: firstname.lastname@example.org, replacing the ??? with your answers.
This is a simplified explanation of the tables you'll have to use:
table: concept sid|cid|clemma|tag|tags|comment|ntag|usrname sid - sentence id (unique) cid - concept id (unique only in a given sentence) clemma - concept lemma (may be a multi-word expression) tag - concept tag (may be a synset, or a number of other predefined tags that you don't need to know about) tags - IGNORE comment - IGNORE ntag - IGNORE usrname - IGNORE table: word sid|wid|word|pos|lemma|cfrom|cto|comment|usrname sig - sentence id (unique) wid - word id (unique only in a given sentence) word - word surface form pos - part-of-speech (consult http://compling.hss.ntu.edu.sg/ntumc/ for the tag sets) cfrom - IGNORE cto - IGNORE comment - IGNORE usrname - IGNORE ### not actually used table: cwl (concept to word links for each sentence) sid|wid|cid|usrname sid - sentence id (unique) wid - word id (unique only in a given sentence) cid - concept id (unique only in a given sentence) usrname - IGNORE
What is the most frequent word in the English Corpus? And in the (Mandarin) Chinese Corpus?
What is the total number of words tagged as Adjective, Noun, Verb and Adverb (using the Universal Part-of-Speech Tagset) for English? And for Chinese? (Please use the mappings provided by NTU-MC).
Your answer should have eight numbers (one for each POS+Language combination) and possibly eight queries.
What are the five concepts with the most variety in lemmas for English and Mandarin (i.e. the synsets with the most different lemmas). Please ignore tags that are not synsets, i.e. the tag must be of the form: xxxxxxxx-p (where x are numbers and p is the pos code).
Look the synsets up in the corpus search and the open multilingual wordnet and look at the lemmas (in either English, Mandarin or both): why you think these concepts have so many lemmas.
Compare the frequency of usage of adjectives (using Petrov, Das and McDonald's Universal Part-of-Speech Tagset) between English and Chinese. Use the Frequency Comparison provided at http://sigil.collocations.de/wizard.html. Using the Universal Part-of-Speech Tagset and the NTU-MC as sample, can we say that any of the two languages differ (significantly) in the usage of adjectives? (Please show any new queries used, the input you give to the wizard, and paste the statistical results along with a short answer).
HG3051 (Corpus Linguistics) main page.Francis Bond <email@example.com>