This assignment constitutes 30% of your final grade. Please work
on the task individually. Unless specifically requested otherwise,
your report may later be posted on-line with a CC-BY license.
Deadline: Nov 14 17:00
Chose one of A, B or C:
- A Examine some examples of phishing mail, either from
your own experience or from a collection such as the Phishing
Identify and describe, with examples, what kinds of grammatical,
lexical, orthographic or meta-data features could be used to
distinguish phishing mail from normal mail.
- Summarize relevant literature on phishing detection
- Describe how these features can be identified
- Finally, write your own phishing mail, aimed at NTU
students, trying to get their username and password, and describe
how you have made it convincing.
- B Identify and describe a source of linguistic meta-data
(at any level: phonological, morphological, syntactic, semantic or pragmatic)
that could be harvested from the Internet.
- words that rhyme harvested from poetry
- positive or negative sentiment from reviews
- multilingual/bilingual data
- Show what property or properties are annotated by the meta-data, with examples
- Outline how the data could be harvested and pre-processed
optionally, harvest and pre-process some.
- Estimate how much annotated data is available
- You should chose a new source of data, not one that is already
described in the literature
(although you can make a variation on an existing approach)
- C Analyze the chat corpus provided here.
- Identify expressions unique to chat/online communication and
suggest why they are used here.
- Refer to expressions by date, time and where necessary,
participant (e.g. Mon 20:40 S)
This corpus was produced and anonymized by Lim Hogzhen as part
of her FYP A comparison of Code-switching Patterns between
Text-Messaging and Speech in 2014. The participants agreed to
release it under a
attribution license. The two participants are Samantha
(not her real name) and Ophelia (not her real name) both
Singaporean undergraduate students at NTU (female; age between
20-25; both use a mixture of Chinese and English). There is some
mark-up of code switching (e.g. <ra:w:n> and bolding or
underlining: you can just ignore it).
Whichever topic you choose you should find and cite relevant literature.
- The deliverable is be a paper of no more than eight pages including diagrams and
- Formatted according to
guidelines to submitting written work for the Division of LMS
- If you want to make it even more beautiful, as I am sure you do,
take a look at my (Computational)
Linguistic Style Guidelines: a guide for the flummoxed.
- Submit both hardcopy (stapled, two sided, no folder, no cover page)
and softcopy (via NTULearn).
Assignment Three for