General Tagging Guidelines
- Before beginning tagging you should:
- Read the full tagging documentation
- Read the text you will be tagging (at least in the language you will be tagging)
- Read the description of how the data is prepared
- Then you tag the document:
- Click on an untagged word: untagged
- It will be presented in red, with some context above and below
- Wordnet senses will be displayed in the top right hand corner
- Look at the tag alternatives at the bottom of the tagging screen
- If it is a multi-word expression, there may be multiple choices
- Click the lemma in the lower left to see the senses for each choice
- Choose the correct tag for each choice (click the radio button)
- You can also add a comment in the comment box
- Clicking the tag will commit your choice
- Repeat until done
- Tagged words are shown as such: tagged
- You can click on them to retag them if you change your mind
- Words you probably don't need to tag are shown in dark grey.
- You can stop tagging at any time, and resume again later.
- The interface has only been fully tested under firefox and chrome. Other browsers may behave strangely.
- After everyone has tagged, you can compare your tags with one or more classmate(s) (Project 2)
The alignment tool will show the results
You then do a final round of tagging
- Discuss your annotations with the other annotators
- Anywhere where you agreed is fine
- Anywhere where there was no clear majority tag should be retagged
- You should add comments for at least 5 interesting examples
- Wordnets are made in a variety of ways, you should read the papers
describing the wordnet you are using.
- The texts
into sentences, and hand-checked.
- They are then
tokenized into words and
lemmatized. For example:
- The snow-covered men saw the does off.
- This process gets it wrong 5-10% of the time
(e.g. lemmatizing saw as saw or does
- Next, we attempt to look up the words in the wordnet, including
multi-word expressions. If a word is in wordnet, then we make it
a candidate, even if the POS suggests it should not be tagged, as
we cannot trust the POS tagge. If a word is not in wordnet, but
is open class (NOUN, ADJ, ADV, VERB, PRON, NUM) we add it as a
concept. By default, we do not add prepositions, conjunctions,
punctuation or determiners.
- Finally, there are some words that we strongly expect to be
false candidates (a as a determiner, be preceding a
participle (be sleeping). We pre-tag these as x.
Actual Tagging documentation
- If the word has been lemmatized correctly, and it is an open class word and there is an appropriate sense in wordnet:
- Select the appropriate sense: 0, 1, …, n
Use various strategies to try to rule out senses. If one
sense is very general and another very specific, then you want
to assign the more specific sense, so long as it applies. Try
using the hypernym of the gloss (if it defines a noun or
verb). Try using the semantic relations of the different senses
(click on the sense number to look them up). If the context is
vague and does not clearly rule out sense(s), then you can
comment on this. There may be information about the information
about the argument structure (does it have 1,2 or 3 arguments: v1, v2 or v3).
- This is the preferred outcome: be a little forgiving in interpreting the definition
- Feel free to add a comment at any time
- Sometimes there are multiple senses that seems similar but
differ in part-of-speech: try to chose the appropriate pos
|n||Noun||dog, me, smiling|
|v||Verb||smile, look up|
|a||Adjective||quick, warm, many|
|r||Adverb||quickly, most, un-[clasp]|
|x||Other||Hello, Thank you|
- You can see the pos predicted by the POS tagger by mousing over the word to be tagged.
FIXME: also show UPOS?
- Sense that agree with the POS tagger's tag are shown in red. FIXME
- In the wordnet viewer you can make it show only ALL or only one of the POSs by clicking on:
All N V A R
- There is more discussion about POS issues in the language specific guidelines
- If there are to many senses you can filter them by:
- Clicking on the POS
- Clicking on the individual senses to hide them (and then on the POS or All to show it again)
FIXME: not naming the sense properly
- If it is a name (or part of a name) chose an appropriate name tag:
- per person
e.g. Irene or Irene Adler (tag both Irene and Adler as per)
- org organization
e.g. Scotland Yard
- dat date/time
e.g. the 31st, 2 o'clock
- loc location
e.g. Riding Thorpe Manor; Norfolk; Baker Street
- oth any other proper name
e.g. Samsung Galaxy SII
- year year
- If there is an error in the corpus (such as tokenization, lemmatization or spelling)
tag as e
Give comments about what it should be in the comment box
- 今 日 should be 今日 kyou "today"
- three-toed should be three - toed
- does is analysed as doe+s not do+es
- If there is an error in the wordnet (missing sense or concept)
tag as w
- Describe the new case in the comment box (see Suggesting changes to wordnet)
- I arrowed them meaning "gave them the unpleasant job"
- I program in python meaning "the computer language"
- If the same thing occurs multiple times, you only need to add
a comment to the first occurrence
- If you think there is no need to tag the word
tag as x
- Closed class part of speech that shouldn't be tagged
Note that we differ from the standard wordnet by including pronouns (see Yu Jie Seah (2013)) and Exclamatives
- Fred is swimming; Fred has swum: we don't tag auxiliary be or have
But we do tag copula
But maybe you are the surgeons 'be identical to'
- I looked up the stairs: we don't tag normal prepositions.
But NOT structural pronouns (tag with x)
- Dummy it: It is raining;
- Existential there: There seems to be a problem;
- Relative pronouns: The book which I bought;
- Bad multi-word expression
- If the wrong lemma is multi-word (kick_the_bucket):
- e.g. In the dark, I kicked the bucket and hurt my toe not "die"
- In this case mark kick_the_bucket as x and tag kicked and bucket
- If the wrong lemma is single word (kick):
- e.g. My pet snail kicked the bucket so I buried it not "hit with foot"
- In this case tag kick_the_bucket and mark kicked and bucket as x
- Even for (compositional multi-word expressions) just tag the largest expressioin:
I sent him the source code "program instructions ..."
I sent him the source code "the symbolic arrangement of data or instructions in a computer ..."
In this case you just have to tag the larger expression
(tag the parts with x: no need to tag)
- Note: You must always check one of these choices!
- If it can be tagged with an existing synset (preferred)
- write in the comments something like:
This lemma is a synonym of 012345678-x
e.g. for laidbacku, it is a synonym of
"02408011-a: laid-back, mellow", so we
add synonym of 02408011-a in the comment
- If you need to suggest a new synset (there is no appropriate concept in wordnet):
At the end of the day, Muslims break their fast (buka puasa) with
a communal meal at home or at the mosque.
Regardless of the tag they may currently have (i.e. likely 'e' or 'w'),
foreign words (like 'buka puasa', above) should be tagged with the
synset that corresponds to that concept in the corresponding language.
In this case, we don't want to add 'buka puasa' to the English wordnet.
Buka puasa is an Indonesian word, and it should be present / or added
to that wordnet. If you are ever asked to tag a word like this there are two
things you should do:
Suggested changes to wordnet can be added to wordnet.
- E.g., laidbacku is a synonym of
"02408011-a: laid-back, mellow", so we
add laidback as an English lemma for the synset
- E.g., Python is a hyponym of "06898352-n programming
language", so we make a new lemma linked to
add 06898352-n as a hyponym, and give it a name
and English definition. If possible add lemmas in other languages.
- Mark bad entries
- This lemma should NOT be in 012345678-x
e.g., 运弓法 "archery" should not be in 07274425-n bowing
- You can also add other comments, like:
I can't really decide between 012345678-x and 012345678-y
Don't forget to check for orthographic variants of existing
synsets. E.g. stir-fry is in wordnet as stir fry,
so stir-fry should just be added to that synset
Anything that you are not sure of or that won't fit in the comment,
note the sentence ID and word ID (e.g. 111:2) and write it up in your
Please give the synset id (012345678-x) not a word when linking.
- A word can be in two multi-word expressions:
6740: I lived in that farm , where I had a room down below , and could get in and out every night , and no one the wiser
tag as both get_in and get_out
- What are some more examples of bad lemmas?
When the morphological analyzer has made an error:
In these cases, if you cand find an appropriate sense,
put it in the tag box, don't chose a Meta tag.
- well dressed man dressed not dress
- Sherlock Holmes Sherlock_Holmes not sherlock_holmes
- Common errors for English
When there is a meaning difference, go with the closest meaning
In general prefer adjective, then verb
In general prefer adjective, then verb, then noun
- plural noun/noun/adjective:
In general prefer adjective, then noun (if both senses are appropriate):
So, damp in damp weather is an
adjective, even though a noun sense exists. But nouns can pre-modify adjectives:
cotton in cotton shirtsis a noun.
singular or plural normally has a meaning difference: pick the right meaning
- Common errors for Japanese
- When does the system miss an entry?
- Differences in spelling/hyphenization; errors in tokenization and lemmatization
night bird, nightbird, night-bird
Detailed Guidelines for English
How to determine part of speech for the word/collocation. This
is not always as obvious as it seems! There are four particularly
tricky cases. These are all tricky because the part of speech of a
word is not always the same as the grammatical function that the
word is performing in the sentence or phrase. For instance, nouns
can function similarly to what are traditionally called
adjectives, and verbs can take on the roles of nouns or
Adjective vs. noun modifying a noun
And sometimes after the noun:
- favorable conditions are forecast
- damp weather is behind us
Nouns can also serve as modifiers, similar to adjectives:
- conditions unfavorable for flying
- entertainment value
- children's books
- cotton shirts
The general rule of thumb for deciding whether it is a noun or
adjective is to check the sense list first for whether there is an
adjective sense corresponding to the word, and, if not, then whether
there's a corresponding noun sense. So, damp in damp weather is an
adjective, even though a noun sense exists. And cotton in cotton
shirts is a noun, which is modifying another noun.
If there is no adjective sense in WordNet, then you should make sure
that it is not truly an adjective that is missing from WordNet. A good
clue that you have an adjective is if you try to modify it with very
or rather and it sounds ok: very/rather favorable conditions
(ok) vs. very/rather cotton shirts (not ok).
Another good clue is if you can make a comparative or superlative form
out of it (damper/dampest/more favorable/most favorable
conditions are all adjectives, but cottoner/cottonest/more
cotton/most cotton shirts are not). If either of these tests come
up ok (that is, very/rather x sounds good or either
x-er/x-est or more/most x sounds good), and there is no
matching adjective sense, then you need to add a new sense to wordnet.
Note that these tests are only valid if they come up ok. Then you know
you have an adjective for sure. If the tests are not ok, then it may
still be an adjective. This is because the tests only work for certain
kinds of adjectives, but not all. If the tests are not ok (that is,
none of very/rather x and x-er/x- est/more x/most x
sound good), then check for a matching noun sense. If there is no
matching noun sense, then do not assign any sense. (But see
below first regarding present and past participles, since it might be
a verb!). If a noun sense does exist, then the word can be considered
a noun, and be tagged to the noun sense.The noun-sense rule applies
only when the word is modifying a noun. If the word is being used
predicatively (that is, after some form of the verb be, or where
the verb could be replaced by a verb such as seem, look, appear, etc.) In the
predicative case, there may be some confusion as to whether what
follows the verb is an adjective or a noun. So, in
damp is an adjective here. Notice that you can replace was
with seemed/looked/appeared and still get a grammatical
sentence: the weather seemed/looked/appeared damp. But note the
difference between the pairs:
- he was drunk [pos = adjective]
- he was really drunk [pos = adjective]
- he was a drunk [pos = noun]
- he was obviously a drunk [pos = noun]
In the second pair, drunk is clearly a noun, not an adjective. It is
the complement of the verb be here. Two reliable ways to
recognize a noun are if it is (or can be) preceded by a determiner
(such as a or the) or adjective (he was a silly
drunk). In summary: When you have the situation of modifier noun, the
modifier will be an adjective when there is a corresponding
adjective sense in WordNet for the meaning it is being used with OR
there is no corresponding adjective sense, but any of the tests come up ok (
very/rather (sounds good when you preface it with very
or rather) or –er/– est/more/most (x-er/x-est/more x/most x))
, in which case Sense not in WordNet should be
The modifier will be a noun when:
- There is no corresponding adjective, and the very/rather and –er/–est/more x/most x tests are not ok, and a corresponding noun sense exists
If it is neither an adjective nor a noun, it might be a verb (see below)
Adjective vs. present participle (-ing form) of verb
The -ing form of verbs can function as adjectives. For instance,
- you are frightening me [pos = verb]
- that is a frightening prospect [pos = adjective (modifier)]
- that prospect is frightening to me [pos = adjective (modifier)]
- that prospect is frightening me [pos = verb]
- we are working hard today [pos = verb]
- the car is in good working order [pos = adjective/(modifier)]
How to tell? The easy case is when the word is modifying a noun. In
general, these are adjectives if there is a corresponding adjective
sense in WordNet. Such adjective senses exist for frightening and
working . However, this is not the case for clicking
and playing, so that in the following sentences,
- the door opened with a clicking sound
- the baseball player’s playing record was impeccable
the appropriate verb senses of click and play would
be selected instead. (This is because these are verbs playing the part
of adjectives, but are not adjectives in themselves.) When the word
appears predicatively (after some form of the verb be), the rule
can't always be applied since it might be impossible to tell whether
it is being used as a verb or an adjective.
- That dress is striking on you [adjective]
- The workers are striking for better working conditions [verb]
- The women are striking [ambiguous]
Without more information, you cannot know whether the third
sentence means that the women are picketing, or whether they are
beautiful. For ambiguous cases like this, if the context does not make
it clear chose which you think is most appropriate and add a comment
saying that it is hard to tell.
Adjective vs. past participle (usually -en form) of verb
Past tense participles can also function as adjectives. The past
tense participle is the form of the verb that appears with the past
tense auxiliary have. It usually, though not always, ends in
-en or -ed: written, destroyed, and spun are past
participles of write, destroy and spin, respectively. The rule
of thumb will be similar to the present participle cases. Where the
word modifies a noun, check first for a corresponding adjective sense.
If no adjective sense exists, then assign the verb sense (if there is
one that matches the meaning as used in the sentence).
- that wasn’t the intended result [adjective]
- make out an itemized list [verb functioning as modifier]
Again, the hard cases occur when the word appears predicatively (ie., after some form of the verb "to be", or where the preceding verb can be replaced by a verb such as seem/look/appear, etc.).
- The sentence was written down for clarity. [verb]
- The sentence was written as opposed to spoken. [adjective]
- The sentence was written [ambiguous]
In the first sentence, written is a verb. A good test of this is
to put the auxiliary verb in the progressive – The sentence WAS
BEING written down for clarity. That makes it clear it is an act
or action that occurred. The second sentence cannot be phrased that
way and still have the same sense: The sentence was being written
as opposed to spoken.In the third sentence, it is not clear
whether "written" refers to an act of writing, or the attribute or
quality of being written. For ambiguous cases like this, do not assign
a sense, and the lexicographers will make the determination.
Noun vs. present participle (-ing form) of verb
To complicate things further, the present participle of verbs can function as a
noun. Often, the distinction is easy to make, if it appears where a
noun is called for grammatically, and there is a corresponding noun
sense in WordNet.
- he made a killing in the stock market [noun]
- the discordant ringing of nonmusical metallic
If no noun sense exists, then assign the verb sense, if
one exists, as for
- The merry frolicking of the lark [verb functioning as noun]
However, if the word is being used as a
verb, then a noun sense should never be assigned! This is easy if
there is no noun sense, as for frolicking
or when it is obviously depicting an ongoing action
- the lark is given to frolicking merrily [verb]
- make a sound like a car engine that is firing too soon [verb]
You can test this out, too. A verb can
never be modified by a or the or a possessive pronoun such as
my/your/our, etc. Try it with the 2 sentences above--it hurts!
But, again, there will be cases where this determination will be
impossible to make
- She dislikes writing thank-you letters [verb]
- She had no knowledge of the writing of
the letter [noun]
- She had been talking about writing [ambiguous]
It is not clear whether writing in the 3rd sentence
refers to the act of writing something (eg, a letter), or whether
writing is the object itself (ie, her writing, or an author's writing,
marks on a piece of paper, etc.) For ambiguous cases like this, assign
a sense and comment on the difficulty.
Using Wordnet Relations to determine sense (or senses)
In WordNet, senses are in part defined by their relations to
other senses. For this reason, the WordNet relations can be very
useful in narrowing down which of the senses applies to a particular
occurrence of the form. The relations for any word or collocation can
be viewed through the WordNet browser. From the WordNet entry of the
word you are tagging, clicking on one of the sense buttons
will display the full entry: you may want to middle click to open in a new tab.
The main relations that are of help are Hypernyms, Derivationally
related forms, and Domain. Not all relations will appear for all words
and all parts of speech for a word.
Hypernym (ISA relation)
The immediate hypernym is the most relevant one here. It is the first indented
relation just below the definition (preceded by an arrow =>). The
hypernym relation will tell you what kind of thing (object or action)
the word refers to. The higher up you go in the hypernym relations,
the more general the senses get (and so often less informative). There
is a new indentation for each level up you go. For instance, two
senses of the noun center that are rather close are
If you look at the hypernyms for the noun senses
of center, you can see that Sense A is an area while Sense B is
a point, what they have in common is a notion of centrality. Both are
at some level locations, and eventually all nouns are entities (so
that knowing that something is a kind of entity is not of much help at
- A: center, centre, middle, heart, eye -- (an area that is
approximately central within some larger region; "it is in the center
of town"; "they ran forward into the heart of the struggle"; "they
were in the eye of the storm")
- B: center, centre, midpoint -- (a
point equidistant from the ends of a line or the extremities of a
Is this term restricted to one topic or area or field
Where they exist, the domain relations can be quite helpful in
narrowing senses down. A word’s domain will tell you whether it is
restricted to some field or area such as Law or Art. Take the
noun work. It has 7 senses, and if you look at its domains, you
can see that one of its senses is restricted to the domain of physics,
having to do with the transfer of energy.
- Bond, Francis, Luís Morgado da Costa, and Tuấn Anh Lê (2015)
IMI — A Multilingual Semantic Annotation Environment.
In Proceedings of ACL-IJCNLP 2015 System Demonstrations, Beijing. pp 7–12
- Christine Fellbaum, editor. 1998. WordNet: An Electronic Lexical Database. MIT Press.
- Shari Landes, Claudia Leacock, and Christiane
Fellbaum. 1998. Building semantic
concordances. In Fellbaum (1998), chapter 8, pages 199–216.
- H. Langone and B. R. Haskell and G. A. Miller
WordNet. In Proceedings of the Workshop Frontiers in Corpus Annotation at HLT-NAACL 2004.
- Shan Wang and Francis Bond (2014)
Building The Sense-Tagged Multilingual Parallel Corpus In 9th Edition of the Language
Resources and Evaluation Conference (LREC 2014), Reykjavik.
Thanks to Christiane Fellbaum for sharing some documentation from the wordnet gloss tagging project.