HG2051 Group Project

due date: 25 November 2019 (Updated)

groups: You may work in groups of 2 or 3 people or by yourself

grading: The project is worth 20% of your final grade. See the Grading section for more information.

Overview

This project involves 3 components:

There are two options of tasks:

  1. The predefined Swadesh-WordNet Interlingual Metrics task; AKA the guided tour
  2. A task that you define yourself (and I approve); AKA choose your own adventure

Choose one of these tasks.

General Notes

Swadesh-WordNet Interlingual Metrics

For this project you will work with the Open Multilingual Wordnet and the disambiguated Swadesh lists (Morgado da Costa, Bond, and Kratochvíl 2016) to estimate monolingal lexical specificity and interlingual semantic overlap. These will be explained below. You should perform these metrics on four languages, one being English.

For the first week, you should:

For the second and third weeks, you should:

The definition of the metrics will be added later. For now just try and finish the above.

Note: This task is similar to the first task of Project 2 from the 2017 offering of HG2051, so you may find some useful information there. It is not the same, however, so be careful how much you rely on it.

Getting Started

Load Swadesh Data

First you will write the load_swadesh() function in the swim.py module. The swadesh.tsv file format looks like this:

# comments begin with '#'
# comments and blank lines should be ignored
# other lines are tab-separated triples

05269901-n  asi:lemma   ekera
05269901-n  arq:lemma   ʕəðˁma
05269901-n  pmr:lemma   gri
05269901-n  buy:lemma   bak

The first column contains Princeton WordNet "offsets", which are historically used as synset ids. Later you will convert them to the standard ids (e.g., where 05269901-n is bone.n.01, etc.), but for now leave them as-is. The second column contains the language code suffixed by :lemma. The third column contains the lemma string.

The load_swadesh() function will:

Choose Three Non-English Languages

Once you have load_swadesh() working, choose 3 languages other than English to examine. The criteria for choosing languages are:

  1. The language exists in both the OMW and in the Swadesh lists.
  2. Two of the three non-English wordnets should have > 50,000 entries. The third one is up to you.

Monolingual Metrics

First you will implement a set of monolingual word-based (i.e., lemma-based) metrics. These come from McCrae and Prangnawarat, 2016. Look at Section 3.3 (you don't have to read the whole paper) and implement the following:

NOTE: The first two functions are useful for defining the last one. Breaking the problem into smaller chunks makes it more manageable.

Part of the challenge here is translating the mathematical notation in the paper into code. This is a useful skill to learn, so I suggest you try it on your own, but if you're really stuck I can help out.

Also, use the provided swim.synsets() function instead of wn.synsets() (assuming you did import nltk.corpus.wordnet as wn or similar), as the provided version avoids inflating the results for English with morphological variants. This helps ensure that the results for English and other languages are comparable. You might also find swim.lemmas() useful.

Once you have these metrics defined, use them for some analysis in the notebook. See the notebook (now updated) for more information.

Update to swim.common_ssids() (new)

In the first version of swim.common_ssids(), it returned the synset objects, then an updated version returned synset ids. The current version also filters out those that do not contain lemmas for all requested languages. Be sure to use the new version of this function into your module.

Interlingual Metrics (new)

Finally you will implement the following interlingual metric:

This computes the proportion of lemma–synset links that share the synset between two languages. Consider how the swim.lemmas() function returns a list lemmas in some language for a synset id and how swim.synsets() returns a list of synsets for a word in some language. You can combine these to "expand" a synset to a list of synsets that share the lemmas of the first synset. This is what the provided swim.expand_homographs() does (recall that homographs are words that are spelled the same but have different meanings, like "paper" (the material, an article, to cover with paper, etc.). The following illustration may help to explain the process (full size):

Illustration of WordNet synset and lemma relations and related metrics.

The above illustration shows synsets on the top line, starting with sun.n.01 (the bold one), which is expanded to lemmas below in English (blue lines; "Sun", "sun") and Spanish (red lines; "sol"). Each of those lemmas is in turn expanded to their synsets. At the bottom of the illustration are expected values for the calls of the monolingual and interlingual metrics on this example.

After you've implemented swim.synset_overlap(), you should examine its outputs on the common synset IDs for your selected languages. Do genealogically-related languages have more similarity then less-related languages? Do you think the size of the respective wordnets for each language have a bigger effect?

Finally, you should implement one more metric, but it is up to you. You can try to fix deficiencies you've found with synset_overlap(), or try something new, such as finding which two lemmas have the highest overlap ("sun" and "sol" in the example above). Or perhaps your experiments inspired you to try something else altogether. Implement the function in swim.py, then test it in project.ipynb as you've done with the other functions.

Self-designed Task

This task is up to you. Maybe you have a research question you wish to answer, or perhaps you have an idea for a fun toy project. It should be at least as difficult as the other task above, it should make use of the Python (and maybe the NLTK) concepts we have covered in class, and it should be linguistically motivated.

Note: You must run it by me first. So talk it over with your group and let me know by about Tuesday 01 October.

Some ideas:

Report (new)

For either the Swadesh-WordNet or the self-designed tasks, you should provide a report. This is where you explain what you did, who did what, how you did it, why you did it that way, other things you've tried, explanations in case something didn't work for you, your interpretation of the results, what you would like to improve, etc. The report is to be about 8--12 pages.

More specifically, it should include:

Also include:

Note that the report is 50% of the project grade. Don't wait until the last minute to start writing.

Deliverable

For the first task, please do not include swadesh.tsv.

Grading

The project is worth 20% of your overall grade. All group partners receive the same grade unless there is disagreement about involvement. The points are distributed as follows: