Nov 3rd (Tue) 9:30--15:30:
- Take the data from omniglot
- Complete the following tasks to the best of your ability
Divide the tasks among the group
- Read all the files that terminate as .tsv in the Omniglot folder.
- Match the the language name in the file name with the ISO 639 lang codes.
Will require some tweaking
- Create dictionaries of the form:
lang_iso[ISOlangcode] = language name
lang_iso[language name] = ISOlangcode
E.g. The file 'Chinese (Cantonese).tsv' should be linked to 'Yue Chinese' (yue),
according to the ISO 639-3.
If a language cannot be matched to an ISO code,
use the full name of the language as it appears in the file name.
- Create dictionaries for each language:
Maybe break this task up
- translations[english_phrase][ISOlangcode] = list(tuples of (translated phrases, transliteration, comment))
If there is no transliteration for a particular language or word,
use the special value None.
- you might want to write them out to a file to make it easy to merge them
phrase \t ISO \t translation \t transliteration \t comment
- For each file, parse the tab separated values assuming they are tables,
where the first row defines the headers.
For each line of the table, you should extract, at least, the English phrase
(to be used as keys to the dictionaries created in 3), the translation of each
English sentence in that language and the transliteration (whenever available).
- Load the dictionaries created in 3 with the data collected from each file.
AIM TO HAVE THIS DONE (at least a first go) BY 13:30
- Make a table showing the coverage (in %)
of each language for the collection of all English phrases presented in all files.
- Compare each language with each other (using original or
transliteration (or better both), as you did for Project 1
Output the pairs in order of closeness
There will be n(n-1) pairs, so leave some time for this
- Hand in the deliverables
- Go home, happy that it is over, proud of what you have accomplished
- Per person:
- 1-2 page paper in any format, finally given as pdf,
describing what you have done, and
what you think could be done given more time.
Include title, author, date
HG2051 Project 3: Parsing omniglot
The file should be named name_surname.pdf
- program (with comments in the code: can be shared by two people)
The program should be named after one of the authors
- program output
- any input needed by your program
Project Three for