Final Projects

Within the four months between the semester start and the final presentations, each of the for student teams implemented a great project on low-resource languages. All of the project actually represent new research contributions - some of which with subsequent publications.

There is also a summary on the whole class, in the following paper:

Alexis Palmer and Michaela Regneri:
Short-term projects, long-term benefits: Four student NLP projects for low-resource languages. In Proceedings of the ComputEL 2014 -- The ACL Workshop on the use of computational methods in the study of endangered languages.

The following gives a short overview of the four projects and links to the repositories.

Small-vocabulary speech recognition for any language

Team lex4all: Anjana Vakil and Max Paulus
Project repository | Poster

This project builds on existing research for small-vocabulary (up to roughly 100 distinct words) speech recognition. The result of this project is an easy-to-use interface that allows a user with no knowledge of speech technologies to build and test a system to recognize words spoken in the target language. In its current implementation, the system uses the English-language recognizer from the freely-available Microsoft Speech Platform;\footnote{\tiny\url{http://msdn.microsoft.com/en-us/library/hh361572}} for this reason, the system is available for Windows only.

Further details about the system (including where to download the code, and discussion of substituting other high-resource language recognizers), are described in the following paper:

Anjana Vakil, Max Paulus, Alexis Palmer, and Michaela Regneri:
lex4all: A language-independent tool for building and evaluating pronunciation lexicons for small-vocabulary speech recognition.
In Proceedings of the ACL 2014 System Demonstrations.

Language identification for many (many) languages

Team Sugali: Guy Emerson, Susanne Fertmann and Liling Tan
Project repository | Poster

This project addresses the task of language identification. Given a string of text in an arbitrary language, can we train a system to recognize what language the text is written in? The project uses three sources of data: the Universal Declaration of Human Rights, Wikipedia, ODIN, and some portions of the data available from Omniglot. They cover well over 1000 languages with their system.

The corpus and how to access it are described in the following paper:

Guy Emerson, Liling Tan, Susanne Fertmann, Alexis Palmer and Michaela Regneri:
SeedLing: Building and using a seed corpus for the Human Language Project.
In Proceedings of the ComputEL 2014 -- The ACL Workshop on the use of computational methods in the study of endangered languages.

A lemmatizer for Uspanteko

Team MayaLemm: Christine Bocionek, Liesa Heuschkel and Aleksandra Piwowarek
Project repository | Poster

This project resulted inlemmatizer for the Mayan language Uspanteko. The (already available) source data had been cleaned, standardized and distributed through the Archive of Indigenous Languages of Latin America.

The lemmatization algorithm is based on longest common substring matching: the closest match for an inflected form is returned as the lemma. Additionally, a table for irregular verb inflections was generated using the annotated source corpus (roughly 50,000 words) and an Uspanteko-Spanish dictionary, to map inflected forms translated with the same Spanish morpheme.

Manual evaluation of 100 sentences, for which a linguist on the team with knowledge of Spanish determined citation forms, showed accuracy of 59% for the lemmatization algorithm.

Named Entity Recognition for Slovak & Persian

Team NER: Omid Moradiannasab and Michal Petko
Project repository | Poster

This project tackles the task of named entity recognition (NER). The students developed a single platform to do NER in both Slovak and Persian, their native languages. The approach is primarily based on using gazetteers (for person names and locations), as well as regular expressions (for temporal expressions).

The project could use some existing resources for both languages, but also devoted quite some time to producing new gazetteers. For Slovak, additional challenges were presented by the language's large number of inflectional cases and resulting variability in form, as well as ambiguity. In Persian, the main challenges were the detection of word boundaries (many names are multi-word expressions) and frequent ambiguities between NEs and proper nouns.

For evaluation, the students hand-labeled over 35,000 words of Slovak (containing 545 NE instances) and about 600 paragraphs of Persian data (with 306 NE instances). Performace varies across named entity category: temporal expression matching is most reliable (f-score 0.96 for Slovak, 0.89 for Persion), followed by locations (0.78 Slovak, 0.92 Persian) and person names (0.63 Slovak, 0.87 Persian).