Project Seminar: Current Topics in Machine Translation

Summer semester, 2010

The goal of this seminar is to familiarize students with current hot topics in Machine Translation research. Students will form groups and choose one of the presented projects to work on. We also encourage students with little to no programming experience to participate, as we would like the groups to be balanced, i.e. a programmer and a linguist can work together to achieve a goal. This will also help you figuring out how to work with people from different backgrounds and with different expectations.

This document contains brief descriptions of the potential projects. Empty circle bullets denote optional requirements. It is also possible to propose your own project. Please email your two choices of projects to Dr. Andreas Eisele (eisele at dfki.de) by May 3, 2010.

50% of the grade will be based on the project report (incl. source codes), 30% will be based on the presentation and 20% will be based on the system's performance. The grading will be based on clarity of the report and the presentation, as well as the creativity and ambitiousness shown in the execution of the project. It is highly encouraged to incorporate novel ideas.

Schedule

#	Date	Topic
1	21.04.2009	General Introduction
2	03.05.2009	Selection of projects
3	?	Progress report
4	?	Progress report
5	?	Progress report
6	23.07.2010	Final session

A: Improving word alignment with PoS annotations

Supervisor: Andreas
Miriam, Dominikus

GIZA++ is a freely available word alignment system, which includes the training program for IBM models and HMM model and Model 6. The goal of this project is to improve the word alignment quality by incorporating PoS information through alignment processes. This may involve the following steps:

PoS tag both sides of the parallel corpus
Find interesting PoS patterns for aligned word pairs
Incorporate PoS during the alignment or after alignment (to filter the phrase tables)
Extract PoS-annotated phrase tables, possibly with additional related features
Translate with resulting models

Compare performance of the model with the plain model and the factored models

Prerequisites:

Text processing (in any programming language) under Unix/Linux environment.
Knowledge of C++ is a plus (for understanding GIZA++).

B: Structure-aware translation models

Supervisor: Yu
Anne, Ruth, Olga

The aim of this project is to develop a new rule extraction method to induce linguistically motivated models from a parallel corpus with existing dependency parses for at least one side. This includes:

Apply state-of-the-art dependency parser to one side of a parallel corpus
Study the extraction algorithm used in hierarchical SMT [Lopez]
Develop an algorithm that extracts linguistically sound rules
Test with hierarchical decoder and compare with existing extraction algorithms

Prerequisites: Knowledge of Python/Java is a plus.

[Lopez] Adam Lopez 2008. Machine translation by pattern matching. Ph.D. thesis, College Park, MD, USA.

C: Dependency parsing for a more robust RBMT engine

Supervisor: Sabine
Andreas, Maria, Patricia

Many translation errors produced by a rule-based MT engine are due to the deterministic mechanism built in the system. The project aims at investigating the first phase, parsing analysis, of a RBMT system "Lucy" and seeking solutions to alleviate the problems. It includes the following parts:

Identify critical parse failures in translation analysis produced Lucy system
Parse the critical sentences with a state-of-the-art dependency parser
Compare the parses and give thorough analysis

Integrate the improved parses into Lucy
Tree visualization for more fine grained analysis (possibly an individual project)

Prerequisites:

Knowledge of current parsing technology

D: Inducing parallel lexicons from non-parallel corpora

Supervisor: Jia
Casey, Ceslav, Lilian

The aim of this project is to acquire bilingual information from language pairs for which parallel corpora of sufficient size are not available. It starts with lexicon acquisition and can be extended to much larger units such as sentence pairs. This may includes:

Crawl the Web/Wikipedia/other interesting sources for pages in interesting languages (Croatian, Estonian, Greek, Greek, Latvian, Lithuanian, Romanian, Slovenian, Bulgarian or Polish).
Translate collected texts to English and/or German using existing MT systems such as Google, Bing, ...
Identify the cases when the MT engines failed to translate a word
Guess a better translation of such an unknown word, e.g. consult very large LMs with the (translated) context and find English/German words that would make sense at this position in a text.

Apply the resulting lexicon to induce parallel sentences from comparable corpora

Prerequisites:

Basic text processing with any programming languages

E: Morphological classification for new languages via parallel corpora

This project is from the opposite angle of project D: use bilingual data to induce linguistic information on only one side. Only morphology is considered in this project. One possible approach is as follows:

Find groups of related forms (based on similarity in form and via alignments in parallel corpora)
Cluster the groups into paradigms (groups that vary in the same way)
Use existing morphologies to find interpretations for the different forms in each group
Induce morphological interpretations for new forms in this way
Test against held-out parts of the morphologies

Prerequisites:

Knowledge of morphological analysis

F: Post-editing interface for machine translations

This project requires to build a web-based interface that allows user to acquire and post-edit machine translations. The interface should not only facilitate the post-editing procedure, but also help to collect evidence for more reliable evaluation of MT outputs. Members of this project need to do the followings:

Experience "caitra" [Koehn]
Design an interface for user to post-edit translations generated by MT systems
Implementation the interface in Django

Get familiar with "wikitrans"
Integrate the post-editing interface with "wikitrans"

Prerequisites:

Ability to program in Python
Knowledge of Django is a plus (for further integration).

[Koehn] Philipp Koehn and Barry Haddow, 2009, Interactive Assistance to Human Translators using Statistical Machine Translation Methods, MT Summit XII

Created on 21 Apr 2010.