Project Seminar: Current Topics in Machine Translation
Project Seminar: Current Topics in Machine Translation
Summer semester, 2010
The goal of this seminar is to familiarize students with current hot topics in Machine Translation research. Students will form groups and choose one of the presented projects to work on. We also encourage students with little to no programming experience to participate, as we would like the groups to be balanced, i.e. a programmer and a linguist can work together to achieve a goal. This will also help you figuring out how to work with people from different backgrounds and with different expectations.
This document contains brief descriptions of the potential projects. Empty circle bullets denote optional requirements. It is also possible to propose your own project. Please email your two choices of projects to Dr. Andreas Eisele (eisele at dfki.de) by May 3, 2010.
50% of the grade will be based on the project report (incl. source codes), 30% will be based on the presentation and 20% will be based on the system's performance. The grading will be based on clarity of the report and the presentation, as well as the creativity and ambitiousness shown in the execution of the project. It is highly encouraged to incorporate novel ideas.
Schedule
| # |
Date |
Topic |
| 1 |
21.04.2009 |
General Introduction |
| 2 |
03.05.2009 |
Selection of projects |
| 3 |
? |
Progress report |
| 4 |
? |
Progress report |
| 5 |
? |
Progress report |
| 6 |
23.07.2010 |
Final session |
A: Improving word alignment with PoS annotations
Supervisor: Andreas
Miriam, Dominikus
GIZA++ is a freely available word alignment system, which includes the training program for IBM models and HMM model and Model 6. The goal of this project is to improve the word alignment quality by incorporating PoS information through alignment processes. This may involve the following steps:
- PoS tag both sides of the parallel corpus
- Find interesting PoS patterns for aligned word pairs
- Incorporate PoS during the alignment or after alignment (to filter the phrase tables)
- Extract PoS-annotated phrase tables, possibly with additional related features
- Translate with resulting models
- Compare performance of the model with the plain model and the factored models
Prerequisites:
- Text processing (in any programming language) under Unix/Linux environment.
- Knowledge of C++ is a plus (for understanding GIZA++).
B: Structure-aware translation models
Supervisor: Yu
Anne, Ruth, Olga
The aim of this project is to develop a new rule extraction method to induce linguistically motivated models from a parallel corpus with existing dependency parses for at least one side. This includes:
- Apply state-of-the-art dependency parser to one side of a parallel corpus
- Study the extraction algorithm used in hierarchical SMT [Lopez]
- Develop an algorithm that extracts linguistically sound rules
- Test with hierarchical decoder and compare with existing extraction algorithms
Prerequisites: Knowledge of Python/Java is a plus.
- [Lopez] Adam Lopez 2008. Machine translation by pattern matching. Ph.D. thesis, College Park, MD, USA.
C: Dependency parsing for a more robust RBMT engine
Supervisor: Sabine
Andreas, Maria, Patricia
Many translation errors produced by a rule-based MT engine are due to the deterministic mechanism built in the system. The project aims at investigating the first phase, parsing analysis, of a RBMT system "Lucy" and seeking solutions to alleviate the problems. It includes the following parts:
- Identify critical parse failures in translation analysis produced Lucy system
- Parse the critical sentences with a state-of-the-art dependency parser
- Compare the parses and give thorough analysis
- Integrate the improved parses into Lucy
- Tree visualization for more fine grained analysis (possibly an individual project)
Prerequisites:
- Knowledge of current parsing technology
D: Inducing parallel lexicons from non-parallel corpora
Supervisor: Jia
Casey, Ceslav, Lilian
The aim of this project is to acquire bilingual information from language pairs for which parallel corpora of sufficient size are not available. It starts with lexicon acquisition and can be extended to much larger units such as sentence pairs. This may includes:
- Crawl the Web/Wikipedia/other interesting sources for pages in interesting languages (Croatian, Estonian, Greek, Greek, Latvian, Lithuanian, Romanian, Slovenian, Bulgarian or Polish).
- Translate collected texts to English and/or German using existing MT systems such as Google, Bing, ...
- Identify the cases when the MT engines failed to translate a word
- Guess a better translation of such an unknown word, e.g. consult very large LMs with the (translated) context and find English/German words that would make sense at this position in a text.
- Apply the resulting lexicon to induce parallel sentences from comparable corpora
Prerequisites:
- Basic text processing with any programming languages
E: Morphological classification for new languages via parallel corpora
This project is from the opposite angle of project D: use bilingual data to induce linguistic information on only one side. Only morphology is considered in this project. One possible approach is as follows:
- Find groups of related forms (based on similarity in form and via alignments in parallel corpora)
- Cluster the groups into paradigms (groups that vary in the same way)
- Use existing morphologies to find interpretations for the different forms in each group
- Induce morphological interpretations for new forms in this way
- Test against held-out parts of the morphologies
Prerequisites:
- Knowledge of morphological analysis
F: Post-editing interface for machine translations
This project requires to build a web-based interface that allows user to acquire and post-edit machine translations. The interface should not only facilitate the post-editing procedure, but also help to collect evidence for more reliable evaluation of MT outputs. Members of this project need to do the followings:
- Experience "caitra" [Koehn]
- Design an interface for user to post-edit translations generated by MT systems
- Implementation the interface in Django
Prerequisites:
- Ability to program in Python
- Knowledge of Django is a plus (for further integration).
- [Koehn] Philipp Koehn and Barry Haddow, 2009, Interactive Assistance to Human Translators using Statistical Machine Translation Methods, MT Summit XII
Created on 21 Apr 2010.