Structured and Unstructured Cache Models for SMT Domain Adaptation

I will present a French to English translation system for Wikipedia biography articles. We use training data from out-of-domain corpora and adapt the system for biographies. We propose two forms of domain adaptation. The first biases the system towards words likely in biographies and encourages repetition of words across the document as a whole.

Since biographies in Wikipedia follow a regular structure, the second model we present exploits this structure as a sequence of topic segments, where each segment discusses a narrower subtopic of biography domain. In this structured model, the system is encouraged to use words likely in the current segment's topic rather than in biographies as a whole. We implement both systems using recently proposed cache-based translation techniques. We show that a system trained on Europarl and news commentary can be adapted for biographies with 0.5 BLEU score improvement using our models. Further we show that the structure-aware model outperforms the system which treats the entire document as a single segment.

Using topic organization regularities has become popular in fields such as coherence modeling and automatic summarization. This work extends some of these ideas to improve machine translation performance.

This is joint work with Bonnie Webber.