Breaking
the Resource Bottleneck for Multilingual Processing
Rebecca Hwa
To train an application for
natural language processing is a challenging machine
learning task: how can a machine automatically and efficiently
induce a model of the complex structures of human language?
Unsupervised learning is not well-suited for this problem because
human languages contain too much ambiguity; on the other hand, fully-supervised
methods require large quantities of manually-annotated
training data, which are difficult to obtain. The annotation
bottleneck is worse for non-English languages because fewer resources
have been developed for them. One way to alleviate the problem
is to build new resources by bootstrapping from existing English
resources. In this talk, I will present my work on inducing annotated
resources for Chinese to train applications such as parsers and
taggers by bootstrapping from English resources.