Project 7: Active Learning for Named Entity Recognition

We propose to study active learning for named entity recognition. To our knowledge no one has applied active learning to this task before.

Named entity recognition would be vital to the task of bootstrapping new resources for the game. For example, if we wanted to make the game similar to some existing story like The Hobbit or Neuromancer, then it would be useful if we could extract all the names of characters, places, and organizations mentioned in the books. Having these named entities identified would allow the parser to treat them appropriately as common nouns or proper nouns, even without having the particular vocabulary items in the lexicon. This would allow our named entity recognizer to be a self-contained module in the game system. Because named entity recognition is commonly treated as a statistical NLP task, it requires labeled training examples. However, creating a sufficiently large set of training examples for a new domain is time-consuming and tedious. We therefore propose to run a feasibility study on whether active learning can be used to create the annotated resources quickly. Active learning works by interactively requesting that a person annotate useful examples in order to reduce the error rate of the statistical models. As previous research has shown, this leads to faster converging learning rates then randomly selecting examples. We propose a simulation experiment for this task in order to figure out the most useful ways to do the active learning. The data set that we will use is CoNLL 2002 / 2003 named entity task. This is a good data set because: 1) it is a large collection of already labeled examples which we can use to simulate active learning (rather than actually having to label it ourselves) 2) there are well-established results for this task. Many individuals have constructed systems, so we have an idea of how well they perform, and also have something to draw from for our ensemble of classifiers.

The method that we will use to try the active learning is "query by committee" wherein an ensemble of different classifiers is used to posit labels for an example. The agreement between different classifiers is a way of measuring their confidence in a label. A low degree of confidence is seen as an approximative measure for the usefulness of the example. A human annotator is brought in to add the correct labels to the most useful examples. An alternate way of measuring confidence is by using a single classifier trained on different parts of the data. This is called "bagging". Bagging divides the existing set of labeled examples into different sets by sampling with replacement. One type of classifier is then trained on these different subsets and treated as the ensemble was before. We would like to run experiments with variations with theses techniques.

The result of our project will be to propose a methodology on how active learning might be effectively applied to annotating the characters and entities of a new game domain.

Chris Callison-Burch callison-burch@ed.ac.uk