Tagging and Parsing with Cascaded Markov Models

Automation of Corpus Annotation

Thorsten Brants
Universität des Saarlandes, Computerlinguistik
Postfach 151150, D-66041 Saarbrücken, Germany
thorsten@coli.uni-sb.de

Dissertation zur Erlangung des Grades
Doktor der Ingenieurwissenschaften (Dr.-Ing.)
der Technischen Fakultät der
Universität des Saarlandes.

Erstgutachter: Prof. Dr. Hans Uszkoreit
Zweitgutachter: Prof. Dr. Wolfgang Wahlster

Colloquium: July 7th, 1999


Saarbrücken Dissertations in Computational Linguistics and Language Technology; Volume 6


Abstract

The methods presented in this thesis aim at automation of corpus annotation and processing of large corpora. Automation enables efficient generation of linguistically interpreted corpora, which on the one hand are a pre-requisite for theoretical linguistic investigations and the development of grammatical processing models. On the other hand, they are the basis for further development of corpus-based taggers and parsers and thereby take part in a bootstrapping process.

The presented methods are based on Markov Models, which model spoken or written utterances as probabilistic sequences. For written language processing, part-of-speech tagging is probably their most prominent application, i.e., the assignment of morpho-syntactic categories to words. We show that the technique used for part-of-speech tagging can be shifted to higher levels of linguistic annotations. Markov Models are suitable for a broader class of labeling tasks and for the generation of hierarchical structures.

While part-of-speech tagging assigns a category to each word, the presented method of tagging grammatical functions assigns a function to each word/tag pair. Going up in the hierarchy, Markov Models determine phrase categories for a given structural element.

The technique is further extended to implement a shallow parsing model. Instead of a single word or a single symbol, each state of the proposed Markov Models emits context-free partial parse trees. Each layer of the resulting structure is represented by its own Markov Model, hence the name Cascaded Markov Models. The output of each layer of the cascades is a probability distribution over possible bracketings and labelings for that layer. This output forms a lattice and is passed as input to the next layer.

After presenting the methods, we investigate two applications of Cascaded Markov Models: creation of resources in corpus annotation and partial parsing as pre-processing for other applications.

During corpus annotation, an instance of the model and a human annotator interact. Cascaded Markov Models create the syntactic structure of a sentence layer by layer, so that the human annotator can follow and correct the automatic output if necessary. The result is very efficient corpus annotation. Additionally, we exploit a feature that is particular to probabilistic models. The existence of alternative assignments and their probabilities are important information about the reliability of automatic annotations. Unreliable assignments can be identified automatically and may trigger additional actions in order to achieve high accuracies.

The second application uses Cascaded Markov Models without human supervision. The possibly ambiguous output of a lower layer is directly passed to the next layer. This type of processing is well suited for partial parsing (chunking), e.g., the recognition of noun phrases, prepositional phrases, and their constituents. Partial parsing delivers less information than deep parsing, but with much higher accuracy and speed. Both are important features for processing large corpora and for the use in applications like message extraction and information retrieval.

We evaluate the proposed methods using German and English corpora, representing the domains of newspaper texts and transliterated spoken dialogues. In addition to standard measures like accuracy, precision, and recall, we present learning curves by using different amounts of training data, and take into account selected alternative assignments. For the tasks of part-of-speech tagging and chunking German and English corpora, our results (96.3% - 97.7% for tagging, 85% - 91% recall, 88% - 94% precsision for chunking) are on a par with state-of-the-art results found in the literature. For the tasks of assigning grammatical functions and phrase labels and the interactive annotation task, our results are the first published.

The presented methods enabled the efficient annotation of the NEGRA corpus as their first practical application. Now, they are being successfully used for the annotation of several other corpora in different languages and domains, using different annotation schemes.

Contribution of this Thesis

This thesis investigates tasks at different levels of syntactic natural language processing. These tasks are performed during text corpus annotation, and a high degree of automation as well as elaborate interaction between the automatic process and a human annotator are required for the efficient generation of accurate language resources. The contribution of this thesis consists of:

The first statistical approach to the assignment of a general set of grammatical functions. These include functions like subject, direct object, head, modifier, pre- and postnominal genitive, ... Tagging grammatical functions can be seen as part-of-speech tagging at a higher level. This approach uses Markov Models to represent phrases and functions.

The assignment of phrase categories to a given structure. This is done by an extension of the previous method. Part-of-speech tagging, assignment of grammatical functions, and assignment of phrase categories together form a complete approach to the labeling problem of a syntactic structure.

The systematic use of alternative assignments. Their probabilities provide a measure to detect unreliable annotations and may trigger additional processing steps.

Cascaded Markov Models. These recognize hierarchical structures by means of Markov Models. Each layer of the resulting structure is represented by a separate Markov Model. The output of a lower layer, consisting of phrase hypotheses and their probabilities, is passed as input to the next higher layer.

The presentation of two applications of the presented methods: interactive corpus annotation, which is a new technique for efficient creation of corpus resources, and partial parsing.

The methods are empirically tested using corpora of different languages (German and English) and different domains (newspaper text and transliterated dialogues).

Outline

Chapter 2 introduces definitions for Markov Models and context-free grammars that will be used throughout this thesis, together with the corresponding basic algorithms.

Chapter 3 gives an overview of work that is related to the investigations of this thesis. It covers the areas of statistical part-of-speech tagging, tagging grammatical functions, stochastic context-free parsing and extensions thereof, as well as work on corpus annotation.

Chapter 4 introduces new techniques based on Markov Models that take part in the automation of corpus annotation. These techniques are: the assignment of grammatical functions, the assignment of phrase categories, and the assignment of partial structures. The processes explore several selected hypotheses in parallel in order to estimate the reliability of the top-ranked analysis and in order to make alternative assignments.

Chapter 5 presents two applications of Cascaded Markov Models. The first application is corpus annotation, the second one is partial parsing. Parsing as presented here is an extension of part-of-speech tagging, or, looking at the model, part-of-speech tagging is a special case of parsing with Cascaded Markov Models.

Chapter 6 discusses evaluation methods and presents the metrics that are used in this dissertation.

Chapter 7 reports on the evaluation of the proposed components of a partial parsing system. The corpora that are used cover the languages German and English and the domains of written and spoken language. We present results for tagging, assigning grammatical functions, assigning phrase categories and for applying Cascaded Markov Models. The final step of Cascaded Markov Models is evaluated both in the interactive annotation mode and the partial parsing mode.

Chapter 8 gives conclusions and indicates open questions and future directions.



Thorsten Brants, April 1999