Thorsten Brants - publications in chronological order


Matthew Crocker and Thorsten Brants, 2000. Wide-Coverage Probabilistic Sentence Processing. Journal of Psycholinguistic Research 29(6):647-669, November 2000.

Abstract bibtex
This paper describes a fully implemented, broad coverage model of human syntactic processing. The model uses probabilistic parsing techniques which combine phrase structure, lexical category, and limited subcategory probabilities with an incremental, left-to-right the system to achieve good accuracy on typical, "garden variety" language (i.e. when tested on corpora). Furthermore, the incremental probabilistic ranking of the preferred analyses during parsing also naturally explains observed human behaviour for a range of garden-path structures. We do not make strong psychological claims about the specific probabilistic mechanism discussed here, which is limited by a number of practical considerations. Rather, we argue incremental probabilistic parsing models are, in general, extremely well suited to explaining this dual nature - generally good and occasionally pathological - of human linguistic performance.


Thorsten Brants and Matthew Crocker, 2000. Probabilistic Parsing and Psychological Plausibility. In Proceedings of the 18th International Conference on Computational Linguistics, Saarbrücken/Luxembourg/Nancy.

Abstract Postscript / PDF / bibtex
Given the recent evidence for probabilistic mechanisms in models of human ambiguity resolution, this paper investigates the plausibility of exploiting current wide-coverage, probabilistic parsing techniques to model human linguistic performance. In particular, we investigate the performance of standard stochastic parsers when they are revised to operate incrementally, and with reduced memory resources. We present techniques for ranking and filtering analyses, together with experimental results. Our results confirm that stochastic parsers which adhere to these psychologically motivated constraints achieve good performance. Memory can be reduced down to 1% (compared to exhausitve search) without reducing recall and precision. Additionally, these models exhibit substantially faster performance. Finally, we argue that this general result is likely to hold for more sophisticated, and psycholinguistically plausible, probabilistic parsing models.


Thorsten Brants, 2000. Inter-Annotator Agreement for a German Newspaper Corpus. In Second International Conference on Language Resources and Evaluation (LREC-2000), Athens, Greece.

Abstract Postscript / PDF / bibtex
This paper presents the results of an investigation on inter-annotator agreement for the NEGRA corpus, consisting of German newspaper texts. The corpus is syntactically annotated with part-of-speech and structural information. Agreement for part-of-speech is 98.6%, the labeled F-score for structures is 92.4%. The two annotations are used to create a common final version by discussing differences and by several iterations of cleaning. Initial and final versions are compared. We identify categories causing large numbers of differences and categories that are handled inconsistently.


Thorsten Brants and Oliver Plaehn, 2000. Interactive Corpus Annotation. In Second International Conference on Language Resources and Evaluation (LREC-2000), Athens, Greece.

Abstract Postscript / PDF / bibtex
We present an easy-to-use graphical tool for syntactic corpus annotation. This tool, Annotate, interacts with a part-of-speech tagger and a parser running in the background. The parser incrementally suggests single phrases bottom-up based on cascaded Markov models. A human annotator confirms or rejects the parser's suggestions. This semi-automatic process facilitates a very rapid and efficient annotation.


Thorsten Brants, 2000. TnT - A Statistical Part-of-Speech Tagger. In Proceedings of the Sixth Applied Natural Language Processing Conference ANLP-2000, Seattle, WA.

Abstract Postscript / PDF / bibtex
Trigrams'n'Tags (TnT) is an efficient statistical part-of-speech tagger. Contrary to claims found elsewhere in the literature, we argue that a tagger based on Markov models performs at least as well as other current approaches, including the Maximum Entropy framework. A recent comparison has even shown that TnT performs significantly better for the tested corpora. We describe the basic model of TnT, the techniques used for smoothing and for handling unknown words. Furthermore, we present evaluations on two corpora.


Thorsten Brants, 1999. Tagging and Parsing with Cascaded Markov Models - Automation of Corpus Annotation. Saarbrücken Dissertations in Computational Linguistics and Language Technology, Volume 6. German Research Center for Artificial Intelligence and Saarland University, Saarbrücken, Germany.

Abstract more info / dissertation series / bibtex
This thesis presents new techniques for parsing natural language. They are based on Markov Models, which are commonly used in part-of-speech tagging for sequential processing on the word level. We show that Markov Models can be successfully applied to other levels of syntactic processing. First, two classification tasks are handled: the assignment of grammatical functions and the labeling of non-terminal nodes. Then, Markov Models are used to recognize hierarchical syntactic structures. Each layer of a structure is represented by a separate Markov Model. The output of a lower layer is passed as input to a higher layer, hence the name: Cascaded Markov Models. Instead of simple symbols, the states emit partial context-free structures. The new techniques are applied to corpus annotation and partial parsing and are evaluated using corpora of different languages and domains.


Matthew Crocker and Thorsten Brants, 1999. Incremental probabilistic models of human linguistic performance. The 5th Conference on Architectures and Mechanisms for Language Processing. Edinburgh, U.K.

Abstract HTML / bibtex
Models of human language processing increasingly advocate probabilistic mechanisms for parsing and disambiguation. These models resolve local syntactic and lexical ambiguity by promoting the analysis which has the greatest probability of being correct. In this talk we will outline a new probabilistic parsing model which is a generalisation of the Hidden Markov Models which have previously been defended as pschological models of lexical category disambiguation. The model uses layered, or cascaded, markov models (CMMs) to build up a syntactic analysis.


Thorsten Brants, Wojciech Skut, and Hans Uszkoreit, 1999. Syntactic Annotation of a German Newspaper Corpus. In Proceedings of the ATALA Treebank Workshop. Paris, France.

Abstract Postscript / PDF / bibtex
We report on the syntactic annotation of a German newspaper corpus. The annotations consists of context-free structures, additionally allowing crossing branches, with labeled nodes (phrases) and edges (grammatical functions). Furthermore, we present a new, interactive semi-automatic annotation process that allows efficient and reliable annotations.


Hans Uszkoreit, Thorsten Brants, and Brigitte Krenn (eds.), 1999. Proceedings of the Workshop on Linguistically Interpreted Corpora (LINC-99). Bergen, Norway.

more info / bibtex


Thorsten Brants, 1999. Cascaded Markov Models. In Proceedings of 9th Conference of the European Chapter of the Association for Computational Linguistics (EACL-99). Bergen, Norway.

Abstract Postscript / PDF / bibtex
This paper presents a new approach to partial parsing of context-free structures. The approach is based on Markov Models. Each layer of the resulting structure is represented by its own Markov Model, and output of a lower layer is passed as input to the next higher layer. An empirical evaluation of the method yields very good results for NP/PP chunking of German newspaper texts.


Hans Uszkoreit, Thorsten Brants, Denys Duchier, Brigitte Krenn, Lars Konieczny, Stephan Oepen, and Wojciech Skut, 1998. Studien zur performanzorientierten Linguisitk. Aspekte der Relativsatzextraposition im Deutschen. Kognitionswissenschaft 7(3).

A longer version of this paper appeared as CLAUS Report #99 (see below). bibtex


Hans Uszkoreit, Thorsten Brants, Denys Duchier, Brigitte Krenn, Lars Konieczny, Stephan Oepen, and Wojciech Skut, 1998. Studien zur performanzorientierten Linguisitk. Aspekte der Relativsatzextraposition im Deutschen. CLAUS Report #99, Saarland University, Computational Linguistics, Saarbrücken.

Abstract Postscript / pdf / bibtex
Am Beispiel der Relativsatzextraposition im Deutschen zeigt das Papier wie Verfahren der sprachwissenschaftlichen Modellbildung, korpuslinguistischen Untersuchung und des psycholinguistischen Experiments in einem integrativen Forschungsansatz zusammenwirken, der auf ein verbessertes Verständnis und die linguistisch wie kognitiv adäquate Modellierung sprachlicher Performanzprobleme zielt. Ausgehend von der von Hawkins (1994) formulierten Theorie zur Wortstellung werden Hypothesen über die positionelle Verteilung von Relativsätzen formuliert und in Bezug auf Korpusdaten und Akzeptabilitätsmessungen überprüft. Alle beschriebenen empirischen Untersuchungen bestätigen den erwarteten Einfluß von Längenfaktoren auf die Relativsatzdistribution, zeigen gleichzeitig aber eine interessante Asymmetrie zwischen Produktions- und Rezeptionsdaten.


Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit, 1998. A Linguistically Interpreted Corpus of German Newspaper Text. In Proceedings of the ESSLLI Workshop on Recent Advances in Corpus Annotation. Saarbrücken, Germany.

Abstract Postscript / PDF / bibtex
In this paper, we report on the development of an annotation scheme an annotation tools for unrestricted German text. Our representation format is based on argument structure, but also permits the extraction of other kinds of representations. We discuss several methodological issues and the analysis of some phenomena. Additional focus is on the tools developed in our project and their applications.


Wojciech Skut and Thorsten Brants, 1998. Chunk Tagger - Statistical Recognition of Noun Phrases. In Proceedings of the ESSLLI Workshop on Automated Acquisition of Syntax and Parsing. Saarbrücken, Germany.

Abstract Postscript / PDF / bibtex
We describe a stochastic approach to partial parsing, i.e., the recognition of syntactic structures of limited depth. The technique utilises Markov Models, but goes beyond usual bracketing approaches, since it is capable of recognising not only the boundaries, but also the internal structure and syntactic category of simple as well as complex NP's, PP's, AP's and adverbials. We compare tagging accuracy for different applications and encoding schemes.


Wojciech Skut and Thorsten Brants, 1998. A Maximum-Entropy Partial Parser for Unrestricted Text. In Proceedings of the Sixth Workshop on Very Large Corpora. Montreal, Canada.

Abstract Postscript / PDF / bibtex
This paper describes a partial parser that assigns syntactic structures to sequences of part-of-speech tags. The program uses the maximum entropy parameter estimation method, which allows a flexible combination of different knowledge sources: the hierarchical structure, parts of speech and phrasal categories. In effect, the parser goes beyond simple bracketing and recognises even fairly complex structures. We give accuracy figures for different applications of the parser.


Thorsten Brants, 1998. Estimating HMM Topologies. In Jonathan Ginzburg, Zurab Khasidashvili, Carl Vogel, Jean-Jacques Lévy, Enric Vallduví (eds.), The Tbilisi Symposium on Logic, Language and Computation: Selected Papers. CSLI Publications, Stanford, California.

Abstract Postscript / PDF / bibtex
There are several ways of estimating parameters for HMMs when used for natural language models. One can use word-n-grams and n-grams of automatically derived categories for speech recognition. Or one can use part-of-speech n-grams for part-of-speech tagging, either by using a manually tagged corpus or by using the Baum-Welch algorithm. This paper shows how to use another method for parameter estimation: Model Merging. It exploits the advantages of the other methods, is applicable both for speech recognition and part-of-speech tagging and, unlike other techniques, it not only induces transition and output probabilities but also the model topology, i.e. the number of states and their respective possible outputs. Thus it automatically generates categories, but in addition to other categorization algorithms is capable of recognizing if a word belongs to more than one category. By adding optimizations the algorithm is used to generate language models that are used for part-of-speech tagging. Their accuracy in a tagging tasks is better than the accuracy of HMMs derived by standard techniques.


Thorsten Brants and Wojciech Skut, 1998. Automation of Treebank Annotation. In Proceedings of New Methods in Language Processing (NeMLaP-98). Sydney, Australia.

Abstract Postscript / PDF / bibtex
This paper describes applications of stochastic and symbolic NLP methods to treebank annotation. In particular we focus on (1) the automation of treebank annotation, (2) the comparison of conflicting annotations for the same sentence and (3) the automatic detection of inconsistencies. These techniques are currently employed for building a German treebank.


Thorsten Brants, Roland Hendriks, Sabine Kramp, Brigitte Krenn, Cordula Preis, Wojciech Skut, and Hans Uszkoreit, 1997. Das NEGRA-Annotationsschema. NEGRA Project Report, Saarland University, Computational Linguistics. Saarbrücken, Germany.

Abstract info about the corpus / bibtex
Das vorliegende Annotierschema entstand während des Aufbaus des NEGRA-Korpus. Nach drei Jahren Arbeit (wobei der Aufbau des Korpus nur ein Teilaspekt des Projektes war) liegen 20,000 annotierte Sätze (ca. 350,000 Tokens) sowie diese mehrfach überarbeitete Version des Schemas vor.


Wojciech Skut, Thorsten Brants, Brigitte Krenn, and Hans Uszkoreit, 1997. Annotating Unrestricted German Text. In Fachtagung der Sektion Computerlinguistik der Deutschen Gesellschaft für Sprachwissenschaft. Heidelberg, Germany.

Abstract Postscript / PDF / bibtex
This paper discusses the development of an annotation scheme for unrestricted German text. We argue for a uniform representation format based on argument structure but allowing us to recover other kinds of representations. We also discuss several methodological issues and the analysis of some phenomena. The presented annotation format has been successfully tested in corpus annotation.


Thorsten Brants, 1997. Internal and External Tagsets in Part-of-Speech Tagging. In Proceedings of Eurospeech. Rhodes, Greece.

Abstract Postscript / PDF / bibtex
We present an approach to statistical part-of-speech tagging that uses two different tagsets, one for its internal and one for its external representation. The internal tagset is used in the underlying Markov model, while the external tagset constitutes the output of the tagger. The internal tagset can be modified and optimized to increase tagging accuracy (with respect to the external tagset). We evaluate this approach in an experiment and show that it performs significantly better than approaches using only one tagset.


Thorsten Brants, Wojciech Skut, and Brigitte Krenn, 1997. Tagging Grammatical Functions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-97). Providence, RI, USA.

Abstract Postscript / PDF / bibtex
This paper addresses issues in automated treebank construction. We show how standard part-of-speech tagging techniques extend to the more general problem of structural annotation, especially for determining grammatical functions and syntactic categories. Annotation is viewed as an interactive process where manual and automatic processing alternate. Efficiency and accuracy results are presented. We also discuss further automation steps.


Thorsten Brants, 1997. The NeGra Export Format. CLAUS Report #98. Saarland University, Computational Linguistics, Saarbrücken.

Abstract Postscript / PDF / bibtex
This paper describes the export format version 3 of corpora used in the NeGra project. We use a line-oriented and ASCII-based format that is both easy to read by humans and easy to parse by machines. It is intended for data exchange and for efficient processing with standard Unix tools and C programs.


Wojciech Skut and Brigitte Krenn and Thorsten Brants and Hans Uszkoreit, 1997. An Annotation Scheme for Free Word Order Languages. In Proceedings of the Fifth Conference on Applied Natural Language Processing (ANLP-97). Washington, DC, USA.

Abstract Postscript / PDF / bibtex
We describe an annotation scheme and a tool developed for creating linguistically annotated corpora for non-configurational languages. Since the requirements for such a formalism differ from those posited for configurational languages, several features have been added, influencing the architecture of the scheme. The resulting scheme reflects a stratificational notion of language, and makes only minimal assumptions about the interrelation of the particular representational strata.


Thorsten Brants, 1996. Estimating Markov Model Structures. In Proceedings of the Fourth Conference on Spoken Language Processing (ICSLP-96). Philadelphia, PA, USA.

Abstract Postscript / PDF / bibtex
We investigate the derivation of Markov model structures from text corpora. The structure of a Markov model is its number of states plus the set of outputs and transitions with non-zero probability. The domain of the investigated models is part-of-speech tagging.
Our investigations concern two methods to derive Markov models and their structures. Both are able to form categories and allow words to belong to more than one of them. The first method is model merging, which starts with a large and corpus-specific model and successively merges states to generate smaller and more general models. The second method is model splitting, which is the inverse procedure and starts with a small and general model. States are successively split to generate larger and more specific models.
In an experiment, we show that the combination of these techniques yields tagging accuracies that are at least equivalent to those of standard approaches.


Thorsten Brants, 1996. TnT - A Statistical Part-of-Speech Tagger. Technical Report, Saarland University, Computational Linguistics. Saarbrücken, Germany.

Abstract Postscript / PDF / bibtex
TnT, the short form of Trigrams'n'Tags, is a very efficient statistical part-of-speech tagger that is trainable on different languages and virtually any tagset. The component for parameter generation trains on tagged corpora. The system incorporates several methods of smoothing and of handling unknown words.
TnT is not optimized for a particular language. Instead, it is optimized for training on a large variety of corpora. Adapting the tagger to a new language, new domain, or new tagset is very easy. Additionally, TnT is optimized for speed.


Thorsten Brants, 1996. Better Language Models with Model Merging. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-96). Philadelphia, PA, USA.

Abstract Postscript / PDF / bibtex
This paper investigates model merging, a technique for deriving Markov models from text or speech corpora. Models are derived by starting with a large and specific model and by successively combining states to build smaller and more general models. We present methods to reduce the time complexity of the algorithm and report on experiments on deriving language models for a speech recognition task. The experiments show the advantage of model merging over the standard bigram approach. The merged model assigns a lower perplexity to the test set and uses considerably fewer states.


Thorsten Brants, 1995. Some Experiments with the CRATER Corpus. Report for the CRATER Project.

Abstract Postscript / PDF / bibtex
This paper reports statistical information about the CRATER corpus and experiments performed with it. The corpus is compiled, part-of-speech annotated and manually edited by the Corpus Resources and Terminology Extraction project. The experiments concern statistical part-of-speech tagging and statistically motivated tagset modification.


Thorsten Brants, 1995. Estimating HMM Topologies. In Proceedings of the Tbilisi Symposium on Language, Logic, and Computation. Tbilisi, Georgia.

Abstract Postscript / PDF / bibtex
There are several ways of estimating parameters for HMMs when used for natural language models. One can use word-n-grams and n-grams of automatically derived categories for speech recognition. Or one can use part-of-speech n-grams for part-of-speech tagging, either by using a manually tagged corpus or by using the Baum-Welch algorithm. This paper shows how to use another method for parameter estimation: Model Merging. It exploits the advantages of the other methods, is applicable both for speech recognition and part-of-speech tagging and, unlike other techniques, it not only induces transition and output probabilities but also the model topology, i.e. the number of states and their respective possible outputs. Thus it automatically generates categories, but in addition to other categorization algorithms is capable of recognizing if a word belongs to more than one category. By adding optimizations the algorithm is used to generate language models that are used for part-of-speech tagging. Their accuracy in a tagging tasks is better than the accuracy of HMMs derived by standard techniques.


Thorsten Brants and Christer Samuelsson, 1995. Tagging the Teleman Corpus. In Proceedings of the 10th Nordic Conference of Computational Linguistics (NODALIDA-95). Helsinki, Finland.

Abstract Postscript / PDF / bibtex
Experiments were carried out comparing the Swedish Teleman and the English Susanne corpora using an HMM-based and a novel reductionistic statistical part-of-speech tagger. They indicate that tagging the Teleman corpus is the more difficult task, and that the performance of the two different taggers is comparable.

Also available as CLAUS Report #54


Thorsten Brants, 1995. Tagset Reduction Without Information Loss. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL-95). Cambridge, MA, USA.

Abstract Postscript / PDF / bibtex
A technique for reducing a tagset used for n-gram part-of-speech disambiguation is introduced and evaluated in an experiment. The technique ensures that all information that is provided by the original tagset can be restored from the reduced one. This is crucial, since we are interested in the linguistically motivated tags for part-of-speech disambiguation. The reduced tagset needs fewer parameters for its statistical model and allows more accurate parameter estimation. Additionally, there is a slight but not significant improvement of tagging accuracy.


Bernhard Kipper, Thorsten Brants, Marcus Plach, and Ralph Schäfer, 1995. Bayessche Netze: Ein einführendes Beispiel. Graduiertenkolleg Kognitionswissenschaft, Bericht Nr. 4. Saarbrücken, Germany.

Abstract Postscript / PDF / bibtex
Bayessche Netze stellen einen vielbeachteten Formalismus zur Repräsentation und Verarbeitung von unsicherem Wissen dar. Zum Formalismus der Bayesschen Netze existieren zwar einige einführende Arbeiten; was diesen Einführungen jedoch fehlt, ist eine Illustration der innerhalb von Bayesschen Netzen verwendeten Mechanismen an Hand konkreter (Zahlen-)Beispiele. Mit der vorliegenden Arbeit soll genau diese Lücke geschlossen werden: Die grundlegende Struktur Bayesscher Netze wird durch die Modellierung eines Beispielszenarios erläutert. In dem daraus resultierenden Beispielnetz werden ferner die probabilistischen Methoden, die bei Bayesschen Netzen Anwendung finden, mit konkreten Zahlenwerten durchgerechnet.


Thorsten Brants, 1994. Parameteroptimierung füur ein statistisches Sprachmodell. In Proceedings der 1. Fachtagung der Gesellschaft für Kognitionswissenschaft. Freiburg, Germany.

Abstract bibtex
Diese Arbeit beschäftigt sich mit dem Bestimmen und Optimieren der Parameter von Hidden-Markov-Modellen (HMMs) bei der Verwendung als Sprachmodell. Die Parameter werden anhand von Textkorpora bestimmt. Dabei sollen nicht nur, wie dies im allgemeinen geschieht, die Wahrscheinlichkeiten für die Zustandsübergänge und Ausgaben aus dem Korpus abgeleitet werden, sondern auch die Anzahl der Zustände und die damit verbundene Bildung von Wortkategorien. Das Ziel ist, linguistische und statistische Anforderungen effektiv miteinander zu verbinden.


Last changed Tue Jan 17 10:48:05 CET 2000, Thorsten Brants