Thorsten Brants - publications in chronological order
Matthew Crocker and Thorsten Brants, 2000. Wide-Coverage
Probabilistic Sentence Processing. Journal of Psycholinguistic
Research 29(6):647-669, November 2000.
This paper describes a fully implemented, broad coverage model of
human syntactic processing. The model uses probabilistic parsing
techniques which combine phrase structure, lexical category, and
limited subcategory probabilities with an incremental, left-to-right
the system to achieve good accuracy on typical, "garden variety"
language (i.e. when tested on corpora). Furthermore, the incremental
probabilistic ranking of the preferred analyses during parsing also
naturally explains observed human behaviour for a range of garden-path
structures. We do not make strong psychological claims about the
specific probabilistic mechanism discussed here, which is limited by a
number of practical considerations. Rather, we argue incremental
probabilistic parsing models are, in general, extremely well suited to
explaining this dual nature - generally good and occasionally
pathological - of human linguistic performance.
Thorsten Brants and Matthew Crocker, 2000. Probabilistic Parsing and
Psychological Plausibility. In Proceedings of the 18th International
Conference on Computational Linguistics, Saarbrücken/Luxembourg/Nancy.
Given the recent evidence for probabilistic mechanisms in models of
human ambiguity resolution, this paper investigates the plausibility
of exploiting current wide-coverage, probabilistic parsing techniques
to model human linguistic performance. In particular, we investigate
the performance of standard stochastic parsers when they are revised
to operate incrementally, and with reduced memory resources. We
present techniques for ranking and filtering analyses, together with
experimental results. Our results confirm that stochastic parsers
which adhere to these psychologically motivated constraints achieve
good performance. Memory can be reduced down to 1% (compared to
exhausitve search) without reducing recall and precision.
Additionally, these models exhibit substantially faster performance.
Finally, we argue that this general result is likely to hold for more
sophisticated, and psycholinguistically plausible, probabilistic
parsing models.
Thorsten Brants, 2000. Inter-Annotator Agreement for a German Newspaper
Corpus. In Second International Conference on Language Resources and
Evaluation (LREC-2000), Athens, Greece.
This paper presents the results of an investigation on
inter-annotator agreement for the NEGRA corpus, consisting of German
newspaper texts. The corpus is syntactically annotated with
part-of-speech and structural information. Agreement for
part-of-speech is 98.6%, the labeled F-score for structures is
92.4%. The two annotations are used to create a common final
version by discussing differences and by several iterations of
cleaning. Initial and final versions are compared. We identify
categories causing large numbers of differences and categories that
are handled inconsistently.
Thorsten Brants and Oliver Plaehn, 2000. Interactive Corpus Annotation. In Second International Conference on Language Resources and
Evaluation (LREC-2000), Athens, Greece.
We present an easy-to-use graphical tool for syntactic corpus annotation. This
tool, Annotate, interacts with a part-of-speech tagger and a parser running in
the background. The parser incrementally suggests single phrases bottom-up
based on cascaded Markov models. A human annotator confirms or rejects the
parser's suggestions. This semi-automatic process facilitates a very rapid and
efficient annotation.
Thorsten Brants, 2000.
TnT - A Statistical Part-of-Speech Tagger. In Proceedings of the Sixth Applied Natural Language Processing Conference ANLP-2000, Seattle, WA.
Trigrams'n'Tags (TnT) is an efficient statistical part-of-speech
tagger. Contrary to claims found elsewhere in the literature, we
argue that a tagger based on Markov models performs at least as well
as other current approaches, including the Maximum Entropy
framework. A recent comparison has even shown that
TnT performs significantly better for the tested corpora. We
describe the basic model of TnT, the techniques used for
smoothing and for handling unknown words. Furthermore, we present
evaluations on two corpora.
Thorsten Brants, 1999.
Tagging and Parsing with Cascaded Markov Models - Automation of Corpus Annotation.
Saarbrücken Dissertations in Computational Linguistics and Language Technology, Volume 6.
German Research Center for Artificial Intelligence and Saarland University, Saarbrücken, Germany.
This thesis presents new techniques for parsing natural language. They
are based on Markov Models, which are commonly used in part-of-speech
tagging for sequential processing on the word level. We show that
Markov Models can be successfully applied to other levels of syntactic
processing. First, two classification tasks are handled: the
assignment of grammatical functions and the labeling of non-terminal
nodes. Then, Markov Models are used to recognize hierarchical
syntactic structures. Each layer of a structure is represented by a
separate Markov Model. The output of a lower layer is passed as input
to a higher layer, hence the name: Cascaded Markov Models. Instead of
simple symbols, the states emit partial context-free structures. The
new techniques are applied to corpus annotation and partial parsing
and are evaluated using corpora of different languages and
domains.
Matthew Crocker and Thorsten Brants, 1999.
Incremental probabilistic models of human linguistic performance.
The 5th Conference on Architectures and Mechanisms for Language Processing.
Edinburgh, U.K.
Models of human language processing increasingly advocate
probabilistic mechanisms for parsing and disambiguation. These models
resolve local syntactic and lexical ambiguity by promoting the
analysis which has the greatest probability of being correct. In this
talk we will outline a new probabilistic parsing model which is a
generalisation of the Hidden Markov Models which have previously been
defended as pschological models of lexical category
disambiguation. The model uses layered, or cascaded, markov models
(CMMs) to build up a syntactic analysis.
Thorsten Brants, Wojciech Skut, and Hans Uszkoreit, 1999.
Syntactic Annotation of a German Newspaper Corpus.
In Proceedings of the ATALA Treebank Workshop. Paris, France.
We report on the syntactic annotation of a German newspaper corpus.
The annotations consists of context-free structures, additionally
allowing crossing branches, with labeled nodes (phrases) and edges
(grammatical functions). Furthermore, we present a new, interactive
semi-automatic annotation process that allows efficient and reliable
annotations.
Hans Uszkoreit, Thorsten Brants, and Brigitte Krenn (eds.), 1999.
Proceedings of the Workshop on Linguistically Interpreted Corpora (LINC-99).
Bergen, Norway.
Thorsten Brants, 1999.
Cascaded Markov Models.
In Proceedings of 9th Conference of the European Chapter of the Association for Computational Linguistics (EACL-99).
Bergen, Norway.
This paper presents a new approach to partial parsing
of context-free structures. The approach is based
on Markov Models. Each layer of the resulting structure is
represented by its own Markov Model, and output of a lower layer is
passed as input to the next higher layer. An empirical evaluation
of the method yields very good results for NP/PP chunking
of German newspaper texts.
Hans Uszkoreit, Thorsten Brants, Denys Duchier, Brigitte
Krenn, Lars Konieczny, Stephan Oepen, and Wojciech Skut, 1998.
Studien zur performanzorientierten Linguisitk. Aspekte der Relativsatzextraposition im Deutschen.
Kognitionswissenschaft 7(3).
| A longer version of this paper appeared as CLAUS Report #99 (see below). |
bibtex |
Hans Uszkoreit, Thorsten Brants, Denys Duchier, Brigitte
Krenn, Lars Konieczny, Stephan Oepen, and Wojciech Skut, 1998.
Studien zur performanzorientierten Linguisitk. Aspekte der Relativsatzextraposition im Deutschen.
CLAUS Report #99, Saarland University, Computational Linguistics, Saarbrücken.
Am Beispiel der Relativsatzextraposition im Deutschen zeigt das Papier
wie Verfahren der sprachwissenschaftlichen Modellbildung,
korpuslinguistischen Untersuchung und des psycholinguistischen
Experiments in einem integrativen Forschungsansatz zusammenwirken, der
auf ein verbessertes Verständnis und die linguistisch wie kognitiv
adäquate Modellierung sprachlicher Performanzprobleme zielt.
Ausgehend von der von Hawkins (1994) formulierten Theorie zur
Wortstellung werden Hypothesen über die positionelle Verteilung von
Relativsätzen formuliert und in Bezug auf Korpusdaten und
Akzeptabilitätsmessungen überprüft. Alle beschriebenen
empirischen Untersuchungen bestätigen den erwarteten Einfluß
von Längenfaktoren auf die Relativsatzdistribution, zeigen
gleichzeitig aber eine interessante Asymmetrie zwischen Produktions-
und Rezeptionsdaten.
Brigitte Krenn, Thorsten Brants, Wojciech Skut, Hans Uszkoreit, 1998.
A Linguistically Interpreted Corpus of German Newspaper Text.
In Proceedings of the ESSLLI Workshop on Recent Advances in Corpus Annotation.
Saarbrücken, Germany.
In this paper, we report on the development of an annotation scheme an
annotation tools for unrestricted German text. Our representation
format is based on argument structure, but also permits the extraction
of other kinds of representations. We discuss several methodological
issues and the analysis of some phenomena. Additional focus is on the
tools developed in our project and their applications.
Wojciech Skut and Thorsten Brants, 1998.
Chunk Tagger - Statistical Recognition of Noun Phrases.
In Proceedings of the ESSLLI Workshop on Automated Acquisition of Syntax and Parsing.
Saarbrücken, Germany.
We describe a stochastic approach to partial parsing,
i.e., the recognition of syntactic structures of limited depth. The
technique utilises Markov Models, but goes beyond usual
bracketing approaches, since it is capable of recognising not only the
boundaries, but also the internal structure and syntactic category of
simple as well as complex NP's, PP's, AP's and adverbials. We compare
tagging accuracy for different applications and encoding schemes.
Wojciech Skut and Thorsten Brants, 1998.
A Maximum-Entropy Partial Parser for Unrestricted Text.
In Proceedings of the Sixth Workshop on Very Large Corpora.
Montreal, Canada.
This paper describes a partial parser that assigns syntactic structures
to sequences of part-of-speech tags. The program uses the maximum
entropy parameter estimation method, which allows a flexible
combination of different knowledge sources: the hierarchical structure,
parts of speech and phrasal categories. In effect, the parser goes beyond
simple bracketing and recognises even fairly complex structures. We
give accuracy figures for different applications of the parser.
Thorsten Brants, 1998.
Estimating HMM Topologies.
In Jonathan Ginzburg, Zurab Khasidashvili, Carl Vogel, Jean-Jacques Lévy, Enric Vallduví (eds.), The Tbilisi Symposium on Logic, Language and Computation: Selected Papers.
CSLI Publications, Stanford, California.
There are several ways of estimating parameters for HMMs when used
for natural language models. One can use word-n-grams and n-grams of
automatically derived categories for speech recognition. Or one can use
part-of-speech n-grams for part-of-speech tagging, either by using a
manually tagged corpus or by using the Baum-Welch algorithm. This paper
shows how to use another method for parameter estimation: Model
Merging. It exploits the advantages of the other methods, is applicable
both for speech recognition and part-of-speech tagging and, unlike other
techniques, it not only induces transition and output probabilities but
also the model topology, i.e. the number of states and their respective
possible outputs. Thus it automatically generates categories, but in
addition to other categorization algorithms is capable of recognizing if
a word belongs to more than one category. By adding optimizations the
algorithm is used to generate language models that are used for
part-of-speech tagging. Their accuracy in a tagging tasks is
better than the accuracy of HMMs derived by standard techniques.
Thorsten Brants and Wojciech Skut, 1998.
Automation of Treebank Annotation.
In Proceedings of New Methods in Language Processing (NeMLaP-98).
Sydney, Australia.
This paper describes applications of stochastic and symbolic NLP
methods to treebank annotation. In particular we focus on (1) the
automation of treebank annotation, (2) the comparison of conflicting
annotations for the same sentence and (3) the automatic detection of
inconsistencies. These techniques are currently employed for building a
German treebank.
Thorsten Brants, Roland Hendriks, Sabine Kramp, Brigitte Krenn, Cordula Preis, Wojciech Skut, and Hans Uszkoreit, 1997.
Das NEGRA-Annotationsschema.
NEGRA Project Report, Saarland University, Computational Linguistics.
Saarbrücken, Germany.
Das vorliegende Annotierschema entstand während des Aufbaus des
NEGRA-Korpus. Nach drei Jahren Arbeit (wobei der Aufbau des Korpus nur
ein Teilaspekt des Projektes war) liegen 20,000 annotierte Sätze
(ca. 350,000 Tokens) sowie diese mehrfach überarbeitete Version des
Schemas vor.
Wojciech Skut, Thorsten Brants, Brigitte Krenn, and Hans Uszkoreit, 1997.
Annotating Unrestricted German Text.
In Fachtagung der Sektion Computerlinguistik der Deutschen Gesellschaft für Sprachwissenschaft.
Heidelberg, Germany.
This paper discusses the development of an annotation scheme for
unrestricted German text. We argue for a uniform representation format
based on argument structure but allowing us to recover other kinds of
representations. We also discuss several methodological issues and the
analysis of some phenomena.
The presented annotation format
has been successfully tested in corpus annotation.
Thorsten Brants, 1997.
Internal and External Tagsets in Part-of-Speech Tagging.
In Proceedings of Eurospeech.
Rhodes, Greece.
We present an approach to statistical part-of-speech tagging that uses
two different tagsets, one for its internal and one for its external
representation. The internal tagset is used in the underlying Markov
model, while the external tagset constitutes the output of the tagger.
The internal tagset can be modified and optimized to increase tagging
accuracy (with respect to the external tagset). We evaluate this
approach in an experiment and show that it performs significantly better
than approaches using only one tagset.
Thorsten Brants, Wojciech Skut, and Brigitte Krenn, 1997.
Tagging Grammatical Functions.
In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-97).
Providence, RI, USA.
This paper addresses issues in automated treebank construction. We show
how standard part-of-speech tagging techniques extend to the more
general problem of structural annotation, especially for determining
grammatical functions and syntactic categories. Annotation is viewed as
an interactive process where manual and automatic processing alternate.
Efficiency and accuracy results are presented. We also discuss further
automation steps.
Thorsten Brants, 1997.
The NeGra Export Format.
CLAUS Report #98. Saarland University, Computational Linguistics, Saarbrücken.
This paper describes the export format version 3 of corpora used in the
NeGra project. We use a line-oriented and ASCII-based format that is
both easy to read by humans and easy to parse by machines. It is
intended for data exchange and for efficient processing with standard
Unix tools and C programs.
Wojciech Skut and Brigitte Krenn and Thorsten Brants and Hans Uszkoreit, 1997.
An Annotation Scheme for Free Word Order Languages.
In Proceedings of the Fifth Conference on Applied Natural Language Processing (ANLP-97).
Washington, DC, USA.
We describe an annotation scheme and a tool developed for creating
linguistically annotated corpora for non-configurational languages.
Since the requirements for such a formalism differ from those posited
for configurational languages, several features have been added,
influencing the architecture of the scheme. The resulting scheme
reflects a stratificational notion of language, and makes only minimal
assumptions about the interrelation of the particular representational
strata.
Thorsten Brants, 1996.
Estimating Markov Model Structures.
In Proceedings of the Fourth Conference on Spoken Language Processing (ICSLP-96).
Philadelphia, PA, USA.
We investigate the derivation of Markov model structures from text
corpora. The structure of a Markov model is its number of states plus
the set of outputs and transitions with non-zero probability. The domain
of the investigated models is part-of-speech tagging.
Our investigations concern two methods to derive Markov models and
their structures. Both are able to form categories and allow words to
belong to more than one of them. The first method is model
merging, which starts with a large and corpus-specific model and
successively merges states to generate smaller and more general models.
The second method is model splitting, which is the inverse
procedure and starts with a small and general model. States are
successively split to generate larger and more specific models.
In an experiment, we show that the combination of these techniques
yields tagging accuracies that are at least equivalent to those of
standard approaches.
Thorsten Brants, 1996.
TnT - A Statistical Part-of-Speech Tagger.
Technical Report, Saarland University, Computational Linguistics. Saarbrücken, Germany.
TnT, the short form of Trigrams'n'Tags, is a very efficient
statistical part-of-speech tagger that is trainable on different
languages and virtually any tagset. The component for parameter
generation trains on tagged corpora. The system incorporates several
methods of smoothing and of handling unknown words.
TnT is not optimized for a particular language. Instead, it is
optimized for training on a large variety of corpora. Adapting the
tagger to a new language, new domain, or new tagset is very
easy. Additionally, TnT is optimized for speed.
Thorsten Brants, 1996.
Better Language Models with Model Merging.
In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-96).
Philadelphia, PA, USA.
This paper investigates model merging, a technique for deriving
Markov models from text or speech corpora. Models are derived by
starting with a large and specific model and by successively combining
states to build smaller and more general models. We present methods to
reduce the time complexity of the algorithm and report on experiments on
deriving language models for a speech recognition task. The experiments
show the advantage of model merging over the standard bigram approach.
The merged model assigns a lower perplexity to the test set and uses
considerably fewer states.
Thorsten Brants, 1995.
Some Experiments with the CRATER Corpus.
Report for the CRATER Project.
This paper reports statistical information about the CRATER corpus and
experiments performed with it. The corpus is compiled, part-of-speech
annotated and manually edited by the Corpus Resources and Terminology
Extraction project. The experiments concern statistical part-of-speech
tagging and statistically motivated tagset modification.
Thorsten Brants, 1995.
Estimating HMM Topologies.
In Proceedings of the Tbilisi Symposium on Language, Logic, and Computation.
Tbilisi, Georgia.
There are several ways of estimating parameters for HMMs when used
for natural language models. One can use word-n-grams and
n-grams of automatically derived categories for speech
recognition. Or one can use part-of-speech n-grams for
part-of-speech tagging, either by using a manually tagged corpus or by
using the Baum-Welch algorithm. This paper shows how to use another
method for parameter estimation: Model Merging. It exploits the
advantages of the other methods, is applicable both for speech
recognition and part-of-speech tagging and, unlike other techniques, it
not only induces transition and output probabilities but also the model
topology, i.e. the number of states and their respective possible
outputs. Thus it automatically generates categories, but in addition to
other categorization algorithms is capable of recognizing if a word
belongs to more than one category. By adding optimizations the
algorithm is used to generate language models that are used for
part-of-speech tagging. Their accuracy in a tagging tasks is better
than the accuracy of HMMs derived by standard techniques.
Thorsten Brants and Christer Samuelsson, 1995.
Tagging the Teleman Corpus.
In Proceedings of the 10th Nordic Conference of Computational Linguistics (NODALIDA-95).
Helsinki, Finland.
Experiments were carried out comparing the Swedish Teleman and the
English Susanne corpora using an HMM-based and a novel reductionistic
statistical part-of-speech tagger. They indicate that tagging the
Teleman corpus is the more difficult task, and that the performance of
the two different taggers is comparable.
Also available as CLAUS Report #54
Thorsten Brants, 1995.
Tagset Reduction Without Information Loss.
In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL-95). Cambridge, MA, USA.
A technique for reducing a tagset used for n-gram part-of-speech
disambiguation is introduced and evaluated in an experiment. The
technique ensures that all information that is provided by the original
tagset can be restored from the reduced one. This is crucial, since we
are interested in the linguistically motivated tags for part-of-speech
disambiguation. The reduced tagset needs fewer parameters for its
statistical model and allows more accurate parameter estimation.
Additionally, there is a slight but not significant improvement of
tagging accuracy.
Bernhard Kipper, Thorsten Brants, Marcus Plach, and Ralph Schäfer, 1995.
Bayessche Netze: Ein einführendes Beispiel.
Graduiertenkolleg Kognitionswissenschaft, Bericht Nr. 4. Saarbrücken, Germany.
Bayessche Netze stellen einen vielbeachteten Formalismus zur
Repräsentation und Verarbeitung von unsicherem Wissen dar. Zum
Formalismus der Bayesschen Netze existieren zwar einige einführende
Arbeiten; was diesen Einführungen jedoch fehlt, ist eine Illustration
der innerhalb von Bayesschen Netzen verwendeten Mechanismen an Hand
konkreter (Zahlen-)Beispiele. Mit der vorliegenden Arbeit soll genau
diese Lücke geschlossen werden: Die grundlegende Struktur Bayesscher
Netze wird durch die Modellierung eines Beispielszenarios erläutert. In
dem daraus resultierenden Beispielnetz werden ferner die
probabilistischen Methoden, die bei Bayesschen Netzen Anwendung finden,
mit konkreten Zahlenwerten durchgerechnet.
Thorsten Brants, 1994.
Parameteroptimierung füur ein statistisches Sprachmodell.
In Proceedings der 1. Fachtagung der Gesellschaft für Kognitionswissenschaft.
Freiburg, Germany.
Diese Arbeit beschäftigt sich mit dem Bestimmen und Optimieren der Parameter von Hidden-Markov-Modellen (HMMs) bei der Verwendung als Sprachmodell. Die Parameter werden anhand von Textkorpora bestimmt. Dabei sollen nicht nur, wie dies im allgemeinen geschieht, die Wahrscheinlichkeiten für die Zustandsübergänge und Ausgaben aus dem Korpus abgeleitet werden, sondern auch die Anzahl der Zustände und die damit verbundene Bildung von Wortkategorien. Das Ziel ist, linguistische und statistische Anforderungen effektiv miteinander zu verbinden.
Last changed Tue Jan 17 10:48:05 CET 2000, Thorsten Brants