Treebank Conversion

Automatic conversion of the French treebank into PLTAG (Tree-Adjoining Grammar) Format and Automatically Generate a PLTAG lexicon.

MOTIVATION:

Manually annotated treebanks are precious resources in for training supervised parsers. However, the different existing large treebanks (the Penn Treebank for English, the Negra / TiGer treebank for German and the French Treebank) have used different annotation schemata and storage formats, and encode slightly different types of information (for example whether they include morphological information or not, in how far they encode functional relationships between words and / or constituents, and how much hierarchical information they explicitly specify - binary trees on the one side of the range with almost completely flat trees on the other side of the range). Our group has recently developed a strictly incremental parser for psycholinguistically motivated tree-adjoining grammar (PLTAG), which evaluates well not only in terms of parsing performance, but also in terms of predicting syntactic processing difficulty in humans. Extending our current model to other languages and testing it on psycholinguistic measures on those languages is an important next step. French is a particularly good candidate for such a model extension, because a large treebank to train the parser is available, as well as a psycholinguistic test sets in the form of eye-tracking corpora. The first step in extending the parsing model to these languages is to convert the existing treebank into an easily usable format and automatically generate a PLTAG lexicon from the converted treebank.

BACKGROUND: You would first familiarize yourself with the French treebank.

French treebank: see Abeille et al. 2003; Seddah et al. 2009

Further necessary background includes the paper on basic TAG grammar extraction algorithm (Xia et al., 2000), as well as treebank conversion for a PLTAG grammar (Demberg and Keller, 2008).

PROJECT STEPS: The main steps of your project would be the following

1) assess the existing resources

2) determine how to deal with anticipated challenges, such as the relative flatness of the French treebank, and with respect to the encoding of modifier vs. argument relationships.

3) you would modify an existing algorithm for treebank conversion in order to generate a French LTAG treebank.

4) You will extract a PLTAG lexicon from the treebank, potentially based on an existing Java implementation (Demberg and Keller, 2008).

This project can be extended to also adapt the existing incremental PLTAG parser for English to parse the converted French treebank.

REQUIREMENTS: Good programming skills in Java are essential for this project.

REFERENCES (which are not provided as links):

Anne Abeille, Lionel Clement, and Francois Toussenel. 2003. Building a treebank for French. In Treebanks: Building and Using Parsed Corpora. Kluwer, Dordrecht.

Fei Xia, Martha Palmer, and Aravind Joshi. 2000. A uniform method of grammar extraction and its applications. In Proceedings of the joint SIGDAT con- ference on empirical methods in NLP and very large corpora, pages 53 - 62,Morristown, NJ, USA.

Djame Seddah, Marie Candito and Benoit Crabbe Proceedings of the 11th International Conference on Parsing Technologies (IWPT), pages 150 - 161, Paris, October 2009. 2009 Association for Computational Linguistics

Skut, W., Krenn, B., Brants, T., and Uszkoreit, H. (1997). An annotation scheme for free worder order languages. In 5th International Conference of Applied Natural Language, pages 88–94, Washington, USA.

back to thesis / hiwi topics