Computational Linguistics & Phonetics Computational Linguistics & Phonetics Fachrichtung 4.7 Universität des Saarlandes

Computational Linguistics Colloquium

Thursday, 4 December, 16:15
Conference Room, Building C7 4

Adapting a WSJ-trained Lexicalized-Grammar Parser to New Domains

Stephen Clark
Oxford University

In this talk I will describe some experiments on adapting the C&C CCG parser to new domains. The parser was originally developed using CCGbank, the CCG version of the Penn Treebank, and is therefore tuned to newspaper text. The two new domains we consider are (1) biomedical abstracts and (2) questions for a QA system (using the term "domain" somewhat loosely in the latter case).
The porting approach we use is to train the parser at lower levels of representation than full syntactic derivations. The lexicalized nature of CCG (in which words are assigned syntactic categories that include subcategorization information) makes it possible to use a level of representation intermediate between POS tags and full derivations. For the biomedical data, we find that simply retraining the POS tagger leads to a large improvement in performance, and that using annotated data at the intermediate CCG lexical category level improves parsing accuracy further. A similar result is obtained for the question data, but the impact of retraining at the CCG lexical category level is much greater. We suggest that this is because the syntax of questions differs more from that of newspaper text than does the syntax of biomedical sentences, and we discuss some measures supporting this idea.
The parsing accuracies obtained for both biomedical and question data are in the same range as those reported for newspaper text, and higher than those previously reported for the biomedical domain on the same evaluation resource. The conclusion is that porting newspaper-trained parsers to new domains may not be as difficult as first thought (at least for parsers which use lexicalized grammars), but we note that different levels of representation may have different impacts on the porting process, depending on the characteristics of the target domain.

If you would like to meet with the speaker, please contact Rebecca Dridan.