Research Interests[Lexical Semantics] [Discourse Processing] [Text Mining] [Lexicon Representation]
Discourse Processing Plus Lexical Semantics
I've recently (September 2008) started a junior research group on "Computational Modelling of Discourse and Semantics". We aim to investigate how to combine shallow semantic parsing with discourse processing, hoping to improve both on the state-of-the art in semantic role labelling and discourse processing. More information can be found here.
Since May 2007, I've been working in the SALSA project. The aim of SALSA is to provide a large, frame-based lexicon for German. My own contribution focuses on (semi-)automatic data-expansion, bootstrapping from existing SALSA data to new lexical units. I'm also working on how to deal with multi-word expressions and non-literal language. Collaborators include Aljoscha Burchardt, Andrea Kowalski, Sebastian Padó, and Marco Pennacchiotti.
With colleagues in Edinburgh, I work on several aspects of automatic discourse processing. In joint work with Alex Lascarides, I investigated ways of determining discourse relations, such as contrast or explanation, automatically, even if they aren't signalled by an overt discourse marker such as but. In particular we looked at the role that automatically labelled training examples can play here. Identifying discourse relations is a sub-task of discourse parsing (i.e. determining the rhetorical structure of a discourse) and is important for many NLP tasks among them Question Answering, Information Extraction and Summarisation.
Another focus of my work lies on discourse segmentation. Together with Mirella Lapata I developed models for segmenting text into paragraphs. Automatic paragraph segmention has applications for tasks, such as text-to-text summarisation, machine translation, or audio transcriptions of speeches or lectures. In related work, we looked at the automatic segmentation of discourse into its elementary units (i.e., discourse chunking), which is a necessary pre-processing step for discourse parsing. We also showed that discourse chunking is useful in its own right for sentence compression.
Between September 2005 and April 2007 I was working as a postdoc on the MITCH project, a joint research project between the ILK Research Group at Tilburg University and Naturalis, the Dutch National Museum of Natural History. The goal of the MITCH project is to develop tools to make the data at Naturalis (fieldbooks, specimen databases, scientific publications) more accessible, e.g. by automatically cleaning and enriching them and by linking different data sources.
One focus of my research lay on the automatic detection and correction of errors in textual databases (i.e. databases which contain some proportion of free text). While error detection is an active research area, most work is not geared towards textual databases. Work on outlier detection, for instance, often requires that the data is numerical or at least categorical (i.e., it is assumed that data values are atomic). However, textual databases typically contain free-text fields. The values of these fields are relatively long text strings which should not be treated as atoms. Given these shortcomings of existing error detection methods, I developed two new error detection techniques that are specifically aimed at detecting errors in textual databases. The first method (horizontal error detection) aims to detect fields which contain a wrong value, e.g., South America instead of South Africa in a Country field. These errors are detected by comparing the value of a field to the values of other fields in a record. The second method (vertical error detection) aims at detecting whether a text string was entered in the wrong column. For example, the string died in captivity might occur in a Location column, but it would be better placed in a Special Remarks column. To detect this type of error, the problem was recast as a text classification task, i.e., a classifier was trained to predict which column a text string should be in and signal a potential error if the predicted column deviates from the original column. Both error detection methods are data-driven and do not require manually labelled training data, instead the database itself is exploited to artificially create (noisy) training data. Collaborators in the work were: Antal van den Bosch, Marieke van Erp, Tijn Porcelijn, and Steve Hunt.
Together with Sander Canisius, I've also been looking at field segmentation, i.e. segmenting semi-structured texts (in our case fieldbooks that describe the circumstances under which a specimen was collected) into chunks which contain a given type of information (for example information about the biotope in which a specimen was collected or about the person who collected it). Field segmentation basically allows one to automatically convert reports into databases. Since it is impractical to model this as a supervised learning task, because training data is usually unavailable, we investigated different strategies of bootstrapping from existing databases for the domain, thereby circumventing the need to manually annotate training data.
For my PhD thesis, I developed a method to automatically create a lexical inheritance hierarchy for a flat input lexicon. Lexical inheritance hierarchies capture generalisations over lexical entries (e.g. that all transitive verbs subcategorise for a subject and an object). However, not all generalisations over lexical entries in a given grammatical framework, such as HPSG, are linguistically meaningful. Some are merely accidental. In my PhD research I developed a method for (semi-)automatically distinguishing between meaningful and less meaningful generalisations. The method makes use of supervised machine learning and utilitsed existing inheritance hierarchies and insight from set-theory to automatically create a training set for the learner. In earlier research, I also explored using a modified decision tree learning algorithm to automatically create tree-structured inheritance hierarchies (i.e. those without nodes inheriting from more than one parent). In collaboration with Harald Lüngen I tested this approach on a morphological lexicon.