Information for students

Currently there are several research topics available for Master's Theses, Bachelor Theses, and Software Projects. We also welcome suggestions for other topics in the broad context of the SALSA project (computational lexical semantics/frame-semantics). If you have an idea, just send an email to salsa-mit@coli.uni-sb.de

Master's Theses

Automatically Modelling the Meaning of Idioms

Expressions with idiomatic meanings such as set in stone or the penny drops pose significant challenges to NLP systems. Their linguistic behaviour usually differs from what would be expected if they were used literally. For instance, the verb drop can normally take a whole range of PP-complements, e.g. into-PP (the plane dropped into), below-PP (the temperature dropped below zero) or on-PP (the gold dropped on the ground); however if drop is used in the idiom the penny drops, PP complements headed by into or below are relatively unlikely and on-PPs tend to realise a different semantic role than the proto-typical location role, i.e. the issue on which "the penny dropped" (he penny dropped on why things turn out the way they do) or the person for whom "the penny dropped" (e.g. after reaching the hotel, the penny dropped on them). While there exist some resources which list idiomatic expressions and provide information about their syntactic-semantic behaviour (e.g. idiom dictionaries), these are expensive to create manually. The aim of this project is to investigate ways in which such information can be bootstrapped from corpora. The first step is to determine the meaning by finding words with which the idiom frequently co-occurs (e.g. for the penny drops: realised, understand, confused, think, explain). From this it should be possible to find synonyms or near synonyms for the meaning of the idiom, e.g. realise, as these should co-occur with the same set of words. As the second step, a suitable semantic frame for the idiom could be determined, in this case "coming_to_believe", which is the frame for realise. The third step would involve mapping the semantic roles of the frame to the syntactic complements of the (head word of the) idiom. For example, the "cognizer"-role is typically realised as an on-PP for the penny drops. Mapping the roles can be done by computing the semantic similarity between known role fillers for the "coming_to_believe" frame, which can be extracted from frame-semantically annotated corpora such as FrameNet.

This research topic builds on existing research on frame assignment to unknown words (e.g. Burchardt et al. (2005)) and role mapping (e.g. Pado et al. (2008)). The topic could also be split in two (i.e. determining meaning/frame assignment and role mapping). Programming skills are essential and some familiarity with statistical modelling/machine learning would be useful.

Improving Word Sense Disambiguation by Exploiting Frame-Semantic Information

Word sense disambiguation (WSD) is crucial processing step for many NLP tasks and it is a topic that has received much attention over the years. However, it is a problem that is still far from being solved. The aim of this project is to shed some new light on an old problem by investigating in how far frame-semantic information can be exploited to determine the correct sense (or frame) of a target word. For example, as frames describe specific situations, they don't co-occur randomly; frames describing related situations are more likely to occur together. How frames relate to each other is encoded in the FrameNet frame hierarchy. For example, the "try_defendant" frame is closely related to the "verdict" frame. Hence in "The government did not dare to try him; they were sure he would be acquitted." the word try is more likely to have the word sense "try_defendant" than "attempt" because it co-occurs with acquit which evokes the "verdict" frame. Another useful source of information are the role fillers. For example, have can evoke several frames, including "ingestion" (Peter had a sandwich) and "possession" (Peter has a car). In the former sense, have is nearly exclusively accompagnied by complements (usually in the direct object position) referring to edible substances; in the latter sense it less likely that a complement refers to something edible. The aim of this project is to investigate how far one can get by exploiting these information sources.

Active Learning for Sematic Role Labelling

Semantic role labelling systems are typically based on supervised machine learning and require training corpora which are manually annotated with semantic role information. Manual corpus annotation is, however, time-consuming, especially if it involves semantic annotation. One way to alleviate the data sparseness problem is by using active learning. In the active learning framework data are not annotated more or less randomly, instead only those instances are annotated that are most useful for the machine learner. For many tasks this approach has been shown to reduce the amount of necessary training data for a given performance level. However, sofar not much work has investigated the use of active learning for semantic role labelling. This project aims to fill that gap.

For this topic some familiarity with machine learning and semantic role labelling would be useful.

Bachelor Theses

Maschinelle Unterstützung von sematischer Annotation

Manuelle Annotation von Corpora mit (frame-)semantischen Informationen ist sehr zeitaufwendig. Manuell annotierte Korpora sind daher meist relativ klein und teuer in der Herstellung. Ziel dieser Bachelor-Arbeit ist es, zu untersuchen, inwieweit die manuelle Korpusannotation sinnvoll durch maschinelle Verfahren unterstützt werden kann. Eine interessante Fragestellung ist, z.B. ob es sinnvoll ist ein Korpus durch einen Semantic Role Labeller (SRL) vorannotieren zu lassen. Auf der einen Seite spart so ein Verfahren unter bestimmten Umständen Zeit, da korrekte Analysen des SRL-Systems direkt vom Annotator übernommen werden können. Auf der anderen Seite machen SRL-Systeme aber auch Fehler, die dann in der Annotation von Hand korrigiert werden müssen. Offen ist auch die Frage, ob sich Annotatoren bei Vorannotation durch ein SRL-System zu sehr dazu verleiten lassen, die Analysen des Systems zu akzeptieren, ohne genau zu prüfen, ob diese korrekt sind. Dies könnte zu einer Verschlechterung der Annotationsqualität führen. In diesem Bachelor-Projekt sollen diese Fragestellungen untersucht werden.

Automatic Detection of Errors in Semantically Annotated Corpora

Many recent successes in NLP can be directly attributed to the increased availability of manually annotated corpora, which can be used to train part-of-speech taggers, parsers, word-sense disambiguators etc. However, manually annotated data is never entirely free of errors and inconsistencies, and any noise in the annotations typically translates into a performance drop of the models trained on these resources. It is therefore crucial that the training data is as error-free as possible, and one way to ensure this is by developing techniques that can automatically detect errors and inconsistencies in the annotations. While automatic error detection has long been an active research area in the data-mining community (e.g., outlier detection), error detection in linguistically annotated data has only recently become the focus of attention. However, so far the research effort has been devoted exclusively to syntactically or POS annotated corpora. This project will adapt and extend existing error detection techniques to data which was annotated with (frame-)semantic information as part of the SALSA project. A number of existing error detection methods could be used as a starting point (unsupervised statistical modelling, symbolic methods, supervised machine learning etc.).

As the SALSA data is in German, this project requires at least a passive knowledge of the language. Some familiarity with frame-semantics would also be beneficial.

Supervised Approaches to Distinguishing Literal and Non-Literal Usage

Many idiomatic expressions lie break the ice or get one's feet wet can have a literal meaning as well. Which interpretation is correct depends very much on the discourse context. NLP systems need to be able to distinguish literal and non-literal usage because idiomatic expressions typically differ in their linguistic behaviour from their literal counterparts. For example, the with-complement of break in literal usage of break the ice (break the ice on the water with a heavy stick) typically refers to an instrument which is used for the breaking. However, in the non-literal case the with-complement is more like to refer to the person/entity with whom/which the ice is to be broken (break the ice with potential clients). The aim of this thesis project is to develop strategies to differentiate literal and non-literal usages of idiomatic expressions. We have a reasonably big data set available in which literal vs. non-literal usages are annotated. Hence it will be possible to use supervised machine learning for this task. The main focus will be on the design of intelligent features which can help to distinguish both usages. Vorkenntnisse: Programmierkenntnisse; Interesse (und ggf. Vorkenntnisse) an maschinellen Lernverfahren und semantischer Verarbeitung.

Softwareprojekte

Wir bieten zur Zeit eine Reihe von Softwareprojekten an:

Wikipedia als linguistisches Korpus

siehe KVV.

Datenanalysetool zur Wortbedeutungsdesambiguierung

siehe KVV.

"Normalisierung von Mehrwortausdrücken"

Mehrwortausdrücke ("Multi Word Expressions" - MWEs) wie "ans Werk gehen" sind für semantische Verarbeitung eine Herausforderung. Das Problem beginnt bereits in der Annotation, weil Mehrwortausdruecke, wie fast alle anderen sprachlichen Ausdruecke auch, in verschiedenen (Oberflaechen-)Formen vorkommen koennen, z.B. durch Flexion: "beisst ins Gras", "biss in das Gras", "ins Gras gebissen". Die Aufgabe dieses Softwareprojektes ist die Entwicklung und Implementation von Methoden, die aus Mehrtwortausdruecken im TIGER-Korpus "normalisierte" Formen ableiten und neue Instanzen darauf zurueckfuehren.

Voraussetzungen: Gute Kenntnis des Deutschen, Programmierkenntnisse, Interesse an Morphologie, Semantik und empirischen Methoden

Falls ihr noch Fragen habt oder bei Interesse an einem der Themen schreibt an: salsa-mit@coli.uni-sb.de.