Computational Linguistics & Phonetics Computational Linguistics & Phonetics Fachrichtung 4.7 Universität des Saarlandes

Caroline Sporleder

Theses Topics

At the moment, I'm offering the following Master and Bachelor thesis topics: I'm also interested in other topics in the areas discourse, semantics, information extraction, text mining, and cross-lingual bootstrapping. So if you have an idea for a topic in these areas (even if it's still vague), get in touch.

Automatic detection of conversational structure (Master or Bachelor)

This project will be jointly supervised with Stefan Diemer (English Linguistics).

Introduction:

Spoken interactions, such as informal conversations but also lectures etc., typically consist of different discourse phases. For example, in conversational storytelling, the narrative might begin with a justification for telling the story, then continue with the main part, and end with a conclusion in which the narrative point is evaluated and commented on. The transitions between these different phases are typically marked by linguistic cues which signal to the listeners that a certain phase is about to begin, e.g. that the speaker is about to tell a story. Such cues play an important role in guiding human conversations; humans use these linguistic cues subconsciously to adapt their communicative behaviour, for example they tend to know when a speaker expects them to comment on something that was just said. While humans are very good at detecting conversational structure, this task is still challenging for machines. Dialogues systems, for example, can typically only deal well with task-oriented dialogues whose structure is relatively rigid and thus can be more or less hard-wired. Part of the problem is that it is not yet very well understood what linguistic cues signal which aspects of conversational structure. The aim of this thesis project is to shed some light on this and develop a method for the automatic detection of discourse phases in spoken interactions. The project will be carried out in close cooperation with the English Linguistics Department (Chair Norrick) who are investigating conversational structure in an ongoing corpus-based research project using the Saarbrücken Corpus of Spoken English (SCoSe), which contains transcripts of various types of spoken language data (conversations between students, interviews with senior citizens, instructional dialogues, classroom discourse etc.).

Method:

  1. Hypothesis Formation: identify linguistic cues for distinguishing different discourse phases by data exploration and discussions with linguists working on the project in the English department (esp. PD Dr. Stefan Diemer).
  2. Hypothesis Testing and Refinement:
    1. implement an automatic system to detect discourse phases based on the cues identified in (1).
    2. apply the system to the data and analyse the results
    3. if necessary refine the hypothesis and repeat (2)

Expected Outcomes:

  1. Knowledge Gain: the project and the tools developed will help to test hypotheses about linguistic cues of conversation structure in a more systematic way and will lead to a better understanding of how structure is signalled.
  2. Methods for Automatic Structure Detection: the automatic analysis method to be developed will also have potential benefits for automatic systems, e.g. dialogue systems.

Requirements:

We are looking for an enthusiastic student who is interested in (English) linguistics, especially conversational structure, and who is willing to collaborate with linguists on a corpus-based study. Some programming skills are necessary (e.g. Perl, Python) but the implementation will not be too challenging.

Background Literature:

Linguistische Anwendungen von Algorithmen aus der Bio-Informatik am Beispiel Wortdisambiguierung (Bachelor, Master evtl. mögl.)

(Contact me if you're interested in this topic for a Master Thesis and would like an English summary.)

Diese Arbeit wird zusammen mit Jan Baumbach (Computational Systems Biology) betreut.

Motivation und Zusammenfassung

Automatische Wortdisambiguierung ist wichtig für die meisten NLP-Anwendungen. Doch trotz langjähriger aktiver Forschungstätigkeit gibt es immer noch keine Algorithmen, die verschiedene Wortbedeutungen wirklich zuverlässig unterscheiden können. Ein oft verwendeter Disambiguierungsansatz, versucht Wortebedeutungen im Kontext jeweils so zuzuweisen, daß alle Bedeutungen zu einander passen. Dafür wird jeweils der Grad der semantischen Ähnlichkeit zwischen möglichen Bedeutungen im Kontext berechnet und dann werden die Bedeutungen so zugewiesen, daß die Ähnlichkeit global optimiert wird. Die globale Optimierung ist jedoch oft nicht trivial. Ziel dieses Projekts ist es State-of-the-Art Clustering Algorithmen aus der Bioinformatik am Problem der Wortdisambiguierung zu testen und in verschiedenen Szenarien zu evaluieren.

Voraussetzungen

Grundlegende Programmierkenntnisse (z.B. Korpusverarbeitung, Schnittstelle zu WordNet etc.) und Interesse am Thema. Es ist nicht notwending ein großes System zu implementieren. Eine Implementation des Clustering Algorithmus' steht zur Verfügung.



Ongoing Thesis Projects

Automatically Modelling the Meaning of Idioms

Expressions with idiomatic meanings such as set in stone or the penny drops pose significant challenges to NLP systems. Their linguistic behaviour usually differs from what would be expected if they were used literally. For instance, the verb drop can normally take a whole range of PP-complements, e.g. into-PP (the plane dropped into), below-PP (the temperature dropped below zero) or on-PP (the gold dropped on the ground); however if drop is used in the idiom the penny drops, PP complements headed by into or below are relatively unlikely and on-PPs tend to realise a different semantic role than the proto-typical location role, i.e. the issue on which "the penny dropped" (he penny dropped on why things turn out the way they do) or the person for whom "the penny dropped" (e.g. after reaching the hotel, the penny dropped on them). While there exist some resources which list idiomatic expressions and provide information about their syntactic-semantic behaviour (e.g. idiom dictionaries), these are expensive to create manually. The aim of this project is to investigate ways in which such information can be bootstrapped from corpora. The first step is to determine the meaning by finding words with which the idiom frequently co-occurs (e.g. for the penny drops: realised, understand, confused, think, explain). From this it should be possible to find synonyms or near synonyms for the meaning of the idiom, e.g. realise, as these should co-occur with the same set of words. As the second step, a suitable semantic frame for the idiom could be determined, in this case "coming_to_believe", which is the frame for realise. The third step would involve mapping the semantic roles of the frame to the syntactic complements of the (head word of the) idiom. For example, the "cognizer"-role is typically realised as an on-PP for the penny drops. Mapping the roles can be done by computing the semantic similarity between known role fillers for the "coming_to_believe" frame, which can be extracted from frame-semantically annotated corpora such as FrameNet.

This research topic builds on existing research on frame assignment to unknown words (e.g. Burchardt et al. (2005)) and role mapping (e.g. Pado et al. (2008)). The topic could also be split in two (i.e. determining meaning/frame assignment and role mapping). Programming skills are essential and some familiarity with statistical modelling/machine learning would be useful.