Theses TopicsAt the moment, I'm offering the following Master and Bachelor thesis topics:
- Automatic detection of conversational structure (Master or Bachelor)
- Linguistische Anwendungen von Algorithmen aus der Bio-Informatik am Beispiel Wortdisambiguierung (Bachelor, Master mögl.)
(Applying Algorithms from bio-informatics to word sense disambiguation (Bachelor, Master might be possible)
Stefan Diemer (English Linguistics).
Introduction:Spoken interactions, such as informal conversations but also lectures etc., typically consist of different discourse phases. For example, in conversational storytelling, the narrative might begin with a justification for telling the story, then continue with the main part, and end with a conclusion in which the narrative point is evaluated and commented on. The transitions between these different phases are typically marked by linguistic cues which signal to the listeners that a certain phase is about to begin, e.g. that the speaker is about to tell a story. Such cues play an important role in guiding human conversations; humans use these linguistic cues subconsciously to adapt their communicative behaviour, for example they tend to know when a speaker expects them to comment on something that was just said. While humans are very good at detecting conversational structure, this task is still challenging for machines. Dialogues systems, for example, can typically only deal well with task-oriented dialogues whose structure is relatively rigid and thus can be more or less hard-wired. Part of the problem is that it is not yet very well understood what linguistic cues signal which aspects of conversational structure. The aim of this thesis project is to shed some light on this and develop a method for the automatic detection of discourse phases in spoken interactions. The project will be carried out in close cooperation with the English Linguistics Department (Chair Norrick) who are investigating conversational structure in an ongoing corpus-based research project using the Saarbrücken Corpus of Spoken English (SCoSe), which contains transcripts of various types of spoken language data (conversations between students, interviews with senior citizens, instructional dialogues, classroom discourse etc.).
- Hypothesis Formation: identify linguistic cues for distinguishing different discourse phases by data exploration and discussions with linguists working on the project in the English department (esp. PD Dr. Stefan Diemer).
- Hypothesis Testing and Refinement:
- implement an automatic system to detect discourse phases based on the cues identified in (1).
- apply the system to the data and analyse the results
- if necessary refine the hypothesis and repeat (2)
- Knowledge Gain: the project and the tools developed will help to test hypotheses about linguistic cues of conversation structure in a more systematic way and will lead to a better understanding of how structure is signalled.
- Methods for Automatic Structure Detection: the automatic analysis method to be developed will also have potential benefits for automatic systems, e.g. dialogue systems.
Requirements:We are looking for an enthusiastic student who is interested in (English) linguistics, especially conversational structure, and who is willing to collaborate with linguists on a corpus-based study. Some programming skills are necessary (e.g. Perl, Python) but the implementation will not be too challenging.
- Marti A. Hearst. TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages. in Computational Linguistics. 23:1. 1997. pp. 33-64.
- Lev Pevzner; Marti A. Hearst. A Critique and Improvement of an Evaluation Metric for Text Segmentation, in Computational Linguistics, 28:1, 2002, 19-36.
- Michel Galley, Kathleen McKeown, Eric Fosler-Lussier, Hongyan Jing (2003). Discourse Segmentation of Multi-Party Conversation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL 2003). July 21-26, 2003. Sapporo, Japan.
- Field, D., S. Worgan, N. Webb, M. Hepple and Y. Wilks Automatic Induction of Dialogue Structure from the Companions Dialogue Corpus, in Proceedings of the 4th International Workshop on Human-Computer Conversation, Bellagio, Italy, October 2008.
- Jaime Arguello and Carolyn Rosé. (2006.) Topic Segmentation of Dialogue. In Proceedings of the HLT-NAACL 2006 Workshop on Analyzing Conversations in Text and Speech. New York, NY. (PDF)
- Caroline Sporleder and Mirella Lapata.Broad Coverage Paragraph Segmentation across Languages and Domains, ACM Transactions in Speech and Language Processing, 3:2, 1-35, July 2006.
- Litman, D. J., and Passonneau, R. J. Combining multiple knowledge sources for discourse segmentation. In Meeting of the Association for Computational Linguistics (1995), pp. 108-115.
- Ponte, J. M., and Croft, W. B. Text segmentation by topic. In European Conference on Digital Libraries (1997), pp. 113-125.
- James Ballantine. 2004. Topic Segmentation in Spoken Dialogue. BA Thesis Macquarie.
- Rebecca J. Passonneau; Diane J. Litman. Intention-Based Segmentation: Human Reliability and Correlation with Linguistic Cues, In Proceedings of ACL-93.
Linguistische Anwendungen von Algorithmen aus der Bio-Informatik am Beispiel Wortdisambiguierung (Bachelor, Master evtl. mögl.)(Contact me if you're interested in this topic for a Master Thesis and would like an English summary.)
Diese Arbeit wird zusammen mit Jan Baumbach (Computational Systems Biology) betreut.
Motivation und ZusammenfassungAutomatische Wortdisambiguierung ist wichtig für die meisten NLP-Anwendungen. Doch trotz langjähriger aktiver Forschungstätigkeit gibt es immer noch keine Algorithmen, die verschiedene Wortbedeutungen wirklich zuverlässig unterscheiden können. Ein oft verwendeter Disambiguierungsansatz, versucht Wortebedeutungen im Kontext jeweils so zuzuweisen, daß alle Bedeutungen zu einander passen. Dafür wird jeweils der Grad der semantischen Ähnlichkeit zwischen möglichen Bedeutungen im Kontext berechnet und dann werden die Bedeutungen so zugewiesen, daß die Ähnlichkeit global optimiert wird. Die globale Optimierung ist jedoch oft nicht trivial. Ziel dieses Projekts ist es State-of-the-Art Clustering Algorithmen aus der Bioinformatik am Problem der Wortdisambiguierung zu testen und in verschiedenen Szenarien zu evaluieren.
VoraussetzungenGrundlegende Programmierkenntnisse (z.B. Korpusverarbeitung, Schnittstelle zu WordNet etc.) und Interesse am Thema. Es ist nicht notwending ein großes System zu implementieren. Eine Implementation des Clustering Algorithmus' steht zur Verfügung.
Automatically Modelling the Meaning of IdiomsExpressions with idiomatic meanings such as set in stone or the penny drops pose significant challenges to NLP systems. Their linguistic behaviour usually differs from what would be expected if they were used literally. For instance, the verb drop can normally take a whole range of PP-complements, e.g. into-PP (the plane dropped into), below-PP (the temperature dropped below zero) or on-PP (the gold dropped on the ground); however if drop is used in the idiom the penny drops, PP complements headed by into or below are relatively unlikely and on-PPs tend to realise a different semantic role than the proto-typical location role, i.e. the issue on which "the penny dropped" (he penny dropped on why things turn out the way they do) or the person for whom "the penny dropped" (e.g. after reaching the hotel, the penny dropped on them). While there exist some resources which list idiomatic expressions and provide information about their syntactic-semantic behaviour (e.g. idiom dictionaries), these are expensive to create manually. The aim of this project is to investigate ways in which such information can be bootstrapped from corpora. The first step is to determine the meaning by finding words with which the idiom frequently co-occurs (e.g. for the penny drops: realised, understand, confused, think, explain). From this it should be possible to find synonyms or near synonyms for the meaning of the idiom, e.g. realise, as these should co-occur with the same set of words. As the second step, a suitable semantic frame for the idiom could be determined, in this case "coming_to_believe", which is the frame for realise. The third step would involve mapping the semantic roles of the frame to the syntactic complements of the (head word of the) idiom. For example, the "cognizer"-role is typically realised as an on-PP for the penny drops. Mapping the roles can be done by computing the semantic similarity between known role fillers for the "coming_to_believe" frame, which can be extracted from frame-semantically annotated corpora such as FrameNet.
This research topic builds on existing research on frame assignment to unknown words (e.g. Burchardt et al. (2005)) and role mapping (e.g. Pado et al. (2008)). The topic could also be split in two (i.e. determining meaning/frame assignment and role mapping). Programming skills are essential and some familiarity with statistical modelling/machine learning would be useful.