Computational Linguistics & Phonetics Co
mputational Linguistics & Phonetics Fachrichtung 4.7Universität des Saarlan

Unlocking the Secrets of the Past: Text Mining for Historical Documents (WS 2011/12)

What? Projektseminar: Computational Linguistics (Bachelor and MSc)
Who? Caroline Sporleder   (csporled AT coli)
(and possibly Martin Schreiber, Kultur- u. Mediengeschichte)
When? 3 week Blockseminar, 22.2.-09.3.2012
Where?  building C7.2, Konferenzraum/conference room 2.11 (take the stairs on the right when you come in walking from the Campus bus stop, the room is on the first floor)

Course Information
This course offers hands-on experience with specific text mining tasks, such as named entity recognition and disambiguation, relation extraction and template filling, segmentation of semi-structured text, automatic link detection between documents, error detection and correction etc. The text mining techniques will be implemented and tested on real-world examples from the cultural heritage domain, such as historical documents. The cultural heritage domain is a good testbed for NLP methods because a wealth of information in this domain is contained in raw unprocessed and often relatively unstructured texts (in contrast to the biomedical domain where a lot of data is already in a fairly structured form). Text mining can make such documents more accessible to researchers and laypersons alike. Moreover language change over time, unorthodox orthography, and errors introduced during digitisation (e.g. OCR errors) make this domain particularly challenging (and thus interesting!) for natural language processing.

Course Structure
This is an interdisciplinary course that is open for both students from Computational Linguistics / Computer Science and students from History. The aim is to design, implement and test practical NLP and text mining solutions to make historical documents more accessible. Possible topics include: detecting and correcting (OCR) errors , information extraction from historical manuscripts, finding links between documents, converting unstructured documents into searchable databases, knowledge discovery from historical documents.

The course consists of a theoretical and a practical part. In the theoretical part, students give a presentation on topics relevant to the course. In the practical part, small interdisciplinary groups will work on implementing a system that solves a real problem relevant for the documents discussed in the seminar.

Course Objectives
  • obtain hands-on experience with text mining techniques (design, implementation, testing)
  • learn about specific problems and challenges that arise when developing NLP tools for the cultural heritage domain
  • work in interdisciplinary teams (finding out what users of NLP technology want, communicating with non-experts, developing solutions according to user specifications)
Course Requirements
  • for Coli students: ability to program and to work with existing NLP tools (i.e., you should be able to figure out yourself how to install and run them without requiring too much help); for CS students, I assume this as given ;-)
  • for CS students: interest in natural language processing and willingness to familiarise yourself with the basic concepts in a relatively short time; for Coli students, I assume this as given ;-)
  • for all: interest in history / historical data; ability and willingness to work in a team
Scheine (Coli)
  • Projektseminar (MSc/BSc): class presentation and practical work including a project report (additional oral exam can be arranged)
Stellung im Studienplan (Coli)
  • als Projektseminar im B.Sc.: Regelstudienzeit 5/6. Semester
  • as project seminar in M.Sc. Programm
Leistungspunkte (Coli)
  • als Projektseminar/project seminar(MSc/BSc) 5 CP
Scheine (Computer Science)
  • Seminar (MSc/BSc): class presentation and practical work including a project report, plus another shorter report on the theoretical part
Leistungspunkte (Computer Science)
  • 7 CP (note: this is more than for Coli, hence the additional short report)