Text Mining for Historical Documents: Topics and Papers

Background Reading

Gunilla Budde, Dagmar Freist u. Hilke Günther- Arndt (eds.), Geschichte. Studium - Wissenschaft - Beruf (Berlin, 2008), Keith Jenkins, Re-Thinking History (London 1991) and Winfried Schulze, Einführung in die Neuere Geschichte (Stuttgart, 2002).

On inquisitors and Cathars:
Malcolm Lambert, The Cathars (Oxford, 1999) and Emmanuel Le Roy Ladurie, Montaillou: Cathars and Catholics in a French Village 1294-1324 (London, 1980).

On reading primary sources see also:
Reading primary sources: Slave narratives. With commentary by Kathryn Walbert http://www.learnnc.org/lp/editions/thinking-guide-slave-narrative/

Presentation Topics (History)

Was sind Quellen und welchen Nutzen ziehen Historiker aus ihnen?
(What are historical sources and what can a historian learn from them?)
Von der Scherbe bis zum Popsong: Typen von Quellen.
(From potsherds to pop songs: types of primary sources.)
Fundorte von Quellen.
(Locations of primary sources/How and where to find primary sources)
Historische Erkenntnis: Geschichte als historische Betrachtungsweise.
(Historical knowledge: "history" as a historical point of view)
"Historische Methode" und Quellenkritik.
(The "historical method" and the critical assessment of sources)
Die "Echtheit" von Quellen: Fallbeispiel.
(The authenticity of primary sources: case study)
Quellenkritik: Fallbeispiel.
(The critical assessment of primary sources: case study)
Die Digitalisierung von Quellen: Chancen, Probleme, Perspektiven.
(The digitization of primary sources: possibilities, problems, outlook)

Presentation Topics (Coli)

Digitization Issues
Information Extraction
- Named Entities: Background (Alexander)
- Named Entity Disambiguation and Linking (Tassilo)
- Information Extraction: Background
Semantic Web
- Semantic Web Background (Fabian)
- Inferring Meta-Data
- Ontologies (Antonia)
Multi-Modal Data
- Speech (Ghamdan)
- Images and Video
Natural Language Processing Across Domains

Papers (Coli)

1. Digitization Issues

Detection and Correction of OCR or transcription errors

Main reading:
- Stoyan Mihov, Klaus U. Schulz, Christoph Ringlstetter, Veselka Dojchinova, Vanja Nakova, Kristina Kalpakchieva,Ognjan Gerasimov, Annette Gotscharek and Claudia Gehrcke: A Corpus for Comparative Evaluation of OCR Software and Postcorrection Techniques. Proceedings of the 8th International Conference on Document Analysis and Recognition (ICDAR'05), pp. 162-166 , 2005.
  pdf
- Martin Reynaert. Non-interactive OCR post-correction for giga-scale digitization projects. In A. Gelbukh (Ed.), Proceedings of the Computational Linguistics and Intelligent Text Processing 9th International Conference, CICLing 2008. Lecture Notes in Computer Science Vol. 4919/2008, Berlin / Heidelberg: Springer, pp. 617-630.
  pdf
Further reading (optional):
- Okan Kolak; Philip Resnik. OCR Post-Processing for Low Density Languages. EMNLP-2005.
  pdf
- Christoph Ringlstetter, Klaus U. Schulz, Stoyan Mihov and Katerina Louka: The Same is Not The Same - Postcorrection of Alphabet Confusion Errors in Mixed-Alphabet OCR Recognition. Proceedings of the 8th International Conference on Document Analysis and Recognition (ICDAR'05), pp. 406-410 , 2005.
  pdf
Dealing with non-standard orthography

Main reading:
- Andreas Hauser, Markus Heller, Elisabeth Leiss, Klaus U. Schulz, Christiane Wanzeck. Information Access to Historical Documents from the Early New High German Period. In: L. Burnard, M. Dobreva, N. Fuhr, A. Lüdeling (eds): Digital Historical Corpora - Architecture, Annotation, and Retrieval. Dagstuhl Seminar Proceedings, 2007.
  pdf
- Loes Braun, Floris Wiesman, Ida Sprinkhuizen-Kuyper. Information Retrieval from Historical Corpora. Proceedings of the 3rd Dutch-Belgian Information Retrieval Workshop (DIR), Leuven, Belgium, pp. 106-112.
  pdf
Further reading (optional)
- Peter Schneider. Computer assisted spelling normalization of 18th century English. Language and Computers, New Frontiers of Corpus Research. Papers from the Twenty First International Conference on English Language Research on Computerized Corpora Sydney 2000. PETERS, Pam, Peter COLLINS and Adam SMITH (Eds.) , pp. 199-211(13)
  pdf
- Andrea Ernst-Gerlach, Norbert Fuhr. Retrieval in text collections with historic spelling using linguistic and spelling variants. Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries. 2007. pp. 333-241.
  pdf
- Thomas Pilz, Wolfram Luther, Norbert Fuhr, Ulrich Ammon. Rule-based Search in Text Databases with Nonstandard Orthography. Literary and Linguistic Computing 2006 21(2):179-186.
  pdf (you need to be logged into a UdS account to access this paper)
- Marijn Koolen, Frans Adriaans, Jaap Kamps, and Maarten de Rijke. A cross-language approach to historic document retrieval. In Mounia Lalmas, Stefan M. Rüger, Theodora Tsikrika, and Alexei Yavlinsky, editors, Advances in Information Retrieval: 28th European Conference on IR Research (ECIR 2006), volume 3936 of Lecture Notes in Computer Science, pages 407-419. Springer Verlag, Heidelberg, 2006.
  pdf
Detection of inclusions in other languages
- Beatrice Alex, Integrating Language Knowledge Resources to Extend the English Inclusion Classifier to a New Language, In: Proceedings of LREC 2006, Genoa, Italy.
  pdf
- Beatrice Alex. An Unsupervised System for Identifying English Inclusions in German Text. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005) - Student Research Workshop. Ann Arbor, Michigan.
  pdf

2. Information Extraction

Named Entities: Background
- David Nadeau, Satoshi Sekine. A survey of named entity recognition and classification. Journal of Linguisticae Investigationes 30:1 ; 2007
  pdf
- Kate Byrne. Nested Named Entity Recognition in Historical Archive Text. ICSC2007, IEEE International Conference on Semantic Computing, Irvine, California.
  pdf
Named Entity Disambiguation and Linking
- Gideon Mann; David Yarowsky. Unsupervised Personal Name Disambiguation. CoNLL-03. 2003
  pdf
- Michael Fleischman; Eduard Hovy. Multi-Document Person Name Resolution. Proceedings of the ACL-2004 Workshop on Reference Resolution and Its Applications. 2004.
  pdf
- Amit Bagga; Breck Baldwin. Entity-Based Cross-Document Coreferencing Using the Vector Space Model. ACL-COLING-1998. 1998.
  pdf
Information Extraction: Background

Main Reading
- Ralph Grishman. Information extraction: techniques and challenges. Information Extraction (International Summer School SCIE-97). 1997.
  pdf
- Brin, Sergey (1998). Extracting Patterns and Relations from the World Wide Web., WebDB Workshop at EDBT'98.
  pdf
Further Reading (optional):
- Patwardhan, S. and Riloff, E. (2006) "Learning Domain-Specific Information Extraction Patterns from the Web", ACL 2006 Workshop on Information Extraction Beyond the Document.
  pdf

3. Semantic Web

Semantic Web Background
- Antoine Isaac, Henk Matthezing, Stefan Schlobach, Claus Zinn. Integrated access to cultural heritage resources through representation and alignment of controlled vocabularies. Library Review, 57(3), March 2008.
  pdf
- Sean B. Palmer. The Semantic Web: An Introduction. 2001.
  http://infomesh.net/2001/swintro/
Inferring Meta-Data
- Gazendam, L., Malaisé, V., Schreiber, G. and Brugman, H. (2006). Deriving Semantic Annotations of an audiovisual program from contextual texts. To appear in Proceedings of First International workshop on Semantic Web Annotations for Multimedia (SWAMM 2006). 23 May 2006, Edinburgh, Scotland.
  pdf
- Veronique Malaisé, Antoine Isaac, Luit Gazendam and Hennie Brugman.(2007). Anchoring Dutch Cultural Heritage Thesauri to WordNet: two case studies. LATECH'07, ACL 2007 Workshop, Prague, June 28th 2007.
  pdf
Ontologies

Main Reading
- Fabian M. Suchanek, Gjergji Kasneci and Gerhard Weikum. "Yago - A Large Ontology from Wikipedia and WordNet" Elsevier Journal of Web Semantics. 2008.
  pdf
Further Reading (optional):
- Fabio Ciravegna, Sam Chapman, Alexiei Dingli and Yorick Wilks, Learning to Harvest Information for the Semantic Web, in Proceedings of the 1st European Semantic Web Symposium, Heraklion, Greece, May 10-12, 2004.
  pdf
- David Schlangen; Manfred Stede; Elena Paslaru Bontas. Feeding OWL: Extracting and Representing the Content of Pathology Reports. Proceeedings of the Workshop on NLP and XML (NLPXML-2004): RDF/RDFS and OWL in Language Technology. 2004.
  pdf

4. Multi-Modal Data

Speech
- Van der Werff, L.B. and Heeren, W.F.L. and Ordelman, R.J.F. and de Jong, F.M.G. (2007) Radio Oranje: Enhanced Access to a Historical Spoken Word Collection. In: Proceedings of the 17th Meeting of Computational Linguistics in the Netherlands, 12 Jan 2007, Leuven, Belgium. pp. 207-218
  pdf
- Samuel Gustman, Dagobert Soergel, Douglas Oard, William Byrne, Michael Picheny, Bhuvana Ramabhadran, Douglas Greenberg. Supporting access to large digital oral history archives. Proceedings of the Joint Conference on Digital Libraries. 2002.
  pdf
Images and Video
- Cees G.M. Snoek, Bouke Huurnink, Laura Hollink, Maarten de Rijke, Guus Schreiber, and Marcel Worring. Adding Semantics to Detectors for Video Retrieval. IEEE Transactions on Multimedia, 2007.
  pdf
- A. Popescu, H. Le Borgne, P.-A. Moëllic Conceptual Image Retrieval over the Wikipedia Corpus Working notes for the CLEF 2008 Workshop, Aarhus, Denmark, 17-19 September 2008.
  pdf