Unlocking the Secrets of the Past: Text Mining for Historical Documents (WS 2009/10)

Topics and Papers

Presentation Topics

Pre-Processing
- OCR Errors
- Detection and Correction of OCR errors (1) (Souhail Bouricha)
- Detection and Correction of OCR errors (2) (Michal Richter)
Non-Standard Language
- Dealing with non-standard orthography (1) (Todd Shore)
- Dealing with non-standard orthography (2) (Johannes Braunias)
- Adapting NLP Tools (1) (Yevgeni Berzak)
- Adapting NLP Tools (2)
Preservation of Digital Data
- Preservation Issues (Sebastian Steenbuck)
Semantic Web
- Semantic Web Background (Hu Jingwen)
- Ontologies (Weijia Shao)
- Vocabulary Alignment
Meta-Data
- Inferring Meta-Data (1)
- Inferring Meta-Data (2) (Daniel Müller)
Information Extraction Basics
- Named Entity Recognition (Uwe Boltz)
- Named Entity Disambiguation and Linking (Andreas Schwarte)
Text Mining
- Converting Fieldbooks to Databases (Carsten Ehrler)
- Event Recognition (1) (Alberto Gonzalez Palomo)
- Event Recognition (2) (Chenhua Chen)
- User Studies
Multi-Modal Data
- Speech
- Images and Video (Christopher Haccius)
Personalisation
- Personalisation (1)
- Personalisation (2) (Sven Steudter)

Papers

Background Reading on Digital Cultural Heritage

(no presentation)

Howard Besser. The Transformation of the Museum and the Way it's Perceived. In: Katherine Jones-Garmil (ed.), The Wired Museum, Washington: American Association of Museums, pages 153-169, 1997.
html (pre-publication version)
Howard Besser. The Changing Role of Photographic Collections With the Advent of Digitization. In: Katherine Jones-Garmil (ed.), The Wired Museum, Washington: American Association of Museums, pages 115-127, 1997.
html (pre-publication version)

Pre-Processing

Background Reading (no presentation)
- Haigh, Susan (1996). Optical character recognition (OCR) as a digitization technology. Network Notes no. 37, Information Technology Services, National Library of Canada.
  html
OCR Errors
- Daniel Lopresti. Performance evaluation for text processing of noisy inputs. In Proceedings of the 20th Annual ACM Symposium on Applied Computing (Document Engineering Track), pages 759-763, 2005.
  pdf
- Daniel Lopresti. Optical character recognition errors and their effects on natural language processing. In Proceedings of the ACM SIGIR Workshop on Analytics for Noisy Unstructured Text 584 Data, pages 9-16, 2008.
  pdf
Detection and Correction of OCR errors (1)
- Martin Reynaert. Non-interactive OCR post-correction for giga-scale digitization projects. In A. Gelbukh (Ed.), Proceedings of the Computational Linguistics and Intelligent Text Processing 9th International Conference, CICLing 2008. Lecture Notes in Computer Science Vol. 4919/2008, Berlin / Heidelberg: Springer, pp. 617-630.
  pdf
Detection and Correction of OCR errors (2)
- Christoph Ringlstetter, Klaus U. Schulz, Stoyan Mihov and Katerina Louka: The Same is Not The Same - Postcorrection of Alphabet Confusion Errors in Mixed-Alphabet OCR Recognition. Proceedings of the 8th International Conference on Document Analysis and Recognition (ICDAR'05), pp. 406-410 , 2005.
  pdf
- Okan Kolak; Philip Resnik. OCR Post-Processing for Low Density Languages. EMNLP-2005.
  pdf

Non-Standard Language

Background Reading (no presentation)
- Marco Pennacchiotti, Fabio Massimo Zanzotto. Natural Language Processing across time: an empirical investigation on Italian. In Proceedings of GOTAL 2008. Gothenburg, Sweden. August, 2008.
  pdf
Non-Standard Orthography (1)
- Andreas Hauser, Markus Heller, Elisabeth Leiss, Klaus U. Schulz, Christiane Wanzeck. Information Access to Historical Documents from the Early New High German Period. In: L. Burnard, M. Dobreva, N. Fuhr, A. Lüdeling (eds): Digital Historical Corpora - Architecture, Annotation, and Retrieval. Dagstuhl Seminar Proceedings, 2007.
  pdf
Non-Standard Orthography (2)
- Marijn Koolen, Frans Adriaans, Jaap Kamps, and Maarten de Rijke. A cross-language approach to historic document retrieval. In Mounia Lalmas, Stefan M. Rüger, Theodora Tsikrika, and Alexei Yavlinsky, editors, Advances in Information Retrieval: 28th European Conference on IR Research (ECIR 2006), volume 3936 of Lecture Notes in Computer Science, pages 407-419. Springer Verlag, Heidelberg, 2006.
  pdf
Adapting NLP Tools (1)
- Taesun Moon and Jason Baldridge. 2007. Part-of-Speech Tagging for Middle English through Alignment and Projection of Parallel Diachronic Texts. In Proceedings of EMNLP/CONLL-2007. Prague.
  pdf
Adapting NLP Tools (2)
- Eiríkur Rögnvaldsson and Sigrún Helgadóttir. Morphological tagging of Old Norse texts and its use in studying syntactic variation and change. Proceedings of the LREC-08 Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2008), 2008.
  pdf

Preservation of Digital Data

Background Reading (no presentation)
- The Digital Preservation Management Tutorial hosted by the Inter-university Consortium for Political and Social Research (ICPSR)
  html
- Myron P. Gutmann, Nancy Y. McGovern, Bryan Beecher, T.E, Raghunathan. How Safe is Safe Enough when we Preserve Social Science Data? Third International Conference on e-Social Science, Ann Arbor, MI, 2007.
  pdf
- Bootie Cosgrove-Mather: Coming Soon: A Digital Dark Age? Digital Memory Threatened As File Formats Evolve.
  html
Preservation Issues
- Howard Besser. Digital longevity. In: Maxine Sitts (ed.) Handbook for Digital Projects: A Management Tool for Preservation and Access, Andover MA: Northeast Document Conservation Center, 2000, pages 155-166
  html
- Mary Baker , Mema Roussopoulos , Mehul Shah , Petros Maniatis , Prashanth Bungale , TJ Giuli , David S. H. Rosenthal: A Fresh Look at the Reliability of Long-term Digital Storage. In: Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006.
  pdf
Note: If you choose this topic you don't have to present the details of the model introduced in the second paper. What you should talk about are the risks to digital data and what can be done to minimise them.

Semantic Web

Semantic Web Background
- Antoine Isaac, Henk Matthezing, Stefan Schlobach, Claus Zinn. Integrated access to cultural heritage resources through representation and alignment of controlled vocabularies. Library Review, 57(3), March 2008.
  html
- Sean B. Palmer. The Semantic Web: An Introduction. 2001.
  http://infomesh.net/2001/swintro/
Ontologies
- Fabian M. Suchanek, Gjergji Kasneci and Gerhard Weikum. "Yago - A Large Ontology from Wikipedia and WordNet" Elsevier Journal of Web Semantics. 2008.
  pdf
Inferring Meta-Data (1)
- Veronique Malaisé, Antoine Isaac, Luit Gazendam and Hennie Brugman.(2007). Anchoring Dutch Cultural Heritage Thesauri to WordNet: two case studies. LATECH'07, ACL 2007 Workshop, Prague, June 28th 2007.
  pdf
Inferring Meta-Data (2)
- Tandeep Sidhu; Judith Klavans; Jimmy Lin. Concept Disambiguation for Improved Subject Access Using Multiple Knowledge Sources. In: Proceedings of the ACL Workshop on Language Technology for Cultural Heritage Data (LaTeCH-07), 2007.
  pdf
Vocabulary Alignment
- Isaac, Antoine; Schlobach, Stefan; Matthezing, Henk; Zinn, Claus. Integrated access to cultural heritage resources through representation and alignment of controlled vocabularies. In Library Review, Volume 57, Number 3, 2008 , pp. 187-199(13).
  pdf

Information Extraction Basics

Named Entity Recognition
- David Nadeau, Satoshi Sekine. A survey of named entity recognition and classification. Journal of Linguisticae Investigationes 30:1 ; 2007
  pdf
- Kate Byrne. Nested Named Entity Recognition in Historical Archive Text. ICSC2007, IEEE International Conference on Semantic Computing, Irvine, California.
  pdf
Named Entity Disambiguation and Linking (one of the following)
- Gideon Mann; David Yarowsky. Unsupervised Personal Name Disambiguation. CoNLL-03. 2003
  pdf
- Michael Fleischman; Eduard Hovy. Multi-Document Person Name Resolution. Proceedings of the ACL-2004 Workshop on Reference Resolution and Its Applications. 2004.
  pdf
- Amit Bagga; Breck Baldwin. Entity-Based Cross-Document Coreferencing Using the Vector Space Model. ACL-COLING-1998. 1998.
  pdf

Text Mining

Converting Fieldbooks to Databases (one of the following:)
- Piroska Lendvai and Steve Hunt. From field notes towards a knowledge base. In Proceedings of the Sixth International Language Resources and Evaluation (LREC'08). Marrakech,Morocco, 2008.
  pdf
- Sander Canisius and Caroline Sporleder. Bootstrapping information extraction from field books. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, pp. 827-836.
  pdf
Event Recognition (1)
- Kate Byrne and Ewan Klein. Automatic extraction of archaeological events from text. In Computer Applications in Archaeology (CAA-09), 2009.
  pdf
Event Recognition (2)
- Tuukka Ruotsalo, Lora Aroyo, Guus Schreiber. Knowledge-Based Linguistic Annotation of Digital Cultural Heritage Collections. In IEEE Intelligent Systems 24:2, pp. 64-75.
  
  pdf
  (If you have problems viewing and printing this paper, you can get a hard copy from me.)
User Studies
- B. Alex, C. Grover, B. Haddow, M. Kabadjov, E. Klein, M. Matthews, S. Roebuck, R. Tobin, X. Wang. Assisted curation: does text mining really help? In Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing (2008), pp. 556-567.
  pdf

Multi-Modal Data

Speech (one of the following:)
- Van der Werff, L.B. and Heeren, W.F.L. and Ordelman, R.J.F. and de Jong, F.M.G. (2007) Radio Oranje: Enhanced Access to a Historical Spoken Word Collection. In: Proceedings of the 17th Meeting of Computational Linguistics in the Netherlands, 12 Jan 2007, Leuven, Belgium. pp. 207-218
  pdf
- Samuel Gustman, Dagobert Soergel, Douglas Oard, William Byrne, Michael Picheny, Bhuvana Ramabhadran, Douglas Greenberg. Supporting access to large digital oral history archives. Proceedings of the Joint Conference on Digital Libraries. 2002.
  pdf
Images and Video
- L. Gazendam, V. Malaisé, A. de Jong, C. Wartena, H. Brugman, and G. Schreiber. Automatic annotation suggestions for audiovisual archives: Evaluation aspects. J. Interdisciplinary Science Reviews, 2009.
  pdf

Personalisation

Personalisation (1)
- Ion Androutsopoulos, Vassiliki Kokkinaki, Aggeliki Dimitromanolaki, Jo Calder, Jon Oberlander, Elena Not. Generating Multilingual Personalized Descriptions of Museum Exhibits - The M-PIRO Project. In: Proceedings of the 29th Conference on Computer Applications and Quantitative Methods in Archaeology, Gotland, Sweden, 2001.
  pdf
Note: Possibly useful background reading:
- I. Androutsopoulos, J. Oberlander and V. Karkaletsis, "Source Authoring for Multilingual Generation of Personalised Object Descriptions". Natural Language Engineering, 13(3):191-233, 2007.
  pdf
- A. Isard, J. Oberlander, I. Androutsopoulos and C. Matheson, "Speaking the Users' Languages". IEEE Intelligent Systems, special issue on "Advances in Natural Language Processing", 18(1):40-45, 2003.
  pdf
Personalisation (2)
- Karl Grieser; Timothy Baldwin; Steven Bird. Dynamic Path Prediction and Recommendation in a Museum Environment. In: Proceedings of the ACL Workshop on Language Technology for Cultural Heritage Data (LaTeCH-07), 2007.
  pdf