Unlocking the Secrets of the Past: Text Mining for Historical Documents (WS 2010/11)

Topics and Papers

Presentation Topics

Pre-Processing
Non-Standard Language
Preservation of Digital Data
- Preservation Issues
Semantic Web
Meta-Data
- Inferring Meta-Data (1)
- Inferring Meta-Data (2)
Information Extraction Basics
Text Mining
Multi-Modal Data
- Speech
- Images and Video
Personalisation
- Personalisation (1)
- Personalisation (2)

Papers

Background Reading on NLP for Cultural Heritage

(no presentation)

Caroline Sporleder. Natural Language Processing for Cultural Heritage Domains Language and Linguistics Compass, Vol 4, Issue 9, September 2010, pp. 750-768, Wiley-Blackwell.
pre-print

Background Reading on Digital Cultural Heritage

(no presentation)

Howard Besser. The Transformation of the Museum and the Way it's Perceived. In: Katherine Jones-Garmil (ed.), The Wired Museum, Washington: American Association of Museums, pages 153-169, 1997.
html (pre-publication version)
Howard Besser. The Changing Role of Photographic Collections With the Advent of Digitization. In: Katherine Jones-Garmil (ed.), The Wired Museum, Washington: American Association of Museums, pages 115-127, 1997.
html (pre-publication version)

Pre-Processing

Background Reading (no presentation)
- Haigh, Susan (1996). Optical character recognition (OCR) as a digitization technology. Network Notes no. 37, Information Technology Services, National Library of Canada.
  html
OCR Errors Michael Barz
Both of the following:
- Daniel Lopresti. Performance evaluation for text processing of noisy inputs. In Proceedings of the 20th Annual ACM Symposium on Applied Computing (Document Engineering Track), pages 759-763, 2005.
  pdf
- Daniel Lopresti. Optical character recognition errors and their effects on natural language processing. In Proceedings of the ACM SIGIR Workshop on Analytics for Noisy Unstructured Text 584 Data, pages 9-16, 2008.
  pdf
Detection and Correction of OCR errors (1) Cornelius Leidinger
Choose one of the following:
- Martin Reynaert. Non-interactive OCR post-correction for giga-scale digitization projects. In A. Gelbukh (Ed.), Proceedings of the Computational Linguistics and Intelligent Text Processing 9th International Conference, CICLing 2008. Lecture Notes in Computer Science Vol. 4919/2008, Berlin / Heidelberg: Springer, pp. 617-630.
  pdf
- Martin Volk, Lenz Furrer, and Rico Sennrich. Strategies for Reducing and Correcting OCR Errors. In: C. Sporleder, A. van den Bosch, and K. Zervanou (Eds.), Language Technology for Cultural Heritage. Selected Papers from the LaTeCH Workshop Series, Series: Theory and Applications of Natural Language Processing. Heidelberg: Springer, 2011.
  pdf
Detection and Correction of OCR errors (2) Jonas Hempel
One or both of the following:
- Christoph Ringlstetter, Klaus U. Schulz, Stoyan Mihov and Katerina Louka: The Same is Not The Same - Postcorrection of Alphabet Confusion Errors in Mixed-Alphabet OCR Recognition. Proceedings of the 8th International Conference on Document Analysis and Recognition (ICDAR'05), pp. 406-410 , 2005.
  pdf
- Okan Kolak; Philip Resnik. OCR Post-Processing for Low Density Languages. EMNLP-2005.
  pdf

Non-Standard Language

Background Reading (no presentation)
- Marco Pennacchiotti, Fabio Massimo Zanzotto. Natural Language Processing across time: an empirical investigation on Italian. In Proceedings of GOTAL 2008. Gothenburg, Sweden. August, 2008.
  pdf
Non-Standard Orthography (1) Mariona Coll Ardanuy
Chose one or more of the following:
- Andreas Hauser, Markus Heller, Elisabeth Leiss, Klaus U. Schulz, Christiane Wanzeck. Information Access to Historical Documents from the Early New High German Period. In: L. Burnard, M. Dobreva, N. Fuhr, A. Lüdeling (eds): Digital Historical Corpora - Architecture, Annotation, and Retrieval. Dagstuhl Seminar Proceedings, 2007.
  pdf
- Marcel Bollmann; Florian Petran; Stefanie Dipper. Rule-Based Normalization of Historical Texts. In: Proceedings of the RANLP-2011 Workshop on Language Technologies for Digital Humanities and Cultural Heritage, pp. 34-42, 2011.
  pdf
Non-Standard Orthography (2)
- Marijn Koolen, Frans Adriaans, Jaap Kamps, and Maarten de Rijke. A cross-language approach to historic document retrieval. In Mounia Lalmas, Stefan M. Rüger, Theodora Tsikrika, and Alexei Yavlinsky, editors, Advances in Information Retrieval: 28th European Conference on IR Research (ECIR 2006), volume 3936 of Lecture Notes in Computer Science, pages 407-419. Springer Verlag, Heidelberg, 2006.
  pdf
Adapting NLP Tools (1)
Choose one of the following:
- Taesun Moon and Jason Baldridge. 2007. Part-of-Speech Tagging for Middle English through Alignment and Projection of Parallel Diachronic Texts. In Proceedings of EMNLP/CONLL-2007. Prague.
  pdf
- Jirka Hana; Anna Feldman; Katsiaryna Aharodnik. A low-budget tagger for Old Czech. In: Proceedings of the ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH 2011). pp. 10-18, 2011.
  pdf
Adapting NLP Tools (2)
- Eiríkur Rögnvaldsson and Sigrún Helgadóttir. Morphological tagging of Old Norse texts and its use in studying syntactic variation and change. Proceedings of the LREC-08 Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2008), 2008.
  pdf
- Eiríkur Rögnvaldsson and Sigrún Helgadóttir. Morphological tagging of Old Icelandic texts and its use in studying syntactic variation and change. In: C. Sporleder, A. van den Bosch, and K. Zervanou (Eds.), Language Technology for Cultural Heritage. Selected Papers from the LaTeCH Workshop Series, Series: Theory and Applications of Natural Language Processing. Heidelberg: Springer, 2011.
  pdf
Note: These two papers present essentially the same approach, so you can make use of both when preparing your presentation.
Adapting NLP Tools (3) Nikolina Koleva
- Nils Reiter, Oliver Hellwig, Anette Frank, Irina Gossmann, Borayin Maitreya Larios, Julio Rodrigues, and Britta Zeller. Adapting NLP Tools and Frame-Semantic Resources for the Semantic Analysis of Ritual Descriptions. In: C. Sporleder, A. van den Bosch, and K. Zervanou (Eds.), Language Technology for Cultural Heritage. Selected Papers from the LaTeCH Workshop Series, Series: Theory and Applications of Natural Language Processing. Heidelberg: Springer, 2011.
  pdf

Preservation of Digital Data

Background Reading (no presentation)
- The Digital Preservation Management Tutorial hosted by the Inter-university Consortium for Political and Social Research (ICPSR)
  html
- Myron P. Gutmann, Nancy Y. McGovern, Bryan Beecher, T.E, Raghunathan. How Safe is Safe Enough when we Preserve Social Science Data? Third International Conference on e-Social Science, Ann Arbor, MI, 2007.
  pdf
- Bootie Cosgrove-Mather: Coming Soon: A Digital Dark Age? Digital Memory Threatened As File Formats Evolve.
  html
- Ray A. Williamson. The Opportunities and Challenges of Preservation Technologies. Archives and Museum Informatics 13, pp 211-225, 1999/2001.
  pdf
Preservation Issues Christian Wellner
- Howard Besser. Digital longevity. In: Maxine Sitts (ed.) Handbook for Digital Projects: A Management Tool for Preservation and Access, Andover MA: Northeast Document Conservation Center, 2000, pages 155-166
  html
- Mary Baker , Mema Roussopoulos , Mehul Shah , Petros Maniatis , Prashanth Bungale , TJ Giuli , David S. H. Rosenthal: A Fresh Look at the Reliability of Long-term Digital Storage. In: Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006.
  pdf
Note: If you choose this topic you don't have to present the details of the model introduced in the second paper. What you should talk about are the risks to digital data and what can be done to minimise them.

Semantic Web

Semantic Web Background
- Sean B. Palmer. The Semantic Web: An Introduction. 2001.
  http://infomesh.net/2001/swintro/
Semantic Web Peter Stahl
Choose one (or more) of the following. The first one together with a general semantic web overview might be a good choice.
- V. R. Benjamins, J. Contreras, M. Blázquez, J. M. Dodero, A. Garcia, E. Navas, F. Hernandez and C. Wert. Cultural Heritage and the Semantic Web. The Semantic Web: Research and Applications Lecture Notes in Computer Science, 2004, Volume 3053/2004, 433-444.
  pdf
- Shenghui Wang, Antoine Isaac, Stefan Schlobach, Lourens van der Meij, Balthasar Schopman. Instance-based Semantic Interoperability in the Cultural Heritage. The Semantic Web Journal. 3:1, pp. 45-64 , 2012
  pdf
- Antoine Isaac, Henk Matthezing, Stefan Schlobach, Claus Zinn. Integrated access to cultural heritage resources through representation and alignment of controlled vocabularies. Library Review, 57(3), March 2008.
  html
- Dana Dannèlls; Mariana Damova; Ramona Enache; Milen Chechev. A Framework for Improved Access to Museum Databases in the Semantic Web. In: Proceedings of the RANLP-2011 Workshop on Language Technologies for Digital Humanities and Cultural Heritage. 2011, pp. 3-10.
  pdf
- Karl Grieser, Timothy Baldwin, Fabian Bohnert, Liz Sonenberg. Using ontological and document similarity to estimate museum exhibit relatedness. Journal on Computing and Cultural Heritage (JOCCH), Volume 3 Issue 3, March 2011
  pdf
- Michael Ashley, Ruth Tringham, Cinzia Perlingieri. Last House on the Hill: Digitally remediating data and media for preservation and access. Journal on Computing and Cultural Heritage (JOCCH), Volume 4 Issue 4, December 2011
  pdf
Ontologies Besnik Fetahu
- Fabian M. Suchanek, Gjergji Kasneci and Gerhard Weikum. "Yago - A Large Ontology from Wikipedia and WordNet" Elsevier Journal of Web Semantics. 2008.
  pdf
Inferring Meta-Data (1) Ehsan Khoddammohammadi (this topic or the next)
- Veronique Malaisé, Antoine Isaac, Luit Gazendam and Hennie Brugman.(2007). Anchoring Dutch Cultural Heritage Thesauri to WordNet: two case studies. LATECH'07, ACL 2007 Workshop, Prague, June 28th 2007.
  pdf
Inferring Meta-Data (2)
- Tandeep Sidhu; Judith Klavans; Jimmy Lin. Concept Disambiguation for Improved Subject Access Using Multiple Knowledge Sources. In: Proceedings of the ACL Workshop on Language Technology for Cultural Heritage Data (LaTeCH-07), 2007.
  pdf
Vocabulary Alignment
- Isaac, Antoine; Schlobach, Stefan; Matthezing, Henk; Zinn, Claus. Integrated access to cultural heritage resources through representation and alignment of controlled vocabularies. In Library Review, Volume 57, Number 3, 2008 , pp. 187-199(13).
  pdf

Information Extraction Basics

Named Entity Recognition Farzaneh Ansari
- David Nadeau, Satoshi Sekine. A survey of named entity recognition and classification. Journal of Linguisticae Investigationes 30:1 ; 2007
  pdf
- Kate Byrne. Nested Named Entity Recognition in Historical Archive Text. ICSC2007, IEEE International Conference on Semantic Computing, Irvine, California.
  pdf
Note: the first paper is an overview paper; the main focus of the presentation should be on the second paper.
Named Entity Disambiguation and Linking Silas Weinbach

Choose one (or more) of the following papers:
- Gideon Mann; David Yarowsky. Unsupervised Personal Name Disambiguation. CoNLL-03. 2003
  pdf
- Michael Fleischman; Eduard Hovy. Multi-Document Person Name Resolution. Proceedings of the ACL-2004 Workshop on Reference Resolution and Its Applications. 2004.
  pdf
- Amit Bagga; Breck Baldwin. Entity-Based Cross-Document Coreferencing Using the Vector Space Model. ACL-COLING-1998. 1998.
  pdf
More Information Extraction Mehdi Hosseini

Choose one of the following:
- Paul Clough, Neil Ireson, Jennifer Marlow. Extending Domain-Specific Resources to Enable Semantic Access to Cultural Heritage Data. Journal of Digital Information, Vol 10, No 6 (2009).
  html
- Lars Borin, Dimitrios Kokkinakis, Leif J. Olsson. Naming the Past: Named Entity and Animacy Recognition in 19th Century Swedish Literature In Proceedings of the ACL 2007 Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2007), 2007, pp. 1-8.
  pdf
- David Elson; Nicholas Dames; Kathleen McKeown. Extracting Social Networks from Literary Fiction. Proceedings of ACL-2010, pp. 138-147, 2010.
  pdf
- David Elson and Kathleen McKeown. Automatic attribution of quoted speech in literary narrative. In AAAI 2010, Atlanta, Georgia, 2010.
  pdf

Text Mining

Converting Fieldbooks to Databases Ervina Cergani
Choose one (or more) of the following:
- Piroska Lendvai and Steve Hunt. From field notes towards a knowledge base. In Proceedings of the Sixth International Language Resources and Evaluation (LREC'08). Marrakech,Morocco, 2008.
  pdf
- Sander Canisius and Caroline Sporleder. Bootstrapping information extraction from field books. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, pp. 827-836.
  pdf
Event Recognition (1) Mainack Mondal
- Kate Byrne and Ewan Klein. Automatic extraction of archaeological events from text. In Computer Applications in Archaeology (CAA-09), 2009.
  pdf
- Kate Byrne. Putting Hybrid Cultural Data on the Semantic Web. Journal of Digital Information, Vol 10, No 6 (2009)
  pdf
Note: both papers essentially present the same approach so you can make use of both for your presentation.
Event Recognition (2)
- Tuukka Ruotsalo, Lora Aroyo, Guus Schreiber. Knowledge-Based Linguistic Annotation of Digital Cultural Heritage Collections. In IEEE Intelligent Systems 24:2, pp. 64-75.
  pdf
  (If you have problems viewing and printing this paper, you can get a hard copy from me.)
Automatic Text Analysis
Choose one of the following:
- Sravana Reddy; Kevin Knight. What We Know About The Voynich Manuscript. In: Proceedings of the ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH-2011), pp. 78-86 ,2011. pdf
- Tze-I Yang; Andrew Torget; Rada Mihalcea. Topic Modeling on Historical Newspapers. In: Proceedings of the ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH-2011), pp. 96-104 ,2011. pdf
- Saif Mohammad. From Once Upon a Time to Happily Ever After: Tracking Emotions in Novels and Fairy Tales. In: Proceedings of the ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH-2011), pp. 105-114 ,2011. pdf
User Studies Benedict Fehringer
- B. Alex, C. Grover, B. Haddow, M. Kabadjov, E. Klein, M. Matthews, S. Roebuck, R. Tobin, X. Wang. Assisted curation: does text mining really help? In Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing (2008), pp. 556-567.
  pdf

Multi-Modal Data

Speech Krishna Narasimhan
Choose one (or more) of the following:
- Van der Werff, L.B. and Heeren, W.F.L. and Ordelman, R.J.F. and de Jong, F.M.G. (2007) Radio Oranje: Enhanced Access to a Historical Spoken Word Collection. In: Proceedings of the 17th Meeting of Computational Linguistics in the Netherlands, 12 Jan 2007, Leuven, Belgium. pp. 207-218
  pdf
- Samuel Gustman, Dagobert Soergel, Douglas Oard, William Byrne, Michael Picheny, Bhuvana Ramabhadran, Douglas Greenberg. Supporting access to large digital oral history archives. Proceedings of the Joint Conference on Digital Libraries. 2002.
  pdf
Images and Video Evangelia Kiagia
- L. Gazendam, V. Malaisé, A. de Jong, C. Wartena, H. Brugman, and G. Schreiber. Automatic annotation suggestions for audiovisual archives: Evaluation aspects. J. Interdisciplinary Science Reviews, 2009.
  pdf

Personalisation

Personalisation (1)
Choose one of the following two:
- Ion Androutsopoulos, Vassiliki Kokkinaki, Aggeliki Dimitromanolaki, Jo Calder, Jon Oberlander, Elena Not. Generating Multilingual Personalized Descriptions of Museum Exhibits - The M-PIRO Project. In: Proceedings of the 29th Conference on Computer Applications and Quantitative Methods in Archaeology, Gotland, Sweden, 2001.
  pdf
- Stasinos Konstantopoulos, Vangelis Karkaletsis, Dimitrios Vogiatzis, and Dimitris Bilidas. Authoring Semantic and Linguistic Knowledge for the Dynamic Generation of Personalized Descriptions. In: C. Sporleder, A. van den Bosch, and K. Zervanou (Eds.), Language Technology for Cultural Heritage. Selected Papers from the LaTeCH Workshop Series, Series: Theory and Applications of Natural Language Processing. Heidelberg: Springer, 2011.
  pdf
Note: Possibly useful background reading:
- I. Androutsopoulos, J. Oberlander and V. Karkaletsis, "Source Authoring for Multilingual Generation of Personalised Object Descriptions". Natural Language Engineering, 13(3):191-233, 2007.
  pdf
- A. Isard, J. Oberlander, I. Androutsopoulos and C. Matheson, "Speaking the Users' Languages". IEEE Intelligent Systems, special issue on "Advances in Natural Language Processing", 18(1):40-45, 2003.
  pdf
Personalisation (2)
Choose one of the following:
- Karl Grieser; Timothy Baldwin; Steven Bird. Dynamic Path Prediction and Recommendation in a Museum Environment. In: Proceedings of the ACL Workshop on Language Technology for Cultural Heritage Data (LaTeCH-07), 2007.
  pdf
- Angeliki Antoniou, George Lepouras. Modeling visitors' profiles: A study to investigate adaptation aspects for museum learning technologies. Journal on Computing and Cultural Heritage (JOCCH) Volume 3 Issue 2, September 2010.
  pdf
  Note: This paper isn't NLP but interesting and potentially applicable to browsing behaviour etc.