Foundational Course at ESSLLI 2006
18th European Summer School in Logic, Language and Information
Málaga, Spain
31 July-11 August, 2006
Introduction to Corpus Resources, Annotation and Access
References
Tokenisation and Morpho-Syntactic Annotation
[Tokenisation]
- Gregory Grefenstette and Pasi Tapanainen (1994): What is a word, what is a sentence? Problems of tokenization. In Proceedings of the 3rd Conference on Computational Lexicography and Text Research, pp. 79-87. Budapest, Hungary.
- Andrei Mikheev (2002): Periods, Capitalized Words, etc. Computational Linguistics, 28(3):289-318.
- Andrei Mikheev (2003): Text segmentation. In: Ruslan Mitkov, editor: The Oxford Handbook of Computational Linguistics, pp. 376-394. Oxford University Press.
- Helmut Schmid (2007?): Tokenizing. In: Anke Lüdeling and Merja Kytö, editors: Corpus Linguistics. An International Handbook. Mouton de Gruyter, Berlin.
- Tokeniser:
[Part-of-Speech Tagging]
[Morphological Annotation]
[Word Distributions]
- Kenneth W. Church and Patrick Hanks (1990): Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1):22-29.
- Ted Dunning (1993): Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61-74.
- Harald Baayen (2001): Word frequency distributions. Kluwer Academic Publishers.
- Stefan Evert (2004): The statistics of word cooccurrences: word pairs and collocations. PhD thesis, Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart.
- Marco Baroni (2007?): Distributions in Text. In: Anke Lüdeling and Merja Kytö, editors: Corpus Linguistics. An International Handbook. Mouton de Gruyter, Berlin.
- Oliver Christ, Bruno M. Schulze, Anja Hofmann, Esther König (1999): The IMS corpus workbench: Corpus Query Processor. Technical report, Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart.
- Adam Kilgarriff, Pavel Rychly, Pavel Smrz, and David Tugwell (2004): The Sketch Engine. In Proceedings of the 11th EURALEX International Congress. Lorient, France.
- Collocations and multiword expressions online:
[Evaluation]
Exercise: Tree Tagger
Exercise: CQP
- IMS Corpus Workbench
- CQP download
- Oliver Christ (1994): A modular and flexible architecture for an integrated corpus query system. In Proceedings of the 3rd Conference on Computational Lexicography and Text Research. Budapest, Hungary.
- Oliver Christ, Bruno M. Schulze, Anja Hofmann, Esther König (1999): The IMS corpus workbench: Corpus Query Processor. Technical report (out of date), Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart.
- Stefan Evert (2002): Corpus encoding tutorial: First steps. Draft, Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart.
- Stefan Evert (2005): The CQP query language tutorial. Technical report, Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart.
Semantic Annotation
[Word Senses]
[WordNet]
[Prague Dependency Treebank]
- Eva Hajicová, Jarmila Panevová, and Petr Sgall (2000): A manual for tectogrammatic tagging of the Prague Dependency Treebank. UFAL/CKL Technical Report TR-2000-09, Charles University, Prague.
- Alena Böhmová, Jan Hajic, Eva Hajicová, and Barbora Hladká (2003): The Prague Dependency Treebank: A three-level annotation scenario. In: Anne Abeille, editor: "Treebanks: building and using syntactically annotated corpora". Kluwer Academic Publishers.
- Petr Sgall, Jarmila Panevová, and Eva Hajicová (2004): Deep syntactic annotation: Tectogrammatical representation and beyond. In Proceedings of the HLT-NAACL Workshop on "Frontiers in Corpus Annotation". Boston, MA.
- Jan Hajic and Zdenka Uresová (2005): The Prague Dependency Treebank and Valency Annotation. Tutorial at RANLP, Borovets.
- PDT online
[FrameNet]
* general and English:
- Collin F. Baker, Charles J. Fillmore, and John B. Lowe (1998): The Berkeley FrameNet project. In Proceedings of the 17th International Conference on Computational Linguistics, pp. 86-90.
- Thierry Fontenelle, editor (2003): FrameNet and frame semantics. Special issue of the International Journal of Lexicography, 16(3).
* German:
* Spanish:
- Carlos Subirats and Hiroaki Sato (2004): Spanish FrameNet and FrameSQL. In Proceedings of the LREC Workshop on "Building Lexical Resources from Semantically Annotated Corpora".
* Japanese:
- Kyoko Hirose Ohara, Seiko Fujii, Toshio Ohori, Ryoko Suzuki, Hiroaki Saito, and Shun Ishizaki (2004): The Japanese FrameNet project: An introduction. In Proceedings of the LREC Workshop on "Building Lexical Resources from Semantically Annotated Corpora".
* FrameNet online:
[PropBank]
[OntoBank / OntoNotes]
- Invited talk by Eduard Hovy at LREC 2006: Corpus creation by annotation. Genoa, Italy.
- Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel (2006): OntoNotes: The 90% Solution. In Proceedings of the Human Language Technology of the North American Chapter of the Association for Computational Linguistics. New York City, NY.
- OntoBank online: link not yet available, check Eduard Hovy's website
[Word Sense Disambiguation and Role Labeling]
Exercise: SALTO Annotation Tool
More Levels of Corpus Annotation
[The Prague Treebank]
- Eva Hajicová (1999): The Prague Dependency Treebank: Crossing the sentence boundary. In Proceedings of the 2nd Workshop on Text, Speech, Dialogue, pp. 20-27. Mariánské Lázne, Czech Republic.
- Eva Hajicová, Jarmila Panevová, and Petr Sgall (2000): Coreference in annotating a large corpus. In Proceedings of the 2nd International Conference on Language Resources and Evaluation, pp. 497-500. Athens, Greece.
- Oana Postolache, Ivana Kruijff-Korbayová, and Geert-Jan Kruijff (2005): Data-driven approaches for information structure identification. In Proceedings of the joint Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 9-16. Vancouver, Canada.
- PDT online
[Rhetorical Structure Theory and the RST Discourse Treebank]
[The Penn Discourse TreeBank]
- Eleni Miltsakaki, Rashmi Prasad, Aravind Joshi, and Bonnie Webber (2004): The Penn Discourse TreeBank. In Proceedings of the 4th International Conference on Language Resources and Evaluation. Lisbon, Portugal.
- Eleni Miltsakaki, Rashmi Prasad, Aravind Joshi, and Bonnie Webber (2004): Annotating discourse connectives and their arguments. In Proceedings of the HLT/NAACL Workshop on "Frontiers in Corpus Annotation". Boston, MA.
- Eleni Miltsakaki, Nikhil Dinesh, Rashmi Prasad, Aravind Joshi, and Bonnie Webber (2005): Experiments on sense annotations and sense disambiguation of discourse connectives. In Proceedings of the 4th Workshop on "Treebanks and Linguistic Theories". Barcelona, Spain.
- Bonnie Webber, Aravind Joshi, Eleni Miltsakaki, Rashmi Prasad, Nikhil Dinesh, Alan Lee, and Kate Forbes (2005): A short introduction to the Penn Discourse TreeBank. In Copenhagen Working Papers in Language and Speech Processing.
- Bonnie Webber, Matthew Stone, Aravind Joshi, and Alistair Knott (2001): Anaphora and discourse structure. Computational Linguistics, 29(4):545-587.
- The PDTB Research Group (2006): The Penn Discourse TreeBank 1.0. Annotation Manual. IRCS Technical Report IRCS-06-01, Institute for Research in Cognitive Science, University of Pennsylvania.
- PDTB online
[Anaphora and Coreference]
- Ruslan Mitkov, Richard Evans, Constantin Orasan, Catalina Barbu, Lisa Jones, and Violeta Sotirova (2000): Coreference and anaphora: Developing annotating tools, annotated resources and annotation strategies. In Proceedings of the Discourse, Anaphora and Reference Resolution Conference, pp. 49-58. Lancaster, UK.
- Eva Hajicová, Jarmila Panevová, and Petr Sgall (2000): Coreference in annotating a large corpus. In Proceedings of the 2nd International Conference on Language Resources and Evaluation, pp. 497-500. Athens, Greece.
- Erhard Hinrichs, Sandra Kübler, Karin Naumann, Heike Telljohann, Julia Trushkina, and Heike Zinsmeister (2005): Recent developments in linguistic annotations of the TüBa-D/Z Treebank. Poster at the 27th Annual Meeting of the German Linguistic Society (Deutsche Gesellschaft für Sprachwissenschaft). Köln, Germany.
[Kiel Corpus of Read Speech]
[MATE]
[NITE]