Our course presents an introduction to corpus resources, combining the
theoretical background of corpora, resource examples, annotation
levels, and tools for exploitation.
First, we motivate corpus resources for empirical linguistics, and
describe the properties/problems of corpus data, the levels of
annotation, and standardisation efforts.
We then relate the annotation levels to appropriate tools and uses for
exploitation:
Tokenisation, tagging, lemmatisation are introduced; we present
CQP to exploit corpora with linear patterns for e.g. collocations, and
unix tools for shallow statistical analyses, e.g. the type-token
distinction, sorting, bigrams.
Treebanks are introduced, with cross-linguistic examples; we
describe typical complexities like pp-attachment, and present ANNOTATE
as a tool for annotation, TIGERSearch as a query tool.
SensEval is introduced as a framework for defining and utilising
semantically annotated corpus data; we demonstrate the exploitation of
word senses.
Finally, we present the web as a corpus. BootCaT is a toolkit to
collect data from the web, e.g. for creating corpora for minority
languages.