The SALSA project

The development of computational linguistics through the last decade has provided abundant evidence how grammar research can benefit from corpus-based methods. Computational linguistics could obviously take similar advantage from corpora on the level of semantics. However, semantic corpus annotation is currently just in its initial stages, comprising almost exclusively word sense annotation (an exception being the Prague TreeBank for Czech).

The aim of the SALSA project is to create a large semantically annotated corpus and to investigate methods for its utilization. In a first step, we annotate a German 1.5 million word corpus by hand exhaustively with frame semantic roles. Additionally we will selectively annotate word senses and anaphoric links. For the semantic role annotation, we use the FrameNet database of frames, extending it to a light version of a German FrameNet. In the next step, we will train statistical systems on the annotated corpus to further extend the corpus (semi)-automatically.

The SALSA corpus can be used in a number of interesting ways, e.g. for the automatic acquisition of lexical semantic information, the training of statistical parsers on a combination of syntactic and semantic role information and the improvement of linguistically guided techniques for information access and extraction.

The Annotation Scheme

Parse tree provided by the TIGER corpus
As a basis for the semantic annotation in SALSA we use the TIGER corpus, a German newspaper corpus annotated for syntactic structure. In this corpus we tag all frame evoking elements with their appropriate frames, and specify their frame elements. In the annotation we represent frame structures as flat trees of depth 1. The root node of a frame tree is labelled by the frame name. The edges are labeled by abbreviated frame element names or as (parts of) the frame evoking element (FEE). The terminal nodes of the frame trees are sets of nodes of the syntactic annotation trees. The figure above shows an example from the TIGER corpus: the tree with straight edges describes the syntactic structure, and the two trees with arched edges describe the frames REQUEST and CONVERSATION introduced by the verb fordert auf (demand) with separable verb prefix, and the noun Gespräch (conversation), respectively.