Forschungsseminar Semantic Processing

Brainstorming Session: A Corpus of Violations of Linguistic Quality in Multi-Document Summarization and its Uses

We present LQVSumm, a collection of annotations of specific violations of linguistic in automatically-produced extractive summaries. We develop an annotation scheme for such violations and annotate about 2,000 automatically created summaries. For example, among other types of violations, we mark pronouns that lack antecedents, adjacent sentences that are not semantically related, and ungrammatical sentences. An inter-annotator agreement study shows that the degree of subjectivity of our annotation schema is manageable, and a statistical corpus analysis shows that detecting instances of violations in summaries may be beneficial in order to reliably judge a summary's linguistic quality. In this brainstorming part of the session, we hope to collect ideas for creating models that automatically detect instances of the violations.