Computational Linguistics & Phonetics Computational Linguistics & Phonetics Fachrichtung 4.7 Universität des Saarlandes

Projekt Schreibgebrauch

Tools and Analyses for the Orthographical Monitoring of Current German Writing Practise

This projects aims at a monitoring of writing habits using NLP (natural language processing) methods. We develop tools that facilitate the orthographic evaluation of day-to-day written German. We use writings by professional writers, such as newspaper and journal articles and books as well as students' writing and Internet texts, such as blog or forum articles, as data basis. The methods and results of the data collection are prepared in a way that allows their direct usage by the Council for German Orthography as building blocks for future standardization efforts.

Four institutions are involved in the project: the Institut für Deutsche Sprache, Mannheim, the department for Computational Linguistics at Saarland University, Saarbrücken, as well as two dictionary publishers: Bibliographisches Institut GmbH (Dudenverlag), Berlin, and Wahrig at Brockhaus, Gütersloh. The joint research project is funded by the German Federal Ministry of Education and Research (Bundesministeriums für Bildung und Forschung – BMBF).

The department for Computational Linguistics at Saarland University (group of Professor Manfred Pinkal) is responsible for all aspects of the project that concern the areas of computational linguistics and language technology. This includes in particular the following two subtasks:

Adaptation of NLP tools for Internet texts

The systematic monitoring of writing practises is substantially supported by – and often requires – automatic linguistic analysis. Part-of-speech taggers assign part-of-speech labels to words in a text with a high accuracy, lemmatizers reduce word forms to their corresponding dictionary entry, parsers and chunkers are used to determine the grammatical structure of sentences.

These tools are usually trained on standard text (such as newspaper or journal articles, or literary works) and perform poorly – if at all – when applied to texts that do not follow standard spelling or grammar conventions. Such texts include in particular Internet texts as we can find them for example in blogs, twitter messages or discussion boards. For the monitoring of writing practice, however, they play a vital role, because (a) these texts are produced spontaneously and are not edited, making insecurities in language usage (e.g. misspellings, grammatical errors, missing or "free" punctuation) particularly obvious and (b) voluntary deviations from the established rules (such as modified writing like "willz" instead of "willst", contraction ("gibste", "haste")) point at potential future developments in spelling and grammar.

We adapt NLP tools towards non-standard texts enabling access to those text types for the analysis of writing practises.

Context and Collocation Analysis

One crucial step in systematic monitoring of writing practises is the identification of relevant applications of spelling rules in large corpora. This task is often non-trivial: In answering the question whether German adjectives in fixed expressions, such as "gelbe Karte" (lit. yellow card) or "Neue Welt" (new world) should be capitalized or not, we have to decide whether the expression is intended to be interpreted literally or not. In the case of "gelbe Karte", for instance, we have to distinguish between usages referring to actions of a referee during a soccer match or to a metaphoric use based on the former in contrast to usages referring to a yellow index card or post card.

Identifying relevant usages requires an analysis of the context. This holds for questions of capitalization as well as separate and compound spelling and especially for punctuation.

Within the project, we develop fine-grained methods for context analysis using and combining techniques for deep grammatical processing, statistical, corpus-based and cooccurrence-based methods (n-gram analyses, disambiguation methods using distributional semantics) as well as sophisticated methods of collocation analyses.