Project WHITEBOARD

PROJECT SHEET

Principal Investigator: Hans Uszkoreit
Project Leader: Günter Neumann

Funding: German Ministry of Education and Research
Duration: 2000-2003

Objectives and Approaches:

The project aims at designing, implementing, investigating and evaluating a new system architecture that facilitates the combination of different language technologies for a range of practical applications.

The greatest challenge for effective information, document and knowledge management including the production of documents is the powerful medium of human language. Seen from the viewpoint of automatic data processing, texts are the prime example of so-called unstructured information. Seen from the point of view of the human users, however, texts exhibit structures that are much more complex than the ones of data-base entries.

We cannot expect in the foreseeable future that machines will be able to reliably determine the rich structure of sentences and paragraphs that humans analyse when they successfully interpret texts. However, language technologies offer numerous means for a partial analysis of texts that can be employed for information retrieval, information extraction, language checking, and many other applications.

Processing methods and tools differ along several dimensions. They may be restricted to certain levels of linguistic description such as lexicon, morphology, syntax, semantics or text structure. According to the depth of analysis they may be called shallow or deep methods. With respect to the way that knowledge about language is derived and applied, they may be linguistic or statistical. Furthermore, certain methods may be especially suited for specific languages or applications. Methods often overlap in their functionality but differ in their strengths and weaknesses.

Finding optimal combinations of heterogeneous techniques and processing components is one of the most difficult tasks in language processing.

The novel architecture to be developed and explored in WHITEBOARD is based on the concept of an annotated text. The different LT components enrich an XML-encoded text with layers of new meta-information that are also represented in XML. Each component can exploit or disregard previously assigned annotations.

In contrast to most monolythic systems or to individual complex interfaces between heterogenous components, the WHITEBOARD architecture has a single shared data structure, which at the same time is the input, throughput, and output of the system. Unstructured data are thus transformed into structured data in a stepwise fashion. In the ideal case, the information gain will be monotonic.

The envisaged architecture permits the pragmatic combination of different processing approaches.

In this way, the system combines shallow and deep analysis methods. The range of combined components includes

the morphological processing system MORPHIX
tagger and phrase parsers TnT and Chunkie
the information extraction system SMES
the efficient HPSG parsing system PET
HPSG Grammars for German, English (Stanfords Lingo Grammar) and Japanese
the controlled language checking system FLAG

Recently achieved efficiency gains in deep grammatical processing with HPSG permit the combination of the deep linguistic analysis with different types of shallow processing. The investigation of new methods for combining deep and shallow processing is a core component of the project.

Two applications are realized for the purpose of evaluating and demonstrating the results. One application is information extraction. As the automatic understanding of entire texts will remain outside of reach for quite some time, the strategy to approach this goal is the gradual improvement of our IE technology.

The second application is controlled language checking. Here again, we cannot expect from today’s technology a comprehensive and correct analysis of an entire text. We might be able, however, to specialize our deep analysis in such a way that it can apply a deep analysis with sufficient precision in certain environments that are critical for the correct diagnosis and correction of errors.