Computational Linguistics & Phonetics Computational Linguistics & Phonetics Fachrichtung 4.7 Universität des Saarlandes

Computational Linguistics Colloquium

Thursday, February 14, 16:15, Building 17, Seminar Room

Question Answering Technology: Getting to Know the New Kid on the Block

Marc Light
The Mitre Corporation
Boston

Question answering (QA) systems aim to allow users to ask questions such as ``which New England communities have reported outbreaks of encephalitis this year?'' and to receive succinct answers. Such systems can be viewed as fine-grained search engines that return short snippets of text containing the answer to a question as opposed to a list of relevant documents.

For the past three years, the National Institute of Standards and Technology has hosted an evaluation of QA systems funded by DARPA and ARDA. The best system this year was able to provide a correct answer among its top five responses for 70% of the questions in the test set. The test questions were taken from search engine logs and the answers were to be found in a document collection consisting of over a million newswire-like texts.

In general, the performance of these systems outstripped expectations. Despite this success, there is little understanding of why these systems work: what aspects of the system and the evaluation were crucial for the performance, what would cause a decline in performance, and what aspects account for the system's errors.

In this talk, we take a detailed look at the performance of components of an idealized question answering system on two different tasks: the TREC Question Answering task and a set of reading comprehension exams. We carry out three types of analysis: inherent properties of the data, feature analysis, and performance bounds. Based on these analyses we explain some of the performance results of the current generation of QA systems and make predictions on future work. In particular, we present four findings: (1) QA system performance is correlated with answer redundancy, (2) relative overlap scores are more effective than absolute overlap scores, (3) equivalence classes on scoring functions can be used to quantify performance bounds, and (4) perfect answer typing still leaves a great deal of ambiguity for a QA system because sentences often contain several items of the same type.

This is joint work with Gideon Mann, Ellen Riloff, and Eric Breck.

If you would like to meet with the speaker, please contact Detlef Prescher.