PhD thesis
Supervisor: Doug Arnold; Internal examiner: Louisa Sadler; External
examiner: John Carroll.
This thesis addresses the issue of how Natural Language Processing (NLP) systems using "constraint-based" grammar formalisms can be made robust, i.e. able to deal with input which is in some way ill-formed or extra-grammatical. In NLP systems which use constraint-based grammars the operation of unification typically plays a central role. Accordingly, the central concern of this thesis is to propose an approach to robust unification.
The first part of the thesis underlines the importance of robustness in NLP, provides an overview of the sort of phenomena that require it, and reviews the state of the art. From this, it appears that no methods currently exist for robust processing with grammars of any real linguistic sophistication.
The class of constraint-based grammars studied here is that based on Typed Feature Logic (TFL), of which Head-Driven Phrase Structure Grammar is the instance chosen for exemplification. The formalism is described in the second part of the thesis.
Grammars based on TFL involve the notion of a signature, which defines the kinds of objects (types) assumed to exist in the grammar. Processing typically involves combining information about pieces of the input by unification. From this perspective, the need for robustness can be seen as arising because pieces of the input provide information which is inconsistent with information from other pieces of the input and/or from the grammar. The first inconsistency is tolerated --- it does not violate the grammar --- and processed using "robust types" which are created by extending the signature to a lattice. Inconsistency with the grammar on the other hand is punished by stripping away the offending information. Weights, added to it on the basis of the grammar, also disappear, thus making the ungrammaticality measurable. The conceptual and formal apparatus for this is developed and exemplified in the third part of the dissertation.
This chapter examines the robustness problem and some of the proposed solutions. It is shown that in the current situation a robust system either has ad hoc recovery components somewhere in the processing chain, ignores ungrammaticalities or is unaware of them, or uses a statistical solution. Research into using symbolic methods which preserve deep processing is scant.
This chapter is an informal introduction to HPSG, and provides the background information for the next chapter by presenting the machinery that is used to analyse sentences in implementations of HPSG. It describes the different components of an HPSG as they are used in implementations, especially the ones based on the logic presented by Carpenter (1992). Feature structures, constructed on the basis of a signature, are used to construct lexical entries, lexical rules, and phrase structure rules. These are the means by which analyses for sentences are built, combined by unification.
In this chapter, the first part of a formalism for robust unification is presented. Relative to a non-robust formalism, the solution space is enlarged through unification relaxation. It is however not so that anything is allowed: the relaxation is limited to what can be useful by exploiting the properties defined in the signature.
In this chapter the feature structures of Carpenter (1992) and the changes to them are discussed that the use of robust unification necessitates. Based on that material a technique is presented to reduce the size of the robust signature without affecting its robustness.
This chapter contains a worked-out example of what was presented in the two previous chapters. In the second part, it makes a comparison with other, similar proposals in the literature.
In this chapter weights are added to the robust unification from Chapters 4 and 5 to represent a measure for the amount of information. The weights are derived from the feature structures in the definition of grammar objects. The feature structure graphs are adapted to cause a loss of information in the case of ill-formedness, and to record it. This makes it possible to rank the analyses based on the quantity of information they lost.
This chapter contains further examples of robust processing. Also, some techniques are discussed to influence the robust behaviour of the grammar, and finally the onsets of some possible extensions are presented: relations and word order.
This chapter touches on some questions that remain to be answered.
This appendix describes an implementation of a version of the formalism presented in Chapters 4, 5 and 7. It includes a small grammar which is used to explain the internal representation and the compilation procedure, along with the restrictions I was working with at the time. It also contains a description of the parser and its data structures.
See Abstracts, handouts and talks.
I hope to put here the bibliography I collected on-line in some form or another. It comprises some 520 publications on robustness (in the strict sense).
From "De Standaard On-line" (13/05/2002), Minder geslaagde doctoraten in humane wetenschappen (Alexandra De Laet)
Want voor een doctoraat moet je je volledig kunnen onderdompelen, legt de vice-rector [Roger Bouillon, vice-rector onderzoek van de KU Leuven (mijn toevoeging)] uit. "Je blijft nadenken, onder de douche, tijdens de weekends, bij een wandeling in het bos. Een doctoraat behaal je niet door zes of acht uur per dag ergens aan te werken: je moet er je hele denkwereld tijdelijk op oriënteren."