Parsing: Bottom-Up and Top-Down
Abstract:This lecture has two parts. In the first part, we
- explain the basic idea underlying naive bottom-up parsing strategies for CFGs.
- give a simple Prolog implementation of a naive bottom-up parser.
- discuss the limits of this work, both implementational, and algorithmic.
In the second part, we
- introduce two naive top-down parsing/recognition algorithms, one depth-first, the other breadth-first.
- give the top-down depth-first recognition algorithm an efficient implementation in Prolog.
- extend this recognizer to a parser.
Bottom-up and top-down parsing are in a way the most basic parsing strategies on the market. The more elaborate algorithms we will see later in this course can all be seen as improved versions of one of them (or a combination of both). Therefore you should make sure that you really understand how the algorithms to be presented now work. Although we sometimes stress their naivite.
Table of Contents
Bottom-Up Parsing and RecognitionThe basic idea of
↗bottom-up parsing and recognition is to begin with the concrete data provided by the input string --- that is, the words we have to parse/recognize --- and try to build bigger and bigger pieces of structure using this information. Eventually we hope to put all these pieces of structure together in a way that shows that we have found a sentence.
An Example Run Let's look at an example, to see if we've got it right. If you ask Prolog
How Well Have we Done?How good is this parser? Well, it's pretty easy to understand --- but performance-wise it's awful. Our English grammar is pretty basic --- but the query
recognize_bottomup([jules,believed,the,robber,who,shot,marsellus,fell]).will make Prolog hesitate, and
recognize_bottomup([jules,believed,the,robber,who,shot,the,robber,who,shot,marsellus,fell]).is painfully slow. In fact, this parser is so bad, it's useful --- it acts as a grim warning of how badly things can be done. We'll see in later lectures that we can do a
lot better. Moreover, it's useful to try and understand
why it is so awful. For it is bad in two quite distinct ways: at the level of
implementation and at the level of
↗algorithm .
Top Down Parsing and Recognition Now we discuss top-down parsing, and we will take care to remove the implementational inefficiency of our previous implementation: we
won't use
append/3, but will work with difference lists instead. As we shall see, this makes a huge difference to the performance. Whereas
recognize_bottomup/1 is unusable for anything except very small sentences and grammars, our next parser and recognizer will easily handle examples that
recognize_bottomup/1 finds difficult. However we won't remove the deeper algorithmic inefficiency. The top-down algorithms are still naive in that they double work (as we have argued for our bottom-up recognizer in
» Algorithmic Problems).
Top Down Recognition in PrologIt is easy to implement a top-down
depth-first recognizer in Prolog --- for this is the strategy Prolog itself uses in its search.
Top Down Parsing in PrologAs is so often the case in Prolog, moving from a recognizer to a parser is simply a matter of adding additional arguments to record the structure.
Exercise
Design a simple ambiguous context free grammar. Show that your grammar is ambiguous by giving an example of a string that has two distinct parse trees. Draw the two trees.
The following grammar (
epsilon.pl ) contains a rule producing an empty terminal symbol:
S→ε written as
s --->[] in our Prolog notation.
s ---> [].
s ---> [left,s,right].
lex(a,left).
lex(b,right).
Does this grammar admit the string
a b? What happens if you try to use
bottomup_recognizer.pl with this grammar to recognize the string
a b? Why?
Our naive bottom-up recognizer only recognizes strings of category
s. Change it, so that it recognizes strings of any category, as long as the whole string is spanned by that category, and returns the category that it found. For example:
?- recognize_bottomup_any([the,man],Cat).
Cat = np
yes
?- recognize_bottomup_any([the,man,dances],Cat).
Cat = s
yes
?- recognize_bottomup_any([the,man,dances,a],Cat).
no
Using the
ourEng.pl grammar, give a detailed top-down
depth-first analysis of
Marsellus knew Vincent loved Mia. That is, start with:
 Figure 1: Starting State for Exercise. Ex. |
and then show all the steps of your work, including the backtracking.
Using the
ourEng.pl grammar, give a detailed top-down
breadth-first analysis of
Marsellus knew Vincent loved Mia. That is, start with:
 Figure 1: Starting Bag for Exercise. Ex. |
and then show all the steps of your work.
Can
topdown_recognizer.pl or
topdown_parser.pl handle the following grammar (
leftrec.pl )?
s ---> [left,right].
left ---> [left,x].
left ---> [x].
right ---> [right,y].
right ---> [y].
lex(a,x).
lex(b,y).
If you are not sure try it. Why does it not work? How can it be fixed?
As we said in
» Top Down Parsing in Prolog, we need a pretty printer that translates list representations of parses (like
[s, [np, [pn, vincent]], [vp, [tv, shot], [np, [pn, marsellus]]]]) to trees. As an example, see
a pretty printer for so called semantic tableaux that was written in Prolog. Just select some example from the boxes and then click on "send". Here, the output is a html-table. Write some pretty printer for the parses we have seen before (it might also produce latex or plain ascii or whatever format you like).