Parsing: Bottom-Up and Top-Down

Abstract:
This lecture has two parts. In the first part, we
  1. explain the basic idea underlying naive bottom-up parsing strategies for CFGs.
  2. give a simple Prolog implementation of a naive bottom-up parser.
  3. discuss the limits of this work, both implementational, and algorithmic.
In the second part, we
  1. introduce two naive top-down parsing/recognition algorithms, one depth-first, the other breadth-first.
  2. give the top-down depth-first recognition algorithm an efficient implementation in Prolog.
  3. extend this recognizer to a parser.
Bottom-up and top-down parsing are in a way the most basic parsing strategies on the market. The more elaborate algorithms we will see later in this course can all be seen as improved versions of one of them (or a combination of both). Therefore you should make sure that you really understand how the algorithms to be presented now work. Although we sometimes stress their naivite.

Table of Contents

Bottom-Up Parsing and Recognition
The basic idea of bottom-up parsing and recognition is to begin with the concrete data provided by the input string --- that is, the words we have to parse/recognize --- and try to build bigger and bigger pieces of structure using this information. Eventually we hope to put all these pieces of structure together in a way that shows that we have found a sentence.

Bottom-Up Recognition in Prolog
The main goal of this section is to write a simple bottom-up recognizer in Prolog.

An Example Run
Let's look at an example, to see if we've got it right. If you ask Prolog

How Well Have we Done?
How good is this parser? Well, it's pretty easy to understand --- but performance-wise it's awful. Our English grammar is pretty basic --- but the queryrecognize_bottomup([jules,believed,the,robber,who,shot,marsellus,fell]).will make Prolog hesitate, andrecognize_bottomup([jules,believed,the,robber,who,shot,the,robber,who,shot,marsellus,fell]).is painfully slow. In fact, this parser is so bad, it's useful --- it acts as a grim warning of how badly things can be done. We'll see in later lectures that we can do a lot better. Moreover, it's useful to try and understand why it is so awful. For it is bad in two quite distinct ways: at the level of implementation and at the level of algorithm .

Top Down Parsing and Recognition
Now we discuss top-down parsing, and we will take care to remove the implementational inefficiency of our previous implementation: we won't use append/3, but will work with difference lists instead. As we shall see, this makes a huge difference to the performance. Whereas recognize_bottomup/1 is unusable for anything except very small sentences and grammars, our next parser and recognizer will easily handle examples that recognize_bottomup/1 finds difficult. However we won't remove the deeper algorithmic inefficiency. The top-down algorithms are still naive in that they double work (as we have argued for our bottom-up recognizer in » Algorithmic Problems).

Top Down Recognition in Prolog
It is easy to implement a top-down depth-first recognizer in Prolog --- for this is the strategy Prolog itself uses in its search.

Top Down Parsing in Prolog
As is so often the case in Prolog, moving from a recognizer to a parser is simply a matter of adding additional arguments to record the structure.

The Code
Code Summary for this Chapter.

Practical Session
Hands-on experience with the implementations

Exercises
Ex.


Exercise

  1. Design a simple ambiguous context free grammar. Show that your grammar is ambiguous by giving an example of a string that has two distinct parse trees. Draw the two trees.
  2. The following grammar (epsilon.pl ) contains a rule producing an empty terminal symbol: Sε written as s --->[] in our Prolog notation.
    
    s ---> [].
    
    
    s ---> [left,s,right].
    
    
    lex(a,left).
    
    
    lex(b,right).
    
    
    Does this grammar admit the string a b? What happens if you try to use bottomup_recognizer.pl with this grammar to recognize the string a b? Why?
  3. Our naive bottom-up recognizer only recognizes strings of category s. Change it, so that it recognizes strings of any category, as long as the whole string is spanned by that category, and returns the category that it found. For example:
    
    ?- recognize_bottomup_any([the,man],Cat).
    
    Cat = np
    yes
    
    ?- recognize_bottomup_any([the,man,dances],Cat).
    
    Cat = s
    yes
    
    ?- recognize_bottomup_any([the,man,dances,a],Cat).
    
    no
    
  4. Using the ourEng.pl grammar, give a detailed top-down depth-first analysis of Marsellus knew Vincent loved Mia. That is, start with:
    Ex.
    Figure 1: Starting State for Exercise.
    Ex.
    and then show all the steps of your work, including the backtracking.
  5. Using the ourEng.pl grammar, give a detailed top-down breadth-first analysis of Marsellus knew Vincent loved Mia. That is, start with:
    Ex.
    Figure 1: Starting Bag for Exercise.
    Ex.
    and then show all the steps of your work.
  6. Can topdown_recognizer.pl or topdown_parser.pl handle the following grammar (leftrec.pl )?
    
    s ---> [left,right].
    
    
    left ---> [left,x].
    
    
    left ---> [x].
    
    
    right ---> [right,y].
    
    
    right ---> [y].
    
    
    lex(a,x).
    
    
    lex(b,y).
    
    
    If you are not sure try it. Why does it not work? How can it be fixed?
  7. As we said in » Top Down Parsing in Prolog, we need a pretty printer that translates list representations of parses (like [s, [np, [pn, vincent]], [vp, [tv, shot], [np, [pn, marsellus]]]]) to trees. As an example, see a pretty printer for so called semantic tableaux that was written in Prolog. Just select some example from the boxes and then click on "send". Here, the output is a html-table. Write some pretty printer for the parses we have seen before (it might also produce latex or plain ascii or whatever format you like).