6.2 Bottom-Up Recognition in Prolog

The main goal of this section is to write a simple bottom-up recognizer in Prolog.

The main goal of this section is to write a simple bottom-up recognizer in Prolog. But before we do this, we have to decide how to represent CFGs in Prolog. The representation that we are going to introduce distinguished between phrase structure rules and lexical rules by representing them in different ways. As we mentioned above, it is often useful to be able to make this distinction when dealing with natural languages. For representing phrase structure rules, we shall use a notation that is quite close to the one used in DCGs. In fact, there are only two differences. First, whereas DCGs use the symbol

-->

for the rewrite arrow, we shall use the symbol

--->

Second, in DCGs we would write as:

s --> np,vp.

However we shall use instead the following notation:

s ---> [np,vp].

Here's an example. The phrase structure rules of our English grammar become in this notation (from file ourEng.pl: View Download ):

s ---> [np, vp].

np ---> [pn].

np ---> [pn,rel].

np ---> [det, nbar].

nbar ---> [n].

nbar ---> [n, rel].

rel ---> [wh, vp].

vp ---> [iv].

vp ---> [tv, np].

vp ---> [dv, np, pp].

vp ---> [sv, s].

pp ---> [p, np].

How does Prolog know about the symbol --->? Well, it needs to be told what it means, and we can do this using an operator definition as follows:

?- op(700,xfx,--->).

That is, we have declared ---> to be a binary infix operator. The best place to put this definition is probably in the recognizer, not in each and every grammar. But note: this means you will have to consult the recognizer before you consult any of the the grammars, as otherwise Prolog won't know what ---> means.

Now, we can represent phrase structure rules. Lexical rules we shall represent using the predicate lex/2. For example, will be represented as lex(vincent,pn). Here are the lexical rules of the little English grammar that we have seen above in the new notation.

lex(vincent,pn). lex(mia,pn). lex(marsellus,pn). lex(jules,pn). lex(a,det). lex(the,det). lex(her,det). lex(his,det). lex(gun,n). lex(robber,n). lex(man,n). lex(woman,n). lex(who,wh). lex(that,wh). lex(to,p). lex(died,iv). lex(fell,iv). lex(loved,tv). lex(shot,tv). lex(knew,tv). lex(gave,dv). lex(handed,dv). lex(knew,sv). lex(believed,sv).

Incidentally --- we shall use this notation for grammars throughout the course. All our parser/recognizers will make us of grammars in this format.

It's now time to define our very first recognizer --- a simple (though inefficient) recognizer which carries out the algorithm sketched above. Here it is. The predicate recognize_bottomup/1 takes as input a list of symbols (for example, [vincent,shot,marsellus]) and tries to build the list [s] (that is, a sentence). Here is its definition (from file bottomup_recognizer.pl ):

recognize_bottomup([s]). %%% the recursive case: choose a part of the incoming %%% string which matches with the right hand side of a %%% rule and replace it with the category symbol %%% on the left hand side of that rule. Recursivel %%% recognize the result. recognize_bottomup(String) :- /*** uncomment the following line to see which states the recognizer goes through ***/ % nl,write('STATE: '),write(String), split(String,Front,Middle,Back), ( Cat ---> Middle ; (Middle = [Word], lex(Word,Cat)) ), /*** uncomment to see which rules are applied ***/ % tab(5),write('RULE: '),write((Cat ---> Middle)),nl, split(NewString,Front,[Cat],Back), recognize_bottomup(NewString). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%% split(+list, ?list, ?list, ?list) %%% splits a list into three segments

How does this work? Well, the first clause, recognize_bottomup([s]), tells us that we have succeeded if we find the list [s]. Incidentally, if you glance down at the following clause, you will see that recognize_bottomup/1 is recursive. This first clause is the base clause of the recursion.

So what about the second clause? First we have:

split(String,Front,Middle,Back)

The predicate split/4 splits a list into three parts. It is defined as follows:

split(ABC, A, B, C) :- append(A, BC, ABC), append(B, C, BC).

split/4 uses the standard append/3 predicate to split up the incoming list by calling it with uninstantiated varibles in the first two arguments. append/3 is called twice: The first time the incoming list is split in two parts, and the second time one of the parts is again split into two parts, resulting in three parts altogether. Unfortunately, using append/3 in this way is very inefficient.

So --- split/4 splits the string into three parts: Front, Middle, and Back. Next comes a disjunction:

Cat ---> Middle ; (Middle = [Word], lex(Word,Cat))

It succeeds if we have either a phrase structure rule with Middle as its right hand side, or if Middle is actually a word that we can map to a category by a lexical rule.

Now for the key step. Suppose we have such a rule. Then

split(NewString,Front,[Cat],Back)

builds a new string out of our input string by replacing the list Middle with the list [Cat]. That is, from

Front Middle Rest

we get the new string

Front Cat Rest

Note how we used the predicate split/4 here in reverse to "unsplit", that is to join the lists.

In short: we have used our rule right to left to build a new string. The rest is simple. We recursively call

recognize_bottomup(NewString)

on the new string we have built. If we have a sentence on our hands, this recursion will eventually produce [s], and we will succeed using the first clause. Note that every call to recognize_bottomup/1 makes use of append/3 to decompose the input list. So, via backtracking, we will eventually find all possible ways of decomposing the input list --- thus if the input really is a sentence, we will eventually succeed in showing this.