The Basics
Abstract:
Context Free Grammars (CFGs) describe a larger class of formal languages than regular languages do.Many natural language structures can be neatly described with context free grammars. Another reason why CFGs are widely used in the field of NLP is the availability of many efficient tools (compilers, parsers, etc.).
↗context free grammar s (CFGs) describe a larger class of formal languages than regular languages do. As we will see below, the language
anbn can in fact be described by a CFG. Even more, many natural language structures can be neatly described with context free grammars. This makes them especially useful for writing grammars for NLP applications. Another reason why CFGs are widely used in the field of NLP is the availability of many efficient tools (compilers, parsers, etc.).
The
→ is called the
↗rewrite arrow , and the four expressions listed above are called
↗context free rule s (you may also see them called
↗rewrite rule s, or something similar).
Apart from the
→, there are two sorts of symbols in this grammar:
↗terminal symbol s and
↗non-terminal symbol s. In the above grammar, the terminal symbols are `a' and `b', and the non-terminal symbols are `S', `A', and `B'.
- Terminal symbols
never occur to the left of a rewrite arrow. - Non-terminal symbols
may occur either to the right or the left of the rewrite arrow (for example, the `S' in the above grammar occurs both to the right and the left of → in the second rule).
Every context free grammar has one special symbol, the
↗start symbol (or
↗sentence symbol ), which is usually written `S'. Some context free grammars also use another special symbol, namely
ε, for the empty string. The
ε symbol can only occur on the right hand side of context free rules. For example, we could add the rule
S→ε to the above grammar, but we could
not add
ε→A.
The simplest interpretation of context free grammars is the
↗rewrite interpretation . This views CFGs as rules which let us rewrite the start symbol to other strings: each rule is interpreted as saying that we can replace an occurrence of the symbol on the left side of the rule by the symbol(s) on the right side.
For example, the above grammar lets us rewrite `S' to `aabb':
S |
ASB | Rule 2 |
aSB | Rule 3 |
aSb | Rule 4 |
aABb | Rule 1 |
aaBb | Rule 3 |
aabb | Rule 4 |
Such a sequence is called a
↗derivation of the symbols in the last row. For example, the above sequence is a derivation of the string `aabb'. Note that there may be many derivations of the same string. For example,
S |
ASB | Rule 2 |
ASb | Rule 4 |
aSb | Rule 3 |
aABb | Rule 1 |
aAbb | Rule 4 |
aabb | Rule 3 |
is another derivation of `aabb'.
Why are such grammars called `context free'?
Because all rules contain only one symbol on the left hand side --- and wherever we see that symbol while doing a derivation, we are free to replace it with the stuff on the right hand side. That is, the `context' in which a symbol on the left hand side of a rule occurs is unimportant --- we can
always use the rule to make the rewrite while doing a derivation.
There are more complex kinds of grammars, with more than one symbol on the left hand side of the rewrite arrow, in which the symbols to the right and left have to be taken into account before making a rewrite. Such grammars can be linguistically important, but we won't study them in this course. The
↗language generated by a context free grammar is the set of terminal symbols that can be derived starting from the start symbol `S'. For example, the above grammar generates the language
anbn∖{ε} (the language consisting of all strings consisting of a block of 'a's followed by a block of 'b's of equal length, except the empty string). And if we added the rule
S→ε to this grammar we would generate the language
anbn.
A language is called
↗context free if it is generated by some context free grammar. For example, the language
anbn∖{ε} is context free, and so is the language
anbn, for we have just seen that these can be produced by CFGs. Not all languages are context free. For example,
anbncn is not. That is, no matter how hard you try to find CFG rules that generate this language, you won't succeed. No CFG can do the job.