Regular Languages

Abstract:
In this lecture we will
  1. introduce regular expressions, an alternative notation for specifying languages that are accepted by FSAs.
  2. look a bit closer at the relation between regular expressions and FSAs and provide an algorithm for constructing FSAs from regular expressions.
  3. learn about certain nice properties of regular languages (the languages characterized by regular expressions or FSAs) and see how to make use of these properties in an implementation.

Table of Contents

Regular Languages and Relations
Regular languages are exactly those languages that can be represented by regular expressions. Every automaton corresponds to a regular expression and thus to a regular language.

Properties of Regular Languages
We will now leave the topic of transducers and rewrite rules and come back to regular languages. This class of languages has a number of interesting properties, of which the following will concern us in this section:
  1. Closure properties
    If we apply the operations of concatenation and Kleene closure, as well as the basic set theoretic operations of union, intersection and complementation to any (pair of) regular languages, the result will always be a regular language, too. We will discuss why this is so, and see how we can construct automata for the languages resulting from these operations based on the one(s) for the respective input language(s). In » Implementing Operations on FSAs we will implement some of these construction rules.
  2. Pumping Lemma
    It can be proven that if a regular language contains any word exceding a minimum length, the language must also contain all words that are related to that word in a certain way. This result is interesting because it allows us to conclude that a language that does not have this property cannot be a regular language. We will make use of this kind of argument in the next chapter.

Implementing Operations on FSAs
We discussed several properties of regular languages in the last sections, and almost all proofs we saw were conducted so as to lead directly to corresponding methods for constructing and transforming FSAs. In this section, we will implement some of these methods. We will set up a general framework for operations on FSAs. Within this framework we will give implementations for the operations of union and intersection, and leave the implementation of further ioperations to you as an exercise.

The Complete Code
Code Summary for this Chapter.

Practical Session
Hint: In newer versions of SWI-Prolog (at least) you can use the built-in predicate atom_chars/2 to convert an atom into a list of one-character atoms. For instance calling

atom_chars(hallo, X).

yields:
X = [h, a, l, l, o]

Further Reading


Exercise

  1. What language does the regular expression (hi(ha(ho))) describe? Sketch the step-by-setp construction of an automaton along the lines of our proof in » From Regular Expressions to FSAs. To make this task a bit less cumbersome, here are some simplifications that you can make:
    • You don't have to treat hi, ha, and ho as complex, that is you can read each of them as a whole word over one edge.
    • You don't have to give names to the states. Just draw empty circles.
    • The solution you hand in needn't contain all intermediate step. That means you can re-use for the next one what you've drawn in one step and don't have to copy over and over.
  2. Now draw an automaton for (hi(ha(ho))) as you would normally do it (that is, don't stick to the proof in » From Regular Expressions to FSAs). Compare!
  3. In » From Regular Expressions to FSAs we looked at one direction of the equivalence between regular expressions and FSAs. We showed that every language that can be characterized by a regular expression is also accepted by some FSA. Now let's look at the other direction and prove that for every FSA there is a regular expression that characterizes the same language.
    Let M=({q1,,qn},Σ,δ,q1,F) be a deterministic FSA, and L be the language accepted by it. We are going to show that there is also a regular expression for L.
    Before we begin, let's define the following notion: Rijk are those strings that M reads when it goes from i to j without passing through a state numbered higher than k (by passing through, we mean entering and leaving again). So i and j may have numbers greater than k. We will use Rijk to "zoom" in and out of our automaton, determining that only a part of it may be used.
    Now our induction will be over k. Starting from single states we successively allow ourselves to use larger parts of our automaton. We have to show that (given any choice of i,jn):
    • Basis
      There is a regular expression that characterizes Rij0. This is the language of all strings that take M from i to j without going through any state numbered higher than 0 (that is infact, without going through any state at all).
    • Induction
      There is a regular expression for Rijk, given that there is one for Rlmk1 for any choice of l,mn.
    Give proofs for basis and induction! To prove the basis, you can of course simply enumerate as a disjunction all the words that can be read in M in one step from i to j. For the inductive step, think about the following: If a word can be read from i to j in M without going through any state higher than k, it can always be split into a sequence of parts none of which takes M through a state higher than k1 (in the sense of entering and leaving). How can this be done?
  4. Add a "pretty-print" operation (you can call it pr) to the operations defined in » Implementing Operations on FSAs. It should use the backbone of the scan/1-predicate to crawl along an automaton and print out messages with information about any state it visits and about any edge it passes.
    For instance loading haha.pl and then calling scan(pr(a1)). should result in output like the following:
    automaton a1:
    initial state 1
    from 1 to 2 over h
    
    etc...
    
    final state 4
    yes
    
    Hint:
    Hint: One way of producing formated output in Prolog is using the format/2 predicate. The first argument of format/2 is a string (i.e. something that's enclosed between double quotes). This string may contain one or more placeholders. The second argument is a list of things to be printed instead of these placeholders.
    For instance the call
    
    format("My ~w has ~w legs.~n My ~w has ~w legs.~n", [dog, 4, sister, 2]).
    
    
    produces the following output:
    
    My dog has 4 legs.
    
    
    My sister has 2 legs.
    
    yes
    
    The string given as first argument in this call contains the placeholder symbol ~w four times. These four placeholders are then replaced by the items on the list in the order in which they occur. Furthermore the strings contains the special symbol ~n twice. This is printed as a linebreak.
    There are lots of other ways of using the format-predicate, for instance with other kinds of placeholders and for file output. You will find them documented with most Prolog implementations. For this exercise, ~w and ~n should do.
  5. In » Intersection, Complementation we saw how to build complement automata, that is how to make an automaton M that accepts some language L into an automaton M¯ that accepts L¯.
    The method we discussed there works for FSAs that are complete, deterministic and ε-free. We did not implement a complementation-operation in our Prolog system in » Implementing Operations on FSAs because we did not want to bother with establishing these properties.
    Add a complementation-operation comp to the implemetation. Your operation may simply assume that the input FSA is complete, deterministic and ε-free. The file deterministic.pl contains a complete, deterministic (and of course ε-free) version of our very first laughing machine. Use this file to test you implementation.