Algorithmic Forays Part 2
by Eli Bendersky

In the last article learned the basics of finite state machines and saw a simple application. As promised, we will now explore a much more interesting use for FSMs: regular expressions. If the topic of regular expressions is familiar to you, feel free to skip the first section, since it serves as a basic introduction to the subject.

Regular expressions (regexes)

Think of an identifier in C++ (such as this, temp_var2, assert etc.). How would you describe what qualifies for an identifier? Lets see if we remember... an identifier consists of letters, digits and the underscore character (_), and must start with a letter (it's possible to start with an underscore, but better not to, as these identifiers are reserved by the language).

A regular expression is a notation that allows us to define such things precisely:

letter (letter | digit | underscore) *

In regular expressions, the vertical bar | means "or", the parentheses are used to group sub expressions (just like in math) and the asterisk (*) means "zero or more instances of the previous". So the regular expression above defines the set of all C++ identifiers (a letter followed by zero or more letters, digits or underscores). Lets see some more examples:

abb*a - all words starting with a, ending with a, and having at least one b in-between. For example: "aba", "abba", "abbbba" are words that fit, but "aa" or "abab" are not.
0|1|2|3|4|5|6|7|8|9 - all digits. For example: "1" is a digit, "fx" is not.
x(y|z)*(a|b|c) - all words starting with x, then zero or more y or z, and end with a, b or c. For example: "xa", "xya", "xzc", "xyzzzyzyyyzb".
xyz - only the word "xyz".

There is another symbol we'll use: eps (usually denoted with the Greek letter Epsilon) eps means "nothing". So, for example the regular expression for "either xy or xyz" is: xy(z|eps).

People familiar with regexes know that there are more complicated forms than * and |. However, anything can be built from *, | and eps. For instance, x? (zero or one instance of x) is a shorthand for (x|eps). x+ (one or more instances of x) is a shorthand for xx*. Note also the interesting fact that * can be represented with +, namely: x* is equivalent to (x+)|eps.

[Perl programmers and those familiar with Perl syntax (Python programmers, that would include you) will recognize eps as a more elegant alternative to the numerical notation {m, n} where both m and n are zero. In this notation, x* is equivalent to x{0,} (unbound upper limit), x+ is x{1,} and all other cases can be built from these two base cases. - Ed.]

Recognizing strings with regexes

Usually a regex is implemented to solve some recognition problem. For example, suppose your application asks a question and the user should anwswer Yes or No. The legal input, expressed as a regex is (yes)|(no). Pretty simple and not too exciting - but things can get much more interesting.

Suppose we want to recognize the following regex: (a|b)*abb, namely all words consisting of a's and b's and ending with abb. For example: "ababb", "aaabbbaaabbbabb". Say you'd like to write the code that accepts such words an rejects others. The following function does the job:

bool recognize(string str)
{
    string::size_type len = str.length();
    
    // can't be shorter than 3 chars
    if (len < 3)
        return false;
    
    // last 3 chars must be "abb"
    if (str.substr(len - 3, 3) != "abb")
        return false;
    
    // must contain no chars other than "a" and "b"
    if (str.find_first_not_of("ab") != string::npos)
        return false;
    
    return true;
}

It's pretty clean and robust - it will recognize (a|b)*abb and reject anything else. However, it is clear that the techniques employed in the code are very "personal" to the regex at hand.
If we slightly change the regex to (a|b)*abb(a|b)*, for instance (all sequences of a's and b's that have abb somewhere in them), it would change the algorithm completely. (We'd then probably want to go over the string, a char at a time and record the appearance of abb. If the string ends and abb wasn't found, it's a rejection, etc.). It seems that for each regex, we should think of some algorithm to handle it, and this algorithm can be completely different from algorithms for other regexes.

So what is the solution? Is there any standartized way to handle regexes? Can we even dream of a general algorithm that can produce a recognizer function given a regex? We sure can (otherwise, what would this article be about :-) )! Read on.

FSMs to the rescue

It happens so that Finite State Machines are a very useful tool for regular expressions. More specifically, a regex (any regex!) can be represented as an FSM. To show how, however, we must present two additional definitions (which actually are very logical, assuming we use a FSM for a regex).

A start state is the state in which a FSM starts.
An accepting state is a final state in which the FSM returns with some answer.

It is best presented with an example:

The start state 0 is denoted with a "Start" arrow. 1 is the accepting state (it is denoted with the bold border). Now, try to figure out what regex this FSM represents.

It actually represents xy* - x and then 0 or more of y. Do you see how? Note that x leads the FSM to state 1, which is the accepting state. Adding y keeps the FSM in the accepting state. If a x appears when in state 1, the FSM moves to state 2, which is non accepting and "stuck", since any input keeps the FSM in state 2 (because xy* rejects strings where a x comes after y-s). But what happens with other letters ? For simplicity we'll now assume that for this FSM our language consists of solely x and y. If our input set would be larger (say the whole lowercase alphabet), we could define that each transition not shown (for instance, on input "z" in state 0) leads us into some "unaccepting" state.

I will now present the general algorithm for figuring out whether a given FSM recognizes a given word. It's called "FSM Simulation". But first lets define an auxiliary function: move(state, input) returns the state resulting from getting input in state state. For the sample FSM above, move(0, X) is 1, move (0, Y) is 0, etc.
So, the algorithm goes as follows:

state = start_state
input = get_next_input
while not end of input do
    state = move(state, input)
    input = get_next_input
end

if state is a final state
    return "ACCEPT"
else
    return "REJECT"

The algorithm is presented in very general terms and should be well understood. Lets "run" it on the simple xy* FSM, with the input "xyy". We start from the start state - 0; Get next input: x, end of input ? not yet; move(0, x) moves us to state 1; input now becomes y; not yet end of input; move(1, y) moves us to state 1; exactly the same with the second y; now it's end of input; state 1 is a final state, so "ACCEPT";

Piece of cake isn't it ? Well, it really is ! It's a straightforward algorithm and a very easy one for the computer to execute.

Lets go back to the regex we started with - (a|b)*abb. Here is the FSM that represents (recognizes) it:

Although it is much more complicated than the previous FSM, it is still simple to comprehend. This is the nature of FSMs - looking at them you can easily characterize the states they can be in, what transitions occur, and when. Again, note that for simplicity our alphabet consists of solely "a" and "b".

Paper and pencil is all you need to "run" the FSM Simulation algorithm on some simple string. I encourage you to do it, to understand better how this FSM relates to the (a|b)*abb regex.

Just for example take a look at the final state - 3. How can we reach it ? From 2 with the input b. How can we reach 2 ? From 1 with input b. How can we reach 1 ? Actually from any state with input a. So, "abb" leads us to accept a string, and it indeed fits the regex.

Coding it the FSM way

I want to go back to the promise I gave in the beginning of this article. I said that it's possible to generate code straight from any regex. That is, given a regular expression, a general tool can generate the code that will correctly recognize all strings that fit the regex, and reject the others. Lets see what it takes...

The task is indeed massive, but consider dividing it to two distinct stages:

Convert the regex to a FSM
Write FSM code to recognize strings

Hey, we already know how to do the second stage! It is actually the FSM Simulation algorithm we saw earlier. Most of the algorithm is the same from FSM to FSM (just dealing with input). The only part that changes is the "move" function, which represents the transition diagram of some FSM, and we learned how to code the transition function in the pvery basic, the technique is the same for any FSM!

Let's now write down the full code that recognizes (a|b)*abb.

#include <iostream>
#include <cassert>
#include <string>

using namespace std;

typedef int     fsm_state;
typedef char    fsm_input;


bool is_final_state(fsm_state state)
{
    return (state == 3) ? true : false;
}


fsm_state get_start_state(void)
{
    return 0;
}


fsm_state move(fsm_state state, fsm_input input)
{
    // our alphabet includes only 'a' and 'b'
    if (input != 'a' && input != 'b')
        assert(0);
    
    switch (state)
    {
        case 0:
            if (input == 'a')
            {
                return 1;
            }
            else if (input == 'b')
            {
                return 0;
            }
            break;
        
        case 1:
            if (input == 'a')
            {
                return 1;
            }
            else if (input == 'b')
            {
                return 2;
            }
            break;
        
        case 2:
            if (input == 'a')
            {
                return 1;
            }
            else if (input == 'b')
            {
                return 3;
            }
            break;
        
        case 3:
            if (input == 'a')
            {
                return 1;
            }
            else if (input == 'b')
            {
                return 0;
            }
            break;
        
        default:
            assert(0);
    }
}


bool recognize(string str)
{
    if (str == "")
        return false;
    
    fsm_state state = get_start_state();
    
    string::const_iterator i = str.begin();
    fsm_input input = *i;
    
    while (i != str.end())
    {
        state = move(state, *i);
        ++i;
    }
    
    if (is_final_state(state))
        return true;
    else
        return false;
}

// simple driver for testing
int main(int argc, char** argv)
{
    recognize(argv[1]) ? cout << 1 : cout << 0;

    return 0;
}

Take a good look at the recognize function. You should immediately see how closely it follows the FSM Simulation algorithm. The FSM is initialized to the start state, and the first input is read. Then, in a loop, the machine moves to its next state and fetches the next input, etc. until the input string ends. Eventually, we check whether we reached a final state.

Note that this recognize function will be the same for any regex. The only functions that change are the trivial is_final_state and get_start_state, and the more complicated transition function move. But move is very structural - it closely follows the graphical description of the FSM. As you'll see in future articles, such transition functions are easily generated from the description.

So, what have we got so far? We know how to write code that runs a state machine on a string. What don't we know? We still don't know how to generate the FSM from a regex. This is what the following articles are about. See you there!

Discuss this article in the forums

Date this article was posted to GameDev.net: 3/29/2004
(Note that this date does not necessarily correspond to the date the article was written)

See Also:
Featured Articles
General