GameDev.netAlgorithmic Forays Part 2

Algorithmic Forays Part 2
In the last article learned the basics of finite state machines and saw a simple application. As promised, we will now explore a much more interesting use for FSMs: regular expressions. If the topic of regular expressions is familiar to you, feel free to skip the first section, since it serves as a basic introduction to the subject. ## Regular expressions (regexes)Think of an identifier in C++ (such as A regular expression is a notation that allows us to define such things precisely: letter (letter | digit | underscore) * In regular expressions, the vertical bar | means "or", the parentheses are used to group sub expressions (just like in math) and the asterisk (*) means "zero or more instances of the previous". So the regular expression above defines the set of all C++ identifiers (a letter followed by zero or more letters, digits or underscores). Lets see some more examples: **abb*a**- all words starting with a, ending with a, and having at least one b in-between. For example: "aba", "abba", "abbbba" are words that fit, but "aa" or "abab" are not.**0|1|2|3|4|5|6|7|8|9**- all digits. For example: "1" is a digit, "fx" is not.**x(y|z)*(a|b|c)**- all words starting with x, then zero or more y or z, and end with a, b or c. For example: "xa", "xya", "xzc", "xyzzzyzyyyzb".**xyz**- only the word "xyz".
There is another symbol we'll use: People familiar with regexes know that there are more complicated forms than * and |. However, anything can be built from *, | and x+ (one or more instances of x) is a shorthand for xx*. Note also the interesting fact that * can be represented with +, namely: x* is equivalent to (x+)|.eps
## Recognizing strings with regexesUsually a regex is implemented to solve some recognition problem. For example, suppose your application asks a question and the user should anwswer Yes or No. The legal input, expressed as a regex is Suppose we want to recognize the following regex: bool recognize(string str) { string::size_type len = str.length(); // can't be shorter than 3 chars if (len < 3) return false; // last 3 chars must be "abb" if (str.substr(len - 3, 3) != "abb") return false; // must contain no chars other than "a" and "b" if (str.find_first_not_of("ab") != string::npos) return false; return true; } It's pretty clean and robust - it will recognize So what is the solution? Is there any standartized way to handle regexes? Can we even dream of a general algorithm that can produce a recognizer function given a regex? We sure can (otherwise, what would this article be about :-) )! Read on. ## FSMs to the rescueIt happens so that Finite State Machines are a very useful tool for regular expressions. More specifically, a regex (any regex!) can be represented as an FSM. To show how, however, we must present two additional definitions (which actually are very logical, assuming we use a FSM for a regex). - A start state is the state in which a FSM starts.
- An accepting state is a final state in which the FSM returns with some answer.
It is best presented with an example:
The start state 0 is denoted with a "Start" arrow. 1 is the accepting state (it is denoted with the bold border). Now, try to figure out what regex this FSM represents. It actually represents I will now present the general algorithm for figuring out whether a given FSM recognizes a given word. It's called "FSM Simulation". But first lets define an auxiliary function: state = start_state input = get_next_input while not end of input do state = move(state, input) input = get_next_input end if state is a final state return "ACCEPT" else return "REJECT" The algorithm is presented in very general terms and should be well understood. Lets "run" it on the simple Piece of cake isn't it ? Well, it really is ! It's a straightforward algorithm and a very easy one for the computer to execute. Lets go back to the regex we started with -
Although it is much more complicated than the previous FSM, it is still simple to comprehend. This is the nature of FSMs - looking at them you can easily characterize the states they can be in, what transitions occur, and when. Again, note that for simplicity our alphabet consists of solely "a" and "b". Paper and pencil is all you need to "run" the FSM Simulation algorithm on some simple string. I encourage you to do it, to understand better how this FSM relates to the Just for example take a look at the final state - 3. How can we reach it ? From 2 with the input b. How can we reach 2 ? From 1 with input b. How can we reach 1 ? Actually from any state with input a. So, "abb" leads us to accept a string, and it indeed fits the regex. ## Coding it the FSM wayI want to go back to the promise I gave in the beginning of this article. I said that it's possible to generate code straight from any regex. That is, given a regular expression, a general tool can generate the code that will correctly recognize all strings that fit the regex, and reject the others. Lets see what it takes... The task is indeed massive, but consider dividing it to two distinct stages: - Convert the regex to a FSM
- Write FSM code to recognize strings
Hey, we already know how to do the second stage! It is actually the FSM Simulation algorithm we saw earlier. Most of the algorithm is the same from FSM to FSM (just dealing with input). The only part that changes is the "move" function, which represents the transition diagram of some FSM, and we learned how to code the transition function in the pvery basic, the technique is the same for any FSM! Let's now write down the full code that recognizes #include <iostream> #include <cassert> #include <string> using namespace std; typedef int fsm_state; typedef char fsm_input; bool is_final_state(fsm_state state) { return (state == 3) ? true : false; } fsm_state get_start_state(void) { return 0; } fsm_state move(fsm_state state, fsm_input input) { // our alphabet includes only 'a' and 'b' if (input != 'a' && input != 'b') assert(0); switch (state) { case 0: if (input == 'a') { return 1; } else if (input == 'b') { return 0; } break; case 1: if (input == 'a') { return 1; } else if (input == 'b') { return 2; } break; case 2: if (input == 'a') { return 1; } else if (input == 'b') { return 3; } break; case 3: if (input == 'a') { return 1; } else if (input == 'b') { return 0; } break; default: assert(0); } } bool recognize(string str) { if (str == "") return false; fsm_state state = get_start_state(); string::const_iterator i = str.begin(); fsm_input input = *i; while (i != str.end()) { state = move(state, *i); ++i; } if (is_final_state(state)) return true; else return false; } // simple driver for testing int main(int argc, char** argv) { recognize(argv[1]) ? cout << 1 : cout << 0; return 0; } Take a good look at the Note that this So, what have we got so far? We know how to write code that runs a state machine on a string. What don't we know? We still don't know how to generate the FSM from a regex. This is what the following articles are about. See you there! © Copyright by Eli Bendersky, 2003. All rights reserved.
© 1999-2011 Gamedev.net. All rights reserved. |