GameDev.netAlgorithmic Forays Part 5

Algorithmic Forays Part 5
Last time we met, you learned about Thompson's Construction - an algorithm for building NFAs from regular expressions. I also presented a "starter" implementation of a basic NFA class in C++. In this article, you'll see the connection between these two. I will present a complete C++ implementation of Thompson's Construction. Additionally, you will see how to represent regular expressions with parse trees, and get acquainted with the code that builds complete NFAs from these regular expressions. ## Implementing Thompson's ConstructionThompson's Construction, as you probably remember (If you don't, it is a good time to skim through the previous article) tells us how to build NFAs from trivial regular expressions and then compose them into more complex NFAs. Let's start with the basics: ## The simplest regular expressionThe most basic regular expression is just some single character, for example Here is the implementation: // Builds a basic, single input NFA // NFA build_nfa_basic(input in) { NFA basic(2, 0, 1); basic.add_trans(0, 1, in); return basic; } Just to remind you about our NFA implementation: The first line of the function creates a new (with no transitions yet) NFA of size 2 (that is: with 2 states), and sets state 0 to be the initial state and state 1 to be the final state. The second line adds a transition to the NFA that says " Note that this procedure is suited for the construction of an ## Some changes to the NFA classLast article's implementation of a simple NFA class was just a starting point and we have quite a few changes to make. First of all, we need direct access to all of the class's data. Instead of providing get and set accessors (which I personally dislike), all of the class members ( Recall what I told you about the internal representation inside the NFA class - it's a matrix representing the transition table of a graph. For each Several new operations were added to the NFA class for our use in the NFA building functions. Their implementation can be found in nfa.cpp (one of the files in the .zip that comes with this article). For now, try to understand how they work (it's really simple stuff), later you'll see why we need them for the implementation of various Thompson Construction stages. It may be useful to have the code of nfa.cpp open in an editor, and to follow the code for the operations while reading these explanations. Here they are: - I want to append a new, empty state to the NFA. This state will have no transitions to it and no transitions from it. If this is the transition table before the appending (a sample table with size 5 - states 0 to 4):`append_empty_state`
Then this is the table after the appending:
The shaded cells are the transitions of the original table (be they empty or not), and the white cells are the new table cells - containing`NONE`.
- I want to rename all NFA's states, shifting them "higher" by some given number. For instance, if I have 5 states numbered 0 - 4, and I want to have the same states, just named 2 - 6, I will call`shift_states``shift_states(2)`, and will get the following table:
- I want to copy the states of some other NFA into the first table cells of my NFA. For instance, if I take the shifted table from above, and fill its first two states with a new small NFA, I will get (the new NFA's states are sky blue):`fill_states`
Note that using`fill_states`after`shift_states`is not incidental. These two operations were created to be used together - to concatenate two NFAs. You'll see how they are employed shortly.
Now I will explain how the more complex operations of Thompson's Construction are implemented. You should understand how the operations demonstrated above work, and also have looked at their source code (a good example of our NFA's internal table manipulation). You may still lack the "feel" of why these operations are needed, but this will soon be covered. Just understand how they work, for now. ## Implementing alternation: a|bHere is the diagram of NFA alternation from the previous article: ## Implementing concatenation: abHere is the diagram of NFA concatenation from the previous article: ## Implementing star: a*Here is the diagram of the NFA for ## Specialty of NFAs constructed by Thompson's ConstructionYou might have observed that all the NFAs constructed by Thompson's Construction have some very specific behavior. For instance, all the basic building blocks for single letters are similar, and the rest of the constructions just create new links between these states to allow for the alternation, concatenation and star operations. These NFAs are also special implementation-wise. For instance, note that in our NFA implementation, the first state is always the initial, and the last state is always final. You may have noted that this is useful in several operations. ## A complete NFA construction implementationWith these operations implemented, we now have a full NFA construction implementation in nfa.cpp! For instance, the regex NFA a = build_nfa_basic('a'); NFA b = build_nfa_basic('b'); NFA alt = build_nfa_alter(a, b); NFA str = build_nfa_star(alt); NFA sa = build_nfa_concat(str, a); NFA sab = build_nfa_concat(sa, b); NFA sabb = build_nfa_concat(sab, b); With these steps completed, ## Even closer to complete automationThough it has now become much simpler to construct NFAs from regular expressions, it's still not as automatic as we'd like it to be. One still has to explicitly specify the regex structure. A useful representation of structure is an expression tree. For example, the construction above closely reflects this tree structure:
Such trees in the world of parsing and compilation are called expression trees, or parse trees. For example, to implement an infix calculator (one that can calculate, for example: 3*4 + 5), the expressions are first turned into parse trees, and then these parse trees are walked to make the calculations. Note, this is, as always, an issue of representation. We have the regex in a textual representation: ## From a parse tree to a NFAA parse tree in our case is just a binary tree, since no operation has more than two arguments. Concatenation and alternations have two arguments, hence their nodes in the tree have two children. Star has one argument - hence only the left child. Chars are the tree leaves. Take a good look at the tree drawn above, you'll see this very clearly. Take a look at the file regex_parse.cpp that comes with this article. It has alot in it, but you only need to focus only on some specific things for now. First, let's look at the definition of typedef enum {CHR, STAR, ALTER, CONCAT} node_type; // Parse node // struct parse_node { parse_node(node_type type_, char data_, parse_node* left_, parse_node* right_) : type(type_), data(data_), left(left_), right(right_) {} node_type type; char data; parse_node* left; parse_node* right; }; This is a completely normal definition of a binary tree node that contains data and some type by which it is identified. Let us ignore, for the moment, the question of how such trees are built from regexes (if you're very curious - it's all in regex_parse.cpp), and instead think about how to build NFAs from such trees. It's very straight forward since the parse tree representation is natural for regexes. Here is the code of the NFA tree_to_nfa(parse_node* tree) { assert(tree); switch (tree->type) { case CHR: return build_nfa_basic(tree->data); case ALTER: return build_nfa_alter(tree_to_nfa(tree->left), tree_to_nfa(tree->right)); case CONCAT: return build_nfa_concat(tree_to_nfa(tree->left), tree_to_nfa(tree->right)); case STAR: return build_nfa_star(tree_to_nfa(tree->left)); default: assert(0); } } Not much of a rocket science, is it? The power of recursion and trees allows us to build NFAs from parse trees in just 18 lines of code... ## From a regex to a parse treeIf you've already looked at regex_parse.cpp, you surely noted that it contains quite a lot of code, much more than I've show so far. This code is the construction of parse trees from actual regexes (strings like I really hate to do this, but I won't explain how this regex-to-tree code works. This article series is complicated enough without getting into parsing. You'll just have to believe me it works (or study the code - it's there!). As a note, the parsing technique I employ to turn regexes into parse trees is called Recursive Descent parsing. It's an exciting topic and there is plenty of information available on it, if you are interested. ## We've come a long way...Let's see what we have accomplished: we have implemented the complete process of taking regular expressions and turning them into NFAs - automatically! Our program turns the regexes into parse trees, walks these parse trees and creates NFAs from them using Thompson's Construction. To see this in action, compile the files nfa.cpp and regex_parse.cpp together (make sure nfa.h resides in the same directory). This would be the command for those who use GCC:
g++ -o regex2nfa regex_parse.cpp nfa.cpp regex_parse.cpp contains a small main function that takes the first argument and displays the NFA resulting from it. For instance, running:
regex2nfa "(a|b)*abb" Displays: This NFA has 11 states: 0 - 10 The initial state is 0 The final state is 10 Transition from 0 to 1 on input EPS Transition from 0 to 7 on input EPS Transition from 1 to 2 on input EPS Transition from 1 to 4 on input EPS Transition from 2 to 3 on input a Transition from 3 to 6 on input EPS Transition from 4 to 5 on input b Transition from 5 to 6 on input EPS Transition from 6 to 1 on input EPS Transition from 6 to 7 on input EPS Transition from 7 to 8 on input a Transition from 8 to 9 on input b Transition from 9 to 10 on input b Which is, if you recall, the exact same NFA we dutifully crafted by hand for ## A note about the source codeAlong with this article, comes a .zip file containing all the code. It currently consists of 3 files: nfa.h, nfa.cpp and regex_parse.cpp (which contains a main() function). All the code, like all my future code for the series is in the public domain - you may do anything you wish with it. If you have any problem compiling and running it, feel free to contact me, I'll help. I'd also be quite happy to receive feedback and bug reports. © Copyright by Eli Bendersky, 2003. All rights reserved.
© 1999-2011 Gamedev.net. All rights reserved. |