Last time we met, you learned about Thompson's Construction - an algorithm for building NFAs from regular expressions. I also presented a "starter" implementation of a basic NFA class in C++. In this article, you'll see the connection between these two. I will present a complete C++ implementation of Thompson's Construction. Additionally, you will see how to represent regular expressions with parse trees, and get acquainted with the code that builds complete NFAs from these regular expressions. Implementing Thompson's ConstructionThompson's Construction, as you probably remember (If you don't, it is a good time to skim through the previous article) tells us how to build NFAs from trivial regular expressions and then compose them into more complex NFAs. Let's start with the basics: The simplest regular expressionThe most basic regular expression is just some single character, for example a. The NFA for such a regex is: Here is the implementation: // Builds a basic, single input NFA // NFA build_nfa_basic(input in) { NFA basic(2, 0, 1); basic.add_trans(0, 1, in); return basic; } Just to remind you about our NFA implementation: The first line of the function creates a new (with no transitions yet) NFA of size 2 (that is: with 2 states), and sets state 0 to be the initial state and state 1 to be the final state. The second line adds a transition to the NFA that says "in moves from state 0 to state 1". That's it - a simple regex, a simple construction procedure. Note that this procedure is suited for the construction of an eps transition as well (it was presented separately in the last article's discussion of Thompson's Construction). Some changes to the NFA classLast article's implementation of a simple NFA class was just a starting point and we have quite a few changes to make. First of all, we need direct access to all of the class's data. Instead of providing get and set accessors (which I personally dislike), all of the class members (size, initial, final and trans_table) have been made public. Recall what I told you about the internal representation inside the NFA class - it's a matrix representing the transition table of a graph. For each i and j, trans_table[i][j] is the input that takes the NFA from state i to state j. It's NONE if there's no such transition (hence a lot of space is wasted - the matrix representation, while fast, is inefficient in memory). Several new operations were added to the NFA class for our use in the NFA building functions. Their implementation can be found in nfa.cpp (one of the files in the .zip that comes with this article). For now, try to understand how they work (it's really simple stuff), later you'll see why we need them for the implementation of various Thompson Construction stages. It may be useful to have the code of nfa.cpp open in an editor, and to follow the code for the operations while reading these explanations. Here they are:
Now I will explain how the more complex operations of Thompson's Construction are implemented. You should understand how the operations demonstrated above work, and also have looked at their source code (a good example of our NFA's internal table manipulation). You may still lack the "feel" of why these operations are needed, but this will soon be covered. Just understand how they work, for now. Implementing alternation: a|bHere is the diagram of NFA alternation from the previous article: Implementing concatenation: abHere is the diagram of NFA concatenation from the previous article: Implementing star: a*Here is the diagram of the NFA for a* from the previous article: Specialty of NFAs constructed by Thompson's ConstructionYou might have observed that all the NFAs constructed by Thompson's Construction have some very specific behavior. For instance, all the basic building blocks for single letters are similar, and the rest of the constructions just create new links between these states to allow for the alternation, concatenation and star operations. These NFAs are also special implementation-wise. For instance, note that in our NFA implementation, the first state is always the initial, and the last state is always final. You may have noted that this is useful in several operations. A complete NFA construction implementationWith these operations implemented, we now have a full NFA construction implementation in nfa.cpp! For instance, the regex (a|b)*abb can be built as follows: NFA a = build_nfa_basic('a'); NFA b = build_nfa_basic('b'); NFA alt = build_nfa_alter(a, b); NFA str = build_nfa_star(alt); NFA sa = build_nfa_concat(str, a); NFA sab = build_nfa_concat(sa, b); NFA sabb = build_nfa_concat(sab, b); With these steps completed, sabb is the NFA representing (a|b)*abb . Note how simple it's to build NFAs this way! There's no need to specify individual transitions like we did before. In fact, it's not necessary to understand NFAs at all - just build the desired regex from its basic blocks, and that's it. Even closer to complete automationThough it has now become much simpler to construct NFAs from regular expressions, it's still not as automatic as we'd like it to be. One still has to explicitly specify the regex structure. A useful representation of structure is an expression tree. For example, the construction above closely reflects this tree structure:
Such trees in the world of parsing and compilation are called expression trees, or parse trees. For example, to implement an infix calculator (one that can calculate, for example: 3*4 + 5), the expressions are first turned into parse trees, and then these parse trees are walked to make the calculations. Note, this is, as always, an issue of representation. We have the regex in a textual representation: (a|b)*abb and we want it in NFA representation. Now we're wondering how to turn it from one representation to the other. My solution is to use an intermediate representation - a parse tree. Going from a regex to a parse tree is similar to parsing arithmetic expressions, and going from a parse tree to a NFA will be now demonstrated, using the Thompson's Construction building blocks described in this chapter. From a parse tree to a NFAA parse tree in our case is just a binary tree, since no operation has more than two arguments. Concatenation and alternations have two arguments, hence their nodes in the tree have two children. Star has one argument - hence only the left child. Chars are the tree leaves. Take a good look at the tree drawn above, you'll see this very clearly. Take a look at the file regex_parse.cpp that comes with this article. It has alot in it, but you only need to focus only on some specific things for now. First, let's look at the definition of parse_node: typedef enum {CHR, STAR, ALTER, CONCAT} node_type; // Parse node // struct parse_node { parse_node(node_type type_, char data_, parse_node* left_, parse_node* right_) : type(type_), data(data_), left(left_), right(right_) {} node_type type; char data; parse_node* left; parse_node* right; }; This is a completely normal definition of a binary tree node that contains data and some type by which it is identified. Let us ignore, for the moment, the question of how such trees are built from regexes (if you're very curious - it's all in regex_parse.cpp), and instead think about how to build NFAs from such trees. It's very straight forward since the parse tree representation is natural for regexes. Here is the code of the tree_to_nfa function from regex_parse.cpp: NFA tree_to_nfa(parse_node* tree) { assert(tree); switch (tree->type) { case CHR: return build_nfa_basic(tree->data); case ALTER: return build_nfa_alter(tree_to_nfa(tree->left), tree_to_nfa(tree->right)); case CONCAT: return build_nfa_concat(tree_to_nfa(tree->left), tree_to_nfa(tree->right)); case STAR: return build_nfa_star(tree_to_nfa(tree->left)); default: assert(0); } } Not much of a rocket science, is it? The power of recursion and trees allows us to build NFAs from parse trees in just 18 lines of code... From a regex to a parse treeIf you've already looked at regex_parse.cpp, you surely noted that it contains quite a lot of code, much more than I've show so far. This code is the construction of parse trees from actual regexes (strings like (a|b)*abb). I really hate to do this, but I won't explain how this regex-to-tree code works. This article series is complicated enough without getting into parsing. You'll just have to believe me it works (or study the code - it's there!). As a note, the parsing technique I employ to turn regexes into parse trees is called Recursive Descent parsing. It's an exciting topic and there is plenty of information available on it, if you are interested. We've come a long way...Let's see what we have accomplished: we have implemented the complete process of taking regular expressions and turning them into NFAs - automatically! Our program turns the regexes into parse trees, walks these parse trees and creates NFAs from them using Thompson's Construction. To see this in action, compile the files nfa.cpp and regex_parse.cpp together (make sure nfa.h resides in the same directory). This would be the command for those who use GCC:
g++ -o regex2nfa regex_parse.cpp nfa.cpp regex_parse.cpp contains a small main function that takes the first argument and displays the NFA resulting from it. For instance, running:
regex2nfa "(a|b)*abb" Displays: This NFA has 11 states: 0 - 10 The initial state is 0 The final state is 10 Transition from 0 to 1 on input EPS Transition from 0 to 7 on input EPS Transition from 1 to 2 on input EPS Transition from 1 to 4 on input EPS Transition from 2 to 3 on input a Transition from 3 to 6 on input EPS Transition from 4 to 5 on input b Transition from 5 to 6 on input EPS Transition from 6 to 1 on input EPS Transition from 6 to 7 on input EPS Transition from 7 to 8 on input a Transition from 8 to 9 on input b Transition from 9 to 10 on input b Which is, if you recall, the exact same NFA we dutifully crafted by hand for (a|b)*abb in the previous article. Now, we can generate it automatically! A note about the source codeAlong with this article, comes a .zip file containing all the code. It currently consists of 3 files: nfa.h, nfa.cpp and regex_parse.cpp (which contains a main() function). All the code, like all my future code for the series is in the public domain - you may do anything you wish with it. If you have any problem compiling and running it, feel free to contact me, I'll help. I'd also be quite happy to receive feedback and bug reports. © Copyright by Eli Bendersky, 2003. All rights reserved. |
|