GameDev.netAn Introduction to Hash Tables with Chaining

An Introduction to Hash Tables with Chaining
## Contents## Meet the Hash Table...Hash tables are data structures that are used when you are managing a large amount of data and need to be able find an item quickly. They are used a lot in compilers, for example, to quickly find symbol table entries for identifiers in the source code. They could also be used in a text game to look items up in the player's inventory. Any program which may need to look data up by name could use a hash table. ## How Does It WorkThe Although in theory this is great and all, there are always a few snags (of course). First, it is obvious that if the table is a fixed size there will be a limit on the number of entries. Second, how does the hash avoid sending two or more strings to the same index? Unfortunately, the solution to the second problem is beyond the scope of this article. It's called In this article, we'll look at one way to solve these problems and implement a hash table. The method is called ## What Do I Need to Know?Just C for all the code. It's pretty basic stuff, a little pointer arithmetic, but not much. A handy guide to the C library functions might help if you don't have them memorized. You will need to know about linked lists, so check out a tutorial on that first, if you need to. Lastly, a little knowledge of number theory might make the math more interesting, but you don't really need to know it (you can skip the math entirely and still write the code with no problem, for sure). ## ChainingChaining resolves collision by using a linked list. (If you don't know about linked lists, check out a tutorial on them before going on.) We create a linked list at each entry of the table. When a string hashes to an index, we stick it on the list. Note that this solves both collision and size. First, we'll create a structure for an entry. It will simply contain a string field and a typedef struct ENTRY_t /* hash table entry */ { char name [SIG_CHARS + 1]; struct ENTRY_t *next; /* your specific fields here */ } ENTRY; Any other fields can be filled in We'll have an array of pointers to these entries (the heads of the lists) which will comprise the table. The size of the array will be a prime number. For those who don't know, a prime number is one that can't be divided evenly by any other number except one (the number 1). 2, 3, 5, 7, 11 and 13 are the first few prime numbers. We'll need a bigger one, though, so let's go with, say, 383. You can choose the size to fit your needs, however some sizes work better than others (I'll come back to this). #define PRIME 383 ENTRY *table [PRIME]; /* each element is a linked list */ Now, we have an array of linked lists, and this is our table. The next step is to write the hash. This is a function that will, in our case, transform any string into an integer in the range 0 to 382. This is usually done in two steps: convert the string into an integer; get the integer into the correct range. The second part is easy. There are several ways to do it, but the simplest is to just take the remainder of the integer when divided by the size of the table (in other words, the number mod PRIME). As for the first part, there are many, many ways to accomplish this. For example, you could just add up the ASCII values of the characters in the string. Even better, we should make the hash dependant on the order of the characters. A standard function looks something like this: unsigned int Hash (const char *name) { unsigned int hash = 0; /* transform the string into an integer */ while (*name != '\0') { hash = hash * MULTIPLIER + *name; ++ name; } /* reduce the integer to the correct range */ return hash % PRIME; } The function adds up the ASCII values of the string, but it So, now the rest is easy. When we get a string, we will hash it and add it to the list at the correct index. Our funtion will be called Enter, it will take a string and return a newly created ENTRY* (note that no check is done to see if the name is already in the table): ENTRY *Enter (const char *name) { unsigned int index = Hash (name); ENTRY *node = malloc (sizeof (ENTRY)); if (node == NULL) return NULL; /* initialize the node (copy the name into it and anything else) */ strncpy (node->name, name, SIG_CHARS); node->name [SIG_CHARS] = '\0'; /* put the new node at the head of the list at table [index] */ node->next = table [index]; table [index] = node; return node; } The new node is put at the head of the list. Note that even though only SIG_CHARS characters are stored in the entry, the hash actually sees more than that. So if two different strings have the same first 10 characters, and SIG_CHARS is 10, they may still hash to different indexes - which is good. Now, we can write functions for lookup, deletion and whatever else, but all the hashing stuff will be the same. To look a string up, just get the index from the hash function and compare the string with the name of each node in the list at that index. Either the string is found and you return a pointer to the node, or it isn't there and you return NULL. As long as you are up on your linked lists, you shouldn't have any problems. ## How do I Pick MULTIPLIER and the Size of the TableLet's first introduce some terminology here (I am introducing my own terms, they may or may not be used by other people). Call the first hash - changing the string to an integer - the Let's look at the secondary hash first. Before, I said that the size (call it SIZE from now on) should be prime. But why? Now I have a gripe because I looked for this in a number of books and sites and couldn't find any good mathematical reasons. They all said something about patterns, but that was it. So I ended up figuring it out myself, after some preliminary head scratching, and now I bring it to you. Skip the next three paragraphs if you just don't care about the math. Call the result from the primary hash H. The secondary hash is then H % SIZE. Then what makes H % SIZE good or bad. Well, observe what happens if H and SIZE have a common factor, say F. So H = F * H' and SIZE = F * SIZE', where H' and SIZE' are what's left when F is divided out. Looking at (F * H') % (F * SIZE'), we note that the F If you want proof, it's a simple use of the division algorithm. We are looking for (ab) mod (ac), which is r in the equation ab = acq + r (r is in the range 0 to ac - 1). Since r = ab - acq, r is divisible by a. So let r' = r / a. Now, dividing out a we get b = cq + r'. If we can show that r' is in the range 0 to c - 1, then r' will be b mod c. But if r' is greater than or equal to c then r is greater than or equal to ac (by multiplying both sides by a), and this is a contradiction. Hence, r' = b mod c and ar' = r = a (b mod c) = (ab) mod (ac), the far right side being our original equation. So what? OK, say that SIZE is divisible by 2. Then any time H is divisible by 2, H % SIZE will be divisible by 2. Using a negative argument, any time H is not divisible by 2, H % SIZE won't be either. In other words, as far as the secondary hash is concerned, even numbers hash to even indexes and odd numbers hash to odd indexes. If SIZE was also divisible by 3, say, then multiples of 3 would hash to multiples of 3 and non-multiples of 3 would hash to non-multiples of 3. It still doesn't seem like a big deal because we would expect half the numbers to be even and the other half to be odd. Unfortunately, this is very unlikely. Now we see what those other articles meant when they talked about patterns! The sample set is more likely to be biased (especially, the smaller it is) and the problem is the the secondary hash is I've seen two seemingly contradictory views on which prime should be chosen for the size of the table. One says that it should be as near to a power of 2 as possible, and one says just the opposite - it should be as far away from a power of 2 as possible. I'm inclined to go with the second option because it appears to be more widely claimed. What about MULTIPLIER? The only thing to be said in general, is that it shouldn't have any factors in common with SIZE, as we've already seen. Again, prime numbers are a popular choice (they're so cool, aren't they!) This brings me to my next point which I cannot stress enough, so I'm giving it its own section. ## Testing the HashBefore writing an apparently great hash function and sticking it in your project, take the time to run sample sets of data through it. Make sure the sample set is similar to what would really be used. Observe the distribution of the hash into the table. Try out different values of MULTIPLIER to get the distribution even. A simple way to measure the distribution in the table is to sum the squares of the number of names hashed to the same index. The following code implements a hash testing program: #include <stdio.h> #include <math.h> #include <assert.h> #define PRIME 23 unsigned int table [PRIME] = {0}; unsigned int sampSize = 0; unsigned int multiplier; unsigned int Hash (const char *name) { unsigned int hash = 0; while (*name != '\0') { hash = hash * multiplier + *name; ++ name; } return hash % PRIME; } void Enter (const char *name) { /* count how many names hash to each index */ ++ table [Hash (name)]; } void TestSample (void) { char name [64]; FILE *file = fopen ("sample.txt", "r"); assert (file); /* read MULTIPLIER off the first line of the file */ fscanf (file, "%u", &multiplier); /* enter names from the file until it reaches the end */ while (fscanf (file, "%s", name) != EOF) { Enter (name); ++ sampSize; /* keep a total of words "in" the table */ } fclose (file); } void TallyResults (void) { double score; unsigned int index, sum, q = sampSize / PRIME, r = sampSize % PRIME; /* add up the squares of the number of words hashed to each index */ for (index = sum = 0; index < PRIME; ++ index) sum += table [index] * table [index]; /* compute a relative score and print it */ score = (double) sum / (double) (q * (PRIME * q + 2 * r) + r); printf ("score = %f\n", score); } int main (void) { TestSample (); TallyResults (); return 0; } You'll notice that very little is going on here. We aren't even adding anything to the table. We're just counting how many times each index is hashed to. TestSample opens a file called sample.txt. The file begins with an string-integer which will be used as MULTIPLIER and is followed by any number of strings. It reads in the strings and hashes them one by one. Enter just increments a running total of each index. When we get down to TallyResults, we add up the square of each index count. We square them so that bigger values (more strings hashing to the same place) are penalized more than smaller ones. After getting the total, we divide by the funny looking expression above the printf statement. This expression is the lowest (best) possible total we can have, when all the strings are distributed as evenly as possible. It's not hard to derive but not really important to do so. Just take a look at what q and r are and I'm sure you can figure it out. Dividing by this Any other statistical info that might be helpful can be gathered, too. The longest list and the number of empty lists for example, might be nice to know. You'll also notice that I made the size of the table much smaller. This is just to illustrate my point with an example. I ran the following set of strings through the program: sword mace axe arrow shield bag stone key skull jar bottle fairy potion water spoon book spear dagger katana helmet chain halberd pipe hat eyeofnewt soup wolfbane soap instantcoffee bugspray flint bones orb gold silver wine bread Then I played around with MULTIPLIER for a bit. The following table summarizes the results:
A little bit surprisingly, 1024 (a power of 2) did best; but 1048576 = 1024 This doesn't mean that 1024 is always the best, or even the best in this case (just the best one I found in about ten minutes of playing with it). It depends on the names being hashed as well as the size of the table. Even the size of an unsigned int affects it (maybe an unsigned __int32 would be better). Just make sure it works with a large sample set similar to what would really be used. If you are writing a C compiler, go through a few dozen C files and pick out a few hundred of the most popular identifiers, etc. Anyways, you see how this goes. Make sure you test your data because how good a MULTIPLIER is depends a lot on the sample set. If you are a real perfectionist, write some kind of AI to choose a MULTIPLIER for you. I have been tinkering with a Genetic Algorithm to do just that. ## What Other Hash Functions Are There?Anything that turns a string into an integer is a primary hash function, and anything that reduces that integer into the correct range is a secondary hash function. I've seen a handful of other secondary hash functions as well, which can be found in a good data structures book. Primary hash functions are easier to invent, so we'll focus on those. One way to spice it up is to use a bitwise exclusive or (xor) instead of an addition in the above hash function: while (*name != '\0') { hash = hash * multiplier ^ *name; ++ name; } A totally different type of hash function comes, again, from the Dragon Book (Aho et al. 1986): int Hash (const char *name) { unsigned int g, h = 0; while (*name) { h = (h << 4) + *name; if ((g = h & 0xF0000000) != 0) { h ^= g >> 24; h ^= g; } ++ name; } return h % PRIME; } I can't tell you exactly how this monstrosity works, but according to the tests they ran it through, it seems to. It was written by P.J. Weinberger. I've used it before, just for fun really, with good results. It's obviously a whole different breed than the simple weighted addition method. Basically, when writing your own, be sure to test it. Very little practical advice can be given in general. The whole point of writing a primary hash is to destroy patterns in the strings. As you might guess, this is easier said than done, especially if you don't know what strings will be going into your table. Most of the time, you will want to use all the characters of the strings and make their order count. I like the add-and-weight strategy used in this article because it is simple, fast and gives you a lot of versatility because you have so many MULTIPLIERs to choose from. If you are interested, ## Anything else about ChainingUsing a circular linked list instead of a regular one is a neat idea. When a name is looked up, the list can be Thanks for reading my article! ;-) Please comment at gaulerthegoat@yahoo.com. ## Works Cited- Aho, Alfred V., Ravi Sethi, and Jeffrey D. Ullman. >Compilers: Principles , Techniques, and Tools. ("Dragon Book") Reading: Addison Wesley Longman, 1986.
- Corman, Thomas H., Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, 2nd ed. Cambridge: The MIT Press, 2001.
© 1999-2011 Gamedev.net. All rights reserved. |