>From: john@cooper.cooper.EDU (John Barkaus)
Newsgroups: comp.graphics
Subject: GIF file format responses 5/5
Date: 21 Apr 89 20:58:01 GMT
Organization: The Cooper Union (NY, NY)
LZW explained----Steve Blackstock
I hope this little document will help enlighten those of you out there
who want to know more about the Lempel-Ziv Welch compression algorithm, and,
specifically, the implementation that GIF uses.
Before we start, here's a little terminology, for the purposes of this
document:
"character": a fundamental data element. In normal text files, this is
just a single byte. In raster images, which is what we're interested in, it's
an index that specifies the color of a given pixel. I'll refer to an arbitray
character as "K".
"charstream": a stream of characters, as in a data file.
"string": a number of continuous characters, anywhere from one to very
many characters in length. I can specify an arbitrary string as "[...]K".
"prefix": almost the same as a string, but with the implication that a
prefix immediately precedes a character, and a prefix can have a length of
zero. So, a prefix and a character make up a string. I will refer to an
arbitrary prefix as "[...]".
"root": a single-character string. For most purposes, this is a
character, but we may occasionally make a distinction. It is [...]K, where
[...] is empty.
"code": a number, specified by a known number of bits, which maps to a
string.
"codestream": the output stream of codes, as in the "raster data"
"entry": a code and its string.
"string table": a list of entries; usually, but not necessarily, unique.
That should be enough of that.
LZW is a way of compressing data that takes advantage of repetition of
strings in the data. Since raster data usually contains a lot of this
repetition, LZW is a good way of compressing and decompressing it.
For the moment, lets consider normal LZW encoding and decoding. GIF's
variation on the concept is just an extension from there.
LZW manipulates three objects in both compression and decompression: the
charstream, the codestream, and the string table. In compression, the
charstream is the input and the codestream is the output. In decompression,
the codestream is the input and the charstream is the output. The string table
is a product of both compression and decompression, but is never passed from
one to the other.
The first thing we do in LZW compression is initialize our string table.
To do this, we need to choose a code size (how many bits) and know how many
values our characters can possibly take. Let's say our code size is 12 bits,
meaning we can store 0->FFF, or 4096 entries in our string table. Lets also
say that we have 32 possible different characters. (This corresponds to, say,
a picture in which there are 32 different colors possible for each pixel.) To
initialize the table, we set code#0 to character#0, code #1 to character#1,
and so on, until code#31 to character#31. Actually, we are specifying that
each code from 0 to 31 maps to a root. There will be no more entries in the
table that have this property.
Now we start compressing data. Let's first define something called the
"current prefix". It's just a prefix that we'll store things in and compare
things to now and then. I will refer to it as "[.c.]". Initially, the current
prefix has nothing in it. Let's also define a "current string", which will be
the current prefix plus the next character in the charstream. I will refer to
the current string as "[.c.]K", where K is some character. OK, look at the
first character in the charstream. Call it P. Make [.c.]P the current string.
(At this point, of course, it's just the root P.) Now search through the
string table to see if [.c.]P appears in it. Of course, it does now, because
our string table is initialized to have all roots. So we don't do anything.
Now make [.c.]P the current prefix. Look at the next character in the
charstream. Call it Q. Add it to the current prefix to form [.c.]Q, the
current string. Now search through the string table to see if [.c.]Q appears
in it. In this case, of course, it doesn't. Aha! Now we get to do something.
Add [.c.]Q (which is PQ in this case) to the string table for code#32, and
output the code for [.c.] to the codestream. Now start over again with the
current prefix being just the root P. Keep adding characters to [.c.] to form
[.c.]K, until you can't find [.c.]K in the string table. Then output the code
for [.c.] and add [.c.]K to the string table. In pseudo-code, the algorithm
goes something like this:
[1] Initialize string table;
[2] [.c.] <- empty;
[3] K <- next character in charstream;
[4] Is [.c.]K in string table?
(yes: [.c.] <- [.c.]K;
go to [3];
)
(no: add [.c.]K to the string table;
output the code for [.c.] to the codestream;
[.c.] <- K;
go to [3];
)
It's as simple as that! Of course, when you get to step [3] and there
aren't any more characters left, you just output the code for [.c.] and throw
the table away. You're done.
Wanna do an example? Let's pretend we have a four-character alphabet:
A,B,C,D. The charstream looks like ABACABA. Let's compress it. First, we
initialize our string table to: #0=A, #1=B, #2=C, #3=D. The first character is
A, which is in the string table, so [.c.] becomes A. Next we get AB, which is
not in the table, so we output code #0 (for [.c.]),
and add AB to the string table as code #4. [.c.] becomes B. Next we get
[.c.]A = BA, which is not in the string table, so output code #1, and add BA
to the string table as code #5. [.c.] becomes A. Next we get AC, which is not
in the string table. Output code #0, and add AC to the string table as code
#6. Now [.c.] becomes C. Next we get [.c.]A = CA, which is not in the table.
Output #2 for C, and add CA to table as code#7. Now [.c.] becomes A. Next we
get AB, which IS in the string table, so [.c.] gets AB, and we look at ABA,
which is not in the string table, so output the code for AB, which is #4, and
add ABA to the string table as code #8. [.c.] becomes A. We can't get any more
characters, so we just output #0 for the code for A, and we're done. So, the
codestream is #0#1#0#2#4#0.
A few words (four) should be said here about efficiency: use a hashing
strategy. The search through the string table can be computationally
intensive, and some hashing is well worth the effort. Also, note that
"straight LZW" compression runs the risk of overflowing the string table -
getting to a code which can't be represented in the number of bits you've set
aside for codes. There are several ways of dealing with this problem, and GIF
implements a very clever one, but we'll get to that.
An important thing to notice is that, at any point during the
compression, if [...]K is in the string table, [...] is there also. This fact
suggests an efficient method for storing strings in the table. Rather than
store the entire string of K's in the table, realize that any string can be
expressed as a prefix plus a character: [...]K. If we're about to store [...]K
in the table, we know that [...] is already there, so we can just store the
code for [...] plus the final character K.
Ok, that takes care of compression. Decompression is perhaps more
difficult conceptually, but it is really easier to program.
Here's how it goes: We again have to start with an initialized string
table. This table comes from what knowledge we have about the charstream that
we will eventually get, like what possible values the characters can take. In
GIF files, this information is in the header as the number of possible pixel
values. The beauty of LZW, though, is that this is all we need to know. We
will build the rest of the string table as we decompress the codestream. The
compression is done in such a way that we will never encounter a code in the
codestream that we can't translate into a string.
We need to define something called a "current code", which I will refer
to as "
Discuss this article in the forums
See Also: © 1999-2011 Gamedev.net. All rights reserved. Terms of Use Privacy Policy
|