True-Color Software Blending
Here are a couple of techniques for Alpha-Blending (mixing between source and destination based on an "Alpha" blend factor) and Additive-Blending (adding the source and destination colors using saturation arithmetic). These aren't as fast as could be achieved using MMX instructions or probably even plain Assembly instructions, but should be pretty fast for optimized C. If you know of any different or better techniques, please let me know.
Alpha-Blending Two Pixels
First you mask off each color component of the source and destination pixels, multiply the source component by the Alpha value, multiply the destination component by the Inverse Alpha value, add the two, divide the result by the maximum Alpha value (in this case 256, which can be accomplished by a bit shift of 8 to the right), bit-and with the component mask again, and finally bit-or the three color components back together to produce the output pixel value. (Note, using an unsigned char Alpha mask only allows an alpha of up to 255, so even if the alpha channel is all 255, the colors of the source image will be ever so slightly reduced from their full intensity.)
RMask, GMask, and BMask are assumed to be the bit-masks for that color channel, usually devined from DirectDraw's DDPixelFormat, or a fixed set of masks for Windows DIBs. source and dest are pointers to the source and destination pixels, and alpha is an unsigned char pointer to the alpha channel pixel (which may be fixed for the whole blit). This should work for 16-bit unsigned short or 32-bit unsigned long source and dest pixels.
Additive-Blending Two Pixels
This is one of the methods I've figured out for Saturated (values don't wrap when they hit the top) Additive Blending. The biggest problem is the saturation arithmetic (why do CPUs seem to only have modulo arithmetic instructions?), which in the general case appears to be handled best by a test and branch. Luckily it is possible to test directly against the mask for the current color component, as seen below. Addition also doesn't pollute the lower bits of the result, so no additional bit-and with the mask is required. Depending on your C compiler, a slightly different arrangement of the operations might be faster.
Other forms of Blending
Other forms of blending are also possible, such as multiplicative, subtractive, divisive, maximum, minimum, etc., though they are slower and/or more difficult to implement on variable-color-mask pixels, which are required for most 16-bit high-color graphics. If you can work with byte-per-component pixels (24-bit or 32-bit) then all the possible software blending modes become very easy and (relatively) fast to implement.
In general, pixel blending is much slower than plain solid or transparent bitmap blitting, so for high frame rate applications such as games, it's best to limit blending to a small portion of the screen. With other applications such as image processing, the speed penalty for blending large bitmaps together isn't much of an issue, and you can go to town.
February 10th, 1999 Update:
An interesting speed up for Alpha Blending, if you don't mind using a fixed alpha map (no interactive fade outs, etc.), is to pre-multiply the source bitmap RGB values by their corresponding Alpha values. That eliminates half of the multiplies per pixel, as you just have to multiply the destination by the inverse alpha and add to the pre-multiplied source. You could store the inverse alpha in your source alpha channel to avoid having to calculate the inverse alpha as well. Thanks Michael Tanczos at Game Programming '99.
April 4th, 1999 Update:
What's a good way to say "gee, that seems so obvious in retrospect"? :) Thanks to Thomas Mauer and Matias Ignacio Suarez Ornani for pointing out that you can remove half of the multiplies per pixel without any pre-processing steps by doing the following, dest = dest + (source - dest) * alpha. It's basically taking the difference between the source and the dest, scaling that by the alpha value, and adding it to the dest, using only one multiply per color channel. It takes a few more bit-ands when using arbitrary color masks than the previous code, but the drop in multiplies really wins out (a multiply takes 9 clocks on a Pentium, while a bit-and takes half a clock).
Here's some sample code to implement the above algorithm. It works (and fast!) for 16-bit pixels, but MAY have overflow issues with 32-bit pixels due to the sign bit in the 32-bit integers. Some of the casts to (int) may be unnecessary, but better safe than sorry, as signed arithmetic must be used for the algorithm to work.
Copyright 1998-1999 by Seumas McNally.
Courtesy Of Longbow Digital Artists