Really Fast Software Blending
by Kevin Gadd

Concept

Alpha blending is slow. It always involves lots of multiplication, or division. No way around that. Either you use lookup tables (and array accessing requires multiplication), or you use multiplication, or you use division. You can get a good approximation of alpha blending with bit-shifting, but that’s just an approximation.

What I set out to do was to create as fast an alpha blending formula as possible, without wasting lots of memory on lookup tables, or lots of clock cycles on math. And as far as I can tell, I did it.

Benchmarks

Alright, before I waste your time with my code, here’s some proof that my method is fast:

Note the horrible floating point performance of the P2-350. I wrote two other versions of my alpha blending algorithm, one using a << 1 bitshift, and one using a * float multiply. All used the same skeleton as my enhanced blend routines. The tests were conducted using a small optimized VB application, and each test involved blending 10 million (that’s right, 10 million) pairs of 32-bit color values. Each test used a different set of values. The additive blend is faster than my alpha blend, as demonstrated by the benchmark – it only has to lookup 3 values instead of 6. I’ve provided the code used for each of the three alpha blend test functions, so you can see that I attempted to optimize all three.

Code

newcolor.channels.red = clipByte((source.channels.red+dest.channels.red) >> 1);
newcolor.channels.green = clipByte((source.channels.green+dest.channels.green) >> 1);
newcolor.channels.blue = clipByte((source.channels.blue+dest.channels.blue) >> 1);

float alphaS, alphaD;
alphaS = (float)alpha / (float)255; alphaD = 1 - alphaS;
newcolor.channels.red = clipByte((source.channels.red*alphaS)+(dest.channels.red*alphaD));
newcolor.channels.green = clipByte((source.channels.green*alphaS)+(dest.channels.green*alphaD));
newcolor.channels.blue = clipByte((source.channels.blue*alphaS)+(dest.channels.blue*alphaD));

f2level * sourceTable; f2level * destTable;
sourceTable = aTable(alpha); destTable = aTable(255 - alpha);
newcolor.channels.red = clipByte(sourceTable->values[source.channels.red] +
                        destTable->values[dest.channels.red]);
newcolor.channels.green = clipByte(sourceTable->values[source.channels.green] +
                          destTable->values[dest.channels.green]);
newcolor.channels.blue = clipByte(sourceTable->values[source.channels.blue] +
                         destTable->values[dest.channels.blue]);

How it works

I used lookup tables for this formula, but I used them in a way which I’ve never seen used before. I created a standard 256*256 element __int8 (Byte for you VB people) lookup table, but I created it in a way as to allow you to create a pointer to each group of 256 elements… this will probably make more sense later. I also came up with an extremely fast way of eliminating a large number of branching instructions in my code (I checked the assembly myself) by using lots of tricky boolean operations.

Let’s get to some code.

struct LookupLevel {
  unsigned __int8 Values[256];
};

struct LookupTable {
  LookupLevel Levels[256];
} AlphaTable;

__inline unsigned __int8 clipByte(int value) {
  value = (0 & (-(int)(value < 0))) | (value & (-(int)!(value < 0)));
  value = (255 & (-(int)(value > 255))) | (value & (-(int)!(value > 255)));
  return value;
}

int InitTable() {
float fValue,
fAlpha;
int iValue, iAlpha;
  for (iAlpha = 0; iAlpha < 256; iAlpha++) {
    fAlpha = ((float)iAlpha) / 255;
    for (iValue = 0; iValue < 256; iValue++) {
      fValue = ((float)iValue) / 255;
      AlphaTable.Levels[iAlpha].Values[iValue] = clipByte((int)((fValue * fAlpha) * 255));
    }
  }
  return true;
}

LookupLevel * LookupPointer(int alpha) {
  return &AlphaTable.Levels[clipByte(alpha)];
}

Explanation

Well, now that we have that (very small) piece of example code, I’m going to explain it in detail.

The two structs are layered in a way to allow you to easily retrieve a pointer to a single 256 element array, for a single alpha value. This is very efficient for static-alpha blits, in which the alpha values for the image are the same the entire time. Blitting images with alpha channels is somewhat slower, but still very fast. I’ve been using this technique for months, and have done some extensive optimization.

Now, first, we start with the clipByte function. That function uses some cryptic-looking bit manipulation to clip a value to a byte. When optimized by Visual C++ 6.0’s compiler, it results in exactly 15 assembly instructions, none of which involve any sort of branching (or cache-thrashing). A variation of this can also be used for masking and other techniques. Here’s what the assembly for clipByte looks like:

mov    ecx, DWORD PTR _value$[esp-4]
xor    eax, eax
test   ecx, ecx
setge  al
neg    eax
and    ecx, eax
cmp    ecx, 255
setle  al
neg    al
and    al, cl
cmp    ecx, 255
setg   cl
neg    cl
or     al, cl
ret    4

Pretty small, isn’t it? It’s fast, too.

Now, onto the InitTable function. Please excuse my horrible naming convention, I’m one of the world’s worst coders. InitTable is pretty simple, it calculates a value from 0-255 for each element of the lookup table. It’s moderately fast, and probably could be made faster if I ever needed to speed it up. (It currently takes my P3-750 between 0 and 10 milliseconds to run the initialization routine. More than fast enough for me.)

Alright, hopefully you understand it now. Let’s move on to the actual use of the lookup tables: Alpha blending.

I’m not going to demonstrate an alpha blended blit routine here, because it would take far too much time. Perhaps in another article. For now I’m going to demonstrate 3 kinds of color blending. Note that I use the 32-bit BGRA colorspace in all my code, but it is easy to adapt it to other colorspaces.

struct ColorChannels
{
  unsigned __int8 Blue;
  unsigned __int8 Green;
  unsigned __int8 Red;
  unsigned __int8 Alpha;
};

union Pixel32 {
  ColorChannels Channels;
  __int32 Value;
};

__int32 AlphaBlend(Pixel32 DestColor, Pixel32 SourceColor, __int32 Alpha) {
Pixel32 NewColor;
LookupLevel *SourceTable;
LookupLevel *DestTable;

  SourceTable = LookupPointer(Alpha);
  DestTable = LookupPointer(255 - Alpha);

  // The clipByte's aren't really necessary, but they're there to remove the
  // chance for an occasional overflow. They only eat a couple clock cycles anyway.
  NewColor.Channels.Red = clipByte(SourceTable->Values[SourceColor.Channels.Red] +
                          DestTable->Values[DestColor.Channels.Red]);
  NewColor.Channels.Green = clipByte(SourceTable->Values[SourceColor.Channels.Green] +
                            DestTable->Values[DestColor.Channels.Green]);
  NewColor.Channels.Blue = clipByte(SourceTable->Values[SourceColor.Channels.Blue] +
                           DestTable->Values[DestColor.Channels.Blue]);
  return NewColor.Value;
}

Explanation

Alright, there we have it. A fully functional alpha blending routine that will blend two colors. Pretty small, isn’t it? Now, here’s how it works:

It gets two pointers: One for the destination pixels, one for the source pixels. Each is a pointer to a 256-element __int8 (Byte for you VB people) lookup table. Then, it uses those tables to blend each color channel of the source and destination pixels. It’s quite fast, even in its unoptimized state. I have the clipByte() around the values, just in case one of the calculations comes out to 256 or something. You could remove the clipByte and probably not have any problems, but it’s there anyway.

When writing a blitter, I suggest doing the calculations inside the blitter itself. It would be much faster, as you’d only have to get those two pointers once per blit, and you could avoid the whole temporary color variable and save your results directly to the destination pixel.

Note that if you want to do this with 15-bit or 16-bit colors, you won’t be able to use a pixel struct, but the lookup table will work just fine with absolutely no modification whatsoever, as long as your alpha values are 0-255. You could also optimize it a bit more by removing the clipByte from the LookupPointer function, as well.

Now that you hopefully understand the alpha blend function, I’m going to show you an additive blend function, and a subtractive blend function. Quite handy functions to have.

__int32 SubtractiveBlend(Pixel32 DestColor, Pixel32 SourceColor, __int32 Alpha) {
Pixel32 NewColor;
LookupLevel *
SourceTable;

  SourceTable = LookupPointer(Alpha);

  // These clipByte's ARE necessary. Don't remove them.
  NewColor.Channels.Red = clipByte(DestColor.Channels.Red -
                          SourceTable->Values[SourceColor.Channels.Red]);
  NewColor.Channels.Green = clipByte(DestColor.Channels.Green -
                            SourceTable->Values[SourceColor.Channels.Green]);
  NewColor.Channels.Blue = clipByte(DestColor.Channels.Blue -
                           SourceTable->Values[SourceColor.Channels.Blue]);
  return NewColor.Value;
}

__int32 AdditiveBlend(Pixel32 DestColor, Pixel32 SourceColor, __int32 Alpha) {
Pixel32 NewColor;
LookupLevel *
SourceTable;

  SourceTable = LookupPointer(Alpha);

  // These clipByte's ARE necessary. Don't remove them.
  NewColor.Channels.Red = clipByte(SourceTable->Values[SourceColor.Channels.Red] +
                          DestColor.Channels.Red);
  NewColor.Channels.Green = clipByte(SourceTable->Values[SourceColor.Channels.Green] +
                            DestColor.Channels.Green);
  NewColor.Channels.Blue = clipByte(SourceTable->Values[SourceColor.Channels.Blue] +
                           DestColor.Channels.Blue);
  return NewColor.Value;
}

Explanation

Alright, there you have it. Additive Blend and Subtractive Blend functions. They’re actually faster than the alpha blend function, believe it or not. You COULD remove the clipByte call, but then you’d have tons of overflow problems, so just leave it there. I may have screwed up the algorithm a bit, as these aren’t the actual functions I use, I rewrote these from scratch.

Hopefully this article has shown you a few ways to optimize your code, and maybe even given you insipriation for how to optimize one of your graphics routines. If you like the article, have any questions, or would like to point something out to me, feel free to email me at janusfury@citlink.net.

Also, I might be writing another article soon on blitters and more graphics operations, if I get enough response to this article.

Discuss this article in the forums

Date this article was posted to GameDev.net: 11/20/2001
(Note that this date does not necessarily correspond to the date the article was written)

See Also:
Alpha Blending
Featured Articles