GameDev.net -- MMX Enhanced Alpha Blending

Contents

Introduction
Basic Blending
Get it Working
Tried them all!
Who needs MMX?
Who needs
MMX? (cont)
MMX Version
Conclusion

Printable version

THIRD ATTEMPT: Who Needs MMX? (continued)

We are using packed data just like the MMX instructions do. We have two 16 bit values packed into a single 32 bit value. If you want to add or subtract a value from packed data you must shift into place multiple copies of the value you wish to add/subtract. Here are some examples:


DWORD PLUS256        = 256 | (256 << 16);
DWORD ALPHAPLUS      = ALPHA | (ALPHA << 16);
DWORD doubleColorKey = sColorKey | (sColorKey << 16);

The pixels are 16 bits each so I shift my values to the left by 16 and 'or' them to a copy of themselves. Multiplication is different, simply multiply the whole 32 bit value times your multiplier. An example:

BLUEC  = ((((ALPHA * ((sb + PLUS256) - db)) >> 8) + db) - ALPHAPLUS) & 0x001F001F;

As we enter into the outer loop we check to see if the oddWidth flag has been set. If this flag is TRUE then our source buffer has an odd numbered width. We need to alpha blend a single pixel using the method in listing 1 then move to the inner loop which processes two pixels at a time. I put this check in to make the routine more general purpose. If you know your sprites, tiles, etc. will never have an odd width then you can remove this section of code and gain a tiny bit of performance.

Inside the inner loop we read in two pixels from our source buffer and I check to see if this value is equal to doubleColorKey. If it is we can skip the majority of the processing. If the two values are not equal we read in two pixels from the destination buffer. Seeing the individual bits of each data element might be helpful so I will refer to Table 1 as we go along. Table 1a and b show the pixels in their 32 bit packed state.

Now we need to separate out the red, green, and blue color components. This is easy enough to do if you know the color format of your buffer. We have 16 bits in which to describe the 3 color channels. Unfortunately, 3 does not divide evenly into 16 so the video chip manufacturers must make a choice. Either they only use 15 bits, giving each color channel 5 bits or they use all 16 bits but give one of the color channels 6 bits and the other two 5 bits. These two formats are referred to as 16bpp 5-5-5 or 16bpp 5-6-5. The numbers map respectively to the red, green and blue color channels. It is possible that other combinations could exist but it is very rare. Listing 3 is tailored for the 5-6-5 format. Below is the code that does this for us (See table 1c-h).


sb = sTemp & 0x001F001F;

			db = dTemp & 0x001F001F;
			sg = (sTemp >> 5)  & 0x003F003F;
			dg = (dTemp >> 5)  & 0x003F003F;
			sr = (sTemp >> 11) & 0x001F001F;
			dr = (dTemp >> 11) & 0x001F001F;

Next we run each color channel through equation 4. The only difference between the code and equation 4 is that after all the calculations we 'and' the result with 0x001F001F (or 0x003F003F). This is done to clamp or bound the result to the range we know it cannot exceed (See Table 1i and j). We then recombine the three alpha blended color channels into the result (See Table 1k).

Result = BLUEC | GREENC | REDC;

We now need to see if either of the two source pixels that we read in is equal to the ColorKey value. If one of them is, we need to replace that pixel in the result with its corresponding destination pixel. All of this is done with the following code.


if ((sTemp >> 16) == sColorKey)
    Result = (Result & 0xFFFF) | (dTemp & 0xFFFF0000);
else if ((sTemp & 0xFFFF) == sColorKey)
    Result = (Result & 0xFFFF0000) | (dTemp & 0xFFFF);

That about wraps it up. Write your result to the destination buffer and increment your pointers to the next two pixels and start over again. How much faster is listing 3 than listing 1 you ask?

Performance of the Third Attempt:

Milliseconds / function call = 8.52

Cycles per pixel = 39

Improvement over baseline = 35%

No doubt if Listing 3 were rewritten in assembly one could squeeze out another 10% to 20% performance which would make it just as fast as my lookup table attempt without any messy tables.

Next : MMX Version