GameDev.netMMX Enhanced Alpha Blending

MMX Enhanced Alpha Blending
by John HebertMy original desire was to write a 16bpp fade routine so I could fade my screen to black. At first, this seemed like a trivial task. My first attempt took 17 seconds to fade a 640x480 screen to black, running on a pentium 100. Not long after seeing how long it took to fade my screen did I realize that a fade to black in the high color modes is simply a special case of the more general effect of alpha blending. In this article I will take you through the optimization process I went through in my search for the fastest possible alpha blending routine. I will start with the basic equation and end with my MMX version of alpha blending.
This equation says that the greater the Before we wrap this equation with some code we need to see if we can optimize it. The first thing that should stand out are the two multiplications in the equation. As costly as multiplication is let's see if we can do anything about it. Rearranging things a bit produces an equation with a single multiplication.
There is one more optimization we can make to the equation. Currently
I will state millisecond and cycles per pixel timings through out this article, it is important to know the exact conditions under which these timings were taken. First, here are some specs for the PC that was used for all the timings:
All of the functions I will be describing read from two buffers, manipulate the data, and then write the data out to a buffer. For these timings all of the buffers that were read from or written to resided in system memory, NOT video memory. Even though these routines will be able to handle transparency the timings were taken using a 320x240 pixel bitmap where none of the pixels in the bitmap were set to the transparent (or colorkey) value. Doing this levels the playing field. We want to take our timings on the worst case scenario which for us would be an image where we have to check all pixels for transparency but do not get to skip any of them because none of them match the colorkey value. There are many more factors that come into play when determining the "speed" of an algorithm. What is important is the difference in performance from one attempt to the next while holding these factors constant.
The basic idea for alpha blending a non-indexed color mode (16bpp, 24bpp, etc) pixel is to breakdown both the source and destination pixel into their red, green, and blue color components or channels. Perform the above alpha blending equation on each channel then recombine the channels to form the resulting alpha blended pixel. Listing 1 shows some sample code that puts all of this together. i = height; do { j = width; do { sTemp = *((WORD*)lpSprite); if ( sTemp != sColorKey ) { dTemp = *((WORD*)lpDest); sb = sTemp & 0x1f; db = dTemp & 0x1f; sg = (sTemp >> 5) & 0x3f; dg = (dTemp >> 5) & 0x3f; sr = (sTemp >> 11) & 0x1f; dr = (dTemp >> 11) & 0x1f; blue = (ALPHA * (sb - db) >> 8) + db; green = (ALPHA * (sg - dg) >> 8) + dg; red = (ALPHA * (sr - dr) >> 8) + dr; *((WORD*)lpDest) = blue | (green << 5) | (red << 11); } lpDest += 2; lpSprite += 2; }while ( --j > 0 ); lpDest += dPadding; lpSprite += sPadding; }while (--i > 0); The code above is designed for a 16bpp surface with a format of 5-6-5. It also checks to see if the srcPixel is equal to the ColorKey in which case it skips drawing this pixel. This gives us transparency as well as alpha blending. This is your basic alpha blending function. We will use the performance of this function as our baseline and will compare subsequent attempts against the numbers below to see how much improvement we have achieved.
I coded many second attempts to improve upon the original. All of them improved the speed at which I processed the pixels but for a price! I tried using lookup tables with some good success I believe this resulted in a 40%-45% improvement in speed but you had the overhead and sloppiness of the lookup tables to contend with. Also the lookup tables for 16 bit color are huge and I have heard can cause cache-thrashing on higher end machines. If your goal is a fullscreen fade you can realize a two-fold increase in speed by only fading every other line per frame. This worked extremely well and produces an interesting effect. However, this does not look so hot if you are trying to alpha blend an explosion with your background tiles. If you can live with a single grade of alpha blending (50/50), then there is an extremely fast method to achieve this. For both the source and destination color values you mask off the low bit in each color channel then shift by one to the right, then add the two results together which gives you a 50% faded look. Listing 2 is the inner loop of a routine that show how to do to this for 16bpp 5-6-5 buffers. j = width; do { sTemp = *((WORD*)lpSprite); if ( sTemp != sColorKey ) { dTemp = *((WORD*)lpDest); Result = ( (sTemp & 0xF7DE) >> 1) + ( (dTemp & 0xF7DE) >> 1); } *((DWORD*)lpDest) = Result; lpDest += 2; lpSprite += 2; }while ( --j > 0 ); The code above is designed for a 16bpp surface with a format of 5-6-5. It also checks to see if the srcPixel is equal to the ColorKey in which case it skips drawing this pixel. This gives us transparency as well as alpha blending. This is your basic alpha blending function. We will use the performance of this function as our baseline and will compare subsequent attempts against the numbers below to see how much improvement we have achieved.
If you look at who the big cycle-eaters are in the first attempt you see that we do 3 multiplies per pixel. At 10 cycles a piece we know it will take more then 30 cycles per pixel. I got a tip from someone on the former www.Angelic-Coders.com about how to eliminate one of the multiplications which was great but it had the side effect of limiting your
Well, we all need MMX and the other SIMD instruction sets, but who says we can't mimic them in "C. The advantage of MMX is that you can perform the same operation on multiple data elements all in the same cycle. If we pack two 16 bit pixels into a single 32 bit value and operate on them as a unit we could eliminate at least 3 multiplication per every 2 pixels! In order to pull this off we will need to modify our alpha blending equation. We cannot afford to have any negative values at any time during the calculation. Our 2 packed pixels are defined as an unsigned 32bit value and if our destination color channel is greater then our source we will get a negative value, which will cause strange things to happen. Negative values are acceptable if you are blending one pixel at a time because the result of equations 1, 2, and 3 will always be a positive value. We need to add the maximum value of a color channel to the source color channel so when we subtract the destination we are assured to have a positive value. For 16 bit color this maximum value is 64 (the green channel can have 6 bits, 111111 binary or 64 in decimal). Of course we need to subtract it back out later to maintain the integrity of the equation. Note that this trick only works for the 16 bit color modes because we are stuffing two 16 bit pixels into one 32 bit value. We will start with equation 3. result = ( ALPHA * ( s - d ) ) / 256 + d Add 64 to the source, this means we must add something to the left side of the equation but what that is will become more clear in a second. result = ( ALPHA * ( ( s + 64 ) - d ) ) / 256 + d After some manipulation of the equation it becomes clear that we need to add result = ( ALPHA * s) / 256 + (ALPHA * 64) / 256 - ( ALPHA * d ) / 256 + d The equivalent of adding to the left side is subtracting from the right side. result = ( ALPHA * s) / 256 + (ALPHA * 64) / 256 - ( ALPHA * d ) / 256 + d - (ALPHA * 64) / 256 Now rearrange the equation for efficiency and you get….
Listing 3 puts everything together. Take a look at it and then I will walk you through the code. DWORD PLUS64 = 64 | (64 << 16); DWORD ALPHABY4 = (ALPHA / 4) | ((ALPHA / 4) << 16); DWORD doubleColorKey = sColorKey | (sColorKey << 16); ... // Is the Sprite width odd or even? If ( width % 2 == 1 ) { oddWidth = TRUE; width = (width - 1) / 2; //div by 2, processing 2 pixels at a time. } else { width = width / 2; //div by 2, processing 2 pixels at a time. } i = height; do { if ( oddWidth ) { sTemp = *((WORD*)lpSprite); if ( sTemp != sColorKey ) { dTemp = *((WORD*)lpDest); sb = sTemp & 0x1f; db = dTemp & 0x1f; sg = (sTemp >> 5) & 0x3f; dg = (dTemp >> 5) & 0x3f; sr = (sTemp >> 11) & 0x1f; dr = (dTemp >> 11) & 0x1f; *((WORD*)lpDest) = (DWORD)((ALPHA * (sb - db) >> 8) + db | ((ALPHA * (sg - dg) >> 8) + dg) << 5 | ((ALPHA * (sr - dr) >> 8) + dr) << 11); } lpDest += 2; lpSprite += 2; } j = width; do { sTemp = *((DWORD*)lpSprite); if ( sTemp != doubleColorKey ) { dTemp = *((DWORD*)lpDest); sb = sTemp & 0x001F001F; db = dTemp & 0x001F001F; sg = (sTemp >> 5) & 0x003F003F; dg = (dTemp >> 5) & 0x003F003F; sr = (sTemp >> 11) & 0x001F001F; dr = (dTemp >> 11) & 0x001F001F; BLUEC = ((((ALPHA * ((sb + PLUS64) - db)) >> 8) + db) - ALPHABY4) & 0x001F001F; GREENC = (((((ALPHA * ((sg + PLUS64) - dg)) >> 8) + dg) - ALPHABY4) & 0x003F003F) << 5; REDC = (((((ALPHA * ((sr + PLUS64) - dr)) >> 8) + dr) - ALPHABY4) & 0x001F001F) << 1; Result = BLUEC | GREENC | REDC; if ( (sTemp >> 16) == sColorKey ) Result = (Result & 0xFFFF) | (dTemp & 0xFFFF0000); else if ( (sTemp & 0xFFFF) == sColorKey ) Result = (Result & 0xFFFF0000) | (dTemp & 0xFFFF); *((DWORD*)lpDest) = Result; } lpDest += 4; lpSprite += 4; }while ( --j > 0 ); lpDest += dbuf; lpSprite += sbuf; }while ( --i > 0 );
We are using packed data just like the MMX instructions do. We have two 16 bit values packed into a single 32 bit value. If you want to add or subtract a value from packed data you must shift into place multiple copies of the value you wish to add/subtract. Here are some examples: DWORD PLUS256 = 256 | (256 << 16); The pixels are 16 bits each so I shift my values to the left by 16 and 'or' them to a copy of themselves. Multiplication is different, simply multiply the whole 32 bit value times your multiplier. An example: BLUEC = ((((ALPHA * ((sb + PLUS256) - db)) >> 8) + db) - ALPHAPLUS) & 0x001F001F; As we enter into the outer loop we check to see if the oddWidth flag has been set. If this flag is TRUE then our source buffer has an odd numbered width. We need to alpha blend a single pixel using the method in listing 1 then move to the inner loop which processes two pixels at a time. I put this check in to make the routine more general purpose. If you know your sprites, tiles, etc. will never have an odd width then you can remove this section of code and gain a tiny bit of performance. Inside the inner loop we read in two pixels from our source buffer and I check to see if this value is equal to doubleColorKey. If it is we can skip the majority of the processing. If the two values are not equal we read in two pixels from the destination buffer. Seeing the individual bits of each data element might be helpful so I will refer to Table 1 as we go along. Table 1a and b show the pixels in their 32 bit packed state. Now we need to separate out the red, green, and blue color components. This is easy enough to do if you know the color format of your buffer. We have 16 bits in which to describe the 3 color channels. Unfortunately, 3 does not divide evenly into 16 so the video chip manufacturers must make a choice. Either they only use 15 bits, giving each color channel 5 bits or they use all 16 bits but give one of the color channels 6 bits and the other two 5 bits. These two formats are referred to as 16bpp 5-5-5 or 16bpp 5-6-5. The numbers map respectively to the red, green and blue color channels. It is possible that other combinations could exist but it is very rare. Listing 3 is tailored for the 5-6-5 format. Below is the code that does this for us (See table 1c-h). sb = sTemp & 0x001F001F; db = dTemp & 0x001F001F; sg = (sTemp >> 5) & 0x003F003F; dg = (dTemp >> 5) & 0x003F003F; sr = (sTemp >> 11) & 0x001F001F; dr = (dTemp >> 11) & 0x001F001F; Next we run each color channel through equation 4. The only difference between the code and equation 4 is that after all the calculations we 'and' the result with 0x001F001F (or 0x003F003F). This is done to clamp or bound the result to the range we know it cannot exceed (See Table 1i and j). We then recombine the three alpha blended color channels into the result (See Table 1k). Result = BLUEC | GREENC | REDC; We now need to see if either of the two source pixels that we read in is equal to the ColorKey value. If one of them is, we need to replace that pixel in the result with its corresponding destination pixel. All of this is done with the following code. if ((sTemp >> 16) == sColorKey) Result = (Result & 0xFFFF) | (dTemp & 0xFFFF0000); else if ((sTemp & 0xFFFF) == sColorKey) Result = (Result & 0xFFFF0000) | (dTemp & 0xFFFF); That about wraps it up. Write your result to the destination buffer and increment your pointers to the next two pixels and start over again. How much faster is listing 3 than listing 1 you ask?
No doubt if Listing 3 were rewritten in assembly one could squeeze out another 10% to 20% performance which would make it just as fast as my lookup table attempt without any messy tables.
There are two alpha blending examples using MMX on Intel's website. This is what I used to learn the MMX instruction set. Once I was comfortable with the ins and outs of the packed data instructions I put together my own MMX version of alpha blending. You can find the code in the zip file. It is well commented but the interlacing of the instructions makes it difficult to follow. The logic is the same as Listing 3 but instead of working on 2 pixels at a time it operates on 4 at a time! The code is very fast, here are the comparisons.
None of the functions I have designed take into consideration that we may be reading or writing to non-aligned addresses. I believe you pay a 3-cycle penalty for this. We need to minimize the number of non-aligned reads and writes. Another optimization that applies only to the MMX function is to test if your four source pixels are all equal to the colorkey value before running through all the calculations. For bitmaps with 50% or more set to the colorkey value this would significantly increase the drawing speed. I'm sure there are countless other optimizations as well.
The MMX version is more than 3 times as fast as the basic alpha blending routine and can alpha blend a 640x480 16bpp screen in about 16 milliseconds. I hope I was clear enough for you to be able to follow my theories. If you have questions about anything e-mail me and I will do my best to answer them. Also, if you see additional improvement please drop me a line, I am always interested in improving this code. Although I feel some of the ideas I have outlined here are original they would not have been possible without all the excellent information I mined from the web. Therefore, thy taketh from the web thy shall giveth back to the web. All that is contained in this document is public domain. All I ask is that the document be maintained as a whole.
A zip file named "ht_alpha.zip" contains all the source code and a sample application to demonstrate what I have talked about in this article. The zip file should contain a VC++ project and these main files:
There are also debug and release versions of the executable. To run the app you need to move one of these exe's into the directory where the bmp files are located (or visa versa). The sample application uses the function keys 1-7 and the ESC key. F1 - Use DX BltFast to render image (no alpha blending) By: John Hebert
You can also contact me through the CDX mailing list. CDX is an open source DirectX wrapper that has many great features, one of which is alpha blending contributed by myself. CDX can be found at: http://www.cdx.sk/
© 1999-2011 Gamedev.net. All rights reserved. |