Using VectorC to Take Advantage of MMX, 3DNow! and SSE
by Andrew Richards
CodePlay Ltd

I have a confession to make. I am a speed addict. No, not the drug, but squeezing the last drop of time from code. I will stay up all night to get just a 0.1% reduction in frame time. The result can be spectacular, but maybe the time would be better spent elsewhere. Tweaking game-play takes time. It takes hours of playing and modifying, testing, discussing, and arguing to make a great playing game. I'm not alone. Hundreds, even thousands (maybe more) of programmers have the same habit. And it's got to stop. The thousands of lines of hand-coded assembly language, the carefully constructed #defines, writing different versions for different processors - all time better spent elsewhere. Which is why I have been working on VectorC - the compiler that does the work for you.

OK, that's quite a claim, and one I can't completely back up. C is a low-level language, so it doesn't provide all the information that a compiler needs to really make your code fly. To get parallelism (which you need on the latest processors) you need to tell the compiler whether it can safely reorder your instructions. You also need to tell it about alignment, because the latest processors are very alignment sensitive. So getting the best out of VectorC requires a few "hints" added to your code. A little work is required, but it's a lot quicker than writing hand-coded assembly. If you want to know more, read on.

MMX

I'm going to start with MMX, because it is the simplest of the technologies I'm going to discuss and it's the most widely available. MMX was designed by Intel for dealing with graphics and sound - "Multimedia Extensions". Eight new 64-bit registers were added to hold vector data. New instructions were added to process this vector data - the idea being that processing 8 bytes of data in one instruction is the same as executing 8 instructions in sequence. You can never quite get that factor 8 speed improvement, but factors of 2 to 4 are attainable.

To get VectorC to produce fast MMX object code from your C source, you need to do 4 things

Store your data in memory in a form that can be read by MMX instructions
Only use operations that are supported by MMX.
Make sure that the compiler can reorder your instructions so that they can be combined.
Don't try to combine MMX and floating-point code. This restriction is relaxed when you are using 3DNow! or Streaming SIMD Extensions.

1. Store your data in memory in a form that can be read by MMX instructions

MMX can only load or store 4 and 8-byte values from memory and works fastest when these reads and writes are aligned. You also need to make sure that your data is in arrays or structures of chars, shorts or ints. You can't mix data sizes within a single MMX register. So, you could store RGBA (a colour with alpha) values as:


struct RGBA {unsigned char R, G, B, Alpha;};

This is a 4-byte structure which could be put into a 2-dimensional array to make a 24-bit image with transparency/alpha. There is a problem if you don't want to include the alpha - an RGB value is 3 bytes which doesn't load easily into an MMX register. If you are dealing with a large number of pixels at a time - in a continuous array - then the compiler can unroll loops to process 4 RGB values at a time, which is 12 bytes. 12 bytes fits into 2 MMX registers easily - one is 8 bytes and the other 4. However, you need to make sure that the first pixel you deal with is aligned on a 4 byte boundary. It is often easier to store the alpha value anyway and always assign 0 to it when you write to the red, green and blue.

2. Only use operations that are supported by MMX

Operations on MMX values are limited to those that Intel decided would be appropriate to sound and image processing. The operations available are addition, subtraction, multiplication (only on signed 16-bit integers) and saturation. Saturation is when you prevent overflows by clamping the maximum and minimum output values to the maximum and minimum values within a given type. You have to use "if" statements in C to do this. An example is given below.

3. Make sure that the compiler can reorder your instructions so that they can be combined.

Vectorization often requires that the operations in your source code be combined. Consider an example that adds 2 24-bit images together:


typedef struct {unsigned char R, G, B, A;} RGB;

void AddBitmaps (RGB *source1, RGB *source2, RGB *dest, int Width, int Height)
   {
   int x, y;
   for (y=0; y<Height; y++)
      for (x=0; x<Width; x++) {
         dest [y * Width + x].R = source1 [y * Width + x].R + source2 [y * Width + x].R;
         dest [y * Width + x].G = source1 [y * Width + x].G + source2 [y * Width + x].G;
         dest [y * Width + x].B = source1 [y * Width + x].B + source2 [y * Width + x].B;
         dest [y * Width + x].A = source1 [y * Width + x].A + source2 [y * Width + x].A;
      }
  }

The example program above cannot be compiled to use MMX because the 4 lines of code that process the R, G, B and A components of each pixel must be executed in sequence and not at the same time. There are 2 solutions to the problem:

Reorder the code yourself.


void AddBitmaps (RGB *source1, RGB *source2, RGB *dest, int Width, int Height)
   {
   int x, y;
   unsigned char R, G, B, A;
   for (y=0; y<Height; y++)
      for (x=0; x<Width; x++) {
         R = source1 [y * Width + x].R; G = source1 [y * Width + x].G;
         B = source1 [y * Width + x].B; A= source1 [y * Width + x].A;
         R += source2 [y * Width + x].R; G += source2 [y * Width + x].G;
         B += source2 [y * Width + x].B; A += source2 [y * Width + x].A;
         dest [y * Width + x].R = R; dest [y * Width + x].G = G;
         dest [y * Width + x].B = B; dest [y * Width + x].A = A;
      }
  }

Tell VectorC that the pointers you are using definitely point to different areas of memory. This lets the compiler reorder memory reads and writes itself. Use the "restrict" keyword for this. This works better than the above code when the loop is unrolled, so I recommend this solution.


void AddBitmaps (RGB restrict *source1, RGB restrict *source2,
                 RGB restrict *dest, int Width, int Height)
   {
   int x, y;
   for (y=0; y<Height; y++)
      for (x=0; x<Width; x++) {
         dest [y * Width + x].R = source1 [y * Width + x].R + source2 [y * Width + x].R;
         dest [y * Width + x].G = source1 [y * Width + x].G + source2 [y * Width + x].G;
         dest [y * Width + x].B = source1 [y * Width + x].B + source2 [y * Width + x].B;
         dest [y * Width + x].A = source1 [y * Width + x].A + source2 [y * Width + x].A;
      }
   }

4. Don't try to combine MMX and floating-point code

Unfortunately, when MMX was designed, DOS was still the main operating system. This led to a problem - how can a program save the processor state when swapping tasks or in an interrupt? You could add new code to save the new MMX registers during a task switch, but under DOS, a lot of different programs had this kind of code - so you would have had a serious compatibility problem. The solution that Intel came up with was to map the MMX registers to the floating-point registers, so after writing to an MMX register you can read the result from a floating-point register. However, this requires a mode-change - which is slow and also means that you can't use floating-point and MMX code in the same area of your code. 3DNow! and SSE give a partial solution to this problem, but only for floating-point code that can execute entirely within the restrictions of these 2 technologies.

3DNow!

3DNow! was designed by AMD as an extension to MMX to support single-precision floating-point arithmetic, which is used in a lot of 3D games. It uses the MMX registers but adds new floating-point instructions that can operate on 1 or 2 single-precision floating-point values at a time. Because MMX and the normal floating-point registers cannot be used in the same area of code, 3DNow! can only be used in areas of code whose only floating-point operations are within 3DNow's features.

3DNow! supports: addition, subtraction, multiplication, division, negation, absolute, comparison, conversion to-and-from integers and reciprocal square root. It is also possible to compute division and reciprocal square root to just 12-bit precision for extra speed.

Streaming SIMD Extensions

Streaming SIMD Extensions (SSE) were designed by Intel and are available in its Pentium III processor and are also available in the newest Celeron processors. These new instructions add 8 new 128-bit registers which do not suffer from MMX's restriction on mixing with FPU (floating-point unit) code. However, operating system support is required (Windows 95 has it and Windows NT4 can have it added with a service pack). SSE instructions operate on 1 or 4 single-precision floating-point values.

SSE supports: addition, subtraction, multiplication, division, negation, absolute, square root, comparison, conversion to-and-from integers and reciprocal square root. It is also possible to compute division and reciprocal square root to just 12-bit precision for extra speed.

Both 3DNow! and Streaming SIMD Extensions can operate on normal single-precision floating point values, so the compiler doesn't need to vectorize its code to get a speed increase. However, under some circumstances, SSE can be slower when operating on single floats than the normal FPU code! I don't yet know exactly when this is the case, so watch out, you may recompile your code for SSE and find it is slower. If you follow the tips given for vectorization, it may speed up considerably, or you may want to put this code in a separate source file that is compiled without SSE support.

Some Examples

Here are a couple of simple example routines that demonstrate the points I made above.

Example 1: Blending 2 24-bit Images

This function blends 2 images together to produce a new image. By varying "Factor" between 0 and 32767 it is possible to fade from one image to another.

The input and output pointers have 2 non-standard hints applied to them. It would be sensible to define macros to use instead of the full code sequence - this would make your code more portable because a #ifdef can be added to make these macros compile to nothing on other compilers.

The "restrict" keyword tells VectorC that the 2 input textures and the output texture are independent - so reads and writes can be reordered and combined.
"__declspec (alignedvalue (8)))" says that the pointers are aligned to an 8-byte boundary. It is your responsibility to make sure that this is correct. If it is not, you may find that this code is much slower than if it did not use MMX (this depends on the processor you are running on and the sizes of the images). You could also use "__declspec (alignedvalue (4)))".

Because this loop processes 4 bytes at a time and MMX works on 8 byte values, "__hint__ ((unroll (2)))" tells VectorC that it should unroll the loop to process 8 bytes at a time.

Blending requires multiplying the 2 images by a fractional value. MMX is integer only, so we need to convert these multiplications to fixed-point. The only fixed-point multiplication that MMX supports is to multiply 2 signed 16-bit integers to a 32-bit signed integer and give the high 16-bits as the result. This is not ideal because it means that the maximum fractional value is 0.5. Later processors have an unsigned multiply which would allow us to take the fraction up to 1. So, we have to multiply the source by 2, make the factors range from 0 to 32767 (the maximum 16-bit signed value), and then shift the result right by 16. We also have to make sure that both factor variables (b1 and b2) are signed 16-bit ("short" in C).

We then do a saturated conversion from the 16-bit signed intermediate values to 8-bit unsigned RGB values. This is strictly speaking unnecessary in this case, but I have added it because this is actually the fastest way to convert from 16-bit to 8-bit with MMX and is also useful if you want to change this routine to do other, similar, operations on 24-bit images. The normal form of converting from 16-bit to 8-bit (modulo arithmetic) can lead to very bright values overflowing and becoming very dark. It is better to "saturate" colour values.

There are several things you can do to this routine to adapt it for your own purposes. You could add "__hint__((prefetch))" to the definitions of Texture1 and Texture2 - this uses prefetch instructions on processors that support them (K6-2 and above and Pentium III and above). Prefetching speeds up memory reads. You could also try adding "__hint__((nontemporal))" to the definition of "Dest". This writes out without writing to the cache. This can massively speed up writing out data if you don't want to read it back in any time soon. It can also massively slow it down, so it is worth trying with and without. Non-temporal writes are available on Pentium III and Athlon and above.


typedef struct {unsigned char R, G, B, A;} RGB;

void BlendImages (__declspec (alignedvalue (8)) RGB restrict *Texture1,
                  __declspec (alignedvalue (8)) RGB restrict *Texture2,
                  __declspec (alignedvalue (8)) RGB restrict *Dest,
                  int Width, int Height, int Factor)
   {
   int r, x, y;
   short R, G, B, R1, G1, B1, R2, G2, B2, b1, b2;

   if (Factor < 0) Factor = 0;
   if (Factor > 32767) Factor = 32767;
   b1 = Factor;
   b2 = 32767 - Factor;
   for (y=0; y < Height; y++)
      for (x=0; x < Width; x++)
         {
         __hint__ ((unroll (2)));
         R1 = Texture1 [y * Width + x].R * 2;
         G1 = Texture1 [y * Width + x].G * 2;
         B1 = Texture1 [y * Width + x].B * 2;
         R2 = Texture2 [y * Width + x].R * 2;
         G2 = Texture2 [y * Width + x].G * 2;
         B2 = Texture2 [y * Width + x].B * 2;
         R = (R1 * b1 >> 16) + (R2 * b2 >> 16);
         G = (G1 * b1 >> 16) + (G2 * b2 >> 16);
         B = (B1 * b1 >> 16) + (B2 * b2 >> 16);
         if (R < 0) R = 0;
         if (R > 255) R = 255;
         if (G < 0) G = 0;
         if (G > 255) G = 255;
         if (B < 0) B = 0;
         if (B > 255) B = 255;
         Dest [y * Width + x].R = R;
         Dest [y * Width + x].G = G;
         Dest [y * Width + x].B = B;
         Dest [y * Width + x].A = 0;
         }
   }

Example 2: Rotating and Projecting 3D Vectors to Screen Coordinates

When drawing 3D objects on a 2D screen, you need to project vertices from 3D world coordinates to the 2D screen coordinates. This requires a rotation, translation, a test (to see if the point is behind the camera) and a division. 12-bit precision division is usually good enough for this.

All the code uses float (no double or long double). Notice also that the constant (0.1) has a "f" after it to signify single precision.

The input and output vectors are aligned on 16-byte boundaries. The VECTOR type is also defined to be of size 16, with a 4-byte Flag (to specify whether a vector is in front of the camera and so can be projected).

This routine will not be able to take advantage of MMX, but will speed up considerably when compiled for 3DNow! or SSE. This is the real advantage of VectorC - you need at least 3 versions of this routine - FPU, 3DNow! and SSE - with VectorC you just need to compile 3 different times from the same source.


typedef struct {float x, y, z; int Flag;} VECTOR;

void RotateProjectVectors (float CameraRotation [3] [3], VECTOR CameraTranslation,
                           float Scale, float CentreX, float CentreY,
                           __declspec (alignedvalue (16)) VECTOR restrict *InVectors,
                           __declspec (alignedvalue (16)) VECTOR restrict *OutVectors,
                           int NumPoints)
   {
   int i;
   float x, y, z;

   for (i=0; i < NumPoints; i++)
      {
      x = CameraRotation [0] [0] * InVectors [i].x
        + CameraRotation [0] [1] * InVectors [i].y
        + CameraRotation [0] [2] * InVectors [i].z + CameraTranslation.x;
      y = CameraRotation [1] [0] * InVectors [i].x
        + CameraRotation [1] [1] * InVectors [i].y
        + CameraRotation [1] [2] * InVectors [i].z + CameraTranslation.y;
      z = CameraRotation [2] [0] * InVectors [i].x
        + CameraRotation [2] [1] * InVectors [i].y
        + CameraRotation [2] [2] * InVectors [i].z + CameraTranslation.z;
      if (z >= 0.1f)
         {
         OutVectors [i].x = Scale * x __hint__((precision(12))) / z + CentreX;
         OutVectors [i].y = Scale * y __hint__((precision(12))) / z + CentreY;
         OutVectors [i].z = z;
         OutVectors [i].Flag = 1;
         }
      else
         OutVectors [i].Flag = 0;
      }
   }

Discuss this article in the forums

Date this article was posted to GameDev.net: 7/7/2000
(Note that this date does not necessarily correspond to the date the article was written)

See Also:
Optimization