Contents

Introduction
MMX
3DNow! and
Streaming SIMD
Some Examples

Get VectorC
Printable version

MMX

I'm going to start with MMX, because it is the simplest of the technologies I'm going to discuss and it's the most widely available. MMX was designed by Intel for dealing with graphics and sound - "Multimedia Extensions". Eight new 64-bit registers were added to hold vector data. New instructions were added to process this vector data - the idea being that processing 8 bytes of data in one instruction is the same as executing 8 instructions in sequence. You can never quite get that factor 8 speed improvement, but factors of 2 to 4 are attainable.

To get VectorC to produce fast MMX object code from your C source, you need to do 4 things

Store your data in memory in a form that can be read by MMX instructions
Only use operations that are supported by MMX.
Make sure that the compiler can reorder your instructions so that they can be combined.
Don't try to combine MMX and floating-point code. This restriction is relaxed when you are using 3DNow! or Streaming SIMD Extensions.

1. Store your data in memory in a form that can be read by MMX instructions

MMX can only load or store 4 and 8-byte values from memory and works fastest when these reads and writes are aligned. You also need to make sure that your data is in arrays or structures of chars, shorts or ints. You can't mix data sizes within a single MMX register. So, you could store RGBA (a colour with alpha) values as:


struct RGBA {unsigned char R, G, B, Alpha;};

This is a 4-byte structure which could be put into a 2-dimensional array to make a 24-bit image with transparency/alpha. There is a problem if you don't want to include the alpha - an RGB value is 3 bytes which doesn't load easily into an MMX register. If you are dealing with a large number of pixels at a time - in a continuous array - then the compiler can unroll loops to process 4 RGB values at a time, which is 12 bytes. 12 bytes fits into 2 MMX registers easily - one is 8 bytes and the other 4. However, you need to make sure that the first pixel you deal with is aligned on a 4 byte boundary. It is often easier to store the alpha value anyway and always assign 0 to it when you write to the red, green and blue.

2. Only use operations that are supported by MMX

Operations on MMX values are limited to those that Intel decided would be appropriate to sound and image processing. The operations available are addition, subtraction, multiplication (only on signed 16-bit integers) and saturation. Saturation is when you prevent overflows by clamping the maximum and minimum output values to the maximum and minimum values within a given type. You have to use "if" statements in C to do this. An example is given below.

3. Make sure that the compiler can reorder your instructions so that they can be combined.

Vectorization often requires that the operations in your source code be combined. Consider an example that adds 2 24-bit images together:


typedef struct {unsigned char R, G, B, A;} RGB;

void AddBitmaps (RGB *source1, RGB *source2, RGB *dest, int Width, int Height)
   {
   int x, y;
   for (y=0; y<Height; y++)
      for (x=0; x<Width; x++) {
         dest [y * Width + x].R = source1 [y * Width + x].R
                                + source2 [y * Width + x].R;
         dest [y * Width + x].G = source1 [y * Width + x].G
                                + source2 [y * Width + x].G;
         dest [y * Width + x].B = source1 [y * Width + x].B
                                + source2 [y * Width + x].B;
         dest [y * Width + x].A = source1 [y * Width + x].A
                                + source2 [y * Width + x].A;
      }
  }

The example program above cannot be compiled to use MMX because the 4 lines of code that process the R, G, B and A components of each pixel must be executed in sequence and not at the same time. There are 2 solutions to the problem:

Reorder the code yourself.


void AddBitmaps (RGB *source1, RGB *source2, RGB *dest, int Width, int Height)
   {
   int x, y;
   unsigned char R, G, B, A;
   for (y=0; y<Height; y++)
      for (x=0; x<Width; x++) {
         R = source1 [y * Width + x].R; G = source1 [y * Width + x].G;
         B = source1 [y * Width + x].B; A= source1 [y * Width + x].A;
         R += source2 [y * Width + x].R; G += source2 [y * Width + x].G;
         B += source2 [y * Width + x].B; A += source2 [y * Width + x].A;
         dest [y * Width + x].R = R; dest [y * Width + x].G = G;
         dest [y * Width + x].B = B; dest [y * Width + x].A = A;
      }
  }

Tell VectorC that the pointers you are using definitely point to different areas of memory. This lets the compiler reorder memory reads and writes itself. Use the "restrict" keyword for this. This works better than the above code when the loop is unrolled, so I recommend this solution.


void AddBitmaps (RGB restrict *source1, RGB restrict *source2,
                 RGB restrict *dest, int Width, int Height)
   {
   int x, y;
   for (y=0; y<Height; y++)
      for (x=0; x<Width; x++) {
         dest [y * Width + x].R = source1 [y * Width + x].R
                                + source2 [y * Width + x].R;
         dest [y * Width + x].G = source1 [y * Width + x].G
                                + source2 [y * Width + x].G;
         dest [y * Width + x].B = source1 [y * Width + x].B
                                + source2 [y * Width + x].B;
         dest [y * Width + x].A = source1 [y * Width + x].A
                                + source2 [y * Width + x].A;
      }
   }

4. Don't try to combine MMX and floating-point code

Unfortunately, when MMX was designed, DOS was still the main operating system. This led to a problem - how can a program save the processor state when swapping tasks or in an interrupt? You could add new code to save the new MMX registers during a task switch, but under DOS, a lot of different programs had this kind of code - so you would have had a serious compatibility problem. The solution that Intel came up with was to map the MMX registers to the floating-point registers, so after writing to an MMX register you can read the result from a floating-point register. However, this requires a mode-change - which is slow and also means that you can't use floating-point and MMX code in the same area of your code. 3DNow! and SSE give a partial solution to this problem, but only for floating-point code that can execute entirely within the restrictions of these 2 technologies.

Next : 3DNow! and Streaming SIMD