Upcoming Events
Unite 2010
11/10 - 11/12 @ Montréal, Canada

GDC China
12/5 - 12/7 @ Shanghai, China

Asia Game Show 2010
12/24 - 12/27  

GDC 2011
2/28 - 3/4 @ San Francisco, CA

More events...
Quick Stats
89 people currently visiting GDNet.
2406 articles in the reference section.

Help us fight cancer!
Join SETI Team GDNet!
Link to us Events 4 Gamers
Intel sponsors gamedev.net search:

  Contents

 Introduction
 MMX
 3DNow! and
 Streaming SIMD

 Some Examples

 Get VectorC
 Printable version

 


Some Examples

Here are a couple of simple example routines that demonstrate the points I made above.

Example 1: Blending 2 24-bit Images

This function blends 2 images together to produce a new image. By varying "Factor" between 0 and 32767 it is possible to fade from one image to another.

The input and output pointers have 2 non-standard hints applied to them. It would be sensible to define macros to use instead of the full code sequence - this would make your code more portable because a #ifdef can be added to make these macros compile to nothing on other compilers.

  • The "restrict" keyword tells VectorC that the 2 input textures and the output texture are independent - so reads and writes can be reordered and combined.
  • "__declspec (alignedvalue (8)))" says that the pointers are aligned to an 8-byte boundary. It is your responsibility to make sure that this is correct. If it is not, you may find that this code is much slower than if it did not use MMX (this depends on the processor you are running on and the sizes of the images). You could also use "__declspec (alignedvalue (4)))".

Because this loop processes 4 bytes at a time and MMX works on 8 byte values, "__hint__ ((unroll (2)))" tells VectorC that it should unroll the loop to process 8 bytes at a time.

Blending requires multiplying the 2 images by a fractional value. MMX is integer only, so we need to convert these multiplications to fixed-point. The only fixed-point multiplication that MMX supports is to multiply 2 signed 16-bit integers to a 32-bit signed integer and give the high 16-bits as the result. This is not ideal because it means that the maximum fractional value is 0.5. Later processors have an unsigned multiply which would allow us to take the fraction up to 1. So, we have to multiply the source by 2, make the factors range from 0 to 32767 (the maximum 16-bit signed value), and then shift the result right by 16. We also have to make sure that both factor variables (b1 and b2) are signed 16-bit ("short" in C).

We then do a saturated conversion from the 16-bit signed intermediate values to 8-bit unsigned RGB values. This is strictly speaking unnecessary in this case, but I have added it because this is actually the fastest way to convert from 16-bit to 8-bit with MMX and is also useful if you want to change this routine to do other, similar, operations on 24-bit images. The normal form of converting from 16-bit to 8-bit (modulo arithmetic) can lead to very bright values overflowing and becoming very dark. It is better to "saturate" colour values.

There are several things you can do to this routine to adapt it for your own purposes. You could add "__hint__((prefetch))" to the definitions of Texture1 and Texture2 - this uses prefetch instructions on processors that support them (K6-2 and above and Pentium III and above). Prefetching speeds up memory reads. You could also try adding "__hint__((nontemporal))" to the definition of "Dest". This writes out without writing to the cache. This can massively speed up writing out data if you don't want to read it back in any time soon. It can also massively slow it down, so it is worth trying with and without. Non-temporal writes are available on Pentium III and Athlon and above.

typedef struct {unsigned char R, G, B, A;} RGB; void BlendImages (__declspec (alignedvalue (8)) RGB restrict *Texture1, __declspec (alignedvalue (8)) RGB restrict *Texture2, __declspec (alignedvalue (8)) RGB restrict *Dest, int Width, int Height, int Factor) { int r, x, y; short R, G, B, R1, G1, B1, R2, G2, B2, b1, b2; if (Factor < 0) Factor = 0; if (Factor > 32767) Factor = 32767; b1 = Factor; b2 = 32767 - Factor; for (y=0; y < Height; y++) for (x=0; x < Width; x++) { __hint__ ((unroll (2))); R1 = Texture1 [y * Width + x].R * 2; G1 = Texture1 [y * Width + x].G * 2; B1 = Texture1 [y * Width + x].B * 2; R2 = Texture2 [y * Width + x].R * 2; G2 = Texture2 [y * Width + x].G * 2; B2 = Texture2 [y * Width + x].B * 2; R = (R1 * b1 >> 16) + (R2 * b2 >> 16); G = (G1 * b1 >> 16) + (G2 * b2 >> 16); B = (B1 * b1 >> 16) + (B2 * b2 >> 16); if (R < 0) R = 0; if (R > 255) R = 255; if (G < 0) G = 0; if (G > 255) G = 255; if (B < 0) B = 0; if (B > 255) B = 255; Dest [y * Width + x].R = R; Dest [y * Width + x].G = G; Dest [y * Width + x].B = B; Dest [y * Width + x].A = 0; } }

Example 2: Rotating and Projecting 3D Vectors to Screen Coordinates

When drawing 3D objects on a 2D screen, you need to project vertices from 3D world coordinates to the 2D screen coordinates. This requires a rotation, translation, a test (to see if the point is behind the camera) and a division. 12-bit precision division is usually good enough for this.

All the code uses float (no double or long double). Notice also that the constant (0.1) has a "f" after it to signify single precision.

The input and output vectors are aligned on 16-byte boundaries. The VECTOR type is also defined to be of size 16, with a 4-byte Flag (to specify whether a vector is in front of the camera and so can be projected).

This routine will not be able to take advantage of MMX, but will speed up considerably when compiled for 3DNow! or SSE. This is the real advantage of VectorC - you need at least 3 versions of this routine - FPU, 3DNow! and SSE - with VectorC you just need to compile 3 different times from the same source.

typedef struct {float x, y, z; int Flag;} VECTOR; void RotateProjectVectors (float CameraRotation [3] [3], VECTOR CameraTranslation, float Scale, float CentreX, float CentreY, __declspec (alignedvalue (16)) VECTOR restrict *InVectors, __declspec (alignedvalue (16)) VECTOR restrict *OutVectors, int NumPoints) { int i; float x, y, z; for (i=0; i < NumPoints; i++) { x = CameraRotation [0] [0] * InVectors [i].x + CameraRotation [0] [1] * InVectors [i].y + CameraRotation [0] [2] * InVectors [i].z + CameraTranslation.x; y = CameraRotation [1] [0] * InVectors [i].x + CameraRotation [1] [1] * InVectors [i].y + CameraRotation [1] [2] * InVectors [i].z + CameraTranslation.y; z = CameraRotation [2] [0] * InVectors [i].x + CameraRotation [2] [1] * InVectors [i].y + CameraRotation [2] [2] * InVectors [i].z + CameraTranslation.z; if (z >= 0.1f) { OutVectors [i].x = Scale * x __hint__((precision(12))) / z + CentreX; OutVectors [i].y = Scale * y __hint__((precision(12))) / z + CentreY; OutVectors [i].z = z; OutVectors [i].Flag = 1; } else OutVectors [i].Flag = 0; } }