Contents

Introduction
Know your tools
Inlining
Fast classes
Digging Deeper
Math Optimisations
Use that brain
of yours

Printable version

Part 4 : Digging Deeper

So now you have a pretty fast C++ class, but you're still not happy. Time to go even deeper.

1. Loop optimizations

Loop unrolling used to be a "big thing". What is it? Well, some loops can simply be written outright. Consider:


for( int i = 0; i < 3; i++ ) array[i] = i;

this is logically the same as


array[0] = 0; array[1] = 1, array[2] = 2;

The second version is slightly faster, because no loop has to be set up - the initialization and incrementing of i takes some time. Most compilers can already do this though, so in most cases, you probably won't get much gain, and a huge code bloat. My best advice here is, if you can't find anything else to speed up, try it, but don't be surprised if it doesn't make a difference.

2. Bit shifting

Bit shifting works for integers only. It is basically a way to multiply or divide by a power of two in a way that's much faster than a straight multiplication (and CERTAINLY faster than a division).

To understand how to use it, think of it using these formulae:


x << y ó x * 2^y
x >> y ó x / 2^y

I think André LaMothe made a big deal of this in his "Tricks of the Game Programming Gurus" books, that’s probably where you heard about it. It's where I heard about it anyway. In some cases, it can be very rewarding indeed. Consider the following (simplistic) code:


i *= 256;

versus


i = i << 8;

Logically, they are the same. For this simple example, the compiler might even turn the first into the second, but as you get more complex ( i = i << 8 + i << 4 is equivalent to i *= 272 for example )the compiler might not be able to make the conversion.

3. Pointer dereference hell

Do you have code like this in your program?


for( int i = 0; i < numPixels; i++ )
{
   rendering_context->back_buffer->surface->bits[i] = some_value; 
}

The exaggeration probably makes the problem stand out. This is a long loop, and all that pointer-indirection is going to eat more time than Homer eats donuts.

You might think this is a contrived example, but I've seen a lot of code that looks like this in code released on the 'net.

Why not do this?


unsigned char *back_surface_bits = rendering_context->back_buffer->surface->bits;
for( int i = 0; i < numPixels; i++ )
{
   back_surface_bits[i] = some_value;
}

You're avoiding a lot of dereferencing here, which can only improve speed, and that's a good thing!

Goltrpoat on Gamedev.net pointed out to me that it could be faster still, here's his, very valid, suggestion:


unsigned char *back_surface_bits = rendering_context->back_buffer->surface->bits;
for( int i = 0; i < numPixels; i++,back_surface_bits++ )
{
   *back_surface_bits = some_value;
}

The previous item was just a special (albeit frequent) case of the following:

4. Unnecessary calculations within loops.

Consider this loop:


for( int i = 0; i < numPixels; i++ )
{
   float brighten_value = view_direction*light_brightness*( 1 / view_distance );
   back_surface_bits[i] *= brighten_value;
}

The calculation of brighten_value is not only expensive, it's unnecessary. The calculation is not influenced by anything that happens within the loop, so you can simply move it outside the loop, and keep re-using that value inside the loop.

This problem can occur in other ways too - unnecessary initialization in functions that you call within the loop, or in object constructors. Be careful when you code, always think "do I really need to do this?".

5. Inline assembler

The last resort, if you really, REALLY know what you are doing, and why it will be faster, you can use inline assembler, or even pure assembler with C-style linkage so it can be called from your C/C++ program. However, if you use inline assembler, either you'll have to work with conditional compilation (testing to see if the processor you're writing assembler for is supported on the platform you are compiling for), or give up on source-compatibility with other platforms. For straight 80x86 assembler, you probably won't mind much, but when you get to MMX, SSE, or 3DNow! instructions, you are limiting your possibilities.

When you get this low, a disassembler may be useful too. You can instruct most compilers to generate intermediate assembler code, so you can browse through it to see if you can improve that functionality's efficiency by hand.

Again, that's really a question of knowing your tools. In Visual Studio, you can do this using the /FA and /Fa compiler switches.

Next : Math Optimizations