Performance Programming Applied to C++
by Joris Timmermans (aka MadKeithV)

Introduction

We all want speed in this business - whether we like to admit it or not. Frames Per Second ratings are thrown around like there is no tomorrow, and everyone in the PR department is always ranting about how their new engine is "faster" than everyone else's and can do "more" as well.

I'm not here to talk about how to make your code "faster" than anyone else's. I just want to teach you to make it faster, and more efficient, than the code you usually write.

I'd like to cover three things that are intricately tied together:

  1. Code Execution Time
  2. Code / Program Size
  3. Programming effort

I'm a strong believer in keeping a balance between those three, and only in certain cases should you sacrifice parts of the latter two to improve the first.

In this article, I'm going to point out some things that may help you improve the efficiency and performance of your code. I will start with the easiest ways of optimizing and work my way to more complex and involved techniques. First, I'll begin with what may seem a not-so obvious point : the compiler.

For the more advanced programmers among you, please keep in mind that I'm trying to write as simple as possible, without cluttering my text with too many details.

Part 1 : Know your tools

It may seem trivial, but how much do you REALLY know about your compiler? Do you know what processors it can generate code for? Do you know what kinds of optimizations it can perform? Do you know about its language incompatibilities?

It helps to know these things when you are implementing anything, and specially when you are trying to make something that goes fast.

For example - in a recent question on the GameDev boards, someone asked about Microsoft Visual C++'s "Release Mode". This is a standard compiler option, if you use this specific compiler, you should know what it does. If you don't, you've spent a lot of money on something you don't fully know how to use. In short, it removes all the debugging code, and performs any code optimizations the compiler might have, producing a much smaller executable, that runs faster. There is slightly more to it than that, but if you're interested, read the documentation that comes with the compiler.

See - if you didn't know this already, I've just handed you a way to make your code a lot faster, without you having to program a single thing!

The target platform is also important. These days, the lowest you would aim for is probably still an Intel Pentium processor, but if you're using a 10-year old compiler, it's not going to give you Pentium-optimized code. Getting a more recent compiler may really improve the speed, once again without making a single code change.

Some other things to keep in mind: does your compiler have a profiling tool? If you don't know, you can't expect to make fast code. If you don't know what a profiling tool IS, you need to learn. A profiling tool is something that tracks where the execution time goes in your program. You run your code using the profiler, perform some operations with it, and after you exit your application, you get back a report of the times spent in each function. You can use this to find your speed "bottlenecks", the parts in your code where you spend the most time. Optimizing these parts is much more effective in raising the overall speed of your application than randomly applying optimizations everywhere.

Don't say "But I know where my bottlenecks are!" They can be very unexpected, specially when working with third-party APIs and libraries. I ran into a problem like that only a few weeks ago, when it turned out a silly state change that occurred (unnecessarily) every frame, was taking up 25% of total execution time. A single line of code adding a test to see if that state was already set, dropped that function out of the top 50 list of most expensive functions in my profiling.

Using the profiler is deceptively simple, in most cases. Interpreting the results isn't always that simple. You have to try to identify the critical path in your application, that is the path in which most of the application time is spent. Optimizing that path will result in a noticeable performance improvement for your users.

An example would be, where loading a particular file shows up as the largest single time expenditure on a function, but you know it only happens once, at load-time of your application. Optimizing this function might save you a few seconds in the total run-time of the program, but it won't result in a performance increase during "normal" use. In fact, it's safe to say that your profiling run wasn't long enough, because during normal operation, the time taken in that function with respect to the total run-time for the application would get less and less, while your critical-path functions would float up to the top of the list gradually.

I hope that gives you a start in using these tools.

The profiling tool is definitely A Good Thing. Use it.

If you don't have a profiler, there is at least one that you can try out for free, and that's Intel's VTune profiler. It's on a one-month trial basis.

You can get it here (http://developer.intel.com/software/products/vtune/index.htm). I haven't used it yet, but as soon as I have time to try it, I'll try to write up some pointers on how to use it.

In the next sections, I'm moving on to letting your C/C++compiler do what you want it to do.

Part 2: Inlining, and the inline keyword

What is inlining? I'll answer that through the description of the inline keyword.

It tells the compiler to "expand the function inline", which is a lot like the way that macros (#define's) work in C and C++, but with a small difference. Inline functions are type-safe, and subject to further compiler optimization. This is a REALLY good thing, because you'll have the speed of a macro (that means you won't suffer from function-call overhead ), with the type-safety of a function, with a bunch of other benefits.

What are those benefits? Well, most compilers can only optimize code in a single module at a time. That's usually a single .h/.cpp combination. Using inline functions means that those functions will actually end up within the same module as the calling module, enabling certain optimizations, such as eliminating return-value copying, eliminating superfluous temporary variables, and a host of other possibilities. If you want to learn more, have a look at my references, specifically the performance C++ book, at the end of this article.

The dreaded inline keyword. I have to mention this, because there seem to be a lot of misconceptions about it. The inline keyword does not force the compiler to inline that particular function, but rather gently requests it. Quoting from MSDN:

"The inline keyword tells the compiler that inline expansion is preferred. However, the compiler can create a separate instance of the function (instantiate) and create standard calling linkages instead of inserting the code inline."

Things that usually make the compiler ignore your request are: using loops in the inline function, calling other inline functions from within the function, and recursion.

The above quote also hints at something else: the linkage for a function declared inline is internal. That means that your linker will choke on an inline function implemented in another object file, making them far less useful than I'd like. In human language, using ye olde declaration in .h file and implementation in .cpp file doesn’t work with inline functions, generally. The ANSI standard has a way to do it, but unfortunately, Visual C++ does not implement it.

So, you ask, what is the solution? Simple, implement the inline functions within the same module. This is easy enough to do - just write the entire function in the .h file, and include it wherever you need to use the function. Not as clean as you might want, but it works.

I don't like it much, in the interest of implementation hiding (I'm an Object Orientation freak), but I do use it in a lot of my classes lately. The good part is, I don't have to write in the inline keyword - if you write a function entirely within a class declaration, the compiler will automatically attempt to inline it. This way, I've ended up with entire classes contained in only a header file, since all of the functions needed to be inline. I'd suggest you ONLY do this when you really need the speed, and when you're not about to share the code with too many people - because the visibility of the implementation may lead to very annoying assumptions from the rest of your team. Trust me, it happens.

Part 3: To be in the Fast Class

Fast classes, the holy grail of C++ programming. Any of you that followed my "C++ operator overloading" thread know why I'm writing this.

I'll use the same example - a 3d vector class - because it's so applicable to our particular industry. Also note that the next rant is almost an exact description of my adventures over the past few weeks writing a vector class. I was actually making these mistakes up to a month ago!

Say you need this vector class, because you're doing a lot of vector math, and writing it out each time is simply too much work. You want to improve your coding efficiency, without sacrificing too much of the speed, so you decide to make a vector class, CVector3f (3f for 3 floats). You want to implement a few operators (+, -, *), making use of that great C++ feature of operator overloading, because you know it will improve the readability and maintainability of your code.

In your first draft, you've quickly implemented a constructor, copy constructor, destructor and three operators. You haven't paid special attention to performance, and haven’t used inlining, but simply put your declaration in the header file, and the implementation in the .cpp file.

So what can you do to make it faster? Well, one thing I've already suggested, that's inlining the class functions in your header file. It will remove the overhead of a function call for those member functions that the compiler manages to inline. It probably won't make that much difference in your execution speed for large functions, though it will be noticeable in this vector class because the functions are so small.

Another thing to consider: do we really need the destructor? The compiler can generate an empty destructor for you, and it's probably at least as efficient than the one you've written. In our vector class, we have nothing that explicitly needs destruction, so why waste programming time on it?

The operators can probably be sped up as well. Chances are, you've written your operators like this:

CVector3f operator+( CVector3f v ) { CVector3f returnVector; returnVector.m_x = m_x + v.m_x; returnVector.m_y = m_y + v.m_y; returnVector.m_z = m_z + v.m_z; return returnVector; }

There's so much hidden redundant code in this function, that it almost makes me queasy. Think about it, the first line declares and constructs a temporary variable. That means the default constructor for this object gets called, but we don't NEED it to be initialized, because we're going to assign all new values anyway.

The return at the end is similar - returnVector is a local variable, so it cannot be returned straight off. Instead, the copy constructor is called on it, something that usually takes quite a bit of processor time, specially in such small functions as this one. A more insidious one is the parameter passed. This one is also a copy of the original argument, so another memory allocation and copy constructor call.

What if we wrote another constructor, one with three arguments for x, y and z, and used it as follows:

CVector3f operator+( const CVector3f &v ) const { return CVector3f( m_x + v.m_x, m_y + v.m_y, m_z + v.m_z ) }

That is minus two copy constructor calls, it makes a difference. Notice that I've also added the const keywords. Not a speed improvement, but certainly a safety improvement. A more compiler-internal point to make here, is that the function I've written allows the compiler to more easily make it's own optimizations as well. There are very few assumptions in this function, it's all very explicit, making it a very likely candidate for inlining or other, more complex optimizations such as the "return value optimization" (see the references at the end of this article for more information).

The point I am trying to make here, is that there can be a lot of "hidden" overhead in C++ code. Constructors/Destructors, and the way inheritance and aggregation work, can make a simple-looking function perform a lot of complex initialization behind the scenes. Knowing when this occurs, and how to avoid or reduce its effects, is a very important part of learning how to write C++ with no surprises.

Know your language, it can only help you.

Part 4 : Digging Deeper

So now you have a pretty fast C++ class, but you're still not happy. Time to go even deeper.

1. Loop optimizations

Loop unrolling used to be a "big thing". What is it? Well, some loops can simply be written outright. Consider:

for( int i = 0; i < 3; i++ ) array[i] = i;

this is logically the same as

array[0] = 0; array[1] = 1, array[2] = 2;

The second version is slightly faster, because no loop has to be set up - the initialization and incrementing of i takes some time. Most compilers can already do this though, so in most cases, you probably won't get much gain, and a huge code bloat. My best advice here is, if you can't find anything else to speed up, try it, but don't be surprised if it doesn't make a difference.

2. Bit shifting

Bit shifting works for integers only. It is basically a way to multiply or divide by a power of two in a way that's much faster than a straight multiplication (and CERTAINLY faster than a division).

To understand how to use it, think of it using these formulae:

x << y ó x * 2y x >> y ó x / 2y

I think André LaMothe made a big deal of this in his "Tricks of the Game Programming Gurus" books, that’s probably where you heard about it. It's where I heard about it anyway. In some cases, it can be very rewarding indeed. Consider the following (simplistic) code:

i *= 256;

versus

i = i << 8;

Logically, they are the same. For this simple example, the compiler might even turn the first into the second, but as you get more complex ( i = i << 8 + i << 4 is equivalent to i *= 272 for example )the compiler might not be able to make the conversion.

3. Pointer dereference hell

Do you have code like this in your program?

for( int i = 0; i < numPixels; i++ ) { rendering_context->back_buffer->surface->bits[i] = some_value; }

The exaggeration probably makes the problem stand out. This is a long loop, and all that pointer-indirection is going to eat more time than Homer eats donuts.

You might think this is a contrived example, but I've seen a lot of code that looks like this in code released on the 'net.

Why not do this?

unsigned char *back_surface_bits = rendering_context->back_buffer->surface->bits; for( int i = 0; i < numPixels; i++ ) { back_surface_bits[i] = some_value; }

You're avoiding a lot of dereferencing here, which can only improve speed, and that's a good thing!


Goltrpoat on Gamedev.net pointed out to me that it could be faster still, here's his, very valid, suggestion:

unsigned char *back_surface_bits = rendering_context->back_buffer->surface->bits; for( int i = 0; i < numPixels; i++,back_surface_bits++ ) { *back_surface_bits = some_value; }

The previous item was just a special (albeit frequent) case of the following:

4. Unnecessary calculations within loops.

Consider this loop:

for( int i = 0; i < numPixels; i++ ) { float brighten_value = view_direction*light_brightness*( 1 / view_distance ); back_surface_bits[i] *= brighten_value; }

The calculation of brighten_value is not only expensive, it's unnecessary. The calculation is not influenced by anything that happens within the loop, so you can simply move it outside the loop, and keep re-using that value inside the loop.

This problem can occur in other ways too - unnecessary initialization in functions that you call within the loop, or in object constructors. Be careful when you code, always think "do I really need to do this?".

5. Inline assembler

The last resort, if you really, REALLY know what you are doing, and why it will be faster, you can use inline assembler, or even pure assembler with C-style linkage so it can be called from your C/C++ program. However, if you use inline assembler, either you'll have to work with conditional compilation (testing to see if the processor you're writing assembler for is supported on the platform you are compiling for), or give up on source-compatibility with other platforms. For straight 80x86 assembler, you probably won't mind much, but when you get to MMX, SSE, or 3DNow! instructions, you are limiting your possibilities.

When you get this low, a disassembler may be useful too. You can instruct most compilers to generate intermediate assembler code, so you can browse through it to see if you can improve that functionality's efficiency by hand.

Again, that's really a question of knowing your tools. In Visual Studio, you can do this using the /FA and /Fa compiler switches.

Part 5 : Math Optimizations

1. Rewriting simple maths expressions for performance.

I'll just list a few of the things you can do, that may seem very obvious, but you shouldn't pass over!

  • a*b + a*c = a*(b+c); This gets rid of one multiplication, with no change to the meaning of the expression.
  • b/a + c/a = (1/a)*(b+c); A variation on the previous item, but this time, replacing two divisions by a division and a multiplication. On every platform that I am aware of, divisions are slower than multiplications, so this rewrite will provide a speed improvement.
  • (a || b ) && c ) = c && ( a || b ); Perhaps not so obvious, but the C++ standard requires lazy evaluation. In this case, whenever c happens to be false, in the first case ( a || b ) has to be evaluated, whereas in the second case, it can be skipped as the entire expression can never evaluate to true anymore.

I am sure there are more, you can always suggest these to me, so I can add them to this article.

2. Taking advantage of the architecture.

Say you're working on that nice PentiumIII of yours. You have 32bit registers, but you are performing operations on 16bit numbers. In some cases, depending on the operation and the chance of overflow, you could perform two operations in one register, at once.

This is what SIMD (Single Instruction, Multiple Data - exactly what we were talking about), like SSE, 3DNow! and MMX were designed for - allowing this kind of thing.

I'm by no means an expert, so I can't give you specific examples, but I thought I'd give you a general idea on how you might be able to use it - sometimes even without MMX or SSE.

Part 6 and conclusion: Use that brain of yours

Sometimes, all the code optimization in the world won't make your program run faster. Sometimes, the optimizations lie in a different direction. Take a step back, rethink your algorithm, try to think up ways of doing it differently. Who knows what you might come up with!

As an example, think of sorting algorithms. With all the programming skill in the world, bubble sort will still be slower than quicksort, even though they perform the same function. Sorting is an area where there has been lots of research into making it fast. For your specific program, or algorithm, this is probably not the case. Perhaps you can find that little shift of thought that allows you to beat your previous idea by an order of magnitude.

Another thing is knowing how the library functions (and classes) you use work, exactly. What they do, what they don't do, things they do that you don't need. CString might be a nice class, but if you use it just to pass a filename, you'll be adding overhead you don't need, because you won't be performing any of the special operations on it, nor will you need the special reference counting it uses. New/delete are two more examples. They are generic memory-allocation constructs, and they probably do more than you need. If it's critical, write your own operator new and delete, and you might get a very healthy speed increase for your particular class.

In the end, there are a lot of ways to make code run faster. I've listed some, and I hope they've helped you a little. My references section has some more food for thought, or at least interesting reading. I alluded to it in the introduction as well - the best thing is to know what you are doing, and what you are working with. Read up on the programming language you are using, many other people are likely to have tried writing faster code using that language. Read the documentation on the web that you can find on programming and optimization. Check out manufacturers' webpages, they have some good things there too.

Browse through my references, that's a start, at least ;). References:


The end! Really ;-).

Send any comments or suggestions to Mad Keith (MadKeithV@yahoo.com) .

Discuss this article in the forums


Date this article was posted to GameDev.net: 7/29/2000
(Note that this date does not necessarily correspond to the date the article was written)

See Also:
C and C++
Optimization

© 1999-2011 Gamedev.net. All rights reserved. Terms of Use Privacy Policy
Comments? Questions? Feedback? Click here!