Upcoming Events
Unite 2010
11/10 - 11/12 @ Montréal, Canada

GDC China
12/5 - 12/7 @ Shanghai, China

Asia Game Show 2010
12/24 - 12/27  

GDC 2011
2/28 - 3/4 @ San Francisco, CA

More events...
Quick Stats
93 people currently visiting GDNet.
2406 articles in the reference section.

Help us fight cancer!
Join SETI Team GDNet!
Link to us Events 4 Gamers
Intel sponsors gamedev.net search:

Enginuity, Part 3
Profiling, Settings, and the Kernel;
or, Stop Watching That Military Control Panel

It's time for another helping of game engine goodness. This time we're going to add a simple profiler to the engine, look at working with game settings, and finally move on to the heart of the engine - the Kernel.

This article builds on the code built up in the previous one, so if you've not read it, go do that already.

Runtime Profiler

Have you ever had one of those days when everything seems to be going really slowly? Everyone seems quiet and muffled, as if there's some invisible, thick fog throughout the air. And that's *after* you've had your morning coffee.

Your code will have times like that, too. But while you can just go to sleep and assume that things will be better in the morning, your code isn't going to change itself. So when your framerate's dropped lower than Sanford Wallace, what do you do? It's time for that wonderful process of Optimization.

"Optimizing?" you cry, "Optimizing?! Optimizing is for professional-edition compilers, masochists, and little girly-men! I do not need to optimize!" Er, OK. I'll let you get back to chopping down trees somewhere. To put it to you straight: ALL professional development houses profile and optimize their games. I don't just mean at the end of development, either - it's an ongoing process. You don't want to build something which relies on a particular aspect of an algorithm somewhere else, if you then go in and replace that algorithm because it's too slow. (Well, you don't want to build something which relies on a particular aspect of an algorithm somewhere else *full stop* - to never make assumptions about implementation details is a fairly important object-orientation concept). Because optimization usually involves going back through your code and looking for slow points, it'll help you spot bugs as well.

But before you start ripping apart your algorithms and converting things to assembler, how do you know which parts you should be working on? It's quite possible that the math code you thought was horribly slow isn't actually being called very much, so it's not going to be responsible for your speed problems. What you need is a way of timing bits of your code, to see where you're losing momentum. Enter the Runtime Profiler.

A Profiler, in case any of you haven't figured it out yet, is a tool for timing bits of your code. Professional Profiling tools can analyze your entire game loop, processing the debug information generated by your compiler to give you the very line numbers of the sluggish bits of code responsible for your problem; and, understandably, they cost large amounts of money. So we're not going to do that. We're going for something much simpler - more of a 'manual' approach. I don't mean that you're going to need to go buy a stopwatch; I mean that when you want to time something, you're going to have to make a couple of calls, and the engine will handle it. You'll be able to see what percentage of the game loop's time gets spent on your chosen blocks of code, along with minimum and maximum peaks (so you can watch for when you walk into a crowded room, for example).

Here's a rough overview of how the profiler will look to the user (i.e. you):

  • A CProfileSample object is created with a given name at the beginning of the block of code. It starts the timer.
  • At the end of the block of code, the CProfileSample object is destroyed (either explicitly, or by letting it fall out of scope). It stops the timer.
  • When starting up the engine, you set a pointer to an object derived from CProfilerOutputHandler, which is responsible for logging/drawing the statistics the way you want.
  • Static functions of CProfileSample allow you to reset the statistics for individual samples or for all samples in the system.

Not exactly hard. You can just enclose a block of questionable code in curly braces, put a CProfileSample sample("MyProfileSample"); line at the beginning of it, and the profiler will do the rest. Neat, hmm? Of course, such ease-of-use comes with a price, and that price is the difficulty of the implementation. But given that you can just download the source code I already made for you, what do you care? ;)

Two of the important things about a profiler are that it has to be passive (because if it interferes with the code we're timing, the test is pointless), and that it has to minimize overhead. There's no point using the profiler if it takes longer to time the code than it does to execute it! So, we avoid the more costly operations - things like dynamic memory allocation (hence creating the CProfileSample on the stack rather than from the heap).

Another useful feature we want to add is a parent/child relationship setup. Say I profile my ErrorLogging routine, and it's horribly slow - ideally, I'd like to then profile *within* that sample to see which parts of the code are slowing it down. It's more useful when you're higher up - for example, I profiled my Server routine (yes, I've written it already :) ), and found that it was using up to 6% of the loop time (compared with the other tasks which were using around 0.2%). So, I added extra profiling samples within the server routine, and discovered that the section I had expected to be slowest - the socketGroup updater - was actually fine, and it was the gamestate broadcast message routine that was causing the problem.

Also - a relatively simple addition - it might well be useful to see how many times per frame a block is executed. If you see something using 20% of your loop time, but it's being called 1000 times, you perhaps don't need to worry so much about optimising the routine itself - it would possibly be more efficient to try and make fewer calls to it.

So how do we work? Upon creation of a CProfileSample object, it needs to do a few things:

  • Record the current 'topmost' sample - that's the parent sample, the sample 'enclosing' our new one
  • Retrieve the sample's statistics and data from storage by searching using the profile name, or create a new set of data for the sample
  • Increment the number of times this sample has been executed this frame
  • Flag this sample as being active
  • Increment the global number of open samples
  • Lastly, as close as possible to finishing, record the current time as the sample's start time.

There's also the issue of the 'topmost sample.' Given that we're eventually going to display the statistics in the form of percentages, we need to know what '100%' *is*, time-wise, so that we can work out what fraction of it 0.06 seconds would be. The easiest way to do this is to rule that there will be a 'main game loop' sample, which will have no parent sample (it will be topmost). When creating a CProfileSample as a topmost sample, it should record the start and end times in special global variables; that way, the times will be available to all samples to calculate their percentages. More on this later, when we actually look at the statistics function.

(It's worth noting, in case some of you find this all sounds a little familiar, that I'm basing my design on Steve Rabin's profiler in Game Programming Gems 1, though I've added my own little improvements to it. The GPG books are an excellent series, and should be on the shelf of any serious game developer).

We can wrap almost all of our profiler up in the CProfileSample class. Take a look:

class CProfileSample
{
public:
  CProfileSample(std::string sampleName);
  ~CProfileSample();

  static void Output();

  static void ResetSample(std::string sampleName);
  static void ResetAll();

  static IProfilerOutputHandler *outputHandler;

protected:
  //index into the array of samples
  int iSampleIndex;
  int iParentIndex;

  inline float GetTime(){ return ((float)SDL_GetTicks())/1000.0f; }

  static struct profileSample
  {
    profileSample()
    {
      bIsValid=false; 
      dataCount=0;
      averagePc=minPc=maxPc=-1;
    }

    bool bIsValid;    //whether or not this sample is valid to be used
    bool bIsOpen;     //is this sample currently being profiled?
    unsigned int callCount; //number of times this sample has been executed, this frame
    std::string name; //name of the sample
    
    float startTime;  //starting time on the clock, in seconds
    float totalTime;  //total time recorded across all executions of this sample, this frame
    float childTime;  //total time taken by children of this sample, this frame

    int parentCount;  //number of parents this sample has (useful for neat indenting)

    float averagePc;  //average percentage of game loop time taken up
    float minPc;      //minimum percentage of game loop time taken up
    float maxPc;      //maximum percentage of game loop time taken up
    unsigned long dataCount; //number of times values have been stored since
                             //sample creation/reset
  } samples[MAX_PROFILER_SAMPLES];
  static int lastOpenedSample;
  static int openSampleCount;
  static float rootBegin, rootEnd;
};

There's quite a lot there... in the public section of the class, you've got the standard constructor and deconstructor - slightly more significant than usual in this class as we'll see in a minute - and the constructor takes the name of the sample you want to use (that is, the 'unique name' you use to recognise the block of code you're profiling). The rest are all static members; an Output() function to send all the statistics to the output handler, ResetSample/ResetAll functions for resetting sample values, and a static pointer to the aforementioned output handler, an object derived from IProfilerOutputHandler (which is an abstract class).

Then we move down into the protected section of the class, and things get a little more interesting.

The first thing I want to draw your attention to is the profileSample structure. Because each CProfileSample object is created and destroyed every time you execute the code you're profiling, we can't keep it's vital statistics as member data; this is the 'storage' I referred to earlier. There's a basic constructor to flag the structure as invalid (so that it will be initialised when it gets used). The comments should be fairly self-explanatory; all of those members will, of course, be covered in more detail when we come to use them.

Besides the profileSample structure there's not much in the protected section. You've got iSampleIndex and iParentIndex - the sample numbers for this sample and the parent sample respectively - and an inlined GetTime() function, which returns the time since a fixed point (such as app startup) in seconds.

There's also four static variables at the end there. lastOpenedSample records the index of the sample that was last started - the 'top of the stack,' so to speak, and it is used to set the iParentIndex upon creation of a CProfileSample. openSampleCount simply tracks how many samples are currently open - it gets used to set parentCount. Finally, rootBegin and rootEnd are the 'special global variables' I said the topmost sample would have to store its start and end times in. Got all that? Good. Let's look at the functions themselves.

CProfileSample::CProfileSample(std::string sampleName)
{
  //The first thing we need to do is restore our previous pieces of sample
  //data from storage. That is, look in the samples[] array to see if there's
  //a valid sample with our name on it
  int i=0;
  //If we don't find it, we're going to need to create a new sample, and rather
  //than looping through the list a second time we store the first non-valid
  //index that we find this time round
  int storeIndex=-1;
  for(i=0;i<MAX_PROFILER_SAMPLES;++i)
  {
    if(!samples[i].bIsValid)
    {
      if(storeIndex<0)storeIndex=i;
    }else{
      if(samples[i].name==sampleName)
      {
        //this is the sample we want
        //check that it's not already open
        //assert only works in debug builds, but given that you don't use
        //the profiler in release builds, it doesn't really matter
        assert(!samples[i].bIsOpen &&
          "Tried to profile a sample which was already being profiled");
        //first, store its index
        iSampleIndex=i;
        //the parent sample is the last opened sample
        iParentIndex=lastOpenedSample;
        lastOpenedSample=i;
        //and the number of parents is the number of open samples (excluding ourselves)
        samples[i].parentCount=openSampleCount;
        ++openSampleCount;
        samples[i].bIsOpen=true;
        //increment the number of times we've been called
        ++samples[i].callCount;
        ///finally (more or less) store the current time to start the sample
        samples[i].startTime=GetTime();
        //if this has no parent, it must be the 'main loop' sample, so copy
        //to the global timer as well
        if(iParentIndex<0)rootBegin=samples[i].startTime;
        //done
        return;
      }
    }
  }
  //we've not found it, so it must be a new sample
  //use the storeIndex value to store the new sample
  assert(storeIndex>=0 && "Profiler has run out of sample slots!");
  samples[storeIndex].bIsValid=true;
  samples[storeIndex].name=sampleName;
  iSampleIndex=storeIndex;
  iParentIndex=lastOpenedSample;
  lastOpenedSample=storeIndex;
  samples[i].parentCount=openSampleCount;
  openSampleCount++;
  samples[storeIndex].bIsOpen=true;
  samples[storeIndex].callCount=1;

  //init the statistics for this sample
  samples[storeIndex].totalTime=0.0f;
  samples[storeIndex].childTime=0.0f;
  samples[storeIndex].startTime=GetTime();
  if(iParentIndex<0)rootBegin=samples[storeIndex].startTime;
}

Pretty simple. It seeks out the existing sample, and if it can't find it, it set up a new one; either way, the sample is set up for this run of the profiler. That should be fairly simple (I hope) - we do a little more processing in the destructor, and the bulk of the statistics calculations in the Output() function. Here's the destructor:

CProfileSample::~CProfileSample()
{
  float fEndTime=GetTime();
  //phew... ok, we're done timing
  samples[iSampleIndex].bIsOpen=false;
  //calculate the time taken this profile, for ease of use later on
  float fTimeTaken = fEndTime - samples[iSampleIndex].startTime;

  if(iParentIndex>=0)
  {
    samples[iParentIndex].childTime+=fTimeTaken;
  }else{
    //no parent, so this is the end of the main loop sample
    rootEnd=fEndTime;
  }
  samples[iSampleIndex].totalTime+=fTimeTaken;
  lastOpenedSample=iParentIndex;
  --openSampleCount;
}

As you can see, the first thing we do is to stop the clock. Once we've done that, we can relax a little when it comes to speed overhead... we cleanup the sample, updating the statistics for this sample on this frame, and the parent sample (if there is one). We then 'pop the stack' - reset the lastOpenedSample and openSampleCount values to what they were before we started profiling (assuming nobody has done anything stupid like try to construct and destruct CProfileSamples out-of-order). Also fairly simple. So let's move on to the Output() function, which calculates all the statistics for a given sample across the entire app, and sends them to the OutputHandler:

void CProfileSample::Output()
{
  assert(outputHandler && "Profiler has no output handler set");
  
  outputHandler->BeginOutput();

  for(int i=0;i<MAX_PROFILER_SAMPLES; ++i)
  {
    if(samples[i].bIsValid)
    {
      float sampleTime, percentage;
      //calculate the time spend on the sample itself (excluding children)
      sampleTime = samples[i].totalTime-samples[i].childTime;
      percentage = ( sampleTime / ( rootEnd - rootBegin ) ) * 100.0f;

      //add it to the sample's values
      float totalPc;
      totalPc=samples[i].averagePc*samples[i].dataCount;
      totalPc+=percentage; samples[i].dataCount++;
      samples[i].averagePc=totalPc/samples[i].dataCount;
      if((samples[i].minPc==-1)||(percentage<samples[i].minPc))samples[i].minPc=percentage;
      if((samples[i].maxPc==-1)||(percentage>samples[i].maxPc))samples[i].maxPc=percentage;

      //output these values
      outputHandler->Sample(samples[i].minPc,
                samples[i].averagePc,
                samples[i].maxPc,
                samples[i].callCount,
                samples[i].name,
                samples[i].parentCount);

      //reset the sample for next time
      samples[i].callCount=0;
      samples[i].totalTime=0;
      samples[i].childTime=0;
    }
  }

  outputHandler->EndOutput();
}

So we begin with a quick check that we actually *have* an output handler, followed by a call to outputHandler->BeginOutput() to allow our output handler to clear buffers, print headers, or whatever it wants.

Then we move on to the more interesting stuff. Firstly, the time spent on the sample in itself is calculated - that is, the total time spent on the sample, minus the time spent on children. This time is then turned into a percentage of the time taken for the whole loop to run, making it easier to work with (because it'll be roughly the same across different-speed machines; a chunk of code will be slower, but so will everything else). Next we calculate our statistics. We take the current average, and multiply it by the number of samples taken so far (dataCount) - that gives us the total time spent on this sample since the profiler was started/reset. We add our new value, increment the number of values in the data, and divide again to get the average back. Then, a pair of quick checks let us update the maximum and minimum percentages, if required.

Then we move on to sending the data to the OutputHandler. If you don't see how that's done, I probably can't help you. :)

Lastly, we reset the sample for the next loop. Given that the Output function should be called once per frame - ideally at the end of the game loop, outside of the top-level sample (as we'll see later this article) - it's the 'tick point' that we can use to reset things without worrying. Once all the samples are processed, the outputHandler's EndOutput() function is called, again to allow it to perform whatever clean-up processing is necessary.

The only other functions in CProfileSample are ResetSample and ResetAll - all they need to do is set the bIsValid flags for the appropriate sample(s), so I won't go over them here. Instead, let's move on to the output handler we're going to use for the time being - CProfileLogHandler.

Before we do that, it's necessary to look briefly at the IProfilerOutputHandler class. You've seen all the functions it provides already:

class IProfilerOutputHandler
{
public:
  virtual void BeginOutput()=0;
  virtual void Sample(float fMin, float fAvg, float fMax,
                      int callCount, std::string name, int parentCount)=0;
  virtual void EndOutput()=0;
};

It's just a simple abstract (interface) class, allowing you to derive any class from it to handle profiler output. The class I've derived from it for you is CProfilerLogHandler:

class CProfileLogHandler : public IProfilerOutputHandler  
{
public:
  void BeginOutput();
  void EndOutput();
  void Sample(float fMin, float fAvg, float fMax,
              int callCount, std::string name, int parentCount);
};

void CProfileLogHandler::BeginOutput()
{
  CLog::Get().Write(LOG_APP,IDS_PROFILE_HEADER1);
  CLog::Get().Write(LOG_APP,IDS_PROFILE_HEADER2);
}

void CProfileLogHandler::EndOutput()
{
  CLog::Get().Write(LOG_APP,"\n");
}

void CProfileLogHandler::Sample(float fMin, float fAvg, float fMax,
                   int callCount, std::string name, int parentCount)
{
  char namebuf[256], indentedName[256];
  char avg[16], min[16], max[16], num[16];

  sprintf(avg, "%3.1f", fAvg);
  sprintf(min, "%3.1f", fMin);
  sprintf(max, "%3.1f", fMax);
  sprintf(num, "%3d",   callCount);

  strcpy( indentedName, name.c_str());
  for( int indent=0; indent<parentCount; ++indent )
  {
    sprintf(namebuf, " %s", indentedName);
    strcpy( indentedName, namebuf);
  }

  CLog::Get().Write(LOG_APP,IDS_PROFILE_SAMPLE,min,avg,max,num,indentedName);
}

And that, ladies and gentlemen, prints a profile table to the APP logfile. IDS_PROFILE_HEADER1 and IDS_PROFILE_HEADER2 are the two header lines that make up the table column names and underlining; IDS_PROFILE_SAMPLE is a simple string to put the statistics into the right places. The only really noteworthy thing here is the fact that the name gets indented relative to the parentCount - making it much easier to see parent/child relationships in the logs. EndOutput() logs a newline, just to space things out in my logs (because we're calling this once per frame, so we get a lot of them in the logs). Here's a sample of the output:

  Min :   Avg :   Max :   # : Profile Name
--------------------------------------------
  0.0 :  96.6 : 100.0 :   1 : Kernel task loop
  0.0 :   0.6 :   6.3 :   1 :  Renderer
  0.0 :   0.1 :   6.3 :   1 :  Server
  0.0 :   0.7 : 100.0 :   1 :   Flushing socketGroup
  0.0 :   0.0 :   6.2 :   1 :   Processing messages
  0.0 :   1.4 :   6.3 :   1 :   Server state update
  0.0 :   0.5 :   6.7 :   1 :  Client

So you should see that the Avg (average) values roughly add up to 100% (give or take a little due to floating-point truncation). The Min and Max values tend to be of limited use - they're better if you examine profiler output over a specific section of the game by calling ResetAll at the beginning of the section, because otherwise startup in particular tends to produce some freak values (as you see there). So, I can immediately see that if I need to make my game faster, I ought to look into the Server State Update code first - as opposed to the message processor, which I thought would be the worst culprit (and it actually has an average time of <0.1%).



Settings

Contents
  Runtime Profiler
  Settings
  The Kernel

  Source code
  Printable version
  Discuss this article

The Series
  Part I
  Part II
  Part III
  Part IV
  Part V