Introduction to Debugging
Diagnosis and Prescription
We've covered the common tools used to collect information about the state and behaviour of our programs. How do we use that information to identify a bug, and then to work out a fix for it?
This is the point at which the computer can't generally help us. It can tell us that the program might have crashed from an access violation, and we can use the watch window to see that we're dereferencing a null pointer, but what's the bug there? Are we failing to check that the pointer is not null immediately before that line, or should it have been caught higher-up in code? What does a null pointer mean in this context? Identifying the bug itself requires that you translate the information you've gathered up to the "conceptual" level of your program, where you're dealing with objects and behaviours instead of source code; as such, it's the point at which you really need to have an understanding of that conceptual level, or you won't be able to address the root problem and may mistakenly address a symptom instead.
Here's an example. I've discovered that killing another player in a death-match game correctly produces a ragdoll (i.e. the player changes from a regular animated model to a limp, physics-controlled corpse) on my own machine, but on all other machines that player just disappears. That's incorrect behaviour; the ragdoll should be appearing in the same way on all machines. So, I set up a breakpoint on the "player death" code on my own machine, and watch to see exactly what happens. I observe that firstly, my machine calls SendNetworkMessage(MSG_PLAYERKILLED), which sends a "PlayerKilled" message to the other machines. Then, it calls RagdollSystem::Create() to create a ragdoll object, and gives it the same position/orientation as the player that was just killed, as well as setting up the initial forces (the force of the fatal shot). Huh, so what's telling the other machines to create ragdolls? Oops... as a matter of fact, nothing. Checking the message-receiving code for the PlayerKilled message reveals that there's no call to RagdollSystem::Create() there for setting up a ragdoll. So, my bug is that non-local machines aren't being instructed to create ragdolls at all. If I don't tell the program to do it, it isn't going to happen.
Now, working through that bug required a fair amount of knowledge of the system on my part. I needed an understanding of how the player-to-ragdoll swap-over process was supposed to work; I also needed an understanding of the network-messaging system between machines, to know what functions like SendNetworkMessage do. (This bug is a watered-down version of a bug I did meet in the wild; the real bug was sending a message and creating the ragdolls on the other machines, but it still wasn't working. I think it turned out that the ragdoll we were creating was associated with the player that we'd just killed, and as such it was also being marked as 'dead' and thus getting garbage collected along with it or something like that).
Diagnosis tends to be the stage at which any work you've invested in clarity of design and coding style will pay off in spades. What does it mean if "foo" is null? Probably not nearly as much as if "currentPlayerWeapon" is null. Using names and structures that match the conceptual model of your program will help you to translate information between the two.
Once you've identified your bug on a conceptual level, developing the fix is sort of working backwards: you work out the fix on a conceptual level, and then translate that down to code. There's often a number of different ways to fix a bug, and each one will have pros and cons, each one will have tradeoffs. Returning to my ragdoll example, I know that I need to instruct the other machines to create ragdolls. What's the best way of achieving that? I could add code to the "PlayerKilled" message handler to set up the ragdoll in there; seems like it should work, right? So I go ahead and implement that. Now, however, it means that killing a player will always generate a ragdoll... so I guess I'd better hope that the designers don't ask me to implement a disintegrator gun. A different fix might have been to split "killing the player" and "creating the ragdoll" into two separate network messages. The "create ragdoll" message could include all necessary information to create a ragdoll, including position, orientation, and initial forces; that way I'd actually make it a lot easier to support things like ragdolls as part of scripted sequences or in-game cut-scenes. What's the catch? Well, now I'm sending two network messages instead of just one. So here, my trade-off is between flexibility and network performance; I have to make the call based on how much I value one over the other. If I'm pretty sure that our designers are never going to ask for player deaths without ragdolls, and our network usage is already pretty high, then the first fix is probably better. If, on the other hand, we're fairly early on in the project and the designers could ask for anything, and the network usage is currently very low, then I'd probably opt for the second fix.
Response is a stage often treated as part of Prescription, which is fairly natural; Prescription is figuring out what the fix is, and Response is actually implementing that. For most projects, response is a simple matter of changing the code and rebuilding the executable. However, it's not so simple if you need to apply the fix while the system is running. What if you're looking at fixing a server issue in a massively multiplayer game? Can you really afford to stop the server, kicking hundreds of players offline, while you rebuild, test the fix, and restart the server? Sometimes you'll have no other options, but issue response is definitely something you should bear in mind when designing software for which it may be a non-trivial problem. Perhaps you should split your code up into DLLs and support reloading DLLs at runtime, so that you can rebuild the DLL elsewhere and then ask the server to reload it without going offline. Or perhaps you should support the transfer of everything the program is currently handling to another program (perhaps another instance of the same program) while you take the first one offline to apply fixes. The most sensible approach will depend on your project.
This is the point at which the process comes full circle. You've applied the fix; now you need to check that it works and that the issue can be safely closed. It's sometimes one of the hardest things to accomplish, because you're trying to cause the program to do something that it has been explicitly told not to do. Usually, this step should be performed by the person who identified the issue in the first place, usually using the same tools and techniques that were used at the issue recognition stage; it's also another point at which work in the issue recognition stage will pay off, as you have a greater understanding of where the bug was and what sorts of things would cause it to manifest itself.