Whose bug is this anyway?!?

December 18, 2012

At a certain point in every programmer’s career we each find a bug that seems impossible because the code is right, dammit! So it must be the operating system, the tools or the computer that’s causing the problem. Right?!?

Today’s story is about some of those bugs I’ve discovered in my career.

This bug is Microsoft’s fault… or not

Several months after the launch of Diablo in late 1995, the StarCraft team put on the hustle and started working extra long hours to get the game done. Since the game was “only two months from launch” it seemed to make sense to work more hours every day (and some weekends too). There was much to do because, even though the team started with the Warcraft II game engine almost every system needed rework. All of the scheduling estimates were willfully wrong (my own included), so this extra effort kept on for over a year.

I wasn’t originally part of the StarCraft dev team, but after Diablo launched, when it became clear that StarCraft needed more “resources” (AKA people), I joined the effort. Because I came aboard late I didn’t have a defined role, so instead I just “used the force” to figure out what needed to happen to move the project forward (more details in a previous post on this blog).

I got to write fun features like implementing parts of the computer AI, which was largely developed by Bob Fitch. One was a system to determine the best place to create “strong-points” — places that AI players would gather units for defense and staging areas for attacks. I was fortunate because there were already well-designed APIs that I could query to learn which map areas were joined together by the path-finding algorithm and where concentrations of enemy units were located in order to select good strong-points, as it would otherwise be embarrassing to fortify positions that could be trivially bypassed by opponents.

I re-implemented some components like the “fog of war” system I had written for previous incarnations of the ‘Craft series. StarCraft deserved to have a better fog-of-war system than its predecessor, Warcraft II, with finer resolution in the fog-map, and we meant to include line-of-sight visibility calculations so that units on higher terrain would be invisible to those on lower terrain, greatly increasing the tactical complexity of the game: when you can’t see what the enemy is doing the game is far more complicated. Similarly, units around a corner would be out of sight and couldn’t be detected.

The new fog of war was the most enjoyable part of the project for me as I needed to do some quick learning to make the system functional and fast. Earlier efforts by another programmer were graphically displeasing and moreover, ran so slowly as to be unworkable. I learned about texture filtering algorithms and Gouraud shading, and wrote the best x386 assembly language of my career — a skill now almost unnecessary for modern game development. Like many others I hope that StarCraft is eventually open-sourced, in my case so I can look with fondness on my coding efforts, though perhaps my memories are better than seeing the actual code!

But my greatest contribution to the StarCraft code was fixing defects. With so many folks working extreme hours writing brand new code the entire development process was haunted by bugs: two steps forward, one step back. While most of the team coded new features, I spent my days hunting down the problems identified by our Quality Assurance (QA) test team.

The trick for effective bug-fixing is to discover how to reliably reproduce a problem. Once you know how to replicate a bug it’s possible to discover why the bug occurs, and then it’s often straightforward to fix. Unfortunately reproducing a “will o’ the wisp” bug that only occasionally deigns to show up can take days or weeks of work. Even worse is that it is difficult or impossible to determine beforehand how long a bug will take to fix, so long hours investigating were the order of the day. My terse status updates to the team were along the lines of “yeah, still looking for it”. I’d sit down in the morning and basically spend all day cracking on, sometimes fixing hundreds of issues, but many times fixing none.

One day I came across some code that wasn’t working: it was supposed to choose a behavior for a game unit based on the unit’s class (“harvesting unit”, “flying unit”, “ground unit”, etc.) and state (“active”, “disabled”, “under attack”, “busy”, “idle”, etc.). I don’t remember the specifics after so many years, but something along the lines of this:

if (UnitIsHarvester(unit))
    return X;

if (UnitIsFlying(unit)) {
    if (UnitCannotAttack(unit))
        return Z;
    return Y;
}

... many more lines

if (! UnitIsHarvester(unit))    // "!" means "not"
    return Q;

return R;   <<< BUG: this code is never reached!

After staring at the problem for too many hours I guessed it might be a compiler bug, so I looked at the assembly language code.

For the non-programmers out there, compilers are tools that take the code that programmers write and convert it into “machine code”, which are the individual instructions executed by the CPU.

// Add two numbers in C, C#, C++ or Java
A = B + C

; Add two numbers in 80386 assembly
mov     eax, [B]    ; move B into a register
add     eax, [C]    ; add C to that register
mov     [A], eax    ; save results into A

After looking at the assembly code I concluded that the compiler was generating the wrong results, and sent a bug report off to Microsoft — the first compiler bug report I’d ever submitted. And I received a response in short order, which in retrospect is surprising: considering that Microsoft wrote the most popular compiler in the world it’s hard to imagine that my bug report got any attention at all, much less a quick reply!

You can probably guess — it wasn’t a bug, there was a trivial error I had been staring at all along but didn’t notice. In my exhaustion — weeks of 12+ hour days — I had failed to see that it was impossible for the code to work properly. It’s not possible for a unit to be neither “a harvester” nor “not a harvester”. The Microsoft tester who wrote back politely explained my mistake. I felt crushed and humiliated at the time, only slightly mitigated by the knowledge that the bug was now fixable.

Incidentally, this is one of the reasons that crunch time is a failed development methodology, as I’ve mentioned in past posts on this blog; developers get tired and start making stupid mistakes. It’s far more effective to work reasonable hours, go home, have a life, and come back fresh the next day.

When I started ArenaNet with two of my friends the “no crunch” philosophy was a cornerstone of our development effort, and one of the reasons we didn’t buy foozball tables and arcade machines for the office. Work, go home at a reasonable time, come back fresh!

This bug is actually Microsoft’s fault

Several years later, while working on Guild Wars, we discovered a catastrophic bug that caused game servers to crash on startup. Unfortunately, this bug didn’t occur in the “dev” (“development”) branch that the programming team used for everyday work, nor did it occur in the “stage” (“staging”) branch used by the game testers for final verification, it only occurred in the “live” branch which our players used to play the game. We had “pushed” a new build out to end-users, and now none of them could play the game! WTF!

Having thousands of angry players amps up the pressure to get that kind of problem fixed quickly. Fortunately we were able to “roll back” the code changes and restore the previous version of the code in short order, but now we needed to understand how we broke the build. Like many problems in programming, it turned out that several issues taken together conspired to cause the bug.

There was a compiler bug in Microsoft Visual Studio 6 (MSVC6), which we used to build the game. Yes! Not our fault! Well, except that our testing failed to uncover the problem. Whoops.

Under certain circumstances the compiler would generate incorrect results when processing templates. What are templates? They’re useful, but they’ll blow your mind; read this if you dare.

C++ is a complex programming language so it is no surprise that compilers that implement the language have their own bugs. In fact the C++ language is far more complicated than other mainstream languages, as shown in this article that visualizes the complexity of C++ compared to the Ruby language. Ruby is a complex and fully-featured language, but as the diagram shows C++ is twice as complex, so we would expect it to have twice as many bugs, all other things being equal.

When we researched the compiler bug it turned out to be one that we already knew about, and that had already fixed by the Microsoft dev team in MSVC6 Service Pack 5 (SP5). In fact all of the programmers had already upgraded to SP5. Sadly, though we had each updated our work computers we neglected to upgrade the build server, which is the computer that gathers the code, artwork, game maps and other assets and turns them into a playable game. So while the game would run perfectly on each programmers’ computer, it would fail horribly when built by the build server. But only in the live branch!

Why only in live? Hmmm… Well, ideally all branches (dev, stage, live) would be identical to eliminate the opportunity for bugs just like this one, but in fact there were a number of differences. For a start we disabled many debugging capabilities for the live branch that were used by the programming and test teams. These capabilities could be used to create gold and items, or spawn monsters, or even crash the game.

We wanted to make sure that the ArenaNet and NCsoft staff didn’t have access to cheat functions because we wanted to create a level playing field for all players. Many MMO companies have had to fire folks who abused their godlike “GM” powers so we thought to eliminate that problem by removing capability.

A further change was to eliminate some of the “sanity checking” code that’s used to validate that the game is functioning properly. This type of code, known as asserts or assertions by programmers, is used to ensure that the game state is proper and correct before and after a computation. These assertions come with a cost, however: each additional check that has to be performed takes time; with enough assertions embedded in the code the game can run quite slowly. We had decided to disable assertions in the live code to reduce the CPU utilization of the game servers, but this had the unintended consequence of causing the C++ compiler to generate the incorrect results which led to the game crash. A program that doesn’t run uses a lot less CPU, but that wasn’t actually the desired result.

The bug was easily fixed by upgrading the build server, but in the end we decided to leave assertions enabled even for live builds. The anticipated cost-savings in CPU utilization (or more correctly, the anticipated savings from being able to purchase fewer computers in the future) were lost due to the programming effort required to identify the bug, so we felt it better to avoid similar issues in future.

Lesson learned: everyone, programmers and build servers alike, should be running the same version of the tools!

Your computer is broken

After my experience reporting a non-bug to the folks at Microsoft, I was notably more shy about suggesting that bugs might be caused by anything other than the code I or one of my teammates wrote.

During the development of Guild Wars (GW) I had occasion to review many bug reports sent in from players’ computers. As GW players may remember, in the (hopefully unlikely) event that the game crashed it would offer to send the bug report back to our “lab” for analysis. When we received those bug reports we triaged to determine who should handle each report, but of course bugs come in all manner of shapes and sizes and some don’t have a clear owner, so several of us would take turns at fixing these bugs.

Periodically we’d come across bugs that defied belief and we’d be left scratching our heads. While it wasn’t impossible for the bugs to occur, and we could construct hypothetically plausible explanations that didn’t involve redefining the space-time continuum, they just “shouldn’t” have occurred. It was possible they could be memory corruption or thread race issues, but given the information we had it just seemed unlikely.

Mike O’Brien, one of the co-founders and a crack programmer, eventually came up with the idea that they were related to computer hardware failures rather than programming failures. More importantly he had the bright idea for how to test that hypothesis, which is the mark of an excellent scientist.

He wrote a module (“OsStress”) which would allocate a block of memory, perform calculations in that memory block, and then compare the results of the calculation to a table of known answers. He encoded this stress-test into the main game loop so that the computer would perform this verification step about 30-50 times per second.

On a properly functioning computer this stress test should never fail, but surprisingly we discovered that on about 1% of the computers being used to play Guild Wars it did fail! One percent might not sound like a big deal, but when one million gamers play the game on any given day that means 10,000 would have at least one crash bug. Our programming team could spend weeks researching the bugs for just one day at that rate!

When the stress test failed Guild Wars would alert the user by closing the game and launching a web browser to a Hardware Failure page which detailed the several common causes that we discovered over time:

Memory failure: in the early days of the IBM PC, when hardware failures were more common, computers used to have “RAM parity bits” so that in the event a portion of the memory failed the computer hardware would be able to detect the problem and halt computation, but parity RAM fell out of favor in the early ’90s. Some computers use “Error Correcting Code” (ECC) memory, but because of the additional cost it is more commonly found on servers rather than desktop computers. Related articles: Google: Computer memory flakier than expected and doctoral student unravels ‘tin whisker’ mystery.
Overclocking: while less common these days, many gamers used to buy lower clock rate — and hence less expensive — CPUs for their computers, and would then increase the clock frequency to improve performance. Overclocking a CPU from 1.8 GHz to 1.9 GHz might work for one particular chip but not another. I’ve overclocked computers myself without experiencing an increase in crash-rate, but some users ratchet up the clock frequency so high as to cause spectacular crashes as the signals bouncing around inside the CPU don’t show up at the right time or place.
Inadequate power supply: many gamers purchase new computers every few years, but purchase new graphics cards more frequently. Graphics cards are an inexpensive system upgrade which generate remarkable improvements in game graphics quality. During the era when Guild Wars was released many of these newer graphics cards had substantially higher power needs than their predecessors, and in some cases a computer power supply was unable to provide enough power when the computer was “under load”, as happens when playing games.
Overheating: Computers don’t much like to be hot and malfunction more frequently in those conditions, which is why computer datacenters are usually cooled to 68-72F (20-22C). Computer games try to maximize video frame-rate to create better visual fidelity; that increase in frame-rate can cause computer temperatures to spike beyond the tolerable range, causing game crashes.

In college I had an external hard-drive on my Mac that would frequently malfunction during spring and summer when it got too hot. I purchased a six-foot SCSI cable that was long enough to reach from my desk to the mini-fridge (nicknamed Julio), and kept the hard-drive in the fridge year round. No further problems!

Once the Guild Wars tech support team was alerted to the overheating issue they had success fixing many otherwise intractable crash bugs. When they received certain types of crash reports they encouraged players to create more air flow by relocating furniture, adding external fans, or just blowing out the accumulated dust that builds up over years, and that solved many problems.

While implementing the computer stress test solution seems beyond the call of duty it had a huge payoff: we were able to identify computers that were generating bogus bug reports and ignore their crashes. When millions of people play a game in any given week, even a low defect rate can result in more bug reports than the programming team can field. By focusing our efforts on the bugs that were actually our fault the programming team was able to spend time creating features that players wanted instead of triaging unfixable bugs.

Ever more bugs

I don’t think that we’ll ever reach a stage where computer programs don’t have bugs — the increase in the expectations from users is rising faster than the technical abilities of programmers. The Warcraft 1 code base was approximately 200,000 lines of code (including in-house tools), whereas Guild Wars 1 eventually grew to 6.5 million lines of code (including tools). Even if it’s possible to write fewer bugs per line of code, the vast increase in the number of lines of code means it is difficult to reduce the total bug count. But we’ll keep trying.

To close out this post I wanted to share one of my favorite tongue-in-cheek quotes from Bob Fitch, who I worked with back in my Blizzard days. He posited that "all programs can be optimized, and all programs have bugs; therefore all programs can be optimized to one line that doesn't work." And that's why we have bugs.

Comments loading...