Game design, game programming and more

Whose bug is this anyway?!?

At a certain point in every programmer’s career we each find a bug that seems impossible because the code is right, dammit! So it must be the operating system, the tools or the computer that’s causing the problem. Right?!?

Today’s story is about some of those bugs I’ve discovered in my career.

This bug is Microsoft’s fault… or not

Several months after the launch of Diablo in late 1995, the StarCraft team put on the hustle and started working extra long hours to get the game done. Since the game was “only two months from launch” it seemed to make sense to work more hours every day (and some weekends too). There was much to do because, even though the team started with the Warcraft II game engine almost every system needed rework. All of the scheduling estimates were willfully wrong (my own included), so this extra effort kept on for over a year.

I wasn’t originally part of the StarCraft dev team, but after Diablo launched, when it became clear that StarCraft needed more “resources” (AKA people), I joined the effort. Because I came aboard late I didn’t have a defined role, so instead I just “used the force” to figure out what needed to happen to move the project forward (more details in a previous post on this blog).

I got to write fun features like implementing parts of the computer AI, which was largely developed by Bob Fitch. One was a system to determine the best place to create “strong-points” — places that AI players would gather units for defense and staging areas for attacks. I was fortunate because there were already well-designed APIs that I could query to learn which map areas were joined together by the path-finding algorithm and where concentrations of enemy units were located in order to select good strong-points, as it would otherwise be embarrassing to fortify positions that could be trivially bypassed by opponents.

I re-implemented some components like the “fog of war” system I had written for previous incarnations of the ‘Craft series. StarCraft deserved to have a better fog-of-war system than its predecessor, Warcraft II, with finer resolution in the fog-map, and we meant to include line-of-sight visibility calculations so that units on higher terrain would be invisible to those on lower terrain, greatly increasing the tactical complexity of the game: when you can’t see what the enemy is doing the game is far more complicated. Similarly, units around a corner would be out of sight and couldn’t be detected.

The new fog of war was the most enjoyable part of the project for me as I needed to do some quick learning to make the system functional and fast. Earlier efforts by another programmer were graphically displeasing and moreover, ran so slowly as to be unworkable. I learned about texture filtering algorithms and Gouraud shading, and wrote the best x386 assembly language of my career — a skill now almost unnecessary for modern game development. Like many others I hope that StarCraft is eventually open-sourced, in my case so I can look with fondness on my coding efforts, though perhaps my memories are better than seeing the actual code!

But my greatest contribution to the StarCraft code was fixing defects. With so many folks working extreme hours writing brand new code the entire development process was haunted by bugs: two steps forward, one step back. While most of the team coded new features, I spent my days hunting down the problems identified by our Quality Assurance (QA) test team.

The trick for effective bug-fixing is to discover how to reliably reproduce a problem. Once you know how to replicate a bug it’s possible to discover why the bug occurs, and then it’s often straightforward to fix. Unfortunately reproducing a “will o’ the wisp” bug that only occasionally deigns to show up can take days or weeks of work. Even worse is that it is difficult or impossible to determine beforehand how long a bug will take to fix, so long hours investigating were the order of the day. My terse status updates to the team were along the lines of “yeah, still looking for it”. I’d sit down in the morning and basically spend all day cracking on, sometimes fixing hundreds of issues, but many times fixing none.

One day I came across some code that wasn’t working: it was supposed to choose a behavior for a game unit based on the unit’s class (“harvesting unit”, “flying unit”, “ground unit”, etc.) and state (“active”, “disabled”, “under attack”, “busy”, “idle”, etc.). I don’t remember the specifics after so many years, but something along the lines of this:

if (UnitIsHarvester(unit))
    return X;

if (UnitIsFlying(unit)) {
    if (UnitCannotAttack(unit))
        return Z;
    return Y;
}

... many more lines

if (! UnitIsHarvester(unit))    // "!" means "not"
    return Q;

return R;   <<< BUG: this code is never reached!

After staring at the problem for too many hours I guessed it might be a compiler bug, so I looked at the assembly language code.

For the non-programmers out there, compilers are tools that take the code that programmers write and convert it into “machine code”, which are the individual instructions executed by the CPU.

// Add two numbers in C, C#, C++ or Java
A = B + C

; Add two numbers in 80386 assembly
mov     eax, [B]    ; move B into a register
add     eax, [C]    ; add C to that register
mov     [A], eax    ; save results into A

After looking at the assembly code I concluded that the compiler was generating the wrong results, and sent a bug report off to Microsoft — the first compiler bug report I’d ever submitted. And I received a response in short order, which in retrospect is surprising: considering that Microsoft wrote the most popular compiler in the world it’s hard to imagine that my bug report got any attention at all, much less a quick reply!

You can probably guess — it wasn’t a bug, there was a trivial error I had been staring at all along but didn’t notice. In my exhaustion — weeks of 12+ hour days — I had failed to see that it was impossible for the code to work properly. It’s not possible for a unit to be neither “a harvester” nor “not a harvester”. The Microsoft tester who wrote back politely explained my mistake. I felt crushed and humiliated at the time, only slightly mitigated by the knowledge that the bug was now fixable.

Incidentally, this is one of the reasons that crunch time is a failed development methodology, as I’ve mentioned in past posts on this blog; developers get tired and start making stupid mistakes. It’s far more effective to work reasonable hours, go home, have a life, and come back fresh the next day.

When I started ArenaNet with two of my friends the “no crunch” philosophy was a cornerstone of our development effort, and one of the reasons we didn’t buy foozball tables and arcade machines for the office. Work, go home at a reasonable time, come back fresh!

This bug is actually Microsoft’s fault

Several years later, while working on Guild Wars, we discovered a catastrophic bug that caused game servers to crash on startup. Unfortunately, this bug didn’t occur in the “dev” (“development”) branch that the programming team used for everyday work, nor did it occur in the “stage” (“staging”) branch used by the game testers for final verification, it only occurred in the “live” branch which our players used to play the game. We had “pushed” a new build out to end-users, and now none of them could play the game! WTF!

Having thousands of angry players amps up the pressure to get that kind of problem fixed quickly. Fortunately we were able to “roll back” the code changes and restore the previous version of the code in short order, but now we needed to understand how we broke the build. Like many problems in programming, it turned out that several issues taken together conspired to cause the bug.

There was a compiler bug in Microsoft Visual Studio 6 (MSVC6), which we used to build the game. Yes! Not our fault! Well, except that our testing failed to uncover the problem. Whoops.

Under certain circumstances the compiler would generate incorrect results when processing templates. What are templates? They’re useful, but they’ll blow your mind; read this if you dare.

C++ is a complex programming language so it is no surprise that compilers that implement the language have their own bugs. In fact the C++ language is far more complicated than other mainstream languages, as shown in this article that visualizes the complexity of C++ compared to the Ruby language. Ruby is a complex and fully-featured language, but as the diagram shows C++ is twice as complex, so we would expect it to have twice as many bugs, all other things being equal.

When we researched the compiler bug it turned out to be one that we already knew about, and that had already fixed by the Microsoft dev team in MSVC6 Service Pack 5 (SP5). In fact all of the programmers had already upgraded to SP5. Sadly, though we had each updated our work computers we neglected to upgrade the build server, which is the computer that gathers the code, artwork, game maps and other assets and turns them into a playable game. So while the game would run perfectly on each programmers’ computer, it would fail horribly when built by the build server. But only in the live branch!

Why only in live? Hmmm… Well, ideally all branches (dev, stage, live) would be identical to eliminate the opportunity for bugs just like this one, but in fact there were a number of differences. For a start we disabled many debugging capabilities for the live branch that were used by the programming and test teams. These capabilities could be used to create gold and items, or spawn monsters, or even crash the game.

We wanted to make sure that the ArenaNet and NCsoft staff didn’t have access to cheat functions because we wanted to create a level playing field for all players. Many MMO companies have had to fire folks who abused their godlike “GM” powers so we thought to eliminate that problem by removing capability.

A further change was to eliminate some of the “sanity checking” code that’s used to validate that the game is functioning properly. This type of code, known as asserts or assertions by programmers, is used to ensure that the game state is proper and correct before and after a computation. These assertions come with a cost, however: each additional check that has to be performed takes time; with enough assertions embedded in the code the game can run quite slowly. We had decided to disable assertions in the live code to reduce the CPU utilization of the game servers, but this had the unintended consequence of causing the C++ compiler to generate the incorrect results which led to the game crash. A program that doesn’t run uses a lot less CPU, but that wasn’t actually the desired result.

The bug was easily fixed by upgrading the build server, but in the end we decided to leave assertions enabled even for live builds. The anticipated cost-savings in CPU utilization (or more correctly, the anticipated savings from being able to purchase fewer computers in the future) were lost due to the programming effort required to identify the bug, so we felt it better to avoid similar issues in future.

Lesson learned: everyone, programmers and build servers alike, should be running the same version of the tools!

Your computer is broken

After my experience reporting a non-bug to the folks at Microsoft, I was notably more shy about suggesting that bugs might be caused by anything other than the code I or one of my teammates wrote.

During the development of Guild Wars (GW) I had occasion to review many bug reports sent in from players’ computers. As GW players may remember, in the (hopefully unlikely) event that the game crashed it would offer to send the bug report back to our “lab” for analysis. When we received those bug reports we triaged to determine who should handle each report, but of course bugs come in all manner of shapes and sizes and some don’t have a clear owner, so several of us would take turns at fixing these bugs.

Periodically we’d come across bugs that defied belief and we’d be left scratching our heads. While it wasn’t impossible for the bugs to occur, and we could construct hypothetically plausible explanations that didn’t involve redefining the space-time continuum, they just “shouldn’t” have occurred. It was possible they could be memory corruption or thread race issues, but given the information we had it just seemed unlikely.

Mike O’Brien, one of the co-founders and a crack programmer, eventually came up with the idea that they were related to computer hardware failures rather than programming failures. More importantly he had the bright idea for how to test that hypothesis, which is the mark of an excellent scientist.

He wrote a module (“OsStress”) which would allocate a block of memory, perform calculations in that memory block, and then compare the results of the calculation to a table of known answers. He encoded this stress-test into the main game loop so that the computer would perform this verification step about 30-50 times per second.

On a properly functioning computer this stress test should never fail, but surprisingly we discovered that on about 1% of the computers being used to play Guild Wars it did fail! One percent might not sound like a big deal, but when one million gamers play the game on any given day that means 10,000 would have at least one crash bug. Our programming team could spend weeks researching the bugs for just one day at that rate!

When the stress test failed Guild Wars would alert the user by closing the game and launching a web browser to a Hardware Failure page which detailed the several common causes that we discovered over time:

  • Memory failure: in the early days of the IBM PC, when hardware failures were more common, computers used to have “RAM parity bits” so that in the event a portion of the memory failed the computer hardware would be able to detect the problem and halt computation, but parity RAM fell out of favor in the early ’90s. Some computers use “Error Correcting Code” (ECC) memory, but because of the additional cost it is more commonly found on servers rather than desktop computers. Related articles: Google: Computer memory flakier than expected and doctoral student unravels ‘tin whisker’ mystery.
  • Overclocking: while less common these days, many gamers used to buy lower clock rate — and hence less expensive — CPUs for their computers, and would then increase the clock frequency to improve performance. Overclocking a CPU from 1.8 GHz to 1.9 GHz might work for one particular chip but not another. I’ve overclocked computers myself without experiencing an increase in crash-rate, but some users ratchet up the clock frequency so high as to cause spectacular crashes as the signals bouncing around inside the CPU don’t show up at the right time or place.
  • Inadequate power supply: many gamers purchase new computers every few years, but purchase new graphics cards more frequently. Graphics cards are an inexpensive system upgrade which generate remarkable improvements in game graphics quality. During the era when Guild Wars was released many of these newer graphics cards had substantially higher power needs than their predecessors, and in some cases a computer power supply was unable to provide enough power when the computer was “under load”, as happens when playing games.
  • Overheating: Computers don’t much like to be hot and malfunction more frequently in those conditions, which is why computer datacenters are usually cooled to 68-72F (20-22C). Computer games try to maximize video frame-rate to create better visual fidelity; that increase in frame-rate can cause computer temperatures to spike beyond the tolerable range, causing game crashes.

In college I had an external hard-drive on my Mac that would frequently malfunction during spring and summer when it got too hot. I purchased a six-foot SCSI cable that was long enough to reach from my desk to the mini-fridge (nicknamed Julio), and kept the hard-drive in the fridge year round. No further problems!

Once the Guild Wars tech support team was alerted to the overheating issue they had success fixing many otherwise intractable crash bugs. When they received certain types of crash reports they encouraged players to create more air flow by relocating furniture, adding external fans, or just blowing out the accumulated dust that builds up over years, and that solved many problems.

While implementing the computer stress test solution seems beyond the call of duty it had a huge payoff: we were able to identify computers that were generating bogus bug reports and ignore their crashes. When millions of people play a game in any given week, even a low defect rate can result in more bug reports than the programming team can field. By focusing our efforts on the bugs that were actually our fault the programming team was able to spend time creating features that players wanted instead of triaging unfixable bugs.

Ever more bugs

I don’t think that we’ll ever reach a stage where computer programs don’t have bugs — the increase in the expectations from users is rising faster than the technical abilities of programmers. The Warcraft 1 code base was approximately 200,000 lines of code (including in-house tools), whereas Guild Wars 1 eventually grew to 6.5 million lines of code (including tools). Even if it’s possible to write fewer bugs per line of code, the vast increase in the number of lines of code means it is difficult to reduce the total bug count. But we’ll keep trying.

To close out this post I wanted to share one of my favorite tongue-in-cheek quotes from Bob Fitch, who I worked with back in my Blizzard days. He posited that “all programs can be optimized, and all programs have bugs; therefore all programs can be optimized to one line that doesn’t work.” And that’s why we have bugs.

About Patrick Wyatt

As a game developer with more than 22 years in the industry I have helped build small companies into big ones (VP of Blizzard, Founder of ArenaNet, COO of En Masse Entertainment); lead the design and development efforts for best-selling game series (Warcraft, Diablo, Starcraft, Guild Wars); written code for virtually every aspect of game development (networking, graphics, AI, pathing, sound, tools, installers, servers, databases, ecommerce, analytics, crypto, dev-ops, etc.); designed many aspects of the games I've shipped; run platform services teams (datacenter operations, customer support, billing/accounts, security, analytics); and developed state-of-the-art technologies required to compete in the AAA+ game publishing business.

  • Marcin Jaczewski

    Another great article, this OsStress is great idea, but what computing its does?
    It can be anything or something specify to cache failures ?

    • http://www.codeofhonor.com/blog Patrick Wyatt

      Since I don’t work at ArenaNet any more I can’t peek at the source code,
      but I expect the code does some cache-busting so that the computations
      occur mostly in main memory instead of the L1 or L2 cache.

    • http://twitter.com/fritzy fritzy

      Here’s another great (recent) article on this problem by the guy who writes Redis (the memory store database) http://antirez.com/news/43 and includes a section on avoiding the CPU cache.

  • Marius Gedminas

    Will you tell us more about the details of the AI code in StarCraft?

  • http://twitter.com/LongSteve Steve Longhurst

    I must say, this is probably the best programming blog article I’ve read all year, I really enjoyed it. Particularly the OsStress module, genius. There’s an old phrase that comes to mind, “select isn’t broken”. Google it for similar stories.

  • wtpayne

    Nice article. Great to hear stories from the trenches.

  • Aaron Opfer

    Is it possible that people who were using game hacks could have generated some of those bug reports? Buggy hacks will crash games too. And if the devs look at the bug reports and see insanity in the call stack (i.e, “why is our SelectUnit function being called directly by an unknown module?”) then I imagine lots of head-scratching and beard-stroking.

    • http://www.codeofhonor.com/blog Patrick Wyatt

      Oh sure, I expect so; game hacks can cause some crashes. But some of the bugs were downright spooky:

      xor eax, eax ; clear register
      mov esi, [something]
      mov al, [esi]
      <<>>
      WTF!

      Now this could be a thread-race condition or memory corruption of the stack, but we didn’t think so after a lot of research.

      • Emjayen

        It’s possible it was an actual processor bug; I vaguely recall a few bugs relating to usage of the string instructions in Intel’s x86 implementation. A race condition seems most likely; I’ve been left scratching my head while looking at absurd register values during debugging only to find that the exception handler was preempted and some thread trashed everything.

      • Morten Ofstad

        In this case the first thing I would check was if the return address of a function had been corrupted and we had ended up returning into the middle of another function… It’s clear the xor instruction hasn’t been executed so there must be something funky going on with the control flow, right?

      • http://www.codeofhonor.com/blog Patrick Wyatt

        You’re right that it could have been a corrupted return address. We receive stack-traces and partial memory dumps in crash reports and could validate that the stack looked good.

        If you look at several hundred bugs with crazy problems like this you eventually find several with enough data to indicate that stack corruption does not appear to be the case.

        We didn’t go looking for OS/hardware problems but after months of looking into these issues Mike’s testable hypothesis and subsequent testing proved it was.

        It’s funny but customers who contacted support just wouldn’t believe us at first — “blow out the dust, are you kidding?” But it worked!

      • Kevin T

        My last job included a lot of assembly, kernel debugging & crash dump work. As crazy as some of the stuff we had to look at could get, we never found a MS bug in my time there. It was generally memory corruption, race conditions, etc. that were causing the issue. I did find quite a few bugs in other companies’ application code & drivers though! It’s kind of neat to email someone what their probable source code bug was, starting only from a binary image. (You cast a COM object received from an external interface to your internal implementation type, didn’t you? Fess up!)

        One interesting skill that came out of my time there was that I could start debugging my own blue screens at home. I once found an “impossible” dump state in the graphics driver stack — the register value was one or two bits off from what had been calculated a few instructions before. Cleaning the dust out of the machine was the correct bugfix.

  • Ryan

    Did you drill hols in your mini-fridge to get the external HDD in there?

    • http://www.codeofhonor.com/blog Patrick Wyatt

      Actually didn’t have to drill holes; the rubber door seal compresses enough so that none of the bits got stuck passing through the SCSI cable.

      • Roy

        Hah! “Bits.” I get it!

  • thecodist

    Games for some reason generate the most interesting bugs. Working at an MMO company was probably the hardest code I’ve ever worked on. I covered one that took me a week to fix: http://thecodist.com/article/fixing_a_nasty_physically_modeled_engine_bug_in_an_fps_game

    • http://www.codeofhonor.com/blog Patrick Wyatt

      Great story; thanks for sharing!

  • KevBru

    Your first example is confusing or incomplete. The return value UnitIsHarvester() certainly can change mid function based on the state of the object, so there is no way the compiler could infer what your implying. Even const wouldn’t help. Imagine a case mid function where UnitBecomesHarvester() is called.

    • http://www.facebook.com/NaibStilgar Stilgar Naib

      In this case it was not a compiler bug. It was a very simple bug he missed because he was tired.

    • http://www.codeofhonor.com/blog Patrick Wyatt

      The code snippet wasn’t clear enough, sorry. The UnitIsHarvester function always returns the same value for a given unit. That is, an SCV is *always* a harvester.

      • http://www.facebook.com/SuperMegaLodewyk Lodewyk Duminy

        It was pretty clear, the function name was clear and concise :)

        I had a similar problem earlier this week. After 30 minutes of stepping through the code and not finding the problem, a co-worker peeked over my shoulder and pointed it out.

        I was embarrassed! You make a very good point – crunch time isn’t effective, not only because programmers get tired, but because they get tired while they are under a lot of pressure.

      • http://twitter.com/frosted Chuck G

        That’s one of the best ways I find of solving problems myself. I use what I call a “cardboard programmer”. Once I explain what I am doing and the problem, the (usually trivial) problem jumps out at me. If no one is around i try to go through the exercise on my own, but having another pair of eyes on the code is a good thing, especially when they ask questions.

      • Kevin T

        I heard once from a coworker that at a prior job of his, someone often kept a dog in the office, that became known as the “code dog”. Describing your problem out loud to the code dog would often be enough to help solve it.

      • http://www.codeofhonor.com/blog Patrick Wyatt

        Yes — exactly! That’s “rubber ducking” – http://en.wikipedia.org/wiki/Rubber_duck_debugging

      • http://www.codeofhonor.com/blog Patrick Wyatt

        Having a second set of eyes (or ears) for a problem can be soooo helpful!

      • Walt Sellers

        Very often, finding anything depends on how observant you can be (ie “powers of observation”.) And that ability diminishes rapidly as you get tired.

  • http://twitter.com/hobbified hobbified

    #11907 Looking for a compiler bug is the strategy of LAST resort. LAST resort.

    – MJD’s Good Advice and Maxims for Programmers

  • Jovan

    Do you remember how the fog of war was implemented?

    • http://www.codeofhonor.com/blog Patrick Wyatt

      More or less; it wasn’t that complicated:

      1. Every time a unit moves, use a circular bitmask to mark the adjacent tiles around the unit in the visibility map. Tricky part: perform line of sight calculations based on the terrain map, which included flags for altitude (low/high) and “can’t see through”. This was more or less ray-casting, and was written in assembly language for speed.

      2. Use a filter function to smooth the visibility map to achieve a less jagged edges.

      3. Use Gouraud shading to smoothly render the fog of war on top of the terrain map.

      4. Periodically mark the entire visibility map invisible to close up areas where no units are stationed.

      • nick k

        ah yes of course — simple

      • http://www.facebook.com/brian.fitzgerald.923171 Brian Fitzgerald

        What I’m sad about is that the first version was REALLY FREAKING COOL but had to be cut back to something more sane for CPU budget reasons. If I’m remembering correctly, that is.

      • http://www.codeofhonor.com/blog Patrick Wyatt

        The first version of fog of war which I alluded to in the article had finer resolution but looked awful. For static screenshots it looked good, but when units moved they caused the fog-of-war outline to “shiver”. You should go back in source control and try to build that version to see how it looked :)

        Also, that code was slow even without the line-of-sight and terrain-height computations; it could never have been optimized enough to include those features.

        To make my version near as pretty as the static screenshots of the initial version I had to do visibility-map smoothing with texture filtering *and* Gouraud shading.

      • Jovan

        Fantastic response Patrick, thanks a lot :)

        From what I remember, the fog of war is only recalculated every Nth frame isn’t it?

        I imagine doing the full on calculations every frame would be a performance issue, and possibly explains why the fog of war lags behind the units by a few seconds in elder RTS games — just a hunch.

        What’s your take on using an LUT of precomputed shading ramps[1] as opposed to calculating the shading on the fly?

        [1] http://www.appsizematters.com/2010/07/how-to-implement-a-fog-of-war-part-2-smooth/

      • http://www.codeofhonor.com/blog Patrick Wyatt

        My recollection is that every time a unit changes its tile location (tiles are 32×32 pixels) it updates the visibility map, which means that the update cost is minimal since, in any given game loop, only a few units cross a tile border. Periodically, when the whole map is wiped back to black, all units update the visibility map.

        I read the link you provided regarding using precalculated lookup tables (LUT). It looks *much* more computationally intensive than what StarCraft does because he uses a finer resolution for his tables.

        In the original StarCraft there are up to 1600 game units active so efficiently marking and rendering the visibility map (on a Pentium-class computer of the day) was critical to having a good frame-rate while leaving enough CPU left over for AI, path-finding and rendering the other layers.

        StarCraft marks visibility only for each 32×32 tile. The combination of texture filtering to smush the visibility values makes for a smoother tile-visibility map, then Gouraud shading interpolates those values on a per-pixel basis. This creates a graphically different look — StarCraft gets a more uniformly smooth shading, whereas the precomputed LUT solution has more rounded shadows.

  • Steven Hauwsz

    I recall devastating, game-wrecking exploits that were widely used in both multiplayer Warcraft II and Starcraft I which often took months for Blizzard to fix, if at all. Has Blizzard or the rest of the game industry learned from and responded to this phenomenon by releasing more bulletproof products at launch?

    • http://www.codeofhonor.com/blog Patrick Wyatt

      In regards to fixing exploits quickly, Blizzard had patch paralysis. StarCraft 1.08, which included game-recording, was finished a month before I left Blizzard in Feburary 2000 but wasn’t released for many months afterwards.

      My co-founders and I aimed to address this problem by engineering the development culture at ArenaNet to focus on iteration. Starting in around April 2001 we pushed live builds to our external alpha testers (eventually numbering thousands of folks) every day. Over the four years until launch we pushed on average 20 builds per weekday.

      When you can iterate that fast it becomes easy to fix exploits, but only a few companies put in the effort to build that type of development pipeline.

      Part of the issue is that, for MMO projects, the dev team only gets one “at bat” every five years, so the learning cycle is much slower than for other online games.

  • Ben Tilly

    There is a quote attributed to Ken Arnold that I think predates Bob Fitch’s version that you presented. “Every program has at least one bug and can be shortened by at least one instruction — from which, by induction, it is evident that every program can be reduced to one instruction that does not work.*

    • http://www.codeofhonor.com/blog Patrick Wyatt

      Huh, hadn’t heard that one. I wonder if Bob came up with his thought independently or dredged up Ken’s quote from some subconscious remembrance.

  • http://twitter.com/cthonctic Cthonctic

    Awesome article, very entertaining and enlightening.

  • Andy Lee

    What a fantastic read. Thank you for writing it.

  • http://www.facebook.com/laurens.rodriguez.oscanoa Laurens Rodriguez

    Open sourcing Starcraft would be epic, please please please! :D

    • http://twitter.com/agmcleod Aaron McLeod

      It would for sure. But obviously that would be up to blizzard, not him ;)

    • Cleroth Sun

      It’ll be opensource… when it no longer serves any purpose.

  • Gabriel Friedmann

    Today my story was one of bug-hunting a feature failure to end up finding a compiler bug. I was doing some pretty gnarly things with Mono.Cecil to edit .NET intermediary language (similar to java bytecode). Bonus part of the story: the lame case was so abstract that i was still able to use Mono.Cecil to edit the bug out of itself so i could recompile my initial target without bug manifestation.

    Also, the hyperlink in this article to “overclocking” is broken.

    • http://www.codeofhonor.com/blog Patrick Wyatt

      Thanks Gabe; link fixed!

  • mm0zct

    I must confess it took me far too long to realize that your first bug was that the code was NOT being reached, I usually put something in that case to flag up if we get into “unreachable” state to catch bugs, it didn’t occur to me that you expected it to be able to reach that line. The dangers of sleep deprived programming.

    It is similar in a way to the if( foo =! 0 ) bug that I took a good hour or two to find once though. My brain refused to read that as anything other than the (foo != 0) that was intended.

  • http://www.facebook.com/people/Dan-Kaminsky/515164691 Dan Kaminsky

    This is fantastic. So few people have any context for large scale software development. Thank you so much for your time spent writing!

  • http://www.ericfleischman.com/ Eric Fleischman

    You haven’t lived until you talk to guys @ Intel and start the conversation with “what i’m seeing doesn’t make sense, I think it’s a bug in your layer, but I can’t prove that.” I feel your pain. Spending a few years debugging random problems @ Microsoft was a ton of fun.

  • paperino

    I found a compiler bug myself in the glorious Borland C++ for Windows 3.1. Under certain conditions, the optimizer would double generate the code for the postincrement operator. The hardest part was trying to figure it out, since debugging or adding instrumentation, would alter the conditions around the optimizer…..

  • Wim Rijnders

    “…the “no crunch” philosophy was a cornerstone of our development effort…. Work, go home at a reasonable time, come back fresh!”

    So true. Wish I could stop forgetting it so often!

  • M

    Wonderful post again, I really enjoy reading them.
    However, I miss the sense, from previous posts, of an article written by a programmer for programmers. Does the audience that requires terms like “push” to be double-quoted actually want to read such code-centric stories?

    • http://www.codeofhonor.com/blog Patrick Wyatt

      Finding the right level of technical detail in my articles is the biggest challenge I have in telling these stories.

      Should I tell stories that all Warcraft/StarCraft/Guild Wars players can read, or aim for the technical folks?

      Who *is* my audience? I have no idea. So I write the article and then — during the editing phase — add bits and pieces to explain the jargon, hence “push” and “what is a compiler”.

      • Jovan

        Really digging the technical side of the articles though! There simply aren’t enough journals out there about how game development was done during the 90s.

  • Dorjan

    You sir, are a legend and a decent writer to boot. Jealous :)

  • Emjayen

    It’s a shame you don’t work at ArenaNet anymore. There was a bug in Guild Wars 2 causing a fatal crash that took them a good 2 months and several patches to fix even though I submitted multiple reports of increasing clarity on how to fix the very simple problem (insufficient working set size manifested as an “out of memory” error)

  • Michel

    I don’t understand what was so hard about the first bug. Of course I’m just a small time hobby programmer and your tools and workflows will be way more complicated, but if you had just added a (conditional) breakpoint or a “cout” there, you would have immediately seen that the code was not reached. Then you could start setting breakpoints further up to figure out which parts were reached, and you would have rapidly solved the bug. I’m assuming you have oversimplified the example and it was a bit more complicated than that?

    • http://www.facebook.com/SuperMegaLodewyk Lodewyk Duminy

      It would most likely have been more complicated than that, and there was code between the 2 points, making it slightly harder to actually see the problem.

      I think the problem was also that he was tired at the time, and as a result not focusing properly.

      This rarely happens where I work, because we have a test-driven development environment. Testcases would have alerted us of the bug and made it easier to fix. The obvious downside of this is that development is a lot slower, making it an undesirable way to produce games, I guess.

    • http://www.facebook.com/NaibStilgar Stilgar Naib

      The whole point is that he was tired. Also sometimes even if you are not tired you get stuck in some code and just can’t see the obvious. If I get stuck especially on “impossible” bugs for more than 2 hours I always call a coworker to take a look.

      On a side note it turns that verbally explaining the problem helps you solve the problem. It is often the case that you look at a problem for days then you try to explain it to a coworker using actual words and magically you know the solution. Rumor has it that the great Alan Turing had a teddy bear and he explained problems with his work to it just to verbalize the problem and get “unstuck”

    • http://www.codeofhonor.com/blog Patrick Wyatt

      Correct; the example was much more complicated than I’ve shown.

      1. It was a spaghetti function that had been hacked on by every programmer on the project.
      2. I had thousands of bugs (not *my* bugs, mostly) to triage and tried to fix them as fast as possible. I’m sure spending more time would have helped.
      3. When you think you’re two months from launch you’ll make unbelievable sacrifices to finish the game — we all worked ridiculous hours. But we were two months from launch for a year, and that took a massive toll on our bodies and intellects.

      To give you an idea about what it was like, I remember walking into James Phinney’s (lead designer of StarCraft) office to ask him a question late at night. He said “wait a second”, leaned over and barfed in a trash can from exhaustion and sickness, then answered my question and went back to work. At the time I didn’t think much of it, but in retrospect it was *CRAZY* that we were doing that.

  • http://www.facebook.com/sharkinu Sparky Sharkinu

    “We wanted to make sure that the ArenaNet and NCsoft staff didn’t have
    access to cheat functions because we wanted to create a level playing
    field for all players.”

    So this ( http://wiki.guildwars.com/wiki/BAMPH! ) was never usable on the live servers?

    • http://www.codeofhonor.com/blog Patrick Wyatt

      Correct.

  • http://twitter.com/Code_Analysis Andrey Karpov

    The compiler is to blame for everything – http://www.viva64.com/en/b/0161/ :-)

  • Saurabh

    >>> I had failed to see that it was impossible for the code to work properly. It’s not possible for a unit to be neither “a harvester” nor “not a harvester”

    That might have been true for your code. But can compiler assume such stuff? Where is the guarantee that any of the functions that you call in “the many lines of code” cannot modify the value of “unit”. Even if unit is a variable on stack what if the caller set a global pointer to the address of unit and some of the function modified it. In my opinion compiler writers make far too many aggressive optimizations that really do not help.

    • http://www.codeofhonor.com/blog Patrick Wyatt

      To clear up the UnitIsHarvester function, it probably looked something like this:

      bool UnitIsHarvester (Unit * u) {
      return u->type == SCV || u->type == DRONE || u->type == PROBE;
      }

      • CdrJameson

        …but it might be complex enough to stop a static analyzer like Lint automatically picking up this kind of bug.

        const bool IsHarvester = UnitIsHarvester();

        if (IsHarvester) return A;

        if (!IsHarvester) return B;

        return C;

        …Would generate an ‘unreachable code’ error for the ‘return C’ line.

        If (UnitIsHarvester()) return A;

        If (!UnitIsHarvester()) return B;

        return C;

        …may not, as the compiler/analyzer may not know whether any of the lines between the calls change the return value of the function.

        Fortunately, these days we all use static analyzers to find this kind of bug. Don’t we?

  • http://twitter.com/MrPatrickWright Patrick Wright

    Great post!!

  • Terje Mathisen

    Nice article!

    You’ve probably heard (from John Cash?) about the HW problem id Software had when developing Quake:

    An intermittent glitch where a single pixel would flash, for a single frame.

    After lots & lots of debugging Mike Abrash found out that they had been sold an overclocked Pentium (90->100 MHz) and the fp unit would sometimes not quite finish all fp operations. :-(

    Personally I was involved with the Pentium FDIV debacle, making the first public post about the bug and then writing most of the sw workaround code.

    I still think it was quite neat that we could make faulty hardware generate exact results, at a very small (1% or less for most programs) slowdown cost.

    • http://www.codeofhonor.com/blog Patrick Wyatt

      Hadn’t heard the Quake overclocking story; thanks for sharing.

      In regard to the FDIV bug in the Pentium, one thing that surprises me about computer chips is how many “errata” (AKA bugs) exist. While FDIV was the most widely publicized, there are many others that software developers (particularly compiler and OS vendors) have to work around!

      • Terje Mathisen

        Oh Yes, “erratas” are very common, but there are definitely a huge step between those that can be (and often has to be) handled by the OS, with no impact at all on any user-mode application, and those (like FDIV) where there is no way for the OS to trap any (possibly) faulty operation and do a fixup.

        The FOOF bug was in the former cathegory, I know that the Linux kernel guys found a fix for this that caused effectively zero overhead for all programs.

        The most common form of the OS-level erratas are those that are caused by race conditions when updating TLBs, page mapping or other OS-level strucures.

        Even pretty severe bugs of this type can be allowed to stay in production chips as long as there is a documented OS-level workaround.

        The userland bugs however are far more serious, and much more likely to cause cpu recalls. Compiler workarounds are only useful on architectures with very little actual binary code already in use.

        I.e. Linux on a brand new platform can live well for quite a while if a gcc fix and recompile suffices, while Windows will crash & burn in the same scenario.

        Similarly, any Android smart phone can use the Dalvik virtual machine to work around many hw glitches.

  • Zavie

    That idea of checking some results to detect faulty hardware is brilliant. I’ll keep it in my toolbox. Thanks for the very interesting read.

  • mwkaufmann

    And this is the reason why I love working in Software Development. It’s because of people like Patrick. There is hardly another field where people are so modest and honest about their errors after they have accomplished so much.

  • Martin

    >> It’s not possible for a unit to be neither “a harvester” nor “not a harvester”.

    not at the same time, but in many lines of code between the 2 checks, the state of the unit object might have changed, or the unit variable might be reused to point to a new object

    • http://www.codeofhonor.com/blog Patrick Wyatt

      Those cases would be classified as bugs in StarCraft; units couldn’t convert types on the fly.

  • Joshua Burns

    Absolutely excellent article thanks for taking the time to write this! It’s interesting to have an inside view of the development process to one of my favorite games of all time (Star Craft).

  • Nagling Considered Harmful

    Then there’s the bug that was a feature decades ago: Nagle’s Algorithm. Many online games don’t turn off Nagling despite it increasing latency. For interactive stuff, throughput should be secondary to lower latency. Nagling belongs in the past – it attempts to fix application/protocol level bugs at the wrong layer. On modern networks Nagling combined with selective acknowledgement may have caused more network performance problems than they have solved.

    • http://www.codeofhonor.com/blog Patrick Wyatt

      Oh, Nagle’s Algorithm is still important, but you’re right that it should be disabled for game-traffic.

      For Guild Wars we left Nagling on for file-patching and server-to-server traffic, but turned it off for client->server and server->client traffic. There was some special case code I barely remember where the game server turned on Nagling under some circumstances (high latency, I believe), and it improved throughput by reducing packets-in-flight and hence per-message-overhead, for modem players.

      • Nagling Considered Harmful

        Sure that’s right? If there’s high latency make latency worse? I guess that explains my trans-Pacific GW1 ping! I can understand turning it on for bulk-transfers, but seems bad for anything that’s latency sensitive, whether server-server, or modem comms, unless bandwidth is extremely constrained. Not sure how many 56k GW modem players there were/are, but having a semi-random 200+ milliseconds added on top of a 200ms link doesn’t make mesmers happy ;). In my opinion if messages should be grouped together it’s a “bug” that should be fixed at the application layer not the network/transport level. So whose bug is this anyway? :)

      • http://www.codeofhonor.com/blog Patrick Wyatt

        I wish I had access to the code so I could tell you more; I wrote low-level async socket code and mid-level application protocol, but did not write the code that enables/disables Nagling — that part was in the application layer.

        You’re right that enabling Nagling would add semi-random 200+ millisecond gaps between sends, which would increase latency. The default was that Nagling was off for server->client code, but would occasionally be turned on by the application layer code. In my previous post I said it was turned on for “high latency”, but a better guess is that the code was trying to detect congestion and turning on Nagling to reduce the number of packets, at the expense of increasing sending time (because messages get queued by the OS until the Nagling delay expires). Wish I could tell you more but that’s about all I can dredge up from code that written back in 2004 by someone else!

        However … Guild Wars actually plays really well on modems and low-bandwidth connections. Our two “torture tests” were three players in Australia sharing a 56K modem and playing on our Los Angeles servers, and eight players in India sharing a 110K DSL line playing on those same Los Angeles servers, and in both cases the game worked well.

      • Nagling Considered Harmful

        Thanks very much for your response. Yes Guild Wars does work on low-bandwidth connections. I’m just unhappy with the random additional 200 ms latency I seem to be getting – I have a 10Mbps connection that’s 200ms away from the GW servers, but I get 200-400+ ping. I do understand that 7 years ago a lot more gamers had 56k modems. But I hope that newer games won’t do this sort of thing, or they’d turn Nagling off server side and make Nagling configurable on the client side (since for most low bandwidth clients upload bandwidth is less than download). An extra 40 bytes 60 times a second (60fps) is only an additional 19.2Kbps. That would likely be a bad case scenario too.

  • rktsci

    In my programming career, I found a compiler bug in the IBM PL/I compiler for the 370 series mainframe in subscripted labels. (Yes, you could do a “goto label(x);”. I was messing around with the compiler on a summer job.)
    I also found a very subtle bug in Perkin-Elmer’s OS/32, while working on a device driver for a custom I/O board. The PE hardware had a DMA controller you could hand off data transfers to via a well-developed API. If a DMA transfer was going from a block of memory that had been loaded into the same physical address as it’s virtual address, the subroutine assumed that the transfer was being done on behalf of the OS, not a user program, and the transfer was done without virtual address translation. So it would transfer to the wrong location in the machine and cause a crash. I found this one by putting a unique data pattern into the source location, doing the transfer, getting a dump of all 16 meg of the system memory, finding the data pattern, having an ah-ha moment, and talking to PE. Our workaround was to tell the linker to put the data blocks outside the physical address limits of the machine, upt at the top of the virtual address space.

  • http://www.facebook.com/crysalim Chris Riccobono

    This is a fantastic article, not only because of the content but the references you cite. In particular, the documents on templates and language complexity are awesome (especially to a novice/intermediate programmer like me)

    Being on the other side of the fence has shown me how weird troubleshooting hardware problems can be. There’s one bug I have run into 3 or 4 times in particular that is absolutely beyond me: if I alt-tab too many times in Windows 7, especially while playing World of Warcraft, my computer can freeze.

    This ends up not being a normal freeze – the computer can’t even boot correctly afterwards! The specific error says “BOOTMGR is missing, ctrl del alt to restart”, and can only be rectified with a repair from an install dvd.

    It can be something with my computer in general, with all games, with my habit of alt tabbing too much and too often, or something else completely. I hope one day to actually figure it out though.. haha.

  • Guilherme Gibertoni

    Great post. I’ve stumbled to bugs that ’til today I don’t know what was going on.
    The best solution for me is to explain my code to someone and eventually rewrite it, module by module and observing its effects.
    I prefer “losing” time in rewriting the code exactly how it is than looking lines up and down and not figuring out the problem.

  • http://twitter.com/LyndonArmitage Lyndon Armitage

    Really good article! Thanks for sharing your insight!

  • Gunther

    TL;DR.

  • http://www.doc.ic.ac.uk/~lwy08 Lee Wei Yeong

    Any idea where the OsStress module is hosted? github?

    • http://www.codeofhonor.com/blog Patrick Wyatt

      The source code is part of the Guild Wars code base owned by ArenaNet, and is not public, I’m afraid.

  • Paul

    In addition to the technical insight I gained from your articles, I’ve had some good laughs. Thanks for sharing, Patrick!

  • CiC

    A related blog about how redis detects memory errors:
    http://antirez.com/news/43

  • xboi209

    I feel like I can relate a bit to this article, I do some of coding in Javascript and some CSS and when people don’t update their web browsers(GRR IE6!!!) they get different results which however I test it with my updated web browser, I get perfectly good results :D

  • Justin

    I can’t wait for the story on how pro SC players effected your work. You made it sound as if it had a profound impact in you how crazy those players were at the game.

    • Cleroth Sun

      Affected*

  • http://twitter.com/soy_yuma Alejandro Cámara

    It would have been terrific if you had taken statistics on the amount of bug reports in summer vs winter (on the North Hemisphere) to see the heat impact on HW.

  • Mai Hanafy

    thanks ..

  • KamronBennett

    Loved the article. As a beginner sometimes you think you made some silly newbie mistake when in fact it is something else but this has motivated me into continuing because it is in fact possible for the machine to fail and not always me. Loved it!