Game design, game programming and more

Debugging running server applications

So you’ve written an awesome online game that works perfectly in the test environment, but when real users are playing the game server doesn’t work properly. Now what?!?

I was reading an article by Mike Perham called Debugging with Thread Dumps and wanted to share a related technique from the development of Guild Wars that might help you solve some of those hard-to-find bugs that only show up in the LIVE game environment.

For those who haven’t played Guild Wars, it’s what’s known as an “instance-based” massively-multiplayer online role-playing game, or MMORPG. In Guild Wars if you and your friends want to fight a dragon, there isn’t actually one waiting patiently in its lair for you to show up and fight. Instead, the game server creates a server game-instance right then and there, imbuing the dragon with hitpoints and a treasure-filled cavern out of thin air.

Something is broken

Well one day in Guild Wars, after the launch of the Nightfall campaign, players in one of those aforementioned game instances were having trouble: their game was crawling along at a glacial pace so it felt like their characters were running through molasses. Unfortunately this game instance was not an eight-player party mission, it was a “town”, with hundreds of players using the spot as a meeting point and jumping-off location for playing other dungeons and missions.

Brett Vickers (at that time Lead Programmer) and Mark Young (at that time Content Team Lead) — both excellent programmers — had heard about the problem and started looking into the issue. They used traditional triage methods: investigate the most likely cause of the problem to determine whether evidence supports that cause; if not, move on to the next most probable scenario.

So they checked the server CPU and memory load, which were within normal levels. While hundreds of game instances were running on the server at that time, the load wasn’t any higher than on other servers in the same datacenter. Hmmmm.

They joined other games running on the same physical server hardware, and in particular other instances of the same town map running on that server and on other servers. Aside: In Guild Wars, there can be many instances of the same map — if 3000 players want to join Ascalon City, which can holds to 100 players per town, then the game will create (3000/100=) 30 town instances. None of these other town instances had slowdown problems. What next?

They reviewed code checkins since the last build, which notably hadn’t had this particular problem. Nothing leaped out and bit them. And so they continued searching, until they were stumped.

Given enough eyeballs…

One of the best solutions when blocked on a bug is to bounce ideas off other programmers. Various folks got called in, and at some point I was asked to lend a sympathetic ear. While I was one of the co-founders of ArenaNet, one of my several roles in the company was Server Team Lead, which meant I spent much of my day writing code — we were a lean shop and the founders all wrote plenty of code.

I wasn’t working on the game content directly, but did write the low-level, multi-threaded, asynchronous library that handled receiving network messages from players and other backend services and posting those messages up into the synchronous, single-threaded game logic.

As an aside, we designed the game server libraries specifically so that each game instance ran single-threaded so that the game content programming team wouldn’t have to fool with writing threaded code, which is time-consuming to develop because of synchronization worries regarding races and deadlocks. Removing these types of obstacles for content team programmers allowed them to get get on to the business of writing game code, allowing them to be much more prolific creating game features that players desperately want. It’s what enabled the studio to generate 3.5 million lines of code from 2000-2005, and another 3 million lines of code for Guild Wars from 2005-2007 when GWEN (Guild Wars: Eye of the North) shipped.

The biggest difficulty we faced was that it wasn’t a good idea to attach a debugger to the running server process and start poking around in memory. That would cause several thousand players on the server to be quite annoyed as their game instances lagged out. And many times when using a debugger it’s not possible to find the source of the problem anyway. We needed something … smarter.

Debugging tools on tap

One of the challenges that programmers often face when writing multi-threaded code is that of deadlocks. A deadlock is a situation where, as Wikipedia so succinctly puts it “two or more competing actions are each waiting for the other to finish, and thus neither ever does”.

Deadlock

What a programming deadlock looks like in real life

So having experienced deadlocks myself (yeah — I write bugs too) I had already developed a solution to help uncover this type of problem. Thank you David Jefferson for CS 111!

I had written a game-crash handler based on some great exception-handling code written by Matt Pietrek for MSJ Magazine — virtually everyone who’s ever written a crash-handler for Windows apps has referred to this code. When a game crash occurred my handler would send programmers an email detailing where the problem occurred and other relevant information like register and memory contents, including most importantly a full stack-walk showing the call-chain of functions leading up to the crash.

I generalized this code so that the reporter would display a stack-walk for all running threads instead of just the one that crashed.

Then I wrote some deadlock-detection code. The idea here is not to perform perfect deadlock detection, which is hard. Instead I created a background thread that periodically checks to ensure that each thread is regularly reporting its aliveness so we know the thread isn’t stuck.

Check out the deadlock detection code here

Finding the bug

For testing purposes I had hooked up the deadlock-reporter to a server-console command, and this was the key to solving the Guild Wars bug. I ran a telnet session to the broken game server and ran the “/deadlock” command many times in succession. Each command dumped thread state so we knew what each thread was working on at the moment the dump was captured. By aggregating stack-traces from these reports we could analyze which functions showed up most frequently and pinpoint the code that was hogging the CPU time.

Mark and Brett were then able to review all relevant code changes to understand the nature of the defect. If I recall correctly, there was a map script that listened for players entering and leaving the map, and stored it in an array.

When players left the game their data should have been deleted from the array, but the code was buggy, causing the array to grow without bound as players joined and left the game.

This bug would have eventually resulted in an out-of-memory error, but game servers have lots of memory and the game server was designed to be memory efficient, so it would have been a long time before we found the problem.

In the meantime array searches were causing the game to spend so much time looping through several million entries that the game felt laggy. We had found the bug!

A quick change to the code, a recompile of the game on the build server, and several minutes later the problem was fixed.

One of the great advantages of the Guild Wars development environment we created was the ability to checkin code, then build and deploy it to millions of end-users across the globe with a single command, all in the span of a couple of minutes. Iteration is the key to building great games — it’s okay to make mistakes, but make them fast and fix them fast.

Incidentally, I should mention the bug wasn’t written by Mark or Brett; the guilty party shall remain nameless; let’s face it, we’ve all written bugs :)

Conclusion

So if you’re writing server code, perhaps you’ll be able to use this solution or Mike Perham’s to help diagnose trouble spots in your project.

P.S. Mike Perham is the guy who wrote SideKiq. It’s a background message-processing library I’ve used in a Rails project that serves the same purpose as Resque or DelayedJob, but does it better. Thank you Mike!

About Patrick Wyatt

As a game developer with more than 22 years in the industry I have helped build small companies into big ones (VP of Blizzard, Founder of ArenaNet, COO of En Masse Entertainment); lead the design and development efforts for best-selling game series (Warcraft, Diablo, Starcraft, Guild Wars); written code for virtually every aspect of game development (networking, graphics, AI, pathing, sound, tools, installers, servers, databases, ecommerce, analytics, crypto, dev-ops, etc.); designed many aspects of the games I've shipped; run platform services teams (datacenter operations, customer support, billing/accounts, security, analytics); and developed state-of-the-art technologies required to compete in the AAA+ game publishing business.

Comments

  1. > Iteration is the key to building great games

    Exactly! I think even more general, iteration is the key to building great software.

    Very enjoyable blog!

  2. what is the tasks that can be given to a network engineer in a game entreprise that developps mmorpg?

    • PatrickWyatt says

      By network engineer I assume that you mean network-switching engineer (datacenter operations), as opposed to a programmer who does network coding.

      Network engineers manage designing the datacenter (colocation) network architecture, specifying and purchasing equipment, bandwidth provisioning, deploying switch/router/firewall gear configurations, network security, caching, load-balancing, monitoring, troubleshooting, auditing, postmortems, penetration testing, and lots more.

      It’s exciting work, particularly because online games with virtual worlds are under constant attack by criminals who desire to break in because game-gold is worth lots of real-world money.

  3. Would you have found the same data with a sampling profiler like VTune (on Windows) or oprofile (on Linux)?

    • I haven’t used either tool so I can’t comment, but if those tools would do the job it’s an excellent idea — saves writing code!

      An operational issue is: since VTune wasn’t installed on the server, would installing it require a reboot?

Speak Your Mind

*