Game design, game programming and more

Using transaction rate-limiting to improve service reliability

I develop and publish multiplayer games for a living, and have discovered some useful solutions for running reliable online services. The one I’m writing about today is how to implement reasonable usage limits so that services are less likely to be abused by hackers.

Y’see, hackers find ways to manipulate games by simulating the behavior of human players, but in ways that exploit bugs and misfeatures; for example, by performing tasks faster than humans could ever hope to do. One example is a “speed hack”, where a player is able to perform actions like swinging a sword more rapidly than should be possible according to the rules of the game. If you’ve ever seen a monster go from full health to instant death in a flurry of blows you’ve likely seen a speed-hack.

Now honestly a speed hack doesn’t sound like such a bad thing; before I was a game programmer I speed-hacked games myself, and thought it was more fun than playing! And I also had to “slow-hack” several older games (including one I developed — Warcraft) to make them playable on fast computers, because the programmers (me included) had forgotten that computers keep getting faster every year.

But then there are professional cyber-criminals, who steal accounts, use speed-hacks to generate lots of game-gold, and sell it for real money to other players, something known as Real Money Trading, or RMT for short. And with such hacks it becomes impossible for honest players to keep up. So what’s to be done?

The code I’ve shared below implements rate-limiting useful for preventing many forms of speed hacking. I’ve used similar code for login rate-limiting, to prevent hackers from brute-force cracking account passwords. It can be used to moderate online chat so that one person can’t “flood” a channel with messages. It’s great for transaction rate-limiting to ensure that no one person can overwhelm a server with requests. And in fact I’ve used it to successfully mitigate distributed denial of service (DDOS) attacks, which I’ll detail in a future article.

Here’s how to use the rate-limiter:

// Using these values a player can attempt to login once
// every 30 seconds, but with as many as ten login attempts
// in a burst. While this sounds like a lot many players
// forget their passwords and need a number of attempts to
// remember it, which I discovered by analyzing log files.
// They should try LastPass, which is an awesome solution
// to this problem.
const unsigned LOGIN_COST_MS      = 30 * 1000;
const unsigned MAX_LOGIN_COST_MS  = 10 * LOGIN_COST_MS;

ErrorCode CPlayer::PlayerLogin () {
    if (!m_rateLimiter.AddTime(LOGIN_COST_MS, MAX_LOGIN_COST_MS))
        return ERROR_LOGIN_RATE_LIMIT;
    ... login code here
}

In any event, here’s the code, which is surprisingly trivial considering how powerful it is. The algorithm is a modified form of the “Leaky Bucket Algorithm as a Meter“, but uses the passage of time instead of incrementing a counter to perform its magic.

RateLimiter.h definitions:

class CRateLimiter {
public:
    CRateLimiter ();
    bool AddTime (unsigned costMs, unsigned maxCostMs);

private:
    unsigned m_timeMs;
};

RateLimiter.cpp implementation:

CRateLimiter::CRateLimiter ()
:   m_timeMs(PlatformTimeMs())
{}

bool CRateLimiter::AddTime (unsigned costMs, unsigned maxCostMs) {
    ASSERT(costMs < maxCostMs);

    // Reset rate-limiter time value if it has expired
    // - handles integer overflow safely
    unsigned currTimeMs = PlatformTimeMs();
    if ((int) (currTimeMs - m_timeMs) > 0)
        m_timeMs = currTimeMs;

    // Has the user accrued too much time-cost?
    // - handles integer overflow safely
    unsigned newTimeMs = m_timeMs + costMs;
    if ((int) (newTimeMs - currTimeMs) >= (int) maxCostMs)
        return false;

    m_timeMs = newTimeMs;
    return true;
}

And somewhere you’ll have to define the PlatformTimeMs function:

unsigned PlatformTimeMs () {
#if defined(_WINDOWS_)
    return GetTickCount();
#else
    #error Your implementation here
   // something like clock_gettime(CLOCK_MONOTONIC, ...) for Unix/Linux
#endif
}

I hope you’ll find this code useful for your project, and would enjoy hearing about the novel purposes you find for it.

About Patrick Wyatt

As a game developer with more than 22 years in the industry I have helped build small companies into big ones (VP of Blizzard, Founder of ArenaNet, COO of En Masse Entertainment); lead the design and development efforts for best-selling game series (Warcraft, Diablo, Starcraft, Guild Wars); written code for virtually every aspect of game development (networking, graphics, AI, pathing, sound, tools, installers, servers, databases, ecommerce, analytics, crypto, dev-ops, etc.); designed many aspects of the games I've shipped; run platform services teams (datacenter operations, customer support, billing/accounts, security, analytics); and developed state-of-the-art technologies required to compete in the AAA+ game publishing business.

Comments

  1. We’ve been using this method successfully for quiet some time now. The leaky bucket counter seems to get forgotten a lot when it comes to rate-limiting.

    You also should mention, though, that one should never ever tie the CRateLimiter instance to something that the hacker has exclusive control of. More specifically, you should rate-limit the password trials per account instead of per IP (think: 4 bytes per IPv6-address of state _you_ have to maintain)…

    • I’m glad to hear you’re getting good results with it.

      I think the right approach for rate-limiting is to do both per-IP and per-account rate limiting, though they both have weaknesses.

      The weakness of account-based rate-limiting is that a hacker can perform a denial of service attack against a player so that player cannot login. Ouch!

      And you’ve mentioned the weakness for per-IP limiting: with so many IPv6 addresses available to a hacker from even one computer, it’s trivial to make a server run out of memory tracking IPv6 addresses!

      For all the servers I’ve written, the latter problem has not been an issue because the servers only offer IPv4 endpoints. While a hacker with a botnet can cause some pain, even with 100K bots the memory overhead for rate-limiting is so small as to present no issues (100K address * 4 bytes/address => practically nothing for a modern server).

      I have often wondered what to do when servers are required to offer IPv6, but my feeling is that — for the foreseeable future, end-user clients will have to utilize NAT more often before servers have to switch to IPv6.

      • Another problem with rate-limiting only by account is that an attacker can still try to brute-force multiple accounts from a single IP. So, you’re right: you still need to rate-limit on a more global criterion like IP(-block)s.

        This, however, can be problematic if multiple users share a single public IP like in an internet café or smartphone users that are using the same gateway.

        We’re not offering IPv6 yet either. But like you, I’m not sure yet how to handle it when we have to.

  2. Hey Pat, thanks for posting this (also, hi!). I’m a bit confused by something, though. It seems this algorithm is different from Leaky Bucket because your max-cost burst limit is also your cooldown period. In other words, if somebody spams 10 logins with your constants above, then it will take 300 seconds before they can login again, whereas with the constant “drip” from Leaky Bucket, they’d be able to try a single login again in 30 seconds. Am I missing something?

    • Hola Chris; good to hear from you again.

      You’re right: if you sent 10 login requests in 1 second you’d have to wait another 299 seconds before you could attempt another login.

      Another way to look at it is that the algorithm allows a long-term sustained rate of of 1 login attempt per 30 seconds, but it allows enough variability that someone could attempt up to 10 in one brief interval, as many folks do when they try to guess which of their several mostest favoritest passwords they actually used for the site.

      Incidentally, these days I’d do the whole thing a bit differently. Since it’s necessary to run a whole slew of servers for scalability and fault-tolerance, I’d actually use something like a Redis cluster to store the CRateLimiter values, and use a Redis Lua script to perform the same calculation as I’ve done here in C. By setting the Redis key-expiration time, the CRateLimiter values fall out of memory when they’re no longer needed.

      • Yeah, making it scalable is an important thing, but for my current usage I just need a simple single-threaded rate limiter like this. Hopefully I’ll need to make it scalable at some point. :)

        My point about the max was just that it acts differently than the ‘real’ leaky bucket algorithm, because there isn’t actually a way to define the steady state rate in the same way. With leaky bucket, you can immediately log in once more after the advertised timeout even if you were rate limited by the bucket filling, whereas here the timeout actually depends on how big your burst was, so it’d be hard for a user to know when they should try again (or for the login page to even say what it was). The advantage here is you are only storing a single int and also not having to tick the counter…it seems like to implement the real leaky bucket you either need to tick it or store two ints, one to track when the last call was so you can leak the right amount out of the buffer on the next call. But, it seems like the real behavior is slightly preferable since the timeout is predictable.

        Anyway, it took me a while to figure out why I was confused by your code after reading the leaky bucket wikipedia page, so hopefully these comments help people from The Future! You might want to make a note in the text near the wikipedia link or something.

        Thanks again!

      • Ah, I see your point. Fortunately it is possible to know the time until trying another login: 30 seconds. If I login 10 times in the first second, I have to wait 300 seconds until the bucket is empty (so I could try another 10 logins), but only 30 seconds until I can try the one more login. It really is the leaky bucket algorithm!

      • Ah, I see it now, I was more confused than I thought! I was missing that the currTimeMS in the second if-statement is eating away at the m_timeMS as time passes, which is the “leak”. Thanks, very nice!

Speak Your Mind

*