MR
Rate LimitingSystem DesignSoftware EngineeringBackend EngineeringInfrastructure

Rate Limiting 101. How I'd Design a Rate Limiter — A System Design Breakdown

If you've ever wondered how apps like Twitter, GitHub, ChatGPT or Claude stop you from hammering their APIs with thousands of requests, here's the answer.

Rate Limting 101
How Rate Limiters filter out requests to prevent API abuse

Do you even use AI if you haven't got rate limited? If you've ever wondered how apps like Twitter, GitHub, ChatGPT or Claude stop you from hammering their APIs with thousands of requests, the answer is a rate limiter.

I recently went deep on how to design one from scratch, and here's everything you need to know.

What Even Is a Rate Limiter?

A rate limiter does exactly what the name says — it limits how many requests a client can make to your service in a given time window. Hit the limit, and further requests get rejected until the window resets.

It protects your servers from being overwhelmed, whether by a buggy client in a loop, a malicious actor trying to take you down, or just one heavy user eating up resources that should be shared across everyone.

Where Does the Rate Limiter Even Live?

Before thinking about algorithms, you need to answer a more fundamental question — where do you put this thing?

You have three options:

  • (bad)Embed it directly inside your application code-- Putting it inside your app code means every service has to implement and maintain its own rate limiting logic. What if you have multiple servers, this means all of them will have their own different state for rate limiting
  • (worse)Run it as a separate standalone service-- This adds latency and now every single request has to make an extra network hop to another service before it can do anything useful. Further, it's another service to maintain and scale.
  • Place it at the API Gateway -The API Gateway already sits in front of all your services. It's the single entry point for every request. Putting rate limiting there gives you centralized control with zero extra network hops
HLD for Rate Limiter
HLD for placing the rate limiter on an api gateway

Picking the Right Algorithm

There are four main algorithms. Each makes a different tradeoff between correctness, memory usage, and how it handles bursty traffic.

Fixed Window

It's the simplest. You divide time into fixed buckets — say, 100 requests per minute — and count requests in each bucket.

The problem is a well-known edge case: a client can send 100 requests at 11:59:59 and another 100 at 12:00:00. That's 200 requests in two seconds while technically never exceeding the per-minute limit. Easy to exploit, so you want to avoid it.

Fixed Window
Visual Representation for Fixed Window Algo

Here's the code:

Python
class FixedWindow:
    def __init__(self, window_size, max_requests):
        self.window_size = window_size
        self.max_requests = max_requests
        self.requests = 0
        self.window_start = time.time()

    def allow_request(self):
        now = time.time()
        if now - self.window_start >= self.window_size:
            self.requests = 0
            self.window_start = now

        if self.requests < self.max_requests:
            self.requests += 1
            return True
        else:
            return False

Sliding Window

This fixes the problem with fixed window by tracking a rolling time window instead of fixed buckets. If the window is the last 60 seconds, it's always the last 60 seconds — no boundaries to exploit. But there's a cost: you need to store a timestamp for every single request for every single user. At any real scale, that becomes a lot of memory.

Sliding Window
Visual Representation for Sliding Window

Here's the code:

Python
class SlidingWindow:
    def __init__(self, window_size, max_requests):
        self.window_size = window_size
        self.max_requests = max_requests
        self.requests = deque()

    def allow_request(self):
        now = time.time()
        while self.requests and self.requests[0] <= now - self.window_size:
            self.requests.popleft()

        if len(self.requests) < self.max_requests:
            self.requests.append(now)
            return True
        else:
            return False

Token Bucket

This is the most widely used algorithm, and for good reason. Each user gets a bucket with a maximum capacity of tokens. Tokens are added at a fixed rate — say, 10 per second. Each request costs one token. If the bucket is empty, the request is denied. If a user hasn't made requests for a while, their bucket fills up and they can burst a bit. It's memory efficient, handles bursts naturally, and maps well to how real users actually behave. This is your default choice.

Token Bucket
Visual Representation for Token Bucket

Here's the code:

Python
class TokenBucket:
    def __init__(self, rate, capacity):
        self.rate = rate
        self.capacity = capacity
        self.tokens = capacity
        self.last_refill = time.time()

    def allow_request(self):
        now = time.time()
        self.tokens += (now - self.last_refill) * self.rate
        self.tokens = min(self.tokens, self.capacity)
        self.last_refill = now

        if self.tokens >= 1:
            self.tokens -= 1
            return True
        else:
            return False

There's one more that's left: Leaky Bucket. It's almost the inverse of Token Bucket. Requests go into a queue and are processed at a constant rate, like water leaking out of a bucket at a steady drip. Burst traffic gets smoothed out rather than absorbed. This is the right choice when you need consistent, predictable output — for example, if you're protecting a downstream service that genuinely can't handle spikes regardless of where they come from.

Storing State

Now that you've picked an algorithm, you need somewhere to store each user's current token count (or request timestamps for sliding window). This needs to be fast, shared across all your API gateway instances, and durable enough that restarting a server doesn't reset everyone's limits.

Adistributed cachelikeRedisis the standard answer. It's in-memory so reads and writes are fast, and it supports atomic operations which matter when you have multiple gateway instances updating the same user's count simultaneously.

At scale, a single Redis instance becomes a bottleneck. The way you address this is with multiple shards combined with consistent hashing. The idea is simple: hash each user's ID to determine which shard their data lives on. This spreads load evenly and ensures a given user's requests always go to the same shard — which means you never have to coordinate across shards for a single rate limit check.

For fault tolerance, you run each shard in a master-replica setup. Reads can go to replicas, and if a master node goes down, Redis Cluster automatically promotes one of its replicas. You get high availability without manual intervention.

HLD with Redis Cluster
Introducing Redis Cluster to store the bucket data for a user.

Storing the Rules

Your rate limiter needs to know the rules — things like "free tier users get 100 requests per minute" or "premium users get 1000." These rules need to be changeable without redeploying anything, which means they need to live somewhere external that your API gateway can read from.

There are two solid approaches.

The simpler one is storing rules in a regular database like Postgres and having your API gateways poll for updates every 30 seconds or so. There's an obvious tradeoff: if you change a rule, it might take up to 30 seconds for all gateways to pick it up. For the vast majority of use cases, this delay is completely fine — rate limiting rules don't usually need to change in real time.

The more sophisticated option is using a configuration management service like Apache ZooKeeper. Instead of polling, ZooKeeper pushes updates to your gateways the moment a rule changes. If you genuinely need instantaneous propagation — say, you're responding to an ongoing abuse incident and need rules enforced immediately — this is the right tool. But it comes with real operational overhead: another service to deploy, monitor, and maintain. Don't reach for it unless you have a concrete reason

The Bottom Line

A well-designed rate limiter doesn't require exotic technology. The defaults — API Gateway placement, Token Bucket algorithm, Redis for storage, Postgres with polling for rules — will take you very far. The more complex options like Sliding Window, ZooKeeper, or elaborate Redis cluster topologies are worth knowing, but only worth using when you have a specific problem they solve.

Most of system design is knowing when the simple thing is enough. Rate limiting is a good example.

That's it for this one. If you enjoyed reading this, follow me on X and LinkedIn for more content like this.

BTW, I'm looking for work, you can DM me on X or email at moyezrabbani.work@gmail.com if you would like to work with me.

Here's my portfolio to know more about myself. Thank you for your time. See you in the next one.