Rate Limiting: The Boring Feature That Saves You at 3 AM

Quick take

Token bucket for most things, sliding window counter when you need precision, and always fail open on user-facing endpoints. The algorithm matters less than the identity model and the client experience.

Rate limiting is one of those features that gets zero credit when it works and all the blame when it doesn’t. I’ve been working at a real-time messaging company, where the API handles millions of requests per second across a global network. Getting rate limiting wrong at that scale doesn’t just degrade performance – it can cascade into a full outage or, worse, silently let abuse through while punishing legitimate customers.

The hard part isn’t choosing an algorithm. It’s getting the identity, scope, and failure behavior right.

What You’re Actually Protecting

A rate limiter serves four purposes, and you need to be clear about which ones matter most for each endpoint:

Availability. Prevent any single client from exhausting shared resources. This is the primary job.

Fairness. Stop one noisy tenant from crowding out everyone else. At that scale, this was non-negotiable – one customer’s batch job shouldn’t degrade another customer’s real-time chat.

Cost control. Expensive operations like message fan-out or history queries cost real money per call. Without limits, a buggy client loop can generate a surprising invoice for everyone involved.

Security. Slow down brute force, credential stuffing, and automated abuse on auth endpoints.

Picking an Algorithm

Fixed Window

Count requests per time bucket. Dead simple. The problem: a client can send a burst at the end of one window and another at the start of the next, effectively doubling the rate for a brief period. Fine for internal services. Not great for public APIs where you’re making guarantees.

Sliding Window Counter

Weight the current and previous windows to smooth out boundary bursts. Much better accuracy than fixed window, and the memory overhead is minimal – two counters per key instead of one. This is what I recommend as the default.

Token Bucket

Tokens refill at a steady rate. Each request consumes one. Bursts are allowed up to the bucket size, but the long-term average stays controlled. Easy to explain to customers: “You get 100 requests per second with a burst allowance of 200.” For real-time messaging APIs, token bucket made the most sense because the traffic is inherently bursty.

Sliding Window Log

Track every request timestamp in a sorted set. Most precise. Also most expensive. Reserve this for low-volume, high-sensitivity endpoints like authentication or payment.

The Choices That Actually Matter

Identity

Rate limits are only as good as the key behind them. API key or user ID is the right default. IP-based limits are a blunt instrument – shared NATs, VPNs, and corporate proxies mean a single IP can represent thousands of users. In that environment, we key on the subscribe/publish key pair because that maps directly to a customer’s usage tier.

Scope and Tiers

Not every endpoint needs the same policy. Reads are cheap. Writes are expensive. Auth endpoints need aggressive limits regardless of tier.

For paid tiers, the shape of the limit matters more than the raw number. A customer might need 1000 requests per second in a burst but only 10,000 per minute sustained. Separate per-second and per-minute constraints handle this naturally.

Client Experience

This is where most teams cut corners. A 429 response without context is hostile. Provide headers:

X-RateLimit-Limit: the cap
X-RateLimit-Remaining: how much is left
X-RateLimit-Reset: when the window resets (Unix timestamp)
Retry-After: how long to wait before trying again

Clients that respect these headers will back off correctly. Clients that don’t will hammer you, but at least your limiter is doing its job.

Distributed Consistency

In a multi-region setup, you need shared counters. Redis with atomic INCR and EXPIRE is the standard choice. But what happens when Redis is down?

Fail open on user-facing endpoints. A brief window without rate limiting is better than a hard outage. Fail closed on security-sensitive endpoints – auth, password reset, API key generation. Document these decisions explicitly so the on-call engineer doesn’t have to guess at 3 AM.

Local Fallback

Keep a conservative in-process limiter as a fallback. It won’t be globally accurate, but it will prevent a single node from being overwhelmed if the central store disappears. Short-lived, conservative, and better than nothing.

The Baseline

If you’re starting from zero:

Token bucket for most endpoints, sliding window counter for auth
Key on API key or user ID, never just IP
Return proper rate limit headers on every response
Monitor hit rates by customer tier and endpoint
Fail open for reads, fail closed for auth
Review limits quarterly as traffic patterns shift

Rate limiting is a product feature as much as an infrastructure control. The algorithm is the easy part. The policy – who gets what, when, and what happens at the boundary – is where the real work lives.