API Rate Limiting: What Actually Works

| 7 min read |
api rate-limiting backend architecture

Algorithms, headers, and deployment patterns for rate limiting APIs -- drawn from building financial data services at the fintech startup.

Quick take

Rate limiting is the difference between a healthy API and a 3am incident. Pick token bucket, use Redis, return proper headers, and stop pretending IP-based limits are enough.


At the fintech startup we serve financial news and data through APIs that get hammered by everything from institutional trading bots to someone’s weekend scraping project. I learned rate limiting the hard way: our search endpoint went down because a single client was firing 4,000 requests per minute with no backoff. No rate limiter in place. Just raw, unprotected endpoints and a very bad Monday morning.

After that, I spent a solid two weeks building out our rate limiting infrastructure. Here’s what I learned.

Why You Need It Yesterday

Rate limiting isn’t a nice-to-have. It’s the control plane for your API. Without it:

  • One buggy client retry loop takes out your entire service
  • A scraper with no sleep interval saturates your database connections
  • Your expensive endpoints (search, aggregation, report generation) eat all the compute, starving the cheap ones
  • Your cloud bill explodes because downstream dependencies charge per call

At the fintech startup, our financial data endpoints had wildly different costs. A simple quote lookup was cheap. A full news aggregation with NLP sentiment scoring? That could take seconds and significant compute. Treating them the same was a mistake we made early on.

The Decisions That Matter

Before you pick an algorithm, answer these four questions.

What’s the unit of work?

Requests per minute is the obvious one. But it’s not always right. If your endpoints vary wildly in cost – like ours did – you want cost-based budgets. A sentiment analysis call might cost 10 tokens. A simple metadata fetch costs 1. Flat request counts hide this.

Who are you actually limiting?

Your options:

  • IP address – Easy. Also terrible for anything serious. NAT gateways, corporate proxies, shared WiFi. We had a financial institution where 200 different users appeared as one IP. Useless.
  • API key / access token – Much better. This is what you want for authenticated traffic.
  • Organization / tenant – Best for B2B SaaS. We used org-level limits at the fintech startup because a single company might have multiple API keys across different services.

Pick the most granular identifier your auth model supports. You can always aggregate up.

How much burst do you allow?

Financial data is bursty by nature. Market opens, everyone hits the API at once. We needed to allow short bursts without letting sustained abuse through. This is where algorithm choice matters most.

Global or per-endpoint?

Start with a global limit. Then add per-endpoint limits where you know the cost is uneven. We had a global cap of 600 requests/minute but dropped the search endpoint down to 60/minute because it was 10x more expensive than everything else.

Algorithms: The Real Tradeoffs

Fixed Window

Count requests in a hard time window. 1,000 per hour, reset on the hour.

Dead simple. Also has a nasty edge case: a client can burn 1,000 requests at 12:59 and another 1,000 at 13:00. You just allowed 2,000 in two minutes. For financial APIs where burst = real money, this matters.

Sliding Window Counter

Uses the current and previous window counts with a weighted average. Good enough for most cases. This is what a lot of distributed systems default to because it’s cheap to compute and reasonably accurate.

Token Bucket

Tokens refill at a steady rate. Each request burns one or more tokens. Bucket has a max capacity that controls burst size.

This is what we ended up using at the fintech startup. Here’s why: it naturally handles burst (the bucket fills up during quiet periods) while enforcing a hard average rate. And it maps perfectly to cost-based limiting – an expensive endpoint just costs more tokens.

A simplified Redis implementation:

-- Token bucket in Redis via Lua script
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])  -- tokens per second
local cost = tonumber(ARGV[3])         -- tokens this request costs
local now = tonumber(ARGV[4])

local bucket = redis.call('hmget', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now

-- Refill tokens based on elapsed time
local elapsed = now - last_refill
local new_tokens = math.min(capacity, tokens + (elapsed * refill_rate))

if new_tokens >= cost then
    new_tokens = new_tokens - cost
    redis.call('hmset', key, 'tokens', new_tokens, 'last_refill', now)
    redis.call('expire', key, math.ceil(capacity / refill_rate) * 2)
    return {1, new_tokens}  -- allowed, remaining tokens
else
    redis.call('hmset', key, 'tokens', new_tokens, 'last_refill', now)
    redis.call('expire', key, math.ceil(capacity / refill_rate) * 2)
    return {0, new_tokens}  -- rejected, remaining tokens
end

The Lua script is atomic in Redis. No race conditions. No distributed locks. It just works.

Leaky Bucket

Processes requests at a constant rate, queues or rejects the rest. Great for smoothing traffic, but clients hate it when they can’t burst at all. We tried it briefly and got complaints from trading firms whose workflows were inherently bursty. Switched to token bucket within a week.

Implementation: Where and How

Enforce at two layers

At the API gateway, put a coarse global limit. This catches the obvious abuse before it hits your application servers. Nginx, Kong, whatever you use – it can do basic rate limiting.

Inside your application, add the smart limits. Per-endpoint costs, per-org budgets, tiered plans. This is where the token bucket with Redis lives.

// Middleware pseudocode
func RateLimitMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        orgID := extractOrgID(r)
        cost := endpointCost(r.URL.Path)

        allowed, remaining := tokenBucket.Allow(orgID, cost)

        w.Header().Set("X-RateLimit-Limit", "600")
        w.Header().Set("X-RateLimit-Remaining", strconv.Itoa(remaining))
        w.Header().Set("X-RateLimit-Reset", strconv.FormatInt(resetTime(), 10))

        if !allowed {
            w.Header().Set("Retry-After", "30")
            http.Error(w, "rate limit exceeded", http.StatusTooManyRequests)
            return
        }

        next.ServeHTTP(w, r)
    })
}

Always return headers

Every response, not just 429s. Clients need to know where they stand. Minimum set:

X-RateLimit-Limit: 600
X-RateLimit-Remaining: 42
X-RateLimit-Reset: 1539628800

When you reject, give them something useful:

HTTP/1.1 429 Too Many Requests
Retry-After: 30
Content-Type: application/json

{"error": "rate_limit_exceeded", "retry_after": 30}

We included retry_after in the JSON body too because some HTTP client libraries make it annoyingly hard to read response headers on error responses.

Redis for distributed counters

Single-node in-memory counters are fine for prototyping. In production with multiple app servers, you need shared state. Redis with EVAL (Lua scripts), INCR, and EXPIRE gives you atomic operations with sub-millisecond latency. We ran a single Redis instance dedicated to rate limiting – separate from our caching layer – and it handled the load without breaking a sweat.

Tiered limits map to product tiers

Free tier: 100 requests/minute, no access to premium endpoints. Pro: 600/minute, full access. Enterprise: custom limits, dedicated support.

This is a product decision as much as a technical one. Your rate limits are your pricing tiers. Make them explicit in your docs and in your headers.

Fail Open or Fail Closed?

When your Redis instance goes down (and it will, eventually), what happens? You have two choices:

Fail open: let all traffic through. Preserves availability but you’re flying blind. One bad client can take you out.

Fail closed: reject everything. Safe but brutal. Legitimate users get 429s for no reason.

We chose fail open with aggressive alerting. The reasoning: a brief period without rate limiting was less damaging than blocking all our paying customers. But we had circuit breakers on the expensive downstream calls as a secondary safety net. Pick explicitly. Don’t let this be a surprise during an incident.

The Checklist

If you’re implementing rate limiting from scratch, here’s the short version:

  • Use API key or org ID as the limiting key, not IP
  • Token bucket for the algorithm unless you have a specific reason not to
  • Redis with Lua scripts for distributed counters
  • Return X-RateLimit-* headers on every response
  • Return Retry-After on 429s, both in header and body
  • Per-endpoint limits where cost varies significantly
  • Log every throttle event, alert on sudden spikes
  • Decide fail-open vs fail-closed before you need to

Build this before your traffic forces you to. Retrofitting rate limiting during an outage is one of the worst experiences in backend engineering.