Rate limiting sits at the gateway layer for a reason: it's the one place where you can enforce policy across every upstream service without touching application code. But "rate limiting" is not a single thing — the algorithm you pick has direct consequences on how traffic degrades under load, how fair the policy is across concurrent consumers, and how complex the implementation becomes at the Redis or in-memory store level. Token bucket, leaky bucket, and fixed window each make different tradeoffs. Getting them confused — or picking one for the wrong traffic shape — leads to either phantom 429s during legitimate bursts or no protection at all against sustained abuse.
Fixed Window: The Simplest Model and Its Boundary Condition
Fixed window counting is the easiest algorithm to implement: maintain a counter keyed on (consumer_id, window_start), increment on each request, and reject when the counter exceeds the limit. Windows are typically aligned to clock boundaries — top of the minute, top of the hour — which makes the math straightforward and monitoring obvious.
The problem is the boundary spike. Imagine a limit of 100 requests per minute. A consumer sends 100 requests at 23:59.50 and 100 requests at 00:00.01. Both windows are technically valid, but the upstream service receives 200 requests in two seconds. This isn't a hypothetical — it's exactly the traffic pattern you see from misconfigured retry logic where clients back off until the window resets and then flood immediately.
Fixed window works acceptably when your traffic is steady and your limit is conservative relative to upstream capacity. It's also the right choice when simplicity of auditability matters: every 429 maps cleanly to a window, and partners can calculate their own headroom without asking you. For internal developer platforms where consumers are trusted and the risk is accidental overload rather than adversarial abuse, fixed window is often sufficient.
Token Bucket: Burst-Tolerant and Intuitive
Token bucket is the algorithm most developers mean when they say "rate limiting that handles bursts." The model: each consumer has a bucket with a maximum capacity of B tokens. The bucket refills at rate r tokens per second (or per interval). Each request consumes one token. When the bucket is empty, requests are rejected.
The key property is that a consumer can accumulate tokens up to capacity B during idle periods and spend them in a burst. A bucket with capacity 60 and refill rate 1/second can handle a 60-request burst at any moment, then sustains 1 req/sec thereafter. This matches real partner traffic patterns: a nightly reconciliation job hits your invoicing API with 50 requests in a few seconds, then goes quiet for 24 hours. Fixed window would reject half those requests on a 30 req/min limit; token bucket lets the burst through because the consumer had been accumulating tokens overnight.
Implementation at the gateway requires atomic read-modify-write on bucket state per consumer per request. With Redis, the canonical implementation uses a Lua script to read the current token count + last refill timestamp, compute tokens added since last request, clamp to capacity, deduct one, and write back — all atomically. This is a single round-trip if your gateway is co-located with Redis, but adds latency on every request path if not. In practice, gateways with millions of active keys often use local in-memory token buckets with a write-back interval, accepting slightly stale counts in exchange for single-digit microsecond overhead versus sub-millisecond Redis round trips.
Token bucket is the right default for partner-facing APIs where traffic is bursty by nature and you care about developer experience. The 429 response should include a Retry-After header or a X-RateLimit-Reset timestamp so partners know exactly when their bucket will have capacity again.
Leaky Bucket: Smoothing Over Downstream Services That Can't Handle Spikes
Leaky bucket inverts the framing: instead of asking "does this consumer have quota remaining?", it asks "can we process this request at a sustainable rate right now?" Requests enter a FIFO queue (the bucket) and are processed at a fixed drain rate. If the queue is full, the new request is dropped immediately — or, in some implementations, the caller blocks until space is available.
The effect is strict output smoothing. A downstream service that can handle 100 req/sec but starts timing out at 200 req/sec benefits from a leaky bucket in front of it: even if a consumer sends 1,000 requests in a second, only 100/sec flow through to the service. The rest queue or are shed at the gateway.
The tradeoff is latency. Requests that enter the queue experience queuing delay proportional to how full the bucket is. For synchronous APIs where callers expect p99 latency under 200ms, a leaky bucket that introduces 400ms of queuing on a moderately loaded gateway is worse than rejecting the request immediately. For batch-processing APIs where callers are submitting jobs and checking status asynchronously, queuing is fine.
We're not saying leaky bucket is the wrong algorithm — it's saying it solves a different problem than token bucket. Token bucket protects a consumer's quota headroom; leaky bucket protects a downstream service's throughput ceiling. They're not interchangeable, and the gap is most visible under sudden traffic spikes.
Sliding Window Log and Sliding Window Counter: The Practical Middle Ground
Two variants close the boundary-spike gap in fixed window without going to full token bucket complexity. The sliding window log stores a timestamped log of requests per consumer and counts entries within the rolling window on each request. It's exact but memory-intensive — at 1,000 req/min, you're storing 1,000 timestamps per active consumer, which gets expensive at scale.
The sliding window counter is a practical approximation: track counts for the current and previous fixed-window buckets, then estimate the count in the rolling window as prev_count × (1 - elapsed_fraction) + current_count. This introduces a small approximation error (at most a few percent at window boundaries) but is constant-memory and fast. Most gateway implementations that advertise "sliding window" rate limiting are using this approximation, not a true log.
Per-Endpoint vs. Per-Consumer vs. Global Limits
The algorithm question is separate from the policy question of what you're limiting. A common mistake is to implement a single global limit per API key when partners have very different traffic profiles across endpoints. Consider a payments platform with a POST /v1/charges endpoint (expensive, touches the payment processor, limit 30/min) and a GET /v1/transactions/{id} endpoint (read-only, cacheable, limit 500/min). Applying a single 100/min key-level limit using token bucket means a batch reconciliation job pulling transaction history will eat into the same bucket as charge creation — and the first 429 the partner hits is on a charge, which is much worse than a 429 on a read.
Per-endpoint rate limits, configured alongside the endpoint definition, give you the granularity to match the limit to the actual cost. A gateway that stores rate limit configuration co-located with the API spec makes this discoverable: a partner looking at the reference docs for POST /v1/charges sees the limit right there, not buried in a separate policy document.
Consider a growing B2B SaaS platform that ran a single 600 req/min token bucket across all endpoints for its partner API. During a routine data sync, a partner's overnight job would hit the limit and trigger retries — which compounded into a thundering herd against the POST /webhooks/retry endpoint specifically. Splitting into per-endpoint limits — 50/min on write endpoints, 600/min on reads — eliminated the retry cascade without changing the total allowed call volume.
Choosing Based on Traffic Shape, Not Convention
The decision framework comes down to three questions. First: does your consumer traffic arrive in predictable bursts, or is it meant to be smooth? Bursts favor token bucket. Second: does your downstream service have a hard throughput ceiling that needs protecting, or just a per-consumer fairness constraint? Hard ceiling favors leaky bucket. Third: is simplicity of auditability more important than precision at window boundaries? If yes, fixed window with a generous buffer above what you'd actually reject.
None of these algorithms runs at zero cost. Token bucket requires atomic state per consumer per request; leaky bucket requires queue management; sliding window counter requires two counter reads and a floating-point multiply. For a gateway routing 5M req/day across 500 active API keys, the overhead is negligible. At 500M req/day across 50,000 keys with hot-path latency requirements under 5ms, the algorithm choice — and where you store the state — becomes a meaningful architectural decision.
Whatever you choose, the 429 response needs to be useful: include X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset (Unix timestamp, not relative seconds). Partners who can read their own headroom in real time build much better backoff logic than partners who are guessing from exponential backoff tables.