Designing Throttles: Rate Limiting in Practice
Estimated reading time: 21-28 minutes | ~5,400 words
Why Rate Limiting Exists
Every request burns scarce resources: CPU, memory, cache, database connections, downstream quotas, and bandwidth. When demand exceeds capacity, latency rises, queues grow, and failures cascade. Rate limiting is the guardrail that keeps load inside a safe operating envelope and protects SLOs.
This post compares algorithms and walks through distributed architectures worth shipping.
TL;DR
- Rate limiting protects capacity; start from the bottleneck, not a guess.
- Pick the right dimension: tenant, user, endpoint, or downstream.
- Token bucket is the default; sliding windows are precise; leaky buckets smooth bursts.
- Use cost-based budgets when requests vary in complexity.
- For scale, use local fast paths and global budgets; accept small error for big wins.
- Expose limits to clients and test with realistic burst patterns.
Edge Gateway Decision Tree
If you can’t afford a per-request remote hop: local limiter + leased tokens (or hybrid).
If you need tight global correctness: centralized store or sharded limiter service (accept hop).
If you need burst tolerance: token bucket (default).
If you need precise rolling windows: sliding window log/counter.
Pick Your Limiter
Need burst tolerance? -> Token bucket (default)
Need strict rolling-window precision? -> Sliding window log/counter
Need smoothing (queueing or pacing acceptable)? -> Leaky bucket / queue
Need constant memory and high precision? -> GCRA (see intuition below)
Distributed and very high QPS? -> Local + leased tokens / hybrid
Running Example
A multi-tenant API with read and write endpoints runs through the post. The full design appears later, but the requirements are simple: tenants have a shared budget, users have a smaller budget, and writes cost more than reads.
Table of Contents
- TL;DR
- Edge Gateway Decision Tree
- Pick Your Limiter
- Running Example
- What Is a Rate?
- Throttle vs Rate Limiter
- Goals and Non-Goals
- Capacity Modeling: From SLO to Rate
- Choosing the Right Dimension
- Identity and Abuse at the Edge
- Composing Limits (AND / OR)
- Core Algorithms
- Tuning Burst and Cost
- Complexity-Based Rate Limiting (Cost Budgets)
- Adaptive Rate Limiting (Dynamic Budgets)
- Rate Limiting in Async and Event-Driven Systems
- Where to Enforce Limits
- Single-Node Implementation Details
- Distributed Rate Limiting Architectures
- Multi-Region and Hierarchical Limits
- Failure Modes and Edge Cases
- Observability and Client UX
- When Not to Rate Limit
- Rate Limits vs Quotas in Practice
- Client-Side Rate Limiting
- Testing Strategies
- A Concrete System Design Example
- Checklist
- Sources and Standards
- Conclusion
What Is a Rate?
A rate is a budget of work per unit time. It only makes sense if you define three things:
- Work unit: request, byte, DB query, CPU millis, or weighted cost
- Time unit: per second, per minute, per hour
- Scope: per user, per API key, per org, per IP, per endpoint, per region
Unless noted, examples use RPS. RPM is used in the running example to mirror product limits.
Most systems express limits as tokens per second. Each request burns tokens based on cost. If enough tokens exist, the request proceeds. If not, it is rejected or delayed.
Rate limiting vs related controls
These lines keep the terms precise:
- Rate limiting: work per unit time (throughput)
- Concurrency limiting: max in-flight requests
- Quota limiting: total work per long window (daily/monthly)
- Circuit breaking: stop calling unhealthy dependencies
- Backpressure: slow producers when queues fill
In practice, several of them work best together.
Throttle vs Rate Limiter
Here is the distinction used in this post:
- Rate limiter = the budget check. It decides allow or deny for a key in a window.
- Throttle = the enforcement behavior. It slows, shapes, or paces traffic, often by delaying instead of rejecting.
A token bucket can do both. In this post, throttle means the broader toolbox: pacing, queuing, smoothing, and adaptive backpressure. Rate limiter means the budget check.
UX policy matrix:
- Reject with 429 (interactive APIs)
- Delay or pace (background jobs, batch exporters)
- Queue with a max latency budget (if tail latency can be bounded)
Goals and Non-Goals
Goals
A good limiter should:
- Protect availability by keeping load under capacity
- Enforce fairness so one client cannot starve others
- Allow small bursts for better UX
- Stay predictable so clients can plan usage
- Scale horizontally with minimal coordination
Non-Goals
- Perfect precision is not required. Small error is fine if it is safe.
- Global consistency at all times is optional. Precision can be traded for scale.
Capacity Modeling: From SLO to Rate
Derive limits from capacity, not guesswork. Treat the system as a pipeline and find the bottleneck.
[Ingress] -> [App] -> [Cache] -> [DB] -> [External API]
20k 10k 40k 4k 2k (RPS)
Bottleneck = 2k RPS
Then reserve headroom to absorb spikes.
Little’s Law as a sanity check
Use Little’s Law to cross-check throughput, latency, and concurrency:
L = λ * W
L= average in-flight requests (concurrency)λ= throughput (requests/sec)W= average time in system (sec)
If a system can safely handle 200 concurrent requests and average latency is 50 ms:
max_rps = 200 / 0.05 = 4000
safe_rps = 0.7 * 4000 = 2800 (30% headroom)
Little’s Law applies to long-run averages in steady-state systems, which makes it ideal for capacity planning and sanity checks.
Split the budget deliberately
Do not split limits evenly unless equal priority is the goal. A simple tier split:
Global: 2800 RPS
- Premium tier: 60%
- Standard tier: 30%
- Burst/unknown: 10%
This maps budgets to product tiers and keeps priorities explicit.
Choosing the Right Dimension
The hard part is deciding what to limit.
Identity dimensions
- API key (strong)
- User ID (good)
- Org / workspace / tenant (best for SaaS)
- IP address (weak; NAT and proxies blur identity)
Resource dimensions
- Per endpoint (write endpoints cost more than read)
- Per feature (upload vs search)
- Per downstream (protect a database or external API)
Weighted requests
Not all requests cost the same. Charge tokens proportional to cost.
Example:
base_cost = 1
extra_cost = floor(rows_scanned / 1000)
request_cost = base_cost + extra_cost
This keeps the limiter aligned with real resource usage.
Identity and Abuse at the Edge
At the gateway, identity is layered. A practical order of authority:
- API key or OAuth client (most reliable)
- JWT subject (user)
- Device or session ID
- IP (weak, but useful for abuse)
Expect evasion: IP rotation, credential stuffing, and distributed low-rate attacks. Rate limiting helps, but bot mitigation and WAF rules are often needed for high-volume abuse.
Canonicalization matters. Normalize paths, strip irrelevant query params, and apply consistent casing before constructing per-endpoint keys. This prevents key explosions and bypasses.
Composing Limits (AND / OR)
Most systems enforce multiple limits at once. Treat them as predicates.
AND composition (most common)
A request must pass all checks.
Request -> [Global limit] -> [Tenant limit] -> [User limit] -> Allow
| fail | fail | fail
Reject Reject Reject
This enforces fairness at every layer.
OR composition (rare, but useful)
A request is allowed if it passes any of the checks. This is useful for fallback budgets.
Request -> [Primary budget] --pass--> Allow
\-> [Overflow budget] --pass--> Allow
| fail
Reject
Policy ordering
Check the cheapest limits first. If an IP-based limit rejects most abusive traffic, place it ahead of any database-backed check.
Example ordering: IP - API key - tenant - endpoint.
Concrete example
A common SaaS pattern uses tenant AND user AND IP limits:
Allow if:
tenant < 1000 RPM
AND user < 60 RPM
AND IP < 300 RPM
This keeps noisy users in check, prevents abuse from a single IP, and still protects the tenant budget.
Core Algorithms
These are the core algorithms used most often.
Code samples use Python-like pseudocode unless noted. The Redis Lua snippet is an atomic datastore primitive.
1) Fixed Window Counter
Idea: Count requests in a fixed window (e.g., per minute). Reject when the count exceeds the limit.
State: count, window_start
Pros: Simple, fast, low memory
Cons: Bursts at window boundaries can exceed the limit by nearly 2x
Limit = 5 per 60s
Time (s): 58 59 | 60 61
Requests: * * | * *
* * | * *
* | *
Window 1: [0..59] -> 5 allowed
Window 2: [60..119] -> 5 allowed
Result: 10 requests in ~2s pass at the boundary.
Running example: Fixed windows on tenant 1000 RPM allow boundary bursts; use only if that behavior is acceptable.
2) Sliding Window Log
Idea: Store timestamps of recent requests and count how many fall inside the last window.
State: ordered list of timestamps
Pros: Precise
Cons: High memory and CPU at large scale
Window = last 10s, now = 23s
Keep timestamps in (13s, 23s]
Time: 8 12 | 13 17 21 22 23
Timestamps: x x | ^ ^ ^ ^ |
Running example: Use this for per-user 60 RPM when strict rolling-window precision is required, and the cardinality is manageable.
3) Sliding Window Counter (Hybrid)
Idea: Use two adjacent fixed windows and interpolate counts based on how far into the current window you are.
State: count_prev, count_curr, window_start
Pros: Good approximation, low memory
Cons: Slight error near boundaries
estimated = count_curr + count_prev * (1 - fraction_into_window)
allow if estimated <= limit
Visual (two windows with interpolation):
Previous window Current window
[0 ......... 60s] | [60s ......... 120s]
count_prev = 8 count_curr = 4
^
now = 90s -> fraction = 0.5
estimated = 4 + 8 * (1 - 0.5) = 8
Running example: A good fit for per-user 60 RPM when precision is needed but memory must stay small.
4) Token Bucket
Idea: A bucket fills at a steady rate. Each request consumes tokens. If enough tokens exist, allow.
State: tokens, last_refill_time
Pros: Allows bursts, easy to reason about
Cons: Requires time math
Token-bucket models define a relationship between rate, burst size, and time; they are widely used to describe rate control in networking.
Visual (time series):
Capacity = 10, refill = 1 token/sec
Time (s): 0 1 2 3 4 5
Tokens: 10 10 10 5 6 7
Events: burst(5) +1 +1
Running example: Use for tenant 1000 RPM with a burst of 1-5 seconds; charge writes at cost 5.
Pseudocode (one common implementation):
now = time()
elapsed = now - last_refill
bucket.tokens = min(capacity, bucket.tokens + elapsed * refill_rate)
last_refill = now
if bucket.tokens >= cost:
bucket.tokens -= cost
allow()
else:
reject()
5) Leaky Bucket
Idea: Requests enter a queue that drains at a fixed rate. If the queue is full, reject.
State: queue size or backlog
Pros: Smooths bursty traffic into a steady flow
Cons: Adds latency if you queue instead of reject
Leaky buckets are commonly used for traffic shaping, smoothing output as it leaves an interface.
Visual (fixed drain):
Incoming: **** ******** **
Queue: 6 6 6 6 6 6 6 6 (bounded)
Outgoing: * * * * * * *
Token bucket vs leaky bucket (same input, different output):
Input: **** ******** **
Token bucket: **** ***** *** ** (allows bursts until empty)
Leaky bucket: * * * * * * * (smooths into steady flow)
Numeric example (queue depth):
Drain rate = 2 req/sec, queue size = 6
Input burst = 6 at t=0
t=0: queue=6
t=1: queue=4
t=2: queue=2
t=3: queue=0
Running example: Use for background exports or webhook fan-out where pacing is better than rejection.
6) GCRA (Generic Cell Rate Algorithm)
Idea: Track a theoretical arrival time (TAT). A request is allowed if it arrives after its allowed time, with some tolerance for burst.
State: tat (theoretical arrival time)
Pros: Precise, constant memory
Cons: Harder to reason about
GCRA is a leaky-bucket-type algorithm used in ATM networks for policing and shaping traffic.
Core concept (simplified):
if now >= tat - allowed_burst:
allow; tat = max(tat, now) + interval
else:
reject
Mental model: Each request schedules the next allowed time. The allowed burst is a time credit. This is closely related to a token bucket where interval = 1 / rate and allowed_burst encodes bucket size.
Running example: Use for per-user 60 RPM when constant memory and precise pacing matter.
Comparison Table
| Algorithm | Memory | Precision | Burst Handling | Implementation |
|---|---|---|---|---|
| Fixed window | O(1) | Low (boundary bursts) | Poor | Very simple |
| Sliding log | O(n) | High | Good | Expensive |
| Sliding counter | O(1) | Medium | Good | Simple |
| Token bucket | O(1) | Medium | Excellent | Simple |
| Leaky bucket | O(1) | Medium | Smooth output | Simple |
| GCRA | O(1) | High | Excellent | Moderate |
Tuning Burst and Cost
The algorithm is only half the story. Most tuning happens here.
Set burst size from UX
Users expect short bursts: page load waterfalls, retries, batch jobs. Let those through, but cap them.
Rule of thumb:
burst = 1-5 seconds of normal rate
If the normal limit is 100 RPS, allow 100-500 tokens of burst.
Use weighted costs
Assign weights to match real cost. For example:
- Read = 1 token
- Write = 5 tokens
- Search = 10 tokens
This keeps the limiter aligned with actual load.
Handle retries explicitly
If clients retry aggressively, lower the rate or enforce a retry backoff policy. Otherwise, retries amplify the load you were trying to control.
Retry storm amplification (example):
1,000 clients x 1 request = 1,000 requests
3 retries per client = 3,000 extra requests
Total load = 4,000 requests (4x)
Use idempotency keys for write endpoints so retries do not duplicate side effects.
Cost attribution for shared resources
When a request touches multiple services, decide who pays. Two common models:
- Caller pays: the API gateway charges the full cost up front based on expected downstream work.
- Per-service limits: each service charges its own budget and can reject independently.
Caller-pays is simpler for clients; per-service limits are safer for shared dependencies. In practice, mix both: charge a baseline at the edge, and let hot downstreams add their own limits.
Complexity-Based Rate Limiting (Cost Budgets)
If request cost varies a lot, stop counting requests. Count work.
The idea is simple: score each request, then enforce a budget on the sum of scores per window and per key. This is ideal for GraphQL, search, report generation, and batch endpoints.
Implement it in three steps:
- Score each request in the origin based on expected cost.
- Emit the score to the edge (response header or metadata).
- Accumulate scores per key and window; block or throttle when the sum exceeds the budget.
Some APIs accept client-declared cost hints (e.g., X-Expected-Cost) to pre-charge budgets. If used, validate server-side to prevent abuse.
Diagram:
[Client] -> [Edge] -> [Origin]
| compute cost
v
response header x-score
[Edge] sums score per key; if total > budget -> block/throttle
One concrete implementation is Cloudflare’s complexity-based rate limiting. It meters requests by a complexity score rather than by request count and expects the origin to return a response header with an integer score. The score must be in the valid range (1 to 1,000,000). Counters track total score per key and only advance when a valid score is present. The rule defines a score budget per period, a period length, and the response header name.
This pattern also appears in GraphQL query-cost scoring and API gateway plugins. The mechanics are the same: score, budget, enforce.
Adaptive Rate Limiting (Dynamic Budgets)
Static limits are easy to reason about, but systems are rarely static. Adaptive rate limiting adjusts limits based on live signals such as latency, error rate, or saturation.
Two common patterns:
- AIMD (Additive Increase, Multiplicative Decrease)
- Increase the limit slowly when the system is healthy.
- Decrease quickly when latency or errors spike.
- Concurrency-based control
- Keep a target concurrency and adjust allowed rate to hold that line.
Simple feedback loop:
if p95_latency > target or error_rate > budget:
limit = limit * 0.7 # fast decrease
else:
limit = limit + k # slow increase
Adaptive limiters turn traffic control into a control system. They require stable signals, smoothing, and guardrails to avoid oscillation.
A practical example is Netflix’s concurrency-limits, which adjusts limits based on observed latency.
Rate Limiting in Async and Event-Driven Systems
Request/response is only one shape. For queues, streams, and webhooks, rate limiting often lives at the producer and at the consumer.
Patterns that work well:
- Token bucket at the producer: pace emits into the queue.
- Consumer concurrency limits: bound the number of in-flight messages.
- Queue depth triggers: reduce intake when backlog grows.
- Per-subscriber budgets: avoid one consumer starving others.
In event-driven systems, a limiter is often tied to backlog rather than time. The goal is the same: keep the system inside a stable envelope.
Where to Enforce Limits
Limits can be enforced at multiple layers. Each layer trades accuracy for locality and cost.
Common placements:
- Client-side: best UX, but untrusted
- Edge / gateway: great for global protection
- Service entry: accurate for application semantics
- Dependency level: protects databases and external APIs
A common default: a coarse limit at the edge and fine-grained limits in the service.
[Client] -> [Edge/Gateway] -> [Service] -> [DB/Downstream]
| | | |
soft limit coarse precise dependency
Database-level admission control
Databases have their own limits. Common techniques include:
- Connection pooling with hard caps (e.g., pgBouncer-style)
- Query timeouts to bound worst-case work
- Admission control at the query layer (e.g., only N heavy queries at once)
These are not replacements for rate limiting, but they are the last line of defense for shared state.
Single-Node Implementation Details
On a single node, limiters are fast and simple, but you still need concurrency control and memory discipline.
1) Ring buffer for sliding windows
Buckets for last 60 seconds:
[ t-59 ][ t-58 ] ... [ t-1 ][ t ]
At each request:
- Compute
bucket_index = now % 60 - If the bucket is stale, reset it
- Increment the counter
- Sum buckets for the window
2) Control memory and cardinality
High-cardinality keys (user IDs, IPs) can explode memory. Mitigate with:
- TTL eviction for idle keys
- LRU caches
- Avoid combinatorial dimensions (per-user + per-IP + per-endpoint) unless needed
3) Atomicity
Use atomics or fine-grained locks. In many languages, a per-key mutex is enough.
Distributed Rate Limiting Architectures
Once a system scales beyond one node, the key decision is how to share state.
| Architecture | Extra hop per request | Consistency | Typical use |
|---|---|---|---|
| Centralized Redis | Yes | Strong-ish | Simple, moderate QPS |
| Sharded limiter service | Yes | Per-key strong | High QPS, controlled network |
| Leased tokens | No | Approximate | Very high QPS, edge-ish |
| Hybrid | Sometimes | Bounded error | Common “best of both” |
Correctness and bounded error
- Local-only: can exceed the global tenant rate by up to the sum of per-PoP bursts.
- Leased tokens: exceedance is bounded by unexpired leases (worst case = tokens already handed out but not yet spent).
- Hybrid: bounded by local bucket size plus slow-path refill frequency.
Option A: Centralized datastore (Redis or similar)
Each request performs an atomic increment in a shared store.
Pros: Simple, consistent
Cons: Latency and cost at high QPS
[Edge] -> [API Node] -> [Redis]
(INCR + TTL)
Redis Lua example (token bucket, atomic):
-- KEYS[1] = bucket key
-- ARGV[1] = now (ms), ARGV[2] = refill_rate (tokens/ms)
-- ARGV[3] = capacity, ARGV[4] = cost
local key = KEYS[1]
local now = tonumber(ARGV[1])
local rate = tonumber(ARGV[2])
local cap = tonumber(ARGV[3])
local cost = tonumber(ARGV[4])
local data = redis.call("HMGET", key, "tokens", "ts")
local tokens = tonumber(data[1]) or cap
local ts = tonumber(data[2]) or now
local delta = math.max(0, now - ts)
local new_tokens = math.min(cap, tokens + delta * rate)
local allowed = new_tokens >= cost
if allowed then new_tokens = new_tokens - cost end
redis.call("HMSET", key, "tokens", new_tokens, "ts", now)
redis.call("PEXPIRE", key, math.ceil(cap / rate))
return { allowed and 1 or 0, new_tokens }
This script uses a single atomic update and a TTL so idle keys expire. Use a consistent time source and tune the refill rate to match the chosen time unit. If HMGET returns nils because a key expired between checks, the defaults (cap and now) are applied; the first request may see a full bucket. To tighten this, initialize with SETNX or store defaults alongside a version key.
Back-of-the-envelope latency impact
Remote limiter calls add latency. As a ballpark example, a single synchronous Redis connection with redis-benchmark -t ping -c 1 often lands around ~0.2 ms median on fast in-region networks. That implies roughly 5,000 ops/sec on a single connection (1s / 0.2ms). As a limiter, that overhead is paid on every request; at 10k RPS you need parallel connections or a local fast path to avoid bottlenecking on the limiter itself. See the AWS benchmark notes for methodology, and measure in your own environment.
Numbers vary by hardware, network, and command mix, but the lesson holds: when the remote check cost starts to dominate the request budget, use local caches, token leasing, or co-locate limiters with the service.
Option B: Dedicated limiter service (sharded)
Shard limit keys by consistent hashing. Each limiter node keeps in-memory state.
Pros: Scales horizontally, low latency
Cons: Extra hop and service to run
+------------------+
[API Node] ->| Limiter Router |
+------------------+
| | |
v v v
[L1] [L2] [L3] (sharded in-memory)
Option C: Local limiters with leased tokens
Each node holds a local bucket. Periodically, it leases tokens from a global pool.
Pros: No per-request remote calls
Cons: Approximate, needs coordination
[Global Budget]
|
| lease every N seconds
v
[Node A bucket] [Node B bucket] [Node C bucket]
Leasing mechanics (practical knobs)
- Lease size: fixed (simple) or proportional to recent demand (more adaptive).
- Renewal cadence: 1-5s with jitter to avoid stampedes.
- Quiet-to-hot transitions: keep a small warm reserve so a region can absorb a sudden spike.
- Coordinator outage: serve with diminishing local caps (fail-open) or stop at lease exhaustion (fail-closed).
Option D: Hybrid (local fast path, global slow path)
Use a local limiter first. If it is low, ask a global store for a refill.
Pros: High throughput with bounded error
Cons: More complex
Control plane vs data plane
Separate policy from enforcement so limits can change without redeploying.
[Policy Store] ---> [Limiter Config Push]
| |
v v
(limits, tiers) [Data-plane limiters]
Multi-Region and Hierarchical Limits
Global consistency across regions is expensive. Use hierarchy instead.
Global Limit (100k RPS)
|
+----+----+
| |
Region A Region B
60k 40k
| |
+--+--+ +--+--+
| | | |
Node1 Node2 Node3 Node4
Enforce locally on short timescales and rebalance globally on longer timescales.
If regions cannot reach a global coordinator, fall back to local budgets. Choose fail-open (serve with local caps) or fail-closed (block) based on risk.
Failure Modes and Edge Cases
Limiters fail in subtle ways. Design for these up front.
1) Clock skew
Time-based algorithms need consistent time. Use monotonic clocks on a single node. Across nodes, assume skew and design for approximation or centralized time.
2) Hot keys
A single key can dominate load. Mitigate with:
- Per-key sharding (hash key + salt)
- Local caching
- Elevated limits for trusted clients
3) Thundering herd on reset
If all clients retry at reset time, you get a spike. Add jitter to client retries or prefer token buckets.
4) DDoS vs legitimate bursts
Not all spikes are abuse. Distinguish between:
- Legitimate bursts (deploys, batch jobs, page loads)
- Abuse or attacks (high volume with low value)
Common signals include auth state, endpoint mix, error rates, and geo/IP reputation. Progressive penalties (delay, then lower limit, then block) and challenge flows (rate-limit pages, CAPTCHAs) help avoid hurting valid users.
5) Graceful degradation
When limits hit, dropping traffic is not the only option. Other patterns:
- Serve cached or stale data
- Return partial results
- Prioritize critical requests (VIP or internal)
- Queue and drain with a leaky bucket
- Use separate buckets or queues per priority class (P0, P1, P2)
6) Fail-open vs fail-closed
If the limiter fails:
- Fail-open keeps traffic flowing but risks overload
- Fail-closed protects systems but may cause outages
Choose based on downstream fragility and SLOs.
Production gotchas
- Monotonic time only: wall-clock jumps (NTP) can refill buckets incorrectly.
- Cardinality explosion: only add a dimension if it materially reduces abuse and the combined cardinality is bounded. Rule of thumb: avoid user x endpoint x IP unless the endpoint is high-risk and the extra dimension cuts abuse by at least 10x.
- Cost gaming: if cost comes from client or response headers, cap and validate server-side, and sample for audits.
Observability and Client UX
Treat the limiter as part of the product surface, not just an internal guardrail.
Status codes
- 429 Too Many Requests indicates the client has sent too many requests in a given amount of time (RFC 6585: https://www.rfc-editor.org/rfc/rfc6585).
- RFC 6585 notes that a 429 response may include Retry-After to tell the client when to try again.
Retry guidance
- Retry-After accepts either an HTTP-date or delay-seconds and tells the client how long to wait (RFC 9110: https://www.rfc-editor.org/rfc/rfc9110).
Recommended response contract
HTTP/1.1 429 Too Many Requests
Retry-After: 24
RateLimit-Policy: "default";q=100;w=60
RateLimit: "default";r=12;t=24
Content-Type: application/json
{"error":"rate_limited","message":"Too many requests","retry_after":24}
When multiple limits fail
Return the most actionable limit to the caller, usually the tightest scope (user over tenant, tenant over IP). Include a stable error code and the server-defined retry delay, and advise clients to add jitter.
Response headers (current standards)
The IETF HTTPAPI working group defines RateLimit and RateLimit-Policy response fields in an active Internet-Draft (work in progress, not yet an RFC): https://datatracker.ietf.org/doc/draft-ietf-httpapi-ratelimit-headers/
The draft uses structured fields; a RateLimit item includes parameters such as remaining quota (r) and time to reset (t).
A minimal example (syntax simplified for readability):
RateLimit-Policy: "default";q=100;w=60
RateLimit: "default";r=12;t=24
Some APIs still use RateLimit-Limit, RateLimit-Remaining, and RateLimit-Reset; meanings vary across implementations, so document them clearly.
Metrics to track
- Allowed requests per key
- Rejected requests per key
- 429 rate by endpoint
- Tokens remaining distribution
- P50/P95 limiter latency
When Not to Rate Limit
Rate limits are not always the right tool. Avoid strict limits in these cases:
- Internal services with backpressure already enforced by queues or concurrency caps
- Trusted batch jobs where throughput is controlled upstream
- Admin or break-glass endpoints used during incidents (use allowlists instead)
- Health checks and liveness probes that must stay fast and reliable
In these cases, use backpressure, circuit breakers, or admission control instead of per-request rate limits.
Rate Limits vs Quotas in Practice
Rate limits control short windows; quotas control long windows. In production, both are common:
- Rate limit: 100 RPM per user (token bucket or sliding window)
- Quota: 1,000,000 requests per month (fixed window with long TTL)
Enforce both by checking the rate limiter first, then the quota counter. This keeps interactive UX snappy while still enforcing long-term budgets.
Client-Side Rate Limiting
Client-side limiters improve UX and reduce wasted retries. Common libraries include:
- Bottleneck (JavaScript) - https://github.com/SGrondin/bottleneck
- ratelimit / limits (Python) - https://pypi.org/project/ratelimit/ and https://limits.readthedocs.io/
- Guava RateLimiter (Java) - https://github.com/google/guava/wiki/RateLimiterExplained
Client-side limits are advisory, not authoritative. The server remains the source of truth.
Use client-side limits when you control the client and want to avoid wasted RTTs on 429s.
Testing Strategies
Rate limiters are easy to get wrong. Tests should include:
- Load tests with realistic burst patterns
- Time-mocking to validate window boundaries and refill logic
- Chaos tests for Redis outages or clock skew
- Golden traces to confirm fairness across keys
- Shadow mode to log would-block decisions before enforcing
Concrete example:
def test_token_bucket_refill(mock_time):
bucket = TokenBucket(capacity=10, rate=1) # 1 token/sec
bucket.consume(5)
mock_time.advance(3) # 3 tokens refill
assert bucket.tokens == 8
A Concrete System Design Example
Here is a concrete design for a multi-tenant API with read and write endpoints.
Requirements
- Each tenant gets 1,000 requests per minute (RPM)
- Each user gets 60 RPM
- Writes cost 5 tokens, reads cost 1 token
- The system runs in two regions
- Low latency; small errors are acceptable
Step 1: Choose algorithms
- Token bucket for tenant limits (burst-friendly)
- Sliding window counter for per-user limits (precise enough)
Step 2: Choose placement
- Enforce tenant limits at the edge gateway
- Enforce user limits inside the service
Step 3: Architecture
[Client]
|
v
[Edge Gateway]
| tenant token bucket
v
[API Service]
| user sliding window
v
[Downstream]
Step 4: Multi-region split
Lease a share of the tenant budget to each region and rebalance every few seconds.
Global Tenant Budget (1000 RPM)
|
+-- Region A: 600 RPM bucket
+-- Region B: 400 RPM bucket
Step 5: Cost model
read -> cost 1 token
write -> cost 5 tokens
Step 6: Pseudocode (simplified)
# edge gateway (tenant bucket)
def allow_tenant(tenant_id, cost):
bucket = local_bucket(tenant_id)
bucket.refill()
if bucket.tokens >= cost:
bucket.tokens -= cost
return True
return False
# service (user window)
def allow_user(user_id):
win = sliding_window(user_id)
return win.estimate() < 60
# AND composition for a full request
def allow_request(tenant_id, user_id, cost):
return allow_tenant(tenant_id, cost) and allow_user(user_id)
This design is fast and scalable and works for most production APIs.
Checklist
- Define the work unit and cost model
- Pick the dimension (user, tenant, IP, endpoint)
- Choose the algorithm (token bucket, sliding window, etc.)
- Decide placement (edge, service, dependency)
- Choose state strategy (local, centralized, sharded, hybrid)
- Plan for failure modes (fail-open vs fail-closed)
- Add headers and metrics for visibility
- Decide on adaptive vs static limits
- Add quota enforcement for long windows
- Test with bursts, retries, and outages
- Validate against the running example requirements
Sources and Standards
If you want to go deeper, these are good starting points:
- RFC 6585: HTTP 429 Too Many Requests - https://www.rfc-editor.org/rfc/rfc6585
- RFC 9110: Retry-After header field - https://www.rfc-editor.org/rfc/rfc9110
- IETF Internet-Draft: RateLimit and RateLimit-Policy header fields - https://datatracker.ietf.org/doc/draft-ietf-httpapi-ratelimit-headers/
- RFC 3290: Token bucket and leaky bucket discussion (Diffserv) - https://www.rfc-editor.org/rfc/rfc3290
- RFC 2697: Token bucket metering (Diffserv) - https://www.rfc-editor.org/rfc/rfc2697
- RFC 5681: TCP Congestion Control (AIMD background) - https://www.rfc-editor.org/rfc/rfc5681
- AWS Redis benchmark notes - https://aws.amazon.com/blogs/database/optimize-redis-client-performance-for-amazon-elasticache/
- Cloudflare complexity-based rate limiting - https://developers.cloudflare.com/waf/rate-limiting-rules/request-rate/#complexity-based-rate-limiting
- Netflix concurrency-limits - https://github.com/Netflix/concurrency-limits
Conclusion
Rate limiting is not a punishment. It is a promise about what a system can safely handle. Start from capacity, pick the right dimensions, and choose the simplest algorithm that meets the SLOs. Make it visible to clients, and the system scales without surprises.
The running multi-tenant API example ties it together: tenant and user budgets, cost weights, and a multi-region split meet the goals without over-engineering.