Designing Throttles: Rate Limiting in Practice

January 28, 2026 · 27 min read

Estimated reading time: 21-28 minutes | ~5,400 words

Why Rate Limiting Exists

Every request burns scarce resources: CPU, memory, cache, database connections, downstream quotas, and bandwidth. When demand exceeds capacity, latency rises, queues grow, and failures cascade. Rate limiting is the guardrail that keeps load inside a safe operating envelope and protects SLOs.

This post compares algorithms and walks through distributed architectures worth shipping.

TL;DR

Rate limiting protects capacity; start from the bottleneck, not a guess.
Pick the right dimension: tenant, user, endpoint, or downstream.
Token bucket is the default; sliding windows are precise; leaky buckets smooth bursts.
Use cost-based budgets when requests vary in complexity.
For scale, use local fast paths and global budgets; accept small error for big wins.
Expose limits to clients and test with realistic burst patterns.

Edge Gateway Decision Tree

If you can’t afford a per-request remote hop: local limiter + leased tokens (or hybrid).
If you need tight global correctness: centralized store or sharded limiter service (accept hop).
If you need burst tolerance: token bucket (default).
If you need precise rolling windows: sliding window log/counter.

Pick Your Limiter

Need burst tolerance? -> Token bucket (default)
Need strict rolling-window precision? -> Sliding window log/counter
Need smoothing (queueing or pacing acceptable)? -> Leaky bucket / queue
Need constant memory and high precision? -> GCRA (see intuition below)
Distributed and very high QPS? -> Local + leased tokens / hybrid

Running Example

A multi-tenant API with read and write endpoints runs through the post. The full design appears later, but the requirements are simple: tenants have a shared budget, users have a smaller budget, and writes cost more than reads.

TL;DR
Edge Gateway Decision Tree
Pick Your Limiter
Running Example
What Is a Rate?
Throttle vs Rate Limiter
Goals and Non-Goals
Capacity Modeling: From SLO to Rate
Choosing the Right Dimension
Identity and Abuse at the Edge
Composing Limits (AND / OR)
Core Algorithms
Tuning Burst and Cost
Complexity-Based Rate Limiting (Cost Budgets)
Adaptive Rate Limiting (Dynamic Budgets)
Rate Limiting in Async and Event-Driven Systems
Where to Enforce Limits
Single-Node Implementation Details
Distributed Rate Limiting Architectures
Multi-Region and Hierarchical Limits
Failure Modes and Edge Cases
Observability and Client UX
When Not to Rate Limit
Rate Limits vs Quotas in Practice
Client-Side Rate Limiting
Testing Strategies
A Concrete System Design Example
Checklist
Sources and Standards
Conclusion

What Is a Rate?

A rate is a budget of work per unit time. It only makes sense if you define three things:

Work unit: request, byte, DB query, CPU millis, or weighted cost
Time unit: per second, per minute, per hour
Scope: per user, per API key, per org, per IP, per endpoint, per region

Unless noted, examples use RPS. RPM is used in the running example to mirror product limits.

Most systems express limits as tokens per second. Each request burns tokens based on cost. If enough tokens exist, the request proceeds. If not, it is rejected or delayed.

These lines keep the terms precise:

Rate limiting: work per unit time (throughput)
Concurrency limiting: max in-flight requests
Quota limiting: total work per long window (daily/monthly)
Circuit breaking: stop calling unhealthy dependencies
Backpressure: slow producers when queues fill

In practice, several of them work best together.

Throttle vs Rate Limiter

Here is the distinction used in this post:

Rate limiter = the budget check. It decides allow or deny for a key in a window.
Throttle = the enforcement behavior. It slows, shapes, or paces traffic, often by delaying instead of rejecting.

A token bucket can do both. In this post, throttle means the broader toolbox: pacing, queuing, smoothing, and adaptive backpressure. Rate limiter means the budget check.

UX policy matrix:

Reject with 429 (interactive APIs)
Delay or pace (background jobs, batch exporters)
Queue with a max latency budget (if tail latency can be bounded)

Goals and Non-Goals

Goals

A good limiter should:

Protect availability by keeping load under capacity
Enforce fairness so one client cannot starve others
Allow small bursts for better UX
Stay predictable so clients can plan usage
Scale horizontally with minimal coordination

Non-Goals

Perfect precision is not required. Small error is fine if it is safe.
Global consistency at all times is optional. Precision can be traded for scale.

Capacity Modeling: From SLO to Rate

Derive limits from capacity, not guesswork. Treat the system as a pipeline and find the bottleneck.

[Ingress] -> [App] -> [Cache] -> [DB] -> [External API]
   20k        10k       40k      4k         2k   (RPS)

Bottleneck = 2k RPS

Then reserve headroom to absorb spikes.

Little’s Law as a sanity check

Use Little’s Law to cross-check throughput, latency, and concurrency:

L = λ * W

L = average in-flight requests (concurrency)
λ = throughput (requests/sec)
W = average time in system (sec)

If a system can safely handle 200 concurrent requests and average latency is 50 ms:

max_rps = 200 / 0.05 = 4000
safe_rps = 0.7 * 4000 = 2800   (30% headroom)

Little’s Law applies to long-run averages in steady-state systems, which makes it ideal for capacity planning and sanity checks.

Split the budget deliberately

Do not split limits evenly unless equal priority is the goal. A simple tier split:

Global: 2800 RPS
  - Premium tier: 60%
  - Standard tier: 30%
  - Burst/unknown: 10%

This maps budgets to product tiers and keeps priorities explicit.

Choosing the Right Dimension

The hard part is deciding what to limit.

Identity dimensions

API key (strong)
User ID (good)
Org / workspace / tenant (best for SaaS)
IP address (weak; NAT and proxies blur identity)

Resource dimensions

Per endpoint (write endpoints cost more than read)
Per feature (upload vs search)
Per downstream (protect a database or external API)

Weighted requests

Not all requests cost the same. Charge tokens proportional to cost.

Example:

base_cost = 1
extra_cost = floor(rows_scanned / 1000)
request_cost = base_cost + extra_cost

This keeps the limiter aligned with real resource usage.

Identity and Abuse at the Edge

At the gateway, identity is layered. A practical order of authority:

API key or OAuth client (most reliable)
JWT subject (user)
Device or session ID
IP (weak, but useful for abuse)

Expect evasion: IP rotation, credential stuffing, and distributed low-rate attacks. Rate limiting helps, but bot mitigation and WAF rules are often needed for high-volume abuse.

Canonicalization matters. Normalize paths, strip irrelevant query params, and apply consistent casing before constructing per-endpoint keys. This prevents key explosions and bypasses.

Composing Limits (AND / OR)

Most systems enforce multiple limits at once. Treat them as predicates.

AND composition (most common)

A request must pass all checks.

Request -> [Global limit] -> [Tenant limit] -> [User limit] -> Allow
              | fail             | fail           | fail
             Reject            Reject          Reject

This enforces fairness at every layer.

OR composition (rare, but useful)

A request is allowed if it passes any of the checks. This is useful for fallback budgets.

Request -> [Primary budget] --pass--> Allow
      \-> [Overflow budget] --pass--> Allow
                    | fail
                   Reject

Policy ordering

Check the cheapest limits first. If an IP-based limit rejects most abusive traffic, place it ahead of any database-backed check.

Example ordering: IP - API key - tenant - endpoint.

Concrete example

A common SaaS pattern uses tenant AND user AND IP limits:

Allow if:
  tenant < 1000 RPM
  AND user < 60 RPM
  AND IP < 300 RPM

This keeps noisy users in check, prevents abuse from a single IP, and still protects the tenant budget.

Core Algorithms

These are the core algorithms used most often.

Code samples use Python-like pseudocode unless noted. The Redis Lua snippet is an atomic datastore primitive.

1) Fixed Window Counter

Idea: Count requests in a fixed window (e.g., per minute). Reject when the count exceeds the limit.

State: count, window_start

Pros: Simple, fast, low memory

Cons: Bursts at window boundaries can exceed the limit by nearly 2x

Limit = 5 per 60s

Time (s):  58 59 | 60 61
Requests:   *  * |  *  *
            *  * |  *  *
            *    |    *

Window 1: [0..59]   -> 5 allowed
Window 2: [60..119] -> 5 allowed

Result: 10 requests in ~2s pass at the boundary.

Running example: Fixed windows on tenant 1000 RPM allow boundary bursts; use only if that behavior is acceptable.

2) Sliding Window Log

Idea: Store timestamps of recent requests and count how many fall inside the last window.

State: ordered list of timestamps

Pros: Precise

Cons: High memory and CPU at large scale

Window = last 10s, now = 23s
Keep timestamps in (13s, 23s]

Time:       8  12 | 13  17  21  22  23
Timestamps: x   x |  ^   ^   ^   ^   |

Running example: Use this for per-user 60 RPM when strict rolling-window precision is required, and the cardinality is manageable.

3) Sliding Window Counter (Hybrid)

Idea: Use two adjacent fixed windows and interpolate counts based on how far into the current window you are.

State: count_prev, count_curr, window_start

Pros: Good approximation, low memory

Cons: Slight error near boundaries

estimated = count_curr + count_prev * (1 - fraction_into_window)
allow if estimated <= limit

Visual (two windows with interpolation):

Previous window      Current window
[0 ......... 60s] | [60s ......... 120s]
count_prev = 8      count_curr = 4
                   ^
                 now = 90s -> fraction = 0.5

estimated = 4 + 8 * (1 - 0.5) = 8

Running example: A good fit for per-user 60 RPM when precision is needed but memory must stay small.

4) Token Bucket

Idea: A bucket fills at a steady rate. Each request consumes tokens. If enough tokens exist, allow.

State: tokens, last_refill_time

Pros: Allows bursts, easy to reason about

Cons: Requires time math

Token-bucket models define a relationship between rate, burst size, and time; they are widely used to describe rate control in networking.

Visual (time series):

Capacity = 10, refill = 1 token/sec

Time (s):   0   1   2   3   4   5
Tokens:    10  10  10   5   6   7
Events:              burst(5)  +1  +1

Running example: Use for tenant 1000 RPM with a burst of 1-5 seconds; charge writes at cost 5.

Pseudocode (one common implementation):

now = time()
elapsed = now - last_refill
bucket.tokens = min(capacity, bucket.tokens + elapsed * refill_rate)
last_refill = now

if bucket.tokens >= cost:
    bucket.tokens -= cost
    allow()
else:
    reject()

5) Leaky Bucket

Idea: Requests enter a queue that drains at a fixed rate. If the queue is full, reject.

State: queue size or backlog

Pros: Smooths bursty traffic into a steady flow

Cons: Adds latency if you queue instead of reject

Leaky buckets are commonly used for traffic shaping, smoothing output as it leaves an interface.

Visual (fixed drain):

Incoming:  ****   ********    **
Queue:     6  6   6 6 6 6 6 6  (bounded)
Outgoing:  *  *  *  *  *  *  *

Token bucket vs leaky bucket (same input, different output):

Input:        ****   ********    **

Token bucket: ****   *****  ***   **   (allows bursts until empty)
Leaky bucket: *  *  *  *  *  *  *     (smooths into steady flow)

Numeric example (queue depth):

Drain rate = 2 req/sec, queue size = 6
Input burst = 6 at t=0

t=0: queue=6
t=1: queue=4
t=2: queue=2
t=3: queue=0

Running example: Use for background exports or webhook fan-out where pacing is better than rejection.

6) GCRA (Generic Cell Rate Algorithm)

Idea: Track a theoretical arrival time (TAT). A request is allowed if it arrives after its allowed time, with some tolerance for burst.

State: tat (theoretical arrival time)

Pros: Precise, constant memory

Cons: Harder to reason about

GCRA is a leaky-bucket-type algorithm used in ATM networks for policing and shaping traffic.

Core concept (simplified):

if now >= tat - allowed_burst:
    allow; tat = max(tat, now) + interval
else:
    reject

Mental model: Each request schedules the next allowed time. The allowed burst is a time credit. This is closely related to a token bucket where interval = 1 / rate and allowed_burst encodes bucket size.

Running example: Use for per-user 60 RPM when constant memory and precise pacing matter.

Comparison Table

Algorithm	Memory	Precision	Burst Handling	Implementation
Fixed window	O(1)	Low (boundary bursts)	Poor	Very simple
Sliding log	O(n)	High	Good	Expensive
Sliding counter	O(1)	Medium	Good	Simple
Token bucket	O(1)	Medium	Excellent	Simple
Leaky bucket	O(1)	Medium	Smooth output	Simple
GCRA	O(1)	High	Excellent	Moderate

Tuning Burst and Cost

The algorithm is only half the story. Most tuning happens here.

Set burst size from UX

Users expect short bursts: page load waterfalls, retries, batch jobs. Let those through, but cap them.

Rule of thumb:

burst = 1-5 seconds of normal rate

If the normal limit is 100 RPS, allow 100-500 tokens of burst.

Use weighted costs

Assign weights to match real cost. For example:

Read = 1 token
Write = 5 tokens
Search = 10 tokens

This keeps the limiter aligned with actual load.

Handle retries explicitly

If clients retry aggressively, lower the rate or enforce a retry backoff policy. Otherwise, retries amplify the load you were trying to control.

Retry storm amplification (example):

1,000 clients x 1 request = 1,000 requests
3 retries per client      = 3,000 extra requests
Total load               = 4,000 requests (4x)

Use idempotency keys for write endpoints so retries do not duplicate side effects.

Cost attribution for shared resources

When a request touches multiple services, decide who pays. Two common models:

Caller pays: the API gateway charges the full cost up front based on expected downstream work.
Per-service limits: each service charges its own budget and can reject independently.

Caller-pays is simpler for clients; per-service limits are safer for shared dependencies. In practice, mix both: charge a baseline at the edge, and let hot downstreams add their own limits.

Complexity-Based Rate Limiting (Cost Budgets)

If request cost varies a lot, stop counting requests. Count work.

The idea is simple: score each request, then enforce a budget on the sum of scores per window and per key. This is ideal for GraphQL, search, report generation, and batch endpoints.

Implement it in three steps:

Score each request in the origin based on expected cost.
Emit the score to the edge (response header or metadata).
Accumulate scores per key and window; block or throttle when the sum exceeds the budget.

Some APIs accept client-declared cost hints (e.g., X-Expected-Cost) to pre-charge budgets. If used, validate server-side to prevent abuse.

Diagram:

[Client] -> [Edge] -> [Origin]
                     | compute cost
                     v
               response header x-score
[Edge] sums score per key; if total > budget -> block/throttle

One concrete implementation is Cloudflare’s complexity-based rate limiting. It meters requests by a complexity score rather than by request count and expects the origin to return a response header with an integer score. The score must be in the valid range (1 to 1,000,000). Counters track total score per key and only advance when a valid score is present. The rule defines a score budget per period, a period length, and the response header name.

This pattern also appears in GraphQL query-cost scoring and API gateway plugins. The mechanics are the same: score, budget, enforce.

Adaptive Rate Limiting (Dynamic Budgets)

Static limits are easy to reason about, but systems are rarely static. Adaptive rate limiting adjusts limits based on live signals such as latency, error rate, or saturation.

Two common patterns:

AIMD (Additive Increase, Multiplicative Decrease)
- Increase the limit slowly when the system is healthy.
- Decrease quickly when latency or errors spike.
Concurrency-based control
- Keep a target concurrency and adjust allowed rate to hold that line.

Simple feedback loop:

if p95_latency > target or error_rate > budget:
    limit = limit * 0.7   # fast decrease
else:
    limit = limit + k     # slow increase

Adaptive limiters turn traffic control into a control system. They require stable signals, smoothing, and guardrails to avoid oscillation.

A practical example is Netflix’s concurrency-limits, which adjusts limits based on observed latency.

Rate Limiting in Async and Event-Driven Systems

Request/response is only one shape. For queues, streams, and webhooks, rate limiting often lives at the producer and at the consumer.

Patterns that work well:

Token bucket at the producer: pace emits into the queue.
Consumer concurrency limits: bound the number of in-flight messages.
Queue depth triggers: reduce intake when backlog grows.
Per-subscriber budgets: avoid one consumer starving others.

In event-driven systems, a limiter is often tied to backlog rather than time. The goal is the same: keep the system inside a stable envelope.

Where to Enforce Limits

Limits can be enforced at multiple layers. Each layer trades accuracy for locality and cost.

Common placements:

Client-side: best UX, but untrusted
Edge / gateway: great for global protection
Service entry: accurate for application semantics
Dependency level: protects databases and external APIs

A common default: a coarse limit at the edge and fine-grained limits in the service.

[Client] -> [Edge/Gateway] -> [Service] -> [DB/Downstream]
     |            |             |            |
  soft limit   coarse        precise      dependency

Database-level admission control

Databases have their own limits. Common techniques include:

Connection pooling with hard caps (e.g., pgBouncer-style)
Query timeouts to bound worst-case work
Admission control at the query layer (e.g., only N heavy queries at once)

These are not replacements for rate limiting, but they are the last line of defense for shared state.

Single-Node Implementation Details

On a single node, limiters are fast and simple, but you still need concurrency control and memory discipline.

1) Ring buffer for sliding windows

Buckets for last 60 seconds:
[ t-59 ][ t-58 ] ... [ t-1 ][ t ]

At each request:

Compute bucket_index = now % 60
If the bucket is stale, reset it
Increment the counter
Sum buckets for the window

2) Control memory and cardinality

High-cardinality keys (user IDs, IPs) can explode memory. Mitigate with:

TTL eviction for idle keys
LRU caches
Avoid combinatorial dimensions (per-user + per-IP + per-endpoint) unless needed

3) Atomicity

Use atomics or fine-grained locks. In many languages, a per-key mutex is enough.

Distributed Rate Limiting Architectures

Once a system scales beyond one node, the key decision is how to share state.

Architecture	Extra hop per request	Consistency	Typical use
Centralized Redis	Yes	Strong-ish	Simple, moderate QPS
Sharded limiter service	Yes	Per-key strong	High QPS, controlled network
Leased tokens	No	Approximate	Very high QPS, edge-ish
Hybrid	Sometimes	Bounded error	Common “best of both”

Correctness and bounded error

Local-only: can exceed the global tenant rate by up to the sum of per-PoP bursts.
Leased tokens: exceedance is bounded by unexpired leases (worst case = tokens already handed out but not yet spent).
Hybrid: bounded by local bucket size plus slow-path refill frequency.

Option A: Centralized datastore (Redis or similar)

Each request performs an atomic increment in a shared store.

Pros: Simple, consistent

Cons: Latency and cost at high QPS

[Edge] -> [API Node] -> [Redis]
                  (INCR + TTL)

Redis Lua example (token bucket, atomic):

-- KEYS[1] = bucket key
-- ARGV[1] = now (ms), ARGV[2] = refill_rate (tokens/ms)
-- ARGV[3] = capacity, ARGV[4] = cost
local key = KEYS[1]
local now = tonumber(ARGV[1])
local rate = tonumber(ARGV[2])
local cap = tonumber(ARGV[3])
local cost = tonumber(ARGV[4])

local data = redis.call("HMGET", key, "tokens", "ts")
local tokens = tonumber(data[1]) or cap
local ts = tonumber(data[2]) or now

local delta = math.max(0, now - ts)
local new_tokens = math.min(cap, tokens + delta * rate)
local allowed = new_tokens >= cost
if allowed then new_tokens = new_tokens - cost end

redis.call("HMSET", key, "tokens", new_tokens, "ts", now)
redis.call("PEXPIRE", key, math.ceil(cap / rate))

return { allowed and 1 or 0, new_tokens }

This script uses a single atomic update and a TTL so idle keys expire. Use a consistent time source and tune the refill rate to match the chosen time unit. If HMGET returns nils because a key expired between checks, the defaults (cap and now) are applied; the first request may see a full bucket. To tighten this, initialize with SETNX or store defaults alongside a version key.

Back-of-the-envelope latency impact

Remote limiter calls add latency. As a ballpark example, a single synchronous Redis connection with redis-benchmark -t ping -c 1 often lands around ~0.2 ms median on fast in-region networks. That implies roughly 5,000 ops/sec on a single connection (1s / 0.2ms). As a limiter, that overhead is paid on every request; at 10k RPS you need parallel connections or a local fast path to avoid bottlenecking on the limiter itself. See the AWS benchmark notes for methodology, and measure in your own environment.

Numbers vary by hardware, network, and command mix, but the lesson holds: when the remote check cost starts to dominate the request budget, use local caches, token leasing, or co-locate limiters with the service.

Option B: Dedicated limiter service (sharded)

Shard limit keys by consistent hashing. Each limiter node keeps in-memory state.

Pros: Scales horizontally, low latency

Cons: Extra hop and service to run

             +------------------+
[API Node] ->| Limiter Router    |
             +------------------+
                 |      |     |
                 v      v     v
              [L1]   [L2]  [L3]   (sharded in-memory)

Option C: Local limiters with leased tokens

Each node holds a local bucket. Periodically, it leases tokens from a global pool.

Pros: No per-request remote calls

Cons: Approximate, needs coordination

[Global Budget]
      |
      | lease every N seconds
      v
[Node A bucket]   [Node B bucket]   [Node C bucket]

Leasing mechanics (practical knobs)

Lease size: fixed (simple) or proportional to recent demand (more adaptive).
Renewal cadence: 1-5s with jitter to avoid stampedes.
Quiet-to-hot transitions: keep a small warm reserve so a region can absorb a sudden spike.
Coordinator outage: serve with diminishing local caps (fail-open) or stop at lease exhaustion (fail-closed).

Option D: Hybrid (local fast path, global slow path)

Use a local limiter first. If it is low, ask a global store for a refill.

Pros: High throughput with bounded error

Cons: More complex

Control plane vs data plane

Separate policy from enforcement so limits can change without redeploying.

[Policy Store] ---> [Limiter Config Push]
        |                 |
        v                 v
   (limits, tiers)    [Data-plane limiters]

Multi-Region and Hierarchical Limits

Global consistency across regions is expensive. Use hierarchy instead.

Global Limit (100k RPS)
          |
     +----+----+
     |         |
  Region A   Region B
   60k        40k
     |         |
  +--+--+   +--+--+
  |     |   |     |
Node1 Node2 Node3 Node4

Enforce locally on short timescales and rebalance globally on longer timescales.

If regions cannot reach a global coordinator, fall back to local budgets. Choose fail-open (serve with local caps) or fail-closed (block) based on risk.

Failure Modes and Edge Cases

Limiters fail in subtle ways. Design for these up front.

1) Clock skew

Time-based algorithms need consistent time. Use monotonic clocks on a single node. Across nodes, assume skew and design for approximation or centralized time.

2) Hot keys

A single key can dominate load. Mitigate with:

Per-key sharding (hash key + salt)
Local caching
Elevated limits for trusted clients

3) Thundering herd on reset

If all clients retry at reset time, you get a spike. Add jitter to client retries or prefer token buckets.

4) DDoS vs legitimate bursts

Not all spikes are abuse. Distinguish between:

Legitimate bursts (deploys, batch jobs, page loads)
Abuse or attacks (high volume with low value)

Common signals include auth state, endpoint mix, error rates, and geo/IP reputation. Progressive penalties (delay, then lower limit, then block) and challenge flows (rate-limit pages, CAPTCHAs) help avoid hurting valid users.

5) Graceful degradation

When limits hit, dropping traffic is not the only option. Other patterns:

Serve cached or stale data
Return partial results
Prioritize critical requests (VIP or internal)
Queue and drain with a leaky bucket
Use separate buckets or queues per priority class (P0, P1, P2)

6) Fail-open vs fail-closed

If the limiter fails:

Fail-open keeps traffic flowing but risks overload
Fail-closed protects systems but may cause outages

Choose based on downstream fragility and SLOs.

Production gotchas

Monotonic time only: wall-clock jumps (NTP) can refill buckets incorrectly.
Cardinality explosion: only add a dimension if it materially reduces abuse and the combined cardinality is bounded. Rule of thumb: avoid user x endpoint x IP unless the endpoint is high-risk and the extra dimension cuts abuse by at least 10x.
Cost gaming: if cost comes from client or response headers, cap and validate server-side, and sample for audits.

Observability and Client UX

Treat the limiter as part of the product surface, not just an internal guardrail.

Status codes

429 Too Many Requests indicates the client has sent too many requests in a given amount of time (RFC 6585: https://www.rfc-editor.org/rfc/rfc6585).
RFC 6585 notes that a 429 response may include Retry-After to tell the client when to try again.

Retry guidance

Retry-After accepts either an HTTP-date or delay-seconds and tells the client how long to wait (RFC 9110: https://www.rfc-editor.org/rfc/rfc9110).

Recommended response contract

HTTP/1.1 429 Too Many Requests
Retry-After: 24
RateLimit-Policy: "default";q=100;w=60
RateLimit: "default";r=12;t=24
Content-Type: application/json

{"error":"rate_limited","message":"Too many requests","retry_after":24}

When multiple limits fail

Return the most actionable limit to the caller, usually the tightest scope (user over tenant, tenant over IP). Include a stable error code and the server-defined retry delay, and advise clients to add jitter.

Response headers (current standards)

The IETF HTTPAPI working group defines RateLimit and RateLimit-Policy response fields in an active Internet-Draft (work in progress, not yet an RFC): https://datatracker.ietf.org/doc/draft-ietf-httpapi-ratelimit-headers/

The draft uses structured fields; a RateLimit item includes parameters such as remaining quota (r) and time to reset (t).

A minimal example (syntax simplified for readability):

RateLimit-Policy: "default";q=100;w=60
RateLimit: "default";r=12;t=24

Some APIs still use RateLimit-Limit, RateLimit-Remaining, and RateLimit-Reset; meanings vary across implementations, so document them clearly.

Metrics to track

Allowed requests per key
Rejected requests per key
429 rate by endpoint
Tokens remaining distribution
P50/P95 limiter latency

When Not to Rate Limit

Rate limits are not always the right tool. Avoid strict limits in these cases:

Internal services with backpressure already enforced by queues or concurrency caps
Trusted batch jobs where throughput is controlled upstream
Admin or break-glass endpoints used during incidents (use allowlists instead)
Health checks and liveness probes that must stay fast and reliable

In these cases, use backpressure, circuit breakers, or admission control instead of per-request rate limits.

Rate Limits vs Quotas in Practice

Rate limits control short windows; quotas control long windows. In production, both are common:

Rate limit: 100 RPM per user (token bucket or sliding window)
Quota: 1,000,000 requests per month (fixed window with long TTL)

Enforce both by checking the rate limiter first, then the quota counter. This keeps interactive UX snappy while still enforcing long-term budgets.

Client-Side Rate Limiting

Client-side limiters improve UX and reduce wasted retries. Common libraries include:

Bottleneck (JavaScript) - https://github.com/SGrondin/bottleneck
ratelimit / limits (Python) - https://pypi.org/project/ratelimit/ and https://limits.readthedocs.io/
Guava RateLimiter (Java) - https://github.com/google/guava/wiki/RateLimiterExplained

Client-side limits are advisory, not authoritative. The server remains the source of truth.

Use client-side limits when you control the client and want to avoid wasted RTTs on 429s.

Testing Strategies

Rate limiters are easy to get wrong. Tests should include:

Load tests with realistic burst patterns
Time-mocking to validate window boundaries and refill logic
Chaos tests for Redis outages or clock skew
Golden traces to confirm fairness across keys
Shadow mode to log would-block decisions before enforcing

Concrete example:

def test_token_bucket_refill(mock_time):
    bucket = TokenBucket(capacity=10, rate=1)  # 1 token/sec
    bucket.consume(5)
    mock_time.advance(3)  # 3 tokens refill
    assert bucket.tokens == 8

A Concrete System Design Example

Here is a concrete design for a multi-tenant API with read and write endpoints.

Requirements

Each tenant gets 1,000 requests per minute (RPM)
Each user gets 60 RPM
Writes cost 5 tokens, reads cost 1 token
The system runs in two regions
Low latency; small errors are acceptable

Step 1: Choose algorithms

Token bucket for tenant limits (burst-friendly)
Sliding window counter for per-user limits (precise enough)

Step 2: Choose placement

Enforce tenant limits at the edge gateway
Enforce user limits inside the service

Step 3: Architecture

[Client]
   |
   v
[Edge Gateway]
   |  tenant token bucket
   v
[API Service]
   |  user sliding window
   v
[Downstream]

Step 4: Multi-region split

Lease a share of the tenant budget to each region and rebalance every few seconds.

Global Tenant Budget (1000 RPM)
   |
   +-- Region A: 600 RPM bucket
   +-- Region B: 400 RPM bucket

Step 5: Cost model

read  -> cost 1 token
write -> cost 5 tokens

Step 6: Pseudocode (simplified)

# edge gateway (tenant bucket)
def allow_tenant(tenant_id, cost):
    bucket = local_bucket(tenant_id)
    bucket.refill()
    if bucket.tokens >= cost:
        bucket.tokens -= cost
        return True
    return False

# service (user window)
def allow_user(user_id):
    win = sliding_window(user_id)
    return win.estimate() < 60

# AND composition for a full request
def allow_request(tenant_id, user_id, cost):
    return allow_tenant(tenant_id, cost) and allow_user(user_id)

This design is fast and scalable and works for most production APIs.

Checklist

Define the work unit and cost model
Pick the dimension (user, tenant, IP, endpoint)
Choose the algorithm (token bucket, sliding window, etc.)
Decide placement (edge, service, dependency)
Choose state strategy (local, centralized, sharded, hybrid)
Plan for failure modes (fail-open vs fail-closed)
Add headers and metrics for visibility
Decide on adaptive vs static limits
Add quota enforcement for long windows
Test with bursts, retries, and outages
Validate against the running example requirements

Sources and Standards

If you want to go deeper, these are good starting points:

RFC 6585: HTTP 429 Too Many Requests - https://www.rfc-editor.org/rfc/rfc6585
RFC 9110: Retry-After header field - https://www.rfc-editor.org/rfc/rfc9110
IETF Internet-Draft: RateLimit and RateLimit-Policy header fields - https://datatracker.ietf.org/doc/draft-ietf-httpapi-ratelimit-headers/
RFC 3290: Token bucket and leaky bucket discussion (Diffserv) - https://www.rfc-editor.org/rfc/rfc3290
RFC 2697: Token bucket metering (Diffserv) - https://www.rfc-editor.org/rfc/rfc2697
RFC 5681: TCP Congestion Control (AIMD background) - https://www.rfc-editor.org/rfc/rfc5681
AWS Redis benchmark notes - https://aws.amazon.com/blogs/database/optimize-redis-client-performance-for-amazon-elasticache/
Cloudflare complexity-based rate limiting - https://developers.cloudflare.com/waf/rate-limiting-rules/request-rate/#complexity-based-rate-limiting
Netflix concurrency-limits - https://github.com/Netflix/concurrency-limits

Conclusion

Rate limiting is not a punishment. It is a promise about what a system can safely handle. Start from capacity, pick the right dimensions, and choose the simplest algorithm that meets the SLOs. Make it visible to clients, and the system scales without surprises.

The running multi-tenant API example ties it together: tenant and user budgets, cost weights, and a multi-region split meet the goals without over-engineering.