One Goroutine Beats 64 (Until It Doesn't): Shared State, Queues, and Backpressure in Go
Estimated reading time: 24-30 minutes | ~4,800 words
TL;DR
- A
sync.Mutexis hard to beat for simple shared-state protection. Uncontended lock/unlock costs 5-12 ns; a buffered channel send+recv costs 50-75 ns. For straightforward maps, start with a mutex. - The single-owner goroutine pattern (one goroutine owns all mutable state, others talk to it via channels) eliminates lock convoys and enables smart batching, but it serializes all work through one core and adds channel overhead.
sync.RWMutexis a trap at scale.RLockdoes anatomic.AddInt32on a shared counter, which means cache-line invalidation on every core. At 64 goroutines on 64 cores, it is 8x slower than single-threaded.- Unbounded (or large) channel buffers are the single-owner pattern’s kill switch. Little’s Law (L = lambda * W) means that queue depth grows proportionally when service time increases, and if the arrival rate stays constant, GC pressure from thousands of queued objects can accelerate the collapse into a feedback loop.
- Backpressure saves the pattern. A bounded channel with
select/defaultrejection keeps tail latency stable under overload while the mutex and large-buffer owner stores degrade. - 58% of Go’s blocking concurrency bugs come from channels, not mutexes. The proverb “share memory by communicating” is a design tool, not a performance tool. Pick the primitive that matches your failure mode, not a slogan.
Table of Contents
- Act 1: The Mutex Baseline
- Act 2: The Single-Owner Pattern
- Act 3: “Until It Doesn’t” — Queues and Failure
- Act 4: Backpressure Saves It
- When to Use What
Act 1: The Mutex Baseline
I wanted to understand when Go’s concurrency proverb actually helps. “Don’t communicate by sharing memory; share memory by communicating” sounds great in a Rob Pike talk, but I kept reaching for sync.Mutex in real code and wondering if I was missing something. So I built the same session store three ways and measured.
The session store is deliberately simple: a map[string]Session that supports Get, Put, Delete, and Len. Each operation does a tunable amount of CPU work (xorshift iterations) to simulate real business logic. This is not a counter benchmark. Counters are misleading because the synchronization overhead dwarfs any useful work, which makes every primitive look bad.
Here is the workload function that every store variant uses:
func doWork(input uint64, iterations int) uint64 {
x := input
for i := 0; i < iterations; i++ {
x ^= x << 13
x ^= x >> 7
x ^= x << 17
}
return x
}
The mutex store is the obvious starting point. One sync.Mutex guards the entire map:
type MutexStore struct {
mu sync.Mutex
sessions map[string]Session
ttl time.Duration
stop chan struct{}
done chan struct{}
}
func (s *MutexStore) Get(key string) (Session, bool) {
s.mu.Lock()
defer s.mu.Unlock()
sess, ok := s.sessions[key]
if !ok {
return Session{}, false
}
if time.Now().After(sess.ExpiresAt) {
delete(s.sessions, key)
return Session{}, false
}
return sess, true
}
func (s *MutexStore) Put(key string, session Session) {
if session.ExpiresAt.IsZero() {
session.ExpiresAt = time.Now().Add(s.ttl)
}
if session.CreatedAt.IsZero() {
session.CreatedAt = time.Now()
}
s.mu.Lock()
defer s.mu.Unlock()
s.sessions[key] = session
}
Nothing clever. Lock, do the thing, unlock. A background goroutine sweeps expired sessions on a ticker, which also takes the lock. This is the realistic part: in production, you rarely have a single acquisition pattern. Background maintenance, health checks, and metrics collection all compete for the same mutex.
What happens under contention
At low goroutine counts, sync.Mutex is remarkably cheap. The fast path is a single CompareAndSwap(0, mutexLocked) on an int32, costing 5-12 ns uncontended. The entire state machine fits in one int32: bit 0 is “locked,” bit 1 is “woken,” bit 2 is “starving,” and bits 3+ encode the waiter count. When there is no contention, Lock() is one atomic operation and Unlock() is an Add(-mutexLocked). That is it.
But contention changes the picture. When a goroutine cannot acquire the lock immediately, it enters lockSlow(). First it tries spinning: up to 4 iterations of procyield (architecture-specific CPU PAUSE/YIELD instructions), but only if the machine is multicore, GOMAXPROCS > 1, and the local run queue is empty. If spinning fails, the goroutine parks via the runtime semaphore and joins a FIFO wait queue.
Go 1.9 introduced starvation mode: when a waiter has been blocked for more than 1 ms (starvationThresholdNs = 1e6), the mutex switches from normal mode (where newly arriving goroutines compete with woken waiters, and new arrivals have the advantage because they are already on-CPU and cache-hot) to starvation mode (where ownership is handed directly to the longest-waiting goroutine). This prevents indefinite starvation, but the handoff itself is expensive because the receiving goroutine must be unparked and rescheduled.
Go issue #33747 documents what happens at scale. On a 96-core machine with a 350 us task and 1% lock hold time, scaling was nearly linear up to 28 workers. At 46 workers, throughput collapsed to 12x (from an expected 46x) because starvation-mode handoffs were taking 60-70 us each. The reporter found that inserting runtime.Gosched() restored linear scaling, but that is not a general fix.
RWMutex: the intuitive trap
“But my workload is read-heavy, so I should use sync.RWMutex.” I thought this too. The problem is that RLock() calls atomic.AddInt32(&rw.readerCount, 1), which invalidates that cache line on every core that holds a copy. With 64 goroutines on 64 cores, Bryan Mills measured 517 ns per RLock/RUnlock, 8.3x slower than single-threaded (72.5 ns). Pure cache coherence traffic, no useful work. That issue is still open in 2026.
Mutex benchmark results
All benchmarks below are medians of 5 runs on an Apple M3 Pro (darwin/arm64, Go 1.25). Sessions are deep-copied on Get and Put to prevent aliasing, which adds one allocation per operation.
At 1 worker, the mutex store hits ~3.1 million ops/sec (323 ns/op) with a p50 of 167 ns and p99 of 750 ns. There is essentially zero contention, so every operation is just an uncontended CAS plus a map lookup.
At 8 workers, throughput drops to ~2.0 million ops/sec (500 ns/op), and tail latency is already climbing: p99 reaches 56 us and p999 hits 155 us. The semaphore wait queue is starting to matter.
At 32 workers, throughput holds at ~2.0 million ops/sec (505 ns/op). The p99 jumps to 269 us and p999 to 621 us. Starvation-mode handoffs are now a regular event.
At 64 workers, throughput holds at ~2.0 million ops/sec (499 ns/op), and tail latency explodes: p50 stays reasonable at 500 ns, but p99 is 562 us and p999 reaches 1,199 us (1.2 ms). That is a 2,399x gap between p50 and p999.
| Workers | ns/op | p50 (ns) | p99 (ns) | p999 (ns) | allocs/op |
|---|---|---|---|---|---|
| 1 | 323 | 167 | 750 | 3,583 | 4 |
| 8 | 500 | 500 | 56,084 | 155,125 | 4 |
| 16 | 505 | 500 | 119,458 | 280,083 | 4 |
| 32 | 505 | 500 | 269,042 | 620,917 | 4 |
| 64 | 499 | 500 | 562,042 | 1,199,334 | 4 |
The pattern is clear: raw throughput degrades moderately (323 to 499 ns/op, a 54% increase), but tail latency grows by three orders of magnitude. Goroutines spend more time parked in the semaphore wait queue than doing actual work. All operations incur 4 allocations per op due to session deep-copying.
Act 2: The Single-Owner Pattern
The alternative is to stop sharing the map entirely. One goroutine owns all mutable state. Everyone else communicates with it by sending messages over a channel. The owner processes messages one at a time, in a tight loop.
Go’s concurrency proverb has intellectual roots in Tony Hoare’s CSP (Communicating Sequential Processes) paper from 1978. Rob Pike worked with CSP in Newsqueak, Alef, and Limbo before bringing it to Go. The idea also maps to the Actor Model (Carl Hewitt, 1973): each actor has private state, processes messages one at a time, and communicates via async message passing. Erlang made actors mainstream. CSP and actors are not semantically identical (Go channels are shared, anonymous primitives; actor mailboxes are per-actor queues with different delivery and lifecycle semantics), but the ownership insight is the same.
Martin Thompson formalized this as the Single Writer Principle: “For any item of data, or resource, that item of data should be owned by a single execution context for all mutations.” Reads are fine from multiple threads (CPUs broadcast read-only copies via cache coherency). The problem is multiple writers, which cause MESI cache-line bouncing. A single writer eliminates that entirely.
The pattern shows up everywhere in production systems:
- Redis runs command execution on a single thread, hitting 500K ops/sec on one core. It is typically network-bound before CPU-bound.
- LMAX Disruptor processes 6 million orders per second on a single thread, with sub-microsecond latency. Their paper reports 25 million messages/sec and mean latency 3 orders of magnitude lower than queue-based approaches.
- etcd’s Raft implementation uses a single goroutine event loop. The
raft.raftstruct is explicitly not thread-safe; araft.Nodewrapper serializes all access through one goroutine runningfor { select { ... } }. - ScyllaDB uses a shard-per-core architecture (via Seastar), and Comcast reported shrinking from 962 Cassandra nodes to 78 ScyllaDB nodes while serving 15 million households and 2.4 billion RESTful calls per day.
The common thread: all of these systems chose to serialize mutations through a single execution context, and all of them achieved remarkable throughput precisely because they eliminated cross-core coordination on the write path.
The OwnerStore
Here is the core of the owner-based session store. Callers send a request struct over a buffered channel. The owner goroutine drains and processes them:
type request struct {
op opCode
key string
session Session
resp chan response
}
type OwnerStore struct {
reqCh chan request
stop chan struct{}
done chan struct{}
respPool sync.Pool
}
The owner loop is the heart of the design:
func (s *OwnerStore) ownerLoop(ttl, sweepInterval time.Duration) {
defer close(s.done)
sessions := make(map[string]Session)
ticker := time.NewTicker(sweepInterval)
defer ticker.Stop()
for {
select {
case <-s.stop:
s.drain(sessions, ttl)
return
case now := <-ticker.C:
for k, sess := range sessions {
if now.After(sess.ExpiresAt) {
delete(sessions, k)
}
}
case req := <-s.reqCh:
s.handleRequest(sessions, ttl, req)
// Smart batching: drain buffered requests without blocking.
s.batchDrain(sessions, ttl)
}
}
}
After handling one request, the owner calls batchDrain, which processes every request already sitting in the channel buffer without blocking:
func (s *OwnerStore) batchDrain(sessions map[string]Session, ttl time.Duration) {
n := len(s.reqCh)
for i := 0; i < n; i++ {
req := <-s.reqCh
s.handleRequest(sessions, ttl, req)
}
}
This is smart batching (also from Thompson). Under load, the channel buffer accumulates requests while the owner processes the current one. When it finishes, it drains the entire batch in a tight loop, amortizing the cost of the select statement and keeping the map hot in L1 cache. The len(s.reqCh) call is cheap (it reads hchan.qcount, which the owner goroutine just updated) and gives us a snapshot of the current buffer depth.
Note the sync.Pool for response channels. Without pooling, every request would allocate a new chan response on the heap. The pool recycles them, which keeps GC pressure low under sustained load. Each response channel is buffered with capacity 1 so the owner can send without blocking.
Callers look like normal synchronous method calls:
func (s *OwnerStore) Get(key string) (Session, bool) {
ch := s.getRespChan()
s.reqCh <- request{op: opGet, key: key, resp: ch}
r := <-ch
s.putRespChan(ch)
return r.session, r.found
}
The honest truth about channels
I should be upfront about something: channels use mutexes internally. Every hchan has a runtime.mutex (runtime/chan.go), which means sends and receives serialize through a lock. The runtime mutex is lighter than sync.Mutex (no starvation mode, just futex + spin), but it is still a lock. For pure state protection, a mutex avoids that extra hop.
dm03514’s benchmark shows mutexes outperforming monitor goroutines by roughly 75x across 0-10,000 goroutines for a simple counter. The Go community consensus is clear: use mutexes for shared state, channels for communication.
So why bother with single-owner? Three reasons:
- No lock convoys. Under a mutex, a slow holder delays every waiter. Under single-owner, the owner processes requests sequentially, but no goroutine holds the “lock” while doing expensive work outside the critical section.
- Smart batching. You cannot batch mutex acquisitions. With a channel, the owner naturally processes multiple requests per wakeup under load.
- Architectural clarity. The owner goroutine is the only code that touches the map. There is no “did I forget to hold the lock?” class of bugs. Bryan Mills made this case in his GopherCon 2018 talk, “Rethinking Classical Concurrency Patterns”.
Owner benchmark results
At 1 worker, the owner store manages ~0.84 million ops/sec (1,192 ns/op), with a p50 of 833 ns and p99 of 5,583 ns. That is 3.7x slower than the mutex store at the same concurrency level. The overhead comes from two channel operations per request (send + receive) plus the select dispatch in the owner loop.
At 8 workers, throughput is ~0.58 million ops/sec (1,726 ns/op). In this workload, channel handoff overhead dominates before batching can amortize enough of the cost. The p50 climbs to 5,459 ns because requests now spend time queued in the channel buffer before the owner picks them up.
At 32 workers, throughput recovers to ~1.0 million ops/sec (979 ns/op). Batching helps at higher fan-in, but p50 is now 19,333 ns (19 us), reflecting deeper queue depths.
At 64 workers, the owner store delivers ~1.13 million ops/sec (885 ns/op). The p50 is 41,625 ns (42 us), p99 is 338,750 ns (339 us), and p999 is 628,708 ns (629 us).
The comparison with the mutex store at 64 workers is revealing. The mutex is much faster on throughput: ~2.0 million ops/sec vs ~1.13 million. But the owner’s p999 of 629 us is still 1.9x better than the mutex’s 1,199 us. The owner pattern trades raw throughput for better tail latency at high contention.
| Workers | ns/op | p50 (ns) | p99 (ns) | p999 (ns) | allocs/op |
|---|---|---|---|---|---|
| 1 | 1,192 | 833 | 5,583 | 9,667 | 4 |
| 8 | 1,726 | 5,459 | 145,167 | 274,500 | 4 |
| 16 | 1,177 | 7,875 | 270,208 | 494,333 | 4 |
| 32 | 979 | 19,333 | 314,166 | 614,667 | 4 |
| 64 | 885 | 41,625 | 338,750 | 628,708 | 4 |
The channel overhead is the price you pay. The question is whether the architectural benefits (no lock convoy, smart batching, single-writer clarity) justify it for your workload.
Act 3: “Until It Doesn’t” — Queues and Failure
Here is where the story turns. The single-owner pattern has a failure mode that mutexes do not: the channel is a queue, and queues under overload are how systems die.
Little’s Law will find you
Little’s Law is one of the most useful results in queueing theory: L = lambda * W. The average number of items in a system (L) equals the arrival rate (lambda) times the average time each item spends in the system (W). It holds regardless of arrival distribution, service distribution, or service order. You cannot negotiate with it.
Here is what happens when downstream latency increases for a service handling 10,000 requests/sec:
| Downstream latency | In-flight (L) | Queue wait | Total latency |
|---|---|---|---|
| 1 ms (normal) | 10 | ~0 | 1 ms |
| 10 ms (slow) | 100 | 10 ms | 20 ms |
| 100 ms (degraded) | 1,000 | 100 ms | 200 ms |
| 1 s (failing) | 10,000 | 1 s | 2 s |
At 10,000 in-flight requests, each request struct sitting in a Go channel buffer is a heap-allocated object the GC must scan. The GC pressure from those objects adds 50-200 ms to service time. That increases W, which increases L, which increases GC pressure. This is the death spiral: queue growth causes latency growth, which causes more queue growth, which causes more GC pressure. It ends with OOM or a restart.
In Go specifically, this is compounded by how channels interact with garbage collection. A buffered channel holds references to every queued element, and those references keep the entire object graph reachable. For our session store, each request holds a Session with a map[string]string, so 10,000 queued requests means 10,000 live maps the GC must trace. The channel buffer itself is a contiguous allocation (makechan allocates hchan + buffer in one shot for non-pointer elements), but the objects it references are scattered across the heap.
This is not hypothetical
Cloudflare’s 2023 incident involved TCP receive buffers growing without bound under specific kernel behavior, degrading sessions across their fleet. Wazuh v4.13.0 hit memory exhaustion from an unbounded control message queue. Both were queues without backpressure.
Jim Gettys identified the same problem in networking and called it bufferbloat (2011): oversized network buffers are “dark buffers,” undetectable under normal load, catastrophic when they fill. A 256-packet ring buffer at 1 Mbps adds 3 seconds of latency. The excess buffering defeats TCP’s congestion avoidance, which needs packet drops as a signal. The analogy to Go channels is direct: a large channel buffer hides overload until it is too late.
Head-of-line blocking: the owner IS the convoy
With a mutex, a slow request holds the lock longer, but other goroutines can still make progress on operations that do not need the lock. With a single-owner, every request funnels through one goroutine. If request #5 takes 50 ms (disk I/O, external API call), requests #6 through #500 wait in line. The owner goroutine becomes the convoy.
This is not just theoretical. Consider a workload where 1% of requests take 10 ms and 99% take 100 us. Under single-owner, the p99 latency can exceed 10 ms purely from queueing behind slow requests. Under a mutex, the slow request still holds the lock for 10 ms, and other goroutines contending for that lock will wait, but goroutines working on unrelated keys or doing work outside the critical section can still make progress. And with a mutex, if your slow path does not need the lock (say it is an external API call), you can release the lock before the slow work and reacquire it after. With a single owner, the goroutine is the lock. There is no releasing it.
JT Olio put it sharply in 2016: “In all my Go code, I can count on one hand the number of times channels were really the best choice.” And James Iry observed that actors are fundamentally about shared mutable state and locking; they just move the locking into the mailbox.
The ASPLOS 2019 finding
This is the number that should give any Go channel enthusiast pause. Tu et al. studied 171 concurrency bugs across Docker, Kubernetes, etcd, gRPC, CockroachDB, and BoltDB. Their finding: 58% of blocking bugs came from message passing (channels), not shared memory (mutexes). Their conclusion: “It is as easy to make concurrency bugs with message passing as with shared memory.”
Channels do not eliminate concurrency bugs. They trade one class of bugs (data races, forgotten locks) for another (deadlocks from full channels, goroutine leaks from unconsumed messages, head-of-line blocking). The Go team itself rejected unbounded channels; Ian Lance Taylor called them “unworthy to be added to the language.”
And there is a deeper runtime issue. Go issue #57070 documents channel contention collapse: on a 48-core machine with ~300K goroutines/sec, 33 of 40 CPU samples showed threads blocked in runtime.futex on a single channel’s hchan.lock. The channel that was supposed to eliminate contention became the contention point.
The Google SRE perspective
Chapter 22 of the Google SRE Book puts it bluntly: “If a user’s web search is slow because an RPC has been queued for 10 seconds, there’s a good chance the user has given up.” They recommend LIFO or CoDel-style queue management over plain FIFO, and they advocate failing early and cheaply over queuing indefinitely.
Act 4: Backpressure Saves It
The single-owner pattern is not broken. It just needs a pressure valve. The fix is a bounded channel combined with non-blocking sends that reject requests when the queue is full.
The BackpressureStore
The BackpressureStore wraps the same single-owner loop with one critical change: callers use select with a default case to detect a full channel and return ErrOverloaded instead of blocking:
type BackpressureStore struct {
reqCh chan request
stop chan struct{}
done chan struct{}
respPool sync.Pool
}
func (s *BackpressureStore) Get(key string) (Session, bool, error) {
ch := s.getRespChan()
req := request{op: opGet, key: key, resp: ch}
select {
case s.reqCh <- req:
r := <-ch
s.putRespChan(ch)
return r.session, r.found, nil
default:
s.putRespChan(ch)
return Session{}, false, ErrOverloaded
}
}
func (s *BackpressureStore) Put(key string, session Session) error {
ch := s.getRespChan()
sess := session
if sess.ExpiresAt.IsZero() {
sess.ExpiresAt = time.Now().Add(s.ttl)
}
if sess.CreatedAt.IsZero() {
sess.CreatedAt = time.Now()
}
req := request{op: opPut, key: key, session: sess, resp: ch}
select {
case s.reqCh <- req:
<-ch
s.putRespChan(ch)
return nil
default:
s.putRespChan(ch)
return ErrOverloaded
}
}
The select/default pattern is the key. When the channel is full, the default branch fires immediately (no runtime lock acquisition, no goroutine parking), and the caller gets an error instead of joining a growing queue. The owner loop itself is identical to OwnerStore, with smart batching and all. The only difference is on the caller side.
This is the pattern the Go team had in mind when they rejected adding unbounded channels to the language. Bounded channels force you to make an explicit choice about what happens at capacity. That choice, whether you block, drop, or reject, is a design decision that should be visible in code, not hidden behind an ever-growing buffer.
Backpressure strategies
There are several ways to handle a full queue. The right choice depends on what you are willing to sacrifice:
| Strategy | Preserves | Sacrifices | Best for |
|---|---|---|---|
| Bounded-block | All data | Latency | Internal pipelines, worker pools |
| Drop-oldest | Freshness, latency | Old data | Metrics, telemetry, real-time feeds |
| Reject/shed | Latency | Some requests | User-facing APIs, gateways |
| CoDel | Auto-tuned balance | Stale requests | Request queues with variable load |
CoDel (Controlled Delay, RFC 8289) is worth knowing about. It tracks how long items spend in the queue and starts dropping when sojourn time exceeds 5 ms for more than 100 ms. The drop rate follows a control law: next_drop = now + interval/sqrt(count). It is what Linux uses for network packet queues, and the same idea applies to application-level request queues.
For the BackpressureStore, I chose reject/shed because it is the simplest to reason about and the most appropriate for a session store: if the store cannot serve you right now, you should retry or fall back, not wait in a growing line.
Drop-oldest in Go
For completeness, here is the drop-oldest pattern in Go. Note the race condition warning: this is only safe with a single forwarding goroutine.
select {
case ch <- value:
// sent
default:
<-ch // drain oldest
ch <- value // send new
}
This is useful for metrics and telemetry channels where freshness matters more than completeness. Fred Hebert’s “Handling Overload” documents this pattern extensively in the Erlang world, where unbounded mailboxes once caused 300K+ message backlogs and 120 GB memory consumption before restarts. The Erlang community built pobox to solve it.
Benchmark: under overload
At 64 workers under normal load, the backpressure store delivers ~1.6 million ops/sec (622 ns/op). That is slower than the mutex store (~2.0 million ops/sec) but faster than the owner store (~1.13 million ops/sec). The select fast path adds modest overhead when the channel has capacity.
The difference shows up under sustained load. Here is a comparison across selected worker counts:
| Workers | Mutex ns/op | Owner ns/op | BP ns/op | Mutex p999 | Owner p999 | BP p999 |
|---|---|---|---|---|---|---|
| 1 | 323 | 1,192 | 1,109 | 3.6 us | 9.7 us | 8.9 us |
| 8 | 500 | 1,726 | 977 | 155 us | 275 us | 94.8 us |
| 32 | 505 | 979 | 707 | 621 us | 615 us | 309 us |
| 64 | 499 | 885 | 622 | 1,199 us | 629 us | 362 us |
At 64 workers, the backpressure store’s p999 of 362 us is 3.3x better than the mutex’s 1,199 us and 1.7x better than the owner store’s 629 us. The bounded channel with select/default rejection caps queue growth and helps keep tail latency lower at high concurrency. The mutex degrades gradually with lock contention, and the large-buffer owner degrades as queue depth grows. With a truly unbounded buffer, the degradation would compound through GC pressure on the queued objects.
The rejected requests are not wasted work. They are saved work. Every request you reject at the door is a request that does not consume memory in the channel buffer, does not add to the owner’s processing queue, does not increase GC pressure, and does not inflate the latency of every request behind it. Dean and Barroso’s “The Tail at Scale” demonstrated that when each of 100 servers has a p99 of 10 ms, an all-100 fanout sees a p99 of 140 ms. Keeping individual node p99 low through load shedding is how you keep system-level tail latency manageable.
Choosing the channel buffer size
The buffer size is a tuning knob. Too small and you reject requests that the owner could have handled (false positives). Too large and you are back to the unbounded queue problem. I think a good starting point is: buffer size = expected throughput * acceptable queue latency. If the owner processes 100K requests/sec and you want at most 1 ms of queue wait, a buffer of 100 is reasonable. You can also use adaptive approaches: monitor the buffer fill rate and adjust the capacity at runtime, though that adds complexity.
The complete answer
The complete answer is: single-owner + bounded channel + load shedding. The single-owner gives you architectural clarity and smart batching. The bounded channel gives you a queue with a known maximum depth. Load shedding gives you predictable latency under overload.
When to Use What
Here is the decision framework I landed on after building and benchmarking all three variants:
| Scenario | Recommended | Why |
|---|---|---|
| Low contention, simple ops | sync.Mutex | 5-12 ns uncontended. Hard to beat. |
| High contention, complex state, need batching | Single-owner goroutine | Smart batching amortizes overhead. No lock convoy. |
| Production service under variable load | Single-owner + backpressure | Bounded queue + rejection keeps p99 stable. |
| Read-heavy, few writes | sync.RWMutex (with caveats) | Works well at low core counts. Degrades badly past ~16 cores. |
| Simple counters/flags | sync/atomic | Single instruction. No contention mechanism at all. |
| CPU-bound work that must parallelize | Sharded owners or sync.Mutex | A single owner caps at one core. Amdahl’s Law still applies. |
Some honest limitations of the single-owner pattern that I did not solve:
CPU-bound ceiling. The owner goroutine runs on one core. If your bottleneck is CPU and your state is partitionable, you need sharded owners (one per shard, each on its own core). This is exactly what ScyllaDB’s shard-per-core and Redpanda’s thread-per-core architectures do.
Cross-shard coordination. If a request needs state from two shards owned by two different goroutines, you are back to coordination. The channel approach requires multiple round-trips and risks deadlock if shard A sends to shard B while shard B sends to shard A on full channels. The mutex approach (lock both in consistent order) is often simpler for in-process coordination, though deadlock-prone if ordering is violated.
Channel contention at extreme fan-in. Go issue #57070 is still open. If hundreds of thousands of goroutines send to one channel, the hchan.lock becomes a bottleneck. At that scale, you need sharded channels or a different architecture entirely.
The full source code for all three store implementations, benchmarks, and the workload simulator is available at github.com/devesh-shetty/one-goroutine-beats-64.
The proverb says “share memory by communicating.” That is good advice for system design. But Go’s channels are built on shared memory (a mutex inside every hchan), and 58% of Go’s blocking bugs come from channels. The real lesson is not “channels good, mutexes bad.” It is: understand your contention profile, bound your queues, and shed load before the system does it for you.