← Home

Object Storage as a Queue: When the Wrong Tool Is Right

Estimated reading time: 22-27 minutes | ~4,400 words

I built a distributed message queue on S3 in ~200 lines of Go to understand why companies like turbopuffer and WarpStream are building queues and streaming systems on top of object storage instead of using purpose-built message brokers. The short answer: two S3 features, strong consistency (2020) and conditional writes (late 2024), turned object storage into a viable coordination primitive. This post walks through the implementation from first principles, then compares it to what turbopuffer and others have built in production.


TL;DR

  • A queue needs only three things: durable writes, ordered reads, and a claim mechanism. S3 now provides the primitives for all three, but expiry is DIY. The full implementation is ~200 lines of Go.
  • S3 gained strong read-after-write consistency in December 2020 and conditional writes (If-None-Match, If-Match) in November 2024. These two changes made S3 a viable coordination primitive.
  • turbopuffer replaced a sharded job queue with a single JSON file on object storage, achieving 10x lower tail latency.
  • At scale, batching makes S3-as-queue cost-competitive with SQS. The trade-off is latency: S3 Standard small object latencies are 100-200ms vs SQS’s ~20 ms in AWS’s same-region example.
  • WarpStream (acquired by Confluent), Neon, and Delta Lake all build coordination layers on top of object storage. This isn’t a hack; it’s a trend.
  • Object storage as a queue works best for batch-oriented, cost-sensitive workloads where sub-100ms latency isn’t critical.

Table of Contents

  1. TL;DR
  2. What Makes a Queue a Queue?
  3. S3: The Primitives That Changed Everything
  4. Building It
  5. Running It
  6. How turbopuffer Takes It Further
  7. The Economics: S3 vs SQS vs Kafka
  8. Who Else Is Doing This?
  9. When This Works (and When It Doesn’t)
  10. What I Didn’t Build
  11. What I Think This Means

What Makes a Queue a Queue?

I think most engineers over-specify what a queue actually needs to be. We reach for Kafka, SQS, or Redis Streams by default, but if you strip a queue down to first principles, the requirements are surprisingly minimal.

A queue needs three properties:

  1. Durable writes. A producer puts a message somewhere, and it stays there even if the producer crashes.
  2. Ordered reads. A consumer can read messages in some predictable order (FIFO, priority, or at least not random).
  3. A claim mechanism with expiry. When multiple consumers exist, a message should be processed by one of them, not all of them. And if that consumer dies, the claim must eventually expire so another consumer can retry. This is the “visibility timeout” or “lease” concept.

That’s it. Everything else (exactly-once semantics, dead letter queues, consumer groups, partition rebalancing) is built on top of these three primitives. They matter, but they aren’t the foundation.

Here’s the thing: S3 provides the primitives for all three, but expiry is still on you. Not natively as a queue API, but through primitives you can compose into one.

Property            S3 Primitive
─────────────────   ─────────────────────────────────────
Durable writes      PUT object (11 nines of durability)
Ordered reads       LIST with lexicographical ordering*
Claim mechanism     Conditional PUT (If-None-Match, Nov 2024)

ListObjectsV2 returns keys in lexicographical order by key name for general purpose buckets. Directory buckets (S3 Express One Zone) do not guarantee ordering. This is not FIFO, but if you use fixed-width timestamp-prefixed keys (e.g., 2026-02-17T12:00:00.000000000Z/job-001), lexicographic order approximates chronological order closely enough for most queue workloads.

This wasn’t always true. Before December 2020, S3 was eventually consistent, which meant a LIST after a PUT might not return the new object. Before November 2024, there was no conditional write, which meant two consumers could both claim the same message. Those two changes, strong consistency and conditional writes, are what turned S3 from “interesting thought experiment” into “people are actually shipping this.”


S3: The Primitives That Changed Everything

Strong read-after-write consistency (December 2020)

On December 1, 2020 (as covered in depth by Gergely Orosz’s deep dive into how S3 is built), AWS announced that S3 now delivers strong read-after-write consistency for all operations (GET, PUT, LIST) in all regions, at no extra cost, with no performance trade-off. Forrest Brazeal, an AWS Serverless Hero at the time, summarized it well: “S3 is now strongly consistent. No config changes, no caveats, it just is.”

Before this, S3 offered eventual consistency. If you wrote an object and immediately listed the bucket, the new object might not appear. This was a real problem. Amazon EMR had to build an entire consistency layer called EMRFS Consistent View to work around it. Hadoop developers built S3Guard. Both became unnecessary overnight.

For queues, this matters because a producer can PUT an object and a consumer can immediately LIST the bucket and see it. No polling delay, no stale reads. The semantics you need for a queue’s read path just work.

Conditional writes (November 2024)

In November 2024, AWS expanded conditional write support for S3. You can now pass If-None-Match (write only if the key doesn’t exist) or If-Match (write only if the ETag matches) headers with PutObject and CompleteMultipartUpload. If the precondition fails, S3 returns 412 Precondition Failed (or 409 Conflict in concurrent delete edge cases).

If-None-Match is the one that matters for queues. It’s a compare-and-swap (CAS) primitive, the exact mechanism you need for a claim/lock operation:

Consumer A: PUT s3://queue/jobs/job-42/lock  (If-None-Match: *)  → 200 OK
Consumer B: PUT s3://queue/jobs/job-42/lock  (If-None-Match: *)  → 412 Precondition Failed

Consumer A gets the job. Consumer B doesn’t. No external lock service, no DynamoDB, no Redis. Just S3.

S3 now provides the put-if-absent primitive natively, though Delta Lake’s official S3 multi-cluster guidance still requires DynamoDB.

What these two primitives unlock together

With strong consistency + conditional writes, S3 supports this workflow:

Producer                           Consumer
────────                           ────────
PUT job-{id}.json                  LIST s3://queue/pending/
  (job payload)                      → [job-001, job-002, job-003]

                                   PUT s3://queue/claimed/job-001
                                     (If-None-Match: *)
                                     → 200 OK (claimed)

                                   GET s3://queue/pending/job-001
                                     (read payload, do work)

                                   DELETE s3://queue/pending/job-001
                                   PUT s3://queue/done/job-001

This is a queue. A weird one, built from files and HTTP headers, but a queue nonetheless. S3 Standard is designed for 11 nines of durability. The consistency is strong. The claim mechanism is atomic.


Building It

I built a queue on S3 in Go to see how the primitives feel in practice. The full source is ~200 lines, uses only the AWS SDK for Go v2, and runs against MinIO locally.

Messages flow through three S3 key prefixes, one per lifecycle stage:

{prefix}/pending/   -> waiting to be processed
{prefix}/claimed/   -> a consumer has taken ownership
{prefix}/done/      -> successfully processed

The queue struct

The queue is just an S3 client, a bucket name, and a key prefix. The prefix lets multiple independent queues coexist in a single bucket:

type Queue struct {
	client *s3.Client
	bucket string
	prefix string
}

type Message struct {
	ID      string // unique identifier (e.g. "job-001")
	Key     string // full S3 object key under pending/
	Payload []byte // the message body
}

Enqueue: timestamp-prefixed keys

Every enqueued message gets a key like queue/pending/2026-02-17T12:00:00.000000000Z-job-001. We use a fixed-width timestamp format instead of Go’s time.RFC3339Nano because RFC 3339 nano drops trailing zeros, which breaks lexicographic sorting: "...00Z" sorts after "...00.1Z" because 'Z' (0x5A) > '.' (0x2E). The fixed-width format always emits exactly 9 fractional digits, so lexicographic order on S3 gives approximate chronological order. It is only approximate because clock skew across multiple producers can still cause out-of-order delivery. No sequence numbers, no coordination, just wall-clock time:

// timestampFormat is a fixed-width time format that always produces exactly
// 9 fractional digits. Go's time.RFC3339Nano drops trailing zeros, which
// breaks lexicographic sorting: "...00Z" sorts after "...00.1Z" because
// 'Z' (0x5A) > '.' (0x2E). Fixed-width avoids this.
const timestampFormat = "2006-01-02T15:04:05.000000000Z07:00"

func (q *Queue) Enqueue(ctx context.Context, id string, payload []byte) error {
	key := fmt.Sprintf("%s/pending/%s-%s", q.prefix, time.Now().UTC().Format(timestampFormat), id)

	_, err := q.client.PutObject(ctx, &s3.PutObjectInput{
		Bucket: aws.String(q.bucket),
		Key:    aws.String(key),
		Body:   bytes.NewReader(payload),
	})
	return err
}

Each PutObject is durable the moment it returns. S3 Standard is designed for 11 nines of durability, so there’s no WAL, no fsync discipline, no crash recovery logic. Compare that to the write-ahead log I built for my storage engine, where durability required careful CRC checksums and fsync on every write.

List: discovering pending messages

Listing is a single ListObjectsV2 call with the pending/ prefix. S3 returns keys in lexicographic order for general purpose buckets, so messages come back in best-effort chronological order (oldest-first when clocks are reasonably in sync):

func (q *Queue) List(ctx context.Context) ([]Message, error) {
	prefix := fmt.Sprintf("%s/pending/", q.prefix)

	resp, err := q.client.ListObjectsV2(ctx, &s3.ListObjectsV2Input{
		Bucket: aws.String(q.bucket),
		Prefix: aws.String(prefix),
	})
	if err != nil {
		return nil, fmt.Errorf("listing pending messages: %w", err)
	}

	messages := make([]Message, 0, len(resp.Contents))
	for _, obj := range resp.Contents {
		key := aws.ToString(obj.Key)
		id := extractID(key)
		messages = append(messages, Message{ID: id, Key: key})
	}
	return messages, nil
}

Because S3 is now strongly consistent, a message that was just enqueued will appear in the next LIST call. No stale reads, no eventual consistency lag.

Claim: the conditional write

This is the core of the queue’s consistency guarantee. To claim a message, a consumer writes a marker to claimed/{id} with IfNoneMatch: "*". S3 rejects the write if the key already exists:

func (q *Queue) Claim(ctx context.Context, messageKey string, workerID string) error {
	messageID := extractID(messageKey)
	claimKey := fmt.Sprintf("%s/claimed/%s", q.prefix, messageID)

	_, err := q.client.PutObject(ctx, &s3.PutObjectInput{
		Bucket:      aws.String(q.bucket),
		Key:         aws.String(claimKey),
		Body:        bytes.NewReader([]byte(workerID)),
		IfNoneMatch: aws.String("*"),
	})
	if err != nil {
		var apiErr smithy.APIError
		if errors.As(err, &apiErr) {
			code := apiErr.ErrorCode()
			if code == "PreconditionFailed" || code == "ConditionalRequestConflict" {
				return ErrAlreadyClaimed
			}
		}
		return fmt.Errorf("claiming message %s: %w", messageID, err)
	}
	return nil
}

The IfNoneMatch: "*" line is doing all the work. When two consumers race to claim the same message, exactly one gets a 200 OK and the other gets a 412 Precondition Failed. No locks, no leases, no coordination service.

The error handling checks for both PreconditionFailed and ConditionalRequestConflict using the smithy error interface, since different S3-compatible implementations may return different error codes.

Ack: completing the lifecycle

After processing, the consumer deletes the message from pending/ and writes a marker to done/. The claimed/ marker is left in place as a tombstone so another consumer can’t re-claim a completed message:

func (q *Queue) Ack(ctx context.Context, messageKey string) error {
	messageID := extractID(messageKey)

	// Delete from pending, the original message.
	_, err := q.client.DeleteObject(ctx, &s3.DeleteObjectInput{
		Bucket: aws.String(q.bucket),
		Key:    aws.String(messageKey),
	})
	if err != nil {
		return fmt.Errorf("deleting pending message %s: %w", messageID, err)
	}

	// Write a done marker so we can audit what was processed.
	doneKey := fmt.Sprintf("%s/done/%s", q.prefix, messageID)
	_, err = q.client.PutObject(ctx, &s3.PutObjectInput{
		Bucket: aws.String(q.bucket),
		Key:    aws.String(doneKey),
		Body:   bytes.NewReader([]byte("done")),
	})
	if err != nil {
		return fmt.Errorf("writing done marker for %s: %w", messageID, err)
	}

	return nil
}

Why keep the tombstone? Because another consumer may already have a list of pending keys from an earlier ListObjectsV2 call. If Ack deletes claimed/job-001, that other consumer can successfully PutObject with If-None-Match: * and re-claim a message that’s already been processed. Keeping claimed/ in place blocks that race.

Two API calls for one Ack, which is the main downside of this approach. The done/ marker is optional but useful for auditing. In production you might skip it and rely on the absence from pending/ as the completion signal. The tombstone in claimed/ would need TTL or periodic cleanup.

Dequeue: the consumer loop

Dequeue ties it all together. List pending messages, try to claim the first one, skip if someone else got it:

func (q *Queue) Dequeue(ctx context.Context, workerID string) (*Message, error) {
	messages, err := q.List(ctx)
	if err != nil {
		return nil, err
	}

	for _, msg := range messages {
		err := q.Claim(ctx, msg.Key, workerID)
		if errors.Is(err, ErrAlreadyClaimed) {
			continue // someone else got it, try next
		}
		if err != nil {
			return nil, err
		}

		// We own this message. Fetch the payload.
		resp, err := q.client.GetObject(ctx, &s3.GetObjectInput{
			Bucket: aws.String(q.bucket),
			Key:    aws.String(msg.Key),
		})
		if err != nil {
			return nil, fmt.Errorf("fetching payload for %s: %w", msg.ID, err)
		}
		defer resp.Body.Close()

		payload, err := io.ReadAll(resp.Body)
		if err != nil {
			return nil, fmt.Errorf("reading payload for %s: %w", msg.ID, err)
		}

		msg.Payload = payload
		return &msg, nil
	}

	return nil, nil // queue is empty
}

Returns (nil, nil) when the queue is empty. That’s not an error, it just means there’s nothing to do right now.


Running It

Start a local MinIO instance:

docker run -p 9000:9000 -p 9001:9001 minio/minio server /data --console-address ":9001"

Then run the demo, which enqueues 5 jobs and spawns 3 concurrent consumers:

git clone https://github.com/devesh-shetty/s3-queue.git
cd s3-queue
go run ./cmd/demo
=== S3 Queue Demo ===

Enqueued: job-001 (payload: "index document batch 1")
Enqueued: job-002 (payload: "index document batch 2")
Enqueued: job-003 (payload: "index document batch 3")
Enqueued: job-004 (payload: "resize images for catalog")
Enqueued: job-005 (payload: "send notification emails")

Starting 3 consumers...

[consumer-3] Claimed job-001 (payload: "index document batch 1")
[consumer-1] Claimed job-002 (payload: "index document batch 2")
[consumer-2] Claimed job-003 (payload: "index document batch 3")
[consumer-3] Acked  job-001
[consumer-1] Acked  job-002
[consumer-1] Claimed job-004 (payload: "resize images for catalog")
[consumer-2] Acked  job-003
[consumer-2] Claimed job-005 (payload: "send notification emails")
[consumer-1] Acked  job-004
[consumer-2] Acked  job-005

All 5 messages processed.
Done: [job-001, job-002, job-003, job-004, job-005]

Five messages, three consumers, zero double-processing. The conditional write does all the coordination.


How turbopuffer Takes It Further

The implementation above works, but it’s a teaching tool. turbopuffer built something more sophisticated for production. They’re a vector database that runs entirely on object storage, and in a February 2026 blog post, they described replacing their sharded indexing job queue with a single JSON file on S3-compatible storage. The results: 10x lower tail latency and dramatically simpler operations.

Their design evolved through four iterations beyond where our simple implementation stops.

Step 1: A single queue.json file

The simplest possible approach: one JSON file containing the entire queue state. Producers append jobs, consumers mark them in-progress. Every mutation reads the file, modifies it, and writes it back using an ETag-based CAS operation (If-Match on the object’s version, as opposed to the If-None-Match claim markers in our implementation above).

{
  "jobs": [
    {"id": "a1", "state": "pending",     "payload": "..."},
    {"id": "a2", "state": "in_progress", "payload": "..."},
    {"id": "a3", "state": "complete",    "payload": "..."}
  ]
}

This works, but there’s a problem: in turbopuffer’s environment, CAS writes to object storage take on the order of 200ms, and only one writer can succeed per CAS cycle. If two writers try simultaneously, one gets a conflict and has to retry. Throughput is capped at roughly a few operations per second.

Step 2: Group commit

Instead of writing on every operation, turbopuffer buffers incoming requests in memory and batches them into a single CAS write. This decouples the request rate from the write rate. If 50 requests arrive during a 200ms write cycle, they all get committed in one CAS operation.

The bottleneck shifts from per-write latency to bandwidth. For a JSON file containing job metadata, bandwidth is never the constraint.

Step 3: A stateless broker

Instead of having every client contend over the CAS write to queue.json, turbopuffer introduced a single broker process. All clients talk to the broker over normal RPCs. The broker handles all object storage interactions (reads, writes, CAS operations) and serializes access to the queue file.

Clients (producers + consumers)
        |
        v
  ┌──────────┐
  │  Broker   │  ← single process, stateless
  └─────┬────┘

        v
  ┌──────────┐
  │ queue.json│  ← on object storage
  └──────────┘

The broker is stateless in the sense that all durable state lives in queue.json. If the broker crashes, a new one starts, reads queue.json, and picks up where it left off. Clients discover the broker’s address from the queue file itself.

Step 4: Heartbeats and failure detection

The final piece: workers send periodic heartbeat timestamps for each claimed job. If a heartbeat goes stale, the broker reassigns the job. This handles the case where a worker dies mid-processing: the job eventually gets reclaimed and retried, giving at-least-once delivery semantics.

The CAS primitive provides correctness even during broker transitions. If two brokers temporarily overlap (e.g., during a restart), the CAS ensures only one broker’s write succeeds per cycle.

The result

turbopuffer’s blog post reports 10x lower tail latency versus their prior sharded queue implementation. The post includes a chart showing a dramatic drop in median queue time after deployment. All from replacing a sharded queue system with a JSON file.

I think the key insight from turbopuffer’s approach is this: object storage offers few but powerful primitives. Once you learn how they behave, you can build distributed systems with what’s already there.


The Economics: S3 vs SQS vs Kafka

The cost argument is what makes this pattern interesting beyond the elegance. Let me work through the numbers for a concrete workload: 1 million messages per day, each 1 KB.

S3 Standard (us-east-1)

All prices below are from the S3 pricing page for us-east-1 (first 50 TB tier). At 1M messages/day, that’s ~30M messages/month:

  • PUT requests: 30M PUTs x $0.005/1,000 = $150.00/month
  • GET requests: 30M GETs x $0.0004/1,000 = $12.00/month
  • LIST requests: ~1,000 LIST calls/day x 30 x $0.005/1,000 = $0.15/month
  • Storage: ~1 GB (messages retained ~1 day) x $0.023/GB = $0.023/month
  • Total: ~$162/month

That’s expensive. The PUT cost dominates. In practice, you’d batch messages into fewer, larger objects to bring this down dramatically. If you batch 100 messages per object, you’re at 300K PUTs/month, or ~$1.50/month for PUTs alone.

S3 Express One Zone lowers PUT costs to $0.00113/1,000 (for requests up to 512 KB) (after the April 2025 price reduction), but storage rises to $0.11/GB.

SQS Standard

SQS pricing is $0.40 per million requests for the first 100 billion/month tier (us-east-1). At 1M messages/day:

  • Requests: 30M sends + 30M receives + 30M deletes = 90M requests/month
  • Cost: 90M x $0.40/1M = $36.00/month (first 1M free)
  • With batching (10 messages per API call): roughly $3.60/month

Self-managed Kafka (3-broker cluster on AWS)

  • EC2 instances: 3 × m5.large = 3 × ~$70/month = $210/month
  • EBS storage: 3 × 100 GB gp3 = 3 × $8 = $24/month
  • Total: ~$234/month (before inter-AZ networking, which WarpStream estimates is 80%+ of real Kafka costs)

Amazon MSK Serverless

  • Cluster hours: ~$0.75/hr × 730 hrs = $547.50/month
  • Plus per-partition and per-storage charges

Comparison table

PropertyS3-as-QueueSQS StandardKafka (self-managed)
Cost (1M msg/day, batched)~$1.50/month~$3.60/month~$234/month
Latency100-200 ms typical~20 ms (AWS same-region example)low ms (varies by config)
Max message size5 GB single / 50 TB multipart1 MiB1 MB (default)
RetentionUnlimited14 days maxConfigurable
Durability11 ninesReplicated across 3 AZsReplication factor
OrderingLexicographic key (general purpose buckets)Best-effort (FIFO extra)Per-partition
Operational overheadNoneNoneHigh
Consumer groupsDIYBuilt-inBuilt-in

With batching, S3-as-queue and SQS land in similar territory for per-message cost. The S3 pattern wins on different axes: message sizes up to 50 TB, unlimited retention, and zero operational overhead beyond what you already have for S3.

The real cost advantage of S3-as-queue shows up when messages are large (SQS caps at 1 MiB), retention is long (SQS caps at 14 days), or you need the data to live in a format that other systems can directly access (Parquet files, JSON blobs, etc.).


Who Else Is Doing This?

turbopuffer isn’t alone. A growing number of systems treat object storage as a coordination layer, not just a storage layer.

WarpStream: Kafka on S3

WarpStream (acquired by Confluent in 2024) builds a Kafka-compatible streaming platform where every byte of data goes to S3 instead of local disks. Because WarpStream writes to S3 instead of replicating data across availability zones, it eliminates the inter-AZ data transfer costs that dominate Kafka bills at scale (AWS charges $0.05/GB for cross-AZ traffic at retail prices).

WarpStream’s agents are completely stateless. Any agent can serve any topic, any partition, any consumer group. They batch writes into S3 objects every 250ms by default, decoupling the number of S3 API calls from the number of partitions. This avoids the naive trap where per-partition-per-flush writes would cost $130/month per partition in PUT requests alone.

For lower latency, WarpStream supports S3 Express One Zone, achieving 4x lower end-to-end latency. They even tier data: ingest to Express One Zone for speed, then compact into S3 Standard for cost. It’s the object storage equivalent of an LSM tree’s level hierarchy, something I explored in my storage engine post.

Neon: Postgres on S3

Neon separates PostgreSQL’s compute from its storage, using object storage as the ultimate source of truth. Compute nodes are stateless: they can be created, resized, or destroyed without risking data loss.

The write path is interesting: WAL records flow from compute to “safekeepers” (a Paxos-replicated durability layer), then to pageservers, and ultimately to S3. The pageservers act as an LRU cache on SSDs, keeping hot data close to compute while cold data lives cheaply on S3. This is what enables Neon’s scale-to-zero capability. No data lives on the compute, so there’s nothing to lose when it shuts down.

Delta Lake: ACID transactions on S3

Delta Lake builds ACID transactions on top of S3 using a file-based transaction log in the _delta_log directory. Each commit creates a new JSON file (000000.json, 000001.json, …) containing the set of data files to add or remove.

The commit protocol uses optimistic concurrency control: multiple writers prepare changes independently, then race to create the next sequentially-numbered commit file. On ADLS and GCS, this uses native put-if-absent semantics. On S3, Delta Lake historically needed a DynamoDB-backed LogStore to provide mutual exclusion. S3’s 2024 conditional writes provide the put-if-absent semantic natively, but as of this writing, Delta Lake’s official S3 multi-cluster guidance still requires DynamoDB. I suspect that dependency will become optional eventually, but it hasn’t happened yet.

The pattern

All four systems share the same architectural bet: object storage is reliable enough, consistent enough, and cheap enough to be the foundation, not just the archive.

Traditional:    App → Queue → Workers → Database → S3 (archive)

Emerging:       App → S3 (source of truth) ← Workers

                Coordination via CAS / conditional writes

When This Works (and When It Doesn’t)

I want to be honest about the trade-offs, because I think it’s easy to get excited about the elegance and overlook the constraints.

It works well when:

  • Latency tolerance is >100ms. AWS documents that applications can achieve 100-200ms latencies for small objects on S3 Standard. If your workload is batch indexing, ETL, or background processing, this is fine. If you need sub-10ms, it’s not.
  • Messages are large. SQS caps at 1 MiB. Kafka defaults to 1 MB. S3 handles individual objects up to 5 GB and multipart uploads up to 50 TB. If your “messages” are Parquet files, model checkpoints, or video segments, S3 is the natural home.
  • Retention is long or unbounded. SQS deletes messages after 14 days max. Kafka retention requires provisioned storage. S3 just keeps things until you delete them, at $0.023/GB/month. For audit logs, compliance data, or replay-friendly architectures, unlimited retention is a genuine advantage.
  • You want your queue data to be directly queryable. S3 objects can be read by Athena, Spark, DuckDB, or any tool that speaks S3. Your queue is also your data lake. With SQS or Kafka, getting data out for analysis requires a separate pipeline.
  • Operational simplicity matters. There are no brokers to patch, no partitions to rebalance, no ZooKeeper to babysit. S3 is managed infrastructure you already pay for.

It doesn’t work well when:

  • You need sub-10ms latency. Even S3 Express One Zone, which AWS describes as delivering “consistent single-digit millisecond data access,” doesn’t match Kafka’s in-process latency. For real-time event streaming, trading systems, or interactive workloads, this pattern adds unacceptable overhead.
  • You need exactly-once processing. S3-as-queue naturally gives you at-least-once delivery. Building exactly-once on top requires idempotency keys or external deduplication. SQS FIFO queues provide exactly-once natively (at the cost of throughput).
  • Consumer group semantics are critical. Kafka’s consumer groups, offset management, and rebalancing protocol are battle-tested. Reimplementing them on S3 is a substantial engineering effort. If you need sophisticated multi-consumer coordination, use a purpose-built system.
  • You need high fan-out. If every message goes to many consumers (pub/sub pattern), S3’s LIST-and-claim approach becomes expensive. SNS+SQS or Kafka topics with multiple consumer groups handle fan-out natively.
  • Message throughput is extremely high. S3 supports 5,500 GET and 3,500 PUT requests per second per prefix. You can shard across prefixes to scale further, but at some point you’re reinventing a distributed queue system. If you need millions of messages per second, use Kafka.

What I Didn’t Build

This is a teaching implementation. It works, but it’s missing everything you’d need for production:

  1. Lease expiry / heartbeats. In this toy, claimed/ markers are tombstones even after success. If a consumer crashes, the message is stuck forever as well. turbopuffer solves this with heartbeat timestamps; you’d need a reaper process or TTL-based reclaim.
  2. Batch enqueue. Writing one S3 object per message is expensive at scale ($0.005 per 1,000 PUTs). Production systems batch many messages into a single object.
  3. Pagination. ListObjectsV2 returns up to 1,000 keys per call. For large queues you’d need to paginate with ContinuationToken.
  4. Delivery guarantees. The Ack function makes two separate API calls (delete pending, write done). If the process crashes between them, you get a ghost message. An atomic move would require a different approach.
  5. Visibility timeout. SQS lets a consumer hold a message for a configurable period before it becomes visible again. Our implementation has no equivalent; you’d need to build it with TTL-based claim keys.
  6. Dead letter queue. Messages that repeatedly fail processing should be moved aside, not retried forever.

turbopuffer’s blog post solves several of these with their broker + group commit + heartbeat design. The gap between what I built and what they shipped is a good illustration of the difference between a proof of concept and a production system.


What I Think This Means

I think we’re seeing a broader shift where object storage is becoming the default substrate for distributed systems, not just the archival tier. The argument goes something like this:

  1. Object storage is the cheapest durable storage available (~$0.023/GB/month for 11 nines of durability).
  2. As of 2020, it’s strongly consistent.
  3. As of 2024, it supports conditional writes.
  4. It requires zero operational overhead.
  5. It scales without configuration.

Given those properties, the question becomes: why wouldn’t you build on it?

The answer is latency. S3 Standard adds 100-200ms of latency to every small object operation. For many workloads, that’s a dealbreaker. But for batch processing, data pipelines, indexing jobs, and background tasks (a large fraction of all compute work) it’s perfectly acceptable.

I think turbopuffer’s observation captures it well: object storage offers few, but powerful, primitives. CAS writes, strong consistency, immutable objects, and lexicographic listing. You don’t get the rich API surface of Kafka or the managed convenience of SQS. But you get something more fundamental: a durable, consistent, virtually infinite storage layer that you can compose into whatever you need.

The systems that bet on this pattern (WarpStream, Neon, Delta Lake, turbopuffer) are all thriving. I suspect we’ll see more of it.