← Home

Building a Database on S3 in 2026: Revisiting a 2008 Paper

Estimated reading time: 7-10 minutes | ~1,600 words

TL;DR

  • The 2008 paper “Building a Database on S3” built a whole consistency protocol around S3’s limitations, and most of that complexity vanishes once you have strong consistency and conditional writes.
  • A copy-on-write B+ tree plus a single manifest CAS is enough for multi-page atomicity; the WAL becomes an intent log for cleanup, not the commit point.
  • Multi-writer OCC at page granularity works well for this shape of storage, as long as every page you read is part of your validation set.
  • Page-per-object is a deliberate teaching simplification; real systems bundle pages into extents to amortize S3 request costs.
  • I built a working Go prototype that runs against S3Mock and exposes CreateTable, Insert, and Get on top of object storage.

Why Revisit “Building a Database on S3”

I rebuilt the 2008 SIGMOD paper “Building a Database on S3” using 2026-era S3 primitives, and the difference is striking: most of the paper’s protocol complexity simply disappears.

The original paper had to fight eventual consistency, a lack of conditional writes, and the inability to do multi-object atomicity on S3. That forced them into an elaborate SQS commit protocol, checkpointing daemons, and multiple consistency tiers.

Today, S3 gives you strong read-after-write consistency and native If-None-Match / If-Match semantics. That doesn’t solve cross-object atomicity, but it removes the need for SQS and most of the client-side monotonicity hacks.

So the interesting question becomes: if I delete the 2008-era scaffolding, what does a modern “database on S3” actually look like?

2008 vs 2026: What Changed

Here is the commit path side-by-side:

2008 paper2026 prototype
Eventual consistency + client-side monotonicity hacksStrong read-after-write consistency + conditional writes
SQS-based commit protocolManifest CAS with If-Match
Multiple consistency tiers + checkpoint daemonsSingle manifest + WAL intent cleanup
Overwrite-heavy metadata updatesImmutable pages + new manifest versions

The core shape is the same (B+ tree pages on object storage), but the coordination layer collapses once S3 gives you conditional writes. The SQS machinery disappears, and the manifest CAS becomes the only commit point.


The Modernized Architecture

Here is the entire commit protocol in one diagram:

read manifest (ETag)
     |
     v
write WAL intent (page keys)
     |
     v
write new pages (CoW)
     |
     v
manifest CAS (If-Match)

The manifest is the only authoritative pointer to the database state. A successful CAS is the commit point; the WAL just tells you which pages to clean up if a writer crashes halfway through.

This is the same shape you see across modern S3-native systems (Neon, turbopuffer, Chroma’s wal3): a set of immutable objects plus a tiny mutable manifest, coordinated with conditional writes. The details vary, but the pattern is consistent.


Storage Layout

I kept the storage model deliberately simple: one page = one S3 object, which makes the copy-on-write protocol obvious and makes debugging easy. It is not how production systems work.

Pages are 64KB with a slotted-page layout inspired by BoltDB and SQLite. A 64KB page gives you a branching factor around 4,000, which means most trees are height 2 or 3.

Page layout (leaf):

[header][leaf elements...][free space][key/value data]

The header includes the page ID, a checksum, the page type, and (for leaves) a right-sibling pointer for fast range scans. Leaf elements store offsets into the data area, so the page stays fixed-size even with variable-length keys and values.

Rows are stored in leaf pages. The primary key is the B+ tree key; the row payload is the value. When a row is too large, the value becomes an overflow pointer and the actual bytes live in chained overflow pages.


Commit Protocol (The Heart of It)

Every transaction follows the same structure:

  1. Read the current manifest and remember its ETag.
  2. Read pages as needed (tracking a read-set).
  3. Write a WAL intent entry listing newly created page objects.
  4. Write new CoW pages to S3.
  5. CAS the manifest with If-Match: <etag>.
  6. Delete the WAL intent entry (best-effort).

The WAL is not the commit point. It only exists to clean up orphaned pages if a writer dies after writing pages but before winning the manifest CAS.

That single manifest CAS is the entire atomicity story. Either the manifest update succeeds (commit) or it fails (retry or abort), and the tree itself never mutates in-place.


Multi-Writer OCC

I used classic Kung–Robinson OCC with page-level read-set and write-set validation. A writer that loses the manifest CAS re-reads the latest manifest and checks whether any of the pages it read or plans to write have changed.

On conflict, the writer replays the same operations against the new state. Read-set validation ensures the original reads are still valid; if they aren’t, the transaction aborts rather than retrying with stale assumptions.

If the read-set and write-set are clean, the writer rebases its transaction onto the new root and retries. If any read or write page changed, the transaction aborts and cleans up its orphaned pages via the WAL intent log.

The granularity here is a page, not a row. It’s coarse compared to a row-level MVCC engine, but it’s good enough for a page-based storage engine and keeps the implementation readable.


Implementation Walkthrough

I wrote the prototype in Go and kept it small enough to read end-to-end. The full source is in github.com/devesh-shetty/s3db-s3mock.

Manifest

The manifest is a single JSON object stored at {prefix}/manifest.json (the default prefix is db). It holds the catalog (table definitions), the root page IDs, and a map of live page IDs to their current ETag tokens. This single-object design is a teaching simplification; a production system would shard the manifest or use layered metadata as the page count grows.

// manifest.go

type Manifest struct {
	Version      uint64              `json:"version"`
	Catalog      map[string]TableDef `json:"catalog"`
	PageVersions map[uint64]string   `json:"page_versions"`
	Recent       []CommitMeta        `json:"recent_commits"`
}

On every commit, I update the catalog (if needed), add the new page IDs to PageVersions, and delete the old ones that were replaced by CoW.

Page Encoding

Pages are fixed-size 64KB buffers with a header and a slotted-array layout. This is a direct nod to BoltDB’s page encoding, but adapted to S3-friendly page sizes.

// page.go

const (
	PageSize       = 64 * 1024
	pageHeaderSize = 32
	leafElemSize   = 16
)

func (p *Page) Encode() ([]byte, error) {
	buf := make([]byte, PageSize)
	// header, elements, and data packing...
	checksum := xxhash.Sum64(buf)
	binary.LittleEndian.PutUint64(buf[8:16], checksum)
	return buf, nil
}

The checksum lets me detect corruption early, and it doubles as a cheap sanity check when debugging S3Mock traces.

Insert Path (CoW)

Every insert walks the tree, copies the pages along the path, and writes new immutable page objects. If a node overflows, it splits and bubbles the new max keys up to the parent.

// btree.go

func (tx *Txn) btreeInsert(root uint64, key []byte, value []byte, tombstone bool) (uint64, error) {
	if root == 0 {
		leaf := NewLeafPage(tx.allocPageID())
		leaf.LeafCells = []LeafCell{{Key: key, Value: value}}
		tx.stagePage(leaf)
		return leaf.ID, nil
	}
	res, err := tx.insertInto(root, key, value, tombstone)
	if err != nil {
		return 0, err
	}
	if res.right == nil {
		return res.left.id, nil
	}
	// create a new root on split
	rootPage := NewBranchPage(tx.allocPageID())
	rootPage.BranchCells = []BranchCell{
		{Key: res.left.maxKey, Child: res.left.id},
		{Key: res.right.maxKey, Child: res.right.id},
	}
	tx.stagePage(rootPage)
	return rootPage.ID, nil
}

The important detail is that nothing mutates in place. Every page touched by the insert becomes a new object on S3.

Commit (Manifest CAS)

Here is the core of the commit sequence. The only mutable object is the manifest, and its CAS is the commit point.

// txn.go

func (tx *Txn) commitOnce(ctx context.Context) error {
	walKey := ""
	pageKeys := tx.stagedPageKeys()
	if len(pageKeys) > 0 {
		entry := WALEntry{
			TxID:            newTxID(),
			ManifestVersion: tx.snapshot.Version,
			PageKeys:        pageKeys,
			StartedAt:       time.Now().UTC().Format(time.RFC3339Nano),
			HeartbeatAt:     time.Now().UTC().Format(time.RFC3339Nano),
		}
		var err error
		walKey, err = tx.db.writeWALEntry(ctx, entry)
		if err != nil {
			return err
		}
	}

	for id, page := range tx.stagedPages {
		buf, _ := page.Encode()
		key := manifestPageKey(tx.db.cfg.Prefix, id)
		resp, err := tx.db.s3.PutObject(ctx, &s3.PutObjectInput{
			Bucket:      aws.String(tx.db.cfg.Bucket),
			Key:         aws.String(key),
			Body:        bytes.NewReader(buf),
			IfNoneMatch: aws.String("*"),
		})
		if err != nil {
			return err
		}
		tx.manifest.PageVersions[id] = aws.ToString(resp.ETag)
	}

	for id := range tx.replacedPages {
		delete(tx.manifest.PageVersions, id)
	}

	tx.manifest.Version++
	tx.manifest.addCommit(tx.readSetKeys(), tx.writeSetKeys())

	_, err := tx.db.putManifest(ctx, tx.manifest, tx.manifestETag) // If-Match
	if err != nil {
		return err
	}

	if walKey != "" {
		_ = tx.db.deleteWALEntry(ctx, walKey)
	}
	return nil
}

The WAL entry is deleted best-effort after the CAS succeeds. If it sticks around, the periodic sweep waits for the heartbeat grace period to expire, then cross-checks live manifests before deleting any orphaned page.


Running It Locally with S3Mock

I run this against S3Mock to keep the loop tight. If you have Podman, the easiest way is:

git clone https://github.com/devesh-shetty/s3db-s3mock
cd s3db-s3mock
make up
make demo

If you don’t have Podman, you can run the executable JAR directly (requires Java 17+):

curl -L -o /tmp/s3mock.jar \
  https://repo1.maven.org/maven2/com/adobe/testing/s3mock/4.11.0/s3mock-4.11.0-exec.jar

COM_ADOBE_TESTING_S3MOCK_STORE_INITIAL_BUCKETS=s3db \
  /opt/homebrew/opt/openjdk/bin/java -jar /tmp/s3mock.jar

Here is the actual output from the demo on my machine:

user 2 -> id=2 name=Linus active=true
user 99 -> not found
wal cleanup: 0 entries, 0 pages removed

What This Still Isn’t

This is a teaching implementation. It is intentionally missing everything you’d need for a production database: buffer pool tuning, garbage collection of replaced pages, snapshot pinning for long-lived readers, page bundling into extents, backups, and a SQL layer.

The architecture is the interesting part. Once you can reason about manifests, conditional writes, and copy-on-write B+ trees, the next steps are all engineering (and a lot of it).


Final Notes

Rebuilding the paper in 2026 made one thing clear to me: the hard part is no longer single-object consistency or monotonic reads. S3 gives those to you for free. What you still need to build is the atomicity layer: the manifest CAS, the read-set validation, the crash cleanup.

The hard part is engineering a storage engine that lives comfortably on object storage: request costs, latency, and object bundling dominate the design once consistency is free.

That’s why I like this exercise. It forces you to see the real constraints, not the historical ones.