← Home

Building a Storage Engine: WAL, LSM Trees, and SSTables from Scratch

Estimated reading time: 19-24 minutes | ~3,800 words

I built a key-value storage engine from scratch in Go to understand how databases like LevelDB and RocksDB actually work under the hood. The whole thing is ~1,000 lines, uses only the Go standard library, and you can clone it, run it, and crash-test it on your own machine.

This post walks through every layer: the write-ahead log, the memtable, the SSTable format, bloom filters, and compaction. Real code, real terminal output. By the end, you’ll have a storage engine that survives an unclean shutdown and recovers every committed write.


TL;DR

  • A write-ahead log (WAL) with CRC32 checksums and fsync per write provides best-effort durability. On macOS, Go’s file.Sync() uses F_FULLFSYNC, which requests a flush of device write caches (though Apple notes some devices may ignore the request).
  • The memtable is a sorted in-memory buffer; when it fills up, it flushes to an immutable SSTable on disk.
  • SSTables store sorted key-value pairs with an index block for binary search and a bloom filter to skip unnecessary disk reads.
  • Bloom filters deliver ~1% false positive rates with 10 bits per key and 7 hash functions. We measured 0.99% in our tests.
  • Compaction merges SSTables via k-way merge sort, deduplicates keys, and removes tombstones.
  • Crash recovery works: 500/500 keys recovered after os.Exit(1) with no Close() call.

Table of Contents

  1. TL;DR
  2. Why Build a Storage Engine?
  3. The Architecture
  4. The Write-Ahead Log
  5. The Memtable
  6. The SSTable
  7. The Bloom Filter
  8. Compaction
  9. Putting It Together
  10. Crash Recovery: The Whole Point
  11. Running It Yourself
  12. What I Didn’t Build
  13. What I Learned

Why Build a Storage Engine?

The log-structured merge-tree (LSM tree) is arguably the most important data structure in modern databases. Patrick O’Neil’s 1996 paper introduced the idea, Google’s Bigtable paper in 2006 proved it worked at planet scale, and Jeff Dean and Sanjay Ghemawat open-sourced LevelDB in 2011 to give the rest of us a reference implementation.

Today, LSM trees power Cassandra, CockroachDB (via Pebble), TiKV, ScyllaDB, and InfluxDB. Facebook forked LevelDB into RocksDB in 2012, adding features like column families, bloom filters per SSTable, and universal compaction.

The core insight is simple: traditional B-tree databases optimize for reads by keeping data sorted on disk, but every write requires random I/O to find the right page and update it in place. LSM trees flip the trade-off: writes are always sequential (append to a log, insert into memory), which is fast on every storage medium. The cost is paid on reads, which may need to check multiple files. Compaction amortizes that cost over time.

I wanted to understand this trade-off by building it. Not by reading about it, but by writing every byte to disk myself and verifying I could crash the process and get my data back.


The Architecture

Here’s the full system at a glance:

                  Write Path                          Read Path
                  ==========                          =========

Put(k, v) ──┐                              Get(k) ──┐
             │                                       │
             v                                       v
      ┌─────────────┐                        ┌─────────────┐
      │  WAL append  │                        │  Memtable   │──found──> return
      │  + fsync     │                        │  lookup     │
      └──────┬──────┘                        └──────┬──────┘
             │                                       │ not found
             v                                       v
      ┌─────────────┐                  ┌──────────────────────────┐
      │  Memtable   │                  │  SSTables (newest first) │
      │  insert     │                  │                          │
      └──────┬──────┘                  │  For each SSTable:       │
             │                         │   1. Bloom filter check  │
             v                         │   2. Index binary search │
      ┌─────────────┐                  │   3. Read data entry     │
      │  Size check │                  └──────────────────────────┘
      │  > 4MB?     │
      └──────┬──────┘
          yes │
             v
      ┌─────────────┐
      │  Flush to   │
      │  SSTable    │
      └──────┬──────┘

             v
      ┌─────────────┐
      │  Compaction  │
      │  (if needed) │
      └─────────────┘

Write path: Every write first appends to the WAL (and fsyncs), then inserts into the memtable. When the memtable exceeds 4MB, it flushes to a sorted SSTable on disk, the WAL resets, and compaction runs if there are too many SSTables.

Read path: Check the memtable first (most recent data). If not found, check SSTables from newest to oldest. For each SSTable, the bloom filter rejects keys that definitely aren’t present, and binary search on the index finds keys that are.


The Write-Ahead Log

The WAL exists for exactly one reason: durability. Before a write touches the in-memory memtable, it must be on disk in the WAL. If the process crashes, the WAL is replayed to reconstruct the memtable.

Entry format

Each WAL record is self-describing. You can read it without knowing anything about the rest of the file:

+----------+----------+--------+----------+---------+----------+---------+
| Size(4B) | CRC(4B)  | Op(1B) | KLen(4B) | Key     | VLen(4B) | Value   |
+----------+----------+--------+----------+---------+----------+---------+

The Size field tells you how many bytes follow. The CRC-32 checksum covers everything after the CRC field: op type, key length, key, value length, value. If the checksum doesn’t match during replay, the record is corrupt (probably a partial write from a crash) and we stop replaying.

Here’s the core of the WAL append, 30 lines that guarantee durability:

func (w *WAL) Append(entry WALEntry) error {
	// Build the payload: op + key_len + key + val_len + val
	payloadSize := 1 + 4 + len(entry.Key) + 4 + len(entry.Value)
	payload := make([]byte, payloadSize)

	off := 0
	payload[off] = byte(entry.Op)
	off++
	binary.LittleEndian.PutUint32(payload[off:], uint32(len(entry.Key)))
	off += 4
	copy(payload[off:], entry.Key)
	off += len(entry.Key)
	binary.LittleEndian.PutUint32(payload[off:], uint32(len(entry.Value)))
	off += 4
	copy(payload[off:], entry.Value)

	// Compute CRC over the payload
	checksum := crc32.ChecksumIEEE(payload)

	// Build the full record: length + CRC + payload
	record := make([]byte, 4+4+payloadSize)
	binary.LittleEndian.PutUint32(record[0:4], uint32(payloadSize))
	binary.LittleEndian.PutUint32(record[4:8], checksum)
	copy(record[8:], payload)

	n, err := w.file.Write(record)
	if err != nil {
		return fmt.Errorf("wal write: %w", err)
	}
	if n != len(record) {
		return fmt.Errorf("wal short write: wrote %d of %d bytes", n, len(record))
	}
	// fsync ensures durability. On macOS this uses F_FULLFSYNC
	if err := w.file.Sync(); err != nil {
		return fmt.Errorf("wal sync: %w", err)
	}
	return nil
}

The key line is w.file.Sync(). On macOS, Go’s os.File.Sync() calls fcntl(fd, F_FULLFSYNC), which asks the device to flush its write cache to stable storage. If F_FULLFSYNC isn’t supported (e.g. on network mounts), Go falls back to fsync(). Apple documents that some devices may ignore the flush request, so this is best-effort rather than an absolute guarantee. Still, it’s the strongest durability primitive available on macOS, and it’s why writes are slow (~244 ops/sec in our benchmarks) but crash-safe.

Replay and corruption tolerance

On startup, Replay reads WAL entries sequentially, verifying each checksum. The moment a CRC mismatch or partial read occurs, it stops:

func Replay(path string) ([]WALEntry, error) {
	f, err := os.Open(path)
	if err != nil {
		if os.IsNotExist(err) {
			return nil, nil
		}
		return nil, fmt.Errorf("wal replay open: %w", err)
	}
	defer f.Close()

	var entries []WALEntry
	header := make([]byte, 8) // length + CRC

	for {
		if _, err := io.ReadFull(f, header); err != nil {
			break // EOF or partial header - done
		}
		length := binary.LittleEndian.Uint32(header[0:4])
		storedCRC := binary.LittleEndian.Uint32(header[4:8])

		if length > 64*1024*1024 {
			break // sanity check
		}

		payload := make([]byte, length)
		if _, err := io.ReadFull(f, payload); err != nil {
			break // partial payload - entry wasn't fully written
		}

		if crc32.ChecksumIEEE(payload) != storedCRC {
			break // corrupted entry - stop here
		}

		entry, err := decodePayload(payload)
		if err != nil {
			break
		}
		entries = append(entries, entry)
	}
	return entries, nil
}

This is safe because the WAL is append-only and sequential. A corrupt record means the process crashed mid-write. Everything after it is suspect. Everything before it was fsync’d and is trustworthy.


The Memtable

The memtable is a sorted in-memory buffer. I used the simplest possible approach: a sorted slice with sort.Search for binary search lookups and insertion-sort for writes.

Production engines like LevelDB and RocksDB use skip lists here because they offer O(log n) inserts without the memory-shuffling cost of slice insertion. But for a tutorial, a sorted slice is clearer and the performance is fine. Our reads hit ~2.9 million ops/sec from the memtable.

The key design detail: deletions insert a tombstone rather than removing the entry. This matters because older SSTables on disk might still contain the key. The tombstone in the memtable (or a later SSTable) shadows those older values.

func (m *Memtable) Put(key string, value []byte) {
	idx := m.search(key)

	if idx < len(m.entries) && m.entries[idx].key == key {
		// Update existing entry
		m.size -= len(m.entries[idx].value)
		m.entries[idx].value = value
		m.entries[idx].tombstone = false
		m.size += len(value)
		return
	}

	// Insert at the correct sorted position
	entry := memEntry{key: key, value: value}
	m.entries = append(m.entries, memEntry{})
	copy(m.entries[idx+1:], m.entries[idx:])
	m.entries[idx] = entry
	m.size += len(key) + len(value) + 1
}

When the memtable exceeds 4MB (the same default LevelDB uses), it’s flushed to an SSTable. The Entries() method returns the already-sorted slice directly, so no extra sort step is needed.


The SSTable

SSTables (Sorted String Tables) are immutable, sorted files on disk. Once written, they never change. Updates and deletes create new SSTables, and compaction merges old ones. Ilya Grigorik’s 2012 post has a good explanation of the LevelDB SSTable format; ours is simpler.

On-disk format

+------------------+
| Data Block       |  sorted key-value entries
|   Key1 -> Val1   |
|   Key2 -> Val2   |
|   ...            |
+------------------+
| Index Block      |  key -> offset pairs for binary search
|   Key1 -> 0      |
|   Key2 -> 284    |
|   ...            |
+------------------+
| Bloom Filter     |  serialized bloom filter bytes
+------------------+
| Footer (28B)     |
|   IndexOffset(8) |
|   IndexCount(4)  |
|   BloomOffset(8) |
|   BloomSize(4)   |
|   Magic(4)       |
+------------------+

The footer lives at the end of the file. A single seek to the last 28 bytes tells you where everything else is. The magic number (0x4C534D54, “LSMT” in ASCII) validates the file is a legitimate SSTable. This is how we detect incomplete SSTables after a crash during flush.

On startup, loadSSTables tries to open every .sst file in the data directory. If a file fails to open (bad magic, truncated footer, or any other corruption), it gets deleted. This is safe because the WAL still holds all the data from that flush. WAL replay reconstructs the memtable, and the next flush writes a clean SSTable.

The writer builds the bloom filter as it writes entries, then appends the index block, the serialized bloom filter, and the footer:

func WriteSSTable(path string, entries []SSTableEntry) error {
	f, err := os.Create(path)
	if err != nil {
		return fmt.Errorf("sstable create: %w", err)
	}
	defer f.Close()

	bloom := NewBloomFilter(len(entries), 0.01)
	for _, e := range entries {
		bloom.Add([]byte(e.Key))
	}

	var index []indexEntry
	offset := int64(0)

	for _, e := range entries {
		index = append(index, indexEntry{Key: e.Key, Offset: offset})
		// ... write key_len + key + value_len + value + tombstone byte
		offset += int64(entrySize)
	}

	// Write index, bloom filter, and footer
	// ... (see full source for details)

	return f.Sync()
}

I’ve trimmed the encoding details for readability. The full source has the complete byte-level implementation.

Reading an SSTable

The reader opens the file, reads the footer, loads the index and bloom filter into memory, and serves point lookups via bloom filter check → binary search → disk read:

func (r *SSTableReader) Get(key string) ([]byte, bool, bool) {
	// Fast path: check bloom filter first
	if !r.bloom.MayContain([]byte(key)) {
		return nil, false, false
	}

	// Binary search the in-memory index
	idx := sort.Search(len(r.index), func(i int) bool {
		return r.index[i].Key >= key
	})
	if idx >= len(r.index) || r.index[idx].Key != key {
		return nil, false, false // bloom filter false positive
	}

	return r.readEntry(r.index[idx].Offset)
}

The bloom filter check is the key optimization. Without it, every SSTable Get requires a binary search of the index. With it, we skip SSTables that definitely don’t contain the key, which is most of them.


The Bloom Filter

A bloom filter is a probabilistic data structure that answers “is this key possibly in this set?” False positives are possible (it might say “maybe” when the key isn’t present), but false negatives are not (if it says “no”, the key is definitely absent).

The math

Given n expected items and a desired false positive rate p:

Bit array size:     m = -n * ln(p) / (ln2)²
Hash functions:     k = (m/n) * ln2

At 10 bits per key with 7 hash functions, you get ~1% false positive rate. Eli Bendersky’s post has a clean explanation of the derivation.

Double hashing

Instead of computing k independent hash functions, we use the Kirsch-Mitzenmacher optimization: split a single 64-bit FNV-1a hash into two 32-bit halves, then derive k positions as h(i) = h1 + i * h2 mod m:

func (bf *BloomFilter) hash(key []byte) (uint32, uint32) {
	h := fnv.New64a()
	h.Write(key)
	sum := h.Sum64()
	h1 := uint32(sum)       // lower 32 bits
	h2 := uint32(sum >> 32) // upper 32 bits
	if h2 == 0 {
		h2 = 1 // avoid degenerate case
	}
	return h1, h2
}

func (bf *BloomFilter) MayContain(key []byte) bool {
	h1, h2 := bf.hash(key)
	for i := uint32(0); i < bf.numHash; i++ {
		pos := (h1 + i*h2) % bf.numBits
		if bf.bits[pos/8]&(1<<(pos%8)) == 0 {
			return false
		}
	}
	return true
}

Our tests measured a false positive rate of 0.99%, right on the 1% target:

=== RUN   TestBloomFilterFalsePositiveRate
    db_test.go:219: Bloom filter false positive rate: 0.9920% (target: 1%)
--- PASS: TestBloomFilterFalsePositiveRate (0.01s)

The bloom filter is serialized into each SSTable (numBits + numHash + bit array) and deserialized when the SSTable is opened. This adds a few hundred bytes per SSTable but saves potentially thousands of unnecessary disk reads.


Compaction

Without compaction, every flush adds another SSTable. Reads get slower because they have to check more files. Compaction merges multiple SSTables into one, deduplicating keys and removing tombstones.

I implemented the simplest possible strategy: full-rewrite compaction. When 4 or more level-0 SSTables accumulate, I merge all of them into a single new output file and delete every old one. There are no levels, no tiers. It’s a brute-force approach, but it keeps the code simple and means I never have to worry about tombstones shadowing keys in unmerged files. Production engines like RocksDB default to leveled compaction which has better read amplification but higher write amplification.

The merge is a k-way merge of sorted inputs. The SSTables are passed to the merge newest-first, and kWayMerge resolves duplicate keys by picking the entry from the lowest index (the newest SSTable):

func kWayMerge(sets [][]SSTableEntry) []SSTableEntry {
	positions := make([]int, len(sets))
	var result []SSTableEntry

	for {
		minKey := ""
		minSet := -1
		for i, pos := range positions {
			if pos >= len(sets[i]) {
				continue
			}
			key := sets[i][pos].Key
			if minSet == -1 || key < minKey {
				minKey = key
				minSet = i
			}
		}
		if minSet == -1 {
			break
		}

		var winner SSTableEntry
		winnerIdx := -1
		for i, pos := range positions {
			if pos >= len(sets[i]) {
				continue
			}
			if sets[i][pos].Key == minKey {
				if winnerIdx == -1 || i < winnerIdx {
					winner = sets[i][pos]
					winnerIdx = i
				}
				positions[i]++
			}
		}
		result = append(result, winner)
	}
	return result
}

After the merge, tombstones are removed. This is only safe because our compaction merges all SSTables into one output file and deletes every old file. If we were merging a subset, a tombstone removal could resurface a deleted key from an older, unmerged SSTable. Production engines track this carefully with version sets and reference counts; ours takes the simple path of always merging everything at once.


Putting It Together

The DB struct ties everything together. Open loads existing SSTables, replays the WAL for crash recovery, then opens a fresh WAL for new writes:

func Open(dir string) (*DB, error) {
	if err := os.MkdirAll(dir, 0755); err != nil {
		return nil, fmt.Errorf("db mkdir: %w", err)
	}

	db := &DB{
		dir:     dir,
		mem:     NewMemtable(DefaultMemtableSize),
		nextSeq: 1,
	}

	// Load existing SSTables
	if err := db.loadSSTables(); err != nil {
		return nil, fmt.Errorf("db load sstables: %w", err)
	}

	// Replay WAL into memtable for crash recovery
	walPath := filepath.Join(dir, "wal")
	entries, err := Replay(walPath)
	if err != nil {
		return nil, fmt.Errorf("db replay wal: %w", err)
	}
	for _, e := range entries {
		switch e.Op {
		case OpPut:
			db.mem.Put(string(e.Key), e.Value)
		case OpDelete:
			db.mem.Delete(string(e.Key))
		}
	}

	wal, err := OpenWAL(walPath)
	if err != nil {
		return nil, fmt.Errorf("db open wal: %w", err)
	}
	db.wal = wal
	return db, nil
}

The write path is two steps: WAL append (durable), then memtable insert (fast). If the memtable is full, flush to SSTable:

func (db *DB) Put(key string, value []byte) error {
	if err := db.wal.Append(WALEntry{Op: OpPut, Key: []byte(key), Value: value}); err != nil {
		return err
	}
	db.mem.Put(key, value)
	if db.mem.IsFull() {
		return db.flush()
	}
	return nil
}

The read path checks memtable first, then SSTables newest-to-oldest. A tombstone in the memtable or a newer SSTable shadows older values:

func (db *DB) Get(key string) ([]byte, error) {
	if val, found := db.mem.Get(key); found {
		if val == nil {
			return nil, ErrKeyNotFound // tombstone
		}
		return val, nil
	}

	for _, sst := range db.sstables {
		val, tombstone, found := sst.Get(key)
		if found {
			if tombstone {
				return nil, ErrKeyNotFound
			}
			return val, nil
		}
	}

	return nil, ErrKeyNotFound
}

That’s the entire engine. ~1,000 lines of Go, no dependencies outside the standard library.


Crash Recovery: The Whole Point

This is the payoff. The whole reason for the WAL, the CRC checksums, the fsync-per-write discipline. It all comes down to this: can we crash the process and get our data back?

I wrote a crash test that runs as a two-phase program. Phase 1 writes 500 key-value pairs then calls os.Exit(1) with no Close(), no cleanup, no graceful shutdown. Phase 2 reopens the database and checks if all 500 keys survived:

// Phase 1: write and crash
for i := 0; i < 500; i++ {
	key := fmt.Sprintf("crash-key-%06d", i)
	val := fmt.Sprintf("crash-val-%06d", i)
	if err := db.Put(key, []byte(val)); err != nil {
		os.Exit(2)
	}
}
fmt.Println("Wrote 500 keys. Crashing NOW (os.Exit(1), no Close)...")
os.Exit(1) // no db.Close()!

Here’s the actual output:

$ go run ./cmd/crashtest/
=== LSM Crash Recovery Test ===
Database directory: /var/folders/mf/tpmn_6611m3g0mzwyr_w0qpm0000gn/T/lsm-crash-2469007956

Phase 1: Writing 500 keys then crashing (no clean shutdown)...
Wrote 500 keys. Crashing NOW (os.Exit(1), no Close)...
exit status 1
  Child exited with: exit status 1 (expected)

Phase 2: Reopening database and verifying all 500 keys...
  Recovered: 500/500
  Missing:   0/500
  Corrupt:   0/500

PASS: All 500 keys recovered after crash!

500 out of 500. Every committed write survived because each one was fsync’d to the WAL before the Put returned. On recovery, Open replays the WAL into a fresh memtable, and the engine picks up where it left off.

This is the same durability model that LevelDB and RocksDB use when sync is enabled. The WAL is the source of truth until its entries are safely flushed to SSTables.


Running It Yourself

The code uses only the Go standard library. No go get required. Clone and run:

git clone https://github.com/devesh-shetty/lsm-engine.git
cd lsm-engine
go build ./...
go test ./... -v

All 14 tests should pass:

=== RUN   TestPutGet
--- PASS: TestPutGet (0.01s)
=== RUN   TestGetMissing
--- PASS: TestGetMissing (0.00s)
=== RUN   TestDelete
--- PASS: TestDelete (0.02s)
=== RUN   TestOverwrite
--- PASS: TestOverwrite (0.02s)
=== RUN   TestPersistence
--- PASS: TestPersistence (0.46s)
=== RUN   TestCrashRecovery
--- PASS: TestCrashRecovery (0.02s)
=== RUN   TestWALCorruptedTail
--- PASS: TestWALCorruptedTail (0.01s)
=== RUN   TestBloomFilterFalsePositiveRate
    db_test.go:219: Bloom filter false positive rate: 0.9920% (target: 1%)
--- PASS: TestBloomFilterFalsePositiveRate (0.01s)
=== RUN   TestBloomSerialize
--- PASS: TestBloomSerialize (0.00s)
=== RUN   TestSSTableRoundtrip
--- PASS: TestSSTableRoundtrip (0.01s)
=== RUN   TestCompaction
--- PASS: TestCompaction (0.02s)
=== RUN   TestL0L1Compaction
--- PASS: TestL0L1Compaction (1.36s)
=== RUN   TestCrashDuringFlush
--- PASS: TestCrashDuringFlush (0.01s)
=== RUN   TestLargeWorkload
--- PASS: TestLargeWorkload (62.99s)
PASS
ok  	github.com/devesh-shetty/lsm-engine	65.664s

Run the demo to see write/read/delete performance and crash recovery:

go run ./cmd/demo/
Writing 1000 key-value pairs...
  Done in 4.250478042s

Reading 1000 keys and verifying...
  Done in 338.875µs (0 errors)

Deleting 500 keys (even-numbered)...
  Done in 2.043538375s

Verifying deletes...
  Deleted keys confirmed gone: 500/500
  Present keys confirmed:      500/500

Database Stats:
  SSTables:         0
  WAL size:         70500 bytes
  Memtable entries: 1000
  Memtable size:    26000 bytes

Timing Summary:
  Write 1000 keys:  4.250478042s (235 ops/sec)
  Read 1000 keys:   338.875µs (2950941 ops/sec)
  Delete 500 keys:  2.043538375s (245 ops/sec)

Closing and reopening database to test recovery...
  Keys surviving reopen: 500/500

Run the crash test:

go run ./cmd/crashtest/

Benchmarks

On an Apple M3 Pro:

OperationLatencyThroughputAllocs/op
Sequential Writes~4.1 ms/op~244 ops/sec8
Random Reads~9.8 µs/op~102,000 ops/sec3
Mixed (50/50 R/W)~2.2 ms/op~454 ops/sec6

Write latency is dominated by fsync on every WAL append. This is the cost of crash safety. Production engines batch writes and sync periodically to amortize this, a topic for another post.


What I Didn’t Build

This is a toy engine. I think it’s useful for learning, but it’s missing everything you’d need for production. Here’s what RocksDB builds on top of the same foundations:

  1. Leveled compaction - I do a full rewrite every time compaction triggers, merging everything into one file. Leveled compaction organizes SSTables into levels with exponentially increasing size targets (10MB, 100MB, 1GB…) for better read amplification and lower space overhead.
  2. Compression - Snappy, LZ4, or zstd compression of data blocks. Typically 2-4x space savings.
  3. Block cache - LRU cache of frequently accessed SSTable blocks in memory.
  4. Concurrent compaction - background goroutine compacting without blocking reads and writes.
  5. MANIFEST / version tracking - atomic tracking of which SSTables are live. Our approach of scanning the directory is fragile.
  6. Prefix compression - adjacent keys in a data block often share prefixes; delta encoding saves space.
  7. Rate limiting - throttle compaction I/O to avoid starving foreground reads. I wrote about rate limiting approaches in a previous post.
  8. Write batching - group multiple writes into a single WAL entry and a single fsync. This is how production engines get from 244 ops/sec to hundreds of thousands.
  9. Snapshots / MVCC - consistent reads at a point in time.
  10. Column families - multiple logical databases sharing one physical engine.

I also skipped directory fsync after creating SSTable files. In a production engine, you’d fsync the parent directory to ensure the file’s directory entry is durable. Without this, a crash could lose the file even though its data was synced. I mention this because it’s a commonly missed step and a real source of data loss bugs.


What I Learned

fsync is the bottleneck, and everything else is designed around it. Our 244 writes/sec is entirely fsync-limited. The in-memory operations (memtable insert, bloom filter check) take microseconds. The disk sync takes milliseconds. Every production optimization (write batching, group commit, WAL recycling) exists to reduce the number of fsyncs per write.

The LSM design is elegant because it separates the durability concern from the indexing concern. The WAL handles durability (sequential writes, fsync). The memtable and SSTables handle indexing (sorted structures, binary search, bloom filters). These two concerns are completely decoupled, which is why you can swap out any component without touching the others.

Crash recovery is surprisingly simple when the WAL is correct. I expected recovery to be the hardest part. It wasn’t. The CRC checksums and fsync discipline do all the work. Replay is under 30 lines. The complexity in production engines comes from concurrent operations and partial flushes, not from the recovery logic itself.

Bloom filters are underappreciated. A few hundred bytes per SSTable eliminates the vast majority of unnecessary disk reads. The math is clean, the implementation is simple, and the payoff is enormous. I think every systems programmer should implement one at least once.

Building this gave me a much better mental model for database internals. When I see RocksDB tuning knobs like write_buffer_size, level0_file_num_compaction_trigger, or bloom_locality, I now understand what they’re actually controlling. That’s the real value of building from scratch.