Building a Storage Engine: WAL, LSM Trees, and SSTables from Scratch
Estimated reading time: 19-24 minutes | ~3,800 words
I built a key-value storage engine from scratch in Go to understand how databases like LevelDB and RocksDB actually work under the hood. The whole thing is ~1,000 lines, uses only the Go standard library, and you can clone it, run it, and crash-test it on your own machine.
This post walks through every layer: the write-ahead log, the memtable, the SSTable format, bloom filters, and compaction. Real code, real terminal output. By the end, you’ll have a storage engine that survives an unclean shutdown and recovers every committed write.
TL;DR
- A write-ahead log (WAL) with CRC32 checksums and
fsyncper write provides best-effort durability. On macOS, Go’sfile.Sync()usesF_FULLFSYNC, which requests a flush of device write caches (though Apple notes some devices may ignore the request). - The memtable is a sorted in-memory buffer; when it fills up, it flushes to an immutable SSTable on disk.
- SSTables store sorted key-value pairs with an index block for binary search and a bloom filter to skip unnecessary disk reads.
- Bloom filters deliver ~1% false positive rates with 10 bits per key and 7 hash functions. We measured 0.99% in our tests.
- Compaction merges SSTables via k-way merge sort, deduplicates keys, and removes tombstones.
- Crash recovery works: 500/500 keys recovered after
os.Exit(1)with noClose()call.
Table of Contents
- TL;DR
- Why Build a Storage Engine?
- The Architecture
- The Write-Ahead Log
- The Memtable
- The SSTable
- The Bloom Filter
- Compaction
- Putting It Together
- Crash Recovery: The Whole Point
- Running It Yourself
- What I Didn’t Build
- What I Learned
Why Build a Storage Engine?
The log-structured merge-tree (LSM tree) is arguably the most important data structure in modern databases. Patrick O’Neil’s 1996 paper introduced the idea, Google’s Bigtable paper in 2006 proved it worked at planet scale, and Jeff Dean and Sanjay Ghemawat open-sourced LevelDB in 2011 to give the rest of us a reference implementation.
Today, LSM trees power Cassandra, CockroachDB (via Pebble), TiKV, ScyllaDB, and InfluxDB. Facebook forked LevelDB into RocksDB in 2012, adding features like column families, bloom filters per SSTable, and universal compaction.
The core insight is simple: traditional B-tree databases optimize for reads by keeping data sorted on disk, but every write requires random I/O to find the right page and update it in place. LSM trees flip the trade-off: writes are always sequential (append to a log, insert into memory), which is fast on every storage medium. The cost is paid on reads, which may need to check multiple files. Compaction amortizes that cost over time.
I wanted to understand this trade-off by building it. Not by reading about it, but by writing every byte to disk myself and verifying I could crash the process and get my data back.
The Architecture
Here’s the full system at a glance:
Write Path Read Path
========== =========
Put(k, v) ──┐ Get(k) ──┐
│ │
v v
┌─────────────┐ ┌─────────────┐
│ WAL append │ │ Memtable │──found──> return
│ + fsync │ │ lookup │
└──────┬──────┘ └──────┬──────┘
│ │ not found
v v
┌─────────────┐ ┌──────────────────────────┐
│ Memtable │ │ SSTables (newest first) │
│ insert │ │ │
└──────┬──────┘ │ For each SSTable: │
│ │ 1. Bloom filter check │
v │ 2. Index binary search │
┌─────────────┐ │ 3. Read data entry │
│ Size check │ └──────────────────────────┘
│ > 4MB? │
└──────┬──────┘
yes │
v
┌─────────────┐
│ Flush to │
│ SSTable │
└──────┬──────┘
│
v
┌─────────────┐
│ Compaction │
│ (if needed) │
└─────────────┘
Write path: Every write first appends to the WAL (and fsyncs), then inserts into the memtable. When the memtable exceeds 4MB, it flushes to a sorted SSTable on disk, the WAL resets, and compaction runs if there are too many SSTables.
Read path: Check the memtable first (most recent data). If not found, check SSTables from newest to oldest. For each SSTable, the bloom filter rejects keys that definitely aren’t present, and binary search on the index finds keys that are.
The Write-Ahead Log
The WAL exists for exactly one reason: durability. Before a write touches the in-memory memtable, it must be on disk in the WAL. If the process crashes, the WAL is replayed to reconstruct the memtable.
Entry format
Each WAL record is self-describing. You can read it without knowing anything about the rest of the file:
+----------+----------+--------+----------+---------+----------+---------+
| Size(4B) | CRC(4B) | Op(1B) | KLen(4B) | Key | VLen(4B) | Value |
+----------+----------+--------+----------+---------+----------+---------+
The Size field tells you how many bytes follow. The CRC-32 checksum covers everything after the CRC field: op type, key length, key, value length, value. If the checksum doesn’t match during replay, the record is corrupt (probably a partial write from a crash) and we stop replaying.
Here’s the core of the WAL append, 30 lines that guarantee durability:
func (w *WAL) Append(entry WALEntry) error {
// Build the payload: op + key_len + key + val_len + val
payloadSize := 1 + 4 + len(entry.Key) + 4 + len(entry.Value)
payload := make([]byte, payloadSize)
off := 0
payload[off] = byte(entry.Op)
off++
binary.LittleEndian.PutUint32(payload[off:], uint32(len(entry.Key)))
off += 4
copy(payload[off:], entry.Key)
off += len(entry.Key)
binary.LittleEndian.PutUint32(payload[off:], uint32(len(entry.Value)))
off += 4
copy(payload[off:], entry.Value)
// Compute CRC over the payload
checksum := crc32.ChecksumIEEE(payload)
// Build the full record: length + CRC + payload
record := make([]byte, 4+4+payloadSize)
binary.LittleEndian.PutUint32(record[0:4], uint32(payloadSize))
binary.LittleEndian.PutUint32(record[4:8], checksum)
copy(record[8:], payload)
n, err := w.file.Write(record)
if err != nil {
return fmt.Errorf("wal write: %w", err)
}
if n != len(record) {
return fmt.Errorf("wal short write: wrote %d of %d bytes", n, len(record))
}
// fsync ensures durability. On macOS this uses F_FULLFSYNC
if err := w.file.Sync(); err != nil {
return fmt.Errorf("wal sync: %w", err)
}
return nil
}
The key line is w.file.Sync(). On macOS, Go’s os.File.Sync() calls fcntl(fd, F_FULLFSYNC), which asks the device to flush its write cache to stable storage. If F_FULLFSYNC isn’t supported (e.g. on network mounts), Go falls back to fsync(). Apple documents that some devices may ignore the flush request, so this is best-effort rather than an absolute guarantee. Still, it’s the strongest durability primitive available on macOS, and it’s why writes are slow (~244 ops/sec in our benchmarks) but crash-safe.
Replay and corruption tolerance
On startup, Replay reads WAL entries sequentially, verifying each checksum. The moment a CRC mismatch or partial read occurs, it stops:
func Replay(path string) ([]WALEntry, error) {
f, err := os.Open(path)
if err != nil {
if os.IsNotExist(err) {
return nil, nil
}
return nil, fmt.Errorf("wal replay open: %w", err)
}
defer f.Close()
var entries []WALEntry
header := make([]byte, 8) // length + CRC
for {
if _, err := io.ReadFull(f, header); err != nil {
break // EOF or partial header - done
}
length := binary.LittleEndian.Uint32(header[0:4])
storedCRC := binary.LittleEndian.Uint32(header[4:8])
if length > 64*1024*1024 {
break // sanity check
}
payload := make([]byte, length)
if _, err := io.ReadFull(f, payload); err != nil {
break // partial payload - entry wasn't fully written
}
if crc32.ChecksumIEEE(payload) != storedCRC {
break // corrupted entry - stop here
}
entry, err := decodePayload(payload)
if err != nil {
break
}
entries = append(entries, entry)
}
return entries, nil
}
This is safe because the WAL is append-only and sequential. A corrupt record means the process crashed mid-write. Everything after it is suspect. Everything before it was fsync’d and is trustworthy.
The Memtable
The memtable is a sorted in-memory buffer. I used the simplest possible approach: a sorted slice with sort.Search for binary search lookups and insertion-sort for writes.
Production engines like LevelDB and RocksDB use skip lists here because they offer O(log n) inserts without the memory-shuffling cost of slice insertion. But for a tutorial, a sorted slice is clearer and the performance is fine. Our reads hit ~2.9 million ops/sec from the memtable.
The key design detail: deletions insert a tombstone rather than removing the entry. This matters because older SSTables on disk might still contain the key. The tombstone in the memtable (or a later SSTable) shadows those older values.
func (m *Memtable) Put(key string, value []byte) {
idx := m.search(key)
if idx < len(m.entries) && m.entries[idx].key == key {
// Update existing entry
m.size -= len(m.entries[idx].value)
m.entries[idx].value = value
m.entries[idx].tombstone = false
m.size += len(value)
return
}
// Insert at the correct sorted position
entry := memEntry{key: key, value: value}
m.entries = append(m.entries, memEntry{})
copy(m.entries[idx+1:], m.entries[idx:])
m.entries[idx] = entry
m.size += len(key) + len(value) + 1
}
When the memtable exceeds 4MB (the same default LevelDB uses), it’s flushed to an SSTable. The Entries() method returns the already-sorted slice directly, so no extra sort step is needed.
The SSTable
SSTables (Sorted String Tables) are immutable, sorted files on disk. Once written, they never change. Updates and deletes create new SSTables, and compaction merges old ones. Ilya Grigorik’s 2012 post has a good explanation of the LevelDB SSTable format; ours is simpler.
On-disk format
+------------------+
| Data Block | sorted key-value entries
| Key1 -> Val1 |
| Key2 -> Val2 |
| ... |
+------------------+
| Index Block | key -> offset pairs for binary search
| Key1 -> 0 |
| Key2 -> 284 |
| ... |
+------------------+
| Bloom Filter | serialized bloom filter bytes
+------------------+
| Footer (28B) |
| IndexOffset(8) |
| IndexCount(4) |
| BloomOffset(8) |
| BloomSize(4) |
| Magic(4) |
+------------------+
The footer lives at the end of the file. A single seek to the last 28 bytes tells you where everything else is. The magic number (0x4C534D54, “LSMT” in ASCII) validates the file is a legitimate SSTable. This is how we detect incomplete SSTables after a crash during flush.
On startup, loadSSTables tries to open every .sst file in the data directory. If a file fails to open (bad magic, truncated footer, or any other corruption), it gets deleted. This is safe because the WAL still holds all the data from that flush. WAL replay reconstructs the memtable, and the next flush writes a clean SSTable.
The writer builds the bloom filter as it writes entries, then appends the index block, the serialized bloom filter, and the footer:
func WriteSSTable(path string, entries []SSTableEntry) error {
f, err := os.Create(path)
if err != nil {
return fmt.Errorf("sstable create: %w", err)
}
defer f.Close()
bloom := NewBloomFilter(len(entries), 0.01)
for _, e := range entries {
bloom.Add([]byte(e.Key))
}
var index []indexEntry
offset := int64(0)
for _, e := range entries {
index = append(index, indexEntry{Key: e.Key, Offset: offset})
// ... write key_len + key + value_len + value + tombstone byte
offset += int64(entrySize)
}
// Write index, bloom filter, and footer
// ... (see full source for details)
return f.Sync()
}
I’ve trimmed the encoding details for readability. The full source has the complete byte-level implementation.
Reading an SSTable
The reader opens the file, reads the footer, loads the index and bloom filter into memory, and serves point lookups via bloom filter check → binary search → disk read:
func (r *SSTableReader) Get(key string) ([]byte, bool, bool) {
// Fast path: check bloom filter first
if !r.bloom.MayContain([]byte(key)) {
return nil, false, false
}
// Binary search the in-memory index
idx := sort.Search(len(r.index), func(i int) bool {
return r.index[i].Key >= key
})
if idx >= len(r.index) || r.index[idx].Key != key {
return nil, false, false // bloom filter false positive
}
return r.readEntry(r.index[idx].Offset)
}
The bloom filter check is the key optimization. Without it, every SSTable Get requires a binary search of the index. With it, we skip SSTables that definitely don’t contain the key, which is most of them.
The Bloom Filter
A bloom filter is a probabilistic data structure that answers “is this key possibly in this set?” False positives are possible (it might say “maybe” when the key isn’t present), but false negatives are not (if it says “no”, the key is definitely absent).
The math
Given n expected items and a desired false positive rate p:
Bit array size: m = -n * ln(p) / (ln2)²
Hash functions: k = (m/n) * ln2
At 10 bits per key with 7 hash functions, you get ~1% false positive rate. Eli Bendersky’s post has a clean explanation of the derivation.
Double hashing
Instead of computing k independent hash functions, we use the Kirsch-Mitzenmacher optimization: split a single 64-bit FNV-1a hash into two 32-bit halves, then derive k positions as h(i) = h1 + i * h2 mod m:
func (bf *BloomFilter) hash(key []byte) (uint32, uint32) {
h := fnv.New64a()
h.Write(key)
sum := h.Sum64()
h1 := uint32(sum) // lower 32 bits
h2 := uint32(sum >> 32) // upper 32 bits
if h2 == 0 {
h2 = 1 // avoid degenerate case
}
return h1, h2
}
func (bf *BloomFilter) MayContain(key []byte) bool {
h1, h2 := bf.hash(key)
for i := uint32(0); i < bf.numHash; i++ {
pos := (h1 + i*h2) % bf.numBits
if bf.bits[pos/8]&(1<<(pos%8)) == 0 {
return false
}
}
return true
}
Our tests measured a false positive rate of 0.99%, right on the 1% target:
=== RUN TestBloomFilterFalsePositiveRate
db_test.go:219: Bloom filter false positive rate: 0.9920% (target: 1%)
--- PASS: TestBloomFilterFalsePositiveRate (0.01s)
The bloom filter is serialized into each SSTable (numBits + numHash + bit array) and deserialized when the SSTable is opened. This adds a few hundred bytes per SSTable but saves potentially thousands of unnecessary disk reads.
Compaction
Without compaction, every flush adds another SSTable. Reads get slower because they have to check more files. Compaction merges multiple SSTables into one, deduplicating keys and removing tombstones.
I implemented the simplest possible strategy: full-rewrite compaction. When 4 or more level-0 SSTables accumulate, I merge all of them into a single new output file and delete every old one. There are no levels, no tiers. It’s a brute-force approach, but it keeps the code simple and means I never have to worry about tombstones shadowing keys in unmerged files. Production engines like RocksDB default to leveled compaction which has better read amplification but higher write amplification.
The merge is a k-way merge of sorted inputs. The SSTables are passed to the merge newest-first, and kWayMerge resolves duplicate keys by picking the entry from the lowest index (the newest SSTable):
func kWayMerge(sets [][]SSTableEntry) []SSTableEntry {
positions := make([]int, len(sets))
var result []SSTableEntry
for {
minKey := ""
minSet := -1
for i, pos := range positions {
if pos >= len(sets[i]) {
continue
}
key := sets[i][pos].Key
if minSet == -1 || key < minKey {
minKey = key
minSet = i
}
}
if minSet == -1 {
break
}
var winner SSTableEntry
winnerIdx := -1
for i, pos := range positions {
if pos >= len(sets[i]) {
continue
}
if sets[i][pos].Key == minKey {
if winnerIdx == -1 || i < winnerIdx {
winner = sets[i][pos]
winnerIdx = i
}
positions[i]++
}
}
result = append(result, winner)
}
return result
}
After the merge, tombstones are removed. This is only safe because our compaction merges all SSTables into one output file and deletes every old file. If we were merging a subset, a tombstone removal could resurface a deleted key from an older, unmerged SSTable. Production engines track this carefully with version sets and reference counts; ours takes the simple path of always merging everything at once.
Putting It Together
The DB struct ties everything together. Open loads existing SSTables, replays the WAL for crash recovery, then opens a fresh WAL for new writes:
func Open(dir string) (*DB, error) {
if err := os.MkdirAll(dir, 0755); err != nil {
return nil, fmt.Errorf("db mkdir: %w", err)
}
db := &DB{
dir: dir,
mem: NewMemtable(DefaultMemtableSize),
nextSeq: 1,
}
// Load existing SSTables
if err := db.loadSSTables(); err != nil {
return nil, fmt.Errorf("db load sstables: %w", err)
}
// Replay WAL into memtable for crash recovery
walPath := filepath.Join(dir, "wal")
entries, err := Replay(walPath)
if err != nil {
return nil, fmt.Errorf("db replay wal: %w", err)
}
for _, e := range entries {
switch e.Op {
case OpPut:
db.mem.Put(string(e.Key), e.Value)
case OpDelete:
db.mem.Delete(string(e.Key))
}
}
wal, err := OpenWAL(walPath)
if err != nil {
return nil, fmt.Errorf("db open wal: %w", err)
}
db.wal = wal
return db, nil
}
The write path is two steps: WAL append (durable), then memtable insert (fast). If the memtable is full, flush to SSTable:
func (db *DB) Put(key string, value []byte) error {
if err := db.wal.Append(WALEntry{Op: OpPut, Key: []byte(key), Value: value}); err != nil {
return err
}
db.mem.Put(key, value)
if db.mem.IsFull() {
return db.flush()
}
return nil
}
The read path checks memtable first, then SSTables newest-to-oldest. A tombstone in the memtable or a newer SSTable shadows older values:
func (db *DB) Get(key string) ([]byte, error) {
if val, found := db.mem.Get(key); found {
if val == nil {
return nil, ErrKeyNotFound // tombstone
}
return val, nil
}
for _, sst := range db.sstables {
val, tombstone, found := sst.Get(key)
if found {
if tombstone {
return nil, ErrKeyNotFound
}
return val, nil
}
}
return nil, ErrKeyNotFound
}
That’s the entire engine. ~1,000 lines of Go, no dependencies outside the standard library.
Crash Recovery: The Whole Point
This is the payoff. The whole reason for the WAL, the CRC checksums, the fsync-per-write discipline. It all comes down to this: can we crash the process and get our data back?
I wrote a crash test that runs as a two-phase program. Phase 1 writes 500 key-value pairs then calls os.Exit(1) with no Close(), no cleanup, no graceful shutdown. Phase 2 reopens the database and checks if all 500 keys survived:
// Phase 1: write and crash
for i := 0; i < 500; i++ {
key := fmt.Sprintf("crash-key-%06d", i)
val := fmt.Sprintf("crash-val-%06d", i)
if err := db.Put(key, []byte(val)); err != nil {
os.Exit(2)
}
}
fmt.Println("Wrote 500 keys. Crashing NOW (os.Exit(1), no Close)...")
os.Exit(1) // no db.Close()!
Here’s the actual output:
$ go run ./cmd/crashtest/
=== LSM Crash Recovery Test ===
Database directory: /var/folders/mf/tpmn_6611m3g0mzwyr_w0qpm0000gn/T/lsm-crash-2469007956
Phase 1: Writing 500 keys then crashing (no clean shutdown)...
Wrote 500 keys. Crashing NOW (os.Exit(1), no Close)...
exit status 1
Child exited with: exit status 1 (expected)
Phase 2: Reopening database and verifying all 500 keys...
Recovered: 500/500
Missing: 0/500
Corrupt: 0/500
PASS: All 500 keys recovered after crash!
500 out of 500. Every committed write survived because each one was fsync’d to the WAL before the Put returned. On recovery, Open replays the WAL into a fresh memtable, and the engine picks up where it left off.
This is the same durability model that LevelDB and RocksDB use when sync is enabled. The WAL is the source of truth until its entries are safely flushed to SSTables.
Running It Yourself
The code uses only the Go standard library. No go get required. Clone and run:
git clone https://github.com/devesh-shetty/lsm-engine.git
cd lsm-engine
go build ./...
go test ./... -v
All 14 tests should pass:
=== RUN TestPutGet
--- PASS: TestPutGet (0.01s)
=== RUN TestGetMissing
--- PASS: TestGetMissing (0.00s)
=== RUN TestDelete
--- PASS: TestDelete (0.02s)
=== RUN TestOverwrite
--- PASS: TestOverwrite (0.02s)
=== RUN TestPersistence
--- PASS: TestPersistence (0.46s)
=== RUN TestCrashRecovery
--- PASS: TestCrashRecovery (0.02s)
=== RUN TestWALCorruptedTail
--- PASS: TestWALCorruptedTail (0.01s)
=== RUN TestBloomFilterFalsePositiveRate
db_test.go:219: Bloom filter false positive rate: 0.9920% (target: 1%)
--- PASS: TestBloomFilterFalsePositiveRate (0.01s)
=== RUN TestBloomSerialize
--- PASS: TestBloomSerialize (0.00s)
=== RUN TestSSTableRoundtrip
--- PASS: TestSSTableRoundtrip (0.01s)
=== RUN TestCompaction
--- PASS: TestCompaction (0.02s)
=== RUN TestL0L1Compaction
--- PASS: TestL0L1Compaction (1.36s)
=== RUN TestCrashDuringFlush
--- PASS: TestCrashDuringFlush (0.01s)
=== RUN TestLargeWorkload
--- PASS: TestLargeWorkload (62.99s)
PASS
ok github.com/devesh-shetty/lsm-engine 65.664s
Run the demo to see write/read/delete performance and crash recovery:
go run ./cmd/demo/
Writing 1000 key-value pairs...
Done in 4.250478042s
Reading 1000 keys and verifying...
Done in 338.875µs (0 errors)
Deleting 500 keys (even-numbered)...
Done in 2.043538375s
Verifying deletes...
Deleted keys confirmed gone: 500/500
Present keys confirmed: 500/500
Database Stats:
SSTables: 0
WAL size: 70500 bytes
Memtable entries: 1000
Memtable size: 26000 bytes
Timing Summary:
Write 1000 keys: 4.250478042s (235 ops/sec)
Read 1000 keys: 338.875µs (2950941 ops/sec)
Delete 500 keys: 2.043538375s (245 ops/sec)
Closing and reopening database to test recovery...
Keys surviving reopen: 500/500
Run the crash test:
go run ./cmd/crashtest/
Benchmarks
On an Apple M3 Pro:
| Operation | Latency | Throughput | Allocs/op |
|---|---|---|---|
| Sequential Writes | ~4.1 ms/op | ~244 ops/sec | 8 |
| Random Reads | ~9.8 µs/op | ~102,000 ops/sec | 3 |
| Mixed (50/50 R/W) | ~2.2 ms/op | ~454 ops/sec | 6 |
Write latency is dominated by fsync on every WAL append. This is the cost of crash safety. Production engines batch writes and sync periodically to amortize this, a topic for another post.
What I Didn’t Build
This is a toy engine. I think it’s useful for learning, but it’s missing everything you’d need for production. Here’s what RocksDB builds on top of the same foundations:
- Leveled compaction - I do a full rewrite every time compaction triggers, merging everything into one file. Leveled compaction organizes SSTables into levels with exponentially increasing size targets (10MB, 100MB, 1GB…) for better read amplification and lower space overhead.
- Compression - Snappy, LZ4, or zstd compression of data blocks. Typically 2-4x space savings.
- Block cache - LRU cache of frequently accessed SSTable blocks in memory.
- Concurrent compaction - background goroutine compacting without blocking reads and writes.
- MANIFEST / version tracking - atomic tracking of which SSTables are live. Our approach of scanning the directory is fragile.
- Prefix compression - adjacent keys in a data block often share prefixes; delta encoding saves space.
- Rate limiting - throttle compaction I/O to avoid starving foreground reads. I wrote about rate limiting approaches in a previous post.
- Write batching - group multiple writes into a single WAL entry and a single fsync. This is how production engines get from 244 ops/sec to hundreds of thousands.
- Snapshots / MVCC - consistent reads at a point in time.
- Column families - multiple logical databases sharing one physical engine.
I also skipped directory fsync after creating SSTable files. In a production engine, you’d fsync the parent directory to ensure the file’s directory entry is durable. Without this, a crash could lose the file even though its data was synced. I mention this because it’s a commonly missed step and a real source of data loss bugs.
What I Learned
fsync is the bottleneck, and everything else is designed around it. Our 244 writes/sec is entirely fsync-limited. The in-memory operations (memtable insert, bloom filter check) take microseconds. The disk sync takes milliseconds. Every production optimization (write batching, group commit, WAL recycling) exists to reduce the number of fsyncs per write.
The LSM design is elegant because it separates the durability concern from the indexing concern. The WAL handles durability (sequential writes, fsync). The memtable and SSTables handle indexing (sorted structures, binary search, bloom filters). These two concerns are completely decoupled, which is why you can swap out any component without touching the others.
Crash recovery is surprisingly simple when the WAL is correct. I expected recovery to be the hardest part. It wasn’t. The CRC checksums and fsync discipline do all the work. Replay is under 30 lines. The complexity in production engines comes from concurrent operations and partial flushes, not from the recovery logic itself.
Bloom filters are underappreciated. A few hundred bytes per SSTable eliminates the vast majority of unnecessary disk reads. The math is clean, the implementation is simple, and the payoff is enormous. I think every systems programmer should implement one at least once.
Building this gave me a much better mental model for database internals. When I see RocksDB tuning knobs like write_buffer_size, level0_file_num_compaction_trigger, or bloom_locality, I now understand what they’re actually controlling. That’s the real value of building from scratch.