OLTP – Phase 9 Write-Ahead Log (WAL)

Everything we’ve built so far has a fatal flaw: if the power goes out, data is lost. The buffer pool keeps dirty pages in memory and flushes them to disk lazily. A crash before that flush means those changes vanish.

The write-ahead log solves this. Before changing any data page, we write a description of the change to a separate, sequential log file. If the database crashes, we replay the log on restart and reconstruct the lost changes. This guarantees durability — the “D” in ACID.

The key insight: writing to a sequential log file is cheap. It’s one append to one file, always at the end. Writing random data pages is expensive — they’re scattered across multiple files. So we make the cheap write first (WAL), and let the expensive writes (data pages) happen whenever convenient. If we crash before the data pages are written, the WAL has everything we need to recover.

In PostgreSQL, this is xlog.c, xlogrecord.h, and xlogrecovery.c.

Here is the roadmap for the phases to come:

  • Phase 9: Write-ahead log
  • Phase 10: Transactions, concurrency, REPL, and server

Full Source Code

The code referenced in this post can be found in https://gitlab.com/kimserey.lam/db-learn.

The WAL Protocol

Every data modification follows this sequence:

1
2
3
4
5
6
7
8
9
1. Modify the page in the buffer pool (in memory only)
2. Write a WAL record describing the change
3. Set the page's LSN to the WAL record's LSN
4. Unpin the page as dirty

   ... time passes, buffer pool eventually needs to flush ...

5. Before writing any dirty page to disk, flush the WAL first
6. Write the dirty page to disk

Step 5 is the write-ahead rule: the WAL record must be durable on disk BEFORE the data page it describes. This guarantees that if a page is on disk, its WAL record is too. And if a page is NOT on disk, the WAL record still lets us reconstruct it.

Steps 1-4 are fast (in-memory + sequential append). Steps 5-6 happen lazily, batching many changes into fewer disk writes.


LSN — Log Sequence Number

Every WAL record gets a monotonically increasing LSN:

1
2
3
4
5
WAL file:
  [LSN=1: INSERT users page=0 slot=0]
  [LSN=2: INSERT users page=0 slot=1]
  [LSN=3: DELETE users page=0 slot=0]
  [LSN=4: INSERT users page=0 slot=2]

Every page stores the LSN of the last WAL record applied to it:

1
2
Page 0: LSN=4  (last change was at LSN 4)
Page 1: LSN=0  (never modified)

The page LSN is how recovery knows what’s already applied. If a page’s LSN is 3 and a WAL record has LSN 2, that change is already on the page — skip it.


WAL Record Format

Each record captures one change:

1
2
3
4
5
6
7
8
9
10
type LogRecord struct {
    LSN       LSN      // unique position in the WAL
    TxnID     uint64   // which transaction (0 for now — no txn support yet)
    Type      LogType  // INSERT, DELETE, UPDATE, COMMIT, ABORT, CHECKPOINT
    TableName string   // which table was modified
    PageID    uint32   // which page in that table
    SlotIndex uint16   // which slot on the page
    OldData   []byte   // previous tuple (for undo / DELETE)
    NewData   []byte   // new tuple (for redo / INSERT)
}

Wire format on disk (binary, little-endian):

1
2
3
4
5
6
[4 bytes: payload length]
[8: LSN] [8: TxnID] [1: Type]
[2: table name length] [table name]
[4: PageID] [2: SlotIndex]
[4: OldData length] [OldData]
[4: NewData length] [NewData]

The length prefix lets us read records sequentially without knowing their size in advance.


Page Header Change

The page header grew from 6 to 14 bytes to include the LSN:

1
2
3
4
5
6
7
8
9
10
Before (6 bytes):
  [0:2]  FreeSpaceStart
  [2:4]  FreeSpaceEnd
  [4:6]  ItemCount

After (14 bytes):
  [0:8]   PageLSN          ← new: tracks which WAL records are applied
  [8:10]  FreeSpaceStart
  [10:12] FreeSpaceEnd
  [12:14] ItemCount

This costs 8 bytes per page (4082 usable bytes instead of 4090), but enables idempotent recovery — we never apply the same change twice.


Recovery — What Happens on Restart

When the database opens, it reads the WAL file and replays any records whose changes weren’t flushed to data pages before the crash:

1
2
3
4
5
6
7
8
9
10
11
12
For each WAL record (in LSN order):
    1. Find the table
    2. Fetch the page from disk
    3. If page LSN >= record LSN → skip (already applied)
    4. If page LSN <  record LSN → apply the change:
         INSERT: insert the tuple data
         DELETE: mark the slot as deleted
         UPDATE: delete old slot, insert new data
    5. Set page LSN = record LSN
    6. Mark page dirty

After all records: flush all dirty pages to disk, truncate the WAL.

Because pages store their LSN, recovery is idempotent — running it twice produces the same result. This handles the case where some pages were flushed before the crash and others weren’t.


A Concrete Crash Scenario

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Timeline:
  1. INSERT alice → WAL record LSN=1, page 0 modified in memory
  2. INSERT bob   → WAL record LSN=2, page 0 modified in memory
  3. INSERT carol → WAL record LSN=3, page 0 modified in memory
  4. WAL.Flush()  → WAL records 1-3 are on disk
  5. ⚡ CRASH — dirty page 0 was never written to disk

State on disk after crash:
  WAL file:  [LSN=1: INSERT alice] [LSN=2: INSERT bob] [LSN=3: INSERT carol]
  Page 0:    empty (LSN=0)

Recovery:
  Record 1: page LSN=0 < 1 → insert alice → page LSN=1
  Record 2: page LSN=1 < 2 → insert bob   → page LSN=2
  Record 3: page LSN=2 < 3 → insert carol → page LSN=3
  Flush page 0 to disk.
  Truncate WAL.

Result: all three rows recovered. ✓

What if page 0 was flushed between steps 2 and 3?

1
2
3
4
5
6
7
8
9
10
11
State on disk:
  WAL file:  [LSN=1: INSERT alice] [LSN=2: INSERT bob] [LSN=3: INSERT carol]
  Page 0:    alice + bob (LSN=2)

Recovery:
  Record 1: page LSN=2 ≥ 1 → skip
  Record 2: page LSN=2 ≥ 2 → skip
  Record 3: page LSN=2 < 3 → insert carol → page LSN=3
  Flush page 0 to disk.

Result: carol recovered, alice and bob already on disk. ✓

Write-Ahead Rule in the Buffer Pool

The buffer pool enforces the WAL-ahead rule in three places:

  1. FlushPage — before writing a single dirty page
  2. FlushAll — before writing all dirty pages
  3. Eviction — when the LRU victim is dirty and must be written to make room

Each calls walFlusher.Flush() before writing any page to disk. This ensures no page reaches disk before its WAL records.

1
2
3
4
5
// In BufferPool.FlushPage:
if bp.walFlusher != nil {
    bp.walFlusher.Flush()  // WAL must hit disk first
}
hf.WritePage(pid.PageNum, &frame.Page)  // then the data page

The WALFlusher interface avoids a circular import — the storage package doesn’t need to know about the wal package.


Integration with Table Operations

Each DML operation writes a WAL record after modifying the in-memory page:

INSERT:

1
serialize values → find/allocate page → insert tuple → WAL append → set page LSN → unpin dirty

DELETE:

1
fetch page → read old data (for WAL) → mark slot deleted → WAL append → set page LSN → unpin dirty

UPDATE (implemented as DELETE + INSERT):

1
2
DELETE old row → WAL record for delete
INSERT new row → WAL record for insert

If WAL is nil (not configured), the operations work exactly as before — backward compatible.


Why Sequential Writes are Fast

A spinning disk can write sequentially at ~200 MB/s but random I/O drops to ~1 MB/s (seek time dominates). Even SSDs benefit: sequential writes are 2-4x faster than random due to write amplification in the flash translation layer.

The WAL exploits this: one sequential append vs. scattered page writes across many files.

1
2
3
Without WAL:  page 37 → seek → write   page 102 → seek → write   page 5 → seek → write
With WAL:     WAL append → append → append  (sequential, fast)
              page flushes happen later in batches  (still random, but amortized)

This is why the WAL can be flushed on every commit without killing performance — a sequential fsync of a few KB is fast.


What This Looks Like in Action

From the tests:

1
2
3
4
5
=== WAL Contents ===
  LSN=1  INSERT    table=users  page=0  slot=0  new=12 bytes  old=0 bytes
  LSN=2  INSERT    table=users  page=0  slot=1  new=10 bytes  old=0 bytes
  LSN=3  DELETE    table=users  page=0  slot=0  new=0 bytes  old=12 bytes
  Total: 3 records

Crash recovery test:

1
2
3
4
  1. INSERT 3 rows → WAL has 3 INSERT records
  2. Crash simulation → WAL flushed, data pages NOT flushed
  3. Reopen → recovery replays 3 inserts
  4. SELECT * → all 3 rows present ✓

Limitations

This implementation has several simplifications compared to PostgreSQL:

  • No transactions yet: every operation is auto-committed. Without BEGIN/COMMIT, there’s no atomicity for multi-step operations (UPDATE = DELETE + INSERT could be partial after a crash).
  • No undo: we only redo (replay forward). A real database also needs undo for rolling back aborted transactions. OldData is stored in WAL records in preparation for Phase 10.
  • No checkpointing: we replay the entire WAL on recovery. PostgreSQL writes periodic checkpoints to bound recovery time.
  • Full-record WAL: each record stores the full tuple. PostgreSQL can store just the diff (delta) for smaller records.
  • No WAL archiving: old WAL files aren’t saved for point-in-time recovery or replication.
  • Single WAL file: PostgreSQL uses a series of 16MB WAL segment files, recycling old ones.

PostgreSQL Comparison

PostgreSQL’s WAL is much more sophisticated:

  • WAL segments: 16MB files, cycled and recycled, with continuous archiving
  • Full-page images: first modification after a checkpoint writes the entire page to WAL (prevents torn-page problems)
  • Resource managers: each subsystem (heap, btree, etc.) registers its own WAL record types and redo functions
  • Streaming replication: WAL records are shipped to standby servers for high availability
  • Point-in-time recovery (PITR): replay WAL up to a specific timestamp to restore to any point
  • Asynchronous commit: optionally skip fsync for higher throughput at the cost of recent durability

Our implementation covers the core idea: log before modify, replay on crash.


Files Created/Modified in Phase 9

File Status Purpose
wal/record.go NEW LogRecord, LSN, LogType, serialization
wal/wal.go NEW WALManager: Append, Flush, ReadAll, Truncate
wal/recovery.go NEW Recover: replay WAL records against data pages
wal/wal_test.go NEW 10 tests: roundtrip, append, crash recovery, idempotency
storage/page.go MODIFIED Added 8-byte LSN to page header (6→14 bytes)
storage/page_test.go MODIFIED Updated for new header size
storage/bufferpool.go MODIFIED WALFlusher interface, flush WAL before writing pages
catalog/table.go MODIFIED WAL field, writes WAL records in Insert/Delete, PageAccess interface
catalog/catalog.go MODIFIED WAL lifecycle, recovery on startup, CrashSimulation

What’s Next

Phase 10 adds transactions (BEGIN/COMMIT/ROLLBACK), a lock manager for concurrency control, and a REPL + TCP server. With the WAL in place, transactions can use it for atomicity: COMMIT writes a commit record and flushes the WAL; ROLLBACK uses the OldData in WAL records to undo changes. The “A” in ACID builds directly on the “D” we just implemented.

Observations

  • The important tradeoff here is that the buffer pool wants to flush dirty pages lazily for performance, while durability requires a reliable record of every change.
  • The table layer appends WAL records for inserts and deletes, while updates are represented as the combination of delete plus insert.
  • The buffer pool is responsible for enforcing the critical ordering rule: flush the WAL to disk before flushing any dirty data page that depends on it.
  • Recovery is conceptually straightforward even if the mechanics are detailed: read the WAL, load the affected page, and replay the change if that page has not yet seen it.