Jun 13, 2026

OLTP – Phase 8 B-Tree Index

Without an index, every query scans every row. WHERE id = 42 on a million-row table reads all million rows to find one. That’s O(n) — slow.

A B+ tree index is a sorted data structure that maps key values (like id) to the physical location of rows (page number + slot). It enables O(log n) lookups — finding a row in a million-row table reads 3-4 pages instead of thousands.

This is the same reason a book index exists: instead of reading every page to find “B-tree,” you check the index at the back, which tells you the page number directly.

In PostgreSQL, this is nbtinsert.c, nbtsearch.c, and nbtree.h. B-tree is PostgreSQL’s default index type.

Side note: The “B” in B-tree is not “binary”. A binary tree has 2 children per node. A B-tree has hundreds — that’s what makes it shallow and disk-friendly.

Here is the roadmap for the phases to come:

Phase 8: B-tree index
Phase 9: Write-ahead log
Phase 10: Transactions, concurrency, REPL, and server

Full Source Code

The code referenced in this post can be found in https://gitlab.com/kimserey.lam/db-learn.

B+ Tree Structure

A B+ tree has two types of nodes:

Leaf nodes hold the actual data — (key, RID) pairs. The key is the actual column value being indexed (e.g., if you index on id, the keys are 1, 2, 3, …). The RID is the row’s physical location:

Leaf: [key=10, RID(page=0,slot=3)] [key=20, RID(page=0,slot=7)] [key=30, RID(page=1,slot=0)]
       ──────────────────────────── ──────────────────────────── ────────────────────────────
                              next leaf ──►

Internal nodes hold keys and pointers to child nodes for navigation:

Internal: [child0] [key=50] [child1] [key=100] [child2]
             │                  │                   │
             ▼                  ▼                   ▼
        keys < 50         50 ≤ keys < 100     keys ≥ 100

Key properties:

All data lives in the leaves. Internal nodes are just for routing.
Leaf nodes are linked together (next pointers) for efficient range scans.
The tree is always balanced — every leaf is at the same depth.
Each node is one page (4096 bytes), so each disk read loads a whole node.

RID — Row ID

type RID struct {
    PageID    uint32
    SlotIndex uint16
}

This is the physical address of a row in a table. The index stores these as values, mapping keys to locations. To fetch a row by RID, the executor reads the page and gets the tuple at that slot.

Note: The RID doesn’t need to identify which file — each index belongs to a specific table, and each table is a single heap file. The page number is always relative to that one file.

How many keys fit in a node?

Each node is one page (4096 bytes). For int32 keys:

Leaf entry: 4 (key) + 4 (pageID) + 2 (slot) = 10 bytes
Max leaf entries: (4096 - 7 header) / 10 = 408 entries
Internal entry: 4 (key) + 4 (child pointer) = 8 bytes
Max internal entries: (4096 - 7 - 4) / 8 = 510 entries

Note: We only support indexing fixed-size types (INT, BIGINT, BOOL) for simplicity. VARCHAR keys would require variable-length entries, making the node layout much more complex.

With 408 keys per leaf, 1000 rows need only 3 leaf pages. With 510 children per internal node, one root page can point to all of them. Total height: 2 (root + leaves). That means finding any key among 1000 takes exactly 2 page reads.

For a million rows: 2,451 leaf pages, ~5 internal nodes, height 3. Three page reads to find any row.

How the tree grows — a concrete example

Imagine max 3 keys per leaf and max 3 keys per internal node.

Height 1 — everything fits in one leaf (the root):

[1, 2, 3]

Insert 4 — leaf overflows, splits, a new root is created → height 2:

        [3]              ← new root (internal)
       /   \
  [1, 2]   [3, 4]       ← leaves, linked by next pointer

After more inserts (5, 6, 7, 8):

        [3 | 5 | 7]                    ← root has 3 keys, 4 children
       /   |    |   \
  [1,2] [3,4] [5,6] [7,8]             ← 4 leaves

Insert 9 — leaf [7,8] overflows → splits → push key up → root overflows → root splits → height 3:

              [5]                       ← new root
             /   \
        [3]       [7 | 8]              ← two internal nodes
       /   \     /  |   \
  [1,2] [3,4] [5,6] [7] [8,9]         ← 5 leaves

The tree only grows taller when the root itself splits. Leaves never change level. Keys within each leaf are always sorted, and leaves are linked left to right for range scans.

Querying this tree — WHERE id = 4:

Read root [5] → 4 < 5 → go left
Read internal [3] → 4 ≥ 3 → go right child
Read leaf [3, 4] → found 4 → return RID

3 page reads instead of scanning every row.

Operations

Search — O(log n)

Navigate from root to leaf, then scan the leaf for matching keys:

Search(key=42):
  1. Read root (internal): 42 < 50 → go to child0
  2. Read child0 (leaf): scan keys... found 42 → return RID

Insert — O(log n)

Find the right leaf, insert the key in sorted order. If the leaf overflows, split it:

Insert(key=25) into full leaf [10, 20, 30, 40]:
Insert: [10, 20, 25, 30, 40] — too many!
Split:  Left=[10, 20]  Right=[25, 30, 40]
Push key 25 up to parent

When an internal node overflows, it splits the same way — the middle key is pushed up to its parent. If the root splits, a new root is created and the tree grows one level taller.

RangeScan — O(log n + k)

Find the leaf containing low, then follow the next-leaf pointers until past high:

RangeScan(10, 50):
Navigate to leaf containing 10
Scan: 10 ✓, 20 ✓, 30 ✓, 40 ✓, 50 ✓
Follow next pointer to next leaf
Scan: 55 ✗ — stop

The linked leaves make range scans efficient — no need to go back up to the root.

Node Serialization

Each node is serialized into a 4096-byte page:

Header (7 bytes):
  [isLeaf: 1 byte] [numKeys: 2 bytes] [nextLeaf: 4 bytes]

Leaf data:
  [key0][pageID0][slot0] [key1][pageID1][slot1] ...

Internal data:
  [child0] [key0] [child1] [key1] [child2] ...

The node is read from disk into a BTreeNode struct, operated on in memory, then serialized back to the page and written to disk. Each index has its own HeapFile.

Executor Integration

CREATE INDEX

CREATE INDEX idx_id ON users(id);

Parser produces CreateIndexStmt{IndexName: "idx_id", TableName: "users", Column: "id"}
Executor tells the catalog to create the index
Catalog creates a new B-tree, scans the table, inserts every row’s key and RID

Index-aware SELECT

SELECT * FROM users WHERE id = 42;

Before running a sequential scan, the executor checks if the WHERE clause is a simple equality (column = literal) on an indexed column. If so:

Look up the index in the catalog
Call btree.Search(key) — returns the RIDs
Fetch each row by RID using table.GetRowByRID(pageID, slot)
Skip the sequential scan entirely

If there’s no matching index (or the WHERE is too complex), it falls back to the sequential scan as before.

What this looks like in action

From the tests:

=== B-Tree Index (20 keys) ===
  Key type:         INT
  Max leaf keys:    408
  Max internal keys:510
  Height:           1
  Root page:        0
  Search(0): [{0 0}]
  Search(10): [{0 10}]
  Search(19): [{0 19}]
  RangeScan(5, 15): 11 results

With 20 keys, the entire index fits in a single leaf node (height 1). With 1000 keys, it grows to height 2.

Integration test:

CREATE TABLE users (id INT, name VARCHAR(50), age INT);
INSERT INTO users VALUES (1, 'alice', 30);
INSERT INTO users VALUES (2, 'bob', 25);
INSERT INTO users VALUES (3, 'carol', 35);
CREATE INDEX idx_id ON users(id);
SELECT * FROM users WHERE id = 2;  -- uses index, reads 2 pages instead of all

Performance: Index vs Sequential Scan

Rows	SeqScan pages	Index pages	Speedup
100	1	1-2	~same
1,000	~3	2	1.5x
100,000	~250	3	~80x
1,000,000	~2,500	3-4	~700x

For small tables, a sequential scan is fine (or even faster due to less overhead). For large tables, the index is dramatically faster for point lookups.

Limitations

This implementation has several simplifications compared to PostgreSQL:

No index maintenance: INSERT/UPDATE/DELETE after CREATE INDEX don’t update the index. The index becomes stale.
Integer keys only: varchar and other variable-length keys aren’t supported for serialization.
No concurrency: no locking on B-tree pages during concurrent access.
No WAL integration: B-tree modifications aren’t logged for crash recovery.
Simple optimizer: only recognizes WHERE column = literal for index use. Doesn’t handle ranges, compound conditions, or flipped operands (WHERE 42 = id).

PostgreSQL comparison

PostgreSQL’s B-tree implementation is much more complex:

Lehman-Yao algorithm: concurrent access without holding locks from root to leaf
Deduplication: multiple RIDs for the same key are compressed
INCLUDE columns: non-key columns stored in the index for index-only scans
Partial indexes: CREATE INDEX ... WHERE active = true indexes only matching rows
Multi-column indexes: CREATE INDEX ON users(last_name, first_name)
WAL-logged: every B-tree modification is crash-safe

Our implementation covers the core idea: a balanced tree that turns O(n) scans into O(log n) lookups.

Files created/modified in Phase 8

File	Status	Purpose
`index/btree.go`	NEW	BTreeIndex: Insert, Search, RangeScan, Height
`index/btree_node.go`	NEW	BTreeNode serialization to/from pages
`index/btree_test.go`	NEW	8 tests: insert 1000, range scan, duplicates, height
`tuple/tuple.go`	MODIFIED	Added CompareValues (moved from executor)
`executor/expression.go`	MODIFIED	Uses tuple.CompareValues
`parser/token.go`	MODIFIED	Added INDEX, ON keywords
`parser/ast.go`	MODIFIED	Added CreateIndexStmt
`parser/parser.go`	MODIFIED	Handles CREATE INDEX
`catalog/catalog.go`	MODIFIED	Tracks indexes, CreateIndex, GetIndexForColumn
`catalog/table.go`	MODIFIED	Added GetRowByRID
`executor/executor.go`	MODIFIED	CREATE INDEX execution, index-aware SELECT

What’s next

Phase 9 (Write-Ahead Log) adds crash recovery. Before modifying any page, the change is written to a sequential log file. If the database crashes, the log is replayed on restart to restore the data to a consistent state. This guarantees durability — the “D” in ACID.

Observations

The point of the index is to store where to look, rather than forcing the database to inspect every page and every row.
In this implementation, the leaf nodes carry the indexed value together with the page and slot of the target row. The heap file is implicit because there is one heap file per table.
Inserts traverse from the root to the appropriate leaf, and when a split happens the structural change propagates back upward. Reads also traverse from top to bottom, but stop once the matching leaf path is found.
What I find useful here is how it connects back to earlier phases: the index narrows the search to a page and slot, and the buffer pool then makes those hot pages cheap to revisit.
Indexes belong in the catalog for the same reason tables do: the catalog is the place that tracks database objects and how to resolve them.