OLTP – Phase 8 B-Tree Index
Without an index, every query scans every row. WHERE id = 42 on a million-row table reads all million rows to find one. That’s O(n) — slow.
A B+ tree index is a sorted data structure that maps key values (like id) to the physical location of rows (page number + slot). It enables O(log n) lookups — finding a row in a million-row table reads 3-4 pages instead of thousands.
This is the same reason a book index exists: instead of reading every page to find “B-tree,” you check the index at the back, which tells you the page number directly.
In PostgreSQL, this is nbtinsert.c, nbtsearch.c, and nbtree.h. B-tree is PostgreSQL’s default index type.
Side note: The “B” in B-tree is not “binary”. A binary tree has 2 children per node. A B-tree has hundreds — that’s what makes it shallow and disk-friendly.
Here is the roadmap for the phases to come:
- Phase 8: B-tree index
- Phase 9: Write-ahead log
- Phase 10: Transactions, concurrency, REPL, and server
Full Source Code
The code referenced in this post can be found in https://gitlab.com/kimserey.lam/db-learn.
B+ Tree Structure
A B+ tree has two types of nodes:
Leaf nodes hold the actual data — (key, RID) pairs. The key is the actual column value being indexed (e.g., if you index on id, the keys are 1, 2, 3, …). The RID is the row’s physical location:
1
2
3
Leaf: [key=10, RID(page=0,slot=3)] [key=20, RID(page=0,slot=7)] [key=30, RID(page=1,slot=0)]
──────────────────────────── ──────────────────────────── ────────────────────────────
next leaf ──►
Internal nodes hold keys and pointers to child nodes for navigation:
1
2
3
4
Internal: [child0] [key=50] [child1] [key=100] [child2]
│ │ │
▼ ▼ ▼
keys < 50 50 ≤ keys < 100 keys ≥ 100
Key properties:
- All data lives in the leaves. Internal nodes are just for routing.
- Leaf nodes are linked together (next pointers) for efficient range scans.
- The tree is always balanced — every leaf is at the same depth.
- Each node is one page (4096 bytes), so each disk read loads a whole node.
RID — Row ID
1
2
3
4
type RID struct {
PageID uint32
SlotIndex uint16
}
This is the physical address of a row in a table. The index stores these as values, mapping keys to locations. To fetch a row by RID, the executor reads the page and gets the tuple at that slot.
Note: The RID doesn’t need to identify which file — each index belongs to a specific table, and each table is a single heap file. The page number is always relative to that one file.
How many keys fit in a node?
Each node is one page (4096 bytes). For int32 keys:
- Leaf entry: 4 (key) + 4 (pageID) + 2 (slot) = 10 bytes
- Max leaf entries: (4096 - 7 header) / 10 = 408 entries
- Internal entry: 4 (key) + 4 (child pointer) = 8 bytes
- Max internal entries: (4096 - 7 - 4) / 8 = 510 entries
Note: We only support indexing fixed-size types (INT, BIGINT, BOOL) for simplicity. VARCHAR keys would require variable-length entries, making the node layout much more complex.
With 408 keys per leaf, 1000 rows need only 3 leaf pages. With 510 children per internal node, one root page can point to all of them. Total height: 2 (root + leaves). That means finding any key among 1000 takes exactly 2 page reads.
For a million rows: 2,451 leaf pages, ~5 internal nodes, height 3. Three page reads to find any row.
How the tree grows — a concrete example
Imagine max 3 keys per leaf and max 3 keys per internal node.
Height 1 — everything fits in one leaf (the root):
1
[1, 2, 3]
Insert 4 — leaf overflows, splits, a new root is created → height 2:
1
2
3
[3] ← new root (internal)
/ \
[1, 2] [3, 4] ← leaves, linked by next pointer
After more inserts (5, 6, 7, 8):
1
2
3
[3 | 5 | 7] ← root has 3 keys, 4 children
/ | | \
[1,2] [3,4] [5,6] [7,8] ← 4 leaves
Insert 9 — leaf [7,8] overflows → splits → push key up → root overflows → root splits → height 3:
1
2
3
4
5
[5] ← new root
/ \
[3] [7 | 8] ← two internal nodes
/ \ / | \
[1,2] [3,4] [5,6] [7] [8,9] ← 5 leaves
The tree only grows taller when the root itself splits. Leaves never change level. Keys within each leaf are always sorted, and leaves are linked left to right for range scans.
Querying this tree — WHERE id = 4:
- Read root
[5]→ 4 < 5 → go left - Read internal
[3]→ 4 ≥ 3 → go right child - Read leaf
[3, 4]→ found 4 → return RID
3 page reads instead of scanning every row.
Operations
Search — O(log n)
Navigate from root to leaf, then scan the leaf for matching keys:
1
2
3
Search(key=42):
1. Read root (internal): 42 < 50 → go to child0
2. Read child0 (leaf): scan keys... found 42 → return RID
Insert — O(log n)
Find the right leaf, insert the key in sorted order. If the leaf overflows, split it:
1
2
3
4
Insert(key=25) into full leaf [10, 20, 30, 40]:
1. Insert: [10, 20, 25, 30, 40] — too many!
2. Split: Left=[10, 20] Right=[25, 30, 40]
3. Push key 25 up to parent
When an internal node overflows, it splits the same way — the middle key is pushed up to its parent. If the root splits, a new root is created and the tree grows one level taller.
RangeScan — O(log n + k)
Find the leaf containing low, then follow the next-leaf pointers until past high:
1
2
3
4
5
RangeScan(10, 50):
1. Navigate to leaf containing 10
2. Scan: 10 ✓, 20 ✓, 30 ✓, 40 ✓, 50 ✓
3. Follow next pointer to next leaf
4. Scan: 55 ✗ — stop
The linked leaves make range scans efficient — no need to go back up to the root.
Node Serialization
Each node is serialized into a 4096-byte page:
1
2
3
4
5
6
7
8
Header (7 bytes):
[isLeaf: 1 byte] [numKeys: 2 bytes] [nextLeaf: 4 bytes]
Leaf data:
[key0][pageID0][slot0] [key1][pageID1][slot1] ...
Internal data:
[child0] [key0] [child1] [key1] [child2] ...
The node is read from disk into a BTreeNode struct, operated on in memory, then serialized back to the page and written to disk. Each index has its own HeapFile.
Executor Integration
CREATE INDEX
1
CREATE INDEX idx_id ON users(id);
- Parser produces
CreateIndexStmt{IndexName: "idx_id", TableName: "users", Column: "id"} - Executor tells the catalog to create the index
- Catalog creates a new B-tree, scans the table, inserts every row’s key and RID
Index-aware SELECT
1
SELECT * FROM users WHERE id = 42;
Before running a sequential scan, the executor checks if the WHERE clause is a simple equality (column = literal) on an indexed column. If so:
- Look up the index in the catalog
- Call
btree.Search(key)— returns the RIDs - Fetch each row by RID using
table.GetRowByRID(pageID, slot) - Skip the sequential scan entirely
If there’s no matching index (or the WHERE is too complex), it falls back to the sequential scan as before.
What this looks like in action
From the tests:
1
2
3
4
5
6
7
8
9
10
=== B-Tree Index (20 keys) ===
Key type: INT
Max leaf keys: 408
Max internal keys:510
Height: 1
Root page: 0
Search(0): [{0 0}]
Search(10): [{0 10}]
Search(19): [{0 19}]
RangeScan(5, 15): 11 results
With 20 keys, the entire index fits in a single leaf node (height 1). With 1000 keys, it grows to height 2.
Integration test:
1
2
3
4
5
6
CREATE TABLE users (id INT, name VARCHAR(50), age INT);
INSERT INTO users VALUES (1, 'alice', 30);
INSERT INTO users VALUES (2, 'bob', 25);
INSERT INTO users VALUES (3, 'carol', 35);
CREATE INDEX idx_id ON users(id);
SELECT * FROM users WHERE id = 2; -- uses index, reads 2 pages instead of all
Performance: Index vs Sequential Scan
| Rows | SeqScan pages | Index pages | Speedup |
|---|---|---|---|
| 100 | 1 | 1-2 | ~same |
| 1,000 | ~3 | 2 | 1.5x |
| 100,000 | ~250 | 3 | ~80x |
| 1,000,000 | ~2,500 | 3-4 | ~700x |
For small tables, a sequential scan is fine (or even faster due to less overhead). For large tables, the index is dramatically faster for point lookups.
Limitations
This implementation has several simplifications compared to PostgreSQL:
- No index maintenance: INSERT/UPDATE/DELETE after CREATE INDEX don’t update the index. The index becomes stale.
- Integer keys only: varchar and other variable-length keys aren’t supported for serialization.
- No concurrency: no locking on B-tree pages during concurrent access.
- No WAL integration: B-tree modifications aren’t logged for crash recovery.
- Simple optimizer: only recognizes
WHERE column = literalfor index use. Doesn’t handle ranges, compound conditions, or flipped operands (WHERE 42 = id).
PostgreSQL comparison
PostgreSQL’s B-tree implementation is much more complex:
- Lehman-Yao algorithm: concurrent access without holding locks from root to leaf
- Deduplication: multiple RIDs for the same key are compressed
- INCLUDE columns: non-key columns stored in the index for index-only scans
- Partial indexes:
CREATE INDEX ... WHERE active = trueindexes only matching rows - Multi-column indexes:
CREATE INDEX ON users(last_name, first_name) - WAL-logged: every B-tree modification is crash-safe
Our implementation covers the core idea: a balanced tree that turns O(n) scans into O(log n) lookups.
Files created/modified in Phase 8
| File | Status | Purpose |
|---|---|---|
index/btree.go |
NEW | BTreeIndex: Insert, Search, RangeScan, Height |
index/btree_node.go |
NEW | BTreeNode serialization to/from pages |
index/btree_test.go |
NEW | 8 tests: insert 1000, range scan, duplicates, height |
tuple/tuple.go |
MODIFIED | Added CompareValues (moved from executor) |
executor/expression.go |
MODIFIED | Uses tuple.CompareValues |
parser/token.go |
MODIFIED | Added INDEX, ON keywords |
parser/ast.go |
MODIFIED | Added CreateIndexStmt |
parser/parser.go |
MODIFIED | Handles CREATE INDEX |
catalog/catalog.go |
MODIFIED | Tracks indexes, CreateIndex, GetIndexForColumn |
catalog/table.go |
MODIFIED | Added GetRowByRID |
executor/executor.go |
MODIFIED | CREATE INDEX execution, index-aware SELECT |
What’s next
Phase 9 (Write-Ahead Log) adds crash recovery. Before modifying any page, the change is written to a sequential log file. If the database crashes, the log is replayed on restart to restore the data to a consistent state. This guarantees durability — the “D” in ACID.
Observations
- The point of the index is to store where to look, rather than forcing the database to inspect every page and every row.
- In this implementation, the leaf nodes carry the indexed value together with the page and slot of the target row. The heap file is implicit because there is one heap file per table.
- Inserts traverse from the root to the appropriate leaf, and when a split happens the structural change propagates back upward. Reads also traverse from top to bottom, but stop once the matching leaf path is found.
- What I find useful here is how it connects back to earlier phases: the index narrows the search to a page and slot, and the buffer pool then makes those hot pages cheap to revisit.
- Indexes belong in the catalog for the same reason tables do: the catalog is the place that tracks database objects and how to resolve them.