OLTP – Phase 1 Pages and Heap Files
Every database needs to store data on disk and read it back. But disks don’t work like memory — you can’t read a single byte efficiently. Disk I/O works in blocks. The operating system reads and writes in chunks (typically 4KB). Databases lean into this by organizing all data into fixed-size pages that match the OS block size.
A heap file is the simplest way to organize pages — just stack them end-to-end in a file. “Heap” means unordered: rows go wherever there’s room.
In PostgreSQL, every table is backed by a heap file stored at base/<dboid>/<relfilenode>. The code lives in src/backend/storage/smgr/md.c (storage manager) and src/backend/storage/page/bufpage.c (page layout).
In this series, the goal is to build a small database engine step by step and understand how each layer fits with the next one. We start from raw bytes on disk, move up through row serialization and table management, then add query execution, indexing, durability, and finally transactions and interfaces.
Here is the roadmap for the phases to come:
- Phase 1: Pages and heap files
- Phase 2: Tuple and record format
- Phase 3: Table and catalog
- Phase 4: Sequential scan and filtering
- Phase 5: Buffer pool manager
- Phase 6: SQL parser
- Phase 7: Query executor
- Phase 8: B-tree index
- Phase 9: Write-ahead log
- Phase 10: Transactions, concurrency, REPL, and server
Full Source Code
The code referenced in this post can be found in https://gitlab.com/kimserey.lam/db-learn.
The Page (storage/page.go)
What is it?
A Page is a fixed 4096-byte array. It’s the smallest unit the database reads from or writes to disk. Everything — rows, indexes, metadata — lives inside pages.
The Slotted Page Layout
1
2
3
4
5
6
7
8
9
10
11
12
13
┌──────────────────────────────────────────────────┐
│ PageHeader (6 bytes) │
│ bytes [0:2] FreeSpaceStart │
│ bytes [2:4] FreeSpaceEnd │
│ bytes [4:6] ItemCount │
├──────────────────────────────────────────────────┤
│ ItemPointer[0] │ ItemPointer[1] │ ... ──► │
│ (4 bytes each: 2 byte offset + 2 byte length) │
├──────────────────────────────────────────────────┤
│ free space │
├──────────────────────────────────────────────────┤
│ ◄── ... │ Tuple[1] data │ Tuple[0] data │
└──────────────────────────────────────────────────┘
Item pointers grow forward from the top (after the header). Tuple data grows backward from the bottom of the page. Free space is the gap in the middle. When the two sides meet, the page is full.
Why this design?
Tuples have different sizes. If you just packed them sequentially, deleting a tuple in the middle would leave a gap, and you’d need to shift everything. The slotted page solves this:
- Each tuple gets a slot number (0, 1, 2, …) that never changes.
- The item pointer for that slot stores the tuple’s actual byte offset and length.
- If you want to reorganize the tuple data (compaction), you just update the pointers — the slot numbers stay the same.
- External references (like index entries) can use (pageID, slotNumber) and it remains valid even after compaction.
Concrete example
After inserting “alice”, “bob”, “carol”:
1
2
3
4
5
Slot[0]: offset=4091, length=5, data="alice"
Slot[1]: offset=4088, length=3, data="bob"
Slot[2]: offset=4083, length=5, data="carol"
[Header 0:6] [Ptrs 6:18] [Free 18:4083] [Tuples 4083:4096]
- “alice” (5 bytes) was placed at byte 4091 (= 4096 - 5).
- “bob” (3 bytes) was placed at byte 4088 (= 4091 - 3).
- “carol” (5 bytes) was placed at byte 4083 (= 4088 - 5).
- Each item pointer costs 4 bytes, so 3 pointers = 12 bytes, occupying bytes 6 through 17.
- Free space: bytes 18 through 4082 = 4065 bytes remaining.
How each operation works
InsertTuple(data):
- Check if
len(data) + 4(pointer) fits in the free space. - Subtract
len(data)from FreeSpaceEnd — this is where the tuple goes. Copy the bytes there. - Write a new ItemPointer at FreeSpaceStart with the tuple’s offset and length.
- Advance FreeSpaceStart by 4 bytes, update ItemCount.
GetTuple(slotIndex):
- Check slotIndex < ItemCount.
- Read the ItemPointer at position
6 + slotIndex * 4. - If offset=0 and length=0, the slot is deleted.
- Otherwise, copy
lengthbytes from positionoffsetand return them.
DeleteTuple(slotIndex):
- Set the slot’s ItemPointer offset and length both to 0.
- The tuple bytes are still physically in the page (wasted space).
- Real databases compact pages to reclaim this space — we skip that for simplicity.
Why the slotted design helps with deletion
Without pointers, if you stored tuples packed sequentially:
1
[alice] [bob] [carol]
Deleting “bob” leaves a hole. You’d have to shift “carol” left to close it, and anything pointing to carol’s old position (like an index entry) would break.
With our slotted design, deletion just zeroes out a pointer:
1
2
3
4
Pointers (left): Tuples (right):
Ptr[0] → alice [carol] [bob's ghost] [alice]
Ptr[1] → 0, 0 DELETED
Ptr[2] → carol
No data moves. No references break. “bob” is still physically there wasting space, but slot 0 still finds “alice” and slot 2 still finds “carol” through their pointers.
What happens when you keep inserting after deletes
New inserts do NOT reuse dead space. They keep consuming free space from both sides:
1
2
3
4
5
6
7
8
9
10
Insert "alice" → Ptr[0], tuple at end
Insert "bob" → Ptr[1], tuple at end
Delete "bob" → Ptr[1] = 0,0 (bytes still there)
Insert "dave" → Ptr[3], new tuple at end (bob's space is NOT reused)
Pointers: Tuples:
Ptr[0] → alice [dave] [carol] [bob's ghost] [alice]
Ptr[1] → DELETED wasted space
Ptr[2] → carol
Ptr[3] → dave
Eventually pointers and tuples meet in the middle → page is full, even though dead space exists. At that point:
- Compact the page — squeeze out dead bytes, move tuples together, update pointers. Free space reappears. We don’t build this.
- Allocate a new page — add another 4KB page to the heap file. This is what we do.
PostgreSQL reclaims dead space during VACUUM — its background cleanup process that compacts pages and removes dead rows left behind by deletes and updates.
Why GetTuple returns a copy
If we returned a slice pointing directly into Page.Data, the caller could accidentally overwrite page contents. Returning a copy (make([]byte, length) + copy) isolates the caller from the page’s internal memory.
Capacity
A fresh page has 4090 bytes of free space (4096 - 6 byte header). Each tuple costs 4 + len(data) bytes. For 10-byte tuples, that’s 14 bytes each, so a page fits 4090/14 = 292 tuples. For a 1000-byte row, it fits about 3-4 per page.
PostgreSQL comparison
PostgreSQL’s PageHeaderData (in src/include/storage/bufpage.h) has more fields than ours:
pd_lsn— Log Sequence Number (for WAL, we’ll add this in Phase 9)pd_checksum— data integrity checkpd_flags— page state flagspd_lower— our FreeSpaceStartpd_upper— our FreeSpaceEndpd_special— pointer to special space (used by indexes)pd_pagesize_version— page size and layout version
Our simplified header (FreeSpaceStart, FreeSpaceEnd, ItemCount) captures the essential mechanics.
The Heap File (storage/heapfile.go)
What is it?
A HeapFile is a file on disk that contains a sequence of pages, laid out back-to-back with no gaps and no file header.
1
2
File on disk:
[Page 0: bytes 0-4095] [Page 1: bytes 4096-8191] [Page 2: bytes 8192-12287] ...
Page N lives at byte offset N * 4096. The total number of pages is fileSize / 4096.
Why no file header?
Simplicity. The only metadata we need is “how many pages” and we can derive that from the file size. Real databases add file headers for versioning, checksums, etc., but for learning, the math fileSize / PageSize is all we need.
How each operation works
AllocatePage():
- Compute the new page ID:
fileSize / 4096. - Create an initialized Page (with proper header values).
- Write 4096 bytes at the end of the file.
- The file is now one page longer.
ReadPage(pageID):
- Check
pageID < PageCount()— can’t read beyond the file. - Compute byte offset:
pageID * 4096. - Use
file.ReadAt()to read exactly 4096 bytes at that offset. - Return the bytes as a
*Page.
WritePage(pageID, page):
- Check
pageID < PageCount()— can’t write beyond the file. - Compute byte offset:
pageID * 4096. - Use
file.WriteAt()to overwrite 4096 bytes at that offset.
Why ReadAt/WriteAt instead of Seek+Read?
ReadAt and WriteAt are atomic with respect to the file offset — they don’t move the file cursor. This means multiple goroutines can read different pages simultaneously without interfering. Regular Seek + Read would require a mutex because one goroutine could seek while another reads.
Persistence
The key test: insert data, close the file, reopen it, data is still there. This is what makes a database a database — data survives process restarts. The bytes are physically on disk (or at least in the OS’s write cache, which Sync()/fsync flushes to actual hardware).
PostgreSQL comparison
PostgreSQL’s storage manager (md.c) is more complex:
- Files are split into 1GB segments (OS file size limits).
- It maintains a file descriptor cache (can’t have thousands of files open).
- There’s a “free space map” that tracks which pages have room for new tuples.
- An “visibility map” tracks which pages have been fully vacuumed.
Our HeapFile is the simplified core: a file of pages, read/write by page ID.
How Page and HeapFile work together
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Application
│
▼
HeapFile.ReadPage(2)
│
├── Seeks to byte 8192 in the file
├── Reads 4096 bytes
└── Returns a *Page
│
▼
Page.GetTuple(0)
│
├── Reads ItemPointer at slot 0
├── Finds offset=4091, length=5
├── Copies bytes from [4091:4096]
└── Returns []byte("alice")
The HeapFile handles which page (disk I/O), the Page handles which tuple within that page (in-memory byte manipulation). This separation is important because later (Phase 5), the Buffer Pool will sit between the application and the HeapFile, caching pages in memory so we don’t hit disk every time.
Key Go patterns used
encoding/binary.LittleEndian
All multi-byte numbers (uint16) are stored in little-endian byte order. This matches x86/ARM CPUs. PutUint16 writes 2 bytes, Uint16 reads 2 bytes. This is how the database serializes its internal structures to raw bytes.
Uint16 always reads exactly 2 bytes from the start of the slice you give it:
1
2
3
p.Data[0:] // slice of 4096 bytes → reads p.Data[0] and p.Data[1]
p.Data[2:] // slice of 4094 bytes → reads p.Data[2] and p.Data[3]
p.Data[100:] // slice of 3996 bytes → reads p.Data[100] and p.Data[101]
The slice can be any length — Uint16 only looks at the first 2 bytes, ignoring the rest. Similarly, Uint32 always reads 4, Uint64 always reads 8. The open-ended slice ([n:]) is just Go’s way of saying “start at byte n” — the function knows how many bytes it needs.
PutUint16 does the reverse — writes a number as 2 bytes:
1
2
3
4
binary.LittleEndian.PutUint16(p.Data[0:], 4091)
// 4091 = 0x0FFB
// p.Data[0] = 0xFB (low byte first — that's what "little-endian" means)
// p.Data[1] = 0x0F (high byte second)
[PageSize]byte (fixed-size array, not slice)
A Page is exactly 4096 bytes, always. Using a fixed array instead of a slice means no accidental resizing, and the size is a compile-time guarantee. When we write it to disk with WriteAt(page.Data[:], offset), the [:] converts the array to a slice.
t.TempDir() in tests
Go’s testing package creates a temporary directory that’s automatically cleaned up when the test ends. This means our heap file tests create real files on disk but leave no mess behind.
*Page (pointer receiver)
All Page methods use pointer receivers (func (p *Page)) because:
- Page is 4096 bytes — copying it on every method call would be wasteful.
- Methods like
InsertTuplemodify the page in place.
Files created in Phase 1
| File | Purpose | Lines |
|---|---|---|
storage/page.go |
Slotted page: insert/get/delete tuples by slot index | ~140 |
storage/page_test.go |
11 tests covering insert, read, delete, capacity, edge cases | ~200 |
storage/heapfile.go |
File of pages on disk: read/write/allocate pages | ~100 |
storage/heapfile_test.go |
9 tests covering create, persist, multi-page, file size | ~170 |
What’s next
Phase 2 (Tuple Format) defines what the raw bytes inside a page slot actually mean — how to serialize a row with typed columns (int, varchar, bool) and NULL values into the []byte that InsertTuple accepts.
Observations
- The first important idea is that bytes are not managed one by one on disk. They are grouped into 4KB pages, and page is the unit the database reasons about.
- A heap file is then just the disk-level container for those pages. In this project, one table maps to one heap file, and one heap file contains many pages.
- What made this phase click for me is that accessing a table does not mean loading the whole file. The database loads the pages it needs into RAM, works on them there, and writes them back through the storage layer.