OLTP – Phase 5 Buffer Pool Manager
Every time the database reads a row, it calls HeapFile.ReadPage() which seeks to a position in the file, reads 4096 bytes from disk, and copies them into memory. When a sequential scan reads page 0, then page 1, then page 0 again — it reads page 0 from disk twice. The OS has a page cache that might help, but the database can’t control or rely on it.
Phase 5 adds a buffer pool: a fixed-size array of page-sized slots in memory. When you need a page, the pool checks if it’s already cached. If so, you get the in-memory copy instantly. If not, it reads from disk and caches it for next time.
This is the single biggest performance optimization in any database. Everything else — indexes, query optimization, parallel scans — builds on top of having fast page access.
In PostgreSQL, this is bufmgr.c (buffer manager). Every PostgreSQL process accesses pages through the shared buffer pool. The default is 128MB (32,768 pages), configurable via shared_buffers.
Here is the roadmap for the phases to come:
- Phase 5: Buffer pool manager
- Phase 6: SQL parser
- Phase 7: Query executor
- Phase 8: B-tree index
- Phase 9: Write-ahead log
- Phase 10: Transactions, concurrency, REPL, and server
Full Source Code
The code referenced in this post can be found in https://gitlab.com/kimserey.lam/db-learn.
How a Buffer Pool Works
The pool is a fixed-size array of frames. Each frame can hold one page. The pool acts as a middleman between the rest of the database and the disk:
1
2
3
4
5
6
Before (Phase 1-4):
Table → HeapFile.ReadPage() → disk
After (Phase 5):
Table → BufferPool.FetchPage() → already cached? return it
→ not cached? read from disk, cache, return it
Three rules govern the pool:
-
Pin before use. When you fetch a page, it’s “pinned” — the pool won’t evict it while you’re using it. You must unpin when done.
-
Mark dirty on write. If you modify a page, you tell the pool by unpinning with
isDirty=true. The pool knows it needs to write this page back to disk eventually. -
Evict LRU when full. When all frames are occupied and you need a new page, the least recently used unpinned page gets evicted. If it’s dirty, it’s flushed to disk first.
The Data Structures
PageID — which page, which file
1
2
3
4
type PageID struct {
FileID uint32
PageNum uint32
}
Each table has its own heap file. FileID identifies the file (assigned when the file is registered with the pool), and PageNum is the page’s position within that file. Together they uniquely identify any page in the system.
BufferFrame — one slot in the pool
1
2
3
4
5
6
7
type BufferFrame struct {
Page Page
ID PageID
PinCount int
IsDirty bool
valid bool
}
Each frame holds:
Page— the actual 4096-byte page data (not a pointer — the page lives directly in the frame)ID— which page this frame is cachingPinCount— how many callers are currently using this page (0 = eligible for eviction)IsDirty— whether the page has been modified since last written to diskvalid— whether this frame is occupied (starts false for empty frames)
BufferPool — the cache
1
2
3
4
5
6
7
8
type BufferPool struct {
frames []BufferFrame
pageTable map[PageID]int // PageID → frame index
files map[uint32]*HeapFile // FileID → HeapFile
lruList *list.List // LRU eviction order
lruMap map[int]*list.Element // frame index → LRU list node
nextFileID uint32
}
The key data structures inside the pool:
frames— fixed-size array of frames. This IS the cache. A pool with 1024 frames caches up to 1024 pages (4MB).pageTable— hash map for O(1) lookup: “is this page cached, and if so, which frame?”files— registered heap files, so the pool can read/write pages on cache misseslruList+lruMap— tracks eviction order (explained below)
The Operations
FetchPage — “give me this page”
1
func (bp *BufferPool) FetchPage(pid PageID) (*Page, error)
Two paths:
Cache hit (page is already in the pool):
- Look up
pidinpageTable→ get frame index - Increment
PinCount - Remove from LRU list (pinned pages can’t be evicted)
- Return
&frame.Page
Cache miss (page not in the pool):
- Find a free frame (or evict an LRU victim)
- Read the page from disk via
HeapFile.ReadPage() - Copy it into the frame, set
PinCount = 1 - Add to
pageTable - Return
&frame.Page
The returned pointer points directly into the frame. If you call FetchPage twice for the same page, you get the same pointer. This means modifications through that pointer are visible to all callers.
UnpinPage — “I’m done with this page”
1
func (bp *BufferPool) UnpinPage(pid PageID, isDirty bool) error
- Decrement
PinCount - If
isDirty, setIsDirty = true(sticky — once dirty, stays dirty until flushed) - If
PinCountreaches 0, add to LRU list (now eligible for eviction)
FlushPage — “write this to disk now”
1
func (bp *BufferPool) FlushPage(pid PageID) error
Writes the page to disk via HeapFile.WritePage() and clears the dirty flag. The page stays in the pool — it’s not evicted.
NewPage — “allocate a new page”
1
func (bp *BufferPool) NewPage(fileID uint32) (PageID, *Page, error)
- Call
HeapFile.AllocatePage()to extend the file on disk - Find a free frame (or evict)
- Initialize the page in the frame
- Pin it (PinCount = 1)
- Return the PageID and page pointer
LRU Eviction
When the pool is full and every frame is occupied, a new page needs space. The pool evicts the least recently used unpinned page.
The LRU list is a doubly-linked list (Go’s container/list):
1
2
Front (least recent) ←→ ... ←→ Back (most recent)
page 3 page 7
- When a page is unpinned (PinCount goes to 0): add it to the back of the list
- When a page is fetched (cache hit): remove it from the list (it’s pinned now)
- When eviction is needed: take from the front of the list (least recently used)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Pool with 3 frames:
1. Fetch page A → read from disk, pin frames: [A*]
2. Fetch page B → read from disk, pin frames: [A*, B*]
3. Fetch page C → read from disk, pin frames: [A*, B*, C*]
4. Unpin A LRU: [A]
5. Unpin B LRU: [A, B]
6. Unpin C LRU: [A, B, C]
7. Fetch B again → cache hit, pin, remove from LRU LRU: [A, C]
8. Unpin B LRU: [A, C, B]
9. Fetch page D → pool full!
→ evict A (front of LRU, flush if dirty)
→ read D from disk LRU: [C, B]
* = pinned
If all frames are pinned (LRU list is empty), FetchPage returns ErrBufferFull. This is the caller’s fault — they’re holding too many pages at once.
The Refactoring
Before Phase 5, Table.InsertRow() looked like this:
1
2
3
page, err := t.HeapFile.ReadPage(lastPageID) // disk read
page.InsertTuple(data) // modify in memory
t.HeapFile.WritePage(lastPageID, page) // disk write
Every insert did a disk read AND a disk write. If you insert 100 rows into the same page, that’s 100 reads and 100 writes.
After Phase 5:
1
2
3
page, err := t.Pool.FetchPage(lastPID) // cache hit (usually)
page.InsertTuple(data) // modify in-pool page
t.Pool.UnpinPage(lastPID, true) // mark dirty, flush later
The first insert reads from disk. The next 99 are cache hits — no disk I/O at all. The dirty page gets written to disk once, when it’s evicted or when the pool is closed.
The Table struct changed from:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
// Before
type Table struct {
Name string
Schema *tuple.Schema
HeapFile *storage.HeapFile
}
// After
type Table struct {
Name string
Schema *tuple.Schema
Pool *storage.BufferPool
FileID uint32
}
The table no longer touches the HeapFile directly. The pool is the single gateway to disk.
The Catalog creates the pool internally:
1
2
3
4
5
6
7
8
func OpenCatalog(dataDir string) (*Catalog, error) {
cat := &Catalog{
tables: make(map[string]*Table),
dataDir: dataDir,
pool: storage.NewBufferPool(defaultPoolSize),
}
// ...
}
When Close() is called, the pool flushes all dirty pages to disk, then closes the heap files. This ensures data survives across restarts.
How Phase 5 connects to previous phases
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Phase 5: BufferPool (NEW — sits between Table and HeapFile)
│
├── FetchPage(pid) → cache hit? return from frames[]
│ → cache miss? HeapFile.ReadPage() ← Phase 1
│
├── UnpinPage(pid, dirty) → track in LRU list
│
├── FlushPage(pid) → HeapFile.WritePage() ← Phase 1
│
└── NewPage(fileID) → HeapFile.AllocatePage() ← Phase 1
Phase 4: SeqScanNode{Table, Filter, Columns}
│
└── Table.ScanAll() → TableIterator
│
└── Next() → Pool.FetchPage() ← Phase 5 (NEW)
└── filter with Expression.Eval() ← Phase 4
└── Deserialize() ← Phase 2
The SeqScan code didn’t change at all. It still calls Table.ScanAll(), which returns a TableIterator. The iterator now uses the pool internally instead of reading the HeapFile directly.
What this looks like in action
From the buffer pool test dump:
1
2
3
4
5
6
7
8
9
10
=== Buffer Pool State ===
Pool size: 5 frames
Pages cached: 3
LRU list size: 2 (unpinned pages)
Frame 0: file=0 page=0 DIRTY unpinned data="row-0"
Frame 1: file=0 page=1 clean pinned(1) data="row-1"
Frame 2: file=0 page=2 DIRTY unpinned data="row-2"
Frame 3: [empty]
Frame 4: [empty]
Three pages cached, two empty frames available. Page 1 is pinned (someone is using it). Pages 0 and 2 are dirty and unpinned — they can be evicted, but their modifications need to be flushed first.
Why this matters
Without a buffer pool, a sequential scan of a 1000-page table does 1000 disk reads. With a pool of 1024 frames, the first scan does 1000 disk reads (cold cache), but the second scan does zero (all cached).
More importantly, writes become dramatically faster. Without the pool, every InsertRow does a read + write. With the pool, inserts that land on the same page (which is most inserts, since pages hold many rows) are all in-memory modifications. The page gets written to disk once — when it’s evicted or flushed.
Real databases tune their buffer pool size carefully. PostgreSQL defaults to 128MB but production systems often set it to 25% of RAM. A database with 64GB of RAM might have a 16GB buffer pool — millions of pages cached.
PostgreSQL comparison
PostgreSQL’s buffer manager (bufmgr.c) is far more complex:
- Clock sweep instead of LRU: PostgreSQL uses a clock-based approximation of LRU that’s cheaper to maintain under high concurrency
- Shared memory: the buffer pool lives in shared memory so all backend processes share the same cache
- Buffer descriptors: separate from the buffer pages, with spinlocks for concurrent access
- Pin counting with wait queues: processes can wait for a pin to become available
- Ring buffer strategy: sequential scans use a small ring buffer to avoid evicting the entire cache
- Background writer: a separate process that flushes dirty pages gradually, avoiding write storms
Our system uses a straightforward LRU list with no concurrency. But the core idea is the same: cache pages in memory, track dirty pages, evict when full.
Go patterns used
container/list for the LRU
1
2
3
4
5
6
import "container/list"
bp.lruList = list.New()
elem := bp.lruList.PushBack(frameIdx) // add to back (most recent)
bp.lruList.Remove(elem) // remove from anywhere in O(1)
front := bp.lruList.Front() // get least recently used
Go’s container/list is a doubly-linked list in the standard library. It gives O(1) insert, remove, and access to both ends — exactly what an LRU needs.
The lruMap (frame index → list element) lets us remove a specific element in O(1) without searching the list. Without this map, removing a page from the LRU on a cache hit would require O(n) traversal.
Pointer into struct field
1
return &frame.Page, nil
FetchPage returns a pointer to the Page field inside the frame struct. The page data lives inside the frame — there’s no separate allocation. Multiple callers that fetch the same page all get the same pointer, so modifications are visible to everyone.
Files created/modified in Phase 5
| File | Status | Purpose |
|---|---|---|
storage/bufferpool.go |
NEW | BufferPool, BufferFrame, PageID, LRU eviction |
storage/bufferpool_test.go |
NEW | 13 tests: caching, eviction, flush, pinning |
catalog/table.go |
MODIFIED | Table uses Pool+FileID instead of HeapFile |
catalog/catalog.go |
MODIFIED | Catalog creates internal BufferPool |
catalog/catalog_test.go |
MODIFIED | Replaced HeapFile.PageCount() with Table.PageCount() |
What’s next
Phase 6 (SQL Parser) adds a tokenizer and recursive-descent parser that turns SQL strings like SELECT name FROM users WHERE age > 25 into an AST (Abstract Syntax Tree). This is where the database starts understanding SQL text instead of requiring manually constructed Go structs.
Observations
- The buffer pool is where performance starts to matter because it reduces how often the database has to go back to disk.
- The naming is helpful: a buffer is temporary in-memory data, and the pool is a fixed set of reusable slots for those buffers.
- Each frame holds a page, and pinning versus unpinning is what prevents the system from evicting a page that is currently in use.
- The design is a combination of a frame array, a page-to-frame lookup map, a heap file registry, and LRU bookkeeping. Together they turn page access into cache access whenever possible.
- Another important shift is architectural: tables and the catalog stop dealing with heap files directly. They go through the buffer pool, which decides when to flush back to disk.