OLTP – Phase 5 Buffer Pool Manager

Every time the database reads a row, it calls HeapFile.ReadPage() which seeks to a position in the file, reads 4096 bytes from disk, and copies them into memory. When a sequential scan reads page 0, then page 1, then page 0 again — it reads page 0 from disk twice. The OS has a page cache that might help, but the database can’t control or rely on it.

Phase 5 adds a buffer pool: a fixed-size array of page-sized slots in memory. When you need a page, the pool checks if it’s already cached. If so, you get the in-memory copy instantly. If not, it reads from disk and caches it for next time.

This is the single biggest performance optimization in any database. Everything else — indexes, query optimization, parallel scans — builds on top of having fast page access.

In PostgreSQL, this is bufmgr.c (buffer manager). Every PostgreSQL process accesses pages through the shared buffer pool. The default is 128MB (32,768 pages), configurable via shared_buffers.

Here is the roadmap for the phases to come:

  • Phase 5: Buffer pool manager
  • Phase 6: SQL parser
  • Phase 7: Query executor
  • Phase 8: B-tree index
  • Phase 9: Write-ahead log
  • Phase 10: Transactions, concurrency, REPL, and server

Full Source Code

The code referenced in this post can be found in https://gitlab.com/kimserey.lam/db-learn.

How a Buffer Pool Works

The pool is a fixed-size array of frames. Each frame can hold one page. The pool acts as a middleman between the rest of the database and the disk:

1
2
3
4
5
6
Before (Phase 1-4):
  Table → HeapFile.ReadPage() → disk

After (Phase 5):
  Table → BufferPool.FetchPage() → already cached? return it
                                  → not cached? read from disk, cache, return it

Three rules govern the pool:

  1. Pin before use. When you fetch a page, it’s “pinned” — the pool won’t evict it while you’re using it. You must unpin when done.

  2. Mark dirty on write. If you modify a page, you tell the pool by unpinning with isDirty=true. The pool knows it needs to write this page back to disk eventually.

  3. Evict LRU when full. When all frames are occupied and you need a new page, the least recently used unpinned page gets evicted. If it’s dirty, it’s flushed to disk first.


The Data Structures

PageID — which page, which file

1
2
3
4
type PageID struct {
    FileID  uint32
    PageNum uint32
}

Each table has its own heap file. FileID identifies the file (assigned when the file is registered with the pool), and PageNum is the page’s position within that file. Together they uniquely identify any page in the system.

BufferFrame — one slot in the pool

1
2
3
4
5
6
7
type BufferFrame struct {
    Page     Page
    ID       PageID
    PinCount int
    IsDirty  bool
    valid    bool
}

Each frame holds:

  • Page — the actual 4096-byte page data (not a pointer — the page lives directly in the frame)
  • ID — which page this frame is caching
  • PinCount — how many callers are currently using this page (0 = eligible for eviction)
  • IsDirty — whether the page has been modified since last written to disk
  • valid — whether this frame is occupied (starts false for empty frames)

BufferPool — the cache

1
2
3
4
5
6
7
8
type BufferPool struct {
    frames     []BufferFrame
    pageTable  map[PageID]int        // PageID → frame index
    files      map[uint32]*HeapFile  // FileID → HeapFile
    lruList    *list.List            // LRU eviction order
    lruMap     map[int]*list.Element // frame index → LRU list node
    nextFileID uint32
}

The key data structures inside the pool:

  • frames — fixed-size array of frames. This IS the cache. A pool with 1024 frames caches up to 1024 pages (4MB).
  • pageTable — hash map for O(1) lookup: “is this page cached, and if so, which frame?”
  • files — registered heap files, so the pool can read/write pages on cache misses
  • lruList + lruMap — tracks eviction order (explained below)

The Operations

FetchPage — “give me this page”

1
func (bp *BufferPool) FetchPage(pid PageID) (*Page, error)

Two paths:

Cache hit (page is already in the pool):

  1. Look up pid in pageTable → get frame index
  2. Increment PinCount
  3. Remove from LRU list (pinned pages can’t be evicted)
  4. Return &frame.Page

Cache miss (page not in the pool):

  1. Find a free frame (or evict an LRU victim)
  2. Read the page from disk via HeapFile.ReadPage()
  3. Copy it into the frame, set PinCount = 1
  4. Add to pageTable
  5. Return &frame.Page

The returned pointer points directly into the frame. If you call FetchPage twice for the same page, you get the same pointer. This means modifications through that pointer are visible to all callers.

UnpinPage — “I’m done with this page”

1
func (bp *BufferPool) UnpinPage(pid PageID, isDirty bool) error
  1. Decrement PinCount
  2. If isDirty, set IsDirty = true (sticky — once dirty, stays dirty until flushed)
  3. If PinCount reaches 0, add to LRU list (now eligible for eviction)

FlushPage — “write this to disk now”

1
func (bp *BufferPool) FlushPage(pid PageID) error

Writes the page to disk via HeapFile.WritePage() and clears the dirty flag. The page stays in the pool — it’s not evicted.

NewPage — “allocate a new page”

1
func (bp *BufferPool) NewPage(fileID uint32) (PageID, *Page, error)
  1. Call HeapFile.AllocatePage() to extend the file on disk
  2. Find a free frame (or evict)
  3. Initialize the page in the frame
  4. Pin it (PinCount = 1)
  5. Return the PageID and page pointer

LRU Eviction

When the pool is full and every frame is occupied, a new page needs space. The pool evicts the least recently used unpinned page.

The LRU list is a doubly-linked list (Go’s container/list):

1
2
Front (least recent) ←→ ... ←→ Back (most recent)
  page 3                           page 7
  • When a page is unpinned (PinCount goes to 0): add it to the back of the list
  • When a page is fetched (cache hit): remove it from the list (it’s pinned now)
  • When eviction is needed: take from the front of the list (least recently used)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Pool with 3 frames:

  1. Fetch page A → read from disk, pin     frames: [A*]
  2. Fetch page B → read from disk, pin     frames: [A*, B*]
  3. Fetch page C → read from disk, pin     frames: [A*, B*, C*]
  4. Unpin A                                 LRU: [A]
  5. Unpin B                                 LRU: [A, B]
  6. Unpin C                                 LRU: [A, B, C]
  7. Fetch B again → cache hit, pin, remove from LRU    LRU: [A, C]
  8. Unpin B                                 LRU: [A, C, B]
  9. Fetch page D → pool full!
     → evict A (front of LRU, flush if dirty)
     → read D from disk                     LRU: [C, B]

  * = pinned

If all frames are pinned (LRU list is empty), FetchPage returns ErrBufferFull. This is the caller’s fault — they’re holding too many pages at once.


The Refactoring

Before Phase 5, Table.InsertRow() looked like this:

1
2
3
page, err := t.HeapFile.ReadPage(lastPageID)   // disk read
page.InsertTuple(data)                          // modify in memory
t.HeapFile.WritePage(lastPageID, page)           // disk write

Every insert did a disk read AND a disk write. If you insert 100 rows into the same page, that’s 100 reads and 100 writes.

After Phase 5:

1
2
3
page, err := t.Pool.FetchPage(lastPID)          // cache hit (usually)
page.InsertTuple(data)                           // modify in-pool page
t.Pool.UnpinPage(lastPID, true)                  // mark dirty, flush later

The first insert reads from disk. The next 99 are cache hits — no disk I/O at all. The dirty page gets written to disk once, when it’s evicted or when the pool is closed.

The Table struct changed from:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
// Before
type Table struct {
    Name     string
    Schema   *tuple.Schema
    HeapFile *storage.HeapFile
}

// After
type Table struct {
    Name   string
    Schema *tuple.Schema
    Pool   *storage.BufferPool
    FileID uint32
}

The table no longer touches the HeapFile directly. The pool is the single gateway to disk.

The Catalog creates the pool internally:

1
2
3
4
5
6
7
8
func OpenCatalog(dataDir string) (*Catalog, error) {
    cat := &Catalog{
        tables:  make(map[string]*Table),
        dataDir: dataDir,
        pool:    storage.NewBufferPool(defaultPoolSize),
    }
    // ...
}

When Close() is called, the pool flushes all dirty pages to disk, then closes the heap files. This ensures data survives across restarts.


How Phase 5 connects to previous phases

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Phase 5: BufferPool (NEW — sits between Table and HeapFile)
    │
    ├── FetchPage(pid) → cache hit? return from frames[]
    │                   → cache miss? HeapFile.ReadPage()       ← Phase 1
    │
    ├── UnpinPage(pid, dirty) → track in LRU list
    │
    ├── FlushPage(pid) → HeapFile.WritePage()                   ← Phase 1
    │
    └── NewPage(fileID) → HeapFile.AllocatePage()               ← Phase 1

Phase 4: SeqScanNode{Table, Filter, Columns}
    │
    └── Table.ScanAll() → TableIterator
            │
            └── Next() → Pool.FetchPage()                      ← Phase 5 (NEW)
                  └── filter with Expression.Eval()             ← Phase 4
                        └── Deserialize()                       ← Phase 2

The SeqScan code didn’t change at all. It still calls Table.ScanAll(), which returns a TableIterator. The iterator now uses the pool internally instead of reading the HeapFile directly.


What this looks like in action

From the buffer pool test dump:

1
2
3
4
5
6
7
8
9
10
=== Buffer Pool State ===
  Pool size:     5 frames
  Pages cached:  3
  LRU list size: 2 (unpinned pages)

  Frame 0: file=0 page=0  DIRTY  unpinned  data="row-0"
  Frame 1: file=0 page=1  clean  pinned(1) data="row-1"
  Frame 2: file=0 page=2  DIRTY  unpinned  data="row-2"
  Frame 3: [empty]
  Frame 4: [empty]

Three pages cached, two empty frames available. Page 1 is pinned (someone is using it). Pages 0 and 2 are dirty and unpinned — they can be evicted, but their modifications need to be flushed first.


Why this matters

Without a buffer pool, a sequential scan of a 1000-page table does 1000 disk reads. With a pool of 1024 frames, the first scan does 1000 disk reads (cold cache), but the second scan does zero (all cached).

More importantly, writes become dramatically faster. Without the pool, every InsertRow does a read + write. With the pool, inserts that land on the same page (which is most inserts, since pages hold many rows) are all in-memory modifications. The page gets written to disk once — when it’s evicted or flushed.

Real databases tune their buffer pool size carefully. PostgreSQL defaults to 128MB but production systems often set it to 25% of RAM. A database with 64GB of RAM might have a 16GB buffer pool — millions of pages cached.


PostgreSQL comparison

PostgreSQL’s buffer manager (bufmgr.c) is far more complex:

  • Clock sweep instead of LRU: PostgreSQL uses a clock-based approximation of LRU that’s cheaper to maintain under high concurrency
  • Shared memory: the buffer pool lives in shared memory so all backend processes share the same cache
  • Buffer descriptors: separate from the buffer pages, with spinlocks for concurrent access
  • Pin counting with wait queues: processes can wait for a pin to become available
  • Ring buffer strategy: sequential scans use a small ring buffer to avoid evicting the entire cache
  • Background writer: a separate process that flushes dirty pages gradually, avoiding write storms

Our system uses a straightforward LRU list with no concurrency. But the core idea is the same: cache pages in memory, track dirty pages, evict when full.


Go patterns used

container/list for the LRU

1
2
3
4
5
6
import "container/list"

bp.lruList = list.New()
elem := bp.lruList.PushBack(frameIdx)  // add to back (most recent)
bp.lruList.Remove(elem)                 // remove from anywhere in O(1)
front := bp.lruList.Front()             // get least recently used

Go’s container/list is a doubly-linked list in the standard library. It gives O(1) insert, remove, and access to both ends — exactly what an LRU needs.

The lruMap (frame index → list element) lets us remove a specific element in O(1) without searching the list. Without this map, removing a page from the LRU on a cache hit would require O(n) traversal.

Pointer into struct field

1
return &frame.Page, nil

FetchPage returns a pointer to the Page field inside the frame struct. The page data lives inside the frame — there’s no separate allocation. Multiple callers that fetch the same page all get the same pointer, so modifications are visible to everyone.


Files created/modified in Phase 5

File Status Purpose
storage/bufferpool.go NEW BufferPool, BufferFrame, PageID, LRU eviction
storage/bufferpool_test.go NEW 13 tests: caching, eviction, flush, pinning
catalog/table.go MODIFIED Table uses Pool+FileID instead of HeapFile
catalog/catalog.go MODIFIED Catalog creates internal BufferPool
catalog/catalog_test.go MODIFIED Replaced HeapFile.PageCount() with Table.PageCount()

What’s next

Phase 6 (SQL Parser) adds a tokenizer and recursive-descent parser that turns SQL strings like SELECT name FROM users WHERE age > 25 into an AST (Abstract Syntax Tree). This is where the database starts understanding SQL text instead of requiring manually constructed Go structs.

Observations

  • The buffer pool is where performance starts to matter because it reduces how often the database has to go back to disk.
  • The naming is helpful: a buffer is temporary in-memory data, and the pool is a fixed set of reusable slots for those buffers.
  • Each frame holds a page, and pinning versus unpinning is what prevents the system from evicting a page that is currently in use.
  • The design is a combination of a frame array, a page-to-frame lookup map, a heap file registry, and LRU bookkeeping. Together they turn page access into cache access whenever possible.
  • Another important shift is architectural: tables and the catalog stop dealing with heap files directly. They go through the buffer pool, which decides when to flush back to disk.