OLAP – Phase 2 Columnar Storage

Phase 1 gave us in-memory vectors and data chunks. But a database needs to persist data to disk and read it back. The challenge is designing a disk format that preserves the benefits of columnar layout: read only the columns you need, skip entire sections when the data can’t match your query.

This phase builds two structures: RowGroups (horizontal partitions of a table, each holding up to 122,880 rows) and ColumnSegments (one column’s serialized data within a row group, annotated with min/max zone maps).

In DuckDB, these are RowGroup (src/storage/table/row_group.cpp) and ColumnSegment (src/storage/table/column_segment.cpp). The default row group size of 122,880 (60 x 2048 vectors) is defined in row_group.hpp.

Here is the roadmap for the phases to come:

  • Phase 2: Columnar storage (row groups and column segments)
  • Phase 3: Compression
  • Phase 4: Table, catalog, and bulk loading
  • Phase 5: Vectorized expressions and scan/filter/project
  • Phase 6: Hash aggregation
  • Phase 7: Hash join
  • Phase 8: SQL parser
  • Phase 9: Query planner and optimizer
  • Phase 10: Sorting, parallel execution, REPL, and server

Full Source Code

The code referenced in this post can be found in https://gitlab.com/kimserey.lam/olap-learn.

Why 122,880 Rows?

A row group batches 60 vector-sized chunks:

1
const RowGroupSize = 60 * vector.VectorSize // 122,880

Why 60? It’s a trade-off:

  • Too small (e.g., 2048 rows): too many segments, zone maps are useless because the value ranges overlap everywhere
  • Too large (e.g., 10 million rows): reading one row group loads too much data, nullifying the benefit of skipping
  • 122,880: large enough that zone maps are meaningful (a sorted column will have tight min/max ranges per group), small enough that skipping a group saves real I/O

Column Segments and Zone Maps

A ColumnSegment holds all values for one column within one row group:

1
2
3
4
5
6
7
8
9
type ColumnSegment struct {
    Type        vector.LogicalType
    Count       uint32
    NullCount   uint32
    Compression CompressionType
    ZoneMap     ZoneMap
    Data        []byte
    Nulls       []byte
}

The ZoneMap stores min/max statistics for the segment’s values:

1
2
3
4
5
6
type ZoneMap struct {
    HasStats bool
    Min      any
    Max      any
    HasNulls bool
}

Zone maps are the OLAP equivalent of a B-tree index, but free — they’re computed once during writes and checked during reads. For WHERE year = 2024, the scan operator checks each row group’s zone map before decompressing any data:

  • min=2020, max=2022skip (2024 can’t be in this range)
  • min=2023, max=2025maybe (decompress and check row by row)
  • min=2024, max=2024all match (every row satisfies the predicate)

The zone map supports five comparison checks:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
func (z *ZoneMap) CheckEquals(val any) ZoneMapResult {
    if !z.HasStats {
        return ZoneMapMaybe
    }
    cmpMin := compareValues(val, z.Min)
    cmpMax := compareValues(val, z.Max)
    if cmpMin < 0 || cmpMax > 0 {
        return ZoneMapSkip
    }
    if cmpMin == 0 && cmpMax == 0 {
        return ZoneMapAllMatch
    }
    return ZoneMapMaybe
}

And similarly for CheckGreaterThan, CheckLessThan, CheckGreaterThanOrEqual, CheckLessThanOrEqual.

Computing Zone Maps

During serialization, we scan all valid values to find min and max:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
func computeMinMax(v *vector.Vector) (hasStats bool, min, max any) {
    count := v.Count()
    if count == 0 {
        return false, nil, nil
    }
    firstValid := -1
    for i := 0; i < count; i++ {
        if v.IsValid(i) {
            firstValid = i
            break
        }
    }
    if firstValid < 0 {
        return false, nil, nil
    }
    // type-switch to find min/max across valid values
    // ...
}

Building a Segment from a Vector

SegmentFromVector serializes an in-memory Vector into a ColumnSegment:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
func SegmentFromVector(v *vector.Vector) *ColumnSegment {
    seg := &ColumnSegment{
        Type:        v.Type(),
        Count:       uint32(v.Count()),
        Compression: CompressionNone,
    }
    seg.NullCount = uint32(v.NullCount())
    seg.ZoneMap.HasNulls = seg.NullCount > 0
    seg.Nulls = serializeNullBitmask(v)
    rawData := serializeData(v)
    seg.ZoneMap.HasStats, seg.ZoneMap.Min, seg.ZoneMap.Max = computeMinMax(v)

    compression, compressedData := AutoSelectCompression(rawData, int(seg.Count), seg.Type)
    seg.Compression = compression
    seg.Data = compressedData
    return seg
}

The null bitmask is serialized as one bit per value (1=valid, 0=null):

1
2
3
4
5
6
7
8
9
10
11
func serializeNullBitmask(v *vector.Vector) []byte {
    count := v.Count()
    nbytes := (count + 7) / 8
    buf := make([]byte, nbytes)
    for i := 0; i < count; i++ {
        if v.IsValid(i) {
            buf[i/8] |= 1 << uint(i%8)
        }
    }
    return buf
}

Row Groups

A RowGroup groups column segments together:

1
2
3
4
type RowGroup struct {
    RowCount uint32
    Columns  []*ColumnSegment
}

Building a row group from DataChunks merges all chunks’ columns into single vectors, then converts each to a segment:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
func NewRowGroupFromChunks(chunks []*vector.DataChunk) *RowGroup {
    colCount := chunks[0].ColumnCount()
    totalRows := 0
    for _, c := range chunks {
        totalRows += c.Count()
    }

    merged := make([]*vector.Vector, colCount)
    for col := 0; col < colCount; col++ {
        typ := chunks[0].Column(col).Type()
        v := vector.NewFlatVector(typ)
        for _, c := range chunks {
            src := c.Column(col)
            for row := 0; row < c.Count(); row++ {
                // append each value from src to v
            }
        }
        merged[col] = v
    }

    segments := make([]*ColumnSegment, colCount)
    for i, v := range merged {
        segments[i] = SegmentFromVector(v)
    }
    return &RowGroup{RowCount: uint32(totalRows), Columns: segments}
}

Reading Back with Column Projection

When deserializing, we only decompress the columns the query needs:

1
2
3
4
5
6
7
8
9
10
11
12
13
func (rg *RowGroup) ToDataChunk(columnIndices []int) *vector.DataChunk {
    types := make([]vector.LogicalType, len(columnIndices))
    for i, colIdx := range columnIndices {
        types[i] = rg.Columns[colIdx].Type
    }
    chunk := vector.NewDataChunk(types)
    for i, colIdx := range columnIndices {
        v := rg.Columns[colIdx].ToVector()
        chunk.SetColumn(i, v)
    }
    chunk.SetCount(int(rg.RowCount))
    return chunk
}

If a query only needs 2 of 10 columns, only those 2 segments are decompressed.


File Format

The on-disk format is a single binary file per table:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
File Header (16 bytes):
  magic: "QUAK" (4 bytes)
  version: 1 (4 bytes)
  column count (4 bytes)
  row group count (4 bytes)

RowGroup 0:
  row count (4 bytes)
  ColumnSegment 0: header(18 bytes) + zone map + null bitmask + data
  ColumnSegment 1: header + zone map + null bitmask + data
  ...

RowGroup 1:
  ...

The segment header is 18 bytes:

1
2
3
4
5
6
7
8
9
10
11
func (seg *ColumnSegment) WriteTo(w io.Writer) error {
    // type(1) + count(4) + nullcount(4) + compression(1) + datalen(4) + nulllen(4) = 18
    header := make([]byte, 18)
    header[0] = byte(seg.Type)
    binary.LittleEndian.PutUint32(header[1:5], seg.Count)
    binary.LittleEndian.PutUint32(header[5:9], seg.NullCount)
    header[9] = byte(seg.Compression)
    binary.LittleEndian.PutUint32(header[10:14], uint32(len(seg.Data)))
    binary.LittleEndian.PutUint32(header[14:18], uint32(len(seg.Nulls)))
    // write header, zone map, nulls, data
}

Selective Reading

For queries that don’t need all columns, ReadRowGroupsSelective reads segment headers to determine byte offsets, then seeks past unneeded columns:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
func ReadRowGroupsSelective(r io.ReadSeeker, columnIndices []int) ([]*RowGroup, error) {
    // read file header
    needed := make(map[int]bool, len(columnIndices))
    for _, idx := range columnIndices {
        needed[idx] = true
    }

    for i := 0; i < rgCount; i++ {
        for col := 0; col < colCount; col++ {
            if needed[col] {
                seg, _ := ReadSegmentFrom(r)
                segments[segIdx] = seg
            } else {
                SkipSegmentFrom(r)  // seek past without decompressing
            }
        }
    }
}

SkipSegmentFrom reads just the 18-byte header to learn the data length, then seeks past the data bytes. This means reading 2 of 10 columns only touches ~20% of the file.


Summary

The row group + column segment model gives us three properties that define OLAP storage:

  1. Column projection: only read the columns the query needs
  2. Zone map pruning: skip entire row groups when min/max statistics prove no match is possible
  3. Compression boundary: each segment can be compressed independently (Phase 3)

The file format is simple but effective — no index structures, no page management, no buffer pool. Just sequential column segments with headers that enable skipping.