OLAP – Phase 2 Columnar Storage
Phase 1 gave us in-memory vectors and data chunks. But a database needs to persist data to disk and read it back. The challenge is designing a disk format that preserves the benefits of columnar layout: read only the columns you need, skip entire sections when the data can’t match your query.
This phase builds two structures: RowGroups (horizontal partitions of a table, each holding up to 122,880 rows) and ColumnSegments (one column’s serialized data within a row group, annotated with min/max zone maps).
In DuckDB, these are RowGroup (src/storage/table/row_group.cpp) and ColumnSegment (src/storage/table/column_segment.cpp). The default row group size of 122,880 (60 x 2048 vectors) is defined in row_group.hpp.
Here is the roadmap for the phases to come:
- Phase 2: Columnar storage (row groups and column segments)
- Phase 3: Compression
- Phase 4: Table, catalog, and bulk loading
- Phase 5: Vectorized expressions and scan/filter/project
- Phase 6: Hash aggregation
- Phase 7: Hash join
- Phase 8: SQL parser
- Phase 9: Query planner and optimizer
- Phase 10: Sorting, parallel execution, REPL, and server
Full Source Code
The code referenced in this post can be found in https://gitlab.com/kimserey.lam/olap-learn.
Why 122,880 Rows?
A row group batches 60 vector-sized chunks:
1
const RowGroupSize = 60 * vector.VectorSize // 122,880
Why 60? It’s a trade-off:
- Too small (e.g., 2048 rows): too many segments, zone maps are useless because the value ranges overlap everywhere
- Too large (e.g., 10 million rows): reading one row group loads too much data, nullifying the benefit of skipping
- 122,880: large enough that zone maps are meaningful (a sorted column will have tight min/max ranges per group), small enough that skipping a group saves real I/O
Column Segments and Zone Maps
A ColumnSegment holds all values for one column within one row group:
1
2
3
4
5
6
7
8
9
type ColumnSegment struct {
Type vector.LogicalType
Count uint32
NullCount uint32
Compression CompressionType
ZoneMap ZoneMap
Data []byte
Nulls []byte
}
The ZoneMap stores min/max statistics for the segment’s values:
1
2
3
4
5
6
type ZoneMap struct {
HasStats bool
Min any
Max any
HasNulls bool
}
Zone maps are the OLAP equivalent of a B-tree index, but free — they’re computed once during writes and checked during reads. For WHERE year = 2024, the scan operator checks each row group’s zone map before decompressing any data:
min=2020, max=2022→ skip (2024 can’t be in this range)min=2023, max=2025→ maybe (decompress and check row by row)min=2024, max=2024→ all match (every row satisfies the predicate)
The zone map supports five comparison checks:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
func (z *ZoneMap) CheckEquals(val any) ZoneMapResult {
if !z.HasStats {
return ZoneMapMaybe
}
cmpMin := compareValues(val, z.Min)
cmpMax := compareValues(val, z.Max)
if cmpMin < 0 || cmpMax > 0 {
return ZoneMapSkip
}
if cmpMin == 0 && cmpMax == 0 {
return ZoneMapAllMatch
}
return ZoneMapMaybe
}
And similarly for CheckGreaterThan, CheckLessThan, CheckGreaterThanOrEqual, CheckLessThanOrEqual.
Computing Zone Maps
During serialization, we scan all valid values to find min and max:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
func computeMinMax(v *vector.Vector) (hasStats bool, min, max any) {
count := v.Count()
if count == 0 {
return false, nil, nil
}
firstValid := -1
for i := 0; i < count; i++ {
if v.IsValid(i) {
firstValid = i
break
}
}
if firstValid < 0 {
return false, nil, nil
}
// type-switch to find min/max across valid values
// ...
}
Building a Segment from a Vector
SegmentFromVector serializes an in-memory Vector into a ColumnSegment:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
func SegmentFromVector(v *vector.Vector) *ColumnSegment {
seg := &ColumnSegment{
Type: v.Type(),
Count: uint32(v.Count()),
Compression: CompressionNone,
}
seg.NullCount = uint32(v.NullCount())
seg.ZoneMap.HasNulls = seg.NullCount > 0
seg.Nulls = serializeNullBitmask(v)
rawData := serializeData(v)
seg.ZoneMap.HasStats, seg.ZoneMap.Min, seg.ZoneMap.Max = computeMinMax(v)
compression, compressedData := AutoSelectCompression(rawData, int(seg.Count), seg.Type)
seg.Compression = compression
seg.Data = compressedData
return seg
}
The null bitmask is serialized as one bit per value (1=valid, 0=null):
1
2
3
4
5
6
7
8
9
10
11
func serializeNullBitmask(v *vector.Vector) []byte {
count := v.Count()
nbytes := (count + 7) / 8
buf := make([]byte, nbytes)
for i := 0; i < count; i++ {
if v.IsValid(i) {
buf[i/8] |= 1 << uint(i%8)
}
}
return buf
}
Row Groups
A RowGroup groups column segments together:
1
2
3
4
type RowGroup struct {
RowCount uint32
Columns []*ColumnSegment
}
Building a row group from DataChunks merges all chunks’ columns into single vectors, then converts each to a segment:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
func NewRowGroupFromChunks(chunks []*vector.DataChunk) *RowGroup {
colCount := chunks[0].ColumnCount()
totalRows := 0
for _, c := range chunks {
totalRows += c.Count()
}
merged := make([]*vector.Vector, colCount)
for col := 0; col < colCount; col++ {
typ := chunks[0].Column(col).Type()
v := vector.NewFlatVector(typ)
for _, c := range chunks {
src := c.Column(col)
for row := 0; row < c.Count(); row++ {
// append each value from src to v
}
}
merged[col] = v
}
segments := make([]*ColumnSegment, colCount)
for i, v := range merged {
segments[i] = SegmentFromVector(v)
}
return &RowGroup{RowCount: uint32(totalRows), Columns: segments}
}
Reading Back with Column Projection
When deserializing, we only decompress the columns the query needs:
1
2
3
4
5
6
7
8
9
10
11
12
13
func (rg *RowGroup) ToDataChunk(columnIndices []int) *vector.DataChunk {
types := make([]vector.LogicalType, len(columnIndices))
for i, colIdx := range columnIndices {
types[i] = rg.Columns[colIdx].Type
}
chunk := vector.NewDataChunk(types)
for i, colIdx := range columnIndices {
v := rg.Columns[colIdx].ToVector()
chunk.SetColumn(i, v)
}
chunk.SetCount(int(rg.RowCount))
return chunk
}
If a query only needs 2 of 10 columns, only those 2 segments are decompressed.
File Format
The on-disk format is a single binary file per table:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
File Header (16 bytes):
magic: "QUAK" (4 bytes)
version: 1 (4 bytes)
column count (4 bytes)
row group count (4 bytes)
RowGroup 0:
row count (4 bytes)
ColumnSegment 0: header(18 bytes) + zone map + null bitmask + data
ColumnSegment 1: header + zone map + null bitmask + data
...
RowGroup 1:
...
The segment header is 18 bytes:
1
2
3
4
5
6
7
8
9
10
11
func (seg *ColumnSegment) WriteTo(w io.Writer) error {
// type(1) + count(4) + nullcount(4) + compression(1) + datalen(4) + nulllen(4) = 18
header := make([]byte, 18)
header[0] = byte(seg.Type)
binary.LittleEndian.PutUint32(header[1:5], seg.Count)
binary.LittleEndian.PutUint32(header[5:9], seg.NullCount)
header[9] = byte(seg.Compression)
binary.LittleEndian.PutUint32(header[10:14], uint32(len(seg.Data)))
binary.LittleEndian.PutUint32(header[14:18], uint32(len(seg.Nulls)))
// write header, zone map, nulls, data
}
Selective Reading
For queries that don’t need all columns, ReadRowGroupsSelective reads segment headers to determine byte offsets, then seeks past unneeded columns:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
func ReadRowGroupsSelective(r io.ReadSeeker, columnIndices []int) ([]*RowGroup, error) {
// read file header
needed := make(map[int]bool, len(columnIndices))
for _, idx := range columnIndices {
needed[idx] = true
}
for i := 0; i < rgCount; i++ {
for col := 0; col < colCount; col++ {
if needed[col] {
seg, _ := ReadSegmentFrom(r)
segments[segIdx] = seg
} else {
SkipSegmentFrom(r) // seek past without decompressing
}
}
}
}
SkipSegmentFrom reads just the 18-byte header to learn the data length, then seeks past the data bytes. This means reading 2 of 10 columns only touches ~20% of the file.
Summary
The row group + column segment model gives us three properties that define OLAP storage:
- Column projection: only read the columns the query needs
- Zone map pruning: skip entire row groups when min/max statistics prove no match is possible
- Compression boundary: each segment can be compressed independently (Phase 3)
The file format is simple but effective — no index structures, no page management, no buffer pool. Just sequential column segments with headers that enable skipping.