OLAP – Phase 1 Vectors and DataChunks
The fundamental difference between OLTP and OLAP databases is how they store and process data. An OLTP database like PostgreSQL stores rows — when you insert a customer record, all columns (name, email, balance) sit together on one page. An OLAP database like DuckDB stores columns — all names sit together, all emails sit together, all balances sit together. This layout is what makes analytical queries fast: SELECT SUM(balance) FROM customers only touches the balance column, ignoring everything else.
This phase builds the two foundational data structures of a columnar engine: Vectors (typed columnar arrays) and DataChunks (batches of column vectors).
In DuckDB, these are Vector (src/include/duckdb/common/types/vector.hpp) and DataChunk (src/include/duckdb/common/types/data_chunk.hpp). The vector size constant STANDARD_VECTOR_SIZE = 2048 is defined in vector_size.hpp.
Here is the roadmap for the phases to come:
- Phase 1: Vectors and DataChunks
- Phase 2: Columnar storage (row groups and column segments)
- Phase 3: Compression
- Phase 4: Table, catalog, and bulk loading
- Phase 5: Vectorized expressions and scan/filter/project
- Phase 6: Hash aggregation
- Phase 7: Hash join
- Phase 8: SQL parser
- Phase 9: Query planner and optimizer
- Phase 10: Sorting, parallel execution, REPL, and server
Full Source Code
The code referenced in this post can be found in https://gitlab.com/kimserey.lam/olap-learn.
Why 2048?
Before looking at the code, it’s worth understanding why OLAP engines don’t just process one row at a time (like the Volcano iterator model in PostgreSQL) or all rows at once.
- One row at a time: too much per-row overhead — function call per row, branch mispredictions, no SIMD
- All rows at once: the full column might be millions of values — won’t fit in CPU cache
- 2048 rows at a time: a batch of 2048 Int64 values is 16KB, which fits comfortably in L1 cache (typically 32-64KB). This “vectorized” approach gets the cache benefits of columnar layout with bounded memory
This batch size is the constant that everything else is built around:
1
2
const VectorSize = 2048
const ValidityMaskSize = VectorSize / 64 // 32 uint64s
Logical Types
Every vector has a type. We support five:
1
2
3
4
5
6
7
8
9
type LogicalType uint8
const (
Boolean LogicalType = iota
Int32
Int64
Float64
Varchar
)
Fixed-width types (Boolean, Int32, Int64, Float64) store values in contiguous arrays — []int32, []int64, etc. Variable-width types (Varchar) use []string. The type also determines which compression codecs can apply (a decision made in Phase 3).
The Vector
A Vector is a typed array of up to 2048 values with a validity bitmask that tracks NULLs:
1
2
3
4
5
6
7
type Vector struct {
typ LogicalType
count int
constant bool
data any // []int32, []string, etc.
validity [ValidityMaskSize]uint64 // 32 × uint64 = 2048 bits
}
Validity Bitmask
Instead of using a *int32 pointer (8 bytes overhead per value) or a []bool (1 byte per value), we use a bitmask: one bit per value, packed into uint64 words. For 2048 values, that’s 32 words = 256 bytes — 8x smaller than a []bool.
Bit 1 means “valid” (non-null), bit 0 means “null”:
1
2
3
4
5
6
7
func (v *Vector) SetNull(i int) {
v.validity[i/64] &^= 1 << uint(i%64)
}
func (v *Vector) IsValid(i int) bool {
return (v.validity[i/64]>>uint(i%64))&1 == 1
}
The AllValid() method checks if any values are null by scanning the bitmask words. For a vector with no nulls, this is a fast loop over 32 uint64s:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
func (v *Vector) AllValid() bool {
if v.count == 0 {
return true
}
fullWords := v.count / 64
for i := 0; i < fullWords; i++ {
if v.validity[i] != ^uint64(0) {
return false
}
}
remainder := v.count % 64
if remainder > 0 {
mask := (uint64(1) << uint(remainder)) - 1
if v.validity[fullWords]&mask != mask {
return false
}
}
return true
}
Flat vs Constant Vectors
A flat vector stores N independent values — one per row. This is what you get from scanning a column:
1
2
3
4
5
6
func NewFlatVector(typ LogicalType) *Vector {
v := &Vector{typ: typ}
v.initData(VectorSize)
v.initValidityAllValid()
return v
}
A constant vector stores a single value that applies to all rows. This is used for literal constants in expressions — WHERE amount > 100 creates a constant vector holding 100:
1
2
3
4
5
6
func NewConstantVector(typ LogicalType) *Vector {
v := &Vector{typ: typ, constant: true}
v.initData(1) // only 1 slot needed
v.validity[0] = 1
return v
}
The constant flag tells the execution engine to read index 0 regardless of which row is being processed.
Raw Data Access
For vectorized operations (processing all 2048 values in a tight loop), we need direct access to the underlying typed slice — not one-at-a-time getters:
1
2
3
4
5
func (v *Vector) Int32Data() []int32 { return v.data.([]int32) }
func (v *Vector) Int64Data() []int64 { return v.data.([]int64) }
func (v *Vector) Float64Data() []float64 { return v.data.([]float64) }
func (v *Vector) BoolData() []bool { return v.data.([]bool) }
func (v *Vector) VarcharData() []string { return v.data.([]string) }
This is the key to vectorized execution: instead of calling GetInt32(i) in a loop (which does bounds checking and null checking per call), the execution engine grabs the raw slice and iterates directly:
1
2
3
4
data := v.Int32Data()
for i := 0; i < v.Count(); i++ {
data[i] = data[i] * 2 // tight loop, no function calls
}
The DataChunk
A DataChunk is a collection of column vectors representing a batch of rows. It’s the unit that flows through the execution pipeline — every operator receives a DataChunk and produces a DataChunk:
1
2
3
4
type DataChunk struct {
columns []*Vector
count int
}
Conceptually:
1
2
3
4
5
DataChunk (3 columns, 5 rows):
Column 0 (Int32): [1, 2, 3, 4, 5 ]
Column 1 (Varchar): ["alice","bob", "carol","dave","eve"]
Column 2 (Float64): [100.0, 200.0, 150.0, 300.0, 50.0]
count: 5
Creating a DataChunk from column types:
1
2
3
4
5
6
7
func NewDataChunk(types []LogicalType) *DataChunk {
columns := make([]*Vector, len(types))
for i, t := range types {
columns[i] = NewFlatVector(t)
}
return &DataChunk{columns: columns}
}
Appending Rows
While the internal format is columnar, data often arrives row-by-row (from CSV files, INSERT statements, etc.). The Append method transposes a row into column vectors:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
func (c *DataChunk) Append(values []any) error {
if len(values) != len(c.columns) {
return ErrColumnCountMismatch
}
for i, col := range c.columns {
val := values[i]
if val == nil {
if err := col.AppendNull(); err != nil {
return err
}
continue
}
switch col.Type() {
case Int32:
if err := col.AppendInt32(val.(int32)); err != nil {
return err
}
// ... other types
}
}
c.count++
return nil
}
The SelectionVector
The SelectionVector is an array of indices that selects a subset of rows from a vector without copying data:
1
2
3
4
type SelectionVector struct {
indices []uint16
count int
}
Why uint16? Vector indices range from 0 to 2047 — a uint16 (max 65535) is sufficient and saves memory.
When a filter evaluates to [true, false, true, false, true], instead of copying the three matching rows into a new vector, we create a SelectionVector [0, 2, 4]. All subsequent operators use this selection to skip non-matching rows across all columns simultaneously:
1
2
3
Original: [10, 20, 30, 40, 50, 60, 70, 80]
Selection: [1, 3, 5, 7]
Result: vector[1]=20, vector[3]=40, vector[5]=60, vector[7]=80
This is a zero-copy optimization — the data stays in place, and we only shuffle a small array of indices. For a filter that eliminates 90% of rows, we avoid copying 90% of the data across all columns.
Summary
These three primitives — Vector, DataChunk, and SelectionVector — form the foundation of everything that follows. Every row group (Phase 2) decompresses into Vectors. Every operator (Phase 5) consumes and produces DataChunks. Every filter (Phase 5) outputs a SelectionVector instead of copying data. The 2048-value batch size is the constant that makes all of this cache-friendly.