OLAP – Phase 1 Vectors and DataChunks

The fundamental difference between OLTP and OLAP databases is how they store and process data. An OLTP database like PostgreSQL stores rows — when you insert a customer record, all columns (name, email, balance) sit together on one page. An OLAP database like DuckDB stores columns — all names sit together, all emails sit together, all balances sit together. This layout is what makes analytical queries fast: SELECT SUM(balance) FROM customers only touches the balance column, ignoring everything else.

This phase builds the two foundational data structures of a columnar engine: Vectors (typed columnar arrays) and DataChunks (batches of column vectors).

In DuckDB, these are Vector (src/include/duckdb/common/types/vector.hpp) and DataChunk (src/include/duckdb/common/types/data_chunk.hpp). The vector size constant STANDARD_VECTOR_SIZE = 2048 is defined in vector_size.hpp.

Here is the roadmap for the phases to come:

  • Phase 1: Vectors and DataChunks
  • Phase 2: Columnar storage (row groups and column segments)
  • Phase 3: Compression
  • Phase 4: Table, catalog, and bulk loading
  • Phase 5: Vectorized expressions and scan/filter/project
  • Phase 6: Hash aggregation
  • Phase 7: Hash join
  • Phase 8: SQL parser
  • Phase 9: Query planner and optimizer
  • Phase 10: Sorting, parallel execution, REPL, and server

Full Source Code

The code referenced in this post can be found in https://gitlab.com/kimserey.lam/olap-learn.

Why 2048?

Before looking at the code, it’s worth understanding why OLAP engines don’t just process one row at a time (like the Volcano iterator model in PostgreSQL) or all rows at once.

  • One row at a time: too much per-row overhead — function call per row, branch mispredictions, no SIMD
  • All rows at once: the full column might be millions of values — won’t fit in CPU cache
  • 2048 rows at a time: a batch of 2048 Int64 values is 16KB, which fits comfortably in L1 cache (typically 32-64KB). This “vectorized” approach gets the cache benefits of columnar layout with bounded memory

This batch size is the constant that everything else is built around:

1
2
const VectorSize = 2048
const ValidityMaskSize = VectorSize / 64 // 32 uint64s

Logical Types

Every vector has a type. We support five:

1
2
3
4
5
6
7
8
9
type LogicalType uint8

const (
    Boolean LogicalType = iota
    Int32
    Int64
    Float64
    Varchar
)

Fixed-width types (Boolean, Int32, Int64, Float64) store values in contiguous arrays — []int32, []int64, etc. Variable-width types (Varchar) use []string. The type also determines which compression codecs can apply (a decision made in Phase 3).

The Vector

A Vector is a typed array of up to 2048 values with a validity bitmask that tracks NULLs:

1
2
3
4
5
6
7
type Vector struct {
    typ      LogicalType
    count    int
    constant bool
    data     any                        // []int32, []string, etc.
    validity [ValidityMaskSize]uint64   // 32 × uint64 = 2048 bits
}

Validity Bitmask

Instead of using a *int32 pointer (8 bytes overhead per value) or a []bool (1 byte per value), we use a bitmask: one bit per value, packed into uint64 words. For 2048 values, that’s 32 words = 256 bytes — 8x smaller than a []bool.

Bit 1 means “valid” (non-null), bit 0 means “null”:

1
2
3
4
5
6
7
func (v *Vector) SetNull(i int) {
    v.validity[i/64] &^= 1 << uint(i%64)
}

func (v *Vector) IsValid(i int) bool {
    return (v.validity[i/64]>>uint(i%64))&1 == 1
}

The AllValid() method checks if any values are null by scanning the bitmask words. For a vector with no nulls, this is a fast loop over 32 uint64s:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
func (v *Vector) AllValid() bool {
    if v.count == 0 {
        return true
    }
    fullWords := v.count / 64
    for i := 0; i < fullWords; i++ {
        if v.validity[i] != ^uint64(0) {
            return false
        }
    }
    remainder := v.count % 64
    if remainder > 0 {
        mask := (uint64(1) << uint(remainder)) - 1
        if v.validity[fullWords]&mask != mask {
            return false
        }
    }
    return true
}

Flat vs Constant Vectors

A flat vector stores N independent values — one per row. This is what you get from scanning a column:

1
2
3
4
5
6
func NewFlatVector(typ LogicalType) *Vector {
    v := &Vector{typ: typ}
    v.initData(VectorSize)
    v.initValidityAllValid()
    return v
}

A constant vector stores a single value that applies to all rows. This is used for literal constants in expressions — WHERE amount > 100 creates a constant vector holding 100:

1
2
3
4
5
6
func NewConstantVector(typ LogicalType) *Vector {
    v := &Vector{typ: typ, constant: true}
    v.initData(1)     // only 1 slot needed
    v.validity[0] = 1
    return v
}

The constant flag tells the execution engine to read index 0 regardless of which row is being processed.

Raw Data Access

For vectorized operations (processing all 2048 values in a tight loop), we need direct access to the underlying typed slice — not one-at-a-time getters:

1
2
3
4
5
func (v *Vector) Int32Data() []int32   { return v.data.([]int32) }
func (v *Vector) Int64Data() []int64   { return v.data.([]int64) }
func (v *Vector) Float64Data() []float64 { return v.data.([]float64) }
func (v *Vector) BoolData() []bool     { return v.data.([]bool) }
func (v *Vector) VarcharData() []string { return v.data.([]string) }

This is the key to vectorized execution: instead of calling GetInt32(i) in a loop (which does bounds checking and null checking per call), the execution engine grabs the raw slice and iterates directly:

1
2
3
4
data := v.Int32Data()
for i := 0; i < v.Count(); i++ {
    data[i] = data[i] * 2  // tight loop, no function calls
}

The DataChunk

A DataChunk is a collection of column vectors representing a batch of rows. It’s the unit that flows through the execution pipeline — every operator receives a DataChunk and produces a DataChunk:

1
2
3
4
type DataChunk struct {
    columns []*Vector
    count   int
}

Conceptually:

1
2
3
4
5
DataChunk (3 columns, 5 rows):
  Column 0 (Int32):   [1,      2,     3,      4,     5    ]
  Column 1 (Varchar): ["alice","bob", "carol","dave","eve"]
  Column 2 (Float64): [100.0,  200.0, 150.0,  300.0, 50.0]
  count: 5

Creating a DataChunk from column types:

1
2
3
4
5
6
7
func NewDataChunk(types []LogicalType) *DataChunk {
    columns := make([]*Vector, len(types))
    for i, t := range types {
        columns[i] = NewFlatVector(t)
    }
    return &DataChunk{columns: columns}
}

Appending Rows

While the internal format is columnar, data often arrives row-by-row (from CSV files, INSERT statements, etc.). The Append method transposes a row into column vectors:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
func (c *DataChunk) Append(values []any) error {
    if len(values) != len(c.columns) {
        return ErrColumnCountMismatch
    }
    for i, col := range c.columns {
        val := values[i]
        if val == nil {
            if err := col.AppendNull(); err != nil {
                return err
            }
            continue
        }
        switch col.Type() {
        case Int32:
            if err := col.AppendInt32(val.(int32)); err != nil {
                return err
            }
        // ... other types
        }
    }
    c.count++
    return nil
}

The SelectionVector

The SelectionVector is an array of indices that selects a subset of rows from a vector without copying data:

1
2
3
4
type SelectionVector struct {
    indices []uint16
    count   int
}

Why uint16? Vector indices range from 0 to 2047 — a uint16 (max 65535) is sufficient and saves memory.

When a filter evaluates to [true, false, true, false, true], instead of copying the three matching rows into a new vector, we create a SelectionVector [0, 2, 4]. All subsequent operators use this selection to skip non-matching rows across all columns simultaneously:

1
2
3
Original:  [10, 20, 30, 40, 50, 60, 70, 80]
Selection: [1, 3, 5, 7]
Result:    vector[1]=20, vector[3]=40, vector[5]=60, vector[7]=80

This is a zero-copy optimization — the data stays in place, and we only shuffle a small array of indices. For a filter that eliminates 90% of rows, we avoid copying 90% of the data across all columns.


Summary

These three primitives — Vector, DataChunk, and SelectionVector — form the foundation of everything that follows. Every row group (Phase 2) decompresses into Vectors. Every operator (Phase 5) consumes and produces DataChunks. Every filter (Phase 5) outputs a SelectionVector instead of copying data. The 2048-value batch size is the constant that makes all of this cache-friendly.