OLTP – Phase 6 SQL Parser

Until now, every query is built manually in Go — constructing executor nodes and expression structs by hand. That works for testing, but a real database needs to understand SQL text. Phase 6 adds a parser that turns a SQL string like SELECT name, age FROM users WHERE age > 25 into a tree of Go structs (an AST) that the executor can work with. This phase is pure syntax — no storage, no execution. It takes a string and produces a data structure.

For example, instead of building nodes manually:

1
2
3
4
5
6
7
8
9
scan := &SeqScanNode{
    Table: tbl,
    Filter: CompareExpr{
        Left:  ColumnRef{Name: "age"},
        Right: Literal{Value: tuple.IntValue(25)},
        Op:    OpGt,
    },
    Columns: []int{1, 2},
}

We can now parse:

1
SELECT name, age FROM users WHERE age > 25;

In PostgreSQL, this is gram.y (grammar), scan.l (lexer), and parsenodes.h (AST nodes).

Here is the roadmap for the phases to come:

  • Phase 6: SQL parser
  • Phase 7: Query executor
  • Phase 8: B-tree index
  • Phase 9: Write-ahead log
  • Phase 10: Transactions, concurrency, REPL, and server

Full Source Code

The code referenced in this post can be found in https://gitlab.com/kimserey.lam/db-learn.

The Pipeline

Parsing happens in two stages:

1
SQL string → [Lexer] → tokens → [Parser] → AST

Lexer (tokenizer): Breaks the raw text into a sequence of tokens. Each token is a meaningful unit: a keyword, a number, a string, an operator, a symbol.

1
2
3
"SELECT name FROM users WHERE age > 25;"

→ [SELECT] [IDENT:"name"] [FROM] [IDENT:"users"] [WHERE] [IDENT:"age"] [GT:">"] [INT_LIT:"25"] [SEMICOLON] [EOF]

Parser: Takes the token sequence and builds a tree structure (AST) that represents the meaning of the SQL statement.

1
2
3
4
5
6
7
8
9
SelectStmt {
    Columns: [ColumnRef{"name"}]
    From:    "users"
    Where:   BinaryExpr {
        Left:  ColumnRef{"age"}
        Op:    ">"
        Right: IntLit{25}
    }
}

The Lexer (parser/lexer.go)

The lexer reads one character at a time and classifies what it sees:

1
2
3
4
5
type Lexer struct {
    input string
    pos   int
    line  int
}

NextToken() looks at the current character and decides:

  • Whitespace (space, tab, newline): skip, but count newlines for error messages
  • Letter or underscore: read an identifier (age, users), then check if it’s a keyword (SELECT, FROM)
  • Digit: read an integer literal (25, 100)
  • Single quote: read a string literal ('alice')
  • Operator characters: =, !=, <>, <, >, <=, >=
  • Symbols: ,, (, ), ;, *

Keywords vs identifiers

When the lexer reads letters, it doesn’t immediately know if select is a keyword or a column name. It reads the whole word, then checks a keyword map:

1
2
3
4
5
6
var keywords = map[string]TokenType{
    "SELECT":  TOKEN_SELECT,
    "FROM":    TOKEN_FROM,
    "WHERE":   TOKEN_WHERE,
    // ...
}

The lookup is case-insensitive — select, SELECT, and Select all produce TOKEN_SELECT. If the word isn’t a keyword, it’s an identifier (TOKEN_IDENT).

Two-character operators

Some operators are two characters: !=, <>, <=, >=. The lexer handles these by peeking at the next character:

1
2
3
4
5
6
7
8
case '<':
    if next == '=' {
        return LTEQ   // <=
    }
    if next == '>' {
        return NEQ    // <>  (SQL's other way to write !=)
    }
    return LT         // <

The AST (parser/ast.go)

The AST (Abstract Syntax Tree) represents the structure of a SQL statement as Go structs. Two interfaces define the tree:

1
2
3
4
5
6
7
type Statement interface {
    stmtNode()
}

type Expression interface {
    exprNode()
}

Statement types

Each SQL command is a different struct:

1
2
3
4
5
CreateTableStmt    CREATE TABLE users (id INT, name VARCHAR(50));
InsertStmt         INSERT INTO users VALUES (1, 'alice', true);
SelectStmt         SELECT name FROM users WHERE age > 25;
UpdateStmt         UPDATE users SET name = 'bob' WHERE id = 1;
DeleteStmt         DELETE FROM users WHERE id = 1;

Expression types

Expressions appear inside statements — in WHERE clauses, SELECT lists, INSERT values, and SET assignments.

Type Example SQL
ColumnRef {Name: "age"} age
IntLit {Value: 25} 25
StringLit {Value: "alice"} 'alice'
BoolLit {Value: true} true
NullLit {} NULL
StarExpr {} *
BinaryExpr {Left, Op: ">", Right} age > 25
NotExpr {Expr} NOT active

The AST is a tree

Just like the expression trees from Phase 4, the AST forms a tree through composition:

1
2
3
4
5
6
7
8
9
10
11
12
SELECT name, age FROM users WHERE active = true AND age >= 18

SelectStmt
├── Columns: [ColumnRef{"name"}, ColumnRef{"age"}]
├── From: "users"
└── Where: BinaryExpr (AND)
           ├── Left:  BinaryExpr (=)
           │          ├── Left:  ColumnRef{"active"}
           │          └── Right: BoolLit{true}
           └── Right: BinaryExpr (>=)
                      ├── Left:  ColumnRef{"age"}
                      └── Right: IntLit{18}

The Parser (parser/parser.go)

Recursive descent

The parser uses recursive descent — each grammar rule is a function, and functions call each other. This is the simplest parsing technique and the one most hand-written parsers use.

The entry point dispatches based on the first keyword:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
func (p *Parser) parseStatement() (Statement, error) {
    switch p.current.Type {
    case TOKEN_SELECT:
        return p.parseSelect()
    case TOKEN_INSERT:
        return p.parseInsert()
    case TOKEN_CREATE:
        return p.parseCreateTable()
    case TOKEN_UPDATE:
        return p.parseUpdate()
    case TOKEN_DELETE:
        return p.parseDelete()
    }
}

Each parse function reads tokens left to right, building up the AST node. parseSelect reads the column list, expects FROM, reads the table name, and optionally parses a WHERE clause.

The current / advance / expect pattern

The parser keeps track of one token at a time:

1
2
3
4
type Parser struct {
    lexer   *Lexer
    current Token     // the token we're looking at right now
}
  • advance() — consume the current token and load the next one
  • expect(TOKEN_FROM) — assert the current token is FROM, consume it, return its value. If it’s not FROM, return an error with a useful message

This pattern appears everywhere:

1
2
3
4
5
6
7
8
9
// Parse: FROM users
if _, err := p.expect(TOKEN_FROM); err != nil {
    return nil, err  // "line 1: expected FROM, got IDENT ("age")"
}
tok, err := p.expect(TOKEN_IDENT)
if err != nil {
    return nil, err  // "line 1: expected IDENT, got SEMICOLON (";")"
}
stmt.From = tok.Literal  // "users"

Operator precedence

When parsing WHERE a = 1 OR b = 2 AND c = 3, the parser needs to know that AND binds tighter than OR. The result should be a = 1 OR (b = 2 AND c = 3), not (a = 1 OR b = 2) AND c = 3.

This is handled by splitting expression parsing into layers, one per precedence level:

1
2
3
4
5
parseExpression  →  parseOrExpr       (lowest precedence)
                      → parseAndExpr
                          → parseNotExpr
                              → parseComparison
                                  → parsePrimary  (highest precedence)

Each layer handles one operator and delegates to the next (higher precedence) layer:

1
2
3
4
5
6
7
8
9
func (p *Parser) parseOrExpr() (Expression, error) {
    left, err := p.parseAndExpr()  // parse higher precedence first
    for p.current.Type == TOKEN_OR {
        p.advance()
        right, err := p.parseAndExpr()
        left = &BinaryExpr{Left: left, Op: "OR", Right: right}
    }
    return left, nil
}

If there’s no OR, it just returns what parseAndExpr returned. If there is an OR, it wraps both sides in a BinaryExpr. Because parseAndExpr is called first, AND operations are grouped before OR — that’s how precedence works.

Parentheses override precedence

(a = 1 OR b = 2) AND c = 3 forces OR to bind first. The parser handles this in parsePrimary:

1
2
3
4
5
case TOKEN_LPAREN:
    p.advance()
    expr, err := p.parseExpression()  // full expression parsing inside parens
    p.expect(TOKEN_RPAREN)
    return expr, nil

Inside parentheses, parseExpression starts over from the lowest precedence, so OR gets parsed as a complete expression before returning to the AND outside.


How Phase 6 connects to previous phases

Phase 6 is independent — it doesn’t import any other package from this project. It’s pure syntax.

1
2
3
4
Phase 6: SQL text → Lexer → Tokens → Parser → AST

Phase 7 will connect them:
  AST (Phase 6) → Executor → SeqScan (Phase 4) → BufferPool (Phase 5) → HeapFile (Phase 1)

The parser’s ColumnRef and the executor’s ColumnRef are similar but separate types. Phase 7 will convert between them — translating the AST into an execution plan.


What this looks like in action

From the test dump:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
SQL: SELECT name, age FROM users WHERE active = true AND age >= 18;
  Type: SELECT
  Columns: name, age
  From: users
  Where: ((active = true) AND (age >= 18))

SQL: INSERT INTO users VALUES (2, 'bob', 25, false), (3, 'carol', 35, true);
  Type: INSERT
  Table: users
  Rows: 2
    Row 0: 2, 'bob', 25, false
    Row 1: 3, 'carol', 35, true

SQL: UPDATE users SET name = 'bob', active = false WHERE id = 1;
  Type: UPDATE
  Table: users
    SET name = 'bob'
    SET active = false
  Where: (id = 1)

The parser transforms flat SQL text into structured trees that the executor can walk.


Error handling

Parse errors include the line number and what was expected:

1
2
3
4
5
6
7
8
Parse("SELECT name users;")
→ "line 1: expected FROM, got IDENT ("users")"

Parse("SELECT name FROM;")
→ "line 1: expected IDENT, got SEMICOLON (";")"

Parse("SELECT\nname\nFROM;")
→ "line 3: expected IDENT, got SEMICOLON (";")"

The line numbers come from the lexer, which tracks newlines as it scans.


PostgreSQL comparison

PostgreSQL’s parser is far more complex:

  • Bison grammar (gram.y): ~15,000 lines of grammar rules, auto-generating the parser. We use hand-written recursive descent (~250 lines).
  • Hundreds of statement types: ALTER, GRANT, EXPLAIN, WITH (CTEs), window functions, subqueries, JOINs. We support 5 statement types.
  • Type coercion: PostgreSQL resolves types during parsing ('5'::integer). We defer type handling to the executor.
  • Query rewriting: PostgreSQL transforms the parse tree before optimization (views, rules). We pass the AST directly to execution.

Our parser handles the core SQL grammar. The architecture (lexer → tokens → parser → AST) is the same.


Go patterns used

Interface with marker method

1
2
3
type Statement interface {
    stmtNode()
}

The stmtNode() method is empty — it exists only to restrict which types can be a Statement. Without it, any type would satisfy interface{}. This is Go’s pattern for “sealed” interfaces — only types that explicitly add the marker method belong.

Type switch on interface

1
2
3
4
5
6
switch s := stmt.(type) {
case *CreateTableStmt:
    // s is *CreateTableStmt here
case *SelectStmt:
    // s is *SelectStmt here
}

The parser returns Statement (an interface). The caller uses a type switch to discover which concrete statement was parsed and access its fields. Same pattern used in compareValues in Phase 4.

Error propagation

1
2
3
4
tok, err := p.expect(TOKEN_FROM)
if err != nil {
    return nil, err
}

Every parse function returns (result, error). Errors propagate up the call stack until they reach Parse(), which returns them to the caller. This is Go’s standard error handling — no exceptions, just return values.


Files created in Phase 6

File Purpose Lines
parser/token.go TokenType enum, Token struct, keyword map ~140
parser/lexer.go Lexer: breaks SQL text into tokens ~120
parser/ast.go AST node types: statements and expressions ~100
parser/parser.go Recursive descent parser: tokens → AST ~280
parser/parser_test.go 26 tests: lexer, parser, errors, dump ~420

What’s next

Phase 7 (Query Executor) connects the parser to the storage engine. It takes the AST from Parse(), resolves table names via the Catalog, converts AST expressions to executor expressions, and runs the query. After Phase 7, you have a working database: SQL string in, results out.

Observations

  • This phase separates reading SQL from executing SQL, which is a major conceptual step.
  • The lexer and tokenizer are the same idea here: read the query character by character and emit a stream of tokens such as keywords, operators, identifiers, and literal values.
  • The parser then consumes those tokens and builds an abstract syntax tree made of statements and expressions.
  • The precedence-driven structure is one of the nicest parts of the implementation. Each function handles one level such as OR, AND, NOT, comparison, and primary expressions.
  • Structural parts like the table name are attached directly to the statement, while dynamic parts like the WHERE clause or select list become expression trees.