OLTP – Phase 6 SQL Parser
Until now, every query is built manually in Go — constructing executor nodes and expression structs by hand. That works for testing, but a real database needs to understand SQL text. Phase 6 adds a parser that turns a SQL string like SELECT name, age FROM users WHERE age > 25 into a tree of Go structs (an AST) that the executor can work with. This phase is pure syntax — no storage, no execution. It takes a string and produces a data structure.
For example, instead of building nodes manually:
1
2
3
4
5
6
7
8
9
scan := &SeqScanNode{
Table: tbl,
Filter: CompareExpr{
Left: ColumnRef{Name: "age"},
Right: Literal{Value: tuple.IntValue(25)},
Op: OpGt,
},
Columns: []int{1, 2},
}
We can now parse:
1
SELECT name, age FROM users WHERE age > 25;
In PostgreSQL, this is gram.y (grammar), scan.l (lexer), and parsenodes.h (AST nodes).
Here is the roadmap for the phases to come:
- Phase 6: SQL parser
- Phase 7: Query executor
- Phase 8: B-tree index
- Phase 9: Write-ahead log
- Phase 10: Transactions, concurrency, REPL, and server
Full Source Code
The code referenced in this post can be found in https://gitlab.com/kimserey.lam/db-learn.
The Pipeline
Parsing happens in two stages:
1
SQL string → [Lexer] → tokens → [Parser] → AST
Lexer (tokenizer): Breaks the raw text into a sequence of tokens. Each token is a meaningful unit: a keyword, a number, a string, an operator, a symbol.
1
2
3
"SELECT name FROM users WHERE age > 25;"
→ [SELECT] [IDENT:"name"] [FROM] [IDENT:"users"] [WHERE] [IDENT:"age"] [GT:">"] [INT_LIT:"25"] [SEMICOLON] [EOF]
Parser: Takes the token sequence and builds a tree structure (AST) that represents the meaning of the SQL statement.
1
2
3
4
5
6
7
8
9
SelectStmt {
Columns: [ColumnRef{"name"}]
From: "users"
Where: BinaryExpr {
Left: ColumnRef{"age"}
Op: ">"
Right: IntLit{25}
}
}
The Lexer (parser/lexer.go)
The lexer reads one character at a time and classifies what it sees:
1
2
3
4
5
type Lexer struct {
input string
pos int
line int
}
NextToken() looks at the current character and decides:
- Whitespace (space, tab, newline): skip, but count newlines for error messages
- Letter or underscore: read an identifier (
age,users), then check if it’s a keyword (SELECT,FROM) - Digit: read an integer literal (
25,100) - Single quote: read a string literal (
'alice') - Operator characters:
=,!=,<>,<,>,<=,>= - Symbols:
,,(,),;,*
Keywords vs identifiers
When the lexer reads letters, it doesn’t immediately know if select is a keyword or a column name. It reads the whole word, then checks a keyword map:
1
2
3
4
5
6
var keywords = map[string]TokenType{
"SELECT": TOKEN_SELECT,
"FROM": TOKEN_FROM,
"WHERE": TOKEN_WHERE,
// ...
}
The lookup is case-insensitive — select, SELECT, and Select all produce TOKEN_SELECT. If the word isn’t a keyword, it’s an identifier (TOKEN_IDENT).
Two-character operators
Some operators are two characters: !=, <>, <=, >=. The lexer handles these by peeking at the next character:
1
2
3
4
5
6
7
8
case '<':
if next == '=' {
return LTEQ // <=
}
if next == '>' {
return NEQ // <> (SQL's other way to write !=)
}
return LT // <
The AST (parser/ast.go)
The AST (Abstract Syntax Tree) represents the structure of a SQL statement as Go structs. Two interfaces define the tree:
1
2
3
4
5
6
7
type Statement interface {
stmtNode()
}
type Expression interface {
exprNode()
}
Statement types
Each SQL command is a different struct:
1
2
3
4
5
CreateTableStmt → CREATE TABLE users (id INT, name VARCHAR(50));
InsertStmt → INSERT INTO users VALUES (1, 'alice', true);
SelectStmt → SELECT name FROM users WHERE age > 25;
UpdateStmt → UPDATE users SET name = 'bob' WHERE id = 1;
DeleteStmt → DELETE FROM users WHERE id = 1;
Expression types
Expressions appear inside statements — in WHERE clauses, SELECT lists, INSERT values, and SET assignments.
| Type | Example | SQL |
|---|---|---|
ColumnRef |
{Name: "age"} |
age |
IntLit |
{Value: 25} |
25 |
StringLit |
{Value: "alice"} |
'alice' |
BoolLit |
{Value: true} |
true |
NullLit |
{} |
NULL |
StarExpr |
{} |
* |
BinaryExpr |
{Left, Op: ">", Right} |
age > 25 |
NotExpr |
{Expr} |
NOT active |
The AST is a tree
Just like the expression trees from Phase 4, the AST forms a tree through composition:
1
2
3
4
5
6
7
8
9
10
11
12
SELECT name, age FROM users WHERE active = true AND age >= 18
SelectStmt
├── Columns: [ColumnRef{"name"}, ColumnRef{"age"}]
├── From: "users"
└── Where: BinaryExpr (AND)
├── Left: BinaryExpr (=)
│ ├── Left: ColumnRef{"active"}
│ └── Right: BoolLit{true}
└── Right: BinaryExpr (>=)
├── Left: ColumnRef{"age"}
└── Right: IntLit{18}
The Parser (parser/parser.go)
Recursive descent
The parser uses recursive descent — each grammar rule is a function, and functions call each other. This is the simplest parsing technique and the one most hand-written parsers use.
The entry point dispatches based on the first keyword:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
func (p *Parser) parseStatement() (Statement, error) {
switch p.current.Type {
case TOKEN_SELECT:
return p.parseSelect()
case TOKEN_INSERT:
return p.parseInsert()
case TOKEN_CREATE:
return p.parseCreateTable()
case TOKEN_UPDATE:
return p.parseUpdate()
case TOKEN_DELETE:
return p.parseDelete()
}
}
Each parse function reads tokens left to right, building up the AST node. parseSelect reads the column list, expects FROM, reads the table name, and optionally parses a WHERE clause.
The current / advance / expect pattern
The parser keeps track of one token at a time:
1
2
3
4
type Parser struct {
lexer *Lexer
current Token // the token we're looking at right now
}
advance()— consume the current token and load the next oneexpect(TOKEN_FROM)— assert the current token is FROM, consume it, return its value. If it’s not FROM, return an error with a useful message
This pattern appears everywhere:
1
2
3
4
5
6
7
8
9
// Parse: FROM users
if _, err := p.expect(TOKEN_FROM); err != nil {
return nil, err // "line 1: expected FROM, got IDENT ("age")"
}
tok, err := p.expect(TOKEN_IDENT)
if err != nil {
return nil, err // "line 1: expected IDENT, got SEMICOLON (";")"
}
stmt.From = tok.Literal // "users"
Operator precedence
When parsing WHERE a = 1 OR b = 2 AND c = 3, the parser needs to know that AND binds tighter than OR. The result should be a = 1 OR (b = 2 AND c = 3), not (a = 1 OR b = 2) AND c = 3.
This is handled by splitting expression parsing into layers, one per precedence level:
1
2
3
4
5
parseExpression → parseOrExpr (lowest precedence)
→ parseAndExpr
→ parseNotExpr
→ parseComparison
→ parsePrimary (highest precedence)
Each layer handles one operator and delegates to the next (higher precedence) layer:
1
2
3
4
5
6
7
8
9
func (p *Parser) parseOrExpr() (Expression, error) {
left, err := p.parseAndExpr() // parse higher precedence first
for p.current.Type == TOKEN_OR {
p.advance()
right, err := p.parseAndExpr()
left = &BinaryExpr{Left: left, Op: "OR", Right: right}
}
return left, nil
}
If there’s no OR, it just returns what parseAndExpr returned. If there is an OR, it wraps both sides in a BinaryExpr. Because parseAndExpr is called first, AND operations are grouped before OR — that’s how precedence works.
Parentheses override precedence
(a = 1 OR b = 2) AND c = 3 forces OR to bind first. The parser handles this in parsePrimary:
1
2
3
4
5
case TOKEN_LPAREN:
p.advance()
expr, err := p.parseExpression() // full expression parsing inside parens
p.expect(TOKEN_RPAREN)
return expr, nil
Inside parentheses, parseExpression starts over from the lowest precedence, so OR gets parsed as a complete expression before returning to the AND outside.
How Phase 6 connects to previous phases
Phase 6 is independent — it doesn’t import any other package from this project. It’s pure syntax.
1
2
3
4
Phase 6: SQL text → Lexer → Tokens → Parser → AST
Phase 7 will connect them:
AST (Phase 6) → Executor → SeqScan (Phase 4) → BufferPool (Phase 5) → HeapFile (Phase 1)
The parser’s ColumnRef and the executor’s ColumnRef are similar but separate types. Phase 7 will convert between them — translating the AST into an execution plan.
What this looks like in action
From the test dump:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
SQL: SELECT name, age FROM users WHERE active = true AND age >= 18;
Type: SELECT
Columns: name, age
From: users
Where: ((active = true) AND (age >= 18))
SQL: INSERT INTO users VALUES (2, 'bob', 25, false), (3, 'carol', 35, true);
Type: INSERT
Table: users
Rows: 2
Row 0: 2, 'bob', 25, false
Row 1: 3, 'carol', 35, true
SQL: UPDATE users SET name = 'bob', active = false WHERE id = 1;
Type: UPDATE
Table: users
SET name = 'bob'
SET active = false
Where: (id = 1)
The parser transforms flat SQL text into structured trees that the executor can walk.
Error handling
Parse errors include the line number and what was expected:
1
2
3
4
5
6
7
8
Parse("SELECT name users;")
→ "line 1: expected FROM, got IDENT ("users")"
Parse("SELECT name FROM;")
→ "line 1: expected IDENT, got SEMICOLON (";")"
Parse("SELECT\nname\nFROM;")
→ "line 3: expected IDENT, got SEMICOLON (";")"
The line numbers come from the lexer, which tracks newlines as it scans.
PostgreSQL comparison
PostgreSQL’s parser is far more complex:
- Bison grammar (
gram.y): ~15,000 lines of grammar rules, auto-generating the parser. We use hand-written recursive descent (~250 lines). - Hundreds of statement types: ALTER, GRANT, EXPLAIN, WITH (CTEs), window functions, subqueries, JOINs. We support 5 statement types.
- Type coercion: PostgreSQL resolves types during parsing (
'5'::integer). We defer type handling to the executor. - Query rewriting: PostgreSQL transforms the parse tree before optimization (views, rules). We pass the AST directly to execution.
Our parser handles the core SQL grammar. The architecture (lexer → tokens → parser → AST) is the same.
Go patterns used
Interface with marker method
1
2
3
type Statement interface {
stmtNode()
}
The stmtNode() method is empty — it exists only to restrict which types can be a Statement. Without it, any type would satisfy interface{}. This is Go’s pattern for “sealed” interfaces — only types that explicitly add the marker method belong.
Type switch on interface
1
2
3
4
5
6
switch s := stmt.(type) {
case *CreateTableStmt:
// s is *CreateTableStmt here
case *SelectStmt:
// s is *SelectStmt here
}
The parser returns Statement (an interface). The caller uses a type switch to discover which concrete statement was parsed and access its fields. Same pattern used in compareValues in Phase 4.
Error propagation
1
2
3
4
tok, err := p.expect(TOKEN_FROM)
if err != nil {
return nil, err
}
Every parse function returns (result, error). Errors propagate up the call stack until they reach Parse(), which returns them to the caller. This is Go’s standard error handling — no exceptions, just return values.
Files created in Phase 6
| File | Purpose | Lines |
|---|---|---|
parser/token.go |
TokenType enum, Token struct, keyword map | ~140 |
parser/lexer.go |
Lexer: breaks SQL text into tokens | ~120 |
parser/ast.go |
AST node types: statements and expressions | ~100 |
parser/parser.go |
Recursive descent parser: tokens → AST | ~280 |
parser/parser_test.go |
26 tests: lexer, parser, errors, dump | ~420 |
What’s next
Phase 7 (Query Executor) connects the parser to the storage engine. It takes the AST from Parse(), resolves table names via the Catalog, converts AST expressions to executor expressions, and runs the query. After Phase 7, you have a working database: SQL string in, results out.
Observations
- This phase separates reading SQL from executing SQL, which is a major conceptual step.
- The lexer and tokenizer are the same idea here: read the query character by character and emit a stream of tokens such as keywords, operators, identifiers, and literal values.
- The parser then consumes those tokens and builds an abstract syntax tree made of statements and expressions.
- The precedence-driven structure is one of the nicest parts of the implementation. Each function handles one level such as
OR,AND,NOT, comparison, and primary expressions. - Structural parts like the table name are attached directly to the statement, while dynamic parts like the
WHEREclause or select list become expression trees.