Skip to content

SQL

SQL is another way into the same operator pipeline. The parser compiles SQL to a QueryDescriptor, which builds the same pull-based operator chain as the DataFrame API.

import { QueryMode } from "querymode/local"
const qm = QueryMode.local()
const results = await qm
.sql("SELECT region, SUM(amount) AS total FROM orders GROUP BY region ORDER BY total DESC")
.collect()

.sql() returns a DataFrame, so you can chain DataFrame methods after it:

const results = await qm
.sql("SELECT * FROM events WHERE created_at > '2026-01-01'")
.filter("country", "eq", "US")
.sort("amount", "desc")
.limit(50)
.collect()
SELECT *
SELECT col1, col2
SELECT col1 AS alias
SELECT DISTINCT col1, col2
SELECT COUNT(*), SUM(amount), AVG(score)
WHERE age > 25
WHERE status = 'active' AND amount >= 100
WHERE dept = 'eng' OR age > 30
WHERE name LIKE '%Alice%'
WHERE id IN (1, 2, 3)
WHERE id NOT IN (4, 5)
WHERE amount BETWEEN 100 AND 500
WHERE amount NOT BETWEEN 100 AND 500
WHERE email IS NULL
WHERE email IS NOT NULL
WHERE NOT (status = 'deleted')
GROUP BY region
GROUP BY region, category
HAVING SUM(amount) > 1000
ORDER BY amount DESC
ORDER BY region ASC, amount DESC
LIMIT 100
LIMIT 100 OFFSET 50
SELECT salary / 1000 AS salary_k
SELECT first_name || ' ' || last_name AS full_name
SELECT CASE WHEN age > 30 THEN 'senior' ELSE 'junior' END AS level
SELECT CAST(age AS text) AS age_str

COUNT, SUM, AVG, MIN, MAX, COUNT(DISTINCT col), STDDEV, VARIANCE, MEDIAN, PERCENTILE(col, p).

SELECT name, ROW_NUMBER() OVER (PARTITION BY region ORDER BY amount DESC) AS rn FROM orders
SELECT name, LAG(amount, 1) OVER (ORDER BY created_at) AS prev_amount FROM orders

ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, and rolling SUM/AVG/MIN/MAX/COUNT with frame specification.

Frame specification supports ROWS and RANGE with UNBOUNDED PRECEDING, N PRECEDING, CURRENT ROW, N FOLLOWING, UNBOUNDED FOLLOWING:

-- Rolling 3-row average
SELECT name, AVG(amount) OVER (
ORDER BY created_at ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
) AS rolling_avg FROM orders
-- Running total
SELECT name, SUM(amount) OVER (
PARTITION BY region ORDER BY created_at ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total FROM orders
WITH active_users AS (
SELECT * FROM users WHERE status = 'active'
)
SELECT * FROM active_users WHERE age > 25

Simple CTEs (no aggregation, no sort+limit) are inlined — their filters merge with the outer query for full pushdown. Complex CTEs are materialized as intermediate results.

SELECT * FROM orders_2025 UNION ALL SELECT * FROM orders_2026 -- keeps duplicates
SELECT * FROM orders_2025 UNION SELECT * FROM orders_2026 -- deduplicates
SELECT region FROM orders INTERSECT SELECT region FROM users
SELECT id FROM orders EXCEPT SELECT id FROM refunds
SELECT * FROM orders JOIN users ON orders.user_id = users.id
SELECT * FROM orders INNER JOIN users ON orders.user_id = users.id
SELECT * FROM orders LEFT JOIN users ON orders.user_id = users.id
SELECT * FROM orders RIGHT JOIN users ON orders.user_id = users.id
SELECT * FROM orders FULL JOIN users ON orders.user_id = users.id
SELECT * FROM a CROSS JOIN b
SELECT * FROM orders NATURAL JOIN users
SELECT * FROM orders JOIN users USING (id)
SELECT * FROM images WHERE embedding NEAR [0.1, 0.2, 0.3] TOPK 10

The NEAR operator performs vector similarity search on the specified column. TOPK limits results to the K nearest neighbors. Uses IVF-PQ index when available, falls back to flat SIMD scan.

QueryMode follows SQL three-valued logic. Expressions involving NULL propagate NULL rather than returning true or false:

-- NULL comparisons → NULL (row excluded from results)
SELECT * FROM users WHERE age = NULL -- no rows (use IS NULL instead)
SELECT * FROM users WHERE NULL > 5 -- no rows
-- AND with NULL
SELECT * FROM t WHERE NULL AND true -- NULL (excluded)
SELECT * FROM t WHERE NULL AND false -- false (excluded)
-- OR with NULL
SELECT * FROM t WHERE NULL OR true -- true (included)
SELECT * FROM t WHERE NULL OR false -- NULL (excluded)
-- NOT IN with NULL elements
SELECT * FROM t WHERE id NOT IN (1, 2, NULL) -- NULL for all rows (per SQL standard)
-- BETWEEN with NULL
SELECT * FROM t WHERE NULL BETWEEN 1 AND 10 -- NULL (excluded)
-- IS NULL / IS NOT NULL (never return NULL)
SELECT * FROM t WHERE email IS NULL -- works correctly

Aggregates also follow SQL NULL rules:

ExpressionResult
SUM(col) where all values are NULLNULL
COUNT(col) where all values are NULL0
COUNT(*)counts all rows (ignores NULLs)
MIN(col) / MAX(col) on empty groupNULL
AVG(col) where all values are NULLNULL

Comparisons between different types follow these rules:

Left typeRight typeBehavior
numbernumberDirect comparison
bigintnumbernumber promoted to bigint (when integer)
stringnumberNumeric comparison if string is numeric, else string comparison
anyNULLResult is NULL

CAST converts between types explicitly:

SELECT CAST(age AS text) AS age_str -- number → string
SELECT CAST('42' AS int) AS age -- string → number
SELECT CAST(id AS bigint) AS big_id -- number → bigint

From highest to lowest:

  1. Parentheses ()
  2. Unary NOT, -
  3. Multiplication *, Division /
  4. Addition +, Subtraction -, Concatenation ||
  5. Comparison =, !=, <>, <, >, <=, >=
  6. IS NULL, IS NOT NULL, BETWEEN, IN, LIKE
  7. AND
  8. OR
-- AND binds tighter than OR
WHERE a = 1 OR b = 2 AND c = 3
-- is parsed as: WHERE a = 1 OR (b = 2 AND c = 3)
-- Line comment (ignored)
SELECT * FROM users /* block comment */ WHERE age > 25
SQL string → lexer → parser → AST → compiler → QueryDescriptor → operator pipeline

The SQL frontend is a recursive descent parser written in TypeScript. It produces an AST that the compiler maps to the same QueryDescriptor the DataFrame API uses. Features that can be expressed as FilterOp[] (AND conditions with eq/gt/lt/in/not_in/between/not_between/like/not_like/is_null/is_not_null) are pushed down to the inner executor. Simple OR expressions are decomposed into filterGroups for pushdown. Features that can’t be pushed down (complex OR expressions, HAVING, multi-column ORDER BY, CASE/CAST/arithmetic) are handled by SqlWrappingExecutor, which applies them on the result rows.

If you need the parsed AST or compiled descriptor directly:

import { parseSql, sqlToDescriptor } from "querymode"
// Parse SQL to AST
const ast = parseSql("SELECT * FROM users WHERE age > 25")
// Convert SQL directly to QueryDescriptor
const descriptor = sqlToDescriptor("SELECT * FROM users WHERE age > 25")