Performance

Query execution stages

Every query flows through these stages. Each is an optimization opportunity:

1. Partition pruning    → skip entire files by partition key (O(1) catalog lookup)
2. Fragment-level skip  → skip files by column min/max across all pages (canSkipFragment)
3. Page-level skip      → skip pages within a file by per-page stats (canSkipPage)
4. WASM SIMD scan       → decode + filter in one pass (no Row[] intermediate)
5. Columnar merge       → k-way merge on typed arrays (no Row[] until exit)
6. Row materialization  → only at final response boundary

The most expensive step is always I/O (R2 reads). Everything else optimizes around reducing I/O.

Partition pruning

If your data is partitioned (e.g., by region or date), the partition catalog provides O(1) fragment lookup:

// Hive-style partition — each region is a separate fragment
await qm.table("events").append(rows, { partitionBy: "region" })

// This query only scans the "us" fragment, skips all others
await qm.table("events").filter("region", "eq", "us").collect()

Partition pruning is the single biggest performance lever. On a 1TB dataset with 100 partitions, an eq filter on the partition key reads ~10GB instead of 1TB.

Supported filter ops for partition pruning: eq, in, neq, not_in. Range filters (gt, between) fall back to min/max pruning.

Memory budgets

Operators that accumulate state accept a memory budget (bytes). When exceeded, they spill to R2.

Operator	Default budget	What it accumulates
`ExternalSortOperator`	256 MB	All rows until sorted
`HashJoinOperator`	256 MB	Build side hash table
`AggregateOperator`	unbounded	Group states (usually small)
`DistinctOperator`	unbounded	Seen-values hash set

Sizing guidance:

256 MB (default for local mode) works for most queries — covers ~20M rows of numeric data or ~5M string rows
32 MB (default for edge mode) — QueryDO and FragmentDO use EDGE_MEMORY_BUDGET to leave headroom for page buffers and WASM memory within DO’s 256 MB limit
Local mode (Node/Bun) has no practical limit — set budget to available RAM

// Explicit budget — 64 MB for sort, R2 spill when exceeded
const spill = new R2SpillBackend(env.DATA_BUCKET, "__spill/q-123")
const sorted = new ExternalSortOperator(source, "amount", true, 0, 64 * 1024 * 1024, spill)

Choosing operators

Sort: TopK vs ExternalSort vs InMemorySort

Scenario	Best operator	Why
Top 10 results	`TopKOperator`	Heap-based, O(n log k) — never sorts the full dataset
Full sort, < 1M rows	`InMemorySortOperator`	No spill overhead
Full sort, > 1M rows	`ExternalSortOperator`	Spills to R2 when budget exceeded

The DataFrame API picks automatically: .sort().limit(k) uses TopK when k is small relative to total rows.

Join: small vs large build side

Scenario	Strategy
Lookup join (small right side)	`HashJoinOperator` — build side fits in memory
Large-large join	`HashJoinOperator` with R2 spill — Grace hash partitioning
Semi-join filter	`SubqueryInOperator` — collects distinct values, pushes as IN filter

Aggregation: with vs without GROUP BY

Scenario	What happens
No GROUP BY	Single accumulator per aggregate — O(1) memory
GROUP BY with few groups (< 10K)	Hash map of accumulators — fast
GROUP BY with many groups (> 100K)	Memory grows with cardinality — consider pre-filtering

Fragment-level skip

Before reading any page data, canSkipFragment aggregates min/max/nullCount across all pages in a fragment and checks if the entire fragment can be eliminated. This reuses the same canSkipPage logic but on fragment-wide stats — one check to skip potentially thousands of pages.

Fragment columns: [{min: 100, max: 500}, {min: 600, max: 900}]  →  aggregated: min=100, max=900
Filter: amount > 1000
→ Skip entire fragment (no R2 reads at all)

Fragment-level skip is automatic and costs nothing — it runs before any R2 I/O. For datasets with many small fragments (e.g., append-heavy workloads), this is often more effective than page-level skip because it eliminates entire R2 reads rather than individual pages within a read.

In explain output, fragmentsSkipped counts fragments eliminated by both partition pruning and fragment-level skip combined.

Page-level skip

Each Lance page stores min/max stats per column. The scan layer checks these before reading page data:

Page stats: min=100, max=500
Filter: amount > 1000
→ Skip entire page (no R2 read)

This is automatic — no configuration needed. Supported filter ops:

Op	Skip condition
`eq`	Value outside [min, max] range
`neq`	Uniform page (min = max) equals filter value
`gt`, `gte`, `lt`, `lte`	Entire range on wrong side of threshold
`in`	All IN values outside [min, max] range
`not_in`	Uniform page (min = max) appears in NOT IN list
`between`	Page range and filter range don’t overlap
`not_between`	Entire page range inside the excluded range
`is_null`	Page has zero null values (nullCount = 0)
`is_not_null`	Page is entirely null (nullCount = rowCount)
`like`	Fixed prefix doesn’t overlap page’s string [min, max] range
`not_like`	Uniform page (min = max) matches the LIKE pattern

String columns have min/max stats (lexicographic). like patterns with a fixed prefix (e.g., 'abc%') can skip pages where the string range doesn’t overlap the prefix. Patterns starting with a wildcard (e.g., '%xyz') cannot be pruned. not_like can skip uniform pages where the single value matches the pattern (all rows would be excluded).

OR filters and page skip

OR filters (filterGroups) use different pruning logic than AND filters:

AND filters: a page is skipped if any single AND filter eliminates it
OR groups: a page is skipped only if every OR group eliminates it

// AND: skip if amount > 1000 OR region < 'a' eliminates the page
.filter("amount", "gt", 500).filter("region", "eq", "us")

// OR: both branches must independently eliminate the page to skip it
.where("amount > 1000 OR region = 'us'")

Page skip decisions are made uniformly across all columns — the same pages are skipped for every column to avoid row misalignment between columns.

Filter pushdown pipeline

Filters flow through multiple layers, each eliminating data earlier (and cheaper) than the next:

DataFrame API / SQL WHERE
    ↓ compile to FilterOp[]
Partition catalog          → skip entire fragments by partition key (O(1))
    ↓
canSkipFragment            → skip fragments by column min/max across all pages
    ↓
canSkipPageMultiCol        → skip pages by per-page min/max/nullCount stats
    ↓
WASM SIMD scan             → decode + filter in one pass (no Row[] intermediate)
    ↓
JS fallback (if needed)    → matchesFilter on remaining rows

All 14 filter ops are fully pushable. The DataFrame API’s .filter() and the SQL compiler’s WHERE clause both produce FilterOp[] — the same array that drives partition pruning, fragment skip, page skip, and WASM scan. No filter op (eq, neq, gt, gte, lt, lte, in, not_in, between, not_between, like, not_like, is_null, is_not_null) requires post-scan evaluation.

SQL expressions that are not FilterOp — such as CASE, CAST, and arithmetic in WHERE (e.g., WHERE price * qty > 100) — cannot be expressed as a FilterOp and go through SqlWrappingExecutor after the scan. These still benefit from any pushable filters in the same query: WHERE region = 'us' AND price * qty > 100 pushes region = 'us' to scan and evaluates the arithmetic on the filtered result.

OR filters (filterGroups) are fully pushable too. Each OR branch is an independent FilterOp[] group. Page skip requires ALL groups to independently eliminate a page before skipping it.

WASM vs JS execution

The scan layer pushes filter+decode to WASM SIMD when possible:

Column type	WASM SIMD	Notes
int32, int64, float32, float64	Yes	4-wide or 2-wide SIMD
bool	Yes	Packed bit operations
utf8 (LIKE)	Yes	`filterStringLike` export
utf8 (eq/in)	JS fallback	String comparison not SIMD-friendly
fixed_size_list (vectors)	Yes	Cosine/L2/dot in WASM

WASM execution is ~3x faster than JS for numeric filters at scale. The JS fallback is correct but slower.

Columnar pipeline

The columnar pipeline avoids Row[] materialization through the entire merge layer:

FragmentDO → QMCB ArrayBuffer (structured clone over RPC)
QueryDO → columnarKWayMerge or concatColumnarBatches (typed arrays)
Worker exit → columnarBatchToRows (once, at response boundary)

This eliminates the biggest cost in distributed queries: serializing millions of rows to JSON between DOs. QMCB transfers are ~10x faster than JSON for numeric data.

See Columnar Format for the wire format details.

Inspecting query plans

Use .explain() to see what a query will do without executing it:

const plan = await qm.table("events")
  .filter("region", "eq", "us")
  .filter("amount", "gt", 100)
  .explain()

Key fields in ExplainResult:

Field	What it tells you
`totalRows`	Total rows in the table
`estimatedRows`	Rows remaining after pruning
`fragments` / `fragmentsSkipped`	Fragments eliminated by partition pruning + fragment-level skip
`pagesTotal` / `pagesSkipped`	How many pages min/max pruning eliminated
`estimatedBytes` / `estimatedR2Reads`	Actual I/O cost (bytes and coalesced R2 reads)
`filters[].pushable`	Whether each filter is pushed to the scan layer
`fanOut`	Whether the query will dispatch to Fragment DOs
`partitionCatalog`	Partition column and cardinality (if catalog exists)

Reading the numbers: If pagesSkipped is close to pagesTotal, your filter is highly selective and the query will be fast. If fragmentsSkipped is close to fragments, partition pruning is working. If estimatedR2Reads is small (< 10), the query touches very little data.

Formatted output: Use formatExplain(plan) for a human-readable summary suitable for logging.

Benchmarks

CI runs head-to-head benchmarks against DuckDB on every push. Typical results at 1M-5M rows:

Operation	QueryMode (workerd)	DuckDB (native)
Filter (numeric)	~200ms	~100ms
GROUP BY + sum	~300ms	~150ms
Hash join (1M × 500K)	~1300ms	~400ms
External sort (5M)	~25s	~8s
Window (row_number, 5M)	~800ms	~200ms

QueryMode runs in WASM on a single core. DuckDB runs native with SIMD. The ~2-3x gap is the WASM overhead — acceptable for edge deployment where the alternative is no local compute at all.

For spill operations (sort, join at > budget), QueryMode uses R2 (network I/O) while DuckDB uses local disk. The gap widens to ~3-5x. This is the price of serverless — offset by the fact that you’re running at the edge, not in a central data center.