Skip to content

Performance

Every query flows through these stages. Each is an optimization opportunity:

1. Partition pruning → skip entire files by partition key (O(1) catalog lookup)
2. Fragment-level skip → skip files by column min/max across all pages (canSkipFragment)
3. Page-level skip → skip pages within a file by per-page stats (canSkipPage)
4. WASM SIMD scan → decode + filter in one pass (no Row[] intermediate)
5. Columnar merge → k-way merge on typed arrays (no Row[] until exit)
6. Row materialization → only at final response boundary

The most expensive step is always I/O (R2 reads). Everything else optimizes around reducing I/O.

If your data is partitioned (e.g., by region or date), the partition catalog provides O(1) fragment lookup:

// Hive-style partition — each region is a separate fragment
await qm.table("events").append(rows, { partitionBy: "region" })
// This query only scans the "us" fragment, skips all others
await qm.table("events").filter("region", "eq", "us").collect()

Partition pruning is the single biggest performance lever. On a 1TB dataset with 100 partitions, an eq filter on the partition key reads ~10GB instead of 1TB.

Supported filter ops for partition pruning: eq, in, neq, not_in. Range filters (gt, between) fall back to min/max pruning.

Operators that accumulate state accept a memory budget (bytes). When exceeded, they spill to R2.

OperatorDefault budgetWhat it accumulates
ExternalSortOperator256 MBAll rows until sorted
HashJoinOperator256 MBBuild side hash table
AggregateOperatorunboundedGroup states (usually small)
DistinctOperatorunboundedSeen-values hash set

Sizing guidance:

  • 256 MB (default for local mode) works for most queries — covers ~20M rows of numeric data or ~5M string rows
  • 32 MB (default for edge mode) — QueryDO and FragmentDO use EDGE_MEMORY_BUDGET to leave headroom for page buffers and WASM memory within DO’s 256 MB limit
  • Local mode (Node/Bun) has no practical limit — set budget to available RAM
// Explicit budget — 64 MB for sort, R2 spill when exceeded
const spill = new R2SpillBackend(env.DATA_BUCKET, "__spill/q-123")
const sorted = new ExternalSortOperator(source, "amount", true, 0, 64 * 1024 * 1024, spill)

Sort: TopK vs ExternalSort vs InMemorySort

Section titled “Sort: TopK vs ExternalSort vs InMemorySort”
ScenarioBest operatorWhy
Top 10 resultsTopKOperatorHeap-based, O(n log k) — never sorts the full dataset
Full sort, < 1M rowsInMemorySortOperatorNo spill overhead
Full sort, > 1M rowsExternalSortOperatorSpills to R2 when budget exceeded

The DataFrame API picks automatically: .sort().limit(k) uses TopK when k is small relative to total rows.

ScenarioStrategy
Lookup join (small right side)HashJoinOperator — build side fits in memory
Large-large joinHashJoinOperator with R2 spill — Grace hash partitioning
Semi-join filterSubqueryInOperator — collects distinct values, pushes as IN filter
ScenarioWhat happens
No GROUP BYSingle accumulator per aggregate — O(1) memory
GROUP BY with few groups (< 10K)Hash map of accumulators — fast
GROUP BY with many groups (> 100K)Memory grows with cardinality — consider pre-filtering

Before reading any page data, canSkipFragment aggregates min/max/nullCount across all pages in a fragment and checks if the entire fragment can be eliminated. This reuses the same canSkipPage logic but on fragment-wide stats — one check to skip potentially thousands of pages.

Fragment columns: [{min: 100, max: 500}, {min: 600, max: 900}] → aggregated: min=100, max=900
Filter: amount > 1000
→ Skip entire fragment (no R2 reads at all)

Fragment-level skip is automatic and costs nothing — it runs before any R2 I/O. For datasets with many small fragments (e.g., append-heavy workloads), this is often more effective than page-level skip because it eliminates entire R2 reads rather than individual pages within a read.

In explain output, fragmentsSkipped counts fragments eliminated by both partition pruning and fragment-level skip combined.

Each Lance page stores min/max stats per column. The scan layer checks these before reading page data:

Page stats: min=100, max=500
Filter: amount > 1000
→ Skip entire page (no R2 read)

This is automatic — no configuration needed. Supported filter ops:

OpSkip condition
eqValue outside [min, max] range
neqUniform page (min = max) equals filter value
gt, gte, lt, lteEntire range on wrong side of threshold
inAll IN values outside [min, max] range
not_inUniform page (min = max) appears in NOT IN list
betweenPage range and filter range don’t overlap
not_betweenEntire page range inside the excluded range
is_nullPage has zero null values (nullCount = 0)
is_not_nullPage is entirely null (nullCount = rowCount)
likeFixed prefix doesn’t overlap page’s string [min, max] range
not_likeUniform page (min = max) matches the LIKE pattern

String columns have min/max stats (lexicographic). like patterns with a fixed prefix (e.g., 'abc%') can skip pages where the string range doesn’t overlap the prefix. Patterns starting with a wildcard (e.g., '%xyz') cannot be pruned. not_like can skip uniform pages where the single value matches the pattern (all rows would be excluded).

OR filters (filterGroups) use different pruning logic than AND filters:

  • AND filters: a page is skipped if any single AND filter eliminates it
  • OR groups: a page is skipped only if every OR group eliminates it
// AND: skip if amount > 1000 OR region < 'a' eliminates the page
.filter("amount", "gt", 500).filter("region", "eq", "us")
// OR: both branches must independently eliminate the page to skip it
.where("amount > 1000 OR region = 'us'")

Page skip decisions are made uniformly across all columns — the same pages are skipped for every column to avoid row misalignment between columns.

Filters flow through multiple layers, each eliminating data earlier (and cheaper) than the next:

DataFrame API / SQL WHERE
↓ compile to FilterOp[]
Partition catalog → skip entire fragments by partition key (O(1))
canSkipFragment → skip fragments by column min/max across all pages
canSkipPageMultiCol → skip pages by per-page min/max/nullCount stats
WASM SIMD scan → decode + filter in one pass (no Row[] intermediate)
JS fallback (if needed) → matchesFilter on remaining rows

All 14 filter ops are fully pushable. The DataFrame API’s .filter() and the SQL compiler’s WHERE clause both produce FilterOp[] — the same array that drives partition pruning, fragment skip, page skip, and WASM scan. No filter op (eq, neq, gt, gte, lt, lte, in, not_in, between, not_between, like, not_like, is_null, is_not_null) requires post-scan evaluation.

SQL expressions that are not FilterOp — such as CASE, CAST, and arithmetic in WHERE (e.g., WHERE price * qty > 100) — cannot be expressed as a FilterOp and go through SqlWrappingExecutor after the scan. These still benefit from any pushable filters in the same query: WHERE region = 'us' AND price * qty > 100 pushes region = 'us' to scan and evaluates the arithmetic on the filtered result.

OR filters (filterGroups) are fully pushable too. Each OR branch is an independent FilterOp[] group. Page skip requires ALL groups to independently eliminate a page before skipping it.

The scan layer pushes filter+decode to WASM SIMD when possible:

Column typeWASM SIMDNotes
int32, int64, float32, float64Yes4-wide or 2-wide SIMD
boolYesPacked bit operations
utf8 (LIKE)YesfilterStringLike export
utf8 (eq/in)JS fallbackString comparison not SIMD-friendly
fixed_size_list (vectors)YesCosine/L2/dot in WASM

WASM execution is ~3x faster than JS for numeric filters at scale. The JS fallback is correct but slower.

The columnar pipeline avoids Row[] materialization through the entire merge layer:

FragmentDO → QMCB ArrayBuffer (structured clone over RPC)
QueryDO → columnarKWayMerge or concatColumnarBatches (typed arrays)
Worker exit → columnarBatchToRows (once, at response boundary)

This eliminates the biggest cost in distributed queries: serializing millions of rows to JSON between DOs. QMCB transfers are ~10x faster than JSON for numeric data.

See Columnar Format for the wire format details.

Use .explain() to see what a query will do without executing it:

const plan = await qm.table("events")
.filter("region", "eq", "us")
.filter("amount", "gt", 100)
.explain()

Key fields in ExplainResult:

FieldWhat it tells you
totalRowsTotal rows in the table
estimatedRowsRows remaining after pruning
fragments / fragmentsSkippedFragments eliminated by partition pruning + fragment-level skip
pagesTotal / pagesSkippedHow many pages min/max pruning eliminated
estimatedBytes / estimatedR2ReadsActual I/O cost (bytes and coalesced R2 reads)
filters[].pushableWhether each filter is pushed to the scan layer
fanOutWhether the query will dispatch to Fragment DOs
partitionCatalogPartition column and cardinality (if catalog exists)

Reading the numbers: If pagesSkipped is close to pagesTotal, your filter is highly selective and the query will be fast. If fragmentsSkipped is close to fragments, partition pruning is working. If estimatedR2Reads is small (< 10), the query touches very little data.

Formatted output: Use formatExplain(plan) for a human-readable summary suitable for logging.

CI runs head-to-head benchmarks against DuckDB on every push. Typical results at 1M-5M rows:

OperationQueryMode (workerd)DuckDB (native)
Filter (numeric)~200ms~100ms
GROUP BY + sum~300ms~150ms
Hash join (1M × 500K)~1300ms~400ms
External sort (5M)~25s~8s
Window (row_number, 5M)~800ms~200ms

QueryMode runs in WASM on a single core. DuckDB runs native with SIMD. The ~2-3x gap is the WASM overhead — acceptable for edge deployment where the alternative is no local compute at all.

For spill operations (sort, join at > budget), QueryMode uses R2 (network I/O) while DuckDB uses local disk. The gap widens to ~3-5x. This is the price of serverless — offset by the fact that you’re running at the edge, not in a central data center.