Skip to content

Architecture

Clients Entry Points Durable Objects
──────── ──────────── ────────────────
Browser MasterDO (single-writer)
CLI ──► Worker (CF Worker) ──► │
DataFrame LocalExecutor QueryDO (per-region)
MaterializedExecutor │
FragmentDO (fan-out pool)
WorkerDO (distributed tasks)
Operator Pipeline Core Engine
───────────────── ───────────
ScanOperator + prefetch WASM SIMD (Zig)
FilterOperator Decode + bitmap
TopKOperator Coalesce ranges
WasmAggregateOperator autoCoalesceGap
Format Decoders Storage
─────────────── ───────
Parquet (Thrift) R2
Lance v2 (protobuf) Disk
Iceberg (JSON meta) SpillBackend
CSV / JSON / Arrow

Owns table metadata. Handles writes via CAS-based manifest coordination. Broadcasts footer invalidations to Query DOs.

One per datacenter region. Caches table footers in memory (~4KB each) with VIP eviction policy. Routes queries to Fragment DOs for parallel scan.

One per fragment — scales with data, no hard cap. Each scans its assigned fragments in parallel. Handles column reads, page decode, filter pushdown, and partial aggregation. Idle DOs cost nothing (they hibernate). Slot names are deterministic (frag-{region}-slot-{N}) so they get reused across queries with warm footer caches.

Generic worker for distributed operations that exceed single-DO memory. Handles hash join partition processing, sort partitions, and distinct deduplication across R2-spilled data. Managed by WorkerPool which coordinates tree-shaped fan-out (direct for ≤50 partitions, two-level for ≤2500, three-level beyond).

  1. Request arrives at Worker → routes to regional QueryDO
  2. QueryDO checks footer cache → if miss, fetches from R2
  3. QueryDO prunes fragments (partition catalog → min/max stats) and partitions survivors across FragmentDO pool
  4. Each FragmentDO runs the operator pipeline:
    • Page-level skip (min/max stats)
    • Coalesced R2 range reads
    • Prefetch next page while decoding current
    • WASM SIMD decode + filter
  5. QueryDO merges partial results via columnar k-way merge (or hierarchical reduction for large fan-outs)
  6. Response returned as JSON or streaming columnar format

Data flows between DOs as QMCB (QueryMode Columnar Binary) — a zero-copy columnar wire format transferred via structured clone over Worker RPC. This eliminates JSON serialization of millions of rows between DOs. See Columnar Format for the wire format details.

Mental model: biological cells, not a brain

Section titled “Mental model: biological cells, not a brain”

Every Fragment DO runs the same WASM binary — like cells sharing the same DNA. No cell knows the whole organism. No central brain directs them. They activate on signal, do their work, and go dormant.

BiologyQueryMode
DNA (shared code)Same WASM binary on every DO
Cell activates on signalDO wakes on RPC, hibernates when idle
Each cell processes local inputsEach DO scans its own fragment
Cells don’t coordinate with each otherFragment DOs are isolated
Body collects and integratesQueryDO reduces via k-way merge
More tissue → more cellsMore fragments → more DOs
Dormant cells cost nothingIdle DOs hibernate (zero cost)

The code is the DNA of these cells — every Fragment DO carries the same WASM engine, responds to the same signals (scan, filter, aggregate), and produces the same output format. Scale comes from having more cells, not smarter ones.

The fan-out decision is workload-aware: small scans run locally in the QueryDO (no RPC overhead), large scans dispatch to Fragment DOs for parallel execution. No configuration — the engine decides based on estimated rows after pruning.

When fragment count exceeds 50, a single QueryDO can’t hold all partial results in memory. Instead of merging everything in one place, QueryMode adds reducer tiers:

Leaf DOs (scan) → Reducer DOs (merge groups of 25) → QueryDO (final merge)

Each tier shrinks the result count by 25×. The math:

FragmentsReducer tiersQueryDO merges
250 (flat)25 results
5012 results
6252~1 result
15,6253~1 result

This keeps QueryDO memory bounded at any data scale. The reducer DOs are regular Fragment DOs reused for merge — same pool, same hibernation behavior.

The querymode.wasm binary is compiled from Zig source (wasm/src/):

  • Column decode — int32, int64, float64, utf8, bool, binary
  • SIMD aggregates — Vec2i64 for int64 sum/min/max, Vec4f64 for float64
  • SQL execution — register columns, execute queries, return rows
  • Vector search — flat SIMD distance computation, IVF-PQ index support
  • Fragment writing — append rows to Lance format

The WASM module is loaded as a CompiledWasm rule in the Worker.

Column registration uses batched calls to minimize WASM boundary crossings. Instead of writing the table name string into WASM memory once per column, registerColumns() writes it once and reuses the pointer across all columns:

Before (per-column): 4N + 3 WASM calls (writeString×2 + alloc + register per column + resetHeap + executeQuery + clearTable)
After (batched): 3N + 4 WASM calls (writeString×1 per column + alloc + register + shared table name + resetHeap + executeQuery + clearTable)

For a 10-column query this saves 10 alloc calls per page batch. The absolute overhead is small (~50μs vs ~10ms R2 I/O) but the batched API is also cleaner — callers pass all columns at once rather than looping individually.

SharedArrayBuffer is not available in Cloudflare Workers (Spectre mitigations), so zero-copy data exchange between JS and WASM is not possible. All column data is copied into WASM linear memory via alloc + Uint8Array.set(). At ~0.005ms per 64KB page copy vs ~10ms R2 I/O latency, copy overhead accounts for less than 0.05% of total query time — the WASM boundary is not the bottleneck.

Filters are pushed as deep as possible — from the DataFrame/SQL layer down to the WASM scan:

DataFrame .filter("age", "gt", 25)
→ QueryDescriptor.filters: [{ column: "age", op: "gt", value: 25 }]
→ Partition catalog prune (skip fragments by partition key)
→ Fragment-level skip (skip fragments where age max < 25 across all pages)
→ Page-level skip (skip pages where page max < 25)
→ WASM SIMD filter (process matching pages in one pass)

All 14 ops push down: eq, neq, gt, gte, lt, lte, in, not_in, between, not_between, like, not_like, is_null, is_not_null.

OR conditions decompose into filterGroups: FilterOp[][] — each inner array is AND-connected, groups are OR-connected:

// WHERE (region = 'us' AND age > 25) OR (region = 'eu' AND age > 30)
// becomes:
filterGroups: [
[{ column: "region", op: "eq", value: "us" }, { column: "age", op: "gt", value: 25 }],
[{ column: "region", op: "eq", value: "eu" }, { column: "age", op: "gt", value: 30 }],
]

Each group is evaluated independently; results are unioned. This preserves pushdown within each group.

Filters follow SQL three-valued logic:

  • NULL = 5 → NULL (not true, row excluded)
  • NULL AND true → NULL (row excluded)
  • NULL OR true → true (row included)
  • NOT IN (1, 2, NULL) → NULL for all rows (per SQL standard)

Table footers (~4KB) are cached in QueryDO memory. The VIP eviction policy protects frequently-accessed tables from being evicted by cold one-off accesses.

The ScanOperator overlaps I/O with compute:

Time →
Fetch page 0 ████
Decode page 0 ████
Fetch page 1 ████ (overlapped)
Decode page 1 ████
Fetch page 2 ████
...

Prefetches the next page while decoding the current page, keeping the pipeline saturated.

Operators that accumulate state (sort, join) accept a memory budget. When exceeded:

  • HashJoinOperator — Grace hash partitioning, spills partitions to R2
  • ExternalSortOperator — writes sorted runs to R2, k-way merges

Same SpillBackend interface for R2 (edge) and filesystem (local).

R2 has per-bucket rate limits. At PB scale, a single bucket becomes a bottleneck. QueryMode distributes data across up to 4 R2 buckets (DATA_BUCKET + DATA_BUCKET_1/2/3) using FNV-1a hash routing on the R2 key prefix. All DOs (Master, Query, Fragment) and the Worker use the shared resolveBucket() utility — reads and writes are automatically routed to the correct shard.

# wrangler.toml — bind extra buckets
[[r2_buckets]]
binding = "DATA_BUCKET_1"
bucket_name = "data-shard-1"
[[r2_buckets]]
binding = "DATA_BUCKET_2"
bucket_name = "data-shard-2"
[[r2_buckets]]
binding = "DATA_BUCKET_3"
bucket_name = "data-shard-3"

For datasets with millions of fragments, evaluating min/max stats on every fragment is too slow. The partition catalog provides O(1) fragment lookup: given a filter on the partition column, return only the fragment IDs that could contain matching rows.

Query: WHERE region = 'us'
Without catalog: scan all 1M fragment stats → 200ms
With catalog: lookup "us" → [frag-1, frag-5, frag-12] → <1ms

Catalogs are built automatically (best-column auto-detection) or explicitly via partitionBy during ingest. They support eq, in, neq, and not_in filters. Catalogs persist to DO durable storage and restore on wake.

  1. Partition catalog prune — O(1) lookup eliminates fragments that can’t match the partition filter
  2. Min/max stat prune — remaining fragments are checked against per-column page stats

This two-phase approach means a query on a PB-scale dataset with a selective partition filter touches only the relevant fragments — no full scan of metadata.

await qm.table("events").append(rows, {
partitionBy: "region", // each region value gets its own fragment
})

Rows are grouped by partition column value and written to separate fragments. This creates clean partition boundaries that the catalog can exploit for O(1) pruning.

LocalExecutor reads files from disk or HTTP with the same operator pipeline. No Durable Objects, no R2 — just direct file I/O.

import { QueryMode } from "querymode/local"
const qm = QueryMode.local()

MaterializedExecutor holds data in-memory for fromJSON() and fromCSV().

Terminal window
pnpm install # install dependencies
pnpm build # build WASM + TypeScript
pnpm test # workerd tests + Node tests
pnpm dev # local dev with wrangler
# WASM only (requires zig)
pnpm wasm

Tests run in two runtimes:

RuntimeWhatCount
workerd (real CF Workers)Operators, DOs, decode, format parsing
NodeDuckDB conformance (1M-5M rows), fixture files, SQL, infra
TotalAll runtimes755+ tests

Conformance tests validate every operator against DuckDB at scale. CI benchmarks compare QueryMode vs DuckDB on every push.