Architecture

System overview

Clients                    Entry Points              Durable Objects
────────                   ────────────              ────────────────
Browser                                              MasterDO (single-writer)
CLI          ──►  Worker (CF Worker)     ──►           │
DataFrame         LocalExecutor                      QueryDO (per-region)
                  MaterializedExecutor                 │
                                                     FragmentDO (fan-out pool)
                                                       │
                                                     WorkerDO (distributed tasks)

                  Operator Pipeline                  Core Engine
                  ─────────────────                  ───────────
                  ScanOperator + prefetch            WASM SIMD (Zig)
                  FilterOperator                     Decode + bitmap
                  TopKOperator                       Coalesce ranges
                  WasmAggregateOperator              autoCoalesceGap

                  Format Decoders                    Storage
                  ───────────────                    ───────
                  Parquet (Thrift)                    R2
                  Lance v2 (protobuf)                Disk
                  Iceberg (JSON meta)                SpillBackend
                  CSV / JSON / Arrow

Durable Objects

MasterDO (single-writer)

Owns table metadata. Handles writes via CAS-based manifest coordination. Broadcasts footer invalidations to Query DOs.

QueryDO (per-region)

One per datacenter region. Caches table footers in memory (~4KB each) with VIP eviction policy. Routes queries to Fragment DOs for parallel scan.

FragmentDO (fan-out pool)

One per fragment — scales with data, no hard cap. Each scans its assigned fragments in parallel. Handles column reads, page decode, filter pushdown, and partial aggregation. Idle DOs cost nothing (they hibernate). Slot names are deterministic (frag-{region}-slot-{N}) so they get reused across queries with warm footer caches.

WorkerDO (distributed tasks)

Generic worker for distributed operations that exceed single-DO memory. Handles hash join partition processing, sort partitions, and distinct deduplication across R2-spilled data. Managed by WorkerPool which coordinates tree-shaped fan-out (direct for ≤50 partitions, two-level for ≤2500, three-level beyond).

Query flow

Request arrives at Worker → routes to regional QueryDO
QueryDO checks footer cache → if miss, fetches from R2
QueryDO prunes fragments (partition catalog → min/max stats) and partitions survivors across FragmentDO pool
Each FragmentDO runs the operator pipeline:
- Page-level skip (min/max stats)
- Coalesced R2 range reads
- Prefetch next page while decoding current
- WASM SIMD decode + filter
QueryDO merges partial results via columnar k-way merge (or hierarchical reduction for large fan-outs)
Response returned as JSON or streaming columnar format

Data flows between DOs as QMCB (QueryMode Columnar Binary) — a zero-copy columnar wire format transferred via structured clone over Worker RPC. This eliminates JSON serialization of millions of rows between DOs. See Columnar Format for the wire format details.

Mental model: biological cells, not a brain

Every Fragment DO runs the same WASM binary — like cells sharing the same DNA. No cell knows the whole organism. No central brain directs them. They activate on signal, do their work, and go dormant.

Biology	QueryMode
DNA (shared code)	Same WASM binary on every DO
Cell activates on signal	DO wakes on RPC, hibernates when idle
Each cell processes local inputs	Each DO scans its own fragment
Cells don’t coordinate with each other	Fragment DOs are isolated
Body collects and integrates	QueryDO reduces via k-way merge
More tissue → more cells	More fragments → more DOs
Dormant cells cost nothing	Idle DOs hibernate (zero cost)

The code is the DNA of these cells — every Fragment DO carries the same WASM engine, responds to the same signals (scan, filter, aggregate), and produces the same output format. Scale comes from having more cells, not smarter ones.

The fan-out decision is workload-aware: small scans run locally in the QueryDO (no RPC overhead), large scans dispatch to Fragment DOs for parallel execution. No configuration — the engine decides based on estimated rows after pruning.

Hierarchical reduction

When fragment count exceeds 50, a single QueryDO can’t hold all partial results in memory. Instead of merging everything in one place, QueryMode adds reducer tiers:

Leaf DOs (scan)  →  Reducer DOs (merge groups of 25)  →  QueryDO (final merge)

Each tier shrinks the result count by 25×. The math:

Fragments	Reducer tiers	QueryDO merges
25	0 (flat)	25 results
50	1	2 results
625	2	~1 result
15,625	3	~1 result

This keeps QueryDO memory bounded at any data scale. The reducer DOs are regular Fragment DOs reused for merge — same pool, same hibernation behavior.

WASM engine (Zig)

The querymode.wasm binary is compiled from Zig source (wasm/src/):

Column decode — int32, int64, float64, utf8, bool, binary
SIMD aggregates — Vec2i64 for int64 sum/min/max, Vec4f64 for float64
SQL execution — register columns, execute queries, return rows
Vector search — flat SIMD distance computation, IVF-PQ index support
Fragment writing — append rows to Lance format

The WASM module is loaded as a CompiledWasm rule in the Worker.

Batched WASM registration

Column registration uses batched calls to minimize WASM boundary crossings. Instead of writing the table name string into WASM memory once per column, registerColumns() writes it once and reuses the pointer across all columns:

Before (per-column):  4N + 3 WASM calls  (writeString×2 + alloc + register per column + resetHeap + executeQuery + clearTable)
After  (batched):     3N + 4 WASM calls  (writeString×1 per column + alloc + register + shared table name + resetHeap + executeQuery + clearTable)

For a 10-column query this saves 10 alloc calls per page batch. The absolute overhead is small (~50μs vs ~10ms R2 I/O) but the batched API is also cleaner — callers pass all columns at once rather than looping individually.

SharedArrayBuffer is not available in Cloudflare Workers (Spectre mitigations), so zero-copy data exchange between JS and WASM is not possible. All column data is copied into WASM linear memory via alloc + Uint8Array.set(). At ~0.005ms per 64KB page copy vs ~10ms R2 I/O latency, copy overhead accounts for less than 0.05% of total query time — the WASM boundary is not the bottleneck.

Filter pushdown

Filters are pushed as deep as possible — from the DataFrame/SQL layer down to the WASM scan:

DataFrame .filter("age", "gt", 25)
  → QueryDescriptor.filters: [{ column: "age", op: "gt", value: 25 }]
    → Partition catalog prune (skip fragments by partition key)
      → Fragment-level skip (skip fragments where age max < 25 across all pages)
        → Page-level skip (skip pages where page max < 25)
          → WASM SIMD filter (process matching pages in one pass)

Filter operations

All 14 ops push down: eq, neq, gt, gte, lt, lte, in, not_in, between, not_between, like, not_like, is_null, is_not_null.

OR filters

OR conditions decompose into filterGroups: FilterOp[][] — each inner array is AND-connected, groups are OR-connected:

// WHERE (region = 'us' AND age > 25) OR (region = 'eu' AND age > 30)
// becomes:
filterGroups: [
  [{ column: "region", op: "eq", value: "us" }, { column: "age", op: "gt", value: 25 }],
  [{ column: "region", op: "eq", value: "eu" }, { column: "age", op: "gt", value: 30 }],
]

Each group is evaluated independently; results are unioned. This preserves pushdown within each group.

NULL semantics

Filters follow SQL three-valued logic:

NULL = 5 → NULL (not true, row excluded)
NULL AND true → NULL (row excluded)
NULL OR true → true (row included)
NOT IN (1, 2, NULL) → NULL for all rows (per SQL standard)

Table footers (~4KB) are cached in QueryDO memory. The VIP eviction policy protects frequently-accessed tables from being evicted by cold one-off accesses.

Prefetch pipeline

The ScanOperator overlaps I/O with compute:

Time →
  Fetch page 0   ████
  Decode page 0       ████
  Fetch page 1        ████     (overlapped)
  Decode page 1            ████
  Fetch page 2             ████
  ...

Prefetches the next page while decoding the current page, keeping the pipeline saturated.

Spill to R2

Operators that accumulate state (sort, join) accept a memory budget. When exceeded:

HashJoinOperator — Grace hash partitioning, spills partitions to R2
ExternalSortOperator — writes sorted runs to R2, k-way merges

Same SpillBackend interface for R2 (edge) and filesystem (local).

PB-scale infrastructure

Multi-bucket sharding

R2 has per-bucket rate limits. At PB scale, a single bucket becomes a bottleneck. QueryMode distributes data across up to 4 R2 buckets (DATA_BUCKET + DATA_BUCKET_1/2/3) using FNV-1a hash routing on the R2 key prefix. All DOs (Master, Query, Fragment) and the Worker use the shared resolveBucket() utility — reads and writes are automatically routed to the correct shard.

# wrangler.toml — bind extra buckets
[[r2_buckets]]
binding = "DATA_BUCKET_1"
bucket_name = "data-shard-1"

[[r2_buckets]]
binding = "DATA_BUCKET_2"
bucket_name = "data-shard-2"

[[r2_buckets]]
binding = "DATA_BUCKET_3"
bucket_name = "data-shard-3"

Partition catalog

For datasets with millions of fragments, evaluating min/max stats on every fragment is too slow. The partition catalog provides O(1) fragment lookup: given a filter on the partition column, return only the fragment IDs that could contain matching rows.

Query: WHERE region = 'us'

Without catalog:  scan all 1M fragment stats  → 200ms
With catalog:     lookup "us" → [frag-1, frag-5, frag-12]  → <1ms

Catalogs are built automatically (best-column auto-detection) or explicitly via partitionBy during ingest. They support eq, in, neq, and not_in filters. Catalogs persist to DO durable storage and restore on wake.

Two-phase fragment pruning

Partition catalog prune — O(1) lookup eliminates fragments that can’t match the partition filter
Min/max stat prune — remaining fragments are checked against per-column page stats

This two-phase approach means a query on a PB-scale dataset with a selective partition filter touches only the relevant fragments — no full scan of metadata.

Partition-aware ingest

await qm.table("events").append(rows, {
  partitionBy: "region",  // each region value gets its own fragment
})

Rows are grouped by partition column value and written to separate fragments. This creates clean partition boundaries that the catalog can exploit for O(1) pruning.

Local mode

LocalExecutor reads files from disk or HTTP with the same operator pipeline. No Durable Objects, no R2 — just direct file I/O.

import { QueryMode } from "querymode/local"
const qm = QueryMode.local()

MaterializedExecutor holds data in-memory for fromJSON() and fromCSV().

Build

pnpm install          # install dependencies
pnpm build            # build WASM + TypeScript
pnpm test             # workerd tests + Node tests
pnpm dev              # local dev with wrangler

# WASM only (requires zig)
pnpm wasm

Testing

Tests run in two runtimes:

Runtime	What	Count
workerd (real CF Workers)	Operators, DOs, decode, format parsing	—
Node	DuckDB conformance (1M-5M rows), fixture files, SQL, infra	—
Total	All runtimes	755+ tests

Conformance tests validate every operator against DuckDB at scale. CI benchmarks compare QueryMode vs DuckDB on every push.