Architecture
System overview
Section titled “System overview”Clients Entry Points Durable Objects──────── ──────────── ────────────────Browser MasterDO (single-writer)CLI ──► Worker (CF Worker) ──► │DataFrame LocalExecutor QueryDO (per-region) MaterializedExecutor │ FragmentDO (fan-out pool) │ WorkerDO (distributed tasks)
Operator Pipeline Core Engine ───────────────── ─────────── ScanOperator + prefetch WASM SIMD (Zig) FilterOperator Decode + bitmap TopKOperator Coalesce ranges WasmAggregateOperator autoCoalesceGap
Format Decoders Storage ─────────────── ─────── Parquet (Thrift) R2 Lance v2 (protobuf) Disk Iceberg (JSON meta) SpillBackend CSV / JSON / ArrowDurable Objects
Section titled “Durable Objects”MasterDO (single-writer)
Section titled “MasterDO (single-writer)”Owns table metadata. Handles writes via CAS-based manifest coordination. Broadcasts footer invalidations to Query DOs.
QueryDO (per-region)
Section titled “QueryDO (per-region)”One per datacenter region. Caches table footers in memory (~4KB each) with VIP eviction policy. Routes queries to Fragment DOs for parallel scan.
FragmentDO (fan-out pool)
Section titled “FragmentDO (fan-out pool)”One per fragment — scales with data, no hard cap. Each scans its assigned fragments in parallel. Handles column reads, page decode, filter pushdown, and partial aggregation. Idle DOs cost nothing (they hibernate). Slot names are deterministic (frag-{region}-slot-{N}) so they get reused across queries with warm footer caches.
WorkerDO (distributed tasks)
Section titled “WorkerDO (distributed tasks)”Generic worker for distributed operations that exceed single-DO memory. Handles hash join partition processing, sort partitions, and distinct deduplication across R2-spilled data. Managed by WorkerPool which coordinates tree-shaped fan-out (direct for ≤50 partitions, two-level for ≤2500, three-level beyond).
Query flow
Section titled “Query flow”- Request arrives at Worker → routes to regional QueryDO
- QueryDO checks footer cache → if miss, fetches from R2
- QueryDO prunes fragments (partition catalog → min/max stats) and partitions survivors across FragmentDO pool
- Each FragmentDO runs the operator pipeline:
- Page-level skip (min/max stats)
- Coalesced R2 range reads
- Prefetch next page while decoding current
- WASM SIMD decode + filter
- QueryDO merges partial results via columnar k-way merge (or hierarchical reduction for large fan-outs)
- Response returned as JSON or streaming columnar format
Data flows between DOs as QMCB (QueryMode Columnar Binary) — a zero-copy columnar wire format transferred via structured clone over Worker RPC. This eliminates JSON serialization of millions of rows between DOs. See Columnar Format for the wire format details.
Mental model: biological cells, not a brain
Section titled “Mental model: biological cells, not a brain”Every Fragment DO runs the same WASM binary — like cells sharing the same DNA. No cell knows the whole organism. No central brain directs them. They activate on signal, do their work, and go dormant.
| Biology | QueryMode |
|---|---|
| DNA (shared code) | Same WASM binary on every DO |
| Cell activates on signal | DO wakes on RPC, hibernates when idle |
| Each cell processes local inputs | Each DO scans its own fragment |
| Cells don’t coordinate with each other | Fragment DOs are isolated |
| Body collects and integrates | QueryDO reduces via k-way merge |
| More tissue → more cells | More fragments → more DOs |
| Dormant cells cost nothing | Idle DOs hibernate (zero cost) |
The code is the DNA of these cells — every Fragment DO carries the same WASM engine, responds to the same signals (scan, filter, aggregate), and produces the same output format. Scale comes from having more cells, not smarter ones.
The fan-out decision is workload-aware: small scans run locally in the QueryDO (no RPC overhead), large scans dispatch to Fragment DOs for parallel execution. No configuration — the engine decides based on estimated rows after pruning.
Hierarchical reduction
Section titled “Hierarchical reduction”When fragment count exceeds 50, a single QueryDO can’t hold all partial results in memory. Instead of merging everything in one place, QueryMode adds reducer tiers:
Leaf DOs (scan) → Reducer DOs (merge groups of 25) → QueryDO (final merge)Each tier shrinks the result count by 25×. The math:
| Fragments | Reducer tiers | QueryDO merges |
|---|---|---|
| 25 | 0 (flat) | 25 results |
| 50 | 1 | 2 results |
| 625 | 2 | ~1 result |
| 15,625 | 3 | ~1 result |
This keeps QueryDO memory bounded at any data scale. The reducer DOs are regular Fragment DOs reused for merge — same pool, same hibernation behavior.
WASM engine (Zig)
Section titled “WASM engine (Zig)”The querymode.wasm binary is compiled from Zig source (wasm/src/):
- Column decode — int32, int64, float64, utf8, bool, binary
- SIMD aggregates — Vec2i64 for int64 sum/min/max, Vec4f64 for float64
- SQL execution — register columns, execute queries, return rows
- Vector search — flat SIMD distance computation, IVF-PQ index support
- Fragment writing — append rows to Lance format
The WASM module is loaded as a CompiledWasm rule in the Worker.
Batched WASM registration
Section titled “Batched WASM registration”Column registration uses batched calls to minimize WASM boundary crossings. Instead of writing the table name string into WASM memory once per column, registerColumns() writes it once and reuses the pointer across all columns:
Before (per-column): 4N + 3 WASM calls (writeString×2 + alloc + register per column + resetHeap + executeQuery + clearTable)After (batched): 3N + 4 WASM calls (writeString×1 per column + alloc + register + shared table name + resetHeap + executeQuery + clearTable)For a 10-column query this saves 10 alloc calls per page batch. The absolute overhead is small (~50μs vs ~10ms R2 I/O) but the batched API is also cleaner — callers pass all columns at once rather than looping individually.
SharedArrayBuffer is not available in Cloudflare Workers (Spectre mitigations), so zero-copy data exchange between JS and WASM is not possible. All column data is copied into WASM linear memory via alloc + Uint8Array.set(). At ~0.005ms per 64KB page copy vs ~10ms R2 I/O latency, copy overhead accounts for less than 0.05% of total query time — the WASM boundary is not the bottleneck.
Filter pushdown
Section titled “Filter pushdown”Filters are pushed as deep as possible — from the DataFrame/SQL layer down to the WASM scan:
DataFrame .filter("age", "gt", 25) → QueryDescriptor.filters: [{ column: "age", op: "gt", value: 25 }] → Partition catalog prune (skip fragments by partition key) → Fragment-level skip (skip fragments where age max < 25 across all pages) → Page-level skip (skip pages where page max < 25) → WASM SIMD filter (process matching pages in one pass)Filter operations
Section titled “Filter operations”All 14 ops push down: eq, neq, gt, gte, lt, lte, in, not_in, between, not_between, like, not_like, is_null, is_not_null.
OR filters
Section titled “OR filters”OR conditions decompose into filterGroups: FilterOp[][] — each inner array is AND-connected, groups are OR-connected:
// WHERE (region = 'us' AND age > 25) OR (region = 'eu' AND age > 30)// becomes:filterGroups: [ [{ column: "region", op: "eq", value: "us" }, { column: "age", op: "gt", value: 25 }], [{ column: "region", op: "eq", value: "eu" }, { column: "age", op: "gt", value: 30 }],]Each group is evaluated independently; results are unioned. This preserves pushdown within each group.
NULL semantics
Section titled “NULL semantics”Filters follow SQL three-valued logic:
NULL = 5→ NULL (not true, row excluded)NULL AND true→ NULL (row excluded)NULL OR true→ true (row included)NOT IN (1, 2, NULL)→ NULL for all rows (per SQL standard)
Footer caching
Section titled “Footer caching”Table footers (~4KB) are cached in QueryDO memory. The VIP eviction policy protects frequently-accessed tables from being evicted by cold one-off accesses.
Prefetch pipeline
Section titled “Prefetch pipeline”The ScanOperator overlaps I/O with compute:
Time → Fetch page 0 ████ Decode page 0 ████ Fetch page 1 ████ (overlapped) Decode page 1 ████ Fetch page 2 ████ ...Prefetches the next page while decoding the current page, keeping the pipeline saturated.
Spill to R2
Section titled “Spill to R2”Operators that accumulate state (sort, join) accept a memory budget. When exceeded:
- HashJoinOperator — Grace hash partitioning, spills partitions to R2
- ExternalSortOperator — writes sorted runs to R2, k-way merges
Same SpillBackend interface for R2 (edge) and filesystem (local).
PB-scale infrastructure
Section titled “PB-scale infrastructure”Multi-bucket sharding
Section titled “Multi-bucket sharding”R2 has per-bucket rate limits. At PB scale, a single bucket becomes a bottleneck. QueryMode distributes data across up to 4 R2 buckets (DATA_BUCKET + DATA_BUCKET_1/2/3) using FNV-1a hash routing on the R2 key prefix. All DOs (Master, Query, Fragment) and the Worker use the shared resolveBucket() utility — reads and writes are automatically routed to the correct shard.
# wrangler.toml — bind extra buckets[[r2_buckets]]binding = "DATA_BUCKET_1"bucket_name = "data-shard-1"
[[r2_buckets]]binding = "DATA_BUCKET_2"bucket_name = "data-shard-2"
[[r2_buckets]]binding = "DATA_BUCKET_3"bucket_name = "data-shard-3"Partition catalog
Section titled “Partition catalog”For datasets with millions of fragments, evaluating min/max stats on every fragment is too slow. The partition catalog provides O(1) fragment lookup: given a filter on the partition column, return only the fragment IDs that could contain matching rows.
Query: WHERE region = 'us'
Without catalog: scan all 1M fragment stats → 200msWith catalog: lookup "us" → [frag-1, frag-5, frag-12] → <1msCatalogs are built automatically (best-column auto-detection) or explicitly via partitionBy during ingest. They support eq, in, neq, and not_in filters. Catalogs persist to DO durable storage and restore on wake.
Two-phase fragment pruning
Section titled “Two-phase fragment pruning”- Partition catalog prune — O(1) lookup eliminates fragments that can’t match the partition filter
- Min/max stat prune — remaining fragments are checked against per-column page stats
This two-phase approach means a query on a PB-scale dataset with a selective partition filter touches only the relevant fragments — no full scan of metadata.
Partition-aware ingest
Section titled “Partition-aware ingest”await qm.table("events").append(rows, { partitionBy: "region", // each region value gets its own fragment})Rows are grouped by partition column value and written to separate fragments. This creates clean partition boundaries that the catalog can exploit for O(1) pruning.
Local mode
Section titled “Local mode”LocalExecutor reads files from disk or HTTP with the same operator pipeline. No Durable Objects, no R2 — just direct file I/O.
import { QueryMode } from "querymode/local"const qm = QueryMode.local()MaterializedExecutor holds data in-memory for fromJSON() and fromCSV().
pnpm install # install dependenciespnpm build # build WASM + TypeScriptpnpm test # workerd tests + Node testspnpm dev # local dev with wrangler
# WASM only (requires zig)pnpm wasmTesting
Section titled “Testing”Tests run in two runtimes:
| Runtime | What | Count |
|---|---|---|
| workerd (real CF Workers) | Operators, DOs, decode, format parsing | — |
| Node | DuckDB conformance (1M-5M rows), fixture files, SQL, infra | — |
| Total | All runtimes | 755+ tests |
Conformance tests validate every operator against DuckDB at scale. CI benchmarks compare QueryMode vs DuckDB on every push.