Formats
QueryMode reads columnar and row-oriented formats with the same API. Format detection is automatic based on file magic bytes.
Supported formats
Section titled “Supported formats”| Format | Type | Page skip | Compression | Status |
|---|---|---|---|---|
| Parquet | Columnar | Min/max stats | Snappy, ZSTD, GZIP, LZ4_RAW | Full support |
| Lance v1/v2 | Columnar | Min/max stats | None (raw pages) | Full support |
| Iceberg | Table format | Via Parquet | Via Parquet | Metadata + Parquet data |
| CSV | Row | No | No | Via fromCSV() |
| JSON / NDJSON | Row | No | No | Via fromJSON() or file path |
| Arrow IPC | Columnar | No | No | File or in-memory |
Parquet
Section titled “Parquet”Full Thrift metadata parser. Reads row groups, column chunks, page-level statistics.
const qm = QueryMode.local()const result = await qm .table("./data/events.parquet") .filter("id", "gt", 50000) // skips row groups where max(id) <= 50000 .collect()Supported types
Section titled “Supported types”| Parquet type | QueryMode type | Notes |
|---|---|---|
| INT32 | int32 | |
| INT64 | int64 | BigInt in JS |
| INT32 (INT_8) | int8 | Via logical type annotation |
| INT32 (INT_16) | int16 | Via logical type annotation |
| INT32 (UINT_8) | uint8 | Via logical type annotation |
| INT32 (UINT_16) | uint16 | Via logical type annotation |
| INT32 (UINT_32) | uint32 | Via logical type annotation |
| INT64 (UINT_64) | uint64 | BigInt in JS |
| FLOAT | float32 | |
| DOUBLE | float64 | |
| FIXED_LEN_BYTE_ARRAY (FLOAT16) | float16 | IEEE 754 half-precision |
| BOOLEAN | bool | |
| BYTE_ARRAY (UTF8) | utf8 | |
| BYTE_ARRAY | binary | |
| FIXED_LEN_BYTE_ARRAY | binary | UUIDs, decimals |
Compression
Section titled “Compression”Snappy decompression is pure TypeScript. ZSTD, GZIP, and LZ4_RAW use the WASM engine. Note: only LZ4 raw blocks are supported — Hadoop-framed LZ4 (LZ4 codec) is not supported.
Lance v1 / v2
Section titled “Lance v1 / v2”Native Lance format reader supporting both v1 and v2 layouts. Parses the 40-byte footer, column metadata protobuf, and page data.
const result = await qm .table("./data/embeddings.lance") .vector("embedding", queryVec, 10) .collect()Lance features
Section titled “Lance features”- Footer parsing (major/minor version, column count, metadata offsets)
- Column metadata from protobuf (names, types, page offsets, null counts)
- Manifest parsing (fragments, schema, version history)
- Null bitmap decode with fast paths (0xFF all-valid, 0x00 all-null)
Iceberg
Section titled “Iceberg”Reads Iceberg metadata JSON, extracts Parquet file paths from manifests, then reads as Parquet.
const result = await qm .table("./warehouse/db/events") .filter("event_type", "eq", "purchase") .collect()Supports Iceberg v1 and v2 metadata, type mapping from Iceberg schema to QueryMode types.
CSV, JSON, and NDJSON
Section titled “CSV, JSON, and NDJSON”In-memory materialization for small datasets:
// From CSV stringconst df = await QueryMode.fromCSV(csvString, "my_table")
// From JSON arrayconst df = QueryMode.fromJSON(jsonArray, "my_table")
// From file — auto-detected by extension and contentconst df = await qm.table("./data/events.json").collect() // JSON arrayconst df = await qm.table("./data/events.ndjson").collect() // newline-delimited JSONconst df = await qm.table("./data/events.csv").collect() // CSV/TSV/PSV (auto-detect delimiter)CSV auto-detects delimiter (comma, tab, pipe, semicolon) and infers column types from the data. JSON supports both [{...}, {...}] arrays and NDJSON ({...}\n{...}) based on the first non-whitespace byte.
These materialize all data in memory. Use Parquet or Lance for large datasets.
Format detection
Section titled “Format detection”QueryMode detects format from the file’s magic bytes:
| Format | Magic | Location |
|---|---|---|
| Parquet | PAR1 | Last 4 bytes |
| Lance | LANC | Last 4 bytes |
For Iceberg, the table path is resolved to find metadata/v*.metadata.json.
Page-level skip
Section titled “Page-level skip”For columnar formats (Parquet, Lance), each page stores min/max statistics. Filters are evaluated against these stats before any data is read:
Page stats: min=100, max=500Filter: id > 600Result: SKIP — entire page never fetched from R2This means queries like filter("id", "gt", 990000) on a 1M-row table only read the last few pages.
Coalesced range reads
Section titled “Coalesced range reads”When multiple column pages are nearby in the file, their byte ranges are merged into fewer R2/disk requests:
Before: 5 separate reads (200B gaps between them)After: 1 merged read (includes gap bytes, but saves 4 round-trips)The merge threshold is computed dynamically from the median inter-page gap (autoCoalesceGap).