Skip to content

Formats

QueryMode reads columnar and row-oriented formats with the same API. Format detection is automatic based on file magic bytes.

FormatTypePage skipCompressionStatus
ParquetColumnarMin/max statsSnappy, ZSTD, GZIP, LZ4_RAWFull support
Lance v1/v2ColumnarMin/max statsNone (raw pages)Full support
IcebergTable formatVia ParquetVia ParquetMetadata + Parquet data
CSVRowNoNoVia fromCSV()
JSON / NDJSONRowNoNoVia fromJSON() or file path
Arrow IPCColumnarNoNoFile or in-memory

Full Thrift metadata parser. Reads row groups, column chunks, page-level statistics.

const qm = QueryMode.local()
const result = await qm
.table("./data/events.parquet")
.filter("id", "gt", 50000) // skips row groups where max(id) <= 50000
.collect()
Parquet typeQueryMode typeNotes
INT32int32
INT64int64BigInt in JS
INT32 (INT_8)int8Via logical type annotation
INT32 (INT_16)int16Via logical type annotation
INT32 (UINT_8)uint8Via logical type annotation
INT32 (UINT_16)uint16Via logical type annotation
INT32 (UINT_32)uint32Via logical type annotation
INT64 (UINT_64)uint64BigInt in JS
FLOATfloat32
DOUBLEfloat64
FIXED_LEN_BYTE_ARRAY (FLOAT16)float16IEEE 754 half-precision
BOOLEANbool
BYTE_ARRAY (UTF8)utf8
BYTE_ARRAYbinary
FIXED_LEN_BYTE_ARRAYbinaryUUIDs, decimals

Snappy decompression is pure TypeScript. ZSTD, GZIP, and LZ4_RAW use the WASM engine. Note: only LZ4 raw blocks are supported — Hadoop-framed LZ4 (LZ4 codec) is not supported.

Native Lance format reader supporting both v1 and v2 layouts. Parses the 40-byte footer, column metadata protobuf, and page data.

const result = await qm
.table("./data/embeddings.lance")
.vector("embedding", queryVec, 10)
.collect()
  • Footer parsing (major/minor version, column count, metadata offsets)
  • Column metadata from protobuf (names, types, page offsets, null counts)
  • Manifest parsing (fragments, schema, version history)
  • Null bitmap decode with fast paths (0xFF all-valid, 0x00 all-null)

Reads Iceberg metadata JSON, extracts Parquet file paths from manifests, then reads as Parquet.

const result = await qm
.table("./warehouse/db/events")
.filter("event_type", "eq", "purchase")
.collect()

Supports Iceberg v1 and v2 metadata, type mapping from Iceberg schema to QueryMode types.

In-memory materialization for small datasets:

// From CSV string
const df = await QueryMode.fromCSV(csvString, "my_table")
// From JSON array
const df = QueryMode.fromJSON(jsonArray, "my_table")
// From file — auto-detected by extension and content
const df = await qm.table("./data/events.json").collect() // JSON array
const df = await qm.table("./data/events.ndjson").collect() // newline-delimited JSON
const df = await qm.table("./data/events.csv").collect() // CSV/TSV/PSV (auto-detect delimiter)

CSV auto-detects delimiter (comma, tab, pipe, semicolon) and infers column types from the data. JSON supports both [{...}, {...}] arrays and NDJSON ({...}\n{...}) based on the first non-whitespace byte.

These materialize all data in memory. Use Parquet or Lance for large datasets.

QueryMode detects format from the file’s magic bytes:

FormatMagicLocation
ParquetPAR1Last 4 bytes
LanceLANCLast 4 bytes

For Iceberg, the table path is resolved to find metadata/v*.metadata.json.

For columnar formats (Parquet, Lance), each page stores min/max statistics. Filters are evaluated against these stats before any data is read:

Page stats: min=100, max=500
Filter: id > 600
Result: SKIP — entire page never fetched from R2

This means queries like filter("id", "gt", 990000) on a 1M-row table only read the last few pages.

When multiple column pages are nearby in the file, their byte ranges are merged into fewer R2/disk requests:

Before: 5 separate reads (200B gaps between them)
After: 1 merged read (includes gap bytes, but saves 4 round-trips)

The merge threshold is computed dynamically from the median inter-page gap (autoCoalesceGap).