DataFrame API

The DataFrame is a lazy query builder. All methods return a new DataFrame (immutable). Nothing executes until a terminal method is called.

Filtering

df.filter("age", "gt", 25)
df.filter("status", "eq", "active")
df.filter("id", "in", [1, 2, 3])
df.filter("name", "like", "%Alice%")
df.filter("amount", "between", [100, 500])
df.whereNotNull("email")
df.whereNull("deleted_at")
df.whereOr(
  [{ column: "region", op: "eq", value: "us" }],
  [{ column: "region", op: "eq", value: "eu" }],
)

Operator	Description
`eq`	Equal
`neq`	Not equal
`gt`	Greater than
`gte`	Greater than or equal
`lt`	Less than
`lte`	Less than or equal
`in`	Membership in array
`not_in`	Not in array
`between`	Within range `[low, high]`
`not_between`	Outside range
`like`	SQL LIKE pattern (`%` = any, `_` = single char)
`not_like`	Negated LIKE
`is_null`	Value is null
`is_not_null`	Value is not null

Projection

df.select("id", "name", "email")
df.drop("internal_id", "debug_flag")     // exclude columns
df.rename({ user_id: "userId", created_at: "createdAt" })

Sorting

df.sort("amount", "desc")
df.orderBy("name", "asc")   // alias for sort()

Limiting

df.limit(100)
df.offset(50)
df.limit(100).offset(50)  // pagination

Aggregation

df.groupBy("region")
  .aggregate("sum", "amount", "total")
  .aggregate("count", "id", "cnt")
  .aggregate("avg", "score", "avg_score")

Supported functions: sum, avg, min, max, count, count_distinct, stddev, variance, median, percentile.

Joins

const orders = qm.table("orders")
const users = qm.table("users")

orders.join(users, { left: "user_id", right: "id" }, "inner")
orders.join(users, { left: "user_id", right: "id" }, "left")

Join types: inner, left, right, full, cross.

Hash joins spill to R2 when memory budget is exceeded (Grace hash partitioning).

Window functions

df.window({
  fn: "row_number",
  partitionBy: ["region"],
  orderBy: [{ column: "amount", direction: "desc" }],
  alias: "rank",
})

Supported: row_number, rank, dense_rank, lag, lead, rolling sum/avg/min/max/count with frame specification.

Rolling aggregates with frame:

df.window({
  fn: "avg",
  column: "amount",
  partitionBy: ["region"],
  orderBy: [{ column: "created_at", direction: "asc" }],
  alias: "rolling_avg",
  frame: { type: "rows", start: -2, end: 0 }, // 3-row trailing window
})

Frame bounds: numeric offsets (negative = preceding, positive = following), "unbounded", or "current".

Computed columns

df.computed("total", (row) => (row.price as number) * (row.qty as number))

Set operations

df1.union(df2)           // UNION (deduplicates)
df1.union(df2, true)     // UNION ALL (keeps duplicates)
df1.intersect(df2)
df1.except(df2)

Convenience filter aliases

df.where("age", "gt", 25)                 // alias for filter()
df.whereIn("status", ["active", "trial"]) // shorthand for in
df.whereNotIn("role", ["banned"])          // shorthand for not_in
df.whereBetween("score", 80, 100)         // shorthand for between
df.whereNotBetween("age", 0, 17)          // shorthand for not_between
df.whereLike("name", "%Alice%")           // shorthand for like
df.whereNotLike("email", "%spam%")        // shorthand for not_like
df.filterIn("region", subquery)           // IN subquery (semi-join)

Keyset pagination

// Cursor-based pagination (efficient for large offsets)
df.sort("id", "asc").after(lastSeenId).limit(50)

Versioning and caching

df.version(3)                   // query a specific Lance dataset version
df.cache({ ttl: 60_000 })      // cache results for 60 seconds

Distinct

df.distinct("region", "category")

Terminal methods

These execute the query and return results:

Method	Returns	Description
`collect()`	`QueryResult`	Execute and return all matching rows
`exec()`	`QueryResult`	Alias for `collect()`
`first()`	`Row \| null`	Return first matching row
`count()`	`number`	Count matching rows
`exists()`	`boolean`	Check if any rows match
`explain()`	`ExplainResult`	Show execution plan without running
`describe()`	Table schema	Column names, types, row counts
`head(n)`	`Row[]`	First N rows (shorthand for limit + collect)
`shape()`	`{ rows, columns }`	Row and column count
`dtypes()`	`Record<string, string>`	Column name → type map
`sample(n)`	`Row[]`	Random sample of N rows
`valueCounts(col)`	`{ value, count }[]`	Frequency counts sorted desc
`toJSON(opts?)`	`string`	Export as JSON string
`toCSV(opts?)`	`string`	Export as CSV string
`show(n?)`	`void`	Print first N rows to console
`lazy()`	`LazyResultHandle`	Page-at-a-time iteration without full materialization
`stream(batchSize)`	`AsyncGenerator<Row[]>`	Async generator over result batches
`cursor({ batchSize })`	`AsyncIterable<Row[]>`	Async iterable over batches
`materialize()`	`DataFrame`	Execute and re-wrap as in-memory DataFrame
`materializeAs(name, opts?)`	`DataFrame`	Execute, write to named table, return DataFrame for next stage
`append(rows, opts?)`	`AppendResult`	Write rows (CAS-coordinated)
`dropTable()`	`DropResult`	Delete table and all fragments
`execStream()`	`ReadableStream<Row>`	Streaming columnar binary (remote executor)
`toOperator()`	`Operator`	Escape hatch: get raw operator pipeline

Data exploration

Familiar methods for interactive data work:

// Shape and schema
const { rows, columns } = await df.shape()     // { rows: 50000, columns: 8 }
const types = await df.dtypes()                  // { id: "int64", name: "utf8", ... }

// Preview
await df.show(10)                                // console.table() first 10 rows
const sample = await df.sample(5)                // random 5 rows

// Frequency analysis
const counts = await df.valueCounts("region")    // [{ value: "us", count: 12000 }, ...]

// Clean and transform
df.fillNull("score", 0)                          // replace nulls
df.cast("amount", "number")                      // type conversion

// Export
const json = await df.toJSON({ pretty: true })
const csv = await df.toCSV({ delimiter: "\t" })

These are all sugar over existing methods — shape() calls describe(), valueCounts() uses groupBy().aggregate(), fillNull() and cast() use computed().

Introspection

These methods inspect the query without executing it:

Method	Returns	Description
`toCode(opts?)`	`string`	Convert query to fluent builder TypeScript code
`toDescriptor()`	`QueryDescriptor`	Serializable query plan (JSON)

Query decompilation (toCode)

Convert any query back to readable, copy-pasteable fluent builder TypeScript. Useful for logging, debugging, and LLM context compression (~50x smaller than raw QueryDescriptor JSON).

const df = qm
  .table("orders")
  .filter("category", "eq", "Electronics")
  .filter("amount", "gte", 100)
  .select("id", "amount", "region")
  .sort("amount", "desc")
  .limit(20)

console.log(df.toCode())
// const result = await qm
//   .table("orders")
//   .filter("category", "eq", "Electronics")
//   .filter("amount", "gte", 100)
//   .select("id", "amount", "region")
//   .sort("amount", "desc")
//   .limit(20)
//   .collect()

Customize the output variable name and table function:

df.toCode({ variableName: "data", tableFn: "db" })
// const data = await db
//   .table("orders")
//   ...

// Or use the static method on a raw descriptor:
import { descriptorToCode } from "querymode"
descriptorToCode(someDescriptor)

Handles all features: filters (all 14 ops with shorthand methods), aggregates, groupBy, joins, window functions, set operations, vector search, distinct, sort, limit/offset, version, cache, computed columns, pipe stages.

Schema evolution

Add or drop columns on an existing dataset:

// Add a column with a default value (applied to all existing rows)
await qm.table("users").addColumn("tier", "utf8", "free")

// Add a nullable column (default: null)
await qm.table("users").addColumn("notes", "utf8")

// Drop a column
await qm.table("users").dropColumn("legacy_field")

Both return a SchemaEvolutionResult:

{
  table: "users",
  operation: "add_column",  // or "drop_column"
  column: "tier",
  columnsAfter: ["id", "name", "email", "tier"]
}

Schema changes apply to the in-memory dataset (MaterializedExecutor). On edge mode, schema evolution coordinates through MasterDO.

Progress tracking

For long-running queries, track progress with a callback:

const result = await df.collect({
  onProgress: (info) => {
    console.log(`${info.rowsCollected} rows, ${info.bytesRead} bytes`)
  },
})

Write operations

// Append rows
await qm.table("events").append([
  { id: 1, type: "click", created_at: "2026-03-06" },
])

// Partitioned write — each region value gets its own fragment
// Enables O(1) fragment lookup at query time via partition catalog
await qm.table("events").append(rows, {
  partitionBy: "region",
})

// Append to specific R2 path with metadata
await qm.table("enriched").append(rows, {
  path: "pipelines/job-abc/enriched.lance/",
  metadata: { pipelineId: "job-abc", sourceTables: "orders,users", ttl: "7d" },
})

// Drop table — deletes all fragments from R2
await qm.table("enriched").dropTable()

See Write Path for details on CAS coordination, read-after-write consistency, and catalog integration.

Vector search

const similar = await qm
  .table("images")
  .vector("embedding", queryVec, 10)
  .exec()

IVF-PQ accelerated when index is present, falls back to flat SIMD search.