DataFrame API

The DataFrame is a lazy query builder. All methods return a new DataFrame (immutable). Nothing executes until a terminal method is called.

Filtering

df.filter("age", "gt", 25)
df.filter("status", "eq", "active")
df.filter("id", "in", [1, 2, 3])
df.filter("name", "like", "%alice%")
df.filter("amount", "between", [100, 500])
df.whereNotNull("email")
df.whereNull("deleted_at")
df.whereOr(
  [{ column: "region", op: "eq", value: "us" }],
  [{ column: "region", op: "eq", value: "eu" }],
)

Operator	Description
`eq`	Equal
`neq`	Not equal
`gt`	Greater than
`gte`	Greater than or equal
`lt`	Less than
`lte`	Less than or equal
`in`	Membership in array
`not_in`	Not in array
`between`	Within range `[low, high]`
`not_between`	Outside range
`like`	SQL LIKE pattern (`%` = any, `_` = single char)
`not_like`	Negated LIKE
`is_null`	Value is null
`is_not_null`	Value is not null

Projection

df.select("id", "name", "email")
df.drop("internal_id", "debug_flag")     // exclude columns
df.rename({ user_id: "userId", created_at: "createdAt" })

Sorting

df.sort("amount", "desc")
df.orderBy("name", "asc")   // alias for sort()

Limiting

df.limit(100)
df.offset(50)
df.limit(100).offset(50)  // pagination

Aggregation

df.groupBy("region")
  .aggregate("sum", "amount", "total")
  .aggregate("count", "id", "cnt")
  .aggregate("avg", "score", "avg_score")

Supported functions: sum, avg, min, max, count, count_distinct, stddev, variance, median, percentile.

Joins

const orders = qm.table("orders")
const users = qm.table("users")

orders.join(users, { left: "user_id", right: "id" }, "inner")
orders.join(users, { left: "user_id", right: "id" }, "left")

Join types: inner, left, right, full, cross.

Hash joins spill to R2 when memory budget is exceeded (Grace hash partitioning).

Window functions

df.window({
  fn: "row_number",
  partitionBy: ["region"],
  orderBy: [{ column: "amount", direction: "desc" }],
  alias: "rank",
})

Supported: row_number, rank, dense_rank, lag, lead, rolling sum/avg/min/max/count with frame specification.

Computed columns

df.computed("total", (row) => (row.price as number) * (row.qty as number))

Set operations

df1.union(df2)           // UNION (deduplicates)
df1.union(df2, true)     // UNION ALL (keeps duplicates)
df1.intersect(df2)
df1.except(df2)

Distinct

df.distinct("region", "category")

Terminal methods

These execute the query and return results:

Method	Returns	Description
`collect()`	`QueryResult`	Execute and return all matching rows
`exec()`	`QueryResult`	Alias for `collect()`
`first()`	`Row \| null`	Return first matching row
`count()`	`number`	Count matching rows
`exists()`	`boolean`	Check if any rows match
`explain()`	`ExplainResult`	Show execution plan without running
`describe()`	Table schema	Column names, types, row counts
`head(n)`	`Row[]`	First N rows (shorthand for limit + collect)
`shape()`	`{ rows, columns }`	Row and column count
`dtypes()`	`Record<string, string>`	Column name → type map
`sample(n)`	`Row[]`	Random sample of N rows
`valueCounts(col)`	`{ value, count }[]`	Frequency counts sorted desc
`toJSON(opts?)`	`string`	Export as JSON string
`toCSV(opts?)`	`string`	Export as CSV string
`show(n?)`	`void`	Print first N rows to console
`append(rows, opts?)`	`AppendResult`	Write rows (CAS-coordinated)
`dropTable()`	`DropResult`	Delete table and all fragments

Data exploration

Familiar methods for interactive data work:

// Shape and schema
const { rows, columns } = await df.shape()     // { rows: 50000, columns: 8 }
const types = await df.dtypes()                  // { id: "int64", name: "utf8", ... }

// Preview
await df.show(10)                                // console.table() first 10 rows
const sample = await df.sample(5)                // random 5 rows

// Frequency analysis
const counts = await df.valueCounts("region")    // [{ value: "us", count: 12000 }, ...]

// Clean and transform
df.fillNull("score", 0)                          // replace nulls
df.cast("amount", "number")                      // type conversion

// Export
const json = await df.toJSON({ pretty: true })
const csv = await df.toCSV({ delimiter: "\t" })

These are all sugar over existing methods — shape() calls explain(), valueCounts() uses groupBy().aggregate(), fillNull() and cast() use computed().

Progress tracking

For long-running queries, track progress with a callback:

const result = await df.collect({
  onProgress: (info) => {
    console.log(`${info.rowsCollected} rows, ${info.bytesRead} bytes`)
  },
})

Write operations

// Append rows
await qm.table("events").append([
  { id: 1, type: "click", created_at: "2026-03-06" },
])

// Append to specific R2 path with metadata
await qm.table("enriched").append(rows, {
  path: "pipelines/job-abc/enriched.lance/",
  metadata: { pipelineId: "job-abc", sourceTables: "orders,users", ttl: "7d" },
})

// Drop table — deletes all fragments from R2
await qm.table("enriched").dropTable()

See Write Path for details on CAS coordination, read-after-write consistency, and catalog integration.

Vector search

const similar = await qm
  .table("images")
  .vector("embedding", queryVec, 10)
  .exec()

IVF-PQ accelerated when index is present, falls back to flat SIMD search.