Skip to content

DataFrame API

The DataFrame is a lazy query builder. All methods return a new DataFrame (immutable). Nothing executes until a terminal method is called.

df.filter("age", "gt", 25)
df.filter("status", "eq", "active")
df.filter("id", "in", [1, 2, 3])
df.filter("name", "like", "%Alice%")
df.filter("amount", "between", [100, 500])
df.whereNotNull("email")
df.whereNull("deleted_at")
df.whereOr(
[{ column: "region", op: "eq", value: "us" }],
[{ column: "region", op: "eq", value: "eu" }],
)
OperatorDescription
eqEqual
neqNot equal
gtGreater than
gteGreater than or equal
ltLess than
lteLess than or equal
inMembership in array
not_inNot in array
betweenWithin range [low, high]
not_betweenOutside range
likeSQL LIKE pattern (% = any, _ = single char)
not_likeNegated LIKE
is_nullValue is null
is_not_nullValue is not null
df.select("id", "name", "email")
df.drop("internal_id", "debug_flag") // exclude columns
df.rename({ user_id: "userId", created_at: "createdAt" })
df.sort("amount", "desc")
df.orderBy("name", "asc") // alias for sort()
df.limit(100)
df.offset(50)
df.limit(100).offset(50) // pagination
df.groupBy("region")
.aggregate("sum", "amount", "total")
.aggregate("count", "id", "cnt")
.aggregate("avg", "score", "avg_score")

Supported functions: sum, avg, min, max, count, count_distinct, stddev, variance, median, percentile.

const orders = qm.table("orders")
const users = qm.table("users")
orders.join(users, { left: "user_id", right: "id" }, "inner")
orders.join(users, { left: "user_id", right: "id" }, "left")

Join types: inner, left, right, full, cross.

Hash joins spill to R2 when memory budget is exceeded (Grace hash partitioning).

df.window({
fn: "row_number",
partitionBy: ["region"],
orderBy: [{ column: "amount", direction: "desc" }],
alias: "rank",
})

Supported: row_number, rank, dense_rank, lag, lead, rolling sum/avg/min/max/count with frame specification.

Rolling aggregates with frame:

df.window({
fn: "avg",
column: "amount",
partitionBy: ["region"],
orderBy: [{ column: "created_at", direction: "asc" }],
alias: "rolling_avg",
frame: { type: "rows", start: -2, end: 0 }, // 3-row trailing window
})

Frame bounds: numeric offsets (negative = preceding, positive = following), "unbounded", or "current".

df.computed("total", (row) => (row.price as number) * (row.qty as number))
df1.union(df2) // UNION (deduplicates)
df1.union(df2, true) // UNION ALL (keeps duplicates)
df1.intersect(df2)
df1.except(df2)
df.where("age", "gt", 25) // alias for filter()
df.whereIn("status", ["active", "trial"]) // shorthand for in
df.whereNotIn("role", ["banned"]) // shorthand for not_in
df.whereBetween("score", 80, 100) // shorthand for between
df.whereNotBetween("age", 0, 17) // shorthand for not_between
df.whereLike("name", "%Alice%") // shorthand for like
df.whereNotLike("email", "%spam%") // shorthand for not_like
df.filterIn("region", subquery) // IN subquery (semi-join)
// Cursor-based pagination (efficient for large offsets)
df.sort("id", "asc").after(lastSeenId).limit(50)
df.version(3) // query a specific Lance dataset version
df.cache({ ttl: 60_000 }) // cache results for 60 seconds
df.distinct("region", "category")

These execute the query and return results:

MethodReturnsDescription
collect()QueryResultExecute and return all matching rows
exec()QueryResultAlias for collect()
first()Row | nullReturn first matching row
count()numberCount matching rows
exists()booleanCheck if any rows match
explain()ExplainResultShow execution plan without running
describe()Table schemaColumn names, types, row counts
head(n)Row[]First N rows (shorthand for limit + collect)
shape(){ rows, columns }Row and column count
dtypes()Record<string, string>Column name → type map
sample(n)Row[]Random sample of N rows
valueCounts(col){ value, count }[]Frequency counts sorted desc
toJSON(opts?)stringExport as JSON string
toCSV(opts?)stringExport as CSV string
show(n?)voidPrint first N rows to console
lazy()LazyResultHandlePage-at-a-time iteration without full materialization
stream(batchSize)AsyncGenerator<Row[]>Async generator over result batches
cursor({ batchSize })AsyncIterable<Row[]>Async iterable over batches
materialize()DataFrameExecute and re-wrap as in-memory DataFrame
materializeAs(name, opts?)DataFrameExecute, write to named table, return DataFrame for next stage
append(rows, opts?)AppendResultWrite rows (CAS-coordinated)
dropTable()DropResultDelete table and all fragments
execStream()ReadableStream<Row>Streaming columnar binary (remote executor)
toOperator()OperatorEscape hatch: get raw operator pipeline

Familiar methods for interactive data work:

// Shape and schema
const { rows, columns } = await df.shape() // { rows: 50000, columns: 8 }
const types = await df.dtypes() // { id: "int64", name: "utf8", ... }
// Preview
await df.show(10) // console.table() first 10 rows
const sample = await df.sample(5) // random 5 rows
// Frequency analysis
const counts = await df.valueCounts("region") // [{ value: "us", count: 12000 }, ...]
// Clean and transform
df.fillNull("score", 0) // replace nulls
df.cast("amount", "number") // type conversion
// Export
const json = await df.toJSON({ pretty: true })
const csv = await df.toCSV({ delimiter: "\t" })

These are all sugar over existing methods — shape() calls describe(), valueCounts() uses groupBy().aggregate(), fillNull() and cast() use computed().

These methods inspect the query without executing it:

MethodReturnsDescription
toCode(opts?)stringConvert query to fluent builder TypeScript code
toDescriptor()QueryDescriptorSerializable query plan (JSON)

Convert any query back to readable, copy-pasteable fluent builder TypeScript. Useful for logging, debugging, and LLM context compression (~50x smaller than raw QueryDescriptor JSON).

const df = qm
.table("orders")
.filter("category", "eq", "Electronics")
.filter("amount", "gte", 100)
.select("id", "amount", "region")
.sort("amount", "desc")
.limit(20)
console.log(df.toCode())
// const result = await qm
// .table("orders")
// .filter("category", "eq", "Electronics")
// .filter("amount", "gte", 100)
// .select("id", "amount", "region")
// .sort("amount", "desc")
// .limit(20)
// .collect()

Customize the output variable name and table function:

df.toCode({ variableName: "data", tableFn: "db" })
// const data = await db
// .table("orders")
// ...
// Or use the static method on a raw descriptor:
import { descriptorToCode } from "querymode"
descriptorToCode(someDescriptor)

Handles all features: filters (all 14 ops with shorthand methods), aggregates, groupBy, joins, window functions, set operations, vector search, distinct, sort, limit/offset, version, cache, computed columns, pipe stages.

Add or drop columns on an existing dataset:

// Add a column with a default value (applied to all existing rows)
await qm.table("users").addColumn("tier", "utf8", "free")
// Add a nullable column (default: null)
await qm.table("users").addColumn("notes", "utf8")
// Drop a column
await qm.table("users").dropColumn("legacy_field")

Both return a SchemaEvolutionResult:

{
table: "users",
operation: "add_column", // or "drop_column"
column: "tier",
columnsAfter: ["id", "name", "email", "tier"]
}

Schema changes apply to the in-memory dataset (MaterializedExecutor). On edge mode, schema evolution coordinates through MasterDO.

For long-running queries, track progress with a callback:

const result = await df.collect({
onProgress: (info) => {
console.log(`${info.rowsCollected} rows, ${info.bytesRead} bytes`)
},
})
// Append rows
await qm.table("events").append([
{ id: 1, type: "click", created_at: "2026-03-06" },
])
// Partitioned write — each region value gets its own fragment
// Enables O(1) fragment lookup at query time via partition catalog
await qm.table("events").append(rows, {
partitionBy: "region",
})
// Append to specific R2 path with metadata
await qm.table("enriched").append(rows, {
path: "pipelines/job-abc/enriched.lance/",
metadata: { pipelineId: "job-abc", sourceTables: "orders,users", ttl: "7d" },
})
// Drop table — deletes all fragments from R2
await qm.table("enriched").dropTable()

See Write Path for details on CAS coordination, read-after-write consistency, and catalog integration.

const similar = await qm
.table("images")
.vector("embedding", queryVec, 10)
.exec()

IVF-PQ accelerated when index is present, falls back to flat SIMD search.