Skip to content

DataFrame API

The DataFrame is a lazy query builder. All methods return a new DataFrame (immutable). Nothing executes until a terminal method is called.

df.filter("age", "gt", 25)
df.filter("status", "eq", "active")
df.filter("id", "in", [1, 2, 3])
df.filter("name", "like", "%alice%")
df.filter("amount", "between", [100, 500])
df.whereNotNull("email")
df.whereNull("deleted_at")
df.whereOr(
[{ column: "region", op: "eq", value: "us" }],
[{ column: "region", op: "eq", value: "eu" }],
)
OperatorDescription
eqEqual
neqNot equal
gtGreater than
gteGreater than or equal
ltLess than
lteLess than or equal
inMembership in array
not_inNot in array
betweenWithin range [low, high]
not_betweenOutside range
likeSQL LIKE pattern (% = any, _ = single char)
not_likeNegated LIKE
is_nullValue is null
is_not_nullValue is not null
df.select("id", "name", "email")
df.drop("internal_id", "debug_flag") // exclude columns
df.rename({ user_id: "userId", created_at: "createdAt" })
df.sort("amount", "desc")
df.orderBy("name", "asc") // alias for sort()
df.limit(100)
df.offset(50)
df.limit(100).offset(50) // pagination
df.groupBy("region")
.aggregate("sum", "amount", "total")
.aggregate("count", "id", "cnt")
.aggregate("avg", "score", "avg_score")

Supported functions: sum, avg, min, max, count, count_distinct, stddev, variance, median, percentile.

const orders = qm.table("orders")
const users = qm.table("users")
orders.join(users, { left: "user_id", right: "id" }, "inner")
orders.join(users, { left: "user_id", right: "id" }, "left")

Join types: inner, left, right, full, cross.

Hash joins spill to R2 when memory budget is exceeded (Grace hash partitioning).

df.window({
fn: "row_number",
partitionBy: ["region"],
orderBy: [{ column: "amount", direction: "desc" }],
alias: "rank",
})

Supported: row_number, rank, dense_rank, lag, lead, rolling sum/avg/min/max/count with frame specification.

df.computed("total", (row) => (row.price as number) * (row.qty as number))
df1.union(df2) // UNION (deduplicates)
df1.union(df2, true) // UNION ALL (keeps duplicates)
df1.intersect(df2)
df1.except(df2)
df.distinct("region", "category")

These execute the query and return results:

MethodReturnsDescription
collect()QueryResultExecute and return all matching rows
exec()QueryResultAlias for collect()
first()Row | nullReturn first matching row
count()numberCount matching rows
exists()booleanCheck if any rows match
explain()ExplainResultShow execution plan without running
describe()Table schemaColumn names, types, row counts
head(n)Row[]First N rows (shorthand for limit + collect)
shape(){ rows, columns }Row and column count
dtypes()Record<string, string>Column name → type map
sample(n)Row[]Random sample of N rows
valueCounts(col){ value, count }[]Frequency counts sorted desc
toJSON(opts?)stringExport as JSON string
toCSV(opts?)stringExport as CSV string
show(n?)voidPrint first N rows to console
append(rows, opts?)AppendResultWrite rows (CAS-coordinated)
dropTable()DropResultDelete table and all fragments

Familiar methods for interactive data work:

// Shape and schema
const { rows, columns } = await df.shape() // { rows: 50000, columns: 8 }
const types = await df.dtypes() // { id: "int64", name: "utf8", ... }
// Preview
await df.show(10) // console.table() first 10 rows
const sample = await df.sample(5) // random 5 rows
// Frequency analysis
const counts = await df.valueCounts("region") // [{ value: "us", count: 12000 }, ...]
// Clean and transform
df.fillNull("score", 0) // replace nulls
df.cast("amount", "number") // type conversion
// Export
const json = await df.toJSON({ pretty: true })
const csv = await df.toCSV({ delimiter: "\t" })

These are all sugar over existing methods — shape() calls explain(), valueCounts() uses groupBy().aggregate(), fillNull() and cast() use computed().

For long-running queries, track progress with a callback:

const result = await df.collect({
onProgress: (info) => {
console.log(`${info.rowsCollected} rows, ${info.bytesRead} bytes`)
},
})
// Append rows
await qm.table("events").append([
{ id: 1, type: "click", created_at: "2026-03-06" },
])
// Append to specific R2 path with metadata
await qm.table("enriched").append(rows, {
path: "pipelines/job-abc/enriched.lance/",
metadata: { pipelineId: "job-abc", sourceTables: "orders,users", ttl: "7d" },
})
// Drop table — deletes all fragments from R2
await qm.table("enriched").dropTable()

See Write Path for details on CAS coordination, read-after-write consistency, and catalog integration.

const similar = await qm
.table("images")
.vector("embedding", queryVec, 10)
.exec()

IVF-PQ accelerated when index is present, falls back to flat SIMD search.