DataFrame API
The DataFrame is a lazy query builder. All methods return a new DataFrame (immutable). Nothing executes until a terminal method is called.
Filtering
Section titled “Filtering”df.filter("age", "gt", 25)df.filter("status", "eq", "active")df.filter("id", "in", [1, 2, 3])df.filter("name", "like", "%alice%")df.filter("amount", "between", [100, 500])df.whereNotNull("email")df.whereNull("deleted_at")df.whereOr( [{ column: "region", op: "eq", value: "us" }], [{ column: "region", op: "eq", value: "eu" }],)| Operator | Description |
|---|---|
eq | Equal |
neq | Not equal |
gt | Greater than |
gte | Greater than or equal |
lt | Less than |
lte | Less than or equal |
in | Membership in array |
not_in | Not in array |
between | Within range [low, high] |
not_between | Outside range |
like | SQL LIKE pattern (% = any, _ = single char) |
not_like | Negated LIKE |
is_null | Value is null |
is_not_null | Value is not null |
Projection
Section titled “Projection”df.select("id", "name", "email")df.drop("internal_id", "debug_flag") // exclude columnsdf.rename({ user_id: "userId", created_at: "createdAt" })Sorting
Section titled “Sorting”df.sort("amount", "desc")df.orderBy("name", "asc") // alias for sort()Limiting
Section titled “Limiting”df.limit(100)df.offset(50)df.limit(100).offset(50) // paginationAggregation
Section titled “Aggregation”df.groupBy("region") .aggregate("sum", "amount", "total") .aggregate("count", "id", "cnt") .aggregate("avg", "score", "avg_score")Supported functions: sum, avg, min, max, count, count_distinct, stddev, variance, median, percentile.
const orders = qm.table("orders")const users = qm.table("users")
orders.join(users, { left: "user_id", right: "id" }, "inner")orders.join(users, { left: "user_id", right: "id" }, "left")Join types: inner, left, right, full, cross.
Hash joins spill to R2 when memory budget is exceeded (Grace hash partitioning).
Window functions
Section titled “Window functions”df.window({ fn: "row_number", partitionBy: ["region"], orderBy: [{ column: "amount", direction: "desc" }], alias: "rank",})Supported: row_number, rank, dense_rank, lag, lead, rolling sum/avg/min/max/count with frame specification.
Computed columns
Section titled “Computed columns”df.computed("total", (row) => (row.price as number) * (row.qty as number))Set operations
Section titled “Set operations”df1.union(df2) // UNION (deduplicates)df1.union(df2, true) // UNION ALL (keeps duplicates)df1.intersect(df2)df1.except(df2)Distinct
Section titled “Distinct”df.distinct("region", "category")Terminal methods
Section titled “Terminal methods”These execute the query and return results:
| Method | Returns | Description |
|---|---|---|
collect() | QueryResult | Execute and return all matching rows |
exec() | QueryResult | Alias for collect() |
first() | Row | null | Return first matching row |
count() | number | Count matching rows |
exists() | boolean | Check if any rows match |
explain() | ExplainResult | Show execution plan without running |
describe() | Table schema | Column names, types, row counts |
head(n) | Row[] | First N rows (shorthand for limit + collect) |
shape() | { rows, columns } | Row and column count |
dtypes() | Record<string, string> | Column name → type map |
sample(n) | Row[] | Random sample of N rows |
valueCounts(col) | { value, count }[] | Frequency counts sorted desc |
toJSON(opts?) | string | Export as JSON string |
toCSV(opts?) | string | Export as CSV string |
show(n?) | void | Print first N rows to console |
append(rows, opts?) | AppendResult | Write rows (CAS-coordinated) |
dropTable() | DropResult | Delete table and all fragments |
Data exploration
Section titled “Data exploration”Familiar methods for interactive data work:
// Shape and schemaconst { rows, columns } = await df.shape() // { rows: 50000, columns: 8 }const types = await df.dtypes() // { id: "int64", name: "utf8", ... }
// Previewawait df.show(10) // console.table() first 10 rowsconst sample = await df.sample(5) // random 5 rows
// Frequency analysisconst counts = await df.valueCounts("region") // [{ value: "us", count: 12000 }, ...]
// Clean and transformdf.fillNull("score", 0) // replace nullsdf.cast("amount", "number") // type conversion
// Exportconst json = await df.toJSON({ pretty: true })const csv = await df.toCSV({ delimiter: "\t" })These are all sugar over existing methods — shape() calls explain(), valueCounts() uses groupBy().aggregate(), fillNull() and cast() use computed().
Progress tracking
Section titled “Progress tracking”For long-running queries, track progress with a callback:
const result = await df.collect({ onProgress: (info) => { console.log(`${info.rowsCollected} rows, ${info.bytesRead} bytes`) },})Write operations
Section titled “Write operations”// Append rowsawait qm.table("events").append([ { id: 1, type: "click", created_at: "2026-03-06" },])
// Append to specific R2 path with metadataawait qm.table("enriched").append(rows, { path: "pipelines/job-abc/enriched.lance/", metadata: { pipelineId: "job-abc", sourceTables: "orders,users", ttl: "7d" },})
// Drop table — deletes all fragments from R2await qm.table("enriched").dropTable()See Write Path for details on CAS coordination, read-after-write consistency, and catalog integration.
Vector search
Section titled “Vector search”const similar = await qm .table("images") .vector("embedding", queryVec, 10) .exec()IVF-PQ accelerated when index is present, falls back to flat SIMD search.