DataFrame API
The DataFrame is a lazy query builder. All methods return a new DataFrame (immutable). Nothing executes until a terminal method is called.
Filtering
Section titled “Filtering”df.filter("age", "gt", 25)df.filter("status", "eq", "active")df.filter("id", "in", [1, 2, 3])df.filter("name", "like", "%Alice%")df.filter("amount", "between", [100, 500])df.whereNotNull("email")df.whereNull("deleted_at")df.whereOr( [{ column: "region", op: "eq", value: "us" }], [{ column: "region", op: "eq", value: "eu" }],)| Operator | Description |
|---|---|
eq | Equal |
neq | Not equal |
gt | Greater than |
gte | Greater than or equal |
lt | Less than |
lte | Less than or equal |
in | Membership in array |
not_in | Not in array |
between | Within range [low, high] |
not_between | Outside range |
like | SQL LIKE pattern (% = any, _ = single char) |
not_like | Negated LIKE |
is_null | Value is null |
is_not_null | Value is not null |
Projection
Section titled “Projection”df.select("id", "name", "email")df.drop("internal_id", "debug_flag") // exclude columnsdf.rename({ user_id: "userId", created_at: "createdAt" })Sorting
Section titled “Sorting”df.sort("amount", "desc")df.orderBy("name", "asc") // alias for sort()Limiting
Section titled “Limiting”df.limit(100)df.offset(50)df.limit(100).offset(50) // paginationAggregation
Section titled “Aggregation”df.groupBy("region") .aggregate("sum", "amount", "total") .aggregate("count", "id", "cnt") .aggregate("avg", "score", "avg_score")Supported functions: sum, avg, min, max, count, count_distinct, stddev, variance, median, percentile.
const orders = qm.table("orders")const users = qm.table("users")
orders.join(users, { left: "user_id", right: "id" }, "inner")orders.join(users, { left: "user_id", right: "id" }, "left")Join types: inner, left, right, full, cross.
Hash joins spill to R2 when memory budget is exceeded (Grace hash partitioning).
Window functions
Section titled “Window functions”df.window({ fn: "row_number", partitionBy: ["region"], orderBy: [{ column: "amount", direction: "desc" }], alias: "rank",})Supported: row_number, rank, dense_rank, lag, lead, rolling sum/avg/min/max/count with frame specification.
Rolling aggregates with frame:
df.window({ fn: "avg", column: "amount", partitionBy: ["region"], orderBy: [{ column: "created_at", direction: "asc" }], alias: "rolling_avg", frame: { type: "rows", start: -2, end: 0 }, // 3-row trailing window})Frame bounds: numeric offsets (negative = preceding, positive = following), "unbounded", or "current".
Computed columns
Section titled “Computed columns”df.computed("total", (row) => (row.price as number) * (row.qty as number))Set operations
Section titled “Set operations”df1.union(df2) // UNION (deduplicates)df1.union(df2, true) // UNION ALL (keeps duplicates)df1.intersect(df2)df1.except(df2)Convenience filter aliases
Section titled “Convenience filter aliases”df.where("age", "gt", 25) // alias for filter()df.whereIn("status", ["active", "trial"]) // shorthand for indf.whereNotIn("role", ["banned"]) // shorthand for not_indf.whereBetween("score", 80, 100) // shorthand for betweendf.whereNotBetween("age", 0, 17) // shorthand for not_betweendf.whereLike("name", "%Alice%") // shorthand for likedf.whereNotLike("email", "%spam%") // shorthand for not_likedf.filterIn("region", subquery) // IN subquery (semi-join)Keyset pagination
Section titled “Keyset pagination”// Cursor-based pagination (efficient for large offsets)df.sort("id", "asc").after(lastSeenId).limit(50)Versioning and caching
Section titled “Versioning and caching”df.version(3) // query a specific Lance dataset versiondf.cache({ ttl: 60_000 }) // cache results for 60 secondsDistinct
Section titled “Distinct”df.distinct("region", "category")Terminal methods
Section titled “Terminal methods”These execute the query and return results:
| Method | Returns | Description |
|---|---|---|
collect() | QueryResult | Execute and return all matching rows |
exec() | QueryResult | Alias for collect() |
first() | Row | null | Return first matching row |
count() | number | Count matching rows |
exists() | boolean | Check if any rows match |
explain() | ExplainResult | Show execution plan without running |
describe() | Table schema | Column names, types, row counts |
head(n) | Row[] | First N rows (shorthand for limit + collect) |
shape() | { rows, columns } | Row and column count |
dtypes() | Record<string, string> | Column name → type map |
sample(n) | Row[] | Random sample of N rows |
valueCounts(col) | { value, count }[] | Frequency counts sorted desc |
toJSON(opts?) | string | Export as JSON string |
toCSV(opts?) | string | Export as CSV string |
show(n?) | void | Print first N rows to console |
lazy() | LazyResultHandle | Page-at-a-time iteration without full materialization |
stream(batchSize) | AsyncGenerator<Row[]> | Async generator over result batches |
cursor({ batchSize }) | AsyncIterable<Row[]> | Async iterable over batches |
materialize() | DataFrame | Execute and re-wrap as in-memory DataFrame |
materializeAs(name, opts?) | DataFrame | Execute, write to named table, return DataFrame for next stage |
append(rows, opts?) | AppendResult | Write rows (CAS-coordinated) |
dropTable() | DropResult | Delete table and all fragments |
execStream() | ReadableStream<Row> | Streaming columnar binary (remote executor) |
toOperator() | Operator | Escape hatch: get raw operator pipeline |
Data exploration
Section titled “Data exploration”Familiar methods for interactive data work:
// Shape and schemaconst { rows, columns } = await df.shape() // { rows: 50000, columns: 8 }const types = await df.dtypes() // { id: "int64", name: "utf8", ... }
// Previewawait df.show(10) // console.table() first 10 rowsconst sample = await df.sample(5) // random 5 rows
// Frequency analysisconst counts = await df.valueCounts("region") // [{ value: "us", count: 12000 }, ...]
// Clean and transformdf.fillNull("score", 0) // replace nullsdf.cast("amount", "number") // type conversion
// Exportconst json = await df.toJSON({ pretty: true })const csv = await df.toCSV({ delimiter: "\t" })These are all sugar over existing methods — shape() calls describe(), valueCounts() uses groupBy().aggregate(), fillNull() and cast() use computed().
Introspection
Section titled “Introspection”These methods inspect the query without executing it:
| Method | Returns | Description |
|---|---|---|
toCode(opts?) | string | Convert query to fluent builder TypeScript code |
toDescriptor() | QueryDescriptor | Serializable query plan (JSON) |
Query decompilation (toCode)
Section titled “Query decompilation (toCode)”Convert any query back to readable, copy-pasteable fluent builder TypeScript. Useful for logging, debugging, and LLM context compression (~50x smaller than raw QueryDescriptor JSON).
const df = qm .table("orders") .filter("category", "eq", "Electronics") .filter("amount", "gte", 100) .select("id", "amount", "region") .sort("amount", "desc") .limit(20)
console.log(df.toCode())// const result = await qm// .table("orders")// .filter("category", "eq", "Electronics")// .filter("amount", "gte", 100)// .select("id", "amount", "region")// .sort("amount", "desc")// .limit(20)// .collect()Customize the output variable name and table function:
df.toCode({ variableName: "data", tableFn: "db" })// const data = await db// .table("orders")// ...
// Or use the static method on a raw descriptor:import { descriptorToCode } from "querymode"descriptorToCode(someDescriptor)Handles all features: filters (all 14 ops with shorthand methods), aggregates, groupBy, joins, window functions, set operations, vector search, distinct, sort, limit/offset, version, cache, computed columns, pipe stages.
Schema evolution
Section titled “Schema evolution”Add or drop columns on an existing dataset:
// Add a column with a default value (applied to all existing rows)await qm.table("users").addColumn("tier", "utf8", "free")
// Add a nullable column (default: null)await qm.table("users").addColumn("notes", "utf8")
// Drop a columnawait qm.table("users").dropColumn("legacy_field")Both return a SchemaEvolutionResult:
{ table: "users", operation: "add_column", // or "drop_column" column: "tier", columnsAfter: ["id", "name", "email", "tier"]}Schema changes apply to the in-memory dataset (MaterializedExecutor). On edge mode, schema evolution coordinates through MasterDO.
Progress tracking
Section titled “Progress tracking”For long-running queries, track progress with a callback:
const result = await df.collect({ onProgress: (info) => { console.log(`${info.rowsCollected} rows, ${info.bytesRead} bytes`) },})Write operations
Section titled “Write operations”// Append rowsawait qm.table("events").append([ { id: 1, type: "click", created_at: "2026-03-06" },])
// Partitioned write — each region value gets its own fragment// Enables O(1) fragment lookup at query time via partition catalogawait qm.table("events").append(rows, { partitionBy: "region",})
// Append to specific R2 path with metadataawait qm.table("enriched").append(rows, { path: "pipelines/job-abc/enriched.lance/", metadata: { pipelineId: "job-abc", sourceTables: "orders,users", ttl: "7d" },})
// Drop table — deletes all fragments from R2await qm.table("enriched").dropTable()See Write Path for details on CAS coordination, read-after-write consistency, and catalog integration.
Vector search
Section titled “Vector search”const similar = await qm .table("images") .vector("embedding", queryVec, 10) .exec()IVF-PQ accelerated when index is present, falls back to flat SIMD search.