Skip to content

API Reference

textsift has a narrow public surface: one class and a handful of types. Everything else (tokenization, chunking, Viterbi CRF, BIOES merging, backend selection, OPFS caching) is internal.

Static factory. Downloads the model on first call, then caches it via OPFS.

static create(options?: CreateOptions): Promise<PrivacyFilter>;
fieldtypedefaultnotes
modelSourcestringHuggingFace Hub URLPoint at your own mirror (R2/S3/on-prem).
backend"auto" | "wasm" | "webgpu""auto""auto" picks WebGPU when available, WASM otherwise. On Node, "webgpu" resolves to the platform native fast path (Metal / Vulkan / Dawn).
onProgress(ev: ProgressEvent) => voidFires during download + warmup.
signalAbortSignalCancels a slow first-time load.
markersMarkerStrategy"[<label>]"Per-category replacement text.
enabledCategoriesSpanLabel[]all 8Only redact these categories.
maxChunkTokensnumber2048Chunk size for long inputs.
rulesRule[][]Custom regex / function detectors merged with model spans.
presetsstring[][]Built-in rule sets. Currently: "secrets".
minConfidencenumber0Drop spans whose model confidence is below this (0..1). Rule spans always pass.
cacheDirstring~/.cache/textsiftNode only. Override the model cache root.
modelPathstringNode only. Local pre-staged ONNX file; skip cache+fetch. Companion .onnx_data expected at <modelPath>_data.
offlinebooleanfalseNode only. Fail loudly on cache miss instead of fetching. No silent WASM fallback.
redact(text: string, options?: RedactOptions): Promise<RedactResult>;

Returns { input, redactedText, spans, summary, containsPii }. redactedText is the input with every detected span replaced by its marker.

detect(text: string, options?: RedactOptions): Promise<DetectResult>;

Same inference, no output string built. Use when you want to highlight spans in a UI instead of producing redacted text.

redactBatch(inputs: readonly string[], options?: RedactOptions): Promise<RedactResult[]>;

Serial processing. For real parallelism, construct multiple PrivacyFilter instances.

classifyColumns(
rows: readonly (readonly string[])[],
options?: ClassifyTableOptions,
): Promise<ColumnClassification[]>;

Sample cells from each column of a tabular dataset, run detect() on each sample, and return per-column classification + confidence. For audits (“which columns are PII?”), GDPR right-to-be-forgotten, vendor data sharing prep.

const cols = await filter.classifyColumns(rows, {
headerRow: true,
sampleSize: 50,
});
// → [
// { index: 0, header: "id", label: null, confidence: 0, samples: 50, labelCounts: {} },
// { index: 1, header: "email", label: "private_email", confidence: 0.96, samples: 50, labelCounts: { private_email: 48 } },
// ]
redactTable(
rows: readonly (readonly string[])[],
options?: RedactTableOptions,
): Promise<string[][]>;

Classifies columns (or uses pre-supplied classifications), then per-cell applies one of three modes to PII columns: "redact" (default — markers via the filter’s markers strategy), "synth" (realistic fakes via markerPresets.faker() shared across the table for cross-row consistency), or "drop_column" (omit the column entirely).

dispose(): void;
[Symbol.dispose](): void;

Releases GPU buffers + WASM memory. Calls after dispose throw. Safe to call more than once.

Symbol.dispose lets you use TypeScript 5.2+‘s using declaration so cleanup happens automatically at scope exit — no manual dispose() call:

{
using filter = await PrivacyFilter.create();
const r = await filter.redact(text);
// filter.dispose() runs automatically when the block exits
}

The 8 PII categories emitted by the model:

type SpanLabel =
| "account_number" // credit cards, bank accounts, SSN-like identifiers
| "private_address" // street addresses, geographic coordinates
| "private_email" // email addresses
| "private_person" // personal names
| "private_phone" // phone numbers
| "private_url" // URLs pointing to personal accounts/profiles
| "private_date" // dates, birthdays
| "secret"; // API keys, passwords, tokens
interface DetectedSpan {
label: SpanLabel;
start: number; // UTF-16 char offset into the original input
end: number; // exclusive
text: string; // input.slice(start, end)
marker: string; // replacement string
confidence: number;
}
interface RedactResult {
input: string;
redactedText: string;
spans: readonly DetectedSpan[];
summary: Readonly<Partial<Record<SpanLabel, number>>>;
containsPii: boolean;
warning?: string; // present only for chunked inputs
}

Either a static map or a callback:

// Static per-category
const markers = { private_person: "[NAME]", private_email: "[EMAIL]" };
// Dynamic per-span (indexed)
const markers = (span, i) => `[${span.label.toUpperCase()}_${i}]`;

Return null to leave a given span unredacted.

Custom detector that runs alongside the model. Either a regex (must have /g) or a function returning { start, end } ranges:

type RuleSeverity = "block" | "warn" | "track";
type Rule =
| { label: string; severity?: RuleSeverity; marker?: string; pattern: RegExp }
| { label: string; severity?: RuleSeverity; marker?: string; match: (text: string) => Array<{ start: number; end: number }> };

Output spans carry source: "rule" and the declared severity so an AI proxy can dispatch on them (block → refuse, warn → log, track → annotate). Rules and model spans coexist; on overlap the rule wins.

import { PrivacyFilter, secretRules } from "textsift";
// Stupidly easy: just enable the preset.
const filter = await PrivacyFilter.create({ presets: ["secrets"] });
// Same thing, explicit:
const filter = await PrivacyFilter.create({ rules: secretRules });
// Mix presets with your own:
const filter = await PrivacyFilter.create({
presets: ["secrets"],
rules: [{ label: "JIRA_TICKET", pattern: /\b[A-Z]{2,}-\d+\b/g }],
});

The "secrets" preset covers credentials/API keys the model wasn’t trained on: JWT, GitHub PAT (all variants), AWS access key, Slack tokens + webhooks, OpenAI/Anthropic/Google API keys, Stripe live/test keys, Stripe webhook secrets, npm tokens, PEM private-key headers. All default to severity: "block".

Performance note: the runtime unions all regex rules into one alternation and scans the input once, regardless of preset count. N rules ≈ 1 scan.

Built-in MarkerStrategy factory that emits realistic-looking fake values instead of [label] markers. Stateful per instance — same input text always maps to the same fake within the strategy’s lifetime, so relationships across mentions are preserved.

import { PrivacyFilter, markerPresets } from "textsift";
const filter = await PrivacyFilter.create({ markers: markerPresets.faker() });
await filter.redact("Hi Alice, alice@example.com");
// → "Hi Alice Anderson, alice.anderson@example.com"

secret spans deliberately render as [secret] regardless of preset — emitting another credible-looking secret is a security footgun. See the Faker mode playground for an interactive demo.

interface ColumnClassification {
index: number;
header?: string;
label: SpanLabel | string | null; // null = no PII detected in samples
confidence: number; // 0..1, fraction of samples matching
samples: number;
labelCounts: Readonly<Record<string, number>>;
}
interface ClassifyTableOptions {
sampleSize?: number; // default: 50
headerRow?: boolean; // first row is column names
headers?: readonly string[]; // explicit, overrides headerRow
rules?: readonly Rule[];
presets?: readonly string[];
signal?: AbortSignal;
}
type RedactTableMode = "redact" | "synth" | "drop_column";
interface RedactTableOptions extends ClassifyTableOptions {
mode?: RedactTableMode; // default: "redact"
classifications?: readonly ColumnClassification[]; // skip the classify step
markers?: MarkerStrategy;
enabledCategories?: readonly SpanLabel[];
}

Convert detection results to SARIF v2.1.0 for GitHub Code Scanning, GitLab SAST, etc.:

import { PrivacyFilter } from "textsift";
import { toSarif, detectResultToSarif } from "textsift/sarif";
const filter = await PrivacyFilter.create();
const result = await filter.detect(text);
// Single file
const log = detectResultToSarif(result, "src/test.ts");
// Multiple files
const log = toSarif([
{ uri: "src/a.ts", text: textA, spans: spansA },
{ uri: "src/b.ts", text: textB, spans: spansB },
], { modelSpanLevel: "warning" }); // optional: downgrade model spans
await writeFile("findings.sarif", JSON.stringify(log, null, 2));

The output is GitHub-Code-Scanning-ready: partialFingerprints for cross-run dedup, per-label rule definitions in tool.driver.rules, severity mapping (model→error, rule:block→error, rule:warn→warning) overridable via modelSpanLevel.

type ProgressEvent =
| { stage: "download"; loaded: number; total: number; url: string }
| { stage: "cache-hit"; total: number }
| { stage: "compile"; backend: "webgpu" | "wasm" }
| { stage: "warmup" }
| { stage: "ready" };