API Reference
textsift has a narrow public surface: one class and a handful of types. Everything else (tokenization, chunking, Viterbi CRF, BIOES merging, backend selection, OPFS caching) is internal.
PrivacyFilter.create(options?)
Section titled “PrivacyFilter.create(options?)”Static factory. Downloads the model on first call, then caches it via OPFS.
static create(options?: CreateOptions): Promise<PrivacyFilter>;CreateOptions
Section titled “CreateOptions”| field | type | default | notes |
|---|---|---|---|
modelSource | string | HuggingFace Hub URL | Point at your own mirror (R2/S3/on-prem). |
backend | "auto" | "wasm" | "webgpu" | "auto" | "auto" picks WebGPU when available, WASM otherwise. On Node, "webgpu" resolves to the platform native fast path (Metal / Vulkan / Dawn). |
onProgress | (ev: ProgressEvent) => void | — | Fires during download + warmup. |
signal | AbortSignal | — | Cancels a slow first-time load. |
markers | MarkerStrategy | "[<label>]" | Per-category replacement text. |
enabledCategories | SpanLabel[] | all 8 | Only redact these categories. |
maxChunkTokens | number | 2048 | Chunk size for long inputs. |
rules | Rule[] | [] | Custom regex / function detectors merged with model spans. |
presets | string[] | [] | Built-in rule sets. Currently: "secrets". |
minConfidence | number | 0 | Drop spans whose model confidence is below this (0..1). Rule spans always pass. |
cacheDir | string | ~/.cache/textsift | Node only. Override the model cache root. |
modelPath | string | — | Node only. Local pre-staged ONNX file; skip cache+fetch. Companion .onnx_data expected at <modelPath>_data. |
offline | boolean | false | Node only. Fail loudly on cache miss instead of fetching. No silent WASM fallback. |
Instance methods
Section titled “Instance methods”redact(text, options?)
Section titled “redact(text, options?)”redact(text: string, options?: RedactOptions): Promise<RedactResult>;Returns { input, redactedText, spans, summary, containsPii }. redactedText is the input with every detected span replaced by its marker.
detect(text, options?)
Section titled “detect(text, options?)”detect(text: string, options?: RedactOptions): Promise<DetectResult>;Same inference, no output string built. Use when you want to highlight spans in a UI instead of producing redacted text.
redactBatch(inputs, options?)
Section titled “redactBatch(inputs, options?)”redactBatch(inputs: readonly string[], options?: RedactOptions): Promise<RedactResult[]>;Serial processing. For real parallelism, construct multiple PrivacyFilter instances.
classifyColumns(rows, options?)
Section titled “classifyColumns(rows, options?)”classifyColumns( rows: readonly (readonly string[])[], options?: ClassifyTableOptions,): Promise<ColumnClassification[]>;Sample cells from each column of a tabular dataset, run detect() on each sample, and return per-column classification + confidence. For audits (“which columns are PII?”), GDPR right-to-be-forgotten, vendor data sharing prep.
const cols = await filter.classifyColumns(rows, { headerRow: true, sampleSize: 50,});// → [// { index: 0, header: "id", label: null, confidence: 0, samples: 50, labelCounts: {} },// { index: 1, header: "email", label: "private_email", confidence: 0.96, samples: 50, labelCounts: { private_email: 48 } },// ]redactTable(rows, options?)
Section titled “redactTable(rows, options?)”redactTable( rows: readonly (readonly string[])[], options?: RedactTableOptions,): Promise<string[][]>;Classifies columns (or uses pre-supplied classifications), then per-cell applies one of three modes to PII columns: "redact" (default — markers via the filter’s markers strategy), "synth" (realistic fakes via markerPresets.faker() shared across the table for cross-row consistency), or "drop_column" (omit the column entirely).
dispose() / Symbol.dispose
Section titled “dispose() / Symbol.dispose”dispose(): void;[Symbol.dispose](): void;Releases GPU buffers + WASM memory. Calls after dispose throw. Safe to call more than once.
Symbol.dispose lets you use TypeScript 5.2+‘s using declaration so cleanup happens automatically at scope exit — no manual dispose() call:
{ using filter = await PrivacyFilter.create(); const r = await filter.redact(text); // filter.dispose() runs automatically when the block exits}SpanLabel
Section titled “SpanLabel”The 8 PII categories emitted by the model:
type SpanLabel = | "account_number" // credit cards, bank accounts, SSN-like identifiers | "private_address" // street addresses, geographic coordinates | "private_email" // email addresses | "private_person" // personal names | "private_phone" // phone numbers | "private_url" // URLs pointing to personal accounts/profiles | "private_date" // dates, birthdays | "secret"; // API keys, passwords, tokensDetectedSpan
Section titled “DetectedSpan”interface DetectedSpan { label: SpanLabel; start: number; // UTF-16 char offset into the original input end: number; // exclusive text: string; // input.slice(start, end) marker: string; // replacement string confidence: number;}RedactResult
Section titled “RedactResult”interface RedactResult { input: string; redactedText: string; spans: readonly DetectedSpan[]; summary: Readonly<Partial<Record<SpanLabel, number>>>; containsPii: boolean; warning?: string; // present only for chunked inputs}MarkerStrategy
Section titled “MarkerStrategy”Either a static map or a callback:
// Static per-categoryconst markers = { private_person: "[NAME]", private_email: "[EMAIL]" };
// Dynamic per-span (indexed)const markers = (span, i) => `[${span.label.toUpperCase()}_${i}]`;Return null to leave a given span unredacted.
Custom detector that runs alongside the model. Either a regex (must have /g) or a function returning { start, end } ranges:
type RuleSeverity = "block" | "warn" | "track";
type Rule = | { label: string; severity?: RuleSeverity; marker?: string; pattern: RegExp } | { label: string; severity?: RuleSeverity; marker?: string; match: (text: string) => Array<{ start: number; end: number }> };Output spans carry source: "rule" and the declared severity so an AI proxy can dispatch on them (block → refuse, warn → log, track → annotate). Rules and model spans coexist; on overlap the rule wins.
Built-in presets
Section titled “Built-in presets”import { PrivacyFilter, secretRules } from "textsift";
// Stupidly easy: just enable the preset.const filter = await PrivacyFilter.create({ presets: ["secrets"] });
// Same thing, explicit:const filter = await PrivacyFilter.create({ rules: secretRules });
// Mix presets with your own:const filter = await PrivacyFilter.create({ presets: ["secrets"], rules: [{ label: "JIRA_TICKET", pattern: /\b[A-Z]{2,}-\d+\b/g }],});The "secrets" preset covers credentials/API keys the model wasn’t trained on: JWT, GitHub PAT (all variants), AWS access key, Slack tokens + webhooks, OpenAI/Anthropic/Google API keys, Stripe live/test keys, Stripe webhook secrets, npm tokens, PEM private-key headers. All default to severity: "block".
Performance note: the runtime unions all regex rules into one alternation and scans the input once, regardless of preset count. N rules ≈ 1 scan.
markerPresets.faker()
Section titled “markerPresets.faker()”Built-in MarkerStrategy factory that emits realistic-looking fake values instead of [label] markers. Stateful per instance — same input text always maps to the same fake within the strategy’s lifetime, so relationships across mentions are preserved.
import { PrivacyFilter, markerPresets } from "textsift";
const filter = await PrivacyFilter.create({ markers: markerPresets.faker() });await filter.redact("Hi Alice, alice@example.com");// → "Hi Alice Anderson, alice.anderson@example.com"secret spans deliberately render as [secret] regardless of preset — emitting another credible-looking secret is a security footgun. See the Faker mode playground for an interactive demo.
ColumnClassification
Section titled “ColumnClassification”interface ColumnClassification { index: number; header?: string; label: SpanLabel | string | null; // null = no PII detected in samples confidence: number; // 0..1, fraction of samples matching samples: number; labelCounts: Readonly<Record<string, number>>;}ClassifyTableOptions / RedactTableOptions
Section titled “ClassifyTableOptions / RedactTableOptions”interface ClassifyTableOptions { sampleSize?: number; // default: 50 headerRow?: boolean; // first row is column names headers?: readonly string[]; // explicit, overrides headerRow rules?: readonly Rule[]; presets?: readonly string[]; signal?: AbortSignal;}
type RedactTableMode = "redact" | "synth" | "drop_column";
interface RedactTableOptions extends ClassifyTableOptions { mode?: RedactTableMode; // default: "redact" classifications?: readonly ColumnClassification[]; // skip the classify step markers?: MarkerStrategy; enabledCategories?: readonly SpanLabel[];}SARIF export — textsift/sarif
Section titled “SARIF export — textsift/sarif”Convert detection results to SARIF v2.1.0 for GitHub Code Scanning, GitLab SAST, etc.:
import { PrivacyFilter } from "textsift";import { toSarif, detectResultToSarif } from "textsift/sarif";
const filter = await PrivacyFilter.create();const result = await filter.detect(text);
// Single fileconst log = detectResultToSarif(result, "src/test.ts");
// Multiple filesconst log = toSarif([ { uri: "src/a.ts", text: textA, spans: spansA }, { uri: "src/b.ts", text: textB, spans: spansB },], { modelSpanLevel: "warning" }); // optional: downgrade model spans
await writeFile("findings.sarif", JSON.stringify(log, null, 2));The output is GitHub-Code-Scanning-ready: partialFingerprints for cross-run dedup, per-label rule definitions in tool.driver.rules, severity mapping (model→error, rule:block→error, rule:warn→warning) overridable via modelSpanLevel.
ProgressEvent
Section titled “ProgressEvent”type ProgressEvent = | { stage: "download"; loaded: number; total: number; url: string } | { stage: "cache-hit"; total: number } | { stage: "compile"; backend: "webgpu" | "wasm" } | { stage: "warmup" } | { stage: "ready" };