API Reference

textsift has a narrow public surface: one class and a handful of types. Everything else (tokenization, chunking, Viterbi CRF, BIOES merging, backend selection, OPFS caching) is internal.

`PrivacyFilter.create(options?)`

Static factory. Downloads the model on first call, then caches it via OPFS.

static create(options?: CreateOptions): Promise<PrivacyFilter>;

`CreateOptions`

field	type	default	notes
`modelSource`	`string`	HuggingFace Hub URL	Point at your own mirror (R2/S3/on-prem).
`backend`	`"auto" \| "wasm" \| "webgpu"`	`"auto"`	`"auto"` picks WebGPU when available, WASM otherwise. On Node, `"webgpu"` resolves to the platform native fast path (Metal / Vulkan / Dawn).
`onProgress`	`(ev: ProgressEvent) => void`	—	Fires during download + warmup.
`signal`	`AbortSignal`	—	Cancels a slow first-time load.
`markers`	`MarkerStrategy`	`"[<label>]"`	Per-category replacement text.
`enabledCategories`	`SpanLabel[]`	all 8	Only redact these categories.
`maxChunkTokens`	`number`	`2048`	Chunk size for long inputs.
`rules`	`Rule[]`	`[]`	Custom regex / function detectors merged with model spans.
`presets`	`string[]`	`[]`	Built-in rule sets. Currently: `"secrets"`.
`minConfidence`	`number`	`0`	Drop spans whose model confidence is below this (0..1). Rule spans always pass.
`cacheDir`	`string`	`~/.cache/textsift`	Node only. Override the model cache root.
`modelPath`	`string`	—	Node only. Local pre-staged ONNX file; skip cache+fetch. Companion `.onnx_data` expected at `<modelPath>_data`.
`offline`	`boolean`	`false`	Node only. Fail loudly on cache miss instead of fetching. No silent WASM fallback.

Instance methods

`redact(text, options?)`

redact(text: string, options?: RedactOptions): Promise<RedactResult>;

Returns { input, redactedText, spans, summary, containsPii }. redactedText is the input with every detected span replaced by its marker.

`detect(text, options?)`

detect(text: string, options?: RedactOptions): Promise<DetectResult>;

Same inference, no output string built. Use when you want to highlight spans in a UI instead of producing redacted text.

`redactBatch(inputs, options?)`

redactBatch(inputs: readonly string[], options?: RedactOptions): Promise<RedactResult[]>;

Serial processing. For real parallelism, construct multiple PrivacyFilter instances.

`classifyColumns(rows, options?)`

classifyColumns(
  rows: readonly (readonly string[])[],
  options?: ClassifyTableOptions,
): Promise<ColumnClassification[]>;

Sample cells from each column of a tabular dataset, run detect() on each sample, and return per-column classification + confidence. For audits (“which columns are PII?”), GDPR right-to-be-forgotten, vendor data sharing prep.

const cols = await filter.classifyColumns(rows, {
  headerRow: true,
  sampleSize: 50,
});
// → [
//     { index: 0, header: "id",    label: null,             confidence: 0,    samples: 50, labelCounts: {} },
//     { index: 1, header: "email", label: "private_email",  confidence: 0.96, samples: 50, labelCounts: { private_email: 48 } },
//   ]

`redactTable(rows, options?)`

redactTable(
  rows: readonly (readonly string[])[],
  options?: RedactTableOptions,
): Promise<string[][]>;

Classifies columns (or uses pre-supplied classifications), then per-cell applies one of three modes to PII columns: "redact" (default — markers via the filter’s markers strategy), "synth" (realistic fakes via markerPresets.faker() shared across the table for cross-row consistency), or "drop_column" (omit the column entirely).

`dispose()` / `Symbol.dispose`

dispose(): void;
[Symbol.dispose](): void;

Releases GPU buffers + WASM memory. Calls after dispose throw. Safe to call more than once.

Symbol.dispose lets you use TypeScript 5.2+‘s using declaration so cleanup happens automatically at scope exit — no manual dispose() call:

{
  using filter = await PrivacyFilter.create();
  const r = await filter.redact(text);
  // filter.dispose() runs automatically when the block exits
}

Types

`SpanLabel`

The 8 PII categories emitted by the model:

type SpanLabel =
  | "account_number"   // credit cards, bank accounts, SSN-like identifiers
  | "private_address"  // street addresses, geographic coordinates
  | "private_email"    // email addresses
  | "private_person"   // personal names
  | "private_phone"    // phone numbers
  | "private_url"      // URLs pointing to personal accounts/profiles
  | "private_date"     // dates, birthdays
  | "secret";          // API keys, passwords, tokens

`DetectedSpan`

interface DetectedSpan {
  label: SpanLabel;
  start: number;   // UTF-16 char offset into the original input
  end: number;     // exclusive
  text: string;    // input.slice(start, end)
  marker: string;  // replacement string
  confidence: number;
}

`RedactResult`

interface RedactResult {
  input: string;
  redactedText: string;
  spans: readonly DetectedSpan[];
  summary: Readonly<Partial<Record<SpanLabel, number>>>;
  containsPii: boolean;
  warning?: string;  // present only for chunked inputs
}

`MarkerStrategy`

Either a static map or a callback:

// Static per-category
const markers = { private_person: "[NAME]", private_email: "[EMAIL]" };

// Dynamic per-span (indexed)
const markers = (span, i) => `[${span.label.toUpperCase()}_${i}]`;

Return null to leave a given span unredacted.

`Rule`

Custom detector that runs alongside the model. Either a regex (must have /g) or a function returning { start, end } ranges:

type RuleSeverity = "block" | "warn" | "track";

type Rule =
  | { label: string; severity?: RuleSeverity; marker?: string; pattern: RegExp }
  | { label: string; severity?: RuleSeverity; marker?: string; match: (text: string) => Array<{ start: number; end: number }> };

Output spans carry source: "rule" and the declared severity so an AI proxy can dispatch on them (block → refuse, warn → log, track → annotate). Rules and model spans coexist; on overlap the rule wins.

Built-in presets

import { PrivacyFilter, secretRules } from "textsift";

// Stupidly easy: just enable the preset.
const filter = await PrivacyFilter.create({ presets: ["secrets"] });

// Same thing, explicit:
const filter = await PrivacyFilter.create({ rules: secretRules });

// Mix presets with your own:
const filter = await PrivacyFilter.create({
  presets: ["secrets"],
  rules: [{ label: "JIRA_TICKET", pattern: /\b[A-Z]{2,}-\d+\b/g }],
});

The "secrets" preset covers credentials/API keys the model wasn’t trained on: JWT, GitHub PAT (all variants), AWS access key, Slack tokens + webhooks, OpenAI/Anthropic/Google API keys, Stripe live/test keys, Stripe webhook secrets, npm tokens, PEM private-key headers. All default to severity: "block".

Performance note: the runtime unions all regex rules into one alternation and scans the input once, regardless of preset count. N rules ≈ 1 scan.

`markerPresets.faker()`

Built-in MarkerStrategy factory that emits realistic-looking fake values instead of [label] markers. Stateful per instance — same input text always maps to the same fake within the strategy’s lifetime, so relationships across mentions are preserved.

import { PrivacyFilter, markerPresets } from "textsift";

const filter = await PrivacyFilter.create({ markers: markerPresets.faker() });
await filter.redact("Hi Alice, alice@example.com");
// → "Hi Alice Anderson, alice.anderson@example.com"

secret spans deliberately render as [secret] regardless of preset — emitting another credible-looking secret is a security footgun. See the Faker mode playground for an interactive demo.

`ColumnClassification`

interface ColumnClassification {
  index: number;
  header?: string;
  label: SpanLabel | string | null;     // null = no PII detected in samples
  confidence: number;                    // 0..1, fraction of samples matching
  samples: number;
  labelCounts: Readonly<Record<string, number>>;
}

`ClassifyTableOptions` / `RedactTableOptions`

interface ClassifyTableOptions {
  sampleSize?: number;                  // default: 50
  headerRow?: boolean;                  // first row is column names
  headers?: readonly string[];          // explicit, overrides headerRow
  rules?: readonly Rule[];
  presets?: readonly string[];
  signal?: AbortSignal;
}

type RedactTableMode = "redact" | "synth" | "drop_column";

interface RedactTableOptions extends ClassifyTableOptions {
  mode?: RedactTableMode;               // default: "redact"
  classifications?: readonly ColumnClassification[];   // skip the classify step
  markers?: MarkerStrategy;
  enabledCategories?: readonly SpanLabel[];
}

SARIF export — `textsift/sarif`

Convert detection results to SARIF v2.1.0 for GitHub Code Scanning, GitLab SAST, etc.:

import { PrivacyFilter } from "textsift";
import { toSarif, detectResultToSarif } from "textsift/sarif";

const filter = await PrivacyFilter.create();
const result = await filter.detect(text);

// Single file
const log = detectResultToSarif(result, "src/test.ts");

// Multiple files
const log = toSarif([
  { uri: "src/a.ts", text: textA, spans: spansA },
  { uri: "src/b.ts", text: textB, spans: spansB },
], { modelSpanLevel: "warning" });   // optional: downgrade model spans

await writeFile("findings.sarif", JSON.stringify(log, null, 2));

The output is GitHub-Code-Scanning-ready: partialFingerprints for cross-run dedup, per-label rule definitions in tool.driver.rules, severity mapping (model→error, rule:block→error, rule:warn→warning) overridable via modelSpanLevel.

`ProgressEvent`

type ProgressEvent =
  | { stage: "download"; loaded: number; total: number; url: string }
  | { stage: "cache-hit"; total: number }
  | { stage: "compile"; backend: "webgpu" | "wasm" }
  | { stage: "warmup" }
  | { stage: "ready" };

API Reference

PrivacyFilter.create(options?)

CreateOptions

Instance methods

redact(text, options?)

detect(text, options?)

redactBatch(inputs, options?)

classifyColumns(rows, options?)

redactTable(rows, options?)

dispose() / Symbol.dispose

Types

SpanLabel

DetectedSpan

RedactResult

MarkerStrategy

Rule