Skip to content

Benchmarks

These numbers are snapshots, not guarantees — they will look different on your hardware. Each section names the host it was measured on.

Forward latency, median of 5 samples after 2 warmup iterations. Reproduce with tests/browser/bench.spec.ts. Baseline (tjs) is forced via createWithResolver so the umbrella’s auto-WebGPU short-circuit can’t accidentally measure WebGPU twice. Browser flags: --enable-unsafe-webgpu --use-angle=metal.

Input lengthtextsift (WebGPU)textsift (WASM MT)tjs (WebGPU default)
~7 tokens8.9 ms29.0 ms32.7 ms
~25 tokens11.8 ms44.6 ms38.5 ms
~80 tokens22.0 ms95.9 ms56.4 ms

textsift’s WebGPU is 2.6–3.7× faster than transformers.js across every input length, with the largest gap on shorter inputs where ORT-Web’s per-dispatch overhead is more visible. The differential reflects the kernels (textsift’s hand-tuned int4 matmul / banded attention vs ORT-Web’s int4 contrib ops).

Sustained throughput (30-forward sequential loop, tok/s):

Input lengthtextsift (WebGPU)textsift (WASM MT)tjs (WebGPU default)
~7 tokens801 tok/s249 tok/s249 tok/s
~25 tokens2068 tok/s558 tok/s644 tok/s
~80 tokens3644 tok/s840 tok/s1396 tok/s

Same 2.6–3.2× lead. Neither engine pipelines forwards (each await detect() blocks on the previous), so throughput ≈ 1/latency × tokens.

Node native — macOS (M2 Pro, Metal-direct)

Section titled “Node native — macOS (M2 Pro, Metal-direct)”

Synthetic-weight forward at production model dimensions, median of 10 samples after 3 warmup iterations. Reproduce with T=32 node tests/native/forward-metal.js.

Ttextsift nativeBrowser textsift WebGPU (M3 Pro)tjs WebGPU (M3 Pro)
75.2 ms8.9 ms32.7 ms
2510.0 ms11.8 ms38.5 ms
3210.8 ms
8023.8 ms22.0 ms56.4 ms

Hand-written MSL beats Dawn’s WGSL→MSL codegen by ~1.9× at T=32 on the same hardware (M2 Pro vs M3 Pro is roughly the same generation, M3 Pro slightly faster — same-hardware comparison would widen the gap). The gap is real codegen quality: control over loop unrolling, threadgroup memory layout, simdgroup matrix ops on M3+.

End-to-end PrivacyFilter.redact() on a 122-character input with 4 PII spans: ~110 ms (includes BPE tokenization, forward, Viterbi decode, span extraction, and replacement).

Node native — Linux (Intel Iris Xe, Vulkan-direct)

Section titled “Node native — Linux (Intel Iris Xe, Vulkan-direct)”

Reproduce with T=32 node tests/native/forward-vulkan.js.

Ttextsift nativeONNX Runtime Node CPUSpeedup
3228 ms~800 ms28×

This is the real Linux story: GPU-accelerated PII detection on Intel iGPU / AMD APU / non-NVIDIA hardware without CUDA, without ROCm, without a driver dance. npm install textsift ships a vendored Vulkan-direct binary that talks to whatever Mesa-supported GPU is there.

The realistic alternatives on Linux without textsift:

PathSetupT=32 latency on a typical iGPU box
ONNX Runtime Node CPUnpm i onnxruntime-node, write your own inference loop~600–800 ms
transformers.js (Node)npm i @xenova/transformers, no GPU on Node so WASM~80–100 ms
PyTorch CPUpip install torch transformers safetensors, write inference~150–500 ms
PyTorch CUDANVIDIA GPU + driver + cuda-toolkit + matched torch wheeln/a (no NVIDIA on most laptops)
textsift nativenpm install textsift~28 ms

End-to-end PrivacyFilter.redact() on a 122-character input with 4 PII spans: ~50–75 ms.

We don’t claim a cold-start speedup over transformers.js. Here’s why.

A meaningful “engine cold start” comparison would isolate parse + compile + first-inference cost from storage I/O, with the model bytes pre-loaded for every backend. We can’t do that cleanly: ORT-Web fetches model bytes inside a worker thread, which doesn’t see main-thread fetch overrides — to inject pre-loaded bytes there we’d need a Service Worker, which entangles the bench with browser plumbing more than the answer is worth.

What we can say:

  • textsift’s WebGPU init (cache hit, model on OPFS): 0.93 s end-to-end. See breakdown below.
  • transformers.js with default config + a 770 MB model: re-fetches every visit because the Cache API silently rejects payloads above its per-resource limit (QuotaExceededError). That’s a real user-visible cost in default config, but it’s a storage decision, not an engine one — a transformers.js user who plugs in an OPFS-backed adapter closes the gap.
  • The factual claim: “textsift caches large models persistently with no configuration”. Not a faster inference engine; a better-defaulted one.

From tests/browser/webgpu-coldstart.spec.ts, second warmup:

StepCost
Adapter + device request0.20 s (Metal driver warm-up)
OPFS read (770 MB model)0.36 s
ONNX parse0.001 s
GPU buffer upload0.38 s
Pipeline compile (14 WGSL shaders, parallel)0.002 s
Total0.93 s

openai/privacy-filter reports F1 96% on PII-Masking-300k in OpenAI’s own evaluation. textsift doesn’t change the model — both backends produce byte-identical span output on the same input (tests/browser/smoke.spec.ts, webgpu-smoke.spec.ts). The detection accuracy is the model’s, not ours.

The forward-parity test (tests/browser/webgpu-forward.spec.ts) confirms argmax agrees on 32/32 tokens of a deterministic input, with logit RMS drift of 0.18 vs the WASM reference — fp16 rounding compounding across 8 transformer layers. Argmax (and downstream Viterbi + span output) is preserved.

EntryMinifiedGzipped (over the wire)
textsift/browser630 KB76 KB

Plus a 90 KB .wasm file loaded async (not in the JS bundle). The native entry (textsift bare import) ships zero bytes for browser bundlers — they only resolve the ./browser subpath.

Terminal window
npm install
npm run build
npx playwright test bench.spec.ts

First run downloads the 770 MB model to OPFS (~15 seconds), subsequent runs are cache-hit fast.