Benchmarks
These numbers are snapshots, not guarantees — they will look different on your hardware. Each section names the host it was measured on.
Browser (M3 Pro, Chromium 147)
Section titled “Browser (M3 Pro, Chromium 147)”Forward latency, median of 5 samples after 2 warmup iterations. Reproduce with tests/browser/bench.spec.ts. Baseline (tjs) is forced via createWithResolver so the umbrella’s auto-WebGPU short-circuit can’t accidentally measure WebGPU twice. Browser flags: --enable-unsafe-webgpu --use-angle=metal.
| Input length | textsift (WebGPU) | textsift (WASM MT) | tjs (WebGPU default) |
|---|---|---|---|
| ~7 tokens | 8.9 ms | 29.0 ms | 32.7 ms |
| ~25 tokens | 11.8 ms | 44.6 ms | 38.5 ms |
| ~80 tokens | 22.0 ms | 95.9 ms | 56.4 ms |
textsift’s WebGPU is 2.6–3.7× faster than transformers.js across every input length, with the largest gap on shorter inputs where ORT-Web’s per-dispatch overhead is more visible. The differential reflects the kernels (textsift’s hand-tuned int4 matmul / banded attention vs ORT-Web’s int4 contrib ops).
Sustained throughput (30-forward sequential loop, tok/s):
| Input length | textsift (WebGPU) | textsift (WASM MT) | tjs (WebGPU default) |
|---|---|---|---|
| ~7 tokens | 801 tok/s | 249 tok/s | 249 tok/s |
| ~25 tokens | 2068 tok/s | 558 tok/s | 644 tok/s |
| ~80 tokens | 3644 tok/s | 840 tok/s | 1396 tok/s |
Same 2.6–3.2× lead. Neither engine pipelines forwards (each await detect() blocks on the previous), so throughput ≈ 1/latency × tokens.
Node native — macOS (M2 Pro, Metal-direct)
Section titled “Node native — macOS (M2 Pro, Metal-direct)”Synthetic-weight forward at production model dimensions, median of 10 samples after 3 warmup iterations. Reproduce with T=32 node tests/native/forward-metal.js.
| T | textsift native | Browser textsift WebGPU (M3 Pro) | tjs WebGPU (M3 Pro) |
|---|---|---|---|
| 7 | 5.2 ms | 8.9 ms | 32.7 ms |
| 25 | 10.0 ms | 11.8 ms | 38.5 ms |
| 32 | 10.8 ms | — | — |
| 80 | 23.8 ms | 22.0 ms | 56.4 ms |
Hand-written MSL beats Dawn’s WGSL→MSL codegen by ~1.9× at T=32 on the same hardware (M2 Pro vs M3 Pro is roughly the same generation, M3 Pro slightly faster — same-hardware comparison would widen the gap). The gap is real codegen quality: control over loop unrolling, threadgroup memory layout, simdgroup matrix ops on M3+.
End-to-end PrivacyFilter.redact() on a 122-character input with 4 PII spans: ~110 ms (includes BPE tokenization, forward, Viterbi decode, span extraction, and replacement).
Node native — Linux (Intel Iris Xe, Vulkan-direct)
Section titled “Node native — Linux (Intel Iris Xe, Vulkan-direct)”Reproduce with T=32 node tests/native/forward-vulkan.js.
| T | textsift native | ONNX Runtime Node CPU | Speedup |
|---|---|---|---|
| 32 | 28 ms | ~800 ms | 28× |
This is the real Linux story: GPU-accelerated PII detection on Intel iGPU / AMD APU / non-NVIDIA hardware without CUDA, without ROCm, without a driver dance. npm install textsift ships a vendored Vulkan-direct binary that talks to whatever Mesa-supported GPU is there.
The realistic alternatives on Linux without textsift:
| Path | Setup | T=32 latency on a typical iGPU box |
|---|---|---|
| ONNX Runtime Node CPU | npm i onnxruntime-node, write your own inference loop | ~600–800 ms |
| transformers.js (Node) | npm i @xenova/transformers, no GPU on Node so WASM | ~80–100 ms |
| PyTorch CPU | pip install torch transformers safetensors, write inference | ~150–500 ms |
| PyTorch CUDA | NVIDIA GPU + driver + cuda-toolkit + matched torch wheel | n/a (no NVIDIA on most laptops) |
| textsift native | npm install textsift | ~28 ms |
End-to-end PrivacyFilter.redact() on a 122-character input with 4 PII spans: ~50–75 ms.
Cold start
Section titled “Cold start”We don’t claim a cold-start speedup over transformers.js. Here’s why.
A meaningful “engine cold start” comparison would isolate parse + compile + first-inference cost from storage I/O, with the model bytes pre-loaded for every backend. We can’t do that cleanly: ORT-Web fetches model bytes inside a worker thread, which doesn’t see main-thread fetch overrides — to inject pre-loaded bytes there we’d need a Service Worker, which entangles the bench with browser plumbing more than the answer is worth.
What we can say:
- textsift’s WebGPU init (cache hit, model on OPFS): 0.93 s end-to-end. See breakdown below.
- transformers.js with default config + a 770 MB model: re-fetches every visit because the Cache API silently rejects payloads above its per-resource limit (
QuotaExceededError). That’s a real user-visible cost in default config, but it’s a storage decision, not an engine one — a transformers.js user who plugs in an OPFS-backed adapter closes the gap. - The factual claim: “textsift caches large models persistently with no configuration”. Not a faster inference engine; a better-defaulted one.
Warmup breakdown (cache hit)
Section titled “Warmup breakdown (cache hit)”From tests/browser/webgpu-coldstart.spec.ts, second warmup:
| Step | Cost |
|---|---|
| Adapter + device request | 0.20 s (Metal driver warm-up) |
| OPFS read (770 MB model) | 0.36 s |
| ONNX parse | 0.001 s |
| GPU buffer upload | 0.38 s |
| Pipeline compile (14 WGSL shaders, parallel) | 0.002 s |
| Total | 0.93 s |
Accuracy
Section titled “Accuracy”openai/privacy-filter reports F1 96% on PII-Masking-300k in OpenAI’s own evaluation. textsift doesn’t change the model — both backends produce byte-identical span output on the same input (tests/browser/smoke.spec.ts, webgpu-smoke.spec.ts). The detection accuracy is the model’s, not ours.
The forward-parity test (tests/browser/webgpu-forward.spec.ts) confirms argmax agrees on 32/32 tokens of a deterministic input, with logit RMS drift of 0.18 vs the WASM reference — fp16 rounding compounding across 8 transformer layers. Argmax (and downstream Viterbi + span output) is preserved.
Bundle size
Section titled “Bundle size”| Entry | Minified | Gzipped (over the wire) |
|---|---|---|
textsift/browser | 630 KB | 76 KB |
Plus a 90 KB .wasm file loaded async (not in the JS bundle). The native entry (textsift bare import) ships zero bytes for browser bundlers — they only resolve the ./browser subpath.
Reproduce
Section titled “Reproduce”npm installnpm run buildnpx playwright test bench.spec.tsFirst run downloads the 770 MB model to OPFS (~15 seconds), subsequent runs are cache-hit fast.