Skip to content

Pre-commit hook

textsift ships a pre-commit hook that scans staged files for PII before each commit. Same engine as the CLI; runs entirely on the developer machine; no data leaves the host.

.pre-commit-config.yaml
repos:
- repo: https://github.com/teamchong/textsift
rev: v0.1.0 # pin to a tag — pre-commit doesn't follow `main`
hooks:
- id: textsift-pii-scan

Then in your repo:

Terminal window
pip install pre-commit # one-time, if you don't have it
pre-commit install # wires .git/hooks/pre-commit
pre-commit run --all-files # initial sweep across the whole repo

The first run downloads the model (~770 MB) into ~/.cache/textsift/. Subsequent commits hit the cache.

Each staged file is run through PrivacyFilter.detect() with the built-in secrets preset enabled. Findings are reported to stderr with line numbers; the commit is blocked if any blocking finding is detected.

Default blocking spans:

  • Model spans — all 8 categories the model detects (private_email, private_phone, private_person, private_address, private_url, private_date, account_number, secret).
  • Rule spans with severity: "block" — the entire secrets preset (JWT, GitHub PAT, AWS, Slack, OpenAI/Anthropic/Google/Stripe keys, PEM private keys).

Example output:

textsift-pii-scan: 3 blocking finding(s) across 2 file(s)
private_person src/test.ts:42:18 "Alice Carter"
private_email src/test.ts:42:35 "alice@example.com"
GITHUB_PAT_CLASSIC scripts/.env:1:14 "ghp_abcdefghijklmnopqrstuvwxyzABCDEF1234"
Bypass with `git commit --no-verify` (do not commit real PII).
Adjust severity with TEXTSIFT_PRECOMMIT_SEVERITY=warn|all|block.

The hook filters before invoking the model:

  • Files with extensions in the binary / generated list (.lock, .png, .wasm, .onnx, .pdf, .zip, .woff2, etc.)
  • Files larger than 1 MB (configurable via TEXTSIFT_PRECOMMIT_MAX_BYTES)
  • Empty files
  • Lockfiles (package-lock.json, yarn.lock, pnpm-lock.yaml, bun.lockb)
  • Minified bundles (.min.js, .min.css)

Pre-commit’s own exclude regex in .pre-commit-hooks.yaml is the first line of defence; the hook script’s extension/size check is a second pass for direct invocation.

Severity (block what you actually care about)

Section titled “Severity (block what you actually care about)”
.pre-commit-config.yaml
hooks:
- id: textsift-pii-scan
args: [] # no flags by default → strict mode
# Or relax for warn-only on a noisy codebase:
# args: [--warn-only]

Or via env var (per-developer override without editing the project config):

Terminal window
# Block-mode default — all model PII + secrets-preset block-severity rules
TEXTSIFT_PRECOMMIT_SEVERITY=block git commit ...
# Warn-mode — only true secrets block; model PII reported but doesn't fail
TEXTSIFT_PRECOMMIT_SEVERITY=warn git commit ...
# All-mode — anything detected blocks (most strict)
TEXTSIFT_PRECOMMIT_SEVERITY=all git commit ...

In CI, add a step to download the model before pre-commit runs so the hook is always cache-warm:

.github/workflows/lint.yml
- uses: actions/cache@v4
with:
path: ~/.cache/textsift
key: textsift-model-${{ hashFiles('.pre-commit-config.yaml') }}
- run: npx textsift download
- uses: pre-commit/[email protected]
env:
TEXTSIFT_OFFLINE: "1" # fail fast if cache miss
Env varDefaultEffect
TEXTSIFT_PRECOMMIT_SEVERITYblockblock / warn / all — see above
TEXTSIFT_PRECOMMIT_MAX_BYTES1000000Skip files larger than N bytes
TEXTSIFT_PRECOMMIT_SECRETS1Set to 0 to disable the secrets rule preset
TEXTSIFT_OFFLINEunsetFail loudly on cache miss (no fetch). Useful in CI.
TEXTSIFT_CACHE_DIR$XDG_CACHE_HOME/textsiftOverride cache root
TEXTSIFT_MODEL_PATHunsetUse a pre-staged ONNX file; skip cache + fetch
TEXTSIFT_MODEL_SOURCEHuggingFaceMirror or fork URL

If the hook flags a false positive or you genuinely need to commit (e.g. pasting test data), bypass with:

Terminal window
git commit --no-verify -m "..."

This skips all hooks, not just textsift. Use sparingly. The right long-term fix for a recurring false positive is usually to add a custom marker to the file (e.g. // pragma: textsift-allow private_phone) — though this isn’t yet wired in v0.1.

First commit on a fresh dev machine: ~30 sec (model download + warmup). Subsequent commits: ~1.5–3 sec depending on platform and how many files changed (the model loads in ~1 sec from disk; per-file detect is sub-100 ms on Mac Metal-direct, sub-200 ms on Linux Vulkan-direct).

The hook script loads the model once and scans all staged files in-process. CLI-per-file would multiply the cold-start cost by N files; the hook avoids that.

ToolCatchesDoesn’t catch
gitleaksHigh-precision regex secretsNames, emails, phones, addresses
trufflehogVerified secrets (validates against APIs)General PII
detect-secretsPlaintext secrets via heuristicsGeneral PII
textsift-pii-scanAll of the above + model-detected PII (names, emails, phone numbers, addresses, etc.)(Should be paired with one of the above for high-precision secrets scanning if you want belt-and-braces)

textsift’s secrets preset overlaps heavily with gitleaks’s pattern set. If you already use gitleaks, you can disable textsift’s secrets preset and let textsift focus on what only it catches:

Terminal window
TEXTSIFT_PRECOMMIT_SECRETS=0 pre-commit run --all-files

…or keep both for belt-and-braces coverage.