Skip to content

Anonymization for LLMs

Escriba can strip or replace personal data before the text reaches an LLM — and put it back afterwards. The heavy NER model runs in a separate, internal-only service (Anonimal, bundling the OpenAI Privacy Filter), enabled by pointing ANONIMAL_URL at it.

High recall by design — several detectors stack on top of each other:

  • NER model — names, organizations, locations, dates.
  • Layout-aware invoice fields — reads label → value by PDF coordinates (company name, tax ID, address…), masking structured documents field by field.
  • 20 toggleable detectors, per user — universal (email, URL, IP, MAC, credit-card Luhn-validated, IBAN mod-97), regional (e.g. Argentine CUIT/CUIL/CBU/DNI/addresses) and aggressive (long numbers, name sequences).
  • Bring Your Own Rules — upload a JSON of your own patterns/labels/keep-list. Your regex runs on RE2 (linear time → ReDoS-proof), with strict JSON parsing and hard limits.
  • Entity propagation — anything detected once is masked in every occurrence.
ModeOutputUse
Typed<PRIVATE_PERSON>, <ACCOUNT_NUMBER>keep the category visible
Anonymous<<ANOM_DATA>>flatten everything
Pseudonymize«PERSONA_1» + a token→original mapthe LLM gateway — anonymize → send → re-hydrate locally
Partial mask••••-3456, j•••@domain.comkeep a usable hint — irreversible
Stable hash«PERSONA_7590fc»same data → same pseudonym across documents — irreversible

Two intensities (Balanced / Strict), all configurable per browser. The restore map and your custom rules never leave your machine.

The pseudonymize mode is the centerpiece:

  1. Convert with Pseudonymize — names become «PERSONA_1», IDs become «ID_2», etc.
  2. Send the safe text to any LLM. The model never sees the real data.
  3. Paste the reply into Re-hydrate — Escriba restores the real values, entirely in your browser, using a map that never touched the server.

For PDFs and scanned images, the result card offers a “Redacted PDF” download: every detected entity is blacked out on the page using true redaction — apply_redactions removes the underlying text and the image pixels beneath each box, so the data no longer exists in the output file. The PDF’s metadata is wiped too (DocInfo + XMP), so a redacted file can’t leak the name or ID via Properties or exiftool. Scanned documents are OCR’d first. Same detection stack, zero extra RAM.