Anonymization for LLMs
Escriba can strip or replace personal data before the text reaches an LLM — and
put it back afterwards. The heavy NER model runs in a separate, internal-only service
(Anonimal, bundling the OpenAI Privacy Filter),
enabled by pointing ANONIMAL_URL at it.
Layered detection
Section titled “Layered detection”High recall by design — several detectors stack on top of each other:
- NER model — names, organizations, locations, dates.
- Layout-aware invoice fields — reads label → value by PDF coordinates (company name, tax ID, address…), masking structured documents field by field.
- 20 toggleable detectors, per user — universal (email, URL, IP, MAC, credit-card Luhn-validated, IBAN mod-97), regional (e.g. Argentine CUIT/CUIL/CBU/DNI/addresses) and aggressive (long numbers, name sequences).
- Bring Your Own Rules — upload a JSON of your own patterns/labels/keep-list. Your regex runs on RE2 (linear time → ReDoS-proof), with strict JSON parsing and hard limits.
- Entity propagation — anything detected once is masked in every occurrence.
Five output modes
Section titled “Five output modes”| Mode | Output | Use |
|---|---|---|
| Typed | <PRIVATE_PERSON>, <ACCOUNT_NUMBER>… | keep the category visible |
| Anonymous | <<ANOM_DATA>> | flatten everything |
| Pseudonymize | «PERSONA_1» + a token→original map | the LLM gateway — anonymize → send → re-hydrate locally |
| Partial mask | ••••-3456, j•••@domain.com | keep a usable hint — irreversible |
| Stable hash | «PERSONA_7590fc» | same data → same pseudonym across documents — irreversible |
Two intensities (Balanced / Strict), all configurable per browser. The restore map and your custom rules never leave your machine.
The LLM gateway pattern
Section titled “The LLM gateway pattern”The pseudonymize mode is the centerpiece:
- Convert with Pseudonymize — names become
«PERSONA_1», IDs become«ID_2», etc. - Send the safe text to any LLM. The model never sees the real data.
- Paste the reply into Re-hydrate — Escriba restores the real values, entirely in your browser, using a map that never touched the server.
Visual redaction
Section titled “Visual redaction”For PDFs and scanned images, the result card offers a “Redacted PDF” download:
every detected entity is blacked out on the page using true redaction —
apply_redactions removes the underlying text and the image pixels beneath each
box, so the data no longer exists in the output file. The PDF’s metadata is wiped
too (DocInfo + XMP), so a redacted file can’t leak the name or ID via Properties or
exiftool. Scanned documents are OCR’d first. Same detection stack, zero extra RAM.