Anonymization for LLMs

Escriba can strip or replace personal data before the text reaches an LLM — and put it back afterwards. The heavy NER model runs in a separate, internal-only service (Anonimal, bundling the OpenAI Privacy Filter), enabled by pointing ANONIMAL_URL at it.

Layered detection

High recall by design — several detectors stack on top of each other:

NER model — names, organizations, locations, dates.
Layout-aware invoice fields — reads label → value by PDF coordinates (company name, tax ID, address…), masking structured documents field by field.
20 toggleable detectors, per user — universal (email, URL, IP, MAC, credit-card Luhn-validated, IBAN mod-97), regional (e.g. Argentine CUIT/CUIL/CBU/DNI/addresses) and aggressive (long numbers, name sequences).
Bring Your Own Rules — upload a JSON of your own patterns/labels/keep-list. Your regex runs on RE2 (linear time → ReDoS-proof), with strict JSON parsing and hard limits.
Entity propagation — anything detected once is masked in every occurrence.

Five output modes

Mode	Output	Use
Typed	`<PRIVATE_PERSON>`, `<ACCOUNT_NUMBER>`…	keep the category visible
Anonymous	`<<ANOM_DATA>>`	flatten everything
Pseudonymize	`«PERSONA_1»` + a token→original map	the LLM gateway — anonymize → send → re-hydrate locally
Partial mask	`••••-3456`, `j•••@domain.com`	keep a usable hint — irreversible
Stable hash	`«PERSONA_7590fc»`	same data → same pseudonym across documents — irreversible

Two intensities (Balanced / Strict), all configurable per browser. The restore map and your custom rules never leave your machine.

The LLM gateway pattern

The pseudonymize mode is the centerpiece:

Convert with Pseudonymize — names become «PERSONA_1», IDs become «ID_2», etc.
Send the safe text to any LLM. The model never sees the real data.
Paste the reply into Re-hydrate — Escriba restores the real values, entirely in your browser, using a map that never touched the server.

Visual redaction

For PDFs and scanned images, the result card offers a “Redacted PDF” download: every detected entity is blacked out on the page using true redaction — apply_redactions removes the underlying text and the image pixels beneath each box, so the data no longer exists in the output file. The PDF’s metadata is wiped too (DocInfo + XMP), so a redacted file can’t leak the name or ID via Properties or exiftool. Scanned documents are OCR’d first. Same detection stack, zero extra RAM.