Anonimal documentation

Anonimal detects PII in text and replaces it according to a chosen mode. A detector only finds spans; the replacement is decided separately. Everything runs locally — the original data never leaves the machine.

What PII it detects

Detection depends on the active engine.

Structured data (both engines)

Emails, phone numbers, credit cards (validated with the Luhn check), URLs, IPv4 addresses and common secrets.

LATAM identifiers (both engines)

Argentine DNI, CUIT / CUIL (with check digit) and CBU bank numbers.

Free-form PII (ML engine only)

People’s names and addresses in running prose, via NER — the part regex cannot see.

Custom rules

User-supplied rules: always-hide / never-hide whitelists and your own {regex, placeholder} patterns.

Engine: regex (lite) vs ML

A detector exposes detect(text) → [Span]; overlaps are resolved by the longest span (ties broken by label priority).

lite — regex only. Lightweight, offline, no model. Covers structured data and the LATAM identifiers above. It does not see free-form names or addresses. Always present, even in the lite image and the anonimal_lite library.
ml — wraps the OpenAI Privacy Filter (OPF, Apache-2.0). Accurate for free-form PII. Heavy (~2.8 GB checkpoint, ~3 GB RAM), CPU-bound, loaded lazily in the background with serialized inference. Optional.

Select with ANONIMAL_ENGINE: auto (ML if ready, else lite) · lite · ml. Requests may override the default per call with an engine field.

Replacement modes

Two modes are opaque (one-way) and one is reversible. A single anonymizer is used per document, so the same value always gets the same replacement (consistency).

Mode	Result	Reversible
`typed`	`[EMAIL]` (placeholder by category)	no
`anon`	`«REDACTADO»` (single opaque token)	no
`pseudo`	`EMAIL_1` (stable numbered pseudonym)	yes (returns a map)
`mask`	`j*@.com` / `*--**-1234` (type-aware)	no
`hash`	`EMAIL_a1b2c3d4e5` (deterministic HMAC)	no

Opaque markers

typed, anon, mask and hash produce non-reversible output. Use them when you only need to share or store text safely. The hash mode is deterministic: set ANON_HASH_KEY so the same value hashes identically across restarts (stable linkage without storing a map).

Reversible mode (re-hydration)

pseudo is the reversible mode. It replaces each value with a stable token (EMAIL_1, PERSON_2, …) and returns a token → original map. The workflow:

POST /anonymize with mode: "pseudo" → get anonymized output plus map.
Send output to the LLM (the original PII never reaches it).
POST /deanonymize with the LLM’s answer and the same map → the original values are re-hydrated back into the text.

Formats

Anonimal preserves the format of files that are already text: txt, md, log, srt, html, CSV (cells anonymized, columns intact) and JSON (string values anonymized, keys never touched; output stays valid JSON). A single anonymizer per file means a consistent map for the whole document.

Converting Word / Excel / images / audio / URLs is not Anonimal’s job — that belongs to Escriba, Extracta and Fisherboy, which feed already-converted text in. Anonimal does, however, offer real PDF redaction (/redact_pdf): genuine black-out of detected spans plus metadata removal.

REST API

All endpoints except /health are gated by require_auth (see authentication). Base URL is your deployment, e.g. http://localhost:8920.

Method	Path	What it does
`GET`	`/health`	Status + ML engine availability. Always open.
`POST`	`/detect`	`{text}` → detected spans.
`POST`	`/anonymize`	`{text, mode, engine?}` → `{output, map, summary}`.
`POST`	`/deanonymize`	`{text, map}` → original text.
`POST`	`/anonymize_file`	File upload + `mode` → anonymized content (same format).
`POST`	`/redact_pdf`	PDF → redacted PDF (black-out + metadata wiped).

`POST /anonymize`

Request:

{
  "text": "email juan@acme.com, CUIT 20-12345678-6",
  "mode": "pseudo",
  "engine": "auto",
  "rules": null
}

Response:

{
  "engine": "lite",
  "mode": "pseudo",
  "output": "email EMAIL_1, CUIT ID_1",
  "spans": [
    { "label": "EMAIL", "start": 6, "end": 19, "text": "juan@acme.com" }
  ],
  "map": { "EMAIL_1": "juan@acme.com", "ID_1": "20-12345678-6" },
  "reversible": true,
  "summary": { "EMAIL": 1, "ID": 1 }
}

The map is only populated for pseudo; reversible reflects that.

`POST /deanonymize`

{ "text": "reply to EMAIL_1", "map": { "EMAIL_1": "juan@acme.com" } }

→ { "output": "reply to juan@acme.com" }. A missing or empty map returns 422.

Legacy (drop-in) `/anonymize`

Calling POST /anonymize without a mode returns the legacy contract used by the embedded Anonimal — {text, detected_spans, redacted_text, summary} with a placeholder per span. This lets Escriba and Fisherboy point their ANONIMAL_URL at the new service without changing a line of code.

`GET /health`

Returns status, the default engine and mode, and an ml block with available, ready and error. Used by the container healthcheck.

Errors

401 (missing/invalid token or session), 413 (text or PDF over the size cap), 422 (invalid mode / missing map / invalid rules_json), 503 (ML engine or PDF support unavailable).

Authentication

Anonimal accepts two independent credentials on the API:

Service token — set ANONIMAL_TOKEN. Every request must then carry it, either as Authorization: Bearer <token> or as the X-Anonimal-Token header. This is how Escriba and Fisherboy authenticate over the internal network.
Browser session — when ANONIMAL_AUTH_ENABLED=true, a signed cookie from the /login page also satisfies the API gate (for the web UI).

If neither is configured, the API is open (it assumes localhost). /health is always reachable for healthchecks.

Ecosystem integration

Anonimal is the single owner of anonymization in the Escriba Suite; the satellites delegate to it.

Service mode — a product with ANONIMAL_URL set calls Anonimal over HTTP (full ML coverage), authenticating with X-Anonimal-Token.
Library fallback — without ANONIMAL_URL, a product falls back to the bundled anonimal_lite (regex only, pure stdlib), so it can still anonymize standalone.

pip install "anonimal-lite @ git+https://github.com/diegoparras/anonimal.git@v0.4.0"

from anonimal_lite import LiteEngine, Anonymizer, deanonymize

eng = LiteEngine()
out = Anonymizer("pseudo").process(text, eng.detect(text))

There are two flows into Anonimal: a human path (Extracta/Fisherboy hand off to Escriba via sessionStorage['escriba.handoff'], and Escriba’s “Anonymize” button calls the API) and an automatic path (an unattended worker calls the API directly). Either way, Anonimal stays the only place anonymization happens.

Custom rules

/detect, /anonymize (field rules) and /anonymize_file (rules_json) accept a rules object: always (always hide), never (never hide) and patterns ({regex, placeholder}). Patterns are a superset of Escriba’s rules, with optional RE2 to guard against ReDoS.

Anonimal documentation

What PII it detects

Engine: regex (lite) vs ML

Replacement modes

Opaque markers

Reversible mode (re-hydration)

Formats

REST API

POST /anonymize

POST /deanonymize

Legacy (drop-in) /anonymize

GET /health

Errors

Authentication

Ecosystem integration

Custom rules

`POST /anonymize`

`POST /deanonymize`

Legacy (drop-in) `/anonymize`

`GET /health`