What is Escriba?

Escriba is the universal translator into the language of AI. It takes any document — a PDF, a Word file, a spreadsheet, an image, an audio recording, a web page, a YouTube link — and turns it into clean, anonymized Markdown, the format that large language models read best.

It solves, in one self-hostable tool, the three headaches of feeding documents to an LLM:

Noisy, token-hungry input → clean, structured Markdown.
Sensitive-data leakage → built-in PII anonymization, with reversible pseudonymization.
“Will it fit? what will it cost?” → a local LLM-prep panel that counts tokens, estimates cost with live pricing, checks context-window fit and chunks for RAG.

It runs locally, in 7 languages, and is built on Microsoft MarkItDown.

Who it’s for

Anyone who pastes documents into ChatGPT, Claude or Gemini and wants the text clean — and the private bits stripped out — before it leaves their hands.
Teams and institutions that can’t send confidential files to a third-party cloud, and need a converter that runs on their own server.
Builders who want a REST API, role-based access and a single Docker image with no moving parts.

What makes it different

The control stays on the human layer. Escriba doesn’t send your documents anywhere. It runs on your machine, deletes files right after conversion, and lets you decide what reaches a model.
Anonymization is reversible. Pseudonymize → send to the LLM → re-hydrate the reply locally. The restore map never leaves your browser.
No AI required. The smart parts — token counting, OCR, anonymization, cost estimates — all run locally. AI is strictly optional.

Next steps

Quick start — get it running in one command.
Converting documents — the day-to-day workflow.
Anonymization for LLMs — the privacy engine in depth.