What is Escriba?
Escriba is the universal translator into the language of AI. It takes any document — a PDF, a Word file, a spreadsheet, an image, an audio recording, a web page, a YouTube link — and turns it into clean, anonymized Markdown, the format that large language models read best.
It solves, in one self-hostable tool, the three headaches of feeding documents to an LLM:
- Noisy, token-hungry input → clean, structured Markdown.
- Sensitive-data leakage → built-in PII anonymization, with reversible pseudonymization.
- “Will it fit? what will it cost?” → a local LLM-prep panel that counts tokens, estimates cost with live pricing, checks context-window fit and chunks for RAG.
It runs locally, in 7 languages, and is built on Microsoft MarkItDown.
Who it’s for
Section titled “Who it’s for”- Anyone who pastes documents into ChatGPT, Claude or Gemini and wants the text clean — and the private bits stripped out — before it leaves their hands.
- Teams and institutions that can’t send confidential files to a third-party cloud, and need a converter that runs on their own server.
- Builders who want a REST API, role-based access and a single Docker image with no moving parts.
What makes it different
Section titled “What makes it different”- The control stays on the human layer. Escriba doesn’t send your documents anywhere. It runs on your machine, deletes files right after conversion, and lets you decide what reaches a model.
- Anonymization is reversible. Pseudonymize → send to the LLM → re-hydrate the reply locally. The restore map never leaves your browser.
- No AI required. The smart parts — token counting, OCR, anonymization, cost estimates — all run locally. AI is strictly optional.
Next steps
Section titled “Next steps”- Quick start — get it running in one command.
- Converting documents — the day-to-day workflow.
- Anonymization for LLMs — the privacy engine in depth.