What Escriba runs on
Escriba is glue around excellent open-source projects. Total transparency matters more than looking clever, so here is exactly what runs under the hood, what each piece does, and when — if ever — Escriba reaches the network.
The engines
Section titled “The engines”| What it does | Project | License |
|---|---|---|
| Core document → Markdown conversion | Microsoft MarkItDown | MIT |
| Web framework / API | FastAPI · Uvicorn | MIT / BSD |
| PDF parsing, page selection & true redaction | PyMuPDF | AGPL-3.0 / commercial |
| Advanced PDF layout extraction (opt-in) | OpenDataLoader PDF | open-source |
| OCR for images & scanned PDFs | Tesseract + OCRmyPDF | Apache-2.0 / MPL-2.0 |
| Audio & video transcription | faster-whisper (OpenAI Whisper) | MIT |
| Text → audio (local voices) | Piper | MIT |
| Web pages & YouTube transcripts | yt-dlp | Unlicense |
| PII detection (the anonymization engine) | OpenAI Privacy Filter | Apache-2.0 |
| Safe user-defined regex rules | google-re2 | BSD-3 |
| Token counting | tiktoken | MIT |
| RAG chunking | semchunk | MIT |
| Export to 10 formats | Pandoc | GPL-2.0+ |
| Live per-model pricing | OpenRouter (public API) | — |
| Rate limiting | embedded Redis | — |
| Preview HTML sanitization (in your browser) | DOMPurify | Apache-2.0 / MPL-2.0 |
License — Escriba is MIT
Section titled “License — Escriba is MIT”Escriba’s own source code is released under the MIT License, one of the most permissive licenses there is. In plain terms:
What you can do
- Use it for anything, including commercial use.
- Modify the source and adapt it to your needs.
- Distribute it, and redistribute your modified versions.
- Use it privately and sublicense it inside your own product.
What you have to do
- Keep the original copyright notice and license text in copies of the source.
What isn’t covered
- It’s provided “as is”, with no warranty — the author isn’t liable for how it’s used.
Transcription models (Whisper)
Section titled “Transcription models (Whisper)”Transcription uses faster-whisper, an optimized runtime for OpenAI’s Whisper.
You choose the model size with WHISPER_MODEL — tiny, base (default), small,
medium or large-v3. Bigger models are more accurate but heavier and slower on CPU;
see System requirements for what each one needs.
About the anonymization engine
Section titled “About the anonymization engine”The PII engine is built on the OpenAI Privacy Filter (OPF, Apache-2.0), a NER model that detects names, organizations, locations and more. Escriba wraps it with layout-aware invoice-field reading, validated detectors (credit-card Luhn, IBAN mod-97), and your own rules running on Google’s RE2 engine (linear-time, ReDoS-proof).
When does Escriba talk to the internet?
Section titled “When does Escriba talk to the internet?”By design, conversion, OCR, transcription and anonymization all run locally on your server. Uploaded files are deleted right after conversion and nothing is stored. The only times Escriba makes an outbound request are these — all either user-initiated or optional:
- You convert a URL or a YouTube link. Escriba fetches that page/transcript (via yt-dlp). Obviously.
- You enable an AI provider. Only then does text go to the provider you chose (OpenAI, Gemini or OpenRouter). The default is No AI, and nothing is sent.
- Live model pricing. The LLM panel fetches the price/context list from OpenRouter — a public catalog with no document data in the request. It’s cached, and the feature simply shows nothing if offline.
- First-run model download. The Whisper and NER models are downloaded once (or pre-baked into the image), then run fully offline.
That’s the complete list. Your documents themselves never leave your machine unless you explicitly point Escriba at an external AI provider.