What Escriba runs on

Escriba is glue around excellent open-source projects. Total transparency matters more than looking clever, so here is exactly what runs under the hood, what each piece does, and when — if ever — Escriba reaches the network.

The engines

What it does	Project	License
Core document → Markdown conversion	Microsoft MarkItDown	MIT
Web framework / API	FastAPI · Uvicorn	MIT / BSD
PDF parsing, page selection & true redaction	PyMuPDF	AGPL-3.0 / commercial
Advanced PDF layout extraction (opt-in)	OpenDataLoader PDF	open-source
OCR for images & scanned PDFs	Tesseract + OCRmyPDF	Apache-2.0 / MPL-2.0
Audio & video transcription	faster-whisper (OpenAI Whisper)	MIT
Text → audio (local voices)	Piper	MIT
Web pages & YouTube transcripts	yt-dlp	Unlicense
PII detection (the anonymization engine)	OpenAI Privacy Filter	Apache-2.0
Safe user-defined regex rules	google-re2	BSD-3
Token counting	tiktoken	MIT
RAG chunking	semchunk	MIT
Export to 10 formats	Pandoc	GPL-2.0+
Live per-model pricing	OpenRouter (public API)	—
Rate limiting	embedded Redis	—
Preview HTML sanitization (in your browser)	DOMPurify	Apache-2.0 / MPL-2.0

License — Escriba is MIT

Escriba’s own source code is released under the MIT License, one of the most permissive licenses there is. In plain terms:

What you can do

Use it for anything, including commercial use.
Modify the source and adapt it to your needs.
Distribute it, and redistribute your modified versions.
Use it privately and sublicense it inside your own product.

What you have to do

Keep the original copyright notice and license text in copies of the source.

What isn’t covered

It’s provided “as is”, with no warranty — the author isn’t liable for how it’s used.

Transcription models (Whisper)

Transcription uses faster-whisper, an optimized runtime for OpenAI’s Whisper. You choose the model size with WHISPER_MODEL — tiny, base (default), small, medium or large-v3. Bigger models are more accurate but heavier and slower on CPU; see System requirements for what each one needs.

About the anonymization engine

The PII engine is built on the OpenAI Privacy Filter (OPF, Apache-2.0), a NER model that detects names, organizations, locations and more. Escriba wraps it with layout-aware invoice-field reading, validated detectors (credit-card Luhn, IBAN mod-97), and your own rules running on Google’s RE2 engine (linear-time, ReDoS-proof).

When does Escriba talk to the internet?

By design, conversion, OCR, transcription and anonymization all run locally on your server. Uploaded files are deleted right after conversion and nothing is stored. The only times Escriba makes an outbound request are these — all either user-initiated or optional:

You convert a URL or a YouTube link. Escriba fetches that page/transcript (via yt-dlp). Obviously.
You enable an AI provider. Only then does text go to the provider you chose (OpenAI, Gemini or OpenRouter). The default is No AI, and nothing is sent.
Live model pricing. The LLM panel fetches the price/context list from OpenRouter — a public catalog with no document data in the request. It’s cached, and the feature simply shows nothing if offline.
First-run model download. The Whisper and NER models are downloaded once (or pre-baked into the image), then run fully offline.

That’s the complete list. Your documents themselves never leave your machine unless you explicitly point Escriba at an external AI provider.