Skip to content

Text to audio (podcast)

Escriba doesn’t just turn documents into text for an LLM — it can also turn the result back into sound. From any conversion, open the Audio / Podcast option to generate an MP3 and listen to your document.

  • Narration — a single voice reads the document end to end.
  • Podcast — an AI writes a short two-host dialogue about the document (host + expert), and Escriba voices it with two alternating voices and stitches it together. Podcast mode needs an AI provider configured (it writes the script); narration does not.
  • Local (Piper) — the default. Voices run on your server, fully offline — the text never leaves your machine. Escriba ships 14 voices across Spanish, English, Portuguese, French, Italian, German and Chinese.
  • Cloud (OpenAI) — optional, higher quality. Uses your OpenAI API key; the text is sent to OpenAI only when you pick a cloud voice. Great for languages without a local voice (e.g. Japanese).

Like a studio panel, you choose:

  • Voice — language + speaker (local or cloud).
  • Pitch — low / medium / high.
  • Speed — slow / normal / fast.
  • Volume — low / medium / high.

A built-in player lets you preview the audio before downloading the MP3.

Audio generation is available to the ANGEL and DIOS levels (like audio/video and OCR). Because Piper synthesizes on the CPU, there’s a per-request character limit to protect the server — configurable per role:

SettingDefaultMeaning
GOD_TTS_CHARS0DIOS: no limit
ANGEL_TTS_CHARS100000ANGEL: max characters per MP3
HUMAN_TTS_CHARS20000HUMANO (only if HUMAN_TTS=true)
TTS_TIMEOUT600Max seconds per synthesis
TTS_OPENAI_MODELtts-1Cloud model (tts-1 or tts-1-hd)

See Configuration for the full list. A very long document with a local voice can take a while to synthesize — that’s the CPU, not a bug.