Fisherboy documentation
This page documents how Fisherboy works end to end: the core concepts, the web UI, the download capabilities, the REST and MCP APIs, and advanced configuration.
Concepts
Section titled “Concepts”The job and the envelope
Section titled “The job and the envelope”Everything Fisherboy does is a job. You submit a job (a URL plus options), it is enqueued
on Redis, a worker runs the pipeline, and the result is stored as an envelope — a single
object holding the extracted content (content_md), structured records, the crawl tree, the
discovered links and metadata. You poll the job until its status is ok.
Tiered fetch (escalates only when blocked)
Section titled “Tiered fetch (escalates only when blocked)”Fisherboy never reaches for a heavy browser until it has to. It tries the cheapest method first and steps up only when a gate detects a block or a CAPTCHA:
- Tier 0 — static HTTP with
httpx. Fast and cheap; works for most static pages. - Tier 1 — TLS fingerprint with
curl_cffi, so the request looks like a real browser’s at the network layer. - Tier 2 — stealth browser (Camoufox / Patchright) for JavaScript-heavy and anti-bot pages; this tier also drives hidden API capture.
- Tier 3 — a real browser (nodriver / Playwright) as a last resort.
The winning tier is cached per domain (TIER_CACHE_TTL_S), so the next request to the
same site starts where it last succeeded. The high tiers are lazy-imported — the base image
stays light, and each tier turns on only if its library is installed. The escalation ceiling
is set with MAX_FETCH_TIER, and tier_hint on a job can suggest a starting point.
Hidden API capture
Section titled “Hidden API capture”Single-page apps and dynamic grids render from JSON they fetch over XHR/fetch. Rather than
fight the rendered HTML, Fisherboy can watch those network responses and keep the JSON the
page already consumes — usually the most reliable and complete source of the data. Enable it
per job with capture_api.
Privacy modes
Section titled “Privacy modes”The privacy mode is chosen per job and bounded by role (defined in
privacy_matrix.yaml, never hardcoded). The pipeline is fail-closed: if anonymization
fails, nothing raw is delivered.
- opaque (
opaco) — each entity becomes a stable typed marker such as«PERSON_1»or«ID_2». The LLM reasons over markers and never sees the PII; the original is not recoverable. - reversible — the same masking, but an encrypted token-to-value map is kept so you can re-hydrate later via
POST /api/revert(single-use, role-bound). The map is encrypted with a Fernet key (REVERSIBLE_KEY) and expires afterREVERSIBLE_TTL_S. - direct (
directo) — raw output, only for non-sensitive data.
A deterministic regex pass always runs for high-risk PII (national ID, email, IP, Luhn-valid
card, phone). When ANONIMAL_URL points at an Anonimal
instance, full NER runs on top of it; standalone falls back to the built-in regex pass only.
If a role does not allow the requested mode, the gateway returns 403 — it never silently downgrades.
Fisherboy has three access levels, each with its own password and capability limits (which tiers, proxies, API capture, captcha solver, crawl and tarantula are allowed). Roles are enforced on both REST and MCP.
| Role | opaque | reversible | direct |
|---|---|---|---|
humano | yes | — | — |
angel | yes | yes | — |
dios | yes | yes | yes |
Broadly: humano gets the cheap tiers and no expensive “weapons”; angel adds browser,
proxy and capture (but no captcha solver); dios has everything. Tarantula and
browser-cookie reading are vetoed in sidekick mode.
Using the web UI
Section titled “Using the web UI”In standalone mode Fisherboy mounts its own web UI at the service URL (e.g.
http://localhost:8000). After logging in with a role password, you:
- Paste a URL and choose the output (
markdown,llms_txtorjson) and the privacy mode allowed by your role. - Optionally enable extras — pagination, API capture, a crawl depth, tarantula, a proxy, cookies.
- Run the job; the result opens in a modal editor.
The built-in editor
Section titled “The built-in editor”The result opens in a modal editor with three tabs:
- Markdown — a toolbar with live preview.
- JSON — a validating editor.
- Table — an editable table; JSON ↔ table is just switching tabs.
You can download the result as .md, .json or .csv. You can download the whole envelope,
just the data (content + records + tree + links), or a flat records array. One click sends the
result to Escriba for further conversion, anonymization, chunking and export.
Downloads
Section titled “Downloads”Beyond page text, Fisherboy can pull media and platform data:
- Files — direct file downloads.
- Video — via
yt-dlp(YouTube, Vimeo and many others).ffmpeg, bundled in the image, muxes video + audio to high-quality mp4; without it, downloads fall back to the best progressive single file. - Galleries / images — via
gallery-dl(Instagram, X, Reddit, Pinterest, Tumblr, Flickr, DeviantArt and more). - Comments / platform data — multi-platform; Instagram post comments and follower/following data use
instaloader, which needs a session cookie (IG_SESSIONID) and is restricted to thediosrole.
REST API
Section titled “REST API”POST /api/jobs # validates schema, role × privacy mode, callback & proxy (SSRF); enqueues → 202GET /api/jobs/{job_id} # status and result (the "envelope")POST /api/proxy/test # routes a request through a proxy; returns exit IP + country + latencyPOST /api/revert # re-hydrates pseudonymized content (reversible mode)POST /api/login # role login (cookie session)GET /healthz # livenessGET /metrics # Prometheus metricsSubmit a job and poll it:
curl -X POST http://localhost:8000/api/jobs \ -H 'content-type: application/json' \ -d '{"url":"https://example.com/article","rol":"angel","privacy_mode":"opaco"}'# → { "job_id": "…", "status": "pendiente" }
curl http://localhost:8000/api/jobs/<job_id># → the envelope with anonymized content_md once status == "ok"Job fields
Section titled “Job fields”| Field | Notes |
|---|---|
url | The page to fetch. |
rol | humano / angel / dios. |
privacy_mode | opaco / reversible / directo (bounded by role). |
output_format | markdown / llms_txt / json. |
tier_hint | Suggested starting tier, 0–3. |
crawl_depth | Depth for the spider crawl. |
max_pages | Page budget (capped by CRAWL_MAX_PAGES). |
paginate | Sweep pagination. |
capture_api | Capture hidden XHR/fetch JSON. |
tarantula | Capture each node’s content + API into a data tree. |
extract_schema | JSON Schema for structured extraction (with output_format=json). |
proxy | Per-job proxy override. |
cookies | Session cookies for the request. |
callback_url | Webhook to receive the envelope on completion. |
The same pipeline is exposed as MCP tools (submit_job, get_job, revert) so n8n,
Claude Code or Escriba can enqueue without hand-writing HTTP:
python -m app.mcp_server # requires fastmcpThe MCP server’s role ceiling is set by MCP_ROLE (it does not trust a role claimed by the
caller).
Advanced configuration
Section titled “Advanced configuration”Proxies
Section titled “Proxies”Paste a proxy in any format — host:port, host:port:user:pass,
user:pass@host:port or a full URL — and Fisherboy normalizes it (socks5 supported). The
Test button (or POST /api/proxy/test) routes a request through it and returns your exit
IP + country + latency, with an actionable hint if it can’t connect. Configure a pool with
PROXIES, choose rotation with PROXY_ROTATION (round_robin / random / sticky), and
tune PROXY_COOLDOWN_S and PROXY_ATTEMPTS. A job can override the pool with its own proxy.
Cookies
Section titled “Cookies”Use pages behind a login or a region without a browser extension. Paste cookies as Netscape
cookies.txt, JSON or name=value pairs, or read them straight from your local browser
(Chrome / Firefox / Edge / Brave). Browser-cookie reading is standalone only and vetoed
in sidekick mode.
CAPTCHA
Section titled “CAPTCHA”The default anti-CAPTCHA strategy is prevention by escalation (CAPTCHA_SOLVER=none): the
fetch gate detects a CAPTCHA and steps up a tier. Optionally, an external API solver can be
configured with CAPTCHA_SOLVER=external, CAPTCHA_SOLVER_URL and CAPTCHA_SOLVER_KEY
(gated by role).
Pagination
Section titled “Pagination”Set paginate on a job to sweep multi-page listings. Fisherboy handles common schemes —
ASP.NET postbacks, “next” links and ?page= query patterns. The total is bounded by
max_pages and the hard CRAWL_MAX_PAGES cap.
Browser tiers
Section titled “Browser tiers”For tiers 2 and 3, tune the headless browser with BROWSER_HEADLESS, BROWSER_SETTLE_S
(wait after load), BROWSER_SCROLL (trigger lazy-load), BROWSER_LOCALE and
BROWSER_USER_AGENT.
Spider & tarantula (deep crawl)
Section titled “Spider & tarantula (deep crawl)”- Spider — follow internal links into a tree (with section scoping) up to
crawl_depth, optionally combined with pagination. - Tarantula — the deep mode: it walks each node and captures both its content and its hidden API into a single data tree. Tarantula is gated to high roles and vetoed in
sidekickmode.
Multi-page crawling respects robots.txt when RESPECT_ROBOTS=1.
Security and limits
Section titled “Security and limits”Fisherboy is fail-closed and hardened: anti-SSRF (DNS resolved; private/loopback/link-local/
cloud-metadata ranges blocked, re-validated on every redirect hop and every browser
request, including the proxy override), per-job secret scrubbing (proxy creds, captcha key,
cookies never appear in the envelope or webhook), role gating on REST and MCP, rate-limiting
(MAX_JOBS_PER_MIN), hard page and byte caps (CRAWL_MAX_PAGES, JOB_MAX_TOTAL_BYTES), and a
non-root container. Review the
production checklist
before exposing it — never set ALLOW_PRIVATE_TARGETS=1 or FISHERBOY_OPEN_GOD=1 in
production.
Optional persistence and observability
Section titled “Optional persistence and observability”Postgres + pgvector persistence is optional (DATABASE_URL, plus EMBEDDINGS_ENABLED for a
vector store); without it the system runs on Redis alone and degrades gracefully. A
Prometheus + Loki + Grafana stack is available via docker-compose.observability.yml.