Skip to content

Fisherboy documentation

This page documents how Fisherboy works end to end: the core concepts, the web UI, the download capabilities, the REST and MCP APIs, and advanced configuration.

Everything Fisherboy does is a job. You submit a job (a URL plus options), it is enqueued on Redis, a worker runs the pipeline, and the result is stored as an envelope — a single object holding the extracted content (content_md), structured records, the crawl tree, the discovered links and metadata. You poll the job until its status is ok.

Tiered fetch (escalates only when blocked)

Section titled “Tiered fetch (escalates only when blocked)”

Fisherboy never reaches for a heavy browser until it has to. It tries the cheapest method first and steps up only when a gate detects a block or a CAPTCHA:

  • Tier 0 — static HTTP with httpx. Fast and cheap; works for most static pages.
  • Tier 1 — TLS fingerprint with curl_cffi, so the request looks like a real browser’s at the network layer.
  • Tier 2 — stealth browser (Camoufox / Patchright) for JavaScript-heavy and anti-bot pages; this tier also drives hidden API capture.
  • Tier 3 — a real browser (nodriver / Playwright) as a last resort.

The winning tier is cached per domain (TIER_CACHE_TTL_S), so the next request to the same site starts where it last succeeded. The high tiers are lazy-imported — the base image stays light, and each tier turns on only if its library is installed. The escalation ceiling is set with MAX_FETCH_TIER, and tier_hint on a job can suggest a starting point.

Single-page apps and dynamic grids render from JSON they fetch over XHR/fetch. Rather than fight the rendered HTML, Fisherboy can watch those network responses and keep the JSON the page already consumes — usually the most reliable and complete source of the data. Enable it per job with capture_api.

The privacy mode is chosen per job and bounded by role (defined in privacy_matrix.yaml, never hardcoded). The pipeline is fail-closed: if anonymization fails, nothing raw is delivered.

  • opaque (opaco) — each entity becomes a stable typed marker such as «PERSON_1» or «ID_2». The LLM reasons over markers and never sees the PII; the original is not recoverable.
  • reversible — the same masking, but an encrypted token-to-value map is kept so you can re-hydrate later via POST /api/revert (single-use, role-bound). The map is encrypted with a Fernet key (REVERSIBLE_KEY) and expires after REVERSIBLE_TTL_S.
  • direct (directo) — raw output, only for non-sensitive data.

A deterministic regex pass always runs for high-risk PII (national ID, email, IP, Luhn-valid card, phone). When ANONIMAL_URL points at an Anonimal instance, full NER runs on top of it; standalone falls back to the built-in regex pass only.

If a role does not allow the requested mode, the gateway returns 403 — it never silently downgrades.

Fisherboy has three access levels, each with its own password and capability limits (which tiers, proxies, API capture, captcha solver, crawl and tarantula are allowed). Roles are enforced on both REST and MCP.

Roleopaquereversibledirect
humanoyes
angelyesyes
diosyesyesyes

Broadly: humano gets the cheap tiers and no expensive “weapons”; angel adds browser, proxy and capture (but no captcha solver); dios has everything. Tarantula and browser-cookie reading are vetoed in sidekick mode.

In standalone mode Fisherboy mounts its own web UI at the service URL (e.g. http://localhost:8000). After logging in with a role password, you:

  1. Paste a URL and choose the output (markdown, llms_txt or json) and the privacy mode allowed by your role.
  2. Optionally enable extras — pagination, API capture, a crawl depth, tarantula, a proxy, cookies.
  3. Run the job; the result opens in a modal editor.

The result opens in a modal editor with three tabs:

  • Markdown — a toolbar with live preview.
  • JSON — a validating editor.
  • Table — an editable table; JSON ↔ table is just switching tabs.

You can download the result as .md, .json or .csv. You can download the whole envelope, just the data (content + records + tree + links), or a flat records array. One click sends the result to Escriba for further conversion, anonymization, chunking and export.

Beyond page text, Fisherboy can pull media and platform data:

  • Files — direct file downloads.
  • Video — via yt-dlp (YouTube, Vimeo and many others). ffmpeg, bundled in the image, muxes video + audio to high-quality mp4; without it, downloads fall back to the best progressive single file.
  • Galleries / images — via gallery-dl (Instagram, X, Reddit, Pinterest, Tumblr, Flickr, DeviantArt and more).
  • Comments / platform data — multi-platform; Instagram post comments and follower/following data use instaloader, which needs a session cookie (IG_SESSIONID) and is restricted to the dios role.
POST /api/jobs # validates schema, role × privacy mode, callback & proxy (SSRF); enqueues → 202
GET /api/jobs/{job_id} # status and result (the "envelope")
POST /api/proxy/test # routes a request through a proxy; returns exit IP + country + latency
POST /api/revert # re-hydrates pseudonymized content (reversible mode)
POST /api/login # role login (cookie session)
GET /healthz # liveness
GET /metrics # Prometheus metrics

Submit a job and poll it:

Terminal window
curl -X POST http://localhost:8000/api/jobs \
-H 'content-type: application/json' \
-d '{"url":"https://example.com/article","rol":"angel","privacy_mode":"opaco"}'
# → { "job_id": "…", "status": "pendiente" }
curl http://localhost:8000/api/jobs/<job_id>
# → the envelope with anonymized content_md once status == "ok"
FieldNotes
urlThe page to fetch.
rolhumano / angel / dios.
privacy_modeopaco / reversible / directo (bounded by role).
output_formatmarkdown / llms_txt / json.
tier_hintSuggested starting tier, 03.
crawl_depthDepth for the spider crawl.
max_pagesPage budget (capped by CRAWL_MAX_PAGES).
paginateSweep pagination.
capture_apiCapture hidden XHR/fetch JSON.
tarantulaCapture each node’s content + API into a data tree.
extract_schemaJSON Schema for structured extraction (with output_format=json).
proxyPer-job proxy override.
cookiesSession cookies for the request.
callback_urlWebhook to receive the envelope on completion.

The same pipeline is exposed as MCP tools (submit_job, get_job, revert) so n8n, Claude Code or Escriba can enqueue without hand-writing HTTP:

Terminal window
python -m app.mcp_server # requires fastmcp

The MCP server’s role ceiling is set by MCP_ROLE (it does not trust a role claimed by the caller).

Paste a proxy in any formathost:port, host:port:user:pass, user:pass@host:port or a full URL — and Fisherboy normalizes it (socks5 supported). The Test button (or POST /api/proxy/test) routes a request through it and returns your exit IP + country + latency, with an actionable hint if it can’t connect. Configure a pool with PROXIES, choose rotation with PROXY_ROTATION (round_robin / random / sticky), and tune PROXY_COOLDOWN_S and PROXY_ATTEMPTS. A job can override the pool with its own proxy.

Use pages behind a login or a region without a browser extension. Paste cookies as Netscape cookies.txt, JSON or name=value pairs, or read them straight from your local browser (Chrome / Firefox / Edge / Brave). Browser-cookie reading is standalone only and vetoed in sidekick mode.

The default anti-CAPTCHA strategy is prevention by escalation (CAPTCHA_SOLVER=none): the fetch gate detects a CAPTCHA and steps up a tier. Optionally, an external API solver can be configured with CAPTCHA_SOLVER=external, CAPTCHA_SOLVER_URL and CAPTCHA_SOLVER_KEY (gated by role).

Set paginate on a job to sweep multi-page listings. Fisherboy handles common schemes — ASP.NET postbacks, “next” links and ?page= query patterns. The total is bounded by max_pages and the hard CRAWL_MAX_PAGES cap.

For tiers 2 and 3, tune the headless browser with BROWSER_HEADLESS, BROWSER_SETTLE_S (wait after load), BROWSER_SCROLL (trigger lazy-load), BROWSER_LOCALE and BROWSER_USER_AGENT.

  • Spider — follow internal links into a tree (with section scoping) up to crawl_depth, optionally combined with pagination.
  • Tarantula — the deep mode: it walks each node and captures both its content and its hidden API into a single data tree. Tarantula is gated to high roles and vetoed in sidekick mode.

Multi-page crawling respects robots.txt when RESPECT_ROBOTS=1.

Fisherboy is fail-closed and hardened: anti-SSRF (DNS resolved; private/loopback/link-local/ cloud-metadata ranges blocked, re-validated on every redirect hop and every browser request, including the proxy override), per-job secret scrubbing (proxy creds, captcha key, cookies never appear in the envelope or webhook), role gating on REST and MCP, rate-limiting (MAX_JOBS_PER_MIN), hard page and byte caps (CRAWL_MAX_PAGES, JOB_MAX_TOTAL_BYTES), and a non-root container. Review the production checklist before exposing it — never set ALLOW_PRIVATE_TARGETS=1 or FISHERBOY_OPEN_GOD=1 in production.

Postgres + pgvector persistence is optional (DATABASE_URL, plus EMBEDDINGS_ENABLED for a vector store); without it the system runs on Redis alone and degrades gracefully. A Prometheus + Loki + Grafana stack is available via docker-compose.observability.yml.