Skip to content

Fisherboy

Any web page, ready for your AI.

Fisherboy is the web-extraction satellite of the Escriba family. Point it at any page and get back clean Markdown or structured JSON — pruned of navigation and boilerplate, anonymized before it leaves, and ready to feed to an LLM. It only fights harder when a site fights back, escalating from a plain HTTP request all the way to a real browser, and it can capture the hidden JSON/XHR that single-page apps already load.

Fisherboy is self-hostable as a single Docker image. It runs standalone with its own web UI, or headless behind Escriba as a REST + MCP service.

  • Anyone collecting web content for an LLM — articles, documentation, product grids, search results — who wants clean Markdown instead of raw HTML.
  • Builders and automators driving extraction from curl, n8n, Claude Code or Escriba over REST or MCP.
  • Privacy-conscious users who need PII stripped or pseudonymized before any data reaches a model or a third party.
  • Self-hosters who want everything to run on their own hardware, with role-based access and an audited security posture.

Page to Markdown or JSON

Clean fit_markdown (Crawl4AI) with a Trafilatura fallback, or structured extraction to a JSON Schema via an LLM.

Tiered anti-blocking fetch

Escalates only when blocked: tier 0 static HTTP, tier 1 TLS fingerprint, tier 2 stealth browser, tier 3 real browser. The winning tier is cached per domain.

Hidden API capture

Instead of fighting rendered HTML, watch the XHR/fetch JSON the page already loads — the most reliable way to scrape SPAs and dynamic grids.

Spider & tarantula crawl

Follow internal links into a tree, sweep pagination, and capture each node’s content plus API into a data tree.

Download everything

Files, video (yt-dlp), galleries (gallery-dl) and platform comments — beyond just the page text.

PII anonymization

Three privacy modes — opaque, reversible and direct — bounded by role and fail-closed, with full NER via Anonimal or a built-in regex fallback.

Proxies & cookies

Paste a proxy in any format and test your exit IP; paste cookies or read them from your local browser for pages behind a login.

Role-based access

Three levels — dios / angel / humano — each with its own password and capability limits, enforced on REST and MCP.

REST + MCP

Drive it from curl, n8n, Claude Code or Escriba. The same pipeline is exposed as MCP tools.

Install Fisherboy View on GitHub