Skip to content
Day job 2026 Solo developer · In progress

Knowledge Scraper — Autonomous Research Agent

An autonomous Python research agent — give it any topic and it plans sub-questions, searches the web, fetches + extracts content, scores sources with Claude, and synthesizes a structured report with citations. Live progress streamed over SSE.

Knowledge Scraper — Autonomous Research Agent screenshot
Year
2026
Type
Day job
Stack
9
Outcomes
4
/ The challenge

What needed solving

Internal teams burn hours collecting context for new topics — manual googling, copy-pasting into docs, no source tracking, no relevance ranking. Static search isn't enough; the work needs actual reasoning over what to search next based on what's already been found.

/ What I built

The solution

Built a closed-loop agent in Python: Claude plans 3–6 sub-questions from the topic, runs them through DuckDuckGo, fetches results with httpx + Playwright fallback, extracts clean content with trafilatura, scores each source for relevance/credibility, and reflects on whether coverage is sufficient or needs another loop. FastAPI exposes a Server-Sent Events stream so the UI shows agent progress in real time.

Python 3.11FastAPIAnthropic ClaudePlaywrightBeautifulSouptrafilaturaDuckDuckGoSQLModelServer-Sent Events
/ Outcomes

What changed

  • Closed-loop agent: plan → search → fetch → extract → score → reflect → synthesize — runs end-to-end from a single prompt
  • SSE-streamed progress so users see each step (searching, fetching URL X, scoring, reflecting) instead of staring at a spinner
  • Configurable token + URL budgets per run keep cost and runtime predictable
  • Past runs persisted in SQLite so users can revisit, export, or share previous reports
/ Under the hood

Technical highlights

Multi-step planner-loop agent using Claude (sonnet-4.6) with prompt caching on system prompt + reused context
Tool-using design: search tool, fetch tool, extract tool, score tool — each composable and replaceable per provider
Hybrid scraping: httpx + BeautifulSoup for static HTML, Playwright (headless Chromium) for JS-rendered fallback
Polite by default — respects robots.txt, real User-Agent, per-domain rate limiting, graceful failure on dead URLs
Structured output: TL;DR, Key Findings (cited), Conflicting Information, Open Questions, Sources panel with relevance bars

Need something like this?

I take on a small number of projects each quarter. Let's talk if your idea fits.