Day job 2026 Solo developer · In progress

Knowledge Scraper — Autonomous Research Agent

An autonomous Python research agent — give it any topic and it plans sub-questions, searches the web, fetches + extracts content, scores sources with Claude, and synthesizes a structured report with citations. Live progress streamed over SSE.

Hire me for similar

Knowledge Scraper — Autonomous Research Agent screenshot

Year: 2026
Type: Day job
Stack: 9
Outcomes: 4

/ The challenge

What needed solving

Internal teams burn hours collecting context for new topics — manual googling, copy-pasting into docs, no source tracking, no relevance ranking. Static search isn't enough; the work needs actual reasoning over what to search next based on what's already been found.

/ What I built

The solution

Built a closed-loop agent in Python: Claude plans 3–6 sub-questions from the topic, runs them through DuckDuckGo, fetches results with httpx + Playwright fallback, extracts clean content with trafilatura, scores each source for relevance/credibility, and reflects on whether coverage is sufficient or needs another loop. FastAPI exposes a Server-Sent Events stream so the UI shows agent progress in real time.

Python 3.11FastAPIAnthropic ClaudePlaywrightBeautifulSouptrafilaturaDuckDuckGoSQLModelServer-Sent Events

/ Outcomes

What changed

Closed-loop agent: plan → search → fetch → extract → score → reflect → synthesize — runs end-to-end from a single prompt
SSE-streamed progress so users see each step (searching, fetching URL X, scoring, reflecting) instead of staring at a spinner
Configurable token + URL budgets per run keep cost and runtime predictable
Past runs persisted in SQLite so users can revisit, export, or share previous reports

/ Under the hood

Technical highlights

Multi-step planner-loop agent using Claude (sonnet-4.6) with prompt caching on system prompt + reused context

Tool-using design: search tool, fetch tool, extract tool, score tool — each composable and replaceable per provider

Hybrid scraping: httpx + BeautifulSoup for static HTML, Playwright (headless Chromium) for JS-rendered fallback

Polite by default — respects robots.txt, real User-Agent, per-domain rate limiting, graceful failure on dead URLs

Structured output: TL;DR, Key Findings (cited), Conflicting Information, Open Questions, Sources panel with relevance bars

Need something like this?

I take on a small number of projects each quarter. Let's talk if your idea fits.

Start a project Ask my AI first