Day job 2026 Solo developer · In production

News Desk — Autonomous Story-Discovery Agent

An automated newsroom agent that reproduces an editorial team's morning routine — searches the web and ~52 curated sources across 11 countries for textile/apparel news, scores and de-dupes the best stories with an LLM, confirms each against its primary source, then drafts publish-ready articles. Two-phase with a human in the loop, streamed live over SSE.

Hire me for similar

News Desk — Autonomous Story-Discovery Agent — Day job project by Manuj Rai

Year: 2026
Type: Day job
Role: Solo developer · In production
Status: In production

/ The challenge

What needed solving

Fibre2Fashion's news desk ran a manual morning routine — Google for textile news, visit dozens of competitor and source sites across many countries, judge what's relevant, confirm it against a primary source, then rewrite it. It was hours of repetitive triage every day, with no de-duplication across days and no consistent ranking of what actually mattered.

/ What I built

The solution

Built a two-phase, human-in-the-loop agent in Python + FastAPI. A fast discovery stage turns the beat plus the team's 45 editorial keywords into ~20 date-aware queries, fans out across SerpAPI/DuckDuckGo, Google News RSS, and an RSS-first crawl of the curated sources, then an LLM judges relevance (direct vs. indirect), collapses cross-outlet duplicates, and ranks by newsworthiness + source authority + recency. The editor ticks the stories worth pursuing; only those get the expensive full-text fetch (httpx + Playwright fallback), clustering into distinct candidates, and primary-source confirmation. One click then drafts a house-style, publish-ready article from the sources — every step streamed to a live activity log.

Python 3.11FastAPIOpenAI · Gemini · ClaudePlaywrighttrafilaturaSerpAPISQL ServerSQLAlchemy + AlembicServer-Sent EventsPrometheusDocker

/ Outcomes

What changed

Turns the desk's manual morning scan into one click — a deduped shortlist of ~20–30 ready-to-draft story candidates
Two-phase pipeline keeps cost low: a scan is ~4 LLM calls / a few thousand tokens on gpt-4o-mini, and only editor-picked stories get the expensive full fetch
One config-driven UI client serves two hosts — the FastAPI SPA and the company's .NET admin portal — so a fix or feature lands once
Production-hardened: SSRF-guarded fetching, robots.txt + rate limits, retries with backoff, per-job token/URL budgets, structured JSON logs + Prometheus metrics, Docker + CI, durable across restarts

/ Under the hood

Technical highlights

Provider-agnostic LLM layer (OpenAI / Gemini / Anthropic Claude) in JSON mode with backoff retries and per-job token accounting — drafting can run on a stronger model than high-volume discovery

RSS-first harvest: prefers a source's feed (configured or auto-discovered) over scraping, with a search-snippet fallback so a bot-blocked page never loses a story

Bot-wall detection → Playwright re-fetch, with layered text extraction (trafilatura → JSON-LD → article container → paragraph sweep) so blocked or messy pages still yield clean copy

Cross-scan de-duplication (URL + content-hash + headline overlap) so the same story never resurfaces across days

Cheap primary-source confirmation with no LLM call — targeted search + authority-tier preference + a Jaccard title-overlap guard

Need something like this?

I take on a small number of projects each quarter. Let's talk if your idea fits.

Start a project Ask my AI first

More day job work

internal · production

AI Chatbot — RAG · Agents · Tools — case study by Manuj Rai (Manuj Rai)

Day job2025