What is a two-phase AI agent?

It splits the work into a cheap discovery phase, where an LLM scores and ranks candidates, and an expensive pursuit phase that only runs on the items a human picks. You never pay full-text fetch and generation cost on stories nobody wanted.

How do you keep an LLM agent's costs down?

Put the human at the most expensive boundary, run high-volume triage on a small fast model and reserve a stronger model for drafting, cap per-job token and URL budgets, and use deterministic checks instead of LLM calls wherever they're good enough.

Why stream agent steps over Server-Sent Events?

A live activity log is what makes an autonomous system trustable — when an editor can watch it search, judge, and reject in real time, they stop second-guessing it. SSE is a simple one-way stream that fits this perfectly.

How do you stop an autonomous agent from drafting fake news?

Each story is confirmed against a primary source before drafting — a targeted search with source-authority preference and a title-overlap guard — which catches obvious fabrications and rumor reposts without spending an extra LLM call.

All writing

May 28, 2026 4 min read Updated June 22, 2026

Building an autonomous news desk: a two-phase story-discovery agent

An editorial team's morning routine — search dozens of sources, judge what matters, confirm it, then draft — turned into a two-phase, human-in-the-loop agent streamed live over SSE. Here's the architecture and the decisions that kept it cheap and trustworthy.

AI AgentsPythonFastAPILLMs

On this page

Why two phases
Getting the inputs without getting blocked
Ranking and de-duplication
Confirming a story without paying for it
Provider-agnostic by design
Production hardening is the actual work
Streaming the work, not just the result
What I'd tell my past self
FAQ

Every morning, a news desk does the same thing: open Google, visit a few dozen competitor and source sites across a dozen countries, decide what's actually relevant, confirm each story against a primary source, and rewrite it in house style. It's hours of repetitive triage — and because it's manual, there's no memory across days and no consistent ranking of what mattered.

I built an agent that reproduces that routine for a textile/apparel newsroom — the full case study lives here. The hard part wasn't calling an LLM. It was making the thing cheap, trustworthy, and controllable enough that an editor would actually use it.

Why two phases

The naive version is one big pipeline: search everything, fetch every page, summarize, rank, draft. That's expensive and slow — you pay full-text fetch and LLM cost on stories nobody wanted.

So I split it in two, with a human in the middle:

Discovery (cheap). Expand the beat plus the team's ~45 editorial keywords into ~20 date-aware queries, fan out across search, Google News RSS, and an RSS-first crawl of ~52 curated sources, then have an LLM score, de-duplicate, and rank the candidates. A full scan is roughly four LLM calls and a few thousand tokens on a small model.
Pursuit (expensive, opt-in). The editor ticks the stories worth chasing. Only those get the full-text fetch, clustering into distinct candidates, primary-source confirmation, and one-click drafting.

That single boundary — let a human pick before you spend — is what keeps the economics sane. The model triages; the editor decides; the expensive work runs on a shortlist.

Getting the inputs without getting blocked

Discovery is only as good as what it can read. The harvest is RSS-first: if a source publishes a feed (configured or auto-discovered), prefer it over scraping. When a page is bot-walled, fall back to a search snippet so a blocked URL never silently drops a story.

For the pages that do need fetching, the extractor is layered — trafilatura, then JSON-LD, then the article container, then a paragraph sweep — and a bot-wall detector re-fetches through Playwright when the cheap path returns junk. Messy or hostile pages still yield clean copy more often than not.

Ranking and de-duplication

An LLM judges relevance as direct vs. indirect to the beat, then collapses the same story told by five different outlets into one item, and ranks what's left by newsworthiness, source authority, and recency. Cross-scan de-duplication (URL + content-hash + headline overlap) means a story that appeared yesterday doesn't resurface today. (Retrieval — picking the right source chunks rather than ranking whole stories — is a related but distinct problem; I dig into it in metadata-filtered RAG.)

Confirming a story without paying for it

A draft is only safe if the story is real. I confirm each one against a primary source — but without an LLM call. It's a targeted search, a preference for higher-authority source tiers, and a Jaccard title-overlap guard to reject loose matches. Cheap, deterministic, and good enough to catch the obvious fabrications and rumor-mill reposts.

Provider-agnostic by design

The LLM layer is provider-agnostic — OpenAI, Gemini, or Claude — all in JSON mode with backoff retries and per-job token accounting. That matters for cost control: high-volume discovery runs on a small, fast model, while drafting (where quality shows) can use a stronger one. Swapping providers is a config change, not a rewrite.

Production hardening is the actual work

The agent is the easy 60%. The last 40% is what makes it deployable:

SSRF-guarded fetching — it follows URLs the internet hands it, so the fetcher refuses private/loopback ranges.
robots.txt + rate limits on every outbound request.
Per-job token and URL budgets, so a pathological run can't burn the month's spend.
Retries with backoff, structured JSON logs, and Prometheus metrics.
Durable across restarts — a job that was mid-flight resumes, not restarts.
Docker + CI so a fix lands the same way every time.

Streaming the work, not just the result

Every step — each query, each source, each judgement — streams to the editor live over Server-Sent Events. That turned out to be more than polish: a visible activity log is what makes an autonomous system trustable. When you can watch it reason and reject, you stop second-guessing it. One config-driven client even serves two hosts — a FastAPI SPA and a .NET admin portal — so a fix ships once.

What I'd tell my past self

Put the human at the most expensive boundary. Don't ask for approval on trivia; ask right before the costly work.
Determinism where you can afford it. Confirmation didn't need an LLM, and not using one made it cheaper and easier to reason about.
Observability is a feature, not ops overhead. The live log is the difference between a demo and a tool people rely on.

The result: a manual morning scan became one click and a deduped shortlist of ~20–30 ready-to-draft candidates — at a fraction of the cost a fetch-everything pipeline would have run up.

Frequently asked questions

What is a two-phase AI agent?: It splits the work into a cheap discovery phase, where an LLM scores and ranks candidates, and an expensive pursuit phase that only runs on the items a human picks. You never pay full-text fetch and generation cost on stories nobody wanted.
How do you keep an LLM agent's costs down?: Put the human at the most expensive boundary, run high-volume triage on a small fast model and reserve a stronger model for drafting, cap per-job token and URL budgets, and use deterministic checks instead of LLM calls wherever they're good enough.
Why stream agent steps over Server-Sent Events?: A live activity log is what makes an autonomous system trustable — when an editor can watch it search, judge, and reject in real time, they stop second-guessing it. SSE is a simple one-way stream that fits this perfectly.
How do you stop an autonomous agent from drafting fake news?: Each story is confirmed against a primary source before drafting — a targeted search with source-authority preference and a title-overlap guard — which catches obvious fabrications and rumor reposts without spending an extra LLM call.

/ continue reading

June 25, 2026 8 min

Model Context Protocol (MCP), explained for people who already build agents

Model Context Protocol explained for agent builders: what an MCP server is, tools vs resources vs prompts, a minimal TypeScript example, and MCP vs tool-calling.

Read

June 24, 2026 7 min

Metadata-filtered RAG: two-stage retrieval that stops returning irrelevant chunks

Metadata-filtered RAG fixes single-shot retrieval that returns junk on multi-topic corpora. How I built a metadata pre-filter, vector search, and LLM rerank pipeline.

Read

Back to all writing

All writing

May 28, 2026 4 min read Updated June 22, 2026

Building an autonomous news desk: a two-phase story-discovery agent

AI AgentsPythonFastAPILLMs

On this page

Why two phases
Getting the inputs without getting blocked
Ranking and de-duplication
Confirming a story without paying for it
Provider-agnostic by design
Production hardening is the actual work
Streaming the work, not just the result
What I'd tell my past self
FAQ

Why two phases

The naive version is one big pipeline: search everything, fetch every page, summarize, rank, draft. That's expensive and slow — you pay full-text fetch and LLM cost on stories nobody wanted.

So I split it in two, with a human in the middle:

Discovery (cheap). Expand the beat plus the team's ~45 editorial keywords into ~20 date-aware queries, fan out across search, Google News RSS, and an RSS-first crawl of ~52 curated sources, then have an LLM score, de-duplicate, and rank the candidates. A full scan is roughly four LLM calls and a few thousand tokens on a small model.
Pursuit (expensive, opt-in). The editor ticks the stories worth chasing. Only those get the full-text fetch, clustering into distinct candidates, primary-source confirmation, and one-click drafting.

That single boundary — let a human pick before you spend — is what keeps the economics sane. The model triages; the editor decides; the expensive work runs on a shortlist.

Getting the inputs without getting blocked

Ranking and de-duplication

Confirming a story without paying for it

Provider-agnostic by design

Production hardening is the actual work

The agent is the easy 60%. The last 40% is what makes it deployable:

SSRF-guarded fetching — it follows URLs the internet hands it, so the fetcher refuses private/loopback ranges.
robots.txt + rate limits on every outbound request.
Per-job token and URL budgets, so a pathological run can't burn the month's spend.
Retries with backoff, structured JSON logs, and Prometheus metrics.
Durable across restarts — a job that was mid-flight resumes, not restarts.
Docker + CI so a fix lands the same way every time.

Streaming the work, not just the result

What I'd tell my past self

Put the human at the most expensive boundary. Don't ask for approval on trivia; ask right before the costly work.
Determinism where you can afford it. Confirmation didn't need an LLM, and not using one made it cheaper and easier to reason about.
Observability is a feature, not ops overhead. The live log is the difference between a demo and a tool people rely on.

The result: a manual morning scan became one click and a deduped shortlist of ~20–30 ready-to-draft candidates — at a fraction of the cost a fetch-everything pipeline would have run up.

Frequently asked questions

What is a two-phase AI agent?: It splits the work into a cheap discovery phase, where an LLM scores and ranks candidates, and an expensive pursuit phase that only runs on the items a human picks. You never pay full-text fetch and generation cost on stories nobody wanted.
How do you keep an LLM agent's costs down?: Put the human at the most expensive boundary, run high-volume triage on a small fast model and reserve a stronger model for drafting, cap per-job token and URL budgets, and use deterministic checks instead of LLM calls wherever they're good enough.
Why stream agent steps over Server-Sent Events?: A live activity log is what makes an autonomous system trustable — when an editor can watch it search, judge, and reject in real time, they stop second-guessing it. SSE is a simple one-way stream that fits this perfectly.
How do you stop an autonomous agent from drafting fake news?: Each story is confirmed against a primary source before drafting — a targeted search with source-authority preference and a title-overlap guard — which catches obvious fabrications and rumor reposts without spending an extra LLM call.

/ continue reading

June 25, 2026 8 min

Model Context Protocol (MCP), explained for people who already build agents

Model Context Protocol explained for agent builders: what an MCP server is, tools vs resources vs prompts, a minimal TypeScript example, and MCP vs tool-calling.

Read

June 24, 2026 7 min

Metadata-filtered RAG: two-stage retrieval that stops returning irrelevant chunks

Metadata-filtered RAG fixes single-shot retrieval that returns junk on multi-topic corpora. How I built a metadata pre-filter, vector search, and LLM rerank pipeline.

Read

Back to all writing