Metadata-filtered RAG: two-stage retrieval that stops returning irrelevant chunks
Metadata-filtered RAG fixes single-shot retrieval that returns junk on multi-topic corpora. How I built a metadata pre-filter, vector search, and LLM rerank pipeline.
On this page
Internal teams kept asking the same questions against the same long handbooks and policy docs — leave policy, expense rules, onboarding steps — and kept getting bad answers. Plain keyword search missed context. So I reached for the obvious upgrade: single-shot RAG over the whole corpus. Embed every chunk, embed the question, pull the top-k by cosine similarity, stuff them into the prompt. It demoed well and then fell over on real questions. The fix that actually held up was metadata-filtered RAG — a two-stage retriever that narrows candidates by metadata before semantic search ever runs.
The failure mode was specific and frustrating. When a question touched only one section of one document, the retriever would still happily return chunks from five unrelated documents that happened to share vocabulary. Ask about parental leave and you'd get a confident, well-written answer half-built from the sabbatical policy. The model wasn't hallucinating from nothing — it was faithfully summarizing irrelevant chunks I handed it. That's the dirty secret of naive RAG on a multi-topic corpus: top-k similarity has no idea what document or section a question belongs to.
This was my first production AI shipment (the AI Chatbot case study), so I want to be honest about the level: I'm actively learning agentic AI, not claiming a decade of ML. But the fix here isn't exotic ML. It's plumbing and judgment, and it's the kind of thing nobody tells you until you've shipped a retriever that embarrasses you.
Why single-shot similarity returns junk
Embedding similarity measures semantic closeness, not topical relevance. Those aren't the same thing. "Notice period for resignation" and "notice period for terminating a vendor contract" are close in vector space — same words, same shape — and a cosine ranker can't tell that one lives in the HR handbook and the other in the procurement policy.
On a single-topic corpus this rarely bites. On a corpus that spans HR, IT, finance, and legal, it bites constantly, because the nearest neighbors of a query are scattered across documents that have nothing to do with the question. You can crank up k to compensate, but that just feeds more off-topic chunks to the LLM and makes the answer worse, not better.
The instinct is to throw a bigger model or a fancier embedding at it. The real fix is upstream: narrow the candidate set before you ever compute similarity.
Attach metadata at ingest — that's where the work is
The whole pipeline depends on one decision made at ingest time: every chunk carries metadata. When I split a document, I don't just store the text and its vector — I store where it came from.
def chunk_document(doc: Document) -> list[Chunk]:
chunks = []
for section in split_into_sections(doc): # respect headings
for piece in window(section.text, size=800, overlap=120):
chunks.append(Chunk(
text=piece,
embedding=None, # filled in batch later
metadata={
"source": doc.filename, # "hr-handbook-2026.pdf"
"section": section.heading, # "Parental Leave"
"tags": doc.tags, # ["hr", "benefits"]
"doc_type": doc.type, # "policy"
},
))
return chunks
Two things matter here. First, chunk on structure, not blindly on character count — splitting along headings keeps section meaningful, so "Parental Leave" stays one coherent unit instead of bleeding into the next policy. Second, the metadata isn't decorative. It's the filter key for stage one. Garbage metadata at ingest means a useless filter later, so this is the step worth being slow and careful about.
Two-stage retrieval: filter, then rank
With metadata in place, retrieval becomes two stages instead of one — metadata filter → vector similarity → LLM rerank.
def retrieve(query: str, filters: dict, k: int = 6) -> list[Chunk]:
# Stage 1 — cheap metadata pre-filter. Off-topic chunks never
# reach the vector search at all.
candidates = store.where(filters) # e.g. {"tags": "hr"}
# Stage 2 — semantic search *within* the filtered set.
q_vec = embed(query)
ranked = top_k(candidates, q_vec, k=k * 3) # over-fetch on purpose
# Stage 3 — LLM rerank the survivors for true relevance.
return llm_rerank(query, ranked)[:k]
The ordering is the entire point. The metadata pre-filter runs first, so an off-topic chunk is gone before vector search ever sees it — it can't sneak into the top-k no matter how similar its wording is. Vector similarity then does what it's actually good at: ordering chunks that are already on-topic. Finally a cheap LLM rerank pass reads the survivors and reorders them by genuine relevance to the question, catching the cases where two chunks are equally "close" but only one actually answers.
Why rerank at all, when the filter already narrowed things? Because the filter is coarse — it knows topic, not intent — and cosine ordering is approximate. The rerank is the precision step: small, bounded input, so it's affordable. I deliberately over-fetch (k * 3) into the reranker and let it cut down, rather than trusting raw similarity rank.
Where does the filter come from? Sometimes the UI scopes it ("ask within HR"). More often the agent infers it — which is the other half of the system.
An agent loop, not a single RAG call
Even good retrieval is the wrong tool for some questions. "Summarize the parental leave policy" wants retrieval. "List every deadline mentioned in section 4" wants deterministic extraction, not fuzzy semantic search — and single-shot RAG is genuinely bad at it, because it'll grab three chunks and miss the fourth.
So retrieval is wrapped in an agent loop (LangChain tools + structured tool-calling) that decides how to answer before it answers:
retrieve(query, filters)— the two-stage pipeline above, for open-ended questions.summarize(section)— pull a whole section and condense, when the user names one.extract(schema, section)— structured extraction into a typed shape, for "extract X from section Y."
tools = [retrieve_tool, summarize_tool, extract_tool]
llm_with_tools = llm.bind_tools(tools)
def answer(question: str) -> str:
ai = llm_with_tools.invoke(question) # model picks the tool
call = ai.tool_calls[0] # name + parsed args
if call["name"] == "extract":
return run_structured_extraction(call["args"]) # deterministic, schema-bound
return dispatch(call) # retrieve / summarize
The win is letting the model choose the retrieval strategy instead of forcing every question through one pipeline. Structured questions route to a structured tool whose output is schema-constrained, so it's far more reliable than hoping the right chunks land in a prompt. This is the same "deterministic where you can afford it" instinct I leaned on hard in the autonomous news desk and its agent build — don't spend an LLM's fuzziness on a job a typed tool does exactly.
I'll be honest about the trade-off: agent loops add latency and a class of "the model picked the wrong tool" bugs you don't have with a fixed pipeline. For a low-traffic internal tool that's a fine price. For a high-QPS path I'd think harder, or constrain the routing more tightly.
Make it observable: FastAPI, streaming, tracing
The service is a FastAPI app deployed on Render with streaming responses and per-request tracing. Streaming isn't just a nicer UX — like the live activity log on the news desk, watching tokens arrive (and seeing which tool fired and which chunks came back) is how you debug a probabilistic system in production. When an answer is wrong, the trace tells you whether the filter was too narrow, the vector search missed, or the rerank mis-ordered — three very different fixes.
Embeddings and chat completion run on OpenAI, but the provider is swappable per environment, so a model or pricing change is config, not a rewrite — the same provider-agnostic discipline the news desk needed.
What I'd tell my past self
- Do the work at ingest. Good metadata on every chunk is what makes the cheap pre-filter possible. Skimp here and no clever retrieval saves you.
- Filter before you rank. A metadata pre-filter that runs before vector search keeps off-topic chunks out of the ranker entirely — that single ordering fixed most of the "confident but wrong" answers.
- Don't force every question through RAG. Let an agent route structured questions to a deterministic extraction tool. Semantic search is for fuzzy intent, not for "list all the dates."
- Stream and trace from day one. You cannot debug a retriever you can't watch.
The outcome: questions that touch one section of one document now get answers built from that section, because the off-topic chunks never make it past stage one. Same corpus, same models — a different order of operations.
Frequently asked questions
- What is metadata-filtered RAG?
- Metadata-filtered RAG attaches metadata (source, section, tags, document type) to each chunk at ingest, then uses it to pre-filter the candidate set before vector search runs. Off-topic chunks are removed by the metadata filter first, so semantic similarity only ranks chunks that are already on the right topic.
- Why does single-shot RAG return irrelevant chunks on multi-topic corpora?
- Embedding similarity measures semantic closeness, not topical relevance. On a corpus spanning many topics, a query's nearest neighbors are scattered across unrelated documents that share vocabulary, so top-k similarity pulls in off-topic chunks. A metadata pre-filter fixes this by narrowing candidates before similarity is computed.
- What is two-stage retrieval in a RAG pipeline?
- Two-stage retrieval runs a cheap metadata pre-filter first, then vector similarity search within the filtered set, then an LLM rerank pass over the survivors. The filter ensures only on-topic chunks reach the ranker, vector search orders them, and the rerank step delivers final precision.
- When should a RAG system use a tool-using agent instead of plain retrieval?
- Use an agent loop when questions vary in shape. Open-ended questions suit semantic retrieval, but structured questions like 'extract every deadline from section 4' suit a deterministic, schema-bound extraction tool. An agent with tool-calling picks the right strategy per question instead of forcing everything through one RAG pipeline.
pgvector turns Postgres into a vector database: store embeddings, rank by cosine distance with <=>, index with HNSW, and filter by metadata in one SQL query.
ReadLLM structured outputs explained: force reliable JSON from OpenAI with JSON mode, tool calling, and json_schema strict decoding, then validate with Zod or Pydantic.
Read