articleMemory & ContextArchitectureTool Use

RAG in Production: Chunking, Hybrid Search, and Agentic Retrieval

Chunking strategies, hybrid search, agentic retrieval loops, GraphRAG, and an honest answer to whether long-context models have made RAG obsolete — everything the memory article deferred.

AgentEngineering EditorialApril 8, 202620 min read

Share Y

The Memory & State Management article introduced retrieval as Tier 2 external memory, showed how to expose it as a tool call, and recommended a short list of vector databases. That was deliberate — the goal there was architecture. This piece picks up exactly where it left off and covers the question that piece couldn't answer in one section: how do you make retrieval actually work?

Important

This article builds directly on Memory & State Management in LLM Agents. If you haven't read that piece, the "retrieval as a tool call" framing and the vector store comparison live there — they won't be repeated here.

The gap between a retrieval system that "works in a demo" and one that works reliably in production is almost always in three places: how documents are chunked before indexing, how queries are executed at retrieval time, and whether the agent loops back to check retrieval quality. We'll cover all three — and then address the increasingly loud claim that long-context models have made RAG unnecessary.

The Chunking Problem

Chunking is the single largest determinant of RAG quality, and it gets almost no attention in introductory material. You can have the best embedding model, the most powerful re-ranker, and a perfectly designed tool interface — and still get poor retrieval because the underlying chunks are wrong.

The root issue: a chunk is the unit of both indexing and retrieval. Once a document is split into chunks and embedded, you can only retrieve at that granularity. A chunk that spans two unrelated topics produces an embedding that accurately represents neither. A chunk that cuts a sentence in half misses the context that makes it relevant.

Fixed-Size Chunking

The default in most tutorials: split every N characters (or tokens), with an optional overlap of M characters.

def fixed_size_chunks(text: str, size: int = 512, overlap: int = 64) -> list[str]:
    chunks = []
    start = 0
    while start < len(text):
        chunks.append(text[start : start + size])
        start += size - overlap
    return chunks

When it works: Homogeneous prose — novels, transcripts, uniform logs — where any arbitrary window is roughly as coherent as any other.

When it fails: Semi-structured documents (API docs, contracts, code files, research papers). A 512-character cut in the middle of a Python function produces two chunks that each mean nothing in isolation. Overlap helps at the edges but cannot fix a chunk that spans two conceptually distinct sections.

Semantic Chunking

Rather than splitting on character count, split on content boundaries — paragraph breaks, section headings, punctuation patterns, or similarity drops between successive sentences.

The most practical implementation: embed sentences sequentially, compute cosine similarity between adjacent sentences, and insert a split wherever similarity drops below a threshold.

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

def semantic_chunks(sentences: list[str], threshold: float = 0.5) -> list[str]:
    embeddings = model.encode(sentences)
    chunks, current = [], [sentences[0]]

    for i in range(1, len(sentences)):
        sim = np.dot(embeddings[i - 1], embeddings[i]) / (
            np.linalg.norm(embeddings[i - 1]) * np.linalg.norm(embeddings[i])
        )
        if sim < threshold:
            chunks.append(" ".join(current))
            current = []
        current.append(sentences[i])

    if current:
        chunks.append(" ".join(current))
    return chunks

Trade-off: Requires an embedding pass over the full document at index time. For large corpora this adds meaningful cost and latency. The payoff is chunks that are coherent by construction — each one covers one topic.

Late Chunking (Jina AI, 2024)

Late chunking inverts the usual order. Instead of splitting first and embedding each chunk independently, you embed the entire document as one sequence, then pool the resulting token-level embeddings into chunk-sized segments.

The key advantage: each token's embedding reflects the full document context, not just the tokens in its local window. A sentence that refers to a concept established three paragraphs earlier carries that context in its embedding, even after the document is split.

This requires a model that produces token-level embeddings (such as jina-embeddings-v3). The late-chunking library handles the pooling:

# pip install late-chunking jina-embeddings
from late_chunking import LateChunker

chunker = LateChunker(model="jinaai/jina-embeddings-v3")
chunks = chunker.chunk(document_text, chunk_size=256)
# chunks[i].embedding is context-aware across the full document

Trade-off: Encoding a full document as a single sequence requires a model with a long enough context window for that document. Impractical for very long documents without a sliding window approach. Best suited for medium-length, densely cross-referential documents — technical specifications, legal contracts, research papers.

Contextual Retrieval (Anthropic, 2024)

A different approach to the same problem: keep your existing chunking strategy, but before embedding each chunk, prepend a short LLM-generated summary that situates the chunk within the larger document.

CONTEXT_PROMPT = """
Document title: {title}
Document summary: {summary}

The following is a chunk from this document. Add a 1-2 sentence context 
that situates this chunk within the broader document, then output the chunk verbatim.

Chunk:
{chunk}
"""

def add_context(chunk: str, title: str, summary: str, llm) -> str:
    prompt = CONTEXT_PROMPT.format(title=title, summary=summary, chunk=chunk)
    context = llm.complete(prompt)
    return f"{context}\n\n{chunk}"

The contextualized chunk — context prepended, original chunk preserved — is what gets embedded. Because the embedding now captures "this is from the contracts section of the employee handbook, specifically about termination clauses," retrieval precision improves significantly for queries that touch specific sub-topics.

Trade-off: Adds an LLM call per chunk at index time. For a 10,000-chunk corpus, that can cost $5–15 depending on the model. Index it once, amortize the cost over every query. Anthropic's internal benchmarks showed a 49% reduction in retrieval failures versus standard chunking.

Adaptive Chunking

Recent work (arXiv:2603.25333, LREC 2026) proposes automatically selecting the best chunking strategy per document type using a lightweight classifier trained on document structure signals. In practice, most teams don't need this yet — the gains over a well-tuned semantic chunker are marginal.

Practical recommendation for a team starting out: Use semantic chunking for prose documents and fixed-size chunking with generous overlap (30–50%) for code and structured logs. Add contextual retrieval if you have budget and precision matters — it's the highest-leverage single improvement before you touch anything else in the pipeline.

Hybrid Search: Why Vectors Alone Aren't Enough

Dense vector search — embed a query, find the nearest neighbors — is powerful for semantic similarity. But "semantic similarity" is not always what you want.

Consider these queries against a codebase:

"What does the process_payment function do?" — the answer is in the function with that exact name. Dense search will find it, but so will an exact string match, faster and with zero ambiguity.
"Where is account ID 4829-B handled?" — 4829-B is a specific identifier. No synonym exists. Semantic embedding is nearly useless here; keyword search gets it in one lookup.

Dense embeddings underperform on: exact identifiers, version numbers, proper nouns, specialized terminology, and any case where the query is already precise. This is not a failure of embedding models — it is a fundamental property of the representation.

BM25 (Best Match 25) is a lexical ranking function that scores documents by term frequency and inverse document frequency. It has no semantic understanding, but it is extremely accurate for exact and near-exact matches, and it runs in milliseconds on an inverted index.

Recent work on financial query benchmarks (arXiv:2604.01733) found that:

BM25 outperforms dense retrieval on structured and numeric data
Hybrid search achieves Recall@5 of 0.816 and MRR@3 of 0.605 on 23,088 financial queries — substantially above either method alone

The right architecture is hybrid: run both in parallel and merge the results.

Reciprocal Rank Fusion (RRF)

RRF fuses two ranked lists without needing to normalize scores across different systems (which is notoriously difficult). Each document receives a score based on its position in each list:

RRF(d) = Σ  1 / (k + rank_r(d))   for each ranked list r in R

where k is a smoothing constant (typically 60) and R is the set of ranked lists.

def reciprocal_rank_fusion(
    results_a: list[str],
    results_b: list[str],
    k: int = 60,
) -> list[str]:
    scores: dict[str, float] = {}
    for rank, doc_id in enumerate(results_a, start=1):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    for rank, doc_id in enumerate(results_b, start=1):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    return sorted(scores, key=scores.__getitem__, reverse=True)

Weaviate and Qdrant have hybrid search with RRF built in. For teams on PostgreSQL, ParadeDB adds BM25 indexing and hybrid search on top of pgvector. If you are starting a new project today, enable hybrid from day one — the marginal complexity is low and the recall improvement on real-world corpora is consistent.

Query Expansion

Two common techniques to improve query coverage:

HyDE (Hypothetical Document Embeddings): Generate a hypothetical answer to the query with an LLM, embed that answer, and retrieve against it. Useful when the query is short and ambiguous — the hypothetical answer occupies the embedding space closer to real answers.

Multi-query: Rewrite the query in 3–5 different phrasings and take the union of retrieved results. Catches synonyms and alternative framings.

Warning

Query expansion helps for ambiguous or open-ended queries. For precise queries — exact identifiers, specific function names, known document titles — it can hurt by introducing noise. Benchmark both modes on your actual query distribution before enabling either by default.

Agentic RAG: Smarter Retrieval Loops

Standard RAG is a single shot: query arrives, retrieval fires once, results are injected into context, model generates. It works for simple lookups. For complex reasoning tasks — multi-step research, multi-document synthesis, questions that require iterating on what you found — a single retrieval pass is often insufficient.

Agentic RAG patterns replace the "retrieve once" model with an adaptive loop where the agent decides when to retrieve, what to retrieve for, and whether the results are good enough.

FLARE — Forward-Looking Active Retrieval

FLARE (Jiang et al., EMNLP 2023) takes a speculative approach: the agent begins generating an answer, and when it produces a token with low confidence (below a threshold), it pauses, forms a retrieval query based on what it was about to say, retrieves, and continues informed by the results.

The key insight: the agent's own generation is a signal about what it doesn't know. When it's about to confabulate, confidence drops — and that's exactly when retrieval should fire.

Best for: Long-form generation tasks where the agent needs to weave together multiple facts — research summaries, report generation, detailed explanations.

Self-RAG — Retrieve Only When Needed, Then Critique

Self-RAG (Asai et al., ICLR 2024 oral) trains a model with special reflection tokens that appear in the generation stream:

[Retrieve] — model decides whether retrieval is needed for the next segment
[IsREL] — is the retrieved passage relevant to the query?
[IsSUP] — does the passage support the claim about to be made?
[IsUSE] — is the final output useful given the retrieved context?

At inference time, these tokens enable segment-level beam search: the model generates multiple candidate segments, scores them against the critique tokens, and selects the best-supported path.

Evaluated on PubHealth, ARC-Challenge, TriviaQA, ASQA, and FactScore — Self-RAG outperforms both vanilla RAG and chain-of-thought prompting on factuality metrics across all benchmarks.

Note

Self-RAG requires a model trained with the reflection token objective — you can't prompt-engineer it into an existing model. Pre-trained 7B and 13B Llama2-based Self-RAG models are available on HuggingFace. For teams that need the capability without retraining, CRAG (below) achieves a meaningful portion of the benefit through prompting alone.

CRAG — Corrective RAG

CRAG (Yan et al., 2024) adds an evaluator step after retrieval: a lightweight model assesses whether the retrieved documents are actually relevant to the query. Based on the score:

High relevance: proceed with the retrieved chunks
Low relevance: discard and fall back to web search
Ambiguous: use both, with explicit source attribution in the context

The fallback to live web search is the defining feature — it handles the case where the internal knowledge base simply doesn't contain what's needed, rather than confidently generating from irrelevant chunks.

Best for: Any agent where the corpus may have coverage gaps and hallucinating from irrelevant context is unacceptable — customer support, medical question answering, legal research.

Adaptive RAG

Adaptive RAG (Jeong et al., NAACL 2024) routes each query to one of three retrieval strategies based on a query complexity classifier:

No retrieval — simple factual queries the model can answer from parametric memory
Single-step retrieval — one retrieval call is sufficient
Multi-step retrieval — iterative retrieval, where each step's results inform the next query

The classifier is a fine-tuned small model (e.g., a DistilBERT-scale classifier trained on query complexity labels). The routing decision adds ~10ms of overhead but eliminates unnecessary retrieval on queries that don't need it and enables richer multi-step retrieval on queries that do.

Pattern Comparison

Pattern	Retrieval trigger	Overhead	Requires training	Best for
FLARE	Low-confidence token	Low	No (prompting)	Long-form generation
Self-RAG	Reflection token decision	Medium	Yes (reflection tokens)	High-accuracy factual tasks
CRAG	Post-retrieval quality check	Low	No (prompting + small evaluator)	Corpora with coverage gaps
Adaptive RAG	Query complexity classifier	Low	Small classifier (~10ms)	Mixed-complexity query streams

GraphRAG: When Vector Search Fails on Multi-Hop Questions

Standard vector RAG excels at lookup: find the chunks most similar to this query. It struggles with questions that require synthesizing relationships across a large corpus — questions like:

"What are the common themes across all product feedback from Q1?"
"Which engineers have contributed to both the authentication service and the billing pipeline?"
"What changed in our compliance posture between 2023 and 2025?"

These queries require global understanding, not local similarity. No single chunk is the answer; the answer emerges from relationships across many chunks.

Microsoft GraphRAG (Edge et al., 2024) addresses this by building an entity-relationship graph from the corpus using an LLM, then applying hierarchical community detection (Leiden algorithm) to cluster related entities. Each community gets a summary. Queries that need global synthesis route to these community summaries rather than raw chunks.

The result is dramatic on aggregation queries: on dataset-wide synthesis questions ("what are the main themes across this entire corpus?"), GraphRAG substantially outperforms standard vector RAG — which often returns irrelevant results or simply fails on questions that have no single-chunk answer.

The cost is significant: building the GraphRAG index requires LLM calls for every entity extraction and community summarization pass. For a 1M-token corpus, expect index-build costs in the range of $10–50 depending on model choice. A recent paper (UnWeaver, arXiv:2603.29875) argues that GraphRAG is "overkill for lookup" and recommends a simpler entity-augmented vector approach for most use cases.

When to use GraphRAG:

Corpus-wide analytical questions are a documented user need
The corpus is relatively stable (high rebuild cost amortizes poorly over rapidly changing data)
Per-query latency budget allows for community summary lookups

When not to: Start with vector RAG. Add GraphRAG when you have confirmed that multi-hop synthesis questions are failing and that the index rebuild cost is acceptable.

Is RAG Still Necessary? The Long-Context Question

With Gemini 2.5 Pro supporting up to 1M tokens and Claude Sonnet 4.6 supporting 200K, a reasonable question is: why retrieve at all? Just put the whole corpus in context.

This point gets made loudly every time a new long-context model ships. Let's look at the data.

The Performance Case

A study comparing retrieval-augmented and full-context approaches across 9 long-context tasks (arXiv:2310.03025) found:

4K context + RAG ≈ 16K full context — retrieval matched a 4x larger context window at a fraction of the cost
Retrieval-augmented Llama2-70B-32K outperformed GPT-3.5-turbo-16K on all 9 tasks, including tasks where GPT-3.5-16K had the entire document in context

The "lost in the middle" phenomenon is a key reason why (Liu et al., 2023): models reliably underperform when the relevant content is buried in the middle of a long context, performing best when it appears near the beginning or end. Retrieval places the most relevant content at the front of the context, in the zone of peak attention.

The Cost Case

At current pricing (Claude Sonnet 4.6, April 2026), stuffing a 500K-token legal corpus into every query costs roughly $1.50 per call at $3/MTok input pricing. An agent handling 10,000 queries/day would spend ~$15,000/day in context costs alone — before output tokens and before any response actually benefits from the full corpus.

RAG with a 4K retrieval window cuts that to ~$0.006 per call. The difference is three orders of magnitude.

Prompt Caching Partially Closes the Gap

Anthropic, OpenAI, and Google all offer prompt caching: repeatedly submitted prefix sequences are cached server-side at substantially reduced cost. Cache reads on Anthropic are billed at 10% of the base input price (~90% savings), though cache writes carry a 25% surcharge — so the net benefit depends on your cache hit rate. OpenAI's caching offers up to 90% savings on input tokens. Minimum prefix lengths apply before caching activates (2,048 tokens for Claude Sonnet 4.6; 4,096 for Gemini 2.5 Pro; 1,024 for OpenAI). Gemini's context cache TTL defaults to 1 hour.

For static corpora that don't change between queries — a fixed product documentation set, a stable legal reference — prompt caching makes long-context more viable. The first call pays full price; subsequent calls within the TTL pay cache rates.

Prompt caching does not solve the problem for dynamic corpora (support ticket histories, codebases under active development, news feeds) where the corpus changes faster than caches expire.

Decision Table

Scenario	Recommended approach
Corpus fits in 8K tokens	Full context — no retrieval needed
Static corpus, repeated identical prefix	Long-context + prompt caching
Dynamic corpus, frequent updates	RAG
Corpus >200K tokens	RAG (most long-context models can't take it all anyway)
Queries require global synthesis across corpus	GraphRAG or long-context community summaries
Sub-second latency required	RAG (context window attention scales quadratically)
Precise lookup: exact IDs, specific names	Hybrid search (BM25 + dense)

The 2026 consensus among practitioners: long-context and RAG are complementary. The optimal architecture for a large knowledge-intensive agent is retrieval-augmented long-context — use retrieval to select the most relevant 8–32K of context, then let the model attend to that focused window with full coherence.

Failure Modes and Mitigations

The RAG glossary entry lists four limitations without solutions. Here they are with mitigations.

1. Hallucinated Citations

The model receives a retrieved chunk, acknowledges it in reasoning, but fabricates specific sub-facts (dates, names, numbers) that aren't in the chunk. The retrieved context is real; the claimed detail is not.

Mitigation: Add a citation verification step as a separate tool call after generation. The verifier checks each factual claim against the source chunks and flags unsupported statements. For high-stakes outputs, make this a required post-generation step.

VERIFY_PROMPT = """
Retrieved source:
{source}

Claim to verify:
{claim}

Does the source directly support this claim? Answer YES, NO, or PARTIAL with a brief reason.
"""

2. Context Overflow

Too many chunks injected → the model's reasoning degrades. Counterintuitively, adding more context beyond a certain point reduces answer quality — the model either loses track of what's relevant or the relevant content gets "lost in the middle."

Mitigation: Enforce a hard token budget per retrieval step. Retrieve more candidates than you need (top-20), apply re-ranking, then truncate to the top-5 that fit your budget. Track the token cost of injected context in your observability stack — context bloat is usually invisible until it starts hurting accuracy.

3. Parametric Contradiction

The model's parametric knowledge (from training) contradicts the retrieved content, and the model favors what it "knows" over what was retrieved. This is especially common when the retrieved information represents a recent change that post-dates training.

Mitigation: Explicit instruction in the system prompt: "When the retrieved context contradicts your prior knowledge, always defer to the retrieved context. If you are uncertain whether a claim is supported by the retrieved context, say so explicitly." Combine with confidence elicitation: ask the model to rate its certainty and cite the specific chunk that supports each claim.

4. Retrieval Misfire

The agent forms a poor retrieval query — either too broad, too narrow, or semantically off — and the returned chunks don't address the actual information need. The model then either confabulates or reports that it couldn't find the answer when the information exists in the corpus.

Mitigation: Implement a CRAG-style quality check: after retrieval, score the relevance of returned chunks against the query. If the score is below a threshold, trigger a reformulated query or web search fallback. Logging every retrieval query and its results is essential for diagnosing misfires in production.

5. Stale Index

The corpus is updated but the index is not re-synced. The agent confidently returns outdated information because it's semantically similar to the query.

Mitigation: Attach a updated_at timestamp to every document at index time. Apply a recency filter on retrieval: down-rank or exclude documents older than a defined freshness threshold for queries where recency matters. Implement an automated sync job — treat the index as a derived artifact of the canonical data store, not a separate system.

Can RAG Be Replaced Entirely?

The pattern that would replace RAG would need to provide: low-latency access to large factual corpora, updatability without retraining, citation capability, and cost-efficient scaling. Let's check the alternatives:

Fine-tuning / continued pretraining: Knowledge is encoded in weights, so retrieval latency drops to zero. But knowledge is frozen at training time — any update requires a new training run. The Atlas framework (Izacard et al., 2022) demonstrated this concretely: a retrieval-augmented Atlas-11B model outperforms GPT-3 (175B parameters) on knowledge-intensive NLP benchmarks — more than 15× larger — by grounding generation in retrieved evidence rather than memorized parameters. In practice, practitioners combine both: fine-tune for behavioral patterns and format, use RAG for factual grounding. The volume of research combining both approaches reflects this — nobody is treating them as mutually exclusive.

Long-context models alone: As covered above — cost-prohibitive at scale, degraded attention on long inputs, no update mechanism without re-prompting, no citation by default.

The 2026 consensus: The search for "RAG killer" techniques consistently lands on the same result — complementary techniques that make RAG better, not replacements. The optimal 2026 retrieval stack looks like this:

Query
  ↓
[Adaptive Router] → no-retrieval path (simple factual queries)
  ↓ (retrieval needed)
[Hybrid Search: BM25 + Dense]
  ↓
[Re-ranker: cross-encoder]
  ↓
[CRAG evaluator] → web fallback if quality < threshold
  ↓
[Top-K injected into 32K context window]
  ↓
[Prompt cache hit check]
  ↓
LLM generation
  ↓
[Citation verifier]
  ↓
Response

Each layer addresses a specific failure mode. None of them, individually or collectively, eliminates the need for retrieval — they make retrieval reliable.

RAG didn't get replaced. It got upgraded.

What's Next

The next piece in this series covers Pydantic AI — the type-safe, Python-native framework that makes the retrieval-as-tool pattern shown in the memory article feel natural in a production codebase. If you haven't read the memory article yet, start there — the tool interface that Pydantic AI wraps around retrieval builds directly on the architecture introduced in Tier 2.