articleReliabilityEvaluationProduction

Agent Evaluation: How to Test What You Can't Fully Predict

Evaluation is the discipline that separates an agent that demos well from one that survives production. This piece builds the vocabulary — correctness, faithfulness, tool-call accuracy, trajectory evaluation — and the workflow behind it.

David AkumaMay 13, 202613 min read

Share Y

You shipped the agent. The ReAct loop is clean, the system prompt has its four sections, the tools call correctly. The demo was flawless. Then real users arrived, and the thing started failing in ways your test suite never saw — not crashing, not throwing, just quietly producing confident, wrong answers.

That gap — between "looks great in the demo" and "doesn't fail silently in prod" — is the single most underserved problem in agent engineering. Evaluation is the discipline that closes it.

This article establishes the vocabulary the rest of this series leans on. If you take one thing away: if you don't have a golden set, you don't have eval — you have vibes.

Important

This builds directly on The ReAct Loop Unpacked and Prompt Engineering for Agent Roles. Eval is the feedback loop that makes both of those tractable. Without it, every "improvement" to your loop or your prompt is a guess you can't verify.

Why Your Unit Tests Stopped Working

Traditional testing rests on one assumption: given input A, the system returns output B, every time. assert output == expected. That assumption is the first casualty of building agents.

Three things break it:

Non-determinism. LLMs are probabilistic. Even at temperature 0, a model's path can shift with API latency, a re-ranked retrieval, or a provider-side update. A hard equality assertion fails even when the agent's alternative path is equally correct.
Open context space. A pure function has finite edge cases. An agent interacts with the web, a database, a user who phrases things ten different ways. You cannot enumerate the states it will encounter.
Emergent behavior. Agents find solutions you didn't script. A rigid test marks a more efficient path as a failure simply because it deviated from the expected sequence.

The result is the failure mode that should keep you up at night.

The Silent Failure

In production, the dangerous bug is not the one that throws. It's the one that returns 200 OK with a plausible, professional, wrong answer. The CI pipeline is green — the plumbing works — but the reasoning drifted. Nobody gets paged. The agent "thinks" it succeeded; the user finds out later that it didn't.

This is why eval is not a checkbox at the end of the pipeline. It is the instrument that detects drift the plumbing can't see.

The Four Words You Need

Most "my agent is broken" conversations are imprecise because the team lacks shared vocabulary for how it's broken. These four terms are the minimum. Every case study later in this series references them.

Correctness

Did the final answer match the expected outcome? This is outcome accuracy. For anything beyond exact-match QA, you don't grade it with string equality — you grade it with a rubric, often via an LLM judge scoring helpfulness and accuracy on a discrete scale rather than a binary pass/fail.

Faithfulness

Did the answer stay grounded in the retrieved context, or did the model invent it? This is the metric that matters most for RAG agents. An answer can be correct in the abstract and still unfaithful — right by luck, not by grounding. Unfaithful-but-correct answers are time bombs: they fail the moment the question shifts slightly.

Tool-Call Accuracy

Did the agent pick the right tool, with the right arguments, in the right order? Break it into three sub-checks:

Selection — did it choose the correct tool at all?
Argument quality — were the parameters valid and well-formed?
Sequencing — did it call tools in a logically required order?

A high-level "the task succeeded" hides which of these failed. Measuring them separately tells you where to fix.

Trajectory Evaluation

Did the agent take a sensible path, or did it stumble into a good answer by accident? Trajectory eval looks at the sequence of thoughts, tool calls, and observations — the trace — not just the destination. It's what catches the "lucky hallucination": the run where the reasoning was wrong but the final answer happened to land. Those runs pass an outcome check and then detonate in production the next time the dice roll differently.

Tip

A quick litmus test for which metric you need: if your agent occasionally returns the right answer for the wrong reason, you have a trajectory problem that correctness alone will never surface.

Two Layers: Components and the Loop

Robust eval separates the brain from the body. You test the deterministic pieces like software, and you test the agent loop like a system.

Layer 1 — Unit Tests on Deterministic Components

Plenty of your agent is not stochastic, and that part deserves ordinary unit tests:

The output parser that turns model text into a structured tool call
The retriever's chunking and ranking logic
Tool schemas and argument validation
Any post-processing or formatting step

These have known-correct outputs. Test them with assert. If your JSON parser breaks, no amount of LLM-as-judge will save you, and you shouldn't waste a judge call discovering it.

Layer 2 — End-to-End Trajectory Checks

This is where the non-determinism lives, and where you evaluate the loop as a whole. When you compare an agent's path against a reference trajectory, you have three matching strategies — and choosing the wrong strictness is a common mistake:

Strategy	What it requires	Use when
Exact match	Identical sequence; any extra step fails	Regulated flows (verify identity before data access)
In-order match	All required steps in order; extra steps allowed	Real-world tasks with noisy APIs or harmless retries
Any-order match	All required actions; order irrelevant	Independent sub-tasks ("check weather in NYC and London")

Reaching for exact match on a task that tolerates retries will fail good agents. Reaching for any-order on a task where sequence is a safety property will pass dangerous ones.

Neither layer is sufficient alone. Component tests pass while the agent loops forever; trajectory checks pass while a tool silently returns stale data. You need both.

LLM-as-Judge: Powerful and Easy to Get Wrong

When there's no programmatic ground truth — tone, reasoning quality, "is this explanation actually helpful" — you reach for a judge model: a separate, capable LLM that scores outputs against a rubric. By 2026 this is the default for evaluating non-deterministic systems at scale. It's also the place teams hurt themselves most.

A bad judge prompt is worse than no eval, because it gives you a number you'll trust.

The Biases Are Real and Documented

Position bias — in pairwise comparisons, judges favor whichever response came first. Mitigation: swap positions and run twice.
Verbosity bias — judges rate longer, more confident-sounding answers higher, filler and all. Mitigation: explicitly penalize wordiness in the rubric.
Self-preference bias — a model favors outputs from its own family. Mitigation: judge with a different model family than the one under test.

Writing a Judge That Holds Up

Treat the judge prompt like a contract, not a vibe:

Concrete role — "You are a senior technical auditor for RAG systems," not "rate this response."
Discrete rubric, not a 1–10 scale — define each level: 1: hallucinated, 2: correct but incomplete, 3: fully grounded and accurate.
Reasoning before score — force chain-of-thought; it measurably improves agreement with humans.
Structured output — JSON with reasoning, score, and violated_constraints.

JUDGE_PROMPT = """
You are a senior technical auditor evaluating an AI agent's response.

Task: {task}
Retrieved context: {context}
Agent response: {response}

Score each dimension on the discrete scale defined below. Think step by
step and write your reasoning BEFORE assigning any score.

faithfulness: 1=invents facts not in context, 2=mostly grounded with
  minor additions, 3=every claim traceable to context
correctness:  1=wrong, 2=partially correct, 3=fully correct
tool_use:     1=wrong tool/args, 2=right tool wrong args, 3=correct

Return JSON: {"reasoning": str, "faithfulness": int, "correctness": int,
"tool_use": int, "violated_constraints": [str]}
"""

Calibrate Before You Trust It

Here is the step most teams skip: an unvalidated judge is just a confident stranger. Before you let a judge gate deploys, have humans label 50–100 examples, then measure how often the judge agrees. Use Cohen's kappa ($\kappa$), which corrects for agreement that would happen by chance — a $\kappa > 0.8$ is the threshold for "I trust this judge." If it's lower, tighten the rubric and re-measure.

The Framework Landscape

You don't have to build the harness yourself. As of 2026 the practical lineup:

Framework	Strength	Best for
OpenAI Evals	Standardized benchmarks	Baselining against industry references
PromptFoo	CLI-first, fast iteration	Rapid prompt loops, red-teaming
RAGAS	RAG-specific metrics	Faithfulness and context relevance
DeepEval	Pytest-native	"Unit test" style tool-call evals
Opik (Comet)	Production-scale tracing	Millions of live traces, turn-level metrics
LangSmith	Deep trace debugging	Multi-step trajectories, human-in-the-loop

Start where your pain is. RAG agent? RAGAS for retrieval metrics. Autonomous tool-user? DeepEval or LangSmith for the action-vs-reasoning split. Don't adopt six tools because a table listed them.

Golden Datasets: The Unglamorous Work That Moves the Needle

A golden set is a curated, versioned collection of cases with known-good behavior. It is the ground truth your evals run against, and building it is the least exciting, highest-leverage work in agent engineering.

Start Small, Curate From Reality

You do not need a thousand cases to start. The honest numbers:

20–50 cases — enough to catch the failures that actually matter early on.
30–50 per core capability — the floor for statistically meaningful comparisons between versions.
100+ — enough to start evaluating your evaluator (see kappa, above).

Spread coverage across three tiers, roughly:

Happy paths (~50–60%) — standard queries the agent should nail.
Edge cases (~25%) — ambiguity, multi-step chains, partial information.
Adversarial / failure cases (~15%) — tool 500s, unanswerable prompts where the right move is to refuse gracefully.

Mine Production Failures

The most valuable cases aren't invented — they're recovered. Every time the agent fails in production, the question is: "What test case would have caught this?" Capture the trace, strip any PII, define how the agent should have behaved, and add it to the golden set permanently. Tools like LangSmith and Arize Phoenix let you promote a real production trace straight into the set. Today's incident becomes tomorrow's regression guard.

Version It Like Code

The ground truth evolves as your tools and capabilities do. Treat the golden set as a first-class artifact:

Semantic versioning (v1.2.0-eval): major = new capability, minor = added cases.
Snapshot it alongside the git commit of the agent it grades.
Keep a holdout set you never look at during development and run only before a major release. The moment you optimize against your eval set, it stops measuring generalization and starts measuring memorization.

Warning

The fastest way to fool yourself is to tune prompts until they pass the golden set, then ship. If you only ever test against cases you've already seen, you're measuring overfitting, not quality. A blind holdout is the cheapest insurance against this.

Why I Keep Coming Back to This: The Agentic Plane

I'll be candid — agent evaluation is the problem I'm most actively researching right now, and it's the reason I keep circling the idea of an agentic plane.

Most teams treat eval as a script you run before a deploy. That's the right starting point, but it doesn't scale to systems where dozens of agents call each other, share memory, and act on live state. At that point evaluation stops being a step and becomes infrastructure — a control plane that observes every trace, scores trajectories continuously, flags drift before users feel it, and feeds failures straight back into the golden set without a human copy-pasting traces.

That's the agentic plane I'm working toward: the layer where correctness, faithfulness, tool-call accuracy, and trajectory scoring aren't a pre-flight check but a standing signal across the whole fleet. The 2025–2026 shift toward agent-as-a-judge — using a stronger model to evaluate the entire trace rather than just the final text — and toward simulation-based testing in synthetic sandboxes is the early scaffolding for exactly this. The frameworks above are the components. The plane is what you get when you wire them into one continuous loop.

I'll go deeper on this in later pieces. For now, the takeaway is narrower and more useful: build the loop small first. You earn the right to a plane by first having a golden set you trust.

Putting It Together

Evaluation is the feedback loop that makes everything else in this series honest. Tune a system prompt without it and you're guessing. Adjust a ReAct loop without it and you can't tell whether you improved anything or just moved the failure.

The minimal eval checklist before an agent goes to production:

You can name how it fails: correctness, faithfulness, tool-call, or trajectory
Deterministic components have ordinary unit tests
The loop has end-to-end trajectory checks with the right match strictness
Your judge model uses a discrete rubric and reasons before scoring
The judge is calibrated against humans ($\kappa > 0.8$) before it gates anything
A golden set of at least 20–50 real cases exists and is versioned like code
Every production failure becomes a new golden case
A blind holdout set guards against overfitting your own evals

The next article applies this to failures that will happen: error recovery and retry strategies. You can't design good recovery until you can define what "correct" looks like — which is exactly what eval gives you.

References

Anthropic. "Demystifying Evals for AI Agents." 2026. docs.anthropic.com
Zheng, L., et al. "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS 2023. arxiv.org/abs/2306.05685
Lu, et al. "Beyond Task Completion: An Assessment Framework for Evaluating Agentic AI Systems." arXiv, 2025.
Google Cloud. "A Methodical Approach to Agent Evaluation." 2025. cloud.google.com/blog
DeepEval / Confident AI. "AI Agent Evaluation Metrics & Trajectory Evaluation." 2025. docs.confident-ai.com
RAGAS. "Faithfulness and Context Relevance Metrics." 2025. docs.ragas.io
Comet. "Opik: Trajectory Accuracy and Multi-turn Agent Evaluation." 2025. comet.com/opik
LangChain. "Evaluating AI Agents: Trajectories vs. Outputs." 2025. smith.langchain.com
Arize AI. "Agent Trajectory Evaluations." 2025. phoenix.arize.com
Maxim AI / Arize AI. "Golden Datasets for Agent Evaluation at Scale." 2025. getmaxim.ai
NIST. "AI Risk Management Framework (AI 100-1)." 2023. nist.gov