Part 1 of 3. Part 2 walks through τ-bench step by step. Part 3 covers the successors τ²-bench (dual-control coordination) and τ³-bench (knowledge retrieval and voice).
Table of contents
Open Table of contents
- Why agent evals aren’t software tests
- The three observability primitives
- Types of runs
- What to evaluate at each level
- Reliability is its own axis: pass^k
- Two pillars of a reliable eval harness
- No single method is enough: the filter funnel
- Five steps to build an eval suite that doesn’t lie to you
- What’s next
- References
Why agent evals aren’t software tests
In traditional software, the code is the source of truth. Read the function, know what happens. Inputs are constrained (forms, buttons, typed parameters), outputs are deterministic, and behaviour is fully specified before runtime.
Agents break every one of those assumptions:
- Non-deterministic outputs. Same input, different trajectories.
- Unconstrained inputs. Natural language is unbounded.
- Emergent behaviour. The agent decides actions, calls tools, and mutates state autonomously.
So in agents, the traces are the source of truth. The code just defines a prompt and a set of tools. You don’t know what the agent does until you run it. That single shift is why observability and evaluation in agents are tightly coupled in a way they never are in conventional software: you cannot test what you cannot observe, and you cannot reason about what you have not traced.
This reframes debugging too. Software debugging is finding the failed function in a stack trace. Agent debugging is debugging reasoning: what went into the LLM, what came out, what context was available, which tools were called and in what order, what the model decided to do with the tool’s response. The bug is rarely a bad line of code. It is usually a bad decision in the middle of a trajectory you didn’t know existed until you saw the trace.
The three observability primitives
Almost every agent observability platform builds on the same three nested concepts. The names vary slightly between vendors but the shapes are the same.
Run (Single step). One atomic operation: one LLM call, one tool invocation, or one retrieval step. Has inputs, outputs, latency, cost, metadata. The smallest unit you can observe and evaluate.
Trace (Full Turn). A full agent execution from one user message to the agent’s final response, with no human intervention in between. The agent loops through multiple runs (LLM calls, tool calls, retrievals) until it decides it is done.
Thread (Multiple Turns). A full conversation: multiple traces linked by human turns. Each new user message kicks off a new trace; the thread groups them.
Read top to bottom: the message stream of an agent conversation, with three brackets on the right marking three scopes.
No level alone tells the full story. A run can be perfect (the LLM made a sensible call) inside a trace that fails (the agent picked the wrong tool earlier and never recovered) inside a thread that succeeds (the user re-prompted and the agent fixed itself in the next trace). You evaluate at all three.
Types of runs
Runs come in a few standard shapes. Most agent stacks emit these four:
- LLM-call run. Model name, prompt, completion, token counts, cost, latency, finish reason. The atom of “what the model said.”
- Tool-call run. Tool name, arguments (schema-validated), return value, error if any, latency.
- Retrieval run. Query, retrieved documents, similarity scores, store identifier. Logged separately because retrieval failures and tool failures look different.
- Generation / output run. The final user-visible response for the trace.
Custom runs are common too: input pre-processing, guardrail checks, validators, post-hoc summarisation steps. Anything you might want to evaluate independently should be its own run.
A minimal schema looks roughly like this:
from dataclasses import dataclass, field
from typing import Literal, Any
@dataclass
class Run:
id: str
parent_id: str | None # None for the root run of a trace
trace_id: str
type: Literal["llm", "tool", "retrieval", "generation", "custom"]
name: str # e.g. "gpt-4o", "search_docs", "rerank"
inputs: dict[str, Any]
outputs: dict[str, Any]
started_at: float
ended_at: float
cost_usd: float | None = None
metadata: dict[str, Any] = field(default_factory=dict)
Two things matter about this shape. First, the parent_id link makes the run tree reconstructible, so you can render a waterfall view of any trace. Second, inputs and outputs are stored verbatim. You will want them later. Re-running an LLM call against a captured input is how you bisect failures; replaying a tool call against captured args is how you reproduce a flaky integration.
What to evaluate at each level
Different levels admit different metrics. Picking the wrong level is the single most common eval-design mistake.
Run level. Did this individual operation work? Latency, cost, schema validity (did the tool call have the right arguments?), token usage. Programmatic checks dominate here.
Trace level. Did this full agent attempt achieve the goal?
- Outcome: end-state match against an annotated goal. Cleanest signal you can get.
- Path quality: number of tool calls, number of retries, whether the policy was followed, whether the agent looped.
- Rule compliance: did the trace violate domain constraints (e.g. refunding outside the allowed window)? Often best graded with a programmatic policy checker, sometimes an LLM judge.
Thread level. Across the whole conversation, did the user end up where they wanted? Often the thing that actually matters in production. Hardest to grade automatically because it depends on user intent that may not be in any single message. LLM-as-judge with a careful rubric, or human review.
A useful sanity check: if your only metric is at the run level, you are evaluating the LLM, not the agent. If your only metric is at the thread level, you cannot tell why anything failed. You want all three.
One cross-cutting distinction worth naming on top of these levels, a framing Anthropic articulates well in their evals post: capability evals ask “what can this agent do?” and should start at low pass rates so there’s a hill to climb. Regression evals ask “does it still do what it used to?” and should sit near 100%, with any drop flagging a break. As your agent matures, capability tasks that get reliably solved graduate into the regression suite. Run both on every change. Hill-climbing capability without watching regression is how you silently break things you’ve already shipped.
Reliability is its own axis: pass^k
A 70%-pass agent that flips outcomes on identical inputs is not safe to deploy. The eval problem is no longer “did it pass on average” but “does it pass every time on the same input.” That is a separate metric.
τ-bench formalises this with pass^k: run each task k independent times and record the fraction of tasks that succeed on all k runs. A perfect pass^1 (one-shot) score of 70% can collapse to a pass^8 below 25% if the agent is genuinely inconsistent. The decay between pass^1 and pass^k is the reliability signal. We will go deeper on this in Part 2; for now, it’s enough to have it as a metric in your toolbox.
Drag the sliders to see how the two metrics diverge as grows:
Watch the gap widen as you drag the slider. At , both metrics just measure your single-run success rate. But as grows, pass@k (did it work at least once?) artificially inflates your confidence, while pass^k (does it work every time?) reveals the brutal reality of your agent’s reliability. Framing adapted from Anthropic’s Demystifying Evals for AI Agents.
The practical implication is that you cannot ship reliability with a single-run eval. Every task in your suite needs to be runnable k times, with isolated state, and your harness needs to compute pass^k as a first-class metric.
Two pillars of a reliable eval harness
Most eval suites fail because they test agents like pure functions. They assert on the sequence of tool calls (the path) and they run trials sequentially in the same environment (shared state). To get real reliability numbers, you have to invert both.
1. Absolute state isolation (The pass^k prerequisite)
If you want to measure pass^k, every single trial must run in a hermetically sealed environment. Reusing state across trials causes correlated failures and hides real reliability problems.
The war story: I once spent a week debugging an agent whose pass rate kept dropping from 80% to 20% over the weekend. It turned out the eval harness was creating a test user in a staging CRM for Trial 1, but failing to clean it up. By Trial 50, the agent was failing simply because the create_user tool was throwing a “duplicate email” error. The agent was fine; the harness was leaking state.
In τ-bench, every single trial spins up an isolated, in-memory SQLite database. If the agent needs to modify a booking, it does so in a universe that only exists for that specific run. When the trace ends, the universe is destroyed.
2. Outcome grading (not path grading)
Define success at the highest-value granularity. Compare final states against goal states programmatically.
If you mandate that an agent must call search_kb before reply, you’ll fail an agent that correctly answers from its context window. If you mandate it must call update_shipping exactly once, you’ll fail an agent that correctly retries after a network timeout.
In τ-bench, grading doesn’t look at the Run tree at all. It just queries the final database state to see if the user’s goal was achieved.
When you must use LLM-as-judge graders for things like tone or policy compliance, score one rubric dimension at a time, build in partial credit, and keep the prompts boring. But whenever possible, grade the database.
No single method is enough: the filter funnel
No single method catches every failure. Think of evaluation as a series of increasingly fine sieves in a filter funnel. Each layer catches a different type of error, and tying your methods directly to the observability primitives (Run, Trace, Thread) ensures you don’t leave gaps.
| Primitive | Eval Methods | What it Catches | Limitations |
|---|---|---|---|
| Run (Single step) | Automated schema validation, latency monitors, token counting, static analysis. | Malformed tool calls, API timeouts, context window overflows. | Tells you nothing about whether the agent achieved the user’s goal. |
| Trace (Full turn) | LLM-as-judge rubrics, programmatic state verification, policy compliance checkers. | Hallucinations, logic loops, failure to complete the task, domain rule violations. | Can diverge from real-world usage if the simulated task is unrealistic. |
| Thread (Multiple turns) | User feedback (thumbs up/down), A/B testing on completion rates, manual transcript review. | User frustration, multi-turn drift, UX issues, unanticipated edge cases. | Slow, sparse, and often lacks ground truth for why it failed. |
The takeaway: pick a framework early and don’t agonise over the choice. They differ at the margins. LangSmith is most natural if you already live in the LangChain ecosystem; Arize Phoenix is the open-source option for tracing plus grading. None of them will save you if your tasks are vague or your graders are wrong. Invest your energy there.
Five steps to build an eval suite that doesn’t lie to you
Most eval suites fail in the same predictable ways. The loop you actually want is: ship, observe, mine failures, fix, validate, repeat.
-
Mine production for focused test sets. Don’t guess failure modes up-front. Ship a thin agent early, instrument everything, and let real users surface the edge cases. Curate these into small, focused test sets: twenty well-curated tasks isolating specific concepts beat two hundred sloppy ones. A useful sanity check before a task enters the suite: write the reference solution. If you can’t write a clear one, the task is too vague to grade reliably and needs sharpening before it’s worth including. As your agent improves, keep mining production for harder tasks to prevent your suite from saturating at a useless 100% pass rate.
The first time I did this, I spent two weeks pre-launch building an eval set that covered every failure mode the team could brainstorm. Week one of production traffic surfaced twelve new failure modes we hadn’t imagined. The largest single category, users replying to the agent’s clarifying questions in a way the prompt had no handler for, accounted for roughly a third of reported bugs. The pre-launch work was not wasted, but the production traces were close to ten times higher signal per hour.
-
Isolate trials and divide the labor. Run each task against a completely fresh state. Reusing state (like a database) across trials causes correlated failures and hides real reliability problems. To maintain this rigor without burning out, split the work: a dedicated evals team owns the infrastructure and isolation, while domain experts and product teams contribute the actual tasks.
-
Grade outcomes, not paths. Define success at the highest-value granularity. Compare final states against goal states programmatically whenever possible. When you must use LLM-as-judge graders, score one rubric dimension at a time, build in partial credit, and keep the prompts boring.
-
Read the transcripts. Sample failed traces and read them manually. You won’t know if your agent is actually failing or if your graders are just wrong until you read fifty failures in one sitting. If you have to squint at a failure to figure out why it failed, the eval is wrong, not the agent.
The first time I did this, roughly 40% of what the dashboard had marked as policy violations were actually the LLM judge misreading the policy. Three rules we had escalated as “frequently broken” weren’t broken at all once I re-graded by hand. The agent was fine; the grader was the bug. You will not catch that from aggregate scores.
-
Balance both directions. The balanced-set idea is well-articulated in Anthropic’s evals roadmap; the 60/40 ratio below is what’s worked for me. If you only test where a behaviour should occur, you optimise for over-triggering and ship an agent that fires too eagerly. Include the “don’t do this” cases too: a web-search agent needs tasks where searching is wrong, not just where it’s right; a refund agent needs cases where the user is asking for a refund they aren’t entitled to. Roughly 60/40 should-fire to should-not-fire, weighted toward whichever direction is more costly when wrong.
What’s next
Part 2 walks through τ-bench (Yao et al., 2024), the benchmark that crystallised most of these ideas: a database-grounded conversational eval with policy documents, a simulated user, and the pass^k metric we sketched here. Part 3 covers the successors τ²-bench (dual-control coordination, where the agent and user can both act on the world) and τ³-bench (retrieval and voice).
If you only had time to study one agent benchmark, τ-bench would be it. If you are shipping agents in 2026, τ²-bench and τ³-bench are closer to your reality.
References
- Yao et al. τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. 2024. arXiv:2406.12045
- Anthropic. Demystifying Evals for AI Agents. anthropic.com/engineering/demystifying-evals-for-ai-agents