Breaking Down Agent Evals (Part 1): A Practitioner's Guide

Part 1 of 3. Part 2 walks through τ-bench step by step. Part 3 covers the successors τ²-bench (dual-control coordination) and τ³-bench (knowledge retrieval and voice).

Open Table of contents

Why agent evals aren’t software tests
The three observability primitives
Types of runs
What to evaluate at each level
Reliability is its own axis: pass^k
Two pillars of a reliable eval harness
- 1. Absolute state isolation (The pass^k prerequisite)
- 2. Outcome grading (not path grading)
No single method is enough: the filter funnel
Five steps to build an eval suite that doesn’t lie to you
What’s next
References

Why agent evals aren’t software tests

In traditional software, the code is the source of truth. Read the function, know what happens. Inputs are constrained (forms, buttons, typed parameters), outputs are deterministic, and behaviour is fully specified before runtime.

Agents break every one of those assumptions:

Non-deterministic outputs. Same input, different trajectories.
Unconstrained inputs. Natural language is unbounded.
Emergent behaviour. The agent decides actions, calls tools, and mutates state autonomously.

So in agents, the traces are the source of truth. The code just defines a prompt and a set of tools. You don’t know what the agent does until you run it. That single shift is why observability and evaluation in agents are tightly coupled in a way they never are in conventional software: you cannot test what you cannot observe, and you cannot reason about what you have not traced.

This reframes debugging too. Software debugging is finding the failed function in a stack trace. Agent debugging is debugging reasoning: what went into the LLM, what came out, what context was available, which tools were called and in what order, what the model decided to do with the tool’s response. The bug is rarely a bad line of code. It is usually a bad decision in the middle of a trajectory you didn’t know existed until you saw the trace.

The three observability primitives

Almost every agent observability platform builds on the same three nested concepts. The names vary slightly between vendors but the shapes are the same.

Run (Single step). One atomic operation: one LLM call, one tool invocation, or one retrieval step. Has inputs, outputs, latency, cost, metadata. The smallest unit you can observe and evaluate.

Trace (Full Turn). A full agent execution from one user message to the agent’s final response, with no human intervention in between. The agent loops through multiple runs (LLM calls, tool calls, retrievals) until it decides it is done.

Thread (Multiple Turns). A full conversation: multiple traces linked by human turns. Each new user message kicks off a new trace; the thread groups them.

Read top to bottom: the message stream of an agent conversation, with three brackets on the right marking three scopes.

No level alone tells the full story. A run can be perfect (the LLM made a sensible call) inside a trace that fails (the agent picked the wrong tool earlier and never recovered) inside a thread that succeeds (the user re-prompted and the agent fixed itself in the next trace). You evaluate at all three.

Types of runs

Runs come in a few standard shapes. Most agent stacks emit these four:

LLM-call run. Model name, prompt, completion, token counts, cost, latency, finish reason. The atom of “what the model said.”
Tool-call run. Tool name, arguments (schema-validated), return value, error if any, latency.
Retrieval run. Query, retrieved documents, similarity scores, store identifier. Logged separately because retrieval failures and tool failures look different.
Generation / output run. The final user-visible response for the trace.

Custom runs are common too: input pre-processing, guardrail checks, validators, post-hoc summarisation steps. Anything you might want to evaluate independently should be its own run.

A minimal schema looks roughly like this:

from dataclasses import dataclass, field
from typing import Literal, Any

@dataclass
class Run:
    id: str
    parent_id: str | None              # None for the root run of a trace
    trace_id: str
    type: Literal["llm", "tool", "retrieval", "generation", "custom"]
    name: str                          # e.g. "gpt-4o", "search_docs", "rerank"
    inputs: dict[str, Any]
    outputs: dict[str, Any]
    started_at: float
    ended_at: float
    cost_usd: float | None = None
    metadata: dict[str, Any] = field(default_factory=dict)

Two things matter about this shape. First, the parent_id link makes the run tree reconstructible, so you can render a waterfall view of any trace. Second, inputs and outputs are stored verbatim. You will want them later. Re-running an LLM call against a captured input is how you bisect failures; replaying a tool call against captured args is how you reproduce a flaky integration.

What to evaluate at each level

Different levels admit different metrics. Picking the wrong level is the single most common eval-design mistake.

Run level. Did this individual operation work? Latency, cost, schema validity (did the tool call have the right arguments?), token usage. Programmatic checks dominate here.

Trace level. Did this full agent attempt achieve the goal?

Outcome: end-state match against an annotated goal. Cleanest signal you can get.
Path quality: number of tool calls, number of retries, whether the policy was followed, whether the agent looped.
Rule compliance: did the trace violate domain constraints (e.g. refunding outside the allowed window)? Often best graded with a programmatic policy checker, sometimes an LLM judge.

Thread level. Across the whole conversation, did the user end up where they wanted? Often the thing that actually matters in production. Hardest to grade automatically because it depends on user intent that may not be in any single message. LLM-as-judge with a careful rubric, or human review.

A useful sanity check: if your only metric is at the run level, you are evaluating the LLM, not the agent. If your only metric is at the thread level, you cannot tell why anything failed. You want all three.

One cross-cutting distinction worth naming on top of these levels, a framing Anthropic articulates well in their evals post: capability evals ask “what can this agent do?” and should start at low pass rates so there’s a hill to climb. Regression evals ask “does it still do what it used to?” and should sit near 100%, with any drop flagging a break. As your agent matures, capability tasks that get reliably solved graduate into the regression suite. Run both on every change. Hill-climbing capability without watching regression is how you silently break things you’ve already shipped.

Reliability is its own axis: pass^k

A 70%-pass agent that flips outcomes on identical inputs is not safe to deploy. The eval problem is no longer “did it pass on average” but “does it pass every time on the same input.” That is a separate metric.

τ-bench formalises this with pass^k: run each task k independent times and record the fraction of tasks that succeed on all k runs. A perfect pass^1 (one-shot) score of 70% can collapse to a pass^8 below 25% if the agent is genuinely inconsistent. The decay between pass^1 and pass^k is the reliability signal. We will go deeper on this in Part 2; for now, it’s enough to have it as a metric in your toolbox.

Drag the sliders to see how the two metrics diverge as $k$ grows:

Watch the gap widen as you drag the slider. At $k=1$ , both metrics just measure your single-run success rate. But as $k$ grows, pass@k (did it work at least once?) artificially inflates your confidence, while pass^k (does it work every time?) reveals the brutal reality of your agent’s reliability. Framing adapted from Anthropic’s Demystifying Evals for AI Agents.

The math, and how to estimate from a finite sample

Given a per-trial success probability $p$ , the two metrics are:

\text{pass@k} = 1 - (1 - p)^k \qquad \text{pass}^{k} = p^k

The first is the probability that at least one of $k$ independent trials succeeds. The second is the probability that all $k$ succeed. At $k=1$ both equal $p$ ; the curves only diverge after that.

In practice you do not know $p$ exactly. You ran each task $n$ times and observed $c$ successes. The unbiased estimators (given $n \geq k$ ) come from straight combinatorics over which $k$ -subsets of the $n$ runs are successful:

\widehat{\text{pass@k}} = 1 - \binom{n - c}{k} \Big/ \binom{n}{k} \qquad \widehat{\text{pass}^{k}} = \binom{c}{k} \Big/ \binom{n}{k}

from math import comb

def pass_caret_k(num_correct: int, num_trials: int, k: int) -> float:
    """Unbiased pass^k estimator: P(all k of k succeed | c of n succeeded)."""
    if num_trials < k:
        raise ValueError("need at least k trials")
    return comb(num_correct, k) / comb(num_trials, k)

def pass_at_k(num_correct: int, num_trials: int, k: int) -> float:
    """Unbiased pass@k estimator: P(at least one of k succeeds | c of n succeeded)."""
    if num_trials < k:
        raise ValueError("need at least k trials")
    if num_trials - num_correct < k:
        return 1.0
    return 1.0 - comb(num_trials - num_correct, k) / comb(num_trials, k)

The practical implication is that you cannot ship reliability with a single-run eval. Every task in your suite needs to be runnable k times, with isolated state, and your harness needs to compute pass^k as a first-class metric.

Two pillars of a reliable eval harness

Most eval suites fail because they test agents like pure functions. They assert on the sequence of tool calls (the path) and they run trials sequentially in the same environment (shared state). To get real reliability numbers, you have to invert both.

1. Absolute state isolation (The pass^k prerequisite)

If you want to measure pass^k, every single trial must run in a hermetically sealed environment. Reusing state across trials causes correlated failures and hides real reliability problems.

The war story: I once spent a week debugging an agent whose pass rate kept dropping from 80% to 20% over the weekend. It turned out the eval harness was creating a test user in a staging CRM for Trial 1, but failing to clean it up. By Trial 50, the agent was failing simply because the create_user tool was throwing a “duplicate email” error. The agent was fine; the harness was leaking state.

In τ-bench, every single trial spins up an isolated, in-memory SQLite database. If the agent needs to modify a booking, it does so in a universe that only exists for that specific run. When the trace ends, the universe is destroyed.

2. Outcome grading (not path grading)

Define success at the highest-value granularity. Compare final states against goal states programmatically.

If you mandate that an agent must call search_kb before reply, you’ll fail an agent that correctly answers from its context window. If you mandate it must call update_shipping exactly once, you’ll fail an agent that correctly retries after a network timeout.

In τ-bench, grading doesn’t look at the Run tree at all. It just queries the final database state to see if the user’s goal was achieved.

When you must use LLM-as-judge graders for things like tone or policy compliance, score one rubric dimension at a time, build in partial credit, and keep the prompts boring. But whenever possible, grade the database.

No single method is enough: the filter funnel

No single method catches every failure. Think of evaluation as a series of increasingly fine sieves in a filter funnel. Each layer catches a different type of error, and tying your methods directly to the observability primitives (Run, Trace, Thread) ensures you don’t leave gaps.

Primitive	Eval Methods	What it Catches	Limitations
Run (Single step)	Automated schema validation, latency monitors, token counting, static analysis.	Malformed tool calls, API timeouts, context window overflows.	Tells you nothing about whether the agent achieved the user’s goal.
Trace (Full turn)	LLM-as-judge rubrics, programmatic state verification, policy compliance checkers.	Hallucinations, logic loops, failure to complete the task, domain rule violations.	Can diverge from real-world usage if the simulated task is unrealistic.
Thread (Multiple turns)	User feedback (thumbs up/down), A/B testing on completion rates, manual transcript review.	User frustration, multi-turn drift, UX issues, unanticipated edge cases.	Slow, sparse, and often lacks ground truth for why it failed.

The takeaway: pick a framework early and don’t agonise over the choice. They differ at the margins. LangSmith is most natural if you already live in the LangChain ecosystem; Arize Phoenix is the open-source option for tracing plus grading. None of them will save you if your tasks are vague or your graders are wrong. Invest your energy there.

Five steps to build an eval suite that doesn’t lie to you

Most eval suites fail in the same predictable ways. The loop you actually want is: ship, observe, mine failures, fix, validate, repeat.

Mine production for focused test sets. Don’t guess failure modes up-front. Ship a thin agent early, instrument everything, and let real users surface the edge cases. Curate these into small, focused test sets: twenty well-curated tasks isolating specific concepts beat two hundred sloppy ones. A useful sanity check before a task enters the suite: write the reference solution. If you can’t write a clear one, the task is too vague to grade reliably and needs sharpening before it’s worth including. As your agent improves, keep mining production for harder tasks to prevent your suite from saturating at a useless 100% pass rate.

The first time I did this, I spent two weeks pre-launch building an eval set that covered every failure mode the team could brainstorm. Week one of production traffic surfaced twelve new failure modes we hadn’t imagined. The largest single category, users replying to the agent’s clarifying questions in a way the prompt had no handler for, accounted for roughly a third of reported bugs. The pre-launch work was not wasted, but the production traces were close to ten times higher signal per hour.
Isolate trials and divide the labor. Run each task against a completely fresh state. Reusing state (like a database) across trials causes correlated failures and hides real reliability problems. To maintain this rigor without burning out, split the work: a dedicated evals team owns the infrastructure and isolation, while domain experts and product teams contribute the actual tasks.
Grade outcomes, not paths. Define success at the highest-value granularity. Compare final states against goal states programmatically whenever possible. When you must use LLM-as-judge graders, score one rubric dimension at a time, build in partial credit, and keep the prompts boring.
Read the transcripts. Sample failed traces and read them manually. You won’t know if your agent is actually failing or if your graders are just wrong until you read fifty failures in one sitting. If you have to squint at a failure to figure out why it failed, the eval is wrong, not the agent.

The first time I did this, roughly 40% of what the dashboard had marked as policy violations were actually the LLM judge misreading the policy. Three rules we had escalated as “frequently broken” weren’t broken at all once I re-graded by hand. The agent was fine; the grader was the bug. You will not catch that from aggregate scores.
Balance both directions. The balanced-set idea is well-articulated in Anthropic’s evals roadmap; the 60/40 ratio below is what’s worked for me. If you only test where a behaviour should occur, you optimise for over-triggering and ship an agent that fires too eagerly. Include the “don’t do this” cases too: a web-search agent needs tasks where searching is wrong, not just where it’s right; a refund agent needs cases where the user is asking for a refund they aren’t entitled to. Roughly 60/40 should-fire to should-not-fire, weighted toward whichever direction is more costly when wrong.

What’s next

Part 2 walks through τ-bench (Yao et al., 2024), the benchmark that crystallised most of these ideas: a database-grounded conversational eval with policy documents, a simulated user, and the pass^k metric we sketched here. Part 3 covers the successors τ²-bench (dual-control coordination, where the agent and user can both act on the world) and τ³-bench (retrieval and voice).

If you only had time to study one agent benchmark, τ-bench would be it. If you are shipping agents in 2026, τ²-bench and τ³-bench are closer to your reality.

References

Yao et al. τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. 2024. arXiv:2406.12045
Anthropic. Demystifying Evals for AI Agents. anthropic.com/engineering/demystifying-evals-for-ai-agents