Go back
Published:
· llm / evals / testing

What an eval suite is, and how to build one

An eval suite is not one thing. It is a layered set of checks with different costs, latencies, and confidence levels. This post walks through what the layers are, how to build the dataset (the part most teams under-do), how grading actually works in practice, and how the whole thing wires into your CI.

A version of this conversation happens at most teams I have worked with. The deploy from yesterday regressed something the team didn’t know they cared about. Nobody can reproduce the regression locally. The rollback is hours away. Someone on Slack asks why we didn’t catch this before deploy.

The answer is always the same. We had a few example prompts saved in a Notion page, somebody ran them by hand on bigger changes, results lived in DMs. That is not a suite. That is vibes in a hard hat.

This post is about what an eval suite actually is, why it has to be layered, how to build the dataset, how grading works in practice, and how the whole thing wires into your CI. It’s the hands-on companion to Part 1 of the agent evals series, which covers the conceptual framework (observability primitives, what to evaluate at which level, pass^k as a reliability metric). This one is about what’s on your laptop and in your CI by Friday.

The headline if you only read this paragraph: aim for fifty well-specified tasks, not five hundred sloppy ones. Use deterministic graders wherever they fit. Calibrate any LLM judge against humans before you trust it. Read your failures yourself. The rest is detail.

Table of contents

Open Table of contents

A suite, not a notebook

The most common reason teams say they have an eval suite and don’t is that they have one big notebook with twelve test cases, run by hand twice a quarter, with results in Slack DMs. That is a notebook. A suite is something that runs automatically, scores reliably, has versioned inputs and versioned graders, and is something a deploy can fail.

Practically, a suite is a layered set of checks. Each layer has its own cost, latency, and confidence-vs-coverage trade-off. Most teams need at least the first three to call themselves serious. Mature teams have all five.

in order of cost and coverage Layer 0 · Vibes Eyeball ten outputs in a notebook Layer 1 · Unit-test-style asserts JSON schema, regex match, tool-call argument checks Layer 2 · Golden-task regression set 20 to 50 frozen tasks · the layer your deploy actually gates on Layer 3 · Production-trace replays Sample real sessions, diff against the previous version Layer 4 · End-to-end agent eval τ-bench style · slow, expensive, run on release gates

Layer 0: vibes. Eyeball ten outputs in a notebook. Free, fast, almost no statistical power, no coverage. The thing you do every time you change a prompt, for an hour before bed. Worth keeping as a sanity layer even when you have all the others, because it’s the only layer where you actually look at the model’s behaviour and notice things the graders haven’t been told to check for.

Layer 1: unit-test-style asserts. Deterministic checks on specific outputs. JSON schema validation. Regex match on the final answer. “No PII in this field.” “Tool call must use these argument names.” “Output is a single integer between 0 and 100.” Cheap, fast, high precision on the things they cover, blind to everything else.

These are the lowest-effort layer to add and the one most teams skip in favour of fancier graders. Almost any non-trivial LLM feature has at least five things you can unit-test deterministically. Find them first.

Layer 2: golden-task regression set. A frozen set of inputs paired with annotated good outputs or grading functions. Runs on every model or prompt change. This is the layer your deploy actually gates on.

Twenty to fifty well-specified tasks beats two hundred sloppy ones. Each task should isolate one concept, one tool flow, one edge case. Mix in adversarial cases (the user pushes back, the policy is ambiguous, the right answer is “refuse”) deliberately, not by accident.

Layer 3: production-trace replays. Sample real user sessions from production logs. Replay them against the new model or prompt. Diff the output against the previous version. The thing your golden set will miss because the production distribution shifted and you didn’t update.

The point of this layer isn’t to grade the new outputs (you can’t, you don’t have a gold answer for each replayed input). It’s to surface the cases where the new version behaves materially differently from the old one. Diffs that look weird get triaged into the golden set as new tasks.

Layer 4: end-to-end agent eval. τ-bench-style. Multi-turn, simulated user, policy-aware grading. Slow, expensive, run rarely (release gates, major model upgrades). Most teams don’t need this in their CI loop. The ones that do are shipping agents whose failures involve dialogue or coordination, not just single-call correctness.

The layers don’t replace each other; they catch different failures. A regression that breaks JSON parsing will trip Layer 1 before Layer 2 has a chance to score it. A regression that quietly degrades answer quality on the long tail of inputs will only show up in Layer 3 when you compare distributions. A regression that breaks multi-turn negotiation will only show up in Layer 4. A team that only has Layer 2 is catching about half of what they’d catch with all five, and they don’t know which half.

Building the dataset

The optimisers, the dashboards, the LLM-as-judge calibrations: none of it matters if the inputs are wrong. Most teams I’ve worked with spent weeks bikeshedding which eval framework to use before they had a single golden task. Reverse that order.

There are four sources of tasks, in roughly increasing order of effort per task and decreasing order of quantity.

SourceProsConsDefault share of a 50-task suite
Production tracesReal user distribution; highest signal; tests what the agent will actually seeRequires logging + anonymisation; biased to current usage; uncurated samples over-fit the easy 80%~30 cases
Hand-written hard casesCaptures every edge case you’ve personally hit; high precision on known failures; long shelf lifeSlow to write; biased to what you can imagine; doesn’t scale to volume~10 cases
LLM-synthesisedCheap volume; good for breadth across categories or paraphrasesBiases match the synthesising model; lower trust on hard cases; risks self-evaluation if you reuse the model under test~5 cases
Adversarial / red-teamCatches the worst regressions you’d otherwise ship; surfaces failures you didn’t think to test forTime-intensive; can drift into edge cases nobody actually triggers~5 cases

The rest of this section is what each row of that table means in practice.

Production traces. The easiest, highest-signal source. Sample real user sessions from your logs, anonymise them, find the interesting ones, turn them into tasks. The bar for “interesting” is: a failure mode you can name, a tricky case you’d want to run on every release, a behaviour the agent should specifically have. The trap is grabbing 200 sampled traces uncurated; you end up with a suite that mostly tests the easy 80%.

Hand-written hard cases. The twenty examples that capture every edge case you’ve personally hit. These are gold. Do not let them rot. When a teammate reports “the agent did the wrong thing here” in Slack, that’s a hand-written hard case in the making.

LLM-synthesised cases. Cheap volume, biases match the synthesising model, lower-trust than real traces. Use for breadth on tasks that need a lot of variants (e.g. classification across many categories) but not for the hard cases. If your synthesiser is the same model you’re testing, you’ve created a self-evaluation problem; use a different model family if you do this.

Adversarial / red-team. Try to break your own product. Half a day of trying to make the agent refuse a legitimate request, prompt-inject itself, leak the system prompt, or escape its guardrails will yield more usable tasks than a week of automated synthesis. The cases you catch this way are usually the worst regressions you would otherwise have shipped.

The mix matters. A suite that’s 100% production traces over-fits to current user behaviour. A suite that’s 100% synthetic over-fits to whatever the synthesising model thinks failures look like. My current default for a new suite of fifty tasks: roughly thirty production-sourced, ten hand-written hard cases, five synthesised for breadth, five adversarial.

Two sanity checks before a task enters the suite

First, write the reference solution. If you can’t write a clean one, the task is too vague to grade reliably and needs sharpening. If a colleague would grade the same input differently from you, neither of you can grade it consistently across releases either.

Second, decide what a failure looks like. Not just “wrong” (too coarse). Is this a “wrong tool” failure, a “wrong arguments” failure, a “right answer wrong format” failure, a “missed the question entirely” failure? Tasks whose failure modes you can’t enumerate are not yet ready.

Grading: where most suites quietly stop being maintained

You can have perfect tasks and a beautiful harness and still have a useless suite if the graders are wrong. Most of the suites I’ve watched stop being maintained stopped because the grader signal lost connection to user-visible quality. The team stopped trusting the numbers. The suite became a CI step nobody read.

Three flavours of grader, in roughly increasing order of cost and decreasing order of throughput.

Deterministic graders. Regex match, JSON schema validation, exact match against a gold string, executable check (“does this generated SQL run and return the right row count”). Use these first, wherever they fit. They are free, fast, infinitely reproducible, and don’t drift.

The trap: forcing a deterministic check on something that doesn’t fit (“the agent’s apology must contain the word sorry”) creates a grader that’s strict and wrong. The check passes only the specific phrasing you anticipated. If the model’s behaviour is genuinely deterministic-checkable, use a deterministic check. If it’s not, don’t reach for a regex that will create false negatives every time the model paraphrases.

Semantic graders (LLM-as-judge, embedding similarity). Necessary when the right answer is “any reasonable refusal” or “a summary that mentions these three facts in any order”. The model you grade with is itself a parameter of your suite. It has its own biases, its own failure modes, its own cost.

Calibrate before you trust. Take a sample of fifty task outputs, score them by hand, score them with your judge, and compute the agreement. If your judge agrees with you on 90% it’s usable. 60%, it’s a coin flip. There’s a separate post planned on the inter-rater calibration stats you actually want here (Cohen’s kappa, Krippendorff’s alpha, when each one applies); the simple agreement rate is the right first check.

The other thing about LLM judges: don’t use the same model as both candidate and judge. Same-family judges show systematic preference for outputs that look like their own writing. Mix the family on the judge side, or run two judges from different families and require they agree.

Human grading. The only ground truth. Slow, expensive, and the thing that calibrates everything else. Use sparingly: it’s expensive enough that you only want to use it where it’s the only option, and where the calibration signal pays for the cost. The protocol that worked for me was rotating two engineers through a one-hour weekly session, fifty samples each, two-thirds the LLM judge agreed on and one-third it disagreed on. The disagreements are where you learn whether the judge is converging or drifting.

Pairwise versus absolute

If you’re comparing two candidate versions (A/B between two prompts, or this model vs that model), pairwise preference is much higher signal than absolute scoring. “Which of these two answers is better” is easier to grade reliably than “rate this answer from 1 to 5”.

If you want to track absolute quality over time, you need an absolute scale, but the absolute scale is high-variance. View long-run trends, not single-release deltas. A 5% absolute-score drop on one release means almost nothing in isolation; the same drop over three releases means something.

Operations: how the suite earns its keep

A suite that runs once a quarter is a suite that doesn’t catch regressions. Three operational pieces matter.

CI integration. Cheap layers (0, 1) run on every PR. Medium-cost layers (2, 3) run on every merge to main or pre-release. Expensive layers (4) run on release gates and major model upgrades. The discipline of “what gates a merge” forces you to keep cheap layers cheap, which is the right pressure.

Cost management. A 200-task suite × five trials per task × $0.05 per call is $50 per CI run. Run that on every PR and you’re spending serious money on something nobody is reading. The tactics that work: cache LLM responses (provider-side or your own), sample down to 20 tasks for PR-time runs and the full 200 for nightly, parallelise aggressively, use cheaper models in the grading harness when only the candidate model is being tested.

Versioning. You will change the eval set. You need to know what version a given historical run was scored on. Commit the eval set to git, treat it like code, and tag the version. When you compare quality across releases, you’re really comparing (version-of-suite × version-of-model). If the suite version changed, the comparison is invalid without re-running the old model on the new suite.

Two failure modes worth naming

Drift. Production behaviour shifts (a new user cohort, a new use case, a viral tweet about your product). Your eval distribution stops matching reality. The number on the dashboard stays good but production quality is degrading. Defence: a quarterly trace-replay run that’s just “sample fifty new production traces, replay against the current and previous model, diff the outputs, manually triage anything weird”. The triaged cases go into the golden set as new tasks.

Overfitting to the suite. Prompt iteration converges on “pass the suite” instead of “be good”. The model gets better at the specific tasks you scored and worse at the underlying capability. Defence: a held-out validation set you don’t iterate on. The optimiser sees the trainset, you see the held-out set. When held-out scores stop tracking train scores, you’ve overfit. This is the same discipline as any ML training loop; people forget that prompt iteration is a training loop.

Tooling: pick a framework for the harness, own the eval set yourself

This is the only section where I’ll name specific products and only because the question always comes up. The frameworks people actually use in 2026: LangSmith, Braintrust, Arize Phoenix, Langfuse, Patronus. They differ mostly at the margins (dashboards, integrations, self-host story, pricing). Pick one early, don’t agonise.

The thing not to outsource: the eval set itself (the tasks, the gold answers, the graders). Your tasks know your domain in a way no vendor can. The harness is commodity; the suite is yours.

If you’re building this from scratch with no framework, the absolute minimum is: a JSON file per task (input, expected output, grading function name), a grader registry mapping function names to deterministic and LLM-judge implementations, a runner that iterates tasks and emits a CSV with one row per (task, grader, candidate, score, run_id), and a simple report comparing two run IDs.

In its absolute thinnest form that’s around 100 lines of Python across three files. Expand each block if you want to read the code; the prose below assumes you have.

Task file (tasks/geo_001.json)
// tasks/geo_001.json
{
  "id": "geo_001",
  "input": "What is the capital of France?",
  "expected": "Paris",
  "grader": "exact_match"
}
Grader registry (graders.py)
# graders.py
GRADERS = {}

def grader(name):
    def deco(fn):
        GRADERS[name] = fn
        return fn
    return deco

@grader("exact_match")
def exact_match(output, task):
    return float(output.strip().lower() == task["expected"].strip().lower())

@grader("contains")
def contains(output, task):
    return float(task["expected"].lower() in output.lower())

# Add an llm_judge grader with the same signature when you need one.
# Calibrate it before wiring it into CI.
Runner and compare (run.py)
# run.py
import csv, json, glob, sys, uuid
from graders import GRADERS
from candidate import generate  # your model call: str -> str

def run(candidate_id, tasks_glob="tasks/*.json", out="runs.csv"):
    run_id = str(uuid.uuid4())[:8]
    with open(out, "a", newline="") as f:
        w = csv.writer(f)
        for path in glob.glob(tasks_glob):
            task = json.load(open(path))
            output = generate(task["input"])
            score = GRADERS[task["grader"]](output, task)
            w.writerow([run_id, task["id"], task["grader"], candidate_id, score])
    return run_id

def compare(run_a, run_b, path="runs.csv"):
    scores = {}
    for row in csv.reader(open(path)):
        rid, tid, _, _, score = row
        scores.setdefault(tid, {})[rid] = float(score)
    for tid, s in sorted(scores.items()):
        a, b = s.get(run_a), s.get(run_b)
        if a is not None and b is not None and a != b:
            print(f"{tid}: {a} -> {b}")

if __name__ == "__main__":
    cmd, *args = sys.argv[1:]
    if cmd == "run": print(run(args[0]))
    elif cmd == "compare": compare(args[0], args[1])

Two notes on that snippet. generate(input: str) -> str is whatever you wrap your model call in (Anthropic SDK, OpenAI SDK, your own retry-and-cache wrapper); it’s kept out of the example so the post doesn’t date itself to one SDK version. And the compare function only prints diffs where both runs have a score, which is deliberate: a task that errored in run A but completed in run B won’t appear as a fake regression.

That’s it. One JSON file per task, a decorator-registered grader, a runner that appends to a CSV, a diff between two run IDs. You’d add an LLM-judge grader (same signature, calibrate it first), retries, parallelism, and a nicer report before you called this production-grade. But this is enough to gate a deploy. The frameworks add storage, dashboards, history, and integrations. None of those are the suite. The suite is the tasks.

The failure-analysis loop, end to end

This is the loop most teams don’t have written down anywhere. A failing CI run becomes a fix.

A new failing task in CI: yesterday’s prompt passed, today’s prompt fails. The grader output is “wrong answer: expected Paris, got France.”

Pull a few similar failures. Look at the trace for that task and three or four other tasks tagged with the same concept (geography, capital cities, whatever the cluster is). Is this one task or a cluster? One task is probably noise. A cluster is a real regression.

Form a hypothesis. “The prompt update I made yesterday weakened the ‘be specific’ instruction; the model is generalising answers.” This is a guess. It’s enough to act on.

Test the hypothesis. Run the failing cluster against the previous prompt (which passed) and the new prompt (which fails). Confirm the cluster fails on new and passes on old. If the cluster behaviour is the same on both, the change wasn’t the cause and you need a different hypothesis.

Make a targeted fix. Either revert the relevant change, or modify the prompt to address the regression specifically. Resist the urge to make broader changes here; you’ll lose the ability to attribute the fix later.

Run the failure cluster plus a small holdout. The cluster confirms the fix. The holdout confirms you didn’t make something else worse.

Commit, with a message naming the cluster and the new behaviour you’re keeping.

Five to fifteen minutes per failure if you have the discipline to do all six steps. The reason most teams don’t is they jump from step one to step five without two through four, and end up with a fix that’s a vibe rather than a verified targeted change.

Anti-patterns I see almost every time

The giant eval notebook nobody runs. This is the failure mode the post opened with and the most common one I see in the wild. If your suite isn’t runnable as a CLI command in CI, you don’t have a suite, you have a document.

Grading the hard examples by hand each release. Often done with the promise of “we’ll automate it later”, which rarely happens. If a task needs human grading every release, it doesn’t belong in the golden set; it belongs in a manual-review pile, run separately and on its own cadence.

LLM-as-judge without calibration. Trusting an off-the-shelf gpt-4o-judge on day one because the marketing said it agrees with humans 90% of the time on the benchmark in the paper. Maybe it does on that benchmark, but you have no idea whether it does on your domain, and the only way to find out is the fifty-sample calibration. Run it before you wire the judge into CI, not after the dashboard has been wrong for a month.

Same model as judge and candidate. Judges tend to prefer their own writing style, which biases the scores in a way that looks like signal but is really self-preference. Use a different family on the judge side, or run two judges from different families and require they agree.

Suite-bloat. The instinct is to add a task every time a regression slips through. The trap is never removing tasks that became trivially passing once the underlying bug got fixed. Suites swell from 50 tasks to 500 and most of the 500 are noise that nobody reads. A useful quarterly habit is to kick out anything with a 100% pass rate over the last three releases, archiving them in a “trivially passing” pile in case you need them again.

Optimising the suite, not the agent. Prompt iteration sessions where the dashboard number goes up but the agent feels worse on a free-form test you ran by hand. That’s the sign your suite has stopped tracking what users care about, and the right response is to refresh tasks from current production traces rather than tune harder against the existing set.

What’s next

You will not have everything. You will have something that gates a deploy and tells you on Tuesday morning that yesterday’s prompt regressed three tasks. That is enough to start. The rest grows from there.

Once the suite is running, the two follow-up posts in this series cover the parts that come next: calibrating LLM judges with Cohen’s kappa and the rest of the inter-rater stats (the part this post hand-waves through), and how to iterate prompts without overfitting your suite (the loop that uses the suite once you have one).

If you want the conceptual framework behind why this works, Part 1 of the agent evals series covers the observability primitives (run, trace, thread), what to evaluate at which level, and the pass^k reliability metric. This post is the hands-on version of the same view.

References


Tagged