Tag: evals

All the articles with the tag "evals".

Breaking Down Agent Evals (Part 3): τ²-bench and τ³-bench

Part 3 of 3. How τ²-bench introduced dual control by giving the user its own tools, what τ³-bench added with sprawling document retrieval and full-duplex voice, and what production agent eval still does not measure.

Published: 10 May, 2026
· agents / evals / benchmarks
LLMs playing Just One: Why Same-Model LLM Ensembles Mode-Collapse

Four Claude Haiku instances asked independently for a clue for 'toast' all reply 'bread'. Four Sonnets do it more often. Four Opuses do it even more often. I built a tiny benchmark using the board game Just One to measure when LLM ensembles collapse and what makes them stop. The mixed-family ensemble + anti-correlation prompt hits 3.25× the single-model baseline.

Published: 22 Apr, 2026
· llm / evals / ensembles
Breaking Down Agent Evals (Part 2): τ-bench Deep Dive

Part 2 of 3. How τ-bench unified a simulated user, domain policies, and a real-world consequence model into one benchmark, why pass^k changed how the field talks about agent quality, and how its design principles transfer to your own eval suite.

Published: 15 Mar, 2026
· agents / evals / benchmarks
Breaking Down Agent Evals (Part 1): A Practitioner's Guide

Part 1 of a 3-part series. Why traces (not code) are the source of truth in agents, the three observability primitives, run types, the metrics that matter at each level, the pass^k reliability metric, a five-step methodology for building an eval suite, and a filter funnel approach to why no single eval method is enough.

Published: 10 Feb, 2026
· agents / evals / observability