Tag: agents

All the articles with the tag "agents".

Breaking Down Agent Evals (Part 3): τ²-bench and τ³-bench

Part 3 of 3. How τ²-bench introduced dual control by giving the user its own tools, what τ³-bench added with sprawling document retrieval and full-duplex voice, and what production agent eval still does not measure.

Published: 10 May, 2026
· agents / evals / benchmarks
Breaking Down Agent Evals (Part 2): τ-bench Deep Dive

Part 2 of 3. How τ-bench unified a simulated user, domain policies, and a real-world consequence model into one benchmark, why pass^k changed how the field talks about agent quality, and how its design principles transfer to your own eval suite.

Published: 15 Mar, 2026
· agents / evals / benchmarks
Breaking Down Agent Evals (Part 1): A Practitioner's Guide

Part 1 of a 3-part series. Why traces (not code) are the source of truth in agents, the three observability primitives, run types, the metrics that matter at each level, the pass^k reliability metric, a five-step methodology for building an eval suite, and a filter funnel approach to why no single eval method is enough.

Published: 10 Feb, 2026
· agents / evals / observability
How to Mitigate the Lost-in-the-Middle Effect in LLMs

A look at why long contexts quietly break LLMs, why important information is easier to use at the boundaries than in the middle, and why agents that periodically restate their goals at the end of the context often work better.

Published: 15 Aug, 2025
· llm / context-engineering / agents