Notes on language models, agents, and the ways they break.
I write about the inside of LLM systems like decoding, evals, memory, and multi-agent coordination, usually with a benchmark or a worked example to anchor each post. Long-form, occasionally interactive, mostly things I wish someone had written before I had to figure them out.
Worth reading first
-
GEPA: How an LLM Can Write a Better Prompt Than RL Can Train One
A walkthrough of GEPA (Agrawal et al., ICLR 2026), the reflective prompt optimiser that beats GRPO with up to 35× fewer rollouts by reading its own trace logs in plain English. The four-step loop, a worked iteration on a multi-hop QA system, the Pareto trick that keeps the candidate pool diverse, and where 98% of the rollout budget actually goes.
-
Setting Logits to Negative Infinity: How LLMs Actually Output JSON
Structured outputs aren't a validation layer; they're a decoding-time intervention. How logit masking actually works, why token boundaries make it hard, and why reordering one field in your Pydantic schema can move accuracy by 90 points.
-
Prompts are Hyperparameters
A practitioner's tour of DSPy, MIPROv2 and GEPA. The reframe (prompts are parameters of an LLM program, not the artefact you ship), the five axes any optimiser can tune, how MIPROv2 and GEPA actually work, where this set of methods quietly disappoints, and a decision tree for picking one.
-
LLMs playing Just One: Why Same-Model LLM Ensembles Mode-Collapse
Four Claude Haiku instances asked independently for a clue for 'toast' all reply 'bread'. Four Sonnets do it more often. Four Opuses do it even more often. I built a tiny benchmark using the board game Just One to measure when LLM ensembles collapse and what makes them stop. The mixed-family ensemble + anti-correlation prompt hits 3.25× the single-model baseline.
-
What an eval suite is, and how to build one
An eval suite is not one thing. It is a layered set of checks with different costs, latencies, and confidence levels. This post walks through what the layers are, how to build the dataset (the part most teams under-do), how grading actually works in practice, and how the whole thing wires into your CI.
-
Breaking Down Agent Evals (Part 3): τ²-bench and τ³-bench
Part 3 of 3. How τ²-bench introduced dual control by giving the user its own tools, what τ³-bench added with sprawling document retrieval and full-duplex voice, and what production agent eval still does not measure.
-
Breaking Down Agent Evals (Part 2): τ-bench Deep Dive
Part 2 of 3. How τ-bench unified a simulated user, domain policies, and a real-world consequence model into one benchmark, why pass^k changed how the field talks about agent quality, and how its design principles transfer to your own eval suite.
-
Breaking Down Agent Evals (Part 1B): Eval Calibration
A primer on eval calibration: what it means for your scoring pipeline to be trustworthy, the four levels (rubric, human-to-human, LLM-to-human, LLM-to-LLM), the common biases that turn a good-looking dashboard into a fiction, and how to read Cohen's kappa without the textbook. Built around small interactive applets.
-
Breaking Down Agent Evals (Part 1A): Building the Eval Suite, Hands-On
The code companion to Part 1. The same five-step methodology, walked file by file: the toy agent, the eval-case schema, the JSONL dataset, an exact-match grader, an LLM judge, and the runner that ties it together and exits non-zero on regression.
-
Breaking Down Agent Evals (Part 1): A Practitioner's Guide
Part 1 of a 3-part series. Why traces (not code) are the source of truth in agents, the three observability primitives, run types, the metrics that matter at each level, the pass^k reliability metric, a five-step methodology for building an eval suite, and a filter funnel approach to why no single eval method is enough.
-
Why Streaming LLMs Need Attention Sinks
A walkthrough of attention sinks: what they are, why softmax produces them by accident, why naive sliding-window inference collapses without them, and how a four-token reservation lets streaming inference run to four million tokens with no quality loss.
-
How PPO Actually Works
PPO walked through from vanilla policy gradients, through the trust region story that motivates it, to the clipped objective you actually run. Intuition first, math when it pays off. Written for ML people who have not done much RL.
-
Context Engineering for Long Agent Loops: The Case for Recitation
A look at why long contexts quietly break LLMs, why important information is easier to use at the boundaries than in the middle, and why agents that periodically restate their goals at the end of the context often work better.