Breaking Down Agent Evals (Part 1B): Eval Calibration
A primer on eval calibration: what it means for your scoring pipeline to be trustworthy, the four levels (rubric, human-to-human, LLM-to-human, LLM-to-LLM), the common biases that turn a good-looking dashboard into a fiction, and how to read Cohen's kappa without the textbook. Built around small interactive applets.