Tag: inference

All the articles with the tag "inference".

Setting Logits to Negative Infinity: How LLMs Actually Output JSON

Structured outputs aren't a validation layer; they're a decoding-time intervention. How logit masking actually works, why token boundaries make it hard, and why reordering one field in your Pydantic schema can move accuracy by 90 points.

Published: 11 May, 2026
· llm / decoding / structured-outputs
Why Streaming LLMs Need Attention Sinks

A walkthrough of attention sinks: what they are, why softmax produces them by accident, why naive sliding-window inference collapses without them, and how a four-token reservation lets streaming inference run to four million tokens with no quality loss.

Published: 12 Nov, 2025
· llm / attention / inference