Tag: inference
All the articles with the tag "inference".
-
Setting Logits to Negative Infinity: How LLMs Actually Output JSON
Structured outputs aren't a validation layer; they're a decoding-time intervention. How logit masking actually works, why token boundaries make it hard, and why reordering one field in your Pydantic schema can move accuracy by 90 points.
-
Why Streaming LLMs Need Attention Sinks
A walkthrough of attention sinks: what they are, why softmax produces them by accident, why naive sliding-window inference collapses without them, and how a four-token reservation lets streaming inference run to four million tokens with no quality loss.