So it’s 1/3 the price of Opus 4.1…
> [..] matches Sonnet 4.5’s best score on SWE-bench Verified, but uses 76% fewer output tokens
…and potentially uses a lot less tokens?
Excited to stress test this in Claude Code, looks like a great model on paper!
Also increasingly it's becoming important to look at token usage rather than just token cost. They say Opus 4.5 (with high reasoning) used 50% fewer tokens than Sonnet 4.5. So you get a higher score on SWE-bench verified, you pay more per token, but you use fewer tokens and overall pay less!
Especially in extraction tasks. This appears as inventing data or rationalizing around clear roadblocks.
My biggest hack so far is giving them an out named "edge_case" and telling them it is REALLY helpful if they identify edgecases. Simply renaming "fail_closed" or "dead_end" options to "edge_case" with helpful wording causes qwen models to adhere to their prompting more.
It feels like there are 100s of these small hacks that people have to have discovered... why isn't there a centralized place where people are recording these learnings?
Snark aside, inference is still being done at a loss. Anthropic, the most profitable AI vendor, is operating at a roughly -140% margin. xAI is the worst at somewhere around -3,600% margin.
But posting something positive and getting slammed in the comments? That's depressing. So the barrier to posting something positive seems higher.
https://www.sciencedirect.com/science/article/abs/pii/002210...