On June 30, 2026, Anthropic released Claude Sonnet 5, calling it the most agentic Sonnet model the company has shipped. The pitch is simple. Get most of what the flagship Opus 4.8 model can do, at a fraction of the price. For anyone evaluating whether to trust a language model with production work, that pitch only matters if the numbers hold up once you look past the announcement.
We pulled the benchmark data Anthropic published, an independent code review study run against the new model, and reactions from developers who tested it on launch day. The real story is not whether Sonnet 5 is good. It is where it gets worse while getting better, and that part never makes it into the launch post.
What Anthropic Is Claiming
Anthropic positions Sonnet 5 as capable of planning multi step tasks, using tools like browsers and terminals, and checking its own output without being asked to. The company says it narrows the gap with Opus 4.8 while staying meaningfully cheaper, and that teams can raise or lower the model's effort setting to trade cost for capability on a given task.
Pricing tells the real story of where Anthropic wants this model to sit. Through August 31, 2026, Sonnet 5 costs $2 per million input tokens and $10 per million output tokens. After that, standard pricing is $3 and $15. Opus 4.8 costs $5 and $25 for the same. Sonnet 5 is also the new default model for Free and Pro tier users. One detail worth knowing if you are estimating costs: Sonnet 5 uses an updated tokenizer, and the same text can map to roughly 1.0 to 1.35 times more tokens than before, which quietly eats into the headline price advantage. Full pricing and capability details are in Anthropic's announcement.
What The Benchmarks Actually Show
On SWE-bench Pro, a harder agentic coding benchmark built to resist test set leakage, Sonnet 5 scores 63.2%, up from Sonnet 4.6's 58.1%. Opus 4.8 still leads at 69.2%. On Terminal-Bench 2.1, Sonnet 5 jumps to 80.4% from Sonnet 4.6's 67.0%, the single biggest gain in the comparison. On OSWorld-Verified, a computer use test, Sonnet 5 scores 81.2% against Sonnet 4.6's 78.5%. On Humanity's Last Exam with tools enabled, Sonnet 5 reaches 57.4%, close to Opus 4.8's 57.9%. Full comparison numbers are in this benchmark breakdown.
Sonnet 4.6 vs. Sonnet 5 vs. Opus 4.8
Numbers below are listed in that order: Sonnet 4.6, then Sonnet 5, then Opus 4.8.
- SWE-bench Pro (agentic coding): 58.1% vs. 63.2% vs. 69.2%
- Terminal-Bench 2.1: 67.0% vs. 80.4% vs. not reported
- OSWorld-Verified (computer use): 78.5% vs. 81.2% vs. not reported
- Humanity's Last Exam, with tools: 46.8% vs. 57.4% vs. 57.9%
- Input / output price per million tokens: $3 / $15 vs. $2 / $10 through August 31, then $3 / $15 vs. $5 / $25
The pattern across every benchmark is the same. Sonnet 5 narrows the gap with Opus. It does not close it. On the hardest agentic coding test in the set, Opus is still 6 points ahead, and Anthropic's own comparison charts show Opus outperforming Sonnet 5 at every effort level except the lowest. The near Opus performance claim depends heavily on which task and which effort setting you are looking at.
What Happened When Reviewers Actually Tested It
CodeRabbit ran Sonnet 5 through its own code review benchmark and found a genuine tradeoff, not a clean win. Precision improved from roughly 29% under Sonnet 4.6 to 38 to 40% under Sonnet 5, meaning far fewer false positive flags in code review. But the bug catch rate fell to 50 to 51%, down from Sonnet 4.6's 63% and their production baseline's 57%. The model got pickier. It also got worse at finding real bugs.
That is the tradeoff that actually matters, not the headline SWE-bench score, because it is exactly the failure mode that decides whether an AI system is safe to run unsupervised. A reviewer that flags fewer wrong things but also catches fewer real bugs is not obviously more accurate. It depends entirely on which mistake costs you more, a missed production bug or a wasted engineering hour chasing a false alarm. CodeRabbit also found that pushing the model to its highest effort setting roughly doubled the cost without finding meaningfully more bugs, a diminishing return worth knowing before you flip every task to maximum effort by default.
What Developers Are Saying
Reaction on Hacker News split along familiar lines. Several commenters praised the agentic improvements, noting Sonnet 5 finishes tasks that used to stall out with earlier Sonnet models and checks its own work without being prompted. Zapier's Daniel Shepard told TechCrunch the model completed complex automation workflows that "used to stall halfway," calling it a "no brainer" for day to day automation. Lovable's Fabian Hedin praised its safety behavior, saying a model that knows when to say no is just as important as one that knows how to build.
Other developers were less convinced. Multiple Hacker News commenters pointed out that Opus 4.8 beats Sonnet 5 on a cost per task basis in most scenarios, with one summing it up as "the cost per task chart is telling me that I should never use Sonnet 5 above medium effort level, Opus always performs better." Another compared it unfavorably to a competing open model, calling it "basically GLM-5.2 level, at 2x cost, but also 2x faster." One user reported the model refusing a requested change and then denying it had, calling it "much lazier than any Claude model I have used." The complaint was not that Sonnet 5 is a bad model. It is that its position between the cheaper and flagship options is genuinely confusing, and picking the right one for a task now takes testing your own workload rather than trusting a chart.
What This Means For Agent Tooling
The self verification behavior Anthropic highlights, the model checking its own output without being told to, is the capability that matters most for anyone building on top of Claude rather than just chatting with it. Every agent tool inherits whatever reliability the underlying model has, whether it is a browser automation script, an inbox triage agent, or a support bot running unattended. A model that catches more of its own mistakes needs less scaffolding wrapped around it to catch them instead.
That is also why the CodeRabbit result above matters past code review. Precision going up while the catch rate goes down is not a coding specific problem, it shows up in any agent that has to decide when to flag something versus when to just act. We build production AI Workers for clients, and the same pattern holds across the agent stacks we have tested, not only Claude. A more agentic model changes what a support agent, an email triage tool, or an ops assistant can safely do without a human watching, and it changes unevenly. Some tasks get meaningfully safer to hand off. Others do not move at all, no matter which model you swap in underneath.
What This Means If You Are Running AI In Production
None of this makes Sonnet 5 a bad release. It is a real improvement over Sonnet 4.6 on nearly every measured axis. But the CodeRabbit result is a clean example of something we see across every model generation. Benchmark gains do not automatically translate into production reliability, because better on average and safe to trust without a human checking the output are two different claims. A model can get more precise and less thorough in the same release. It can score higher on a public benchmark and still fail differently on your specific data. That part never makes it into the launch post.
This is the problem behind AI output reliability. Raw model capability is necessary, but it is not sufficient. Whichever frontier model your team runs today, in six months you will be running a different one, and each new release ships with its own failure pattern. The teams that stay reliable are the ones with a validation layer that catches what the model itself misses, independent of which model is underneath.
If you are already running Sonnet, Opus, or any other frontier model in production and want to know where it is actually failing on your data, book a 30 minute call and we will walk through it.



