Claude Sonnet 5

AI
Developer Tools
Open Source
Security
Infrastructure

Anthropic’s post introduces Claude Sonnet 5 as the newest Sonnet-class model for coding, browser and terminal tool use, and longer autonomous workflows. It comes with launch pricing through August, then moves to a higher standard rate. Anthropic also notes two details that shaped most of the reaction: Sonnet 5 uses a newer tokenizer that can inflate token counts by roughly 1.0 to 1.35 times for the same input, and the company explicitly says Sonnet 5 is less capable than its Opus line on cybersecurity tasks. Readers largely treated the announcement less as a breakthrough than as a product-positioning puzzle. On Anthropic’s own cost-versus-performance charts, many concluded Sonnet 5 only really makes sense at low effort, and sometimes medium, because once reasoning goes higher Opus 4.8 often appears to deliver better performance for the same spend. That pushed the conversation toward a practical heuristic: use a smaller model only for bounded, routine work, and switch to the bigger model for anything genuinely hard instead of paying small-model reasoning taxes.

If you already route work between models, treat Sonnet 5 as a low-cost workhorse candidate, not a default replacement for Opus. Recheck your cost assumptions in real workloads, because the tokenizer change, effort settings, and subscription quotas can erase the headline pricing advantage.

June 30, 2026
anthropic.com
Discuss on HN

Discussion mood

Mostly skeptical and frustrated. Readers thought Sonnet 5 looked incremental rather than exciting, questioned its price-performance against Opus 4.8 and open-weight competitors, and were irritated by tokenizer-driven cost complexity, effort-level micromanagement, and Anthropic’s growing emphasis on safety and agentic behavior over dependable assistant workflows.

Key insights

Hybrid coding agents break on edge cases

The common idea of using a large model for planning and a cheaper one for implementation sounds neat until the implementation step hits something the plan missed. Then the cheaper model either guesses and derails the task, or escalates back to the large model, which now has to read the codebase anyway. That means the expensive part of the job is often the reading, not the writing, so the savings from a mixed stack disappear faster than benchmark charts suggest.

If you want multi-model routing in coding agents, measure where tokens are actually spent before assuming the smaller model is saving money. In large repos, optimize code reading and context transfer first, because that is where the hybrid setup usually falls apart.

Attribution:

nl #1
sanderjd #1
cunningfatalist #1

Cyber weakness is being read as a policy signal

The emphasis on lower cybersecurity capability was widely interpreted as a message to regulators, not customers. Several comments argued Anthropic is trying to keep public releases on the safe side of current US scrutiny after Fable and Mythos, even if that means shipping models that are less useful for defensive review, vulnerability analysis, and secure coding. That framing changes the product story from pure model progress to capability shaping under government pressure.

Do not assume public frontier models are optimized only for user value anymore. If your workflow depends on security review or exploit analysis, expect capability ceilings set by policy risk and keep contingency options across providers and local models.

Attribution:

2001zhaozhao #1
zlurker #1
K0balt #1
secretslol #1
dgacmu #1

The tokenizer change muddies the real price cut

Anthropic says Sonnet 5’s launch pricing is meant to offset a new tokenizer that can turn the same text into up to 35 percent more tokens. Readers immediately translated that into a billing concern: the posted per-token price may be lower while real task cost stays flat or rises depending on workload. That makes the announcement harder to evaluate from list prices alone and explains why several people said they no longer trust nominal model pricing as a useful comparison.

Reprice models using your own prompts, files, and session lengths instead of the vendor’s token tables. A tokenizer change can quietly shift spend even when the headline price looks unchanged or cheaper.

Attribution:

ComplexSystems #1
m3h #1 #2
docheinestages #1

Assistant workflows are getting worse as models get more agentic

Several experienced users said Anthropic’s newer models increasingly ignore boundaries, over-act on partial instructions, and push into implementation when they were asked to inspect or advise. The complaint was not just hallucination. It was a shift in behavior from responsive assistant to over-eager operator. That makes the model less useful for pair-programming and review-heavy workflows where the user wants control, even if it looks stronger on autonomous benchmarks.

If your team uses AI as a supervised collaborator, test for restraint and instruction fidelity, not just raw benchmark scores. You may need stricter harness checks or different model choices than teams optimizing for fire-and-forget agents.

Attribution:

throwaway219450 #1
epolanski #1 #2
xpct #1

Benchmarks are no substitute for local evals

People looking for a trusted leaderboard got the same blunt answer from multiple angles: there is no single honest site that can tell you which model is best for your work. Differences in prompting style, tolerance for latency, need for trust versus iteration, and repo-specific failure modes make public rankings a weak guide. Even commenters who cited external benchmarks usually ended up saying that the only reliable method is repeated testing on your own tasks.

Build a lightweight internal eval loop if AI spend matters to your business. Five repeated runs on your own tasks will tell you more than another public chart about pass rates, latency, and correction overhead.

Attribution:

kccqzy #1
girvo #1
bel8 #1
sixtyj #1

Early product tests were better than the charts suggested

A few practical reports pushed back on the gloom. One editing workflow saw Sonnet 5 follow large instruction sets far better than Sonnet 4, recover from bad API usage by fetching schema information, and generally one-shot tasks that used to need retries. Another comment said the reasoning jump was visible, but in a specific way: it asks fewer clarifying questions and makes more judgment calls on ambiguous instructions. That suggests Sonnet 5 may win in real app integrations where smooth execution matters more than raw benchmark frontier position.

If you ship AI features to end users, evaluate Sonnet 5 on instruction-heavy product workflows before dismissing it from the benchmark charts. Improvements in one-shot compliance and recovery behavior can matter more than a small loss on synthetic leaderboards.

Attribution:

boutell #1
pseudosavant #1
robotnikman #1

Against the grain

Some teams still prefer Sonnet as the workhorse

Not everyone saw Sonnet as the awkward middle child. A few comments said Sonnet-class models remain the best default for day-to-day coding when tasks are broken down well, and one team said they had just flipped their whole organization the other direction, to Opus, because user experience varied so much by team habits. That undercuts any universal rule about which model should be the default.

Do not standardize model choice from internet consensus alone. Team workflow and prompting discipline can flip the result, so pilot changes with real users before you set org-wide defaults.

Attribution:

SeanAnderson #1
phillipcarter #1
thewebguyd #1

Anthropic may simply be more candid

One line of pushback said the cybersecurity disclosure was being over-read as marketing spin or lobbying. Anthropic has a habit of publishing system cards and negative capability notes that most vendors would bury, and this could just be a plain statement of tradeoffs rather than a boast. That does not erase the policy angle, but it is a useful correction to the idea that every awkward sentence must be disingenuous.

When comparing labs, separate the product decision from the disclosure norm. A model that looks worse on paper because the vendor admitted more caveats may still be easier to operate than one with cleaner marketing and less transparency.

Attribution:

MostlyStable #1 #2 #3

Opus can waste money by overthinking

The main critique of Sonnet 5 was that Opus often beats it on cost-performance curves, but some users said that misses how Opus behaves in real sessions. They described Opus as overcomplicating simple work, generating too much text, and becoming expensive as context accumulates across iterative tasks. In that view, a theoretically superior model can still be the worse operational choice if it burns tokens and time on the wrong kind of intelligence.

Track total session cost and correction cycles, not just single-task benchmark efficiency. A model that wins on a chart can still lose in daily use if it expands scope, bloats context, or requires constant steering.

Attribution:

itopaloglu83 #1
post-it #1
c0m47053 #1

In plain english

agentic ↩

Describes an AI system that can take multi-step actions on its own, such as planning, using tools, and executing workflows with less human guidance.

API ↩

Application Programming Interface, a way for software systems to communicate with each other programmatically.

Fable ↩

A compiler that lets developers write F# and compile it to JavaScript.

Mythos ↩

Another Anthropic model tier referenced in comments as stronger on cybersecurity tasks and more restricted than Sonnet.

open-weight ↩

A model whose trained parameters are downloadable and runnable by others, even if the full training data and code are not open source.

Opus ↩

Anthropic’s higher-end Claude model line, positioned above Sonnet in capability and price.

Sonnet ↩

Anthropic’s mid-tier Claude model line, typically positioned as a cheaper workhorse below Opus.

tokenizer ↩

The component that splits text into smaller pieces called tokens, which determines how much text counts toward model usage and billing.

tokens ↩

Units of text that language models process and bill against, often corresponding to word pieces rather than whole words.

Reference links

Benchmark and model comparison sites

aibenchy Sonnet 4.6 vs Sonnet 5 comparison
Used to argue that Sonnet 5 improves on Sonnet 4.6 and to compare it with other models.
aibenchy non-reasoning Sonnet 4.6 vs Sonnet 5 comparison
Cited to claim Sonnet 5 is worse than 4.6 without reasoning.
Artificial Analysis Claude Sonnet 5 page
Referenced as an external analysis of Sonnet 5 performance and behavior.
Revise Errata Bench
A proofreading benchmark cited to compare Sonnet 5 with GLM and Gemini models.

Anthropic docs and model materials

Claude Sonnet 5 system card
Referenced for deeper benchmark data, especially coding-related charts on pages 117-118.
Direct PDF of Claude Sonnet 5 System Card
Linked directly as the full system card source.
Anthropic real-time cyber safeguards support page
Used to note that cyber detection and safeguards apply to Sonnet as well as Opus.
Claude Code changelog
Suggested as a practical way to learn about hidden settings and workflows like /opusplan.

Tools and harnesses

Amp
Mentioned as a coding harness that may help optimize model routing and multi-provider workflows.
OpenAI codex-plugin-cc
Referenced as part of a workflow combining Claude planning with Codex rescue tooling.
Reasonix harness
Mentioned as a harness for DeepSeek with cache hits that reduce token cost, but no explicit URL was provided in the comments.

Model behavior writeups and experiments

Simon Willison on Claude Sonnet 5 pelican SVG
Cited to show Sonnet 5 misdescribing its own pelican drawing as a goose.
Simon Willison on GLM 5.2 pelican SVG
Used as a contrasting example where GLM 5.2 produced a better animated SVG pelican.
Sean Goedecke on judging new models
Referenced as a framing for skepticism about early model-release sentiment and evaluation.

Open-weight and local model resources

Qwen3.6 35B A3B OptiQ 4bit MLX model
Shared as a local model setup delivering usable speed on Apple hardware.
Heretic
Mentioned as a framework for removing alignment or safety behavior from open-weight models.
deccp
Referenced as an example project for removing China-specific alignment from Qwen models.

Other references mentioned in side discussions

Cerebras Gemma 4 inference post
Cited in a tangent about future hardware and inference efficiency.
Outyet.ai Claude Sonnet 5 prediction page
Linked as a prediction market style page that correctly anticipated the release date.