Anthropic Claims ‘Best Coding Model in the World’ With Claude Sonnet 4.5—We Tested It

Listen to the article

0:00

In brief

Anthropic released Claude Sonnet 4.5, calling it the best coding model yet.
The model scored 77.2% on SWE-bench Verified, rising to 82% with parallel compute.
Anthropic claimed improvements on alignment and safety, but jailbreakers cracked it within minutes.

Anthropic released Claude Sonnet 4.5 on Monday, calling it “the best coding model in the world” and releasing a suite of new developer tools alongside the model. The company said the model can focus for more than 30 hours on complex, multi-step coding tasks and shows gains in reasoning and mathematical capabilities.

Introducing Claude Sonnet 4.5—the best coding model in the world.

It’s the strongest model for building complex agents. It’s the best model at using computers. And it shows substantial gains on tests of reasoning and math. pic.twitter.com/7LwV9WPNAv

— Claude (@claudeai) September 29, 2025

The model scored 77.2% on SWE-bench Verified, a benchmark that measures real-world software coding abilities, according to Anthropic’s announcement. That score rises to 82% when using parallel test-time compute. This puts the new model ahead of the best offerings from OpenAI and Google, and even Anthropic’s Claude 4.1 Opus (per the company’s naming scheme, Haiku is a small model, Sonnet is a medium size, and Opus is the heaviest and most powerful model in the family).

Image: Anthropic

Claude Sonnet 4.5 also leads on OSWorld, a benchmark testing AI models on real-world computer tasks, scoring 61.4%. Four months ago, Claude Sonnet 4 held the lead at 42.2%. The model shows improved capabilities across reasoning and math benchmarks, and experts in specific business fields like finance, law and medicine.

We tried the model, and our first quick test found it capable of generating our usual “AI vs Journalists” game using zero-shot prompting without iterations, tweaks, or retries. The model produced functional code faster than Claude 4.1 Opus while maintaining top quality output. The application it created showed visual polish comparable to OpenAI’s outputs, a change from earlier Claude versions that typically produced less refined interfaces.

Anthropic released several new features with the model. Claude Code now includes checkpoints, which save progress and allow users to roll back to previous states. The company refreshed the terminal interface and shipped a native VS Code extension. The Claude API gained a context editing feature and a memory tool that lets agents run longer and handle greater complexity. Claude apps now include code execution and file creation for spreadsheets, slides, and documents directly in conversations.

Pricing remains unchanged from Claude Sonnet 4 at $3 per million input tokens and $15 per million output tokens. All Claude Code updates are available to all users, while Claude Developer Platform updates, including the Agent SDK, are available to all developers.

Anthropic also called Claude Sonnet 4.5 “our most aligned frontier model yet,” saying it made substantial improvements in reducing concerning behaviors like sycophancy, deception, power-seeking, and encouraging delusional thinking. The company also said it made progress on defending against prompt injection attacks, which it identified as one of the most serious risks for users of agentic and computer use capabilities.

Of course, it took Pliny—the world’s most famous AI prompt engineer—a few minutes to jailbreak it and generate drug recipes like it was the most normal thing in the world.

The release comes as competition intensifies among AI companies for coding capabilities. OpenAI released GPT-5 last month, while Google’s models compete on various benchmarks. This can be a shocker for some prediction markets, which up until a few hours ago were almost completely certain that Gemini was going to be the best model of the month.

It may be a race against time. Right now, the model does not appear on the rankings, but LM Arena announced it was already available for ranking. Depending on the number of interactions, the outcome tomorrow could be pretty surprising, considering Claude 4.1 Opus in in second place and Claude 4.5 Sonnet is much better.

Anthropic is also releasing a temporary research preview called “Imagine with Claude,” available to Max subscribers for five days. In the experiment, Claude generates software on the fly with no predetermined functionality or prewritten code, responding and adapting to requests as users interact.

“What you see is Claude creating in real time,” the company said. Anthropic described it as a demonstration of what’s possible when combining the model with appropriate infrastructure.

Generally Intelligent Newsletter

A weekly AI journey narrated by Gen, a generative AI model.

Read the full article here

Fact Checker

Verify the accuracy of this article using AI-powered analysis and real-time sources.

Trending

Ripple-linked ETFs attract biggest inflows since January

Exodus Posts $32M Loss as Wallet Revenue Craters 37%, Sells 1,076 BTC

Senate Banking Panel Releases CLARITY Act Draft Ahead of Thursday Markup

Listen to the article

Generally Intelligent Newsletter

Ripple-linked ETFs attract biggest inflows since January

Exodus Posts $32M Loss as Wallet Revenue Craters 37%, Sells 1,076 BTC

Senate Banking Panel Releases CLARITY Act Draft Ahead of Thursday Markup

The War Comes for Your Wallet: Inflation Hits 3.8%, Highest Level in 3 Years

EBay rejects GameStop’s $56 billion bid as bitcoin exposure back in focus

Istanbul Blockchain Week returns in June 2026 amid surging crypto adoption in Türkiye

Exodus Posts $32M Loss as Wallet Revenue Craters 37%, Sells 1,076 BTC

Senate Banking Panel Releases CLARITY Act Draft Ahead of Thursday Markup

The War Comes for Your Wallet: Inflation Hits 3.8%, Highest Level in 3 Years

EBay rejects GameStop’s $56 billion bid as bitcoin exposure back in focus

Istanbul Blockchain Week returns in June 2026 amid surging crypto adoption in Türkiye

North Korean Crypto Hackers Stole $2.1B in 2025, 60% of All Losses: CertiK

ABC Shows A Backbone In FCC Fight, Shows FCC Manufactured A Controversy Surrounding James Talarico

Latest News

Ripple-linked ETFs attract biggest inflows since January

Exodus Posts $32M Loss as Wallet Revenue Craters 37%, Sells 1,076 BTC

Senate Banking Panel Releases CLARITY Act Draft Ahead of Thursday Markup

Trending

Listen to the article

Key Takeaways

Playback Speed

Select a Voice

In brief

Generally Intelligent Newsletter

Fact Checker

Get Your Fact Check Report

Continue with Full Access

Related Articles

Subscribe to Updates

Cookies

Manage Cookies

Your permission applies to the following domains: