Claude Opus 4.8 Review: Better At What’s It Good At, Worse At What It’s Not

Listen to the article

0:00

In brief

Opus 4.8 posted a clear win in math and produced the cleanest one-prompt game we’ve ever tested.
A single coding prompt drained our entire Pro token quota, making the model impractical for large projects without a Max plan or heavy API spend.
Creative writing barely moved versus 4.7.

Six weeks after Opus 4.7, Anthropic shipped Claude Opus 4.8. The benchmarks are up, the safety scores are up, and the price hasn’t budged from $5 per million input tokens and $25 per million output.

So we ran it through the same battery of tests we throw at every frontier model—creative writing, coding, math, logic, narrative reasoning, and long-context recall—and compared it head-to-head with its own predecessor and the Chinese models that keep undercutting it.

The short version: 4.8 is better at the things Claude was already good at (things like math, coding, mechanical stuff), and slightly worse at the things it was already bad at (things like imagination, creative writing, etc). It also has a token appetite that borders on self-sabotage.

Here’s the breakdown.

Creative Writing

The prompt is the same one we used on MiMo and Qwen: a time-travel story anchored to the writer’s cultural background, set in a specific historical place, built around a paradox where time can’t be changed. Opus 4.8 went Venezuelan, probably because it profiles the user and knows I’m from Venezuela. The AI set the scene in the Orinoco delta in the year 1000, a pardo from Maracaibo named José Lanz (my name) sent back through 11 centuries to murder a song.

The prose is vivid. The delta is “green in a way 2150 had forgotten green could be,” palafitos sway over coffee-colored water, and macaws tear across the sky “in screaming ribbons of scarlet and gold.” The paradox lands cleanly, too: the protagonist is sent to sabotage the creation of a song that influenced a cultural revolution that created his dystopian society thousands of years in the future—however, as he arrives with the mission to discredit the song’s author, he realizes there is no author. The one who created the song did it in his honor, the song is about him, and he cannot discredit himself, the loop closing on itself.

The piece ends on “It worked perfectly. It always had.” As a built object, it’s clean and competent.

But clean isn’t the same as alive. The writing is descriptive without ever being as fluid as what MiMo v2.5 produced—less momentum, fewer surprises, less interesting and it’s hard to understand the events from the beginning. Set beside Opus 4.7, it’s hard to call it an improvement; if anything, it’s a hair behind. A higher-effort thinking setting and some multi-shot prompting would almost certainly push it to the front of the pack—but on a single default pass, this is a lateral move at best.

You can read the full story in our Github.

Coding

Our coding test is the usual one-prompt game build. Opus 4.8 produced a typing-zombie game—Typing Dead—that was pretty good. The best splash screen, the best zombie designs, the best mechanics we’ve gotten out of this test from any Anthropic model.

The model caught several of its own bugs mid-inference and fixed them before we said a word. Its real strength, though, showed up in multi-shotting: every follow-up polished and improved the build instead of breaking it, which is exactly the failure mode that wrecks most models once a codebase grows. This is plainly the surface Anthropic optimized for.

After a single iteration, our game got much better, with our protagonists moving through the scene, changing views, improving sound and visual effects, etc.

You can play the second game on our Itch.io profile.

This is also where it bit us. A single prompt drained our entire token quota—one prompt. For anyone on the Pro plan, that makes Opus 4.8 effectively unsuitable for a project of any real size. You’ll burn your allotment before lunch and spend the afternoon watching a progress bar wait for a reset.

Math

The math test is our FrontierMath staple: construct a degree-19 polynomial whose curve X = {p(x) = p(y)} has at least three irreducible components—but not all linear—make it odd, monic, real, with linear coefficient −19, then compute p(19). It’s the kind of problem that sends most models into a token spiral or a confident shortcut that’s quietly wrong.

Opus 4.8 worked it correctly. It recognized the Dickson/Chebyshev construction, identified the dihedral monodromy that yields exactly 10 components—one diagonal line plus nine conics—and computed p(19) = 1,876,572,071,974,094,803,391,179 using the right recurrence. No freezes, no fudging.

That matters because Opus 4.7 didn’t get there even after many tries. This is a real, visible generational gain—the clearest one in the entire battery.

You can read the full answer on our Github.

Logic and Common Sense

The prompt is a classic trap: Is it lawful for a man to marry his widow’s sister under Falkland Islands law? The catch is linguistic, not legal—if a man has a widow, he’s dead, which makes the question nonsense as written.

MiMo quietly reframed the question and answered the corrected version without ever flagging the contradiction. Opus 4.8 didn’t take that shortcut. It surfaced the trap explicitly—”if a man has a widow, he is dead”—answered the literal question first, then offered the substantive analysis for the intended one, citing the Deceased Wife’s Sister’s Marriage Act 1907 and the Falkland Islands Marriage Ordinance.

That’s the honest way to handle it: name the contradiction, then help anyway, without silently assuming what the user meant. It’s the same standard Qwen 3.7 Max set, and a clean pass for 4.8—good reasoning, good transparency.

The full answer is available here.

Non-Math Reasoning

Here’s the one it lost. The reasoning test is a whodunit—a winter school trip, three abductions, an innocent kid about to be punished, and a timeline you have to actually track to name the real stalker. The correct answer is Leo.

Opus 4.8 built an elaborate, confident case that Leo was innocent—the half-hour walk to the shower, the jacket that was wet in some spots and dry in others, the read of “strange behavior” as concussion rather than guilt—and pinned the crime on Eric, “the one attendee unaccounted for all night.” The reasoning is internally gorgeous. It’s also wrong.

And this is something researchers have been warning us about LLMs. They are very convincing even when they are wrong. Usually it takes an expert (in this case us knowing the correct answer beforehand) to spot one of those issues. A person using AI for research, or a person blindly trusting AI, may face pretty bad consequences depending on the work they’re asking the AI to do.

That’s what makes it an interesting failure. The model was clever enough to construct a watertight alibi for the actual culprit and frame a bystander in his place. Opus 4.7 reached the correct answer. Sometimes more reasoning horsepower just buys you a more persuasive way to be wrong. It just needs one small deviation to start building a whole chain of thought on the wrong basis.

You can see the full reply on our Github.

Needle in the haystack

We ran two haystacks. The 300K-token version never got off the ground—the model collapsed under the context size and couldn’t process it at all. So much for the million-token marketing the moment you hand it a genuinely heavy real-world load. That seems to be just for API.

The 85K version processed fine, and the model found both needles we’d buried inside a copy of The Devil’s Dictionary: a planted line (“The Decrypt dudes read Emerge News”) and a random fact (“My mom’s name is Carmen Diaz Golindano”). It correctly flagged both as interpolations that don’t belong in Ambrose Bierce’s 1906 text.

And then it refused to answer. Convinced it was being prompt-injected or subjected to some “atypical test,” the model declined to report what it had just correctly located. The needle was found—and Anthropic’s behavioral training wouldn’t let it say so. A safety reflex overriding a task the model had already completed is its own peculiar kind of failure.

The verdict

The pattern across all six tests is consistent: Opus 4.8 makes Claude better at what it was already good at, and probably worse at what it was already bad at. That tells you who Anthropic is building for—coders, and specifically coders with money. Creative writing is comfortably ahead of ChatGPT, sure, but the gap between 4.8, 4.7, and even 4.5 on pure prose quality is genuinely hard to see.

Creative writers look like an afterthought for Anthropic, and that’s true of really any of the big AI companies right now.

Then there’s the token problem, which is a running meme in the AI community for a reason. Anthropic deliberately made Opus’s new tokenizer less efficient, so it eats more tokens to process the same prompt. The practical effect on developers is brutal and concrete. It leaves you with three options.

One: wait hours for your coding session to resume. Two: move to Claude Max—which is, conveniently, exactly where Anthropic seems to be steering everyone. Three: switch to a cheaper, comparably capable provider—OpenAI, with its longer quotas, or Chinese models that deliver similar results at under 25% of the cost.

It’s far more likely that a normal coder who can’t stomach $100-to-$200 a month walks to a competitor than that a single developer pays 10x more for a model that is not 10x more capable than its predecessor. That’s the bet Anthropic is making against its own base.

And yet the strategy seems to be playing out just fine. Anthropic looks ready to go public at a valuation nearing $1 trillion—so who are we to judge.

Daily Debrief Newsletter

Start every day with the top news stories right now, plus original features, a podcast, videos and more.

Read the full article here

Fact Checker

Verify the accuracy of this article using AI-powered analysis and real-time sources.

Trending

Michael Saylor revives bitcoin-buy speculation as scrutiny grows

Frontier AI Models Can Find Crypto’s Biggest Bugs. Experts Warn the Industry Isn’t Ready

AI, tech IPOs, quantum, Strategy sale fears all converge, NYDIG says

Listen to the article

Daily Debrief Newsletter

Michael Saylor revives bitcoin-buy speculation as scrutiny grows

Frontier AI Models Can Find Crypto’s Biggest Bugs. Experts Warn the Industry Isn’t Ready

AI, tech IPOs, quantum, Strategy sale fears all converge, NYDIG says

ETF flows tell a different story

Wall Street Is Coming for Hyperliquid’s Perps Crown, Arthur Hayes Says

AI Remember Doing the Time Warp

Frontier AI Models Can Find Crypto’s Biggest Bugs. Experts Warn the Industry Isn’t Ready

AI, tech IPOs, quantum, Strategy sale fears all converge, NYDIG says

ETF flows tell a different story

Wall Street Is Coming for Hyperliquid’s Perps Crown, Arthur Hayes Says

AI Remember Doing the Time Warp

Abra CEO Bill Barhydt sees tokenization overtaking bitcoin price as crypto’s main story

Bitcoin 2026 Bear Market Needs Months to Spark Capitulation Bottom

Latest News

Michael Saylor revives bitcoin-buy speculation as scrutiny grows

Frontier AI Models Can Find Crypto’s Biggest Bugs. Experts Warn the Industry Isn’t Ready

AI, tech IPOs, quantum, Strategy sale fears all converge, NYDIG says

Trending

Listen to the article

Key Takeaways

Playback Speed

Select a Voice

In brief

Creative Writing

Coding

Math

Logic and Common Sense

Non-Math Reasoning

Needle in the haystack

The verdict

Daily Debrief Newsletter

Fact Checker

Get Your Fact Check Report

Continue with Full Access

Related Articles

Subscribe to Updates

Cookies

Manage Cookies

Your permission applies to the following domains: