DeepSeek, Xiaomi Just Made Frontier AI 99% Cheaper. American Labs Went the Other Way

Listen to the article

0:00

In brief

DeepSeek made its 75% V4-Pro discount permanent on May 22, locking in output at $0.87 per million tokens.
Xiaomi cut MiMo-V2.5 prices by up to 99% on May 26, with cached input now at $0.0036 per million tokens for the Pro model.
OpenAI’s GPT-5.5 doubled output prices to $30 per million tokens at launch, and Anthropic’s Claude Opus 4.7 shipped with an updated tokenizer that can inflate actual costs by up to 35%.

DeepSeek made the 75% discount on DeepSeek V4-Pro, which was set to expire, permanent earlier this week. And now fellow Chinese AI lab Xiaomi slashed MiMo-V2.5 API prices by up to 99% for cached inputs. Two of the most capable AI models on the market just got aggressively cheaper, while American labs moved in the opposite direction.

Quick explainer for the non-developers in the room: When you use ChatGPT or Claude in a browser, you’re paying a flat subscription—or nothing. When a company builds a product on top of an AI model, they pay per token, where a token is roughly three-quarters of a word. Every message sent, every reply generated, every document processed: all of it adds up at a rate measured in millions of tokens.

An API is the raw pipe that makes this possible, making it possible for an app, an agent, a web site, etc. to use the model in their own environment. So token pricing determines whether an AI-powered product is economically viable or a money pit.

Token plans are a subscription wrapper on top of that. You buy credits upfront; the model eats through them. Xiaomi’s billing upgrade gives users 5 to 8 times more tokens at the same price. The Max plan at $100 now gets you 82 billion tokens, up from 1.6 billion.

For context, 82 billion tokens is more than 60 billion words.

Why the cuts are real, not marketing

Fuli Luo, head of Xiaomi’s MiMo team and a former core DeepSeek developer who co-built DeepSeek-V2, published a technical explanation on X. The biggest savings come from a smarter way of storing and reusing information the AI has already processed. Instead of repeatedly doing the same work, Xiaomi’s system can remember much more data at once—about five times more than before. That means the AI needs far less computing power, cutting storage and processing costs by around 80%.

Behind the MiMo API Price Reduction:
The deepest price cut, up to 99%, is for Input (Cache Hit). The core reason is our inference framework now supports hierarchical KV cache optimization for SWA. Production inference engine tests show this optimization increases cached token…

— Fuli Luo (@_LuoFuli) May 27, 2026

“Operating at these newly reduced API prices, our production inference engine is running at near full capacity, and we can still essentially break even,” Luo wrote. “If more architectures that save compute and KV [Key-Value cache] cache emerge, along with better inference Infra to drive down API costs, this will form an excellent virtuous cycle in the industry.”

DeepSeek’s architecture lands in the same place differently. V4 uses two interleaved attention types—one compressing every four tokens for selective attention, another collapsing every 128 tokens for global context at minimal compute. At one million tokens of context, V4-Pro’s KV cache is 10% the size of its predecessor’s, and single-token inference runs at 27% of the previous compute cost.

The result is a model 98% cheaper than GPT-5.5 Pro with a competitive performance.

Silicon Valley’s bet

Claude Opus 4.7 costs $5 per million input tokens and $25 per million output tokens. Anthropic kept the rate card flat but shipped it with a new tokenizer that can produce up to 35% more tokens for the same input text. So the price didn’t go up. Your bill still might.

GPT-5.5, released in late April, just doubled its predecessor’s output price to $30 per million tokens. Gemini 2.5 Pro sits at $1.25 input and $10 output—cheap by American standards.

DeepSeek V4-Pro is a 1.6 trillion parameter model that gives you the knowledge base of a massive model at a fraction of the compute cost. It now permanently runs at $0.435 input and $0.87 output per million tokens. That’s a model that scored 80.6% on SWE-Verified against Claude Opus 4.6’s 80.8%—a benchmark measuring real GitHub issue resolution, not cherry-picked demos. The pricing gap between models with essentially the same coding score: 34x on output.

MiMo-V2.5-Pro matches that same $0.435/$0.87 per million tokens after the new cuts. Cache hits drop to $0.0036. For context, that’s cheaper per token than most people pay per character in an SMS.

DeepSeek and Xiaomi aren’t alone

These cuts landed in a market where Chinese models were already much cheaper before any of this. MiniMax M2.7, which trades punches with Claude Opus on coding benchmarks per Artificial Analysis, costs $0.30 input and $1.20 output per million tokens—about 5% of Opus 4.7’s output rate.

Kimi K2.5 from Moonshot AI, with 76.8% on SWE-bench Verified, runs $0.60 input and $2.50 output. GLM-5.1 from Z.AI beat Claude Opus 4.6 on a key coding benchmark earlier this quarter. Four Chinese frontier models shipped in a 12-day window in early May, all under one-third of Opus 4.7’s per-token cost.

For better visualization, this chart shows how Chinese models stack up against the three most popular American AI providers (Anthropic, OpenAI, and Meta) in terms of price to quality ratio.

Image: Artificialanalysis.ai

The Q2 2026 gap between Chinese and American frontier models sits at 15–30x, depending on which models you compare—and that’s the baseline, before any cache discounts.

What this week’s cuts do is collapse that gap further for the specific workloads that actually run in production: agent pipelines with stable system prompts, document processors, retrieval tools, things that hit cache constantly. At $0.003625 per million cached input tokens, DeepSeek V4-Pro’s cost for repeated context is functionally rounding error.

Daily Debrief Newsletter

Start every day with the top news stories right now, plus original features, a podcast, videos and more.

Read the full article here

Fact Checker

Verify the accuracy of this article using AI-powered analysis and real-time sources.

Trending

What Is an AI Kill Switch and Why Do US Lawmakers Want One?

Democratizing weather derivatives through tokenization could be blockchain industry’s most important real-world use case

North Korea arrests bank hacking ring tied to crypto laundering: Report

Listen to the article

Daily Debrief Newsletter

What Is an AI Kill Switch and Why Do US Lawmakers Want One?

Democratizing weather derivatives through tokenization could be blockchain industry’s most important real-world use case

North Korea arrests bank hacking ring tied to crypto laundering: Report

California’s War on Goats Could Worsen the Wildfire Crisis

Fidelity joins push for Senate passage of CLARITY Act

Today in Supreme Court History: July 25, 1965

Democratizing weather derivatives through tokenization could be blockchain industry’s most important real-world use case

North Korea arrests bank hacking ring tied to crypto laundering: Report

California’s War on Goats Could Worsen the Wildfire Crisis

Fidelity joins push for Senate passage of CLARITY Act

Today in Supreme Court History: July 25, 1965

For Thousands of New York City Apartment Buildings, the Math Doesn’t Math

Robinhood Chain’s real-world assets jump fivefold as tokenized stocks start trading in size

Latest News

What Is an AI Kill Switch and Why Do US Lawmakers Want One?

Democratizing weather derivatives through tokenization could be blockchain industry’s most important real-world use case

North Korea arrests bank hacking ring tied to crypto laundering: Report

Trending

Listen to the article

Key Takeaways

Playback Speed

Select a Voice

In brief

Daily Debrief Newsletter

Fact Checker

Get Your Fact Check Report

Continue with Full Access

Related Articles

Subscribe to Updates

Cookies

Manage Cookies

Your permission applies to the following domains: