Close Menu
FSNN | Free Speech News NetworkFSNN | Free Speech News Network
  • Home
  • News
    • Politics
    • Legal & Courts
    • Tech & Big Tech
    • Campus & Education
    • Media & Culture
    • Global Free Speech
  • Opinions
    • Debates
  • Video/Live
  • Community
  • Freedom Index
  • About
    • Mission
    • Contact
    • Support
Trending

Bitcoin slips to $79,000, DOGE leads majors losses as negative funding rates set 10-year record

6 minutes ago

Chaos Labs Rotates Keys After Suspected Nation-State Crypto Attack

7 minutes ago

Morning Minute: Morgan Stanley Is Coming for Coinbase

12 minutes ago
Facebook X (Twitter) Instagram
Facebook X (Twitter) Discord Telegram
FSNN | Free Speech News NetworkFSNN | Free Speech News Network
Market Data Newsletter
Friday, May 8
  • Home
  • News
    • Politics
    • Legal & Courts
    • Tech & Big Tech
    • Campus & Education
    • Media & Culture
    • Global Free Speech
  • Opinions
    • Debates
  • Video/Live
  • Community
  • Freedom Index
  • About
    • Mission
    • Contact
    • Support
FSNN | Free Speech News NetworkFSNN | Free Speech News Network
Home»Cryptocurrency & Free Speech Finance»Google Found a Way to Make Local AI Up to 3x Faster—No New Hardware Required
Cryptocurrency & Free Speech Finance

Google Found a Way to Make Local AI Up to 3x Faster—No New Hardware Required

News RoomBy News Room2 hours agoNo Comments5 Mins Read326 Views
Share Facebook Twitter Pinterest Copy Link LinkedIn Tumblr Email VKontakte Telegram
Google Found a Way to Make Local AI Up to 3x Faster—No New Hardware Required
Share
Facebook Twitter Pinterest Email Copy Link

Listen to the article

0:00
0:00

Key Takeaways

Playback Speed

Select a Voice

In brief

  • Google released Multi-Token Prediction (MTP) drafters for Gemma 4, delivering up to a 3x speedup at inference without any degradation in output quality.
  • The technique—called speculative decoding—uses a lightweight “drafter” model to predict several tokens at once, which the main model then verifies in parallel, bypassing the one-token-at-a-time bottleneck.
  • MTP drafters are available on Hugging Face, Kaggle, and Ollama under the same Apache 2.0 license as Gemma 4, and work with tools like vLLM, MLX, and SGLang.

Running an AI model on your own computer is great—until it isn’t.

The promise is privacy, no subscription fees, and no data leaving your machine. The reality, for most people, is watching a cursor blink for five seconds between sentences.

That bottleneck has a name: inference speed. And it has nothing to do with how smart the model is. It’s a hardware problem. Standard AI models generate text one word fragment—called a token—at a time. The hardware has to shuttle billions of parameters from memory to its compute units just to produce each single token. It’s slow by design. On consumer hardware, it’s painful.

The workaround most people reach for is running smaller, weaker models—or heavily compressed versions, called quantized models, that sacrifice some quality for speed. Neither solution is great. You get something that runs, but it’s not the model you actually wanted.

Now Google has a different idea. The company just released Multi-Token Prediction (MTP) drafters for its Gemma 4 family of open models—a technique that can deliver up to a 3x speedup without touching the model’s quality or reasoning ability at all.

The approach is called speculative decoding, and it’s been around as a concept for years. Google researchers published the foundational paper back in 2022. The idea didn’t go mainstream until now because it required the right architecture to make it work at scale.

Here’s the short version of how it works. Instead of making the big, powerful model do all the work alone, you pair it with a tiny “drafter” model. The drafter is fast and cheap—it predicts several tokens at once in less time than the main model would take to produce just one. Then the big model checks all of those guesses in a single pass. If the guesses are right, then you get the whole sequence for the price of one forward pass.

According to Google, “if the target model agrees with the draft, it accepts the entire sequence in a single forward pass—and even generates an additional token of its own in the process.”

Nothing is sacrificed: The large model—Gemma 4’s 31B dense version, for example—still verifies every token, and the output quality is identical. You’re just exploiting idle compute power that was sitting unused during the slow parts.

Google says the drafter models share the target model’s KV cache—a memory structure that stores already-processed context—so they don’t waste time recalculating things the larger model already knows. For the smaller edge models designed for phones and Raspberry Pi devices, the team even built an efficient clustering technique to further cut generation time.

This isn’t the only attempt the AI world has made at parallelizing text generation. Diffusion-based language models—like Mercury from Inception Labs—tried a completely different approach: Instead of predicting one token at a time, they start with noise and iteratively refine the entire output. That’s fast on paper, but diffusion LLMs have struggled to match the quality of traditional transformer models, leaving them more of a research curiosity than a practical tool.

Speculative decoding is different because it doesn’t change the underlying model at all. It’s a serving optimization, not an architecture replacement. The same Gemma 4 you’d already run gets faster.

The practical upside is real. A Gemma 4 26B model running on an Nvidia RTX Pro 6000 desktop GPU gets roughly twice the tokens per second with the MTP drafter enabled, according to Google’s own benchmarks. On Apple Silicon, batch sizes of 4 to 8 requests unlock around 2.2x speedups. Not quite the 3x ceiling in every scenario, but still a meaningful difference between “barely usable” and “actually fast enough to work with.”

The context matters here. When Chinese model DeepSeek shocked the market in January 2025—wiping $600 billion from Nvidia’s market cap in a single day—the core lesson was that efficiency gains can hit harder than raw compute. Running smarter beats throwing more hardware at the problem. Google’s MTP drafter is another move in that direction, except aimed squarely at the consumer end of the market.

The whole AI industry is right now a triangle that considers inference, training, and memory. Each breakthrough in either area tends to boost or shock the entire ecosystem. DeepSeek’s training approach (achieving powerful models with lower end hardware) was one example, while Google’s TurboQuant (shrinking AI memory without losing quality) paper was another. Both crashed the markets as companies tried to figure out what to do.

Google says the drafter unlocks “improved responsiveness: drastically reduce latency for near real-time chat, immersive voice applications and agentic workflows”—the kind of tasks that demand low latency to feel useful at all.

Use cases snap into focus quickly: A local coding assistant that doesn’t lag; a voice interface that responds before you’ve forgotten what you asked; an agentic workflow that doesn’t make you wait three seconds between steps. All of this, on hardware you already own.

The MTP drafters are available now on Hugging Face, Kaggle, and Ollama, under the Apache 2.0 license. They work with vLLM, MLX, SGLang, and Hugging Face Transformers out of the box.

Daily Debrief Newsletter

Start every day with the top news stories right now, plus original features, a podcast, videos and more.

Read the full article here

Fact Checker

Verify the accuracy of this article using AI-powered analysis and real-time sources.

Get Your Fact Check Report

Enter your email to receive detailed fact-checking analysis

5 free reports remaining

Continue with Full Access

You've used your 5 free reports. Sign up for unlimited access!

Already have an account? Sign in here

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Telegram Copy Link
News Room
  • Website
  • Facebook
  • X (Twitter)
  • Instagram
  • LinkedIn

The FSNN News Room is the voice of our in-house journalists, editors, and researchers. We deliver timely, unbiased reporting at the crossroads of finance, cryptocurrency, and global politics, providing clear, fact-driven analysis free from agendas.

Related Articles

Cryptocurrency & Free Speech Finance

Bitcoin slips to $79,000, DOGE leads majors losses as negative funding rates set 10-year record

6 minutes ago
Cryptocurrency & Free Speech Finance

Chaos Labs Rotates Keys After Suspected Nation-State Crypto Attack

7 minutes ago
Cryptocurrency & Free Speech Finance

Morning Minute: Morgan Stanley Is Coming for Coinbase

12 minutes ago
Cryptocurrency & Free Speech Finance

AI agents could solve crypto’s user problem

1 hour ago
Cryptocurrency & Free Speech Finance

Block Shares Jump on Strong Quarter Despite Bitcoin Dip

1 hour ago
Cryptocurrency & Free Speech Finance

Kraken Parent Acquires Asian Stablecoin Firm Reap for $600 Million: Bloomberg

1 hour ago
Add A Comment
Leave A Reply Cancel Reply

Editors Picks

Chaos Labs Rotates Keys After Suspected Nation-State Crypto Attack

7 minutes ago

Morning Minute: Morgan Stanley Is Coming for Coinbase

12 minutes ago

AI agents could solve crypto’s user problem

1 hour ago

Block Shares Jump on Strong Quarter Despite Bitcoin Dip

1 hour ago
Latest Posts

Kraken Parent Acquires Asian Stablecoin Firm Reap for $600 Million: Bloomberg

1 hour ago

20 banks and tech giants are waiting to issue tokens with Anchorage Digital

2 hours ago

Coinbase Misses Estimates on Q1 Revenue, $400M Loss

2 hours ago

Subscribe to News

Get the latest news and updates directly to your inbox.

At FSNN – Free Speech News Network, we deliver unfiltered reporting and in-depth analysis on the stories that matter most. From breaking headlines to global perspectives, our mission is to keep you informed, empowered, and connected.

FSNN.net is owned and operated by GlobalBoost Media
, an independent media organization dedicated to advancing transparency, free expression, and factual journalism across the digital landscape.

Facebook X (Twitter) Discord Telegram
Latest News

Bitcoin slips to $79,000, DOGE leads majors losses as negative funding rates set 10-year record

6 minutes ago

Chaos Labs Rotates Keys After Suspected Nation-State Crypto Attack

7 minutes ago

Morning Minute: Morgan Stanley Is Coming for Coinbase

12 minutes ago

Subscribe to Updates

Get the latest news and updates directly to your inbox.

© 2026 GlobalBoost Media. All Rights Reserved.
  • Privacy Policy
  • Terms of Service
  • Our Authors
  • Contact

Type above and press Enter to search. Press Esc to cancel.

🍪

Cookies

We and our selected partners wish to use cookies to collect information about you for functional purposes and statistical marketing. You may not give us your consent for certain purposes by selecting an option and you can withdraw your consent at any time via the cookie icon.

Cookie Preferences

Manage Cookies

Cookies are small text that can be used by websites to make the user experience more efficient. The law states that we may store cookies on your device if they are strictly necessary for the operation of this site. For all other types of cookies, we need your permission. This site uses various types of cookies. Some cookies are placed by third party services that appear on our pages.

Your permission applies to the following domains:

  • https://fsnn.net
Necessary
Necessary cookies help make a website usable by enabling basic functions like page navigation and access to secure areas of the website. The website cannot function properly without these cookies.
Statistic
Statistic cookies help website owners to understand how visitors interact with websites by collecting and reporting information anonymously.
Preferences
Preference cookies enable a website to remember information that changes the way the website behaves or looks, like your preferred language or the region that you are in.
Marketing
Marketing cookies are used to track visitors across websites. The intention is to display ads that are relevant and engaging for the individual user and thereby more valuable for publishers and third party advertisers.