Close Menu
FSNN | Free Speech News NetworkFSNN | Free Speech News Network
  • Home
  • News
    • Politics
    • Legal & Courts
    • Tech & Big Tech
    • Campus & Education
    • Media & Culture
    • Global Free Speech
  • Opinions
    • Debates
  • Video/Live
  • Community
  • Freedom Index
  • About
    • Mission
    • Contact
    • Support
Trending

Anthropic, age verification laws, and press freedom

29 minutes ago

Ars Fires Reporter For Accidentally Using Fake AI Quotes

32 minutes ago

Here Are 12 Bills Democrats Just Passed To Trample Gun Rights in Virginia

34 minutes ago
Facebook X (Twitter) Instagram
Facebook X (Twitter) Discord Telegram
FSNN | Free Speech News NetworkFSNN | Free Speech News Network
Market Data Newsletter
Wednesday, March 11
  • Home
  • News
    • Politics
    • Legal & Courts
    • Tech & Big Tech
    • Campus & Education
    • Media & Culture
    • Global Free Speech
  • Opinions
    • Debates
  • Video/Live
  • Community
  • Freedom Index
  • About
    • Mission
    • Contact
    • Support
FSNN | Free Speech News NetworkFSNN | Free Speech News Network
Home»Cryptocurrency & Free Speech Finance»This One Weird Trick Defeats AI Safety Features in 99% of Cases
Cryptocurrency & Free Speech Finance

This One Weird Trick Defeats AI Safety Features in 99% of Cases

News RoomBy News Room4 months agoNo Comments5 Mins Read1,353 Views
Share Facebook Twitter Pinterest Copy Link LinkedIn Tumblr Email VKontakte Telegram
This One Weird Trick Defeats AI Safety Features in 99% of Cases
Share
Facebook Twitter Pinterest Email Copy Link

Listen to the article

0:00
0:00

Key Takeaways

Playback Speed

Select a Voice

AI researchers from Anthropic, Stanford, and Oxford have discovered that making AI models think longer makes them easier to jailbreak—the opposite of what everyone assumed.

The prevailing assumption was that extended reasoning would make AI models safer, because it gives them more time to detect and refuse harmful requests. Instead, researchers found it creates a reliable jailbreak method that bypasses safety filters entirely.

Using this technique, an attacker could insert an instruction in the Chain of Thought process of any AI model and force it to generate instructions for creating weapons, writing malware code, or producing other prohibited content that would normally trigger immediate refusal. AI companies spend millions building these safety guardrails precisely to prevent such outputs.

The study reveals that Chain-of-Thought Hijacking achieves 99% attack success rates on Gemini 2.5 Pro, 94% on GPT o4 mini, 100% on Grok 3 mini, and 94% on Claude 4 Sonnet. These numbers destroy every prior jailbreak method tested on large reasoning models.

The attack is simple and works like the “Whisper Down the Lane” game (or “Telephone”), with a malicious player somewhere near the end of the line. You simply pad a harmful request with long sequences of harmless puzzle-solving; researchers tested Sudoku grids, logic puzzles, and abstract math problems. Add a final-answer cue at the end, and the model’s safety guardrails collapse.

“Prior works suggest this scaled reasoning may strengthen safety by improving refusal. Yet we find the opposite,” the researchers wrote. The same capability that makes these models smarter at problem-solving makes them blind to danger.

Here’s what happens inside the model: When you ask an AI to solve a puzzle before answering a harmful question, its attention gets diluted across thousands of benign reasoning tokens. The harmful instruction—buried somewhere near the end—receives almost no attention. Safety checks that normally catch dangerous prompts weaken dramatically as the reasoning chain grows longer.

This is a problem that many people familiar with AI are aware of, but to a lesser extent. Some jailbreak prompts are deliberately long to make a model waste tokens before processing the harmful instructions.

The team ran controlled experiments on the S1 model to isolate the effect of reasoning length. With minimal reasoning, attack success rates hit 27%. At natural reasoning length, that jumped to 51%. Force the model into extended step-by-step thinking, and success rates soared to 80%.

Every major commercial AI falls victim to this attack. OpenAI’s GPT, Anthropic’s Claude, Google’s Gemini, and xAI’s Grok—none are immune. The vulnerability exists in the architecture itself, not any specific implementation.

AI models encode safety checking strength in middle layers around layer 25. Late layers encode the verification outcome. Long chains of benign reasoning suppress both signals which ends up shifting attention away from harmful tokens.

The researchers identified specific attention heads responsible for safety checks, concentrated in layers 15 through 35. They surgically removed 60 of these heads. Refusal behavior collapsed. Harmful instructions became impossible for the model to detect.

The “layers” in AI models are like steps in a recipe, where each step helps the computer better understand and process information. These layers work together, passing what they learn from one to the next, so the model can answer questions, make decisions, or spot problems. Some layers are especially good at recognizing safety issues—like blocking harmful requests—while others help the model think and reason. By stacking these layers, AI can become much smarter and more careful about what it says or does.

This new jailbreak challenges the core assumption driving recent AI development. Over the past year, major AI companies shifted focus to scaling reasoning rather than raw parameter counts. Traditional scaling showed diminishing returns. Inference-time reasoning—making models think longer before answering—became the new frontier for performance gains.

The assumption was that more thinking equals better safety. Extended reasoning would give models more time to spot dangerous requests and refuse them. This research proves that assumption was inaccurate, and even probably wrong.

A related attack called H-CoT, released in February by researchers from Duke University and Taiwan’s National Tsing Hua University, exploits the same vulnerability from a different angle. Instead of padding with puzzles, H-CoT manipulates the model’s own reasoning steps. OpenAI’s o1 model maintains a 99% refusal rate under normal conditions. Under H-CoT attack, that drops below 2%.

The researchers propose a defense: reasoning-aware monitoring. It tracks how safety signals change across each reasoning step, and if any step weakens the safety signal, then penalize it—force the model to maintain attention on potentially harmful content regardless of reasoning length. Early tests show this approach can restore safety without destroying performance.

But implementation remains uncertain. The proposed defense requires deep integration into the model’s reasoning process, which is far from a simple patch or filter. It needs to monitor internal activations across dozens of layers in real-time, adjusting attention patterns dynamically. That’s computationally expensive and technically complex.

The researchers disclosed the vulnerability to OpenAI, Anthropic, Google DeepMind, and xAI before publication. “All groups acknowledged receipt, and several are actively evaluating mitigations,” the researchers claimed in their ethics statement.

Generally Intelligent Newsletter

A weekly AI journey narrated by Gen, a generative AI model.

Read the full article here

Fact Checker

Verify the accuracy of this article using AI-powered analysis and real-time sources.

Get Your Fact Check Report

Enter your email to receive detailed fact-checking analysis

5 free reports remaining

Continue with Full Access

You've used your 5 free reports. Sign up for unlimited access!

Already have an account? Sign in here

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Telegram Copy Link
News Room
  • Website
  • Facebook
  • X (Twitter)
  • Instagram
  • LinkedIn

The FSNN News Room is the voice of our in-house journalists, editors, and researchers. We deliver timely, unbiased reporting at the crossroads of finance, cryptocurrency, and global politics, providing clear, fact-driven analysis free from agendas.

Related Articles

Media & Culture

Ars Fires Reporter For Accidentally Using Fake AI Quotes

32 minutes ago
Media & Culture

Here Are 12 Bills Democrats Just Passed To Trample Gun Rights in Virginia

34 minutes ago
Cryptocurrency & Free Speech Finance

Ripple share buyback program values the firm at $50 billion: Bloomberg

48 minutes ago
Cryptocurrency & Free Speech Finance

Foundry to Launch Institutional-Grade Zcash Mining Pool in April 2026

50 minutes ago
Cryptocurrency & Free Speech Finance

Microsoft Sides With Anthropic Against Trump Admin’s Supply Chain Risk Designation

55 minutes ago
Media & Culture

A DOGE Bro Allegedly Walked Out Of Social Security With 500 Million Americans’ Records On A Thumb Drive And Expected A Pardon If Caught

2 hours ago
Add A Comment

Comments are closed.

Editors Picks

Ars Fires Reporter For Accidentally Using Fake AI Quotes

32 minutes ago

Here Are 12 Bills Democrats Just Passed To Trample Gun Rights in Virginia

34 minutes ago

Ripple share buyback program values the firm at $50 billion: Bloomberg

48 minutes ago

Foundry to Launch Institutional-Grade Zcash Mining Pool in April 2026

50 minutes ago
Latest Posts

Microsoft Sides With Anthropic Against Trump Admin’s Supply Chain Risk Designation

55 minutes ago

A DOGE Bro Allegedly Walked Out Of Social Security With 500 Million Americans’ Records On A Thumb Drive And Expected A Pardon If Caught

2 hours ago

Journalists in eastern DRC detained over war coverage, broadcasters occupied 

2 hours ago

Subscribe to News

Get the latest news and updates directly to your inbox.

At FSNN – Free Speech News Network, we deliver unfiltered reporting and in-depth analysis on the stories that matter most. From breaking headlines to global perspectives, our mission is to keep you informed, empowered, and connected.

FSNN.net is owned and operated by GlobalBoost Media
, an independent media organization dedicated to advancing transparency, free expression, and factual journalism across the digital landscape.

Facebook X (Twitter) Discord Telegram
Latest News

Anthropic, age verification laws, and press freedom

29 minutes ago

Ars Fires Reporter For Accidentally Using Fake AI Quotes

32 minutes ago

Here Are 12 Bills Democrats Just Passed To Trample Gun Rights in Virginia

34 minutes ago

Subscribe to Updates

Get the latest news and updates directly to your inbox.

© 2026 GlobalBoost Media. All Rights Reserved.
  • Privacy Policy
  • Terms of Service
  • Our Authors
  • Contact

Type above and press Enter to search. Press Esc to cancel.

🍪

Cookies

We and our selected partners wish to use cookies to collect information about you for functional purposes and statistical marketing. You may not give us your consent for certain purposes by selecting an option and you can withdraw your consent at any time via the cookie icon.

Cookie Preferences

Manage Cookies

Cookies are small text that can be used by websites to make the user experience more efficient. The law states that we may store cookies on your device if they are strictly necessary for the operation of this site. For all other types of cookies, we need your permission. This site uses various types of cookies. Some cookies are placed by third party services that appear on our pages.

Your permission applies to the following domains:

  • https://fsnn.net
Necessary
Necessary cookies help make a website usable by enabling basic functions like page navigation and access to secure areas of the website. The website cannot function properly without these cookies.
Statistic
Statistic cookies help website owners to understand how visitors interact with websites by collecting and reporting information anonymously.
Preferences
Preference cookies enable a website to remember information that changes the way the website behaves or looks, like your preferred language or the region that you are in.
Marketing
Marketing cookies are used to track visitors across websites. The intention is to display ads that are relevant and engaging for the individual user and thereby more valuable for publishers and third party advertisers.