Close Menu
FSNN | Free Speech News NetworkFSNN | Free Speech News Network
  • Home
  • News
    • Politics
    • Legal & Courts
    • Tech & Big Tech
    • Campus & Education
    • Media & Culture
    • Global Free Speech
  • Opinions
    • Debates
  • Video/Live
  • Community
  • Freedom Index
  • About
    • Mission
    • Contact
    • Support
Trending

How Communists Conquered China

2 minutes ago

Crypto’s biggest exchange fights back against allegations of moving billions of Iran-linked money

41 minutes ago

Anthropic Accuses Three Firms of Using Sophisticated Distillation Attacks

42 minutes ago
Facebook X (Twitter) Instagram
Facebook X (Twitter) Discord Telegram
FSNN | Free Speech News NetworkFSNN | Free Speech News Network
Market Data Newsletter
Wednesday, February 25
  • Home
  • News
    • Politics
    • Legal & Courts
    • Tech & Big Tech
    • Campus & Education
    • Media & Culture
    • Global Free Speech
  • Opinions
    • Debates
  • Video/Live
  • Community
  • Freedom Index
  • About
    • Mission
    • Contact
    • Support
FSNN | Free Speech News NetworkFSNN | Free Speech News Network
Home»Cryptocurrency & Free Speech Finance»OpenAI Says Benchmark Used to Measure AI Coding Skill Is ‘Contaminated’—Here’s Why
Cryptocurrency & Free Speech Finance

OpenAI Says Benchmark Used to Measure AI Coding Skill Is ‘Contaminated’—Here’s Why

News RoomBy News Room4 hours agoNo Comments4 Mins Read1,012 Views
Share Facebook Twitter Pinterest Copy Link LinkedIn Tumblr Email VKontakte Telegram
OpenAI Says Benchmark Used to Measure AI Coding Skill Is ‘Contaminated’—Here’s Why
Share
Facebook Twitter Pinterest Email Copy Link

Listen to the article

0:00
0:00

Key Takeaways

Playback Speed

Select a Voice

In brief

  • OpenAI argues that SWE-bench Verified no longer reflects real coding ability because the benchmark is allegedly contaminated.
  • It is now pushing SWE-bench Pro as tougher replacement.
  • Scores plunged from ~70% to ~23% on the newer benchmark,

The number that every major AI lab has been using to claim coding supremacy was just declared meaningless.

OpenAI published a post this week announcing that SWE-bench Verified, the go-to benchmark for measuring AI coding capabilities, is so riddled with flawed tests and training data leakage that it no longer tells you anything useful about whether a model can actually write software.

The benchmark works like this: Give an AI a real GitHub issue from a popular open-source Python project, ask it to fix the bug without seeing the tests, and check if its patch makes the failing tests pass without breaking anything else.

OpenAI created SWE-bench Verified in August 2024 as a cleaner version of the original 2023 benchmark, recruiting 93 software engineers to filter out tasks that were impossible or poorly designed.

The cleanup worked well enough that every major lab started citing scores on it as proof of progress. When Anthropic launched Claude Opus 4 in May 2025, Decrypt reported that the model scored 72.5% on SWE-bench Verified, beating GPT-4.1’s 54.6% and Gemini 2.5 Pro’s 63.2%. It was the coding benchmark that mattered.

Since then, every single AI lab from America to China has shown the SWE performance to claim the throne as the best model for coding capabilities.

Image: Minimax

Now OpenAI says that race was partly a mirage. According to the report, the team audited 138 tasks that GPT-5.2 consistently failed across 64 independent runs, and had six engineers review each one. It ultimately concluded that 59.4% of those tasks are broken.

About 35.5% have tests so narrowly written that they require a specific function name never mentioned in the problem description. Another 18.8% check for features that weren’t part of the original problem at all, gathered from unrelated pull requests.

The contamination problem roughly works like this: SWE-bench pulls its problems from open-source repositories that most AI companies crawl when building training sets. OpenAI tested whether GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash Preview had seen the benchmark’s solutions during training. All three had.

Given only a task ID and a brief hint, each model could reproduce the exact code fix from memory, including variable names and inline comments that appear nowhere in the problem description. In one case, GPT-5.2’s chain-of-thought logs showed it reasoning that a specific parameter must have been “added around Django 4.1″—a detail found only in Django’s release notes, not the task description. It was answering a question it had already seen the answer to.

OpenAI now recommends SWE-bench Pro, a newer benchmark from Scale AI that uses more diverse codebases and licenses that reduce training data exposure. The performance drop is jarring: models that cleared 70% on the old Verified benchmark score around 23% on SWE-bench Pro’s public split, and even less on its private tasks.

On the current public SWE-bench Verified leaderboard, OpenAI is far from the benchmark’s podium. Retiring a benchmark where you’re losing and endorsing one where everyone starts at 23% resets the scoreboard at a convenient moment and makes the competitors’ claims less impressive.

This is especially important considering that the much anticipated newer version of DeepSeek is rumored to beat or get extremely close to American ai models, especially in agentic and coding tasks with a free, open-source model. That model could be days away from release, and SWE-bench Verified can be a key metric to measure its quality.

OpenAI said it’s building privately authored evaluations that won’t be released before testing, pointing to its GDPVal project where domain experts write original tasks graded by trained human reviewers.

The benchmark problem is not new, and is not unique to coding. AI labs have cycled through multiple evaluations, each useful until models were trained on them or until the tasks proved too narrow.

But what makes this case notable is that OpenAI hyped SWE-bench Verified, promoted it across model releases, and is now publicly documenting how thoroughly it has failed—including by showing their own model cheating on it.

Daily Debrief Newsletter

Start every day with the top news stories right now, plus original features, a podcast, videos and more.

Read the full article here

Fact Checker

Verify the accuracy of this article using AI-powered analysis and real-time sources.

Get Your Fact Check Report

Enter your email to receive detailed fact-checking analysis

5 free reports remaining

Continue with Full Access

You've used your 5 free reports. Sign up for unlimited access!

Already have an account? Sign in here

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Telegram Copy Link
News Room
  • Website
  • Facebook
  • X (Twitter)
  • Instagram
  • LinkedIn

The FSNN News Room is the voice of our in-house journalists, editors, and researchers. We deliver timely, unbiased reporting at the crossroads of finance, cryptocurrency, and global politics, providing clear, fact-driven analysis free from agendas.

Related Articles

Media & Culture

How Communists Conquered China

2 minutes ago
Cryptocurrency & Free Speech Finance

Crypto’s biggest exchange fights back against allegations of moving billions of Iran-linked money

41 minutes ago
Cryptocurrency & Free Speech Finance

Anthropic Accuses Three Firms of Using Sophisticated Distillation Attacks

42 minutes ago
Cryptocurrency & Free Speech Finance

Bitcoin Depot Will Require ID for ‘Every Transaction’ at ATMs Amid Growing Pressure

43 minutes ago
Media & Culture

ICE Whistleblower Says Training Is ‘Deficient, Defective, and Broken’

1 hour ago
Cryptocurrency & Free Speech Finance

Bitwise CEO says AI Is ‘Unstoppable freight train’ for Crypto, Haun’s Monica urges caution

2 hours ago
Add A Comment
Leave A Reply Cancel Reply

Editors Picks

Crypto’s biggest exchange fights back against allegations of moving billions of Iran-linked money

41 minutes ago

Anthropic Accuses Three Firms of Using Sophisticated Distillation Attacks

42 minutes ago

Bitcoin Depot Will Require ID for ‘Every Transaction’ at ATMs Amid Growing Pressure

43 minutes ago

Tech Companies Shouldn’t Be Bullied Into Doing Surveillance

60 minutes ago
Latest Posts

ICE Whistleblower Says Training Is ‘Deficient, Defective, and Broken’

1 hour ago

Bitwise CEO says AI Is ‘Unstoppable freight train’ for Crypto, Haun’s Monica urges caution

2 hours ago

Bitwise Acquires $2.2B Crypto Staking Firm Chorus One

2 hours ago

Subscribe to News

Get the latest news and updates directly to your inbox.

At FSNN – Free Speech News Network, we deliver unfiltered reporting and in-depth analysis on the stories that matter most. From breaking headlines to global perspectives, our mission is to keep you informed, empowered, and connected.

FSNN.net is owned and operated by GlobalBoost Media
, an independent media organization dedicated to advancing transparency, free expression, and factual journalism across the digital landscape.

Facebook X (Twitter) Discord Telegram
Latest News

How Communists Conquered China

2 minutes ago

Crypto’s biggest exchange fights back against allegations of moving billions of Iran-linked money

41 minutes ago

Anthropic Accuses Three Firms of Using Sophisticated Distillation Attacks

42 minutes ago

Subscribe to Updates

Get the latest news and updates directly to your inbox.

© 2026 GlobalBoost Media. All Rights Reserved.
  • Privacy Policy
  • Terms of Service
  • Our Authors
  • Contact

Type above and press Enter to search. Press Esc to cancel.

🍪

Cookies

We and our selected partners wish to use cookies to collect information about you for functional purposes and statistical marketing. You may not give us your consent for certain purposes by selecting an option and you can withdraw your consent at any time via the cookie icon.

Cookie Preferences

Manage Cookies

Cookies are small text that can be used by websites to make the user experience more efficient. The law states that we may store cookies on your device if they are strictly necessary for the operation of this site. For all other types of cookies, we need your permission. This site uses various types of cookies. Some cookies are placed by third party services that appear on our pages.

Your permission applies to the following domains:

  • https://fsnn.net
Necessary
Necessary cookies help make a website usable by enabling basic functions like page navigation and access to secure areas of the website. The website cannot function properly without these cookies.
Statistic
Statistic cookies help website owners to understand how visitors interact with websites by collecting and reporting information anonymously.
Preferences
Preference cookies enable a website to remember information that changes the way the website behaves or looks, like your preferred language or the region that you are in.
Marketing
Marketing cookies are used to track visitors across websites. The intention is to display ads that are relevant and engaging for the individual user and thereby more valuable for publishers and third party advertisers.