FREE MEETING: KEY TRENDS AND RISKS IN NFT GAMES– REGISTER

Crypto Cipherium
  • Home
  • News
    On line casino Group FY 2025 slides: EBITDA surges 14% amid restructuring
    Business

    On line casino Group FY 2025 slides: EBITDA surges 14% amid restructuring

    On line casino Group FY 2025 slides: EBITDA surges 14% amid restructuring

    By Editor
    April 25, 2026
    JPMorgan Says Tokenization May Remodel Fund Operations Over the Subsequent Few Years
    Business
    JPMorgan Says Tokenization May Remodel Fund Operations Over the Subsequent Few Years
    Blackmail and higher grades: How the AI revolution is reshaping American life
    Business
    Blackmail and higher grades: How the AI revolution is reshaping American life
    Palestinian native elections give some Gazans first probability to vote in years
    Business
    Palestinian native elections give some Gazans first probability to vote in years
    Lawsuit unveils large drawback with Costco staple
    Business
    Lawsuit unveils large drawback with Costco staple
  • Stock Market
    Stock MarketShow More
    Trump hosts crypto contest winners at Mar-a-Lago as his coin languishes
    Trump hosts crypto contest winners at Mar-a-Lago as his coin languishes
    April 25, 2026
    Lazarus Group linked to 2M DeFi hack, B TVL outflows ensue
    Lazarus Group linked to $292M DeFi hack, $13B TVL outflows ensue
    April 25, 2026
    BSP tightening helps Peso however dangers linger – OCBC
    BSP tightening helps Peso however dangers linger – OCBC
    April 25, 2026
    BlackBerry: The Turnaround Is Right here (Ranking Improve)
    BlackBerry: The Turnaround Is Right here (Ranking Improve)
    April 25, 2026
    Bitcoiners forged doubt on the US army's understanding of the community
    Bitcoiners forged doubt on the US army's understanding of the community
    April 25, 2026
  • Blockchain
    BlockchainShow More
    DOT Value Prediction: .35 Breakout or .18 Breakdown Inside 14 Days
    DOT Value Prediction: $1.35 Breakout or $1.18 Breakdown Inside 14 Days
    April 25, 2026
    Tokens.xyz Streamlines Solana (SOL) Asset Information with Unified Pages
    Tokens.xyz Streamlines Solana (SOL) Asset Information with Unified Pages
    April 25, 2026
    Bitcoin Whales Amass B as BTC Nears K, Santiment Experiences
    Bitcoin Whales Amass $3B as BTC Nears $80K, Santiment Experiences
    April 25, 2026
    DeepSeek V4 Launches With NVIDIA Blackwell, Enabling 1M-Token Context AI
    DeepSeek V4 Launches With NVIDIA Blackwell, Enabling 1M-Token Context AI
    April 25, 2026
    Paul Sztorc to Launch eCash Bitcoin Arduous Fork in August
    Paul Sztorc to Launch eCash Bitcoin Arduous Fork in August
    April 25, 2026
  • Market Analysis
    Market Analysis
    Show More
    Top News
    Humanoid robotic chases wild boars in viral video
    Humanoid robotic chases wild boars in viral video
    April 15, 2026
    What is the Higher eVTOL Inventory to Purchase for 2026?
    What is the Higher eVTOL Inventory to Purchase for 2026?
    December 17, 2025
    Shopify (SHOP) Name Choice Unfold Garners a 33% Return Potential
    Shopify (SHOP) Name Choice Unfold Garners a 33% Return Potential
    March 20, 2026
    Latest News
    On line casino Group FY 2025 slides: EBITDA surges 14% amid restructuring
    April 25, 2026
    JPMorgan Says Tokenization May Remodel Fund Operations Over the Subsequent Few Years
    April 25, 2026
    Blackmail and higher grades: How the AI revolution is reshaping American life
    April 25, 2026
    Palestinian native elections give some Gazans first probability to vote in years
    April 25, 2026
Reading: OpenAI Abandons SWE-bench Verified After Discovering 59% of Failed Exams Have been Flawed
Share
Crypto CipheriumCrypto Cipherium
Font ResizerAa
Search
  • Home
  • News
    • NFT
    • Mining
  • Stock Market
    • Bitcoin
    • Ethereum
    • Forex
    • Tether
  • Blockchain
  • Market
    • Business
    • Money
Have an existing account? Sign In
Follow US
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms of Service
2025 © Crypto Cipherium. All Rights Reserved.
Blockchain

OpenAI Abandons SWE-bench Verified After Discovering 59% of Failed Exams Have been Flawed

Editor
Last updated: March 3, 2026 7:15 pm
Editor
Published: March 3, 2026
Share
OpenAI Abandons SWE-bench Verified After Discovering 59% of Failed Exams Have been Flawed


Contents
  • The Numbers Inform the Story
  • Each Main Mannequin Is Contaminated
  • From 80% to 23%
  • What Comes Subsequent


Rebeca Moen
Mar 03, 2026 18:33

OpenAI reveals main contamination points in SWE-bench Verified benchmark, exhibiting frontier AI fashions memorized options and exams rejected right code.





OpenAI has stopped reporting scores on SWE-bench Verified, the widely-used AI coding benchmark, after discovering that just about 60% of issues its fashions failed contained basically damaged exams. The corporate’s February 23, 2026 evaluation additionally discovered proof that every one main frontier fashions—together with GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash—had been educated on benchmark options, rendering scores meaningless.

“Enhancements on SWE-bench Verified now not mirror significant enhancements in fashions’ real-world software program growth skills,” OpenAI said. “As an alternative, they more and more mirror how a lot the mannequin was uncovered to the benchmark at coaching time.”

The Numbers Inform the Story

OpenAI audited 138 issues—27.6% of the 500-problem dataset—that its o3 mannequin could not persistently resolve throughout 64 impartial runs. The findings had been damning: 59.4% of those issues had materials points in take a look at design or drawback descriptions that made them “extraordinarily tough or inconceivable even for essentially the most succesful mannequin or human to unravel.”

Breaking down the failures: 35.5% of audited duties had overly strict exams that rejected functionally right options by demanding particular implementation particulars by no means talked about in drawback descriptions. One other 18.8% examined for performance that wasn’t even specified within the activity.

One instance concerned a pylint PR the place exams required importing a operate referred to as “get_annotation”—a reputation by no means talked about in the issue assertion. Fashions that solved the underlying challenge accurately nonetheless failed as a result of they did not psychically guess the anticipated operate title.

Each Main Mannequin Is Contaminated

The contamination proof proved extra troubling. OpenAI constructed an automatic red-teaming system utilizing GPT-5 to probe competing fashions for benchmark information. The outcomes confirmed all examined frontier fashions may reproduce unique human-written options or quote verbatim drawback particulars they need to by no means have seen.

GPT-5.2, when given minimal hints, reproduced the precise code patch for a Django authentication repair—together with the particular conditional assertion “if username is None or password is None.” Claude Opus 4.5 quoted word-for-word an inline remark from a gold patch it supposedly by no means encountered. Gemini 3 Flash, given solely a activity ID, output the whole unified diff with right line numbers.

The contamination creates an unfair benefit. Fashions which have seen options throughout coaching can move underspecified exams by “remembering” implementation particulars that weren’t in the issue description—basically having the reply key earlier than the examination.

From 80% to 23%

The benchmark’s decay grew to become seen in stalled progress. State-of-the-art scores improved solely from 74.9% to 80.9% over six months—not as a result of fashions hit functionality ceilings, however as a result of the remaining issues had been both inconceivable or required memorized information.

SWE-bench Professional, the beneficial substitute, paints a special image. In accordance with current information from February 26, 2026, fashions scoring 80% on Verified dropped to roughly 23% on Professional—a benchmark designed to withstand contamination. Claude Opus 4.6 presently leads Professional with 79.20% efficiency, although that determine measures a special, cleaner take a look at set.

What Comes Subsequent

OpenAI recommends the business shift to SWE-bench Professional’s public cut up whereas acknowledging it is imperfect. The corporate is investing in privately-authored benchmarks like GDPVal, the place area consultants create unique duties and educated reviewers grade options holistically.

The broader lesson issues for anybody monitoring AI capabilities: benchmarks sourced from public repositories carry inherent contamination danger. When coaching information consists of the take a look at, scores develop into theater. For researchers, buyers, and builders betting on AI coding progress, the true frontier is more durable to measure than leaderboards counsel.

Picture supply: Shutterstock


OP Worth Prediction: Optimism Targets $0.13-$0.15 Restoration by April 2026
Integrating Agentic AI in Laptop Imaginative and prescient: Enhancing Video Analytics
CRV Value Prediction: Curve Eyes $0.40-$0.46 Restoration Regardless of Bearish Momentum
Bitcoin Dips 3% As Bitcoin ETFs Break 6-Day Outflow Streak
Tokenization Will ‘Eat’ The Monetary System, Says Robinhood CEO

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
[mc4wp_form]
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Email Copy Link Print
Previous Article Tillis calls Noem’s management a ‘catastrophe’ in fiery Senate listening to Tillis calls Noem’s management a ‘catastrophe’ in fiery Senate listening to
Next Article Bitcoin Caught Between ,000 and ,500 as 7M in Liquidations Construct: The place Subsequent? Bitcoin Caught Between $65,000 and $70,500 as $577M in Liquidations Construct: The place Subsequent?
Leave a Comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Follow US

Find US on Socials
FacebookLike
XFollow
YoutubeSubscribe
TelegramFollow
Popular News
Success Story: Charles Tyler’s Studying Journey with 101 Blockchains
Success Story: Charles Tyler’s Studying Journey with 101 Blockchains
Key Advantages, Use Circumstances, And Developments
Key Advantages, Use Circumstances, And Developments
The Innovation Hub Playbook: Constructing a Digital Ecosystem for the Recent Meals Chain
The Innovation Hub Playbook: Constructing a Digital Ecosystem for the Recent Meals Chain

Follow Us on Socials

We use social media to react to breaking news, update supporters and share information

Facebook X-twitter Youtube
Crypto Cipherium

We influence 20 million users and is the number one business blockchain and crypto news network on the planet.

Topics

  • About Us
  • Contact Us
  • Privacy Policy
  • Terms of Service
Reading: OpenAI Abandons SWE-bench Verified After Discovering 59% of Failed Exams Have been Flawed
Share
2025 © Crypto Cipherium. All Rights Reserved.
  • bitcoinBitcoin(BTC)$77,519.00-0.10%
  • ethereumEthereum(ETH)$2,315.51-0.09%
  • tetherTether(USDT)$1.000.00%
  • rippleXRP(XRP)$1.42-1.12%
  • binancecoinBNB(BNB)$629.46-1.31%
  • usd-coinUSDC(USDC)$1.000.00%
  • solanaSolana(SOL)$85.90-0.69%
  • tronTRON(TRX)$0.3240240.10%
  • Figure HelocFigure Heloc(FIGR_HELOC)$1.02-2.00%
  • dogecoinDogecoin(DOGE)$0.097981-0.72%
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?