Rongchai Wang
Mar 05, 2026 00:55
New benchmark evaluates AI brokers’ skill to detect, patch, and exploit sensible contract vulnerabilities. GPT-5.3-Codex scores 72.2% on exploit duties.
OpenAI and crypto enterprise agency Paradigm have launched EVMbench, a benchmark that measures how effectively AI brokers can discover, repair, and exploit vulnerabilities in Ethereum sensible contracts. The announcement comes as AI-powered safety instruments race to guard the $100 billion-plus locked in DeFi protocols.
The benchmark attracts from 120 curated high-severity vulnerabilities pulled from 40 actual safety audits, principally from Code4rena competitions. It additionally consists of vulnerability eventualities from safety evaluations of Tempo, a Layer 1 blockchain constructed for stablecoin funds.
Three Methods to Break Good Contracts
EVMbench checks AI brokers throughout three distinct modes. In Detect mode, brokers audit contract repositories and get scored on discovering recognized vulnerabilities. Patch mode requires brokers to repair susceptible code with out breaking current performance. Exploit mode is essentially the most aggressive—brokers should execute precise fund-draining assaults towards contracts deployed on a sandboxed blockchain.
The outcomes present how rapidly AI capabilities are advancing on this area. GPT-5.3-Codex working through Codex CLI hit a 72.2% success price on exploit duties. That is greater than double the 31.9% rating from GPT-5, which launched simply six months prior.
Curiously, AI brokers carry out higher at attacking than defending. The exploit setting has a transparent goal—hold iterating till you drain the funds. Detection and patching proved more durable. Brokers typically stopped after discovering one bug as an alternative of auditing exhaustively, and sustaining full contract performance whereas eradicating delicate vulnerabilities remained difficult.
Actual Limitations Price Noting
OpenAI acknowledged EVMbench does not seize the total issue of real-world contract safety. Closely deployed protocols like Uniswap or Aave bear way more scrutiny than audit competitors code. The benchmark can also’t confirm if an agent finds legit vulnerabilities that human auditors missed—it solely checks towards recognized points.
The exploit setting runs on a clear native Anvil occasion quite than forked mainnet state, and timing-dependent assaults fall outdoors scope. Single-chain environments just for now.
$10M for Defensive Analysis
Alongside EVMbench, OpenAI dedicated $10 million in API credit particularly for defensive safety analysis. The corporate is increasing its Aardvark safety analysis agent to extra customers and partnering with open-source maintainers without spending a dime codebase scanning.
The timing issues. As AI brokers get higher at exploiting contracts, the window between vulnerability discovery and exploitation shrinks. Protocol groups that are not utilizing AI-assisted auditing will more and more discover themselves at a drawback towards attackers who’re.
OpenAI launched EVMbench’s duties, tooling, and analysis framework publicly. For DeFi builders and safety researchers, it is each a measuring stick and a warning about the place AI capabilities are headed.
Picture supply: Shutterstock
