OpenAI has launched a brand new sensible contract safety benchmark as AI brokers achieve stronger coding skills within the crypto sector. Along with Paradigm, OpenAI stated the benchmark, known as EVMbench, exams how AI techniques detect, patch, and exploit critical Ethereum contract bugs. Their effort responds to the rising monetary threat, since sensible contracts routinely safe over $100 billion in open-source crypto belongings.
OpenAI Good Contract Benchmark Targets Actual Audit Vulnerabilities
In their launch, OpenAI stated EVMbench attracts on 120 curated vulnerabilities collected from 40 skilled sensible contract audits. Notably, a lot of the points got here from open audit competitions, together with Code4rena. OpenAI stated the benchmark additionally consists of vulnerability situations tied to safety auditing work for the Tempo blockchain.
Tempo is described as a purpose-built Layer-1 community designed for high-throughput, low-cost stablecoin funds. Due to that, these situations lengthen the benchmark into payment-focused contract code. The corporate additionally stated it expects agent-based stablecoin cost exercise to develop.
To construct the benchmark environments, OpenAI stated it tailored present exploit proof-of-concept exams and deployment scripts when out there. Nevertheless, it stated engineers manually wrote lacking elements when no scripts existed. OpenAI added that it ensured patch duties remained exploitable whereas nonetheless fixable with out breaking compilation.
Detect, Patch, Exploit Modes Check AI Brokers Underneath Stress
OpenAI stated EVMbench evaluates synthetic intelligence brokers in three modes. That’s detect, patch, and exploit. In detect mode, brokers audit sensible contract repositories and get scored on recall of confirmed vulnerabilities and audit rewards. In patch mode, brokers should modify susceptible contracts whereas maintaining supposed performance intact.
Exploit mode, nevertheless, focuses on full end-to-end fund draining assaults in a sandbox blockchain atmosphere. The corporate stated graders confirm outcomes utilizing transaction replay and on-chain checks. To help reproducible analysis, the corporate stated it developed a Rust-based harness to deploy contracts and replay transactions deterministically.
Notably, the exploit duties run in an remoted native Anvil atmosphere as an alternative of reside crypto networks. It additionally stated vulnerabilities used within the benchmark are historic and publicly documented. OpenAI added that the harness restricts unsafe RPC strategies to restrict abuse.
In exploit testing, OpenAI stated GPT-5.3-Codex operating through Codex CLI scored 72.2%. Nevertheless, it stated the sooner GPT-5 mannequin scored 31.9%, regardless of being launched simply over six months earlier. OpenAI additionally famous that detect recall and patch success stay under full protection.
OpenAI Provides New Expertise with Agent Rent
Whereas OpenAI pushed EVMbench into public view, it additionally expanded its agent growth workforce. Notably, they employed Peter Steinberger, founding father of the viral open-source AI agent venture OpenClaw, beforehand often called Clawdbot. Sam Altman confirmed on X that Steinberger will be part of OpenAI to guide work on the “subsequent era of private brokers.”
In the meantime, Altman stated OpenClaw will transition right into a basis mannequin venture supported by OpenAI. The open-source venture will proceed beneath that construction, based on the announcement. The hiring drew vast consideration as OpenAI will increase its give attention to autonomous and private AI brokers.
