cover

TL;DR: EVMBench says AI can exploit 72% of smart contract vulnerabilities, and the industry started talking about fully automated auditing. We re-tested with more configurations and 22 real-world attack incidents. Exploit success: 0%. AI auditing has real value, but replacing humans is not close. The right direction is human-AI collaboration.

Can AI replace smart contract auditors?

In February 2026, OpenAI, Paradigm, and OtterSec released EVMBench, the first large-scale benchmark for AI agents on smart contract security. The headline numbers were striking: the best agent detects 45.6% of vulnerabilities and exploits 72.2% of a curated subset. The authors conclude that "discovery, not repair or transaction construction, is the primary bottleneck." Paradigm wrote that "a growing portion of audits in the future will be done by agents." Media coverage went further, calling AI "the primary, standardized police force for the Ethereum Virtual Machine".

The numbers are exciting. But when we re-evaluated more systematically, we saw a very different picture.


What EVMBench Contributed

To be clear: EVMBench is a valuable contribution.

Before it, the field had no unified evaluation standard for AI agents in smart contract security. EVMBench changed that: 40 Code4rena audit repositories, 120 vulnerabilities, three tasks (Detect, Patch, Exploit), 14 agent configurations, all running in isolated Docker containers. The methodology is transparent, the code is open-source.

The results looked promising. The best agent detected 45.6% of vulnerabilities. On Exploit, the success rate reached 72.2%. The core conclusion: "discovery, not repair or transaction construction, is the primary bottleneck." In other words, once a vulnerability is found, exploiting it is largely within reach.

The industry's reaction followed this logic. Paradigm noted the pace of progress: in early 2025, top models could exploit fewer than 20% of critical bugs; by February 2026, that number exceeded 70%. VaultXAI's analysis went further, claiming EVMBench poses an "existential threat" to mid-tier audit firms.

From under 20% to over 70% in barely a year. Linear extrapolation makes fully automated AI auditing seem imminent.

But linear extrapolation is often dangerous.

What We Did

Our paper is called ReEVMBench. The core idea: re-answer the same question with more configurations and more realistic data.

EVMBench's experimental design has two aspects worth examining.

The first is evaluation scope. EVMBench tested 14 agent configurations, with most models running only on their vendor scaffold (Claude on Claude Code, GPT on Codex CLI). You cannot tell whether an agent's performance reflects the model's capability or the scaffold's advantage.

The second is more fundamental: the risk of data contamination. EVMBench's 120 vulnerabilities come from Code4rena audit reports, with roughly 36 of 40 repositories from contests that ended before August 2025. The frontier models evaluated were all released in late 2025 or early 2026. These contest reports, vulnerability descriptions, and even exploit analyses may well have been consumed during training. How much of the high score is genuine capability, and how much is memorization?

We addressed both issues.

On configurations, we expanded from 14 to 26, covering four model families (Claude, GPT, Gemini, GLM) and three scaffolds (Claude Code, Codex CLI, OpenCode). GLM-5 was the highest-rated newly released model on OpenRouter at the time of our experiments. We systematically cross-tested model-scaffold combinations to separate the two variables.

On data, our evaluation uses two datasets. The first is EVMBench's existing curated dataset (40 Code4rena repositories, 120 vulnerabilities), which we re-ran with all 26 configurations to test whether rankings hold under broader coverage. The second is our Incidents dataset: 22 real-world security incidents sourced from BlockSec's security incident archive and ClaraHacks, all occurring after mid-February 2026, each confirmed through actual on-chain exploitation with verified financial loss. All evaluated models were released by February 19, 2026, and training data collection necessarily precedes release, so these incidents fall outside every model's training window.

The first dataset answers "do conclusions hold with more configurations?" The second answers "can agents handle vulnerabilities they have never seen?"

All experiments were conducted between February 28 and March 8, 2026.

Finding 1: Rankings Are Less Stable Than You Think

Good news first: the overall detection ceiling matches EVMBench's. EVMBench reported 45.6%; we measured 47.5% (Claude Opus 4.6). The ceiling is real.

But the rankings shifted, and they shifted substantially.

Detect scores for all 26 agent configurations, colored by scaffold

On EVMBench, the Exploit leader was GPT-5.3-Codex. In our evaluation, the leader became Claude Sonnet 4.6 (61.1%). Same benchmark, same tasks, just more configurations, and the winner changed.

Exploit scores for 15 agent configurations

More striking: the same model can perform completely differently across tasks.

Gemini 3 Pro is the most extreme case: last place on Detect (16.7%), fourth place on Exploit (45.8%). The worst detection model ranks near the top for exploitation. This indicates that detection and exploitation rely on fundamentally different underlying capabilities.

Detect vs. Exploit scores, showing the same model can rank very differently across tasks

GLM-5's trajectory is also notable. On EVMBench's Detect task, it ranked 25th (20.8%). But on the Incidents dataset (real-world security events), GLM-5 rose to 7th (42.9%), outperforming several models that scored higher on curated data. Its real-world capability was underestimated by the curated benchmark.

The same model, a different dataset, and rankings can jump 20 positions. Rankings from a single benchmark do not generalize.

Finding 2: Real-World Exploit Success Is 0%

This is the central finding.

22 real-world security incidents, 5 agents, 6 hours per agent per incident. 110 agent-incident pairs, 0% success rate.

No agent completed an end-to-end exploit on any real-world security incident.

Detection itself was not the issue. The best agent (Claude Opus 4.6) detected 65% of real-world vulnerabilities (13/20).

Detect scores on the Incidents dataset

The difficulty distribution follows a clear pattern. Six incidents were detected by nearly all agents (87.5% to 100%), involving well-known patterns like sell-hook reserve manipulation and unchecked multiplication overflow. But four incidents were detected by none (0%), and five by only one of eight agents.

The breakdown happens at the transition from detection to exploitation. In real-world conditions, agents typically spend extensive time reading code and querying on-chain state but fail to converge on an attack strategy. The most common failure modes: insufficient understanding of cross-contract protocol dependencies; giving up after repeated failures; inability to chain token approvals, flash loans, and state changes into a complete attack sequence.

Compared to real attacks tracked in BlockSec's security incident archive, the gap is clear. Real attackers have deep understanding of target protocols and can precisely orchestrate multi-step operations. This kind of protocol-level knowledge and adversarial reasoning is beyond current AI agents.

The 72%-to-0% gap directly contradicts EVMBench's conclusion that "discovery is the primary bottleneck." In the real world, exploitation is the actual bottleneck.

Finding 3: The Variable You Overlooked

EVMBench's evaluation design has an overlooked confounding variable: the scaffold.

The scaffold is the agent's runtime framework, handling tool invocation, file operations, and code execution. EVMBench generally pairs each model with its vendor scaffold. To borrow an analogy: two athletes compete, one in Nike, the other in Adidas, and the performance gap is attributed entirely to the athletes.

We ran systematic cross-scaffold comparisons. The results were unexpected: OpenCode, an open-source third-party scaffold, outperformed vendor scaffolds in five of six controlled comparisons, with gaps up to 5 percentage points.

Scaffold comparison: OpenCode (open-source) vs. vendor scaffolds, outperforming in 5 of 6 pairs

Five percentage points is enough to shift rankings by several positions. Some of the ranking differences that EVMBench attributes to models may actually reflect scaffold choice.

Another counterintuitive finding: GPT-5.2 on Exploit scored higher at low reasoning effort (37.5%) than at xhigh effort (29.2%), a gap of 8 percentage points. More reasoning time led to worse performance. One hypothesis is that higher reasoning effort leads the model to overthink exploit path selection, falling into overly complex attack strategies and missing more direct paths. This parallels a common experience among human security researchers: sometimes overthinking is worse than intuition. The exact cause requires more systematic ablation experiments to confirm.

AI Auditing: Capabilities, Boundaries, and Direction

The preceding sections examine aggregate data: rankings, success rates, configuration effects. But averages mask extremes. To truly understand what AI auditing can and cannot do, we need to look at specific cases.

Two Extreme Cases

Previous sections tell us the averages. These two cases show the extremes.

Sequence signature state machine: complete failure. Sequence is a smart contract wallet project (audit report). 26 agents, 0% detection rate. The irony: Claude Opus 4.6 (the detection leader) specifically analyzed the Checkpointer and chained signature modules and explicitly marked them "safe." The vulnerabilities were hiding in the exact code it deemed secure. These bugs required understanding how the signature validation state machine interacts with flag combinations during nested calls, a kind of cross-abstraction-layer reasoning that exceeds current model capabilities.

Coinbase cross-chain replay: 1 out of 26. Coinbase's Smart Wallet is a widely-used on-chain wallet (audit report). Only Claude Opus 4.6 detected the cross-chain replay vulnerability and recommended a fix consistent with the actual remediation.

The pattern is clear: AI agents excel at pattern matching but struggle with reasoning. Known patterns (access control, reentrancy, arithmetic overflow) are reliably caught. But vulnerabilities requiring protocol-specific knowledge, cross-contract trust relationships, or multi-step state reasoning fall outside current capabilities.

The Power of Human Hints

The boundary is not fixed, though. Human intervention can shift it dramatically.

EVMBench ran a hint experiment. After providing GPT-5.2 with mechanism-level hints, exploit success rose from 62.5% to 76.4%. With more specific hints, it surged to 95.8%.

From 62.5% to 95.8%. The effect of a single human hint exceeded any model upgrade or scaffold optimization.

This tells us that agents are not "dumb"; they are "blind." They have execution capability but lack direction. Give them the right direction, and they can reach the destination.

Notably, this hint experiment was conducted on EVMBench's curated data. Our Incidents evaluation used a no-hint setting (no human guidance), which partly explains the 0% exploit rate. Adding human hints on real-world incidents would likely improve exploit success, though we have not yet run this experiment. This finding reinforces the core conclusion: agents cannot operate alone; they need human direction.

The gap is not in execution capability. It is in knowledge.

The Right Direction: Human-in-the-Loop

Human-in-the-loop: humans provide direction, AI handles execution

The cases and data point in the same direction: the right role for AI in smart contract security is a human-in-the-loop agentic workflow. Not full automation. Human-AI collaboration.

For developers: agent scans as a pre-deployment check. A 47.5% detection ceiling means more than half of vulnerabilities will be missed. But for known patterns, agents are already reliable. Running an agent scan before deployment is low-cost and worthwhile, though it should not be relied on alone.

For audit and security firms: agents as a first-pass filter, and knowledge as the competitive edge. Let agents handle the first round of triage, flagging known-pattern vulnerabilities and freeing human auditors to focus on protocol-specific, complex issues. The hint experiment shows the ceiling of this model: human direction plus agent execution yields up to 95.8% success. In this framework, everyone has access to the same models. The differentiator is domain knowledge.

Systematically encoding domain knowledge into agent workflows turns agents from blunt instruments into force multipliers. For example: when an agent scans a DEX protocol, it automatically loads historical attack cases and common vulnerability patterns specific to DEX designs. Knowledge in, capability out.

Conclusion

Back to the opening question: can AI replace smart contract auditors?

The current answer is clear: no. But AI's value is real.

65% real-world detection rate. Six common-pattern incidents detected by nearly all agents. Up to 95.8% exploit success with human hints. These numbers show AI is already a useful tool.

But a 47.5% detection ceiling, 0% real-world exploit success, and unstable rankings show that fully automated AI auditing is not close.

There is a point here that is easy to miss: security auditing is fundamentally different from other software engineering tasks. Code completion at 90% accuracy means the remaining 10% is merely inconvenient. Security auditing at 90% detection means one missed vulnerability in the remaining 10% could drain the entire protocol.

Some will argue: human auditors miss vulnerabilities too. That is true. But the key is not "who misses fewer" but "what types each misses." Our data is clear: AI misses vulnerabilities that require deep protocol understanding (the Sequence signature state machine, where all 26 agents failed), and these are precisely the high-value targets that attackers are best at exploiting. Human auditors tend to miss pattern-based known issues, often due to fatigue, which is exactly what AI excels at. The two miss types are naturally complementary.

The real question is not "can AI replace humans" but "how should humans and AI work together." AI handles breadth (systematic scanning); humans handle depth (protocol knowledge, adversarial reasoning). Neither can do the other's job. Together, they form a complete audit capability.

For security audit firms and auditors, the rules of competition have changed. Everyone can call the same AI models — Claude, GPT, Gemini, it's just an API call. What actually creates differentiation is how deeply you understand attacks, how many vulnerability patterns you've seen, and whether you can feed that experience to AI. The firms that will be eliminated are those that rely purely on headcount to produce audit reports — once models improve, their "more people" advantage disappears. For individual auditors, the same applies: your value is no longer how many lines of code you've read, but what you understand. The firms and auditors that survive this cycle are those with long-term accumulated security knowledge: understanding how attackers think, having witnessed how real attacks unfold, and being able to turn that knowledge into structured data that AI can use.

BlockSec has been building along this direction. Our security incident archive continuously tracks on-chain attacks, converting each real incident into structured attack pattern data. Phalcon Security provides real-time monitoring and automated blocking, having intercepted over 20 real attacks. STOP (Sequencer Threat Overwatch Program) intercepts malicious transactions at the L2 sequencer layer. Phalcon Explorer lets anyone visually analyze the full execution flow of on-chain transactions, tracing fund flows and call chains. Behind these products is our deep understanding of how DeFi protocols operate and how they get attacked, driven by research-led innovation. From academic papers to security tools, from front-line attack tracking to structured knowledge bases — this long-term accumulation is the real competitive moat in the AI era.

Humans and AI each have their strengths. Combined, they are the future of smart contract security.

Paper, code, and data are open-sourced: github.com/blocksecteam/ReEVMBench