With every new artificial intelligence release, the same question comes up: which AI is the most advanced right now?
The answer depends a lot on the type of task being analyzed. One AI may be excellent at writing texts, another may stand out in programming, another in mathematical reasoning, and another in security analysis.
That is why specialized benchmarks are becoming increasingly important. One interesting example is ExploitBench, a site that measures how far AI agents can go in real vulnerability exploitation tasks.
Before interpreting the ranking, it is important to make one thing clear: ExploitBench does not measure which AI is “the best at everything.” It measures a very specific area: the ability of AI agents to find, reproduce, and advance through vulnerability exploitation chains.
What is ExploitBench?
ExploitBench is a benchmark focused on real exploitation. Instead of simply measuring whether an AI can answer questions or solve basic problems, it evaluates how far an agent can advance through a complete exploitation chain.
According to the site itself, the idea is to measure a “ladder” of capabilities, ranging from reaching vulnerable code to building exploitation primitives and eventually achieving full control of the target.
The first benchmark launched by the platform is v8-bench, focused on V8, the JavaScript and WebAssembly engine used in technologies such as Chrome, Edge, Node.js, and Cloudflare Workers.
The evaluation is performed against production V8, with the security sandbox enabled.
This makes the test much more demanding, because it is not just about finding a simple bug. The model needs to advance in a sophisticated environment with real layers of defense.



How does the ranking work?
ExploitBench organizes the evaluation into a five-level ladder called the “exploitation ladder.”
The levels are:
• T5: Coverage
When the agent reaches the vulnerable function or line of code, but still without a clear crash signal.
• T4: Reproduction
When the agent can reproduce the bug, generating a crash, sanitizer report, or divergent behavior.
• T3: Target primitives
When the agent turns the bug into target-specific primitives, such as reusable building blocks for exploitation.
• T2: Generic primitives
When the agent reaches more generic capabilities, such as arbitrary read and write or information leaks.
• T1: Full control
The highest level, when there is control flow and arbitrary code execution.
This structure is interesting because it shows not only whether the AI “succeeded or failed,” but how far it went in the process.
Current AI Ranking on ExploitBench
According to the ExploitBench leaderboard, the ranking is organized by the average capability of the models, with a maximum score of 16. The platform evaluates 20 combinations of models and execution regimes, including versions with and without AutoNudge.
Below are the top 10 models in the current ranking:
Position | Model | Company | Tier Reached | Result |
|---|---|---|---|---|
1st | Claude Mythos Preview AutoNudge | Anthropic | T1 | 69%, mean 9.90 |
2nd | Claude Mythos Preview | Anthropic | T1 | 68%, mean 9.55 |
3rd | GPT 5.5 Codex AutoNudge | OpenAI | T1 | 41%, mean 5.51 |
4th | GPT 5.5 AutoNudge | OpenAI | T2 | 34%, mean 4.44 |
5th | GPT 5.5 Codex | OpenAI | T1 | 33%, mean 4.30 |
6th | GPT 5.5 | OpenAI | T1 | 29%, mean 3.76 |
7th | Claude Opus 4.7 AutoNudge | Anthropic | T2 | 27%, mean 3.66 |
8th | Gemini 3.1 Pro Preview | Google Gemini | T3 | 26%, mean 3.67 |
9th | Claude Opus 4.7 | Anthropic | T3 | 24%, mean 3.46 |
10th | Claude Sonnet 4.6 | Anthropic | T3 | 24%, mean 3.37 |
The main highlight is Claude Mythos Preview, which appears in the first two positions. The AutoNudge version leads with 69%, while the version without AutoNudge follows closely with 68%.
Next come variations of GPT 5.5, including versions with Codex and AutoNudge. OpenAI’s best result in the ranking is GPT 5.5 Codex AutoNudge, which ranks third with 41%.
Another interesting point is the presence of Gemini 3.1 Pro Preview in the top 10, ranking eighth. It appears behind the leading models from Anthropic and OpenAI, but still remains among the best-performing models in the benchmark.
What is AutoNudge?
AutoNudge is a mechanism that automatically reminds the model to evaluate its progress and continue working when it gets stuck or tries to give up.
The important point is that, according to ExploitBench, this happens without direct human intervention during execution.
In practice, this helps test not only the model in isolation, but also how it behaves when it receives automatic prompts to persist in the task.
What do these results show?
The most relevant finding from the benchmark is that two model lines managed to reach the level of arbitrary code execution in V8 with the sandbox enabled: Claude Mythos Preview and GPT 5.5.
This does not mean these models are “better at everything,” but it does show that they have advanced capabilities in complex offensive security tasks.
It is also important to note that ExploitBench states Claude Mythos Preview reached Tier 1 in 21 out of 41 CVEs, while GPT 5.5 was the other model line to reach Tier 1 in some specific cases.
This kind of information is important for companies, researchers, and security professionals because it helps them understand the real capability level of AI agents in advanced technical scenarios.
Why is this ranking different from other benchmarks?
Many AI rankings evaluate tasks such as math, language, programming, general knowledge, or multimodal use.
ExploitBench is different because it measures a practical exploitation chain. It tries to answer a more specific question: how far can an AI agent advance in a real vulnerability scenario?
The site itself criticizes benchmarks that reduce everything to a binary result, such as “worked” or “did not work.” According to ExploitBench, that hides where the AI’s real capability actually ends.
This tiered approach makes the analysis richer. A model that only reproduces a crash is not at the same level as a model that can build exploitation primitives or achieve full control.
What does this mean for the future of AI?
The ranking shows that current models are becoming increasingly capable in complex technical tasks.
This has two sides.
On one hand, these advances can help security teams identify vulnerabilities, assess severity, reproduce bugs, and prioritize fixes before problems are exploited in the real world.
On the other hand, it also shows that advanced models may have sensitive capabilities, requiring governance, limits, monitoring, and responsible use.
The evolution of AI agents should not be seen only as a race for performance. It must also be accompanied by responsibility, security, and transparency.
Conclusion
ExploitBench offers a very interesting view of today’s AI rankings in one specific area: real vulnerability exploitation.
In the analyzed ranking, Claude Mythos Preview appears as the leader, followed by variations of GPT 5.5 and Claude Opus 4.7. But the most important point is not just each model’s position. The main value of ExploitBench is showing how far each AI can advance in a real technical chain.
For anyone following artificial intelligence, cybersecurity, or the evolution of autonomous agents, ExploitBench is a very useful tool.
In the end, the recommendation is clear: it is worth following ExploitBench, especially if you want to understand AI performance beyond traditional rankings and observe how these models behave in high-level technical tasks.








Comments0
Please sign in to leave a comment.