Full Report
Wiz Research’s AI Cyber Model Arena benchmarks offensive AI security on 257 real-world challenges (zero-days, CVEs, API/web, and cloud across AWS/Azure/GCP/K8s) demonstrating what AI models and agents can really do
Analysis Summary
# Research: AI Cyber Model Arena: Testing AI Agents in Cybersecurity
## Metadata
- Authors: Matan Vetzler, Nir Ohfeld, Alon Schindel
- Institution: Wiz Research
- Publication: Wiz Blog
- Date: February 12, 2026 (as per blog post date)
## Abstract
This research introduces the **AI Cyber Model Arena**, a comprehensive benchmarking suite designed to evaluate the offensive security capabilities of AI models and agents against real-world cybersecurity challenges. The benchmark encompasses 257 specific tasks across five offensive domains—zero-day discovery, CVE detection, API security, web security, and multi-cloud configuration attacks (AWS, Azure, GCP, K8s). The key finding is that offensive capability is highly **jointly determined** by the foundational AI model and the specific agentic scaffold it operates within, with performance demonstrating significant domain specificity.
## Research Objective
The primary objective is to establish a rigorous, real-world benchmark to evaluate and compare the offensive cybersecurity capabilities of various AI models and agents, addressing the growing integration of LLMs into security workflows. The research seeks to answer: What are the demonstrable offensive capabilities of current AI agents across a broad spectrum of realistic, complex cybersecurity challenges?
## Methodology
### Approach
The methodology employs a **multi-agent $\times$ multi-model matrix evaluation**. The evaluation system explicitly separates the capabilities of the underlying AI model from the effects of the agentic framework wrapping it. Results are based on deterministic, programmatic scoring against category-specific ground truth. Each challenge is attempted three times, and the best result (**pass@3**) is reported to reflect iterative practitioner behavior.
### Dataset/Environment
The benchmark consists of **257 real-world challenges** covering five offensive categories:
1. Zero-day discovery (cold-start memory bug discovery).
2. CVE (known code vulnerability) detection (static analysis).
3. API security.
4. Web security (dynamic exploitation).
5. Cloud security (multi-step misconfiguration attacks across AWS, Azure, GCP, and Kubernetes).
The challenges are grounded in vulnerabilities and exposure encountered in the daily work of Wiz Research.
### Tools & Technologies
Evaluations run inside **isolated Docker containers** with sufficient resources and no deliberate timeouts to ensure scores reflect capability, not throttling. Each agent executes using its **native tools and execution model** (no custom scaffolding like MCP servers). The containerized environment provides necessary domain-appropriate tooling (e.g., debuggers for binary tasks, cloud CLIs for cloud tasks) equally to all agents. Dynamic validation within the isolated environment is used to catch hardcoded solutions.
## Key Findings
### Primary Results
1. **Offensive Capability is Jointly Determined:** The performance of an AI in cybersecurity tasks is not solely dependent on the underlying LLM but is dramatically influenced by the surrounding agent scaffold utilized for execution.
2. **High Domain Specificity:** No single Agent-Model pairing achieves dominance across all five offensive categories; performance varies significantly depending on the security domain being tested (e.g., one pairing might excel at CVE detection but fail at zero-day discovery).
3. **Real-World Applicability:** The benchmark effectively demonstrates what current AI models and agents can achieve against challenges mimicking real-world, encountered cyber exposures.
### Supporting Evidence
- Performance scores are based on deterministic, programmatic matching against pre-defined ground truth metrics specific to each category (e.g., multi-dimensional rubrics, endpoint/severity matching, lag capture).
- The test structure avoids biasing results toward any specific model by ensuring equal environmental access to system tooling and running tasks in network-isolated environments.
### Novel Contributions
- Creation of the **AI Cyber Model Arena**, a novel, publicly benchmarked suite specifically designed for offensive AI evaluation across diverse and complex cybersecurity attack surfaces (including zero-days and multi-cloud attacks).
- Establishment of a methodology that cleanly separates model effects from agent effects while preserving the realistic, iterative nature of expert performance (pass@3).
## Technical Details
The methodology employs a pragmatic design goal: fairness and realism. By isolating tests in containers, environmental variables are controlled. The provision of "out-of-the-box" agent tools combined with standardized system tooling (e.g., debuggers) ensures that performance differences reflect the agent's strategic reasoning and execution ability rather than missing prerequisite software. The network isolation standardizes the operating environment for sensitive challenges.
## Practical Implications
### For Security Practitioners
The results provide practitioners with empirical data on the current maturity levels of AI agents for offensive security tasks. This allows for informed decisions regarding the integration of these tools into vulnerability research or defensive threat hunting workflows.
### For Defenders
Defenders gain insight into the specific attack vectors AI agents are currently proficient at executing (e.g., known CVE patterns vs. complex, multi-step cloud exploitation), allowing for superior prioritization of defensive patching and configuration hardening in those high-risk areas.
### For Researchers
The Arena establishes a standardized, publicly accessible platform for future AI security research, allowing new models and agent frameworks to be consistently tested against a standardized, evolving set of real-world challenges.
## Limitations
The source article highlights that the scores reflect the "out of the box" performance of agents, meaning performance achieved through significant custom fine-tuning or advanced scaffolding (like sophisticated Memory-Context-Planning servers) is not measured. The benchmark is static upon release and will require continuous updating to remain reflective of the rapid evolution of AI models.
## Comparison to Prior Work
Unlike generalized LLM benchmarks, the AI Cyber Model Arena focuses exclusively on deep, offensive cybersecurity tasks grounded in actual exploits and cloud configurations drawn from operational security knowledge, providing a higher fidelity measure of practical offensive utility compared to generalized coding or logic tests.
## Future Work
The authors intend to continue updating the AI Cyber Model Arena by incorporating newly released AI models, adding more real-world challenges, and exploring new tools and frameworks that push the frontiers of AI cybersecurity capabilities evaluations.
## References
- [AI Cyber Model Arena (Wiz Link)]
- [Wiz Blog - Introduction Article] (Self-reference)