pith. sign in

arxiv: 2606.29981 · v1 · pith:MKIQAT5Qnew · submitted 2026-06-29 · 💻 cs.CR

Hephaestus: Toward a Cybersecurity AI Scientist

Pith reviewed 2026-06-30 05:40 UTC · model grok-4.3

classification 💻 cs.CR
keywords cybersecurityAI scientistmulti-agent systemnon-stationarydigital twinscyber rangesAI-native defenseagent security
0
0 comments X

The pith

AI-native cybersecurity forms a distinct scientific object requiring its own specialized AI scientist architecture.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper claims that cybersecurity differs from stable scientific domains in three key ways: its units of study are security events and interaction traces, its tools and models are non-stationary, and evaluation needs digital twins and cyber ranges rather than fixed benchmarks. Existing AI scientist systems are designed for steady-state domains and thus fall short here. The authors introduce the Cybersecurity AI Scientist as a modular multi-agent system handling problem framing through reporting, with objectives in risk, trust, incident, and energy. They focus on shifting to resilient agent legions in defense. A sympathetic reader would care because cyber threats move at machine speed, demanding automated research that accounts for constant change.

Core claim

AI-native cybersecurity is a different kind of scientific object whose recurring units of study are security events and interaction traces, not static assets; whose model and tool substrate is non-stationary, not steady-state; and whose credible evaluation depends on digital twins, cyber ranges, and auditable evidence rather than on a single benchmark score. This object is called the Cybersecurity AI Scientist, realized as a modular, role-specialized multi-agent research system anchored in a four-zeros frame.

What carries the argument

The Cybersecurity AI Scientist: a modular, role-specialized multi-agent research system coordinating problem framing, threat modeling, tool generation, controlled experimentation, evaluation, governance, and scientific reporting, with concrete objectives in a four-zeros frame of risk, trust, incident, and energy dimensions.

If this is right

  • AI-native defense replaces steady-state perimeters with resilient agent legions.
  • The category of terminal security is deconstructed into agent security.
  • Future systems can build on this defined object separate from any single realization.
  • Empirical programs can use the architecture for benchmarks and experiments in cyber ranges.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar non-stationary challenges may exist in other fields like real-time financial trading or adaptive robotics, suggesting the need for domain-specific AI scientists.
  • Testing the multi-agent coordination could involve running the system in controlled cyber ranges to measure research output speed and quality.
  • The four-zeros frame might provide a structured way to balance multiple objectives in security research beyond this paper's scope.

Load-bearing premise

The non-stationary character of cybersecurity tools and events creates a fundamental categorical difference from stable scientific domains that existing AI scientist systems cannot handle without a wholly new specialized architecture.

What would settle it

An experiment showing that a general-purpose AI scientist system can automate cybersecurity research tasks effectively using only standard benchmarks and without adaptations for non-stationarity or specialized evaluation methods.

Figures

Figures reproduced from arXiv: 2606.29981 by Jiaqi Li, Lidong Zhai, Lvyang Zhang, Wen Lu, Yang Zhao.

Figure 1
Figure 1. Figure 1: Cybersecurity AI Scientist is motivated by two coupled shifts. The scientific object [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: One operational view of the Cybersecurity AI Scientist. The framework is not a single [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: From terminal-anchored defense to resilient agent legions. Each legion carries an event [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
read the original abstract

Cyber offense is moving to machine speed; cyber research itself is not. Existing AI scientist systems make end-to-end research automation increasingly plausible, but they target relatively stable scientific domains. We argue that AI-native cybersecurity is a different kind of scientific object. Its recurring units of study are security events and interaction traces, not static assets; its model and tool substrate is non-stationary, not steady-state; and credible evaluation depends on digital twins, cyber ranges, and auditable evidence rather than on a single benchmark score. We call this object the Cybersecurity AI Scientist. A practical realization is a modular, role-specialized multi-agent research system that coordinates problem framing, threat modeling, tool generation, controlled experimentation, evaluation, governance, and scientific reporting, and that anchors its concrete objectives in a four-zeros frame spanning risk, trust, incident, and energy dimensions. As a representative agenda we focus on AI-native defense, where steady-state perimeters give way to resilient agent legions and the classical category of terminal security is itself being deconstructed into agent security. This paper defines the object, separates it from any single organizational realization, and offers an architecture and an agenda on which later systems, benchmarks, and empirical programs can be built.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper argues that cybersecurity constitutes a distinct scientific object for AI research automation, differing from stable domains in three ways: recurring units of study are security events and interaction traces rather than static assets; the model and tool substrate is non-stationary; and evaluation relies on digital twins, cyber ranges, and auditable evidence rather than single benchmark scores. It defines this as the 'Cybersecurity AI Scientist,' realized as a modular multi-agent system coordinating problem framing, threat modeling, tool generation, experimentation, evaluation, governance, and reporting, anchored in a four-zeros frame (risk, trust, incident, energy). The paper separates the object definition from any specific implementation and offers an agenda focused on AI-native defense, where perimeters yield to resilient agent legions and terminal security becomes agent security.

Significance. If the distinctions hold and the proposed architecture is realized and validated, the work could establish a foundational conceptual framework for specialized AI systems in dynamic cybersecurity research, guiding future benchmarks, empirical programs, and multi-agent designs in AI-native defense. The paper explicitly separates the object from organizational realizations, which is a strength for enabling multiple implementations.

major comments (2)
  1. [Abstract] Abstract: The central claim that the three listed differences (event/trace units, non-stationary substrate, digital-twin evaluation) entail a 'fundamental categorical difference' requiring a wholly new specialized multi-agent architecture is asserted without analysis showing why targeted adaptations to existing AI scientist systems (e.g., dynamic tool interfaces or range-based evaluators) would be insufficient. This assumption is load-bearing for introducing the new object and the necessity of the proposed realization.
  2. [Abstract] Abstract: The four-zeros frame is introduced as anchoring concrete objectives, but no derivation, mapping, or example is provided showing how it operationalizes the three domain differences or coordinates the listed agent roles (problem framing through reporting).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and precise comments. We agree that both points identify areas where the manuscript requires additional analysis and elaboration to fully support its claims. We outline targeted revisions below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the three listed differences (event/trace units, non-stationary substrate, digital-twin evaluation) entail a 'fundamental categorical difference' requiring a wholly new specialized multi-agent architecture is asserted without analysis showing why targeted adaptations to existing AI scientist systems (e.g., dynamic tool interfaces or range-based evaluators) would be insufficient. This assumption is load-bearing for introducing the new object and the necessity of the proposed realization.

    Authors: We agree that the manuscript asserts the categorical distinction on the basis of the three differences without providing an explicit comparative analysis against possible adaptations of existing AI-scientist systems. This is a substantive gap. In the revised version we will insert a new subsection (placed after the domain-differences argument) that (a) briefly reviews representative existing AI-scientist architectures, (b) examines whether dynamic tool interfaces or range-based evaluators could be extended to accommodate non-stationary substrates and event/trace units, and (c) identifies concrete limitations that remain even after such extensions. This addition will make the necessity claim evidence-based rather than asserted. revision: yes

  2. Referee: [Abstract] Abstract: The four-zeros frame is introduced as anchoring concrete objectives, but no derivation, mapping, or example is provided showing how it operationalizes the three domain differences or coordinates the listed agent roles (problem framing through reporting).

    Authors: We accept that the four-zeros frame is introduced without derivation, explicit mapping to the three domain differences, or an illustrative example of coordination across agent roles. In revision we will add a dedicated subsection that (1) derives each zero (risk, trust, incident, energy) from the non-stationary, event-trace, and digital-twin characteristics; (2) provides a table mapping each zero to the seven agent roles; and (3) walks through a short worked example of how the frame shapes objective setting and hand-offs in a representative threat-modeling-to-reporting workflow. These additions will make the frame operational rather than declarative. revision: yes

Circularity Check

0 steps flagged

No significant circularity; conceptual definition from domain properties

full rationale

The paper's central move is to list three domain properties of cybersecurity (recurring units as security events/traces, non-stationary substrate, evaluation via digital twins/ranges) and from them define a new object called the 'Cybersecurity AI Scientist,' then propose a multi-agent architecture as one realization. This is a definitional framing, not a derivation chain that reduces by construction to its own inputs. No equations, fitted parameters, or predictions appear. No self-citations are invoked as load-bearing justification for the categorical difference or uniqueness. The architecture is presented as a practical response to the defined object rather than derived from prior results by the same authors. The absence of external benchmarks for the 'fundamental categorical difference' claim is a matter of argumentative strength, not circularity under the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the authors' characterization of cybersecurity as categorically non-stationary and on the introduction of a new named scientific object without independent evidence or derivation from prior results.

axioms (1)
  • domain assumption Cybersecurity research units (security events and interaction traces) and tool substrates are non-stationary in a way that distinguishes them categorically from stable scientific domains.
    Invoked in the abstract to justify the need for a distinct Cybersecurity AI Scientist object.
invented entities (1)
  • Cybersecurity AI Scientist no independent evidence
    purpose: A modular, role-specialized multi-agent research system coordinating problem framing, threat modeling, tool generation, experimentation, evaluation, governance, and reporting for cybersecurity.
    Newly defined object introduced to organize the proposed agenda; no prior existence or external falsifiable evidence is provided.

pith-pipeline@v0.9.1-grok · 5751 in / 1522 out tokens · 39229 ms · 2026-06-30T05:40:18.473164+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 30 canonical work pages · 7 internal anchors

  1. [1]

    Project glasswing: Securing critical software for the ai era.https://www.anthropi c.com/glasswing, 2026

    Anthropic. Project glasswing: Securing critical software for the ai era.https://www.anthropi c.com/glasswing, 2026. Anthropic announcement, April 7, 2026; introduces Claude Mythos Preview as the frontier model powering Project Glasswing

  2. [2]

    Expanding project glasswing.https://www.anthropic.com/news/expanding-p roject-glasswing, 2026

    Anthropic. Expanding project glasswing.https://www.anthropic.com/news/expanding-p roject-glasswing, 2026. Anthropic news, 2026; expansion of Claude Mythos Preview access to critical-infrastructure partners

  3. [3]

    An AI system to help scientists write expert-level empirical software

    Eser Aygün, Anastasiya Belyaeva, Gheorghe Comanici, Marc Coram, Hao Cui, Jake Garrison, Renee Johnston Anton Kast, Cory Y. McLean, Peter Norgaard, Zahra Shamsi, David Smalling, James Thompson, Subhashini Venugopalan, Brian P. Williams, Chujun He, Sarah Martinson, Martyna Plomecka, Lai Wei, Yuchen Zhou, Qian-Ze Zhu, Matthew Abraham, Erica Brand, Anna Bulan...

  4. [4]

    Opensec: Measuring incident response agent calibration under adversarial evidence, 2026

    Jarrod Barnes. Opensec: Measuring incident response agent calibration under adversarial evidence, 2026. URLhttps://arxiv.org/abs/2601.21083

  5. [5]

    Cyberseceval 2: A wide-ranging cybersecurity evaluation suite for large language models, 2024

    Manish Bhatt, Sahana Chennabasappa, Yue Li, Cyrus Nikolaidis, Daniel Song, Shengye Wan, Faizan Ahmad, Cornelius Aschermann, Yaohui Chen, Dhaval Kapil, David Molnar, Spencer Whitman, and Joshua Saxe. Cyberseceval 2: A wide-ranging cybersecurity evaluation suite for large language models, 2024. URLhttps://arxiv.org/abs/2404.13161

  6. [6]

    Ai4research: A survey of artificial intelligence for scientific research.arXiv preprint arXiv:2507.01903, 2025

    Qiguang Chen, Mingda Yang, Libo Qin, Jinhao Liu, Zheng Yan, Jiannan Guan, Dengyun Peng, Yiyan Ji, Hanjing Li, Mengkang Hu, Yimeng Zhang, Yihao Liang, Yuhang Zhou, Jiaqi Wang, 12 Zhi Chen, and Wanxiang Che. Ai4research: A survey of artificial intelligence for scientific research, 2025. URLhttps://arxiv.org/abs/2507.01903

  7. [7]

    Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps

    Alankrit Chona, Igor Kozlov, and Ambuj Kumar. Cyber defense benchmark: Agentic threat hunting evaluation for llms in secops, 2026. URLhttps://arxiv.org/abs/2604.19533

  8. [8]

    Multi-agent penetration testing ai for the web, 2025

    Isaac David et al. Multi-agent penetration testing ai for the web, 2025. URLhttps://arxiv. org/abs/2508.20816

  9. [9]

    Darpa ai cyber challenge aims to secure nation’s most critical software.https://www.darpa.mil/news/2023/ai-cyber-challenge-software,

    Defense Advanced Research Projects Agency. Darpa ai cyber challenge aims to secure nation’s most critical software.https://www.darpa.mil/news/2023/ai-cyber-challenge-software,

  10. [10]

    DARPA News, August 9, 2023

  11. [11]

    Ai cyber challenge marks pivotal inflection point for cyber defense.https://www.darpa.mil/news/2025/aixcc-results, 2025

    Defense Advanced Research Projects Agency. Ai cyber challenge marks pivotal inflection point for cyber defense.https://www.darpa.mil/news/2025/aixcc-results, 2025. DARPA News, August 8, 2025

  12. [12]

    Pentestgpt: An llm-empowered automatic penetration testing tool, 2023

    Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. Pentestgpt: An llm-empowered automatic penetration testing tool, 2023. URLhttps://arxiv.org/abs/2308.06782

  13. [13]

    What makes a good llm agent for real-world penetration testing?, 2026

    Gelei Deng, Yi Liu, Yuekang Li, Ruozhao Yang, Xiaofei Xie, Jie Zhang, Han Qiu, and Tianwei Zhang. What makes a good llm agent for real-world penetration testing?, 2026. URL https://arxiv.org/abs/2602.17622

  14. [14]

    LLM Agents can Autonomously Exploit One-day Vulnerabilities

    Richard Fang, Rohan Bindu, Akul Gupta, and Daniel Kang. Llm agents can autonomously exploit one-day vulnerabilities, 2024. URLhttps://arxiv.org/abs/2404.08144

  15. [15]

    Measuring ai agents’ progress on multi-step cyber attack scenarios,

    Linus Folkerts, Will Payne, Simon Inman, Philippos Giavridis, Joe Skinner, Sam Deverett, James Aung, Ekin Zorer, Michael Schmatz, Mahmoud Ghanem, John Wilkinson, Alan Steer, Vy Hong, and Jessica Wang. Measuring ai agents’ progress on multi-step cyber attack scenarios,

  16. [16]

    URLhttps://arxiv.org/abs/2603.11214

  17. [17]

    Ai-driven guided response for security operation centers with microsoft copilot for security, 2024

    Scott Freitas, Jovan Kalajdjieski, Amir Gharib, and Rob McCann. Ai-driven guided response for security operation centers with microsoft copilot for security, 2024. URLhttps://arxiv. org/abs/2407.09017

  18. [18]

    E.et al.A multi-agent system for automating scientific discovery.Nature https://doi.org/10.1038/s41586-026-10652-y (2026) doi:10.1038/s41586-026-10652-y

    Ali Essam Ghareeb, Benjamin Chang, Ludovico Mitchener, Angela Yiu, Caralyn J. Szostkiewicz, Dmytro Shved, Gavin J. Gyimesi, Jon M. Laurent, Samantha M. Wright, Muhammed T. Razzak, Andrew D. White, Silvia C. Finnemann, Michaela M. Hinks, and Samuel G. Rodriques. A multi- agent system for automating scientific discovery.Nature, 2026. doi: 10.1038/s41586-026...

  19. [19]

    Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Petar Sirkovic, Artiom Myaskovsky, Grzegorz Glowaty, Felix Weissenberger, Alessio Orlandi, Dan Popovici, Anil Palepu, Keran Rong, Ryutaro Tanno, Khaled Saab, Fan Zhang, Jacob Blum, Andrew Carroll, Kavita Kulkarni, Nenad Tomašev, Dina Zverinski, Ivor Rendulic, Elahe Vedadi, Florian Hasler, Luka Riman...

  20. [20]

    Benchmarking practices in llm-driven offensive security: Testbeds, metrics, and experiment design, 2025

    Andreas Happe and Jürgen Cito. Benchmarking practices in llm-driven offensive security: Testbeds, metrics, and experiment design, 2025. URLhttps://arxiv.org/abs/2504.10112

  21. [21]

    Pengfei He, Ash Fox, Lesly Miculicich, Stefan Friedli, Daniel Fabian, Burak Gokturk, Jiliang Tang, Chen-Yu Lee, Tomas Pfister, and Long T. Le. Co-redteam: Orchestrated security discovery and exploitation with llm agents, 2026. URLhttps://arxiv.org/abs/2602.02164

  22. [22]

    Security of ai agents, 2024

    Yifeng He, Ethan Wang, Yuyang Rong, Zifei Cheng, and Hao Chen. Security of ai agents, 2024. URLhttps://arxiv.org/abs/2406.08689

  23. [23]

    A survey of agentic ai and cybersecurity: Challenges, opportunities and use-case prototypes, 2026

    Sahaya Jestus Lazer, Kshitiz Aryal, Maanak Gupta, and Elisa Bertino. A survey of agentic ai and cybersecurity: Challenges, opportunities and use-case prototypes, 2026. URLhttps: //arxiv.org/abs/2601.05293

  24. [24]

    Lin et al

    Justin W. Lin, Eliot Krzysztof Jones, Donovan Julian Jasper, Ethan Jun-shen Ho, Anna Wu, Arnold Tianyi Yang, Neil Perry, Andy Zou, Matt Fredrikson, J. Zico Kolter, Percy Liang, Dan Boneh, and Daniel E. Ho. Comparing ai agents to cybersecurity professionals in real-world penetration testing, 2025. URLhttps://arxiv.org/abs/2512.09882

  25. [25]

    Synthesizing Multi-Agent Harnesses for Vulnerability Discovery

    Hanzhi Liu, Chaofan Shou, Xiaonan Liu, Hongbo Wen, Yanju Chen, Ryan Jingyang Fang, and Yu Feng. Synthesizing multi-agent harnesses for vulnerability discovery, 2026. URL https://arxiv.org/abs/2604.20801

  26. [26]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery, 2024. URLhttps: //arxiv.org/abs/2408.06292

  27. [27]

    Llm4sr: A survey on large language models for scientific research, 2025

    Ziming Luo, Zonglin Yang, Zexin Xu, Wei Yang, and Xinya Du. Llm4sr: A survey on large language models for scientific research, 2025. URLhttps://arxiv.org/abs/2501.04306

  28. [28]

    Hacksynth: Llm agent and evaluation framework for autonomous penetration testing, 2024

    Lajos Muzsai, David Imolai, and András Lukács. Hacksynth: Llm agent and evaluation framework for autonomous penetration testing, 2024. URLhttps://arxiv.org/abs/2412.0 1778

  29. [29]

    Skarlinski, Sam Cox, Jon M

    Michael D. Skarlinski, Sam Cox, Jon M. Laurent, James D. Braza, Michaela Hinks, Michael J. Hammerling, Manvitha Ponnapati, Samuel G. Rodriques, and Andrew D. White. Language agents achieve superhuman synthesis of scientific knowledge, 2024. URLhttps://arxiv.org/ abs/2409.13740

  30. [30]

    A survey of ai scientists, 2025

    Guiyao Tie, Pan Zhou, and Lichao Sun. A survey of ai scientists, 2025. URL https: //arxiv.org/abs/2510.23045

  31. [31]

    A summer of security: empowering cyber defenders with ai.https://blog.goo gle/innovation-and-ai/technology/safety-security/cybersecurity-updates-summe r-2025/, 2025

    Kent Walker. A summer of security: empowering cyber defenders with ai.https://blog.goo gle/innovation-and-ai/technology/safety-security/cybersecurity-updates-summe r-2025/, 2025. Google Blog, July 15, 2025

  32. [32]

    Cyberseceval 3: Advancing the evaluation of cybersecurity risks and capabilities in large language models, 2024

    Shengye Wan, Cyrus Nikolaidis, Daniel Song, David Molnar, James Crnkovich, Jayson Grace, Manish Bhatt, Sahana Chennabasappa, Spencer Whitman, Stephanie Ding, Vlad Ionescu, Yue Li, and Joshua Saxe. Cyberseceval 3: Advancing the evaluation of cybersecurity risks and capabilities in large language models, 2024. URLhttps://arxiv.org/abs/2408.01605. 14

  33. [33]

    The digital cybersecurity expert: How far have we come?, 2025

    Dawei Wang, Geng Zhou, Xianglong Li, Yu Bai, Li Chen, Ting Qin, Jian Sun, and Dan Li. The digital cybersecurity expert: How far have we come?, 2025. URLhttps://arxiv.org/ab s/2504.11783

  34. [34]

    ExploitGym: Can AI Agents Turn Security Vulnerabilities into Real Attacks?

    Zhun Wang, Nico Schiller, Hongwei Li, Srijiith Sesha Narayana, Milad Nasr, Nicholas Carlini, Xiangyu Qi, Eric Wallace, Elie Bursztein, Luca Invernizzi, Kurt Thomas, Yan Shoshitaishvili, Wenbo Guo, Jingxuan He, Thorsten Holz, and Dawn Song. Exploitgym: Can ai agents turn security vulnerabilities into real attacks?, 2026. URLhttps://arxiv.org/abs/2605.11086

  35. [35]

    Cybergym: Evaluating ai agents’ real-world cybersecurity capabilities at scale, 2025

    Zhun Wang et al. Cybergym: Evaluating ai agents’ real-world cybersecurity capabilities at scale, 2025. URLhttps://arxiv.org/abs/2506.02548

  36. [36]

    The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

    Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search, 2025. URLhttps://arxiv.org/abs/2504.08066

  37. [37]

    Llama-3.1-foundationai-securityllm-reasoning-8b technical report, 2026

    Zhuoran Yang et al. Llama-3.1-foundationai-securityllm-reasoning-8b technical report, 2026. URLhttps://arxiv.org/abs/2601.21051

  38. [38]

    Teams of llm agents can exploit zero-day vulnerabilities, 2024

    Yuxuan Zhu, Antony Kellermann, Akul Gupta, Philip Li, Richard Fang, Rohan Bindu, and Daniel Kang. Teams of llm agents can exploit zero-day vulnerabilities, 2024. URLhttps: //arxiv.org/abs/2406.01637

  39. [39]

    Cve-bench: A benchmark for ai agents’ ability to exploit real-world web application vulnerabilities, 2025

    Yuxuan Zhu, Antony Kellermann, Dylan Bowman, Philip Li, Akul Gupta, Adarsh Danda, Richard Fang, Conner Jensen, Eric Ihli, Jason Benn, Jet Geronimo, Avi Dhir, Sudhit Rao, Kaicheng Yu, Twm Stone, and Daniel Kang. Cve-bench: A benchmark for ai agents’ ability to exploit real-world web application vulnerabilities, 2025. URLhttps://arxiv.org/abs/2503 .17332. 15