pith. machine review for the scientific record. sign in

arxiv: 2604.05719 · v1 · submitted 2026-04-07 · 💻 cs.CR · cs.AI· cs.SE

Recognition: 1 theorem link

· Lean Theorem

Hackers or Hallucinators? A Comprehensive Analysis of LLM-Based Automated Penetration Testing

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:48 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.SE
keywords LLM-based AutoPTAutomated Penetration TestingSystematization of KnowledgeEmpirical EvaluationAgent ArchitectureCybersecurityBenchmarking
0
0 comments X

The pith

LLM-based automated penetration testing frameworks are classified across six design dimensions and compared on a unified benchmark for the first time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to bring structure to the scattered research on LLM-powered tools for automated hacking by mapping their internal designs and testing many of them side by side. A reader would care because new frameworks appear rapidly, yet it remains unclear which architectural choices actually produce reliable autonomous attacks. The work reviews 15 frameworks through the lenses of agent architecture, planning, memory, execution, external knowledge, and benchmarks, then runs large-scale experiments on a shared test suite. Over 10 billion tokens were processed and more than 1,500 logs were examined by a panel of experts, producing both a taxonomy and performance data. This supplies a common reference point for judging current capabilities and identifying concrete gaps in end-to-end autonomy.

Core claim

We deliver the first Systematization of Knowledge on LLM-based AutoPT frameworks by reviewing existing designs across six dimensions and by executing large-scale empirical comparisons of 13 open-source frameworks plus two baselines on a unified benchmark, with all logs manually reviewed over four months by more than 15 cybersecurity experts.

What carries the argument

The six-dimensional taxonomy of agent architecture, agent plan, agent memory, agent execution, external knowledge, and benchmarks, which both organizes the review of framework designs and structures the unified empirical evaluation.

If this is right

  • New frameworks can adopt the six-dimensional taxonomy to deliberately improve weak areas such as memory retention or execution reliability.
  • The published benchmark and logs enable direct, reproducible comparisons for any future AutoPT system.
  • Identified limitations in current agent plans and external knowledge use point to specific research targets for increasing end-to-end success rates.
  • The scale of token consumption documented in the experiments supplies practical guidance on the computational cost of deploying these agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Widespread use of the benchmark could reduce duplicated effort by letting research groups measure incremental gains against a shared baseline.
  • The same review-plus-unified-test method could be applied to LLM agents in other security-sensitive tasks such as malware analysis or vulnerability disclosure.
  • If open-source results remain modest, closed-source commercial tools may need separate evaluation to determine whether they close the autonomy gap.

Load-bearing premise

The 13 open-source frameworks plus two baselines plus the chosen unified benchmark are representative enough of the field that the architectural patterns and performance observations generalize.

What would settle it

A later study that adds several omitted frameworks, re-runs the identical benchmark, and finds substantially different success rates or entirely new architectural patterns would show the original selection was not representative.

Figures

Figures reproduced from arXiv: 2604.05719 by Chang You, Chao Zhang, Cheng Huang, Fan Shi, Hanlin Sun, Haoran Ou, Hongda Sun, Jiancheng Zhang, Jianguo Zhao, Jiaren Peng, Junyi Liu, Kunshu Song, Renyang Liu, Rui Yan, Shuqiao Zhang, Xuan Tian, Yan Wang, Yuqiang Sun, Yutong Jiao, Zeqin Li.

Figure 1
Figure 1. Figure 1: The systematization framework of AutoPT. The upper section aligns with the traditional PT lifecycle. The lower section systematically deconstructs the AutoPT archi￾tecture into six core dimensions: Agent Architecture, Agent Plan, Agent Memory, Agent Execution, External Knowledge, and Benchmarks. the inherent limitations of LLM parameterized knowledge. Finally, at the framework eval￾uation level, the benchm… view at source ↗
Figure 2
Figure 2. Figure 2: Taxonomy of planning strategies in AutoPT frameworks. 3.2 Agent Plan PT tasks typically take only a single sentence description or a single URL as initial in￾put, while the subsequent actual execution path may contain dozens or even hundreds of operations, involving multiple phases of the PT life cycle [95]. Just as human security experts instinctively decompose complex tasks into phased sub-goals, the Aut… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of various retrieval strategies. It highlights three primary mechanisms: dense retrieval based on vector embeddings, sparse retrieval focusing on exact entity keyword matching, and tool-based retrieval which delegates the retrieval autonomy to the LLM by selecting predefined testing methodologies from a library. How to accurately retrieve the required knowledge from a massive knowledge base is… view at source ↗
Figure 4
Figure 4. Figure 4: Tool usage distribution and call volume across frameworks by challenges difficulty. intrinsic framework tools. These are tools built into each framework to support its own reasoning and task management processes. As illustrated in the [PITH_FULL_IMAGE:figures/full_fig_p066_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Tool call distribution of major tools across backbone LLMs and frameworks. 67 [PITH_FULL_IMAGE:figures/full_fig_p067_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Tool call composition and configuration effects in CyberStrike. In terms of overall Challenge completion, the performance of CyberStrike-Full (S = 58) and CyberStrike-Lite (S = 55) perform comparably, suggesting that framework capability and the backbone llm, rather than toolset scale, may be the primary determinants of success on Easy and Medium challenges. Under Hard difficulty, the two variants show som… view at source ↗
Figure 7
Figure 7. Figure 7: LLM calls and token consumption of each framework on successfully compromised challenges under Easy and Medium challenges. In terms of LLM calls, single-agent frameworks show a certain advantage overall. Baseline-cc, baseline-kimi, Tinyctfer, XBow-Comp, and CyberStrike can all complete an above-average number of tasks with relatively few LLM calls. This indicates that in scenarios with clear task structure… view at source ↗
Figure 8
Figure 8. Figure 8: Workflow of the CTFSOLVER framework. 94 [PITH_FULL_IMAGE:figures/full_fig_p094_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Framework card of CTFSOLVER. 95 [PITH_FULL_IMAGE:figures/full_fig_p095_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Workflow of the LuaN1ao framework. 96 [PITH_FULL_IMAGE:figures/full_fig_p096_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Framework card of LuaN1ao. 97 [PITH_FULL_IMAGE:figures/full_fig_p097_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Workflow of the Tinyctfer framework. 98 [PITH_FULL_IMAGE:figures/full_fig_p098_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Framework card of Tinyctfer. 99 [PITH_FULL_IMAGE:figures/full_fig_p099_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Workflow of the XBow-Comp framework. 100 [PITH_FULL_IMAGE:figures/full_fig_p100_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Framework card of XBow-Competition. 101 [PITH_FULL_IMAGE:figures/full_fig_p101_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Workflow of the Cruiser framework. 102 [PITH_FULL_IMAGE:figures/full_fig_p102_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Framework card of Cruiser. 103 [PITH_FULL_IMAGE:figures/full_fig_p103_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Workflow of the CHYing framework. 104 [PITH_FULL_IMAGE:figures/full_fig_p104_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Framework card of CHYing. 105 [PITH_FULL_IMAGE:figures/full_fig_p105_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Workflow of the SickHackShark framework. 106 [PITH_FULL_IMAGE:figures/full_fig_p106_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Framework card of SickHackShark. 107 [PITH_FULL_IMAGE:figures/full_fig_p107_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Workflow of the newmapta framework. 108 [PITH_FULL_IMAGE:figures/full_fig_p108_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Framework card of newmapta. 109 [PITH_FULL_IMAGE:figures/full_fig_p109_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Workflow of the sub-agent framework. 110 [PITH_FULL_IMAGE:figures/full_fig_p110_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Framework card of sub-agent. 111 [PITH_FULL_IMAGE:figures/full_fig_p111_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Workflow of the CyberStrike framework. 112 [PITH_FULL_IMAGE:figures/full_fig_p112_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Framework card of CyberStrike. 113 [PITH_FULL_IMAGE:figures/full_fig_p113_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Workflow of the H-Pentest framework. 114 [PITH_FULL_IMAGE:figures/full_fig_p114_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Framework card of H-Pentest. 115 [PITH_FULL_IMAGE:figures/full_fig_p115_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: The prompt of baseline-kimi and baseline-cc 116 [PITH_FULL_IMAGE:figures/full_fig_p116_30.png] view at source ↗
read the original abstract

The rapid advancement of Large Language Models (LLMs) has created new opportunities for Automated Penetration Testing (AutoPT), spawning numerous frameworks aimed at achieving end-to-end autonomous attacks. However, despite the proliferation of related studies, existing research generally lacks systematic architectural analysis and large-scale empirical comparisons under a unified benchmark. Therefore, this paper presents the first Systematization of Knowledge (SoK) focusing on the architectural design and comprehensive empirical evaluation of current LLM-based AutoPT frameworks. At systematization level, we comprehensively review existing framework designs across six dimensions: agent architecture, agent plan, agent memory, agent execution, external knowledge, and benchmarks. At empirical level, we conduct large-scale experiments on 13 representative open-source AutoPT frameworks and 2 baseline frameworks utilizing a unified benchmark. The experiments consumed over 10 billion tokens in total and generated more than 1,500 execution logs, which were manually reviewed and analyzed over four months by a panel of more than 15 researchers with expertise in cybersecurity. By investigating the latest progress in this rapidly developing field, we provide researchers with a structured taxonomy to understand existing LLM-based AutoPT frameworks and a large-scale empirical benchmark, along with promising directions for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to deliver the first Systematization of Knowledge (SoK) on LLM-based Automated Penetration Testing (AutoPT) frameworks. It reviews designs across six dimensions (agent architecture, plan, memory, execution, external knowledge, and benchmarks), then performs large-scale experiments on 13 open-source frameworks plus two baselines under a unified benchmark. The experiments used >10B tokens, produced >1,500 logs that were manually reviewed by >15 experts over four months, and yield a taxonomy, performance insights, identified limitations, and future research directions.

Significance. If the taxonomy is comprehensive and the empirical findings are reproducible, the work would provide a valuable structured reference and benchmark for the emerging intersection of LLMs and automated security testing. The scale of the token usage and multi-expert review are positive signals of effort, but the absence of reproducibility safeguards on the manual analysis limits the strength of the empirical contribution.

major comments (2)
  1. [Empirical Evaluation] Empirical Evaluation section (and associated methodology description): The central empirical claims rest on manual review of >1,500 execution logs by >15 researchers. No inter-rater reliability metrics (e.g., Cohen’s kappa or Fleiss’ kappa), annotation guidelines, disagreement-resolution protocol, or bias-mitigation steps are reported. This is load-bearing because the reported framework rankings, failure-mode distributions, and limitation insights are derived directly from these subjective judgments; without these details the evaluation cannot be considered reproducible or objective.
  2. [Framework Selection] Framework Selection subsection: The justification for choosing exactly these 13 open-source frameworks (plus the two baselines) as “representative” is not sufficiently detailed. The paper must show that the selection criteria (e.g., GitHub stars, recency, architectural diversity) were applied systematically and that excluded frameworks would not materially alter the taxonomy or performance conclusions.
minor comments (2)
  1. [Systematization] The six-dimensional taxonomy is introduced in the abstract and systematization section but the mapping of individual frameworks to each dimension is not presented in a single, easily scannable table; adding such a summary table would improve readability.
  2. [Experiments] Token-count and log-volume statistics are given in aggregate; breaking them down by framework (or at least by category) would allow readers to assess whether computational cost correlates with reported performance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing honest responses and committing to revisions where the comments identify gaps in the current version. Our goal is to strengthen the reproducibility and clarity of the work without misrepresenting the original contributions.

read point-by-point responses
  1. Referee: [Empirical Evaluation] Empirical Evaluation section (and associated methodology description): The central empirical claims rest on manual review of >1,500 execution logs by >15 researchers. No inter-rater reliability metrics (e.g., Cohen’s kappa or Fleiss’ kappa), annotation guidelines, disagreement-resolution protocol, or bias-mitigation steps are reported. This is load-bearing because the reported framework rankings, failure-mode distributions, and limitation insights are derived directly from these subjective judgments; without these details the evaluation cannot be considered reproducible or objective.

    Authors: We agree that the absence of explicit details on the manual review process represents a limitation in the current manuscript's reproducibility. The reviews were conducted by a panel of over 15 cybersecurity experts over four months, with logs assigned to multiple reviewers where possible and disagreements resolved through group discussion and consensus. In the revised version, we will expand the methodology description to include: the annotation guidelines used (e.g., standardized failure-mode categories and success criteria), the disagreement-resolution protocol, bias-mitigation steps such as log randomization and independent initial reviews, and any available inter-rater agreement statistics (e.g., pairwise agreement percentages). While we did not compute formal metrics like Cohen’s or Fleiss’ kappa during the original process and cannot retroactively apply them without re-reviewing all logs, we will report the available agreement data and explicitly discuss the subjective nature of the analysis as a limitation. This revision will make the empirical contribution more transparent and objective. revision: yes

  2. Referee: [Framework Selection] Framework Selection subsection: The justification for choosing exactly these 13 open-source frameworks (plus the two baselines) as “representative” is not sufficiently detailed. The paper must show that the selection criteria (e.g., GitHub stars, recency, architectural diversity) were applied systematically and that excluded frameworks would not materially alter the taxonomy or performance conclusions.

    Authors: We acknowledge that the Framework Selection subsection could benefit from greater explicitness regarding the systematic application of criteria. The 13 frameworks were chosen to ensure coverage across the six taxonomy dimensions while prioritizing open-source availability, recency (primarily 2023–2024 releases), and indicators of adoption such as GitHub stars and community activity; the two baselines were included for controlled comparison. In the revision, we will expand this subsection with a clear enumeration of the selection criteria (including thresholds and sources), a table summarizing how each selected framework maps to the taxonomy dimensions, and a discussion of notable excluded frameworks (e.g., those that are closed-source, non-functional, or duplicates of included ones) with arguments that their inclusion would not materially change the taxonomy structure or the high-level performance patterns observed. This will demonstrate the representativeness of the sample without altering the paper's core claims. revision: yes

Circularity Check

0 steps flagged

No circularity in SoK derivation or empirical claims

full rationale

The paper performs a literature systematization across six architectural dimensions and runs experiments on 13 external open-source frameworks plus baselines under a unified benchmark. No equations, fitted parameters, predictions, or model derivations appear in the provided text that could reduce to the paper's own inputs by construction. The central claims rest on review of prior work and analysis of independent systems rather than self-referential steps, self-citation chains, or renamed empirical patterns. This matches the default expectation of no significant circularity for a survey-style paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As a systematization paper the central contribution is a new taxonomy and benchmark rather than new theory; it draws frameworks and evaluation practices from existing literature without introducing new free parameters or invented entities.

axioms (1)
  • domain assumption Standard practices of literature review and empirical benchmarking in computer science are sufficient to produce a representative SoK.
    Invoked when claiming the 13 frameworks and unified benchmark cover the field.

pith-pipeline@v0.9.0 · 5582 in / 1176 out tokens · 37702 ms · 2026-05-10T18:48:54.003012+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

138 extracted references · 33 canonical work pages · 7 internal anchors

  1. [1]

    https://openstd.samr.gov.cn/bzgk/gb/newGbInfo?hcno=BAFB47E8874764186BD B7865E8344DAF, 2019

    Information security technology ibaseline for classified protection of cybersecurity. https://openstd.samr.gov.cn/bzgk/gb/newGbInfo?hcno=BAFB47E8874764186BD B7865E8344DAF, 2019

  2. [2]

    HexStrike AI MCP Agents

    0x4m4. HexStrike AI MCP Agents. https://github.com/0x4m4/hexstrike-ai, 2026

  3. [3]

    EnIGMA: Interactive tools substantially assist LM agents in finding security vulnerabilities

    Talor Abramovich, Meet Udeshi, Minghao Shao, Kilian Lieret, Haoran Xi, Kimberly Milner, Sofija Jancheska, John Yang, Carlos E Jimenez, Farshad Khorrami, Prashanth Krishnamurthy, Brendan Dolan-Gavitt, Muhammad Shafique, Karthik R Narasimhan, Ramesh Karri, and Ofir Press. EnIGMA: Interactive tools substantially assist LM agents in finding security vulnerabi...

  4. [4]

    Metasploit penetration testing cookbook

    Monika Agarwal and Abhinav Singh. Metasploit penetration testing cookbook . Packt Publishing Birmingham, 2013

  5. [5]

    Breachseek: A multi-agent automated penetration tester

    Ibrahim Alshehri, Adnan Alshehri, Abdulrahman Almalki, Majed Bamardouf, and Alaqsa Akbar. Breachseek: A multi-agent automated penetration tester. arXiv preprint arXiv:2409.03789 , 2024

  6. [6]

    Introducing the model context protocol

    Anthropic. Introducing the model context protocol. https://www.anthropic.com/ news/model-context-protocol , 2024

  7. [7]

    Agent skills

    Anthropic. Agent skills. https://agentskills.io/home, 2025

  8. [8]

    Claude code

    Anthropic. Claude code. https://github.com/anthropics/claude-code, 2026

  9. [9]

    Claude opus 4.6 system card

    Anthropic. Claude opus 4.6 system card. Technical report, Anthropic, 2026

  10. [10]

    Introducing Claude Opus 4.6

    Anthropic. Introducing Claude Opus 4.6. https://www.anthropic.com/news/cla ude-opus-4-6 , 2026. 85

  11. [11]

    Software penetration testing

    Brad Arkin, Scott Stender, and Gary McGraw. Software penetration testing. IEEE Security & Privacy , 3(1):84–87, 2005

  12. [12]

    Pentest-ai, an llm-powered multi-agents framework for penetration testing automation leveraging mitre attack

    Stanislas G Bianou and Rodrigue G Batogna. Pentest-ai, an llm-powered multi-agents framework for penetration testing automation leveraging mitre attack. In 2024 IEEE International Conference on Cyber Security and Resilience (CSR) , pages 763–770. IEEE, 2024

  13. [13]

    About penetration testing

    Matt Bishop. About penetration testing. IEEE Security & Privacy , 5(6):84–87, 2007

  14. [14]

    Coverage-based greybox fuzzing as markov chain

    Marcel Böhme, Van-Thuan Pham, and Abhik Roychoudhury. Coverage-based greybox fuzzing as markov chain. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security , pages 1032–1043, 2016

  15. [15]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

  16. [16]

    The diamond model of intrusion analysis

    Sergio Caltagirone, Andrew Pendergast, and Christopher Betz. The diamond model of intrusion analysis. 2013

  17. [17]

    Carnegie Mellon University. picoCTF. https://picoctf.org/, 2026

  18. [18]

    Why Do Multi-Agent LLM Systems Fail?

    Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. Why do multi-agent llm systems fail? arXiv preprint arXiv:2503.13657 , 2025

  19. [19]

    tinyctfer

    chainreactors. tinyctfer. https://github.com/chainreactors/tinyctfer, 2026

  20. [20]

    RedTeamLLM: An agentic AI framework for offensive security,

    Brian Challita and Pierre Parrend. Redteamllm: an agentic ai framework for offensive security. arXiv preprint arXiv:2505.06913 , 2025

  21. [21]

    Under the hoodie: Lessons from a season of penetration testing, 2018

    Rapid7 Global Consulting. Under the hoodie: Lessons from a season of penetration testing, 2018

  22. [22]

    The growing importance of exposure management: Key insights from gartner hype cycle for security operations 2024

    Jamie Cowper. The growing importance of exposure management: Key insights from gartner hype cycle for security operations 2024. https://www.rapid7.com/blog/po st/2024/09/13/the-growing-importance-of-exposure-management-our-key-i nsights-from-gartner-r-hype-cycle-for-security-operations-2024/ , 2024

  23. [23]

    crewAI: Fast and Flexible Multi-Agent Automation Framework

    CrewAI. crewAI: Fast and Flexible Multi-Agent Automation Framework. https: //github.com/crewaiinc/crewai, 2026

  24. [24]

    CVE: Common Vulnerabilities and Exposures

    CVE Program. CVE: Common Vulnerabilities and Exposures. https://www.cve.or g/, 2026

  25. [25]

    Refpentester: A knowledge- informed self-reflective penetration testing framework based on large language models

    Hanzheng Dai, Yuanliang Li, Jun Yan, and Zhibo Zhang. Refpentester: A knowledge- informed self-reflective penetration testing framework based on large language models. arXiv preprint arXiv:2505.07089 , 2025

  26. [26]

    Multi-agent Penetration Testing AI for the Web.ArXiv, abs/2508.20816, aug 2025

    Isaac David and Arthur Gervais. Multi-agent penetration testing ai for the web. arXiv preprint arXiv:2508.20816 , 2025

  27. [27]

    What makes a good llm agent for real-world penetration testing?, 2026

    Gelei Deng, Yi Liu, Yuekang Li, Ruozhao Yang, Xiaofei Xie, Jie Zhang, Han Qiu, and Tianwei Zhang. What makes a good llm agent for real-world penetration testing?, 2026

  28. [28]

    {PentestGPT}: Evaluating and harnessing large language models for automated penetration testing

    Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. {PentestGPT}: Evaluating and harnessing large language models for automated penetration testing. In 33rd USENIX Security Symposium (USENIX Security 24) , pages 847–864, 2024

  29. [29]

    Cyberstrikeai

    Ed1s0nZ. Cyberstrikeai. https://github.com/Ed1s0nZ/CyberStrikeAI, 2026. 86

  30. [30]

    Regulation (EU) 2022/2554 of the European Parliament and of the Council of 14 December 2022 on digital op- erational resilience for the financial sector (DORA), 2022

    European Parliament and Council of the European Union. Regulation (EU) 2022/2554 of the European Parliament and of the Council of 14 December 2022 on digital op- erational resilience for the financial sector (DORA), 2022. Official Journal of the European Union, L 333/1. Accessed: 2026-04-04

  31. [31]

    A survey on rag meeting llms: Towards retrieval-augmented large language models

    Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat- Seng Chua, and Qing Li. A survey on rag meeting llms: Towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining , pages 6491–6501, 2024

  32. [32]

    Llm agents can au- tonomously exploit one-day vulnerabilities, 2024

    Richard Fang, Rohan Bindu, Akul Gupta, and Daniel Kang. Llm agents can au- tonomously exploit one-day vulnerabilities, 2024

  33. [33]

    LLM agents can autonomously hack websites, 2024

    Richard Fang, Rohan Bindu, Akul Gupta, Qiusi Zhan, and Daniel Kang. Llm agents can autonomously hack websites. arXiv preprint arXiv:2402.06664 , 2024

  34. [34]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, Haofen Wang, et al. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997 , 2(1):32, 2023

  35. [35]

    PentestAgent

    GH05TCREW. PentestAgent. https://github.com/GH05TCREW/PentestAgent , 2026

  36. [36]

    Automated Planning: theory and practice

    Malik Ghallab, Dana Nau, and Paolo Traverso. Automated Planning: theory and practice. Elsevier, 2004

  37. [37]

    Autopenbench: A vulnerability testing benchmark for generative agents

    Luca Gioacchini, Alexander Delsanto, Idilio Drago, Marco Mellia, Giuseppe Siracu- sano, and Roberto Bifulco. Autopenbench: A vulnerability testing benchmark for generative agents. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track , pages 1615–1624, 2025

  38. [38]

    Gemini 3.1 Pro

    Google DeepMind. Gemini 3.1 Pro. https://deepmind.google/models/gemini/pr o/, 2026

  39. [39]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645(8081):633–638, 2025

  40. [40]

    DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guant- ing Chen, Xiao Bi, Yifan Wu, YK Li, et al. Deepseek-coder: when the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024

  41. [41]

    Hack The Box

    Hack The Box. Hack The Box. https://www.hackthebox.com/, 2026

  42. [42]

    Hacking articles

    Hacking Articles. Hacking articles. https://www.hackingarticles.in/, 2026

  43. [43]

    Getting pwn nd by ai: Penetration testing with large language models

    Andreas Happe and Jürgen Cito. Getting pwn nd by ai: Penetration testing with large language models. In Proceedings of the 31st ACM joint european software engineering conference and symposium on the foundations of software engineering , pages 2082– 2086, 2023

  44. [44]

    Can llms hack enterprise networks? autonomous assumed breach penetration-testing active directory networks

    Andreas Happe and Jürgen Cito. Can llms hack enterprise networks? autonomous assumed breach penetration-testing active directory networks. ACM Transactions on Software Engineering and Methodology , 2025

  45. [45]

    On the surprising efficacy of llms for penetration- testing

    Andreas Happe and Jürgen Cito. On the surprising efficacy of llms for penetration- testing. arXiv preprint arXiv:2507.00829 , 2025

  46. [46]

    Got root? a linux priv-esc benchmark, 2024

    Andreas Happe and Jürgen Cito. Got root? a linux priv-esc benchmark, 2024

  47. [47]

    Llms as hackers: Autonomous linux privilege escalation attacks

    Andreas Happe, Aaron Kaplan, and Juergen Cito. Llms as hackers: Autonomous linux privilege escalation attacks. arXiv preprint arXiv:2310.11409 , 2023. 87

  48. [48]

    Autopentest: Enhancing vulnerability management with autonomous llm agents

    Julius Henke. Autopentest: Enhancing vulnerability management with autonomous llm agents. arXiv preprint arXiv:2505.10321 , 2025

  49. [49]

    H-pentest

    hexian2001. H-pentest. https://github.com/hexian2001/H-Pentest, 2026

  50. [50]

    Context rot: How increasing input tokens impacts llm performance

    Kelly Hong, Anton Troynikov, and Jeff Huber. Context rot: How increasing input tokens impacts llm performance. URL https://research. trychroma. com/context-rot, retrieved October, 20:2025, 2025

  51. [51]

    Metagpt: Meta programming for a multi-agent collaborative framework

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collaborative framework. In The twelfth interna- tional conference on learning representations , 2023

  52. [52]

    Penheal: A two-stage llm framework for automated pentesting and optimal remediation

    Junjie Huang and Quanyan Zhu. Penheal: A two-stage llm framework for automated pentesting and optimal remediation. In Proceedings of the workshop on autonomous cybersecurity, pages 11–22, 2023

  53. [53]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186 , 2024

  54. [54]

    newmapta

    HUST-JYHLab. newmapta. https://github.com/HUST-JYHLab/newmapta, 2026

  55. [55]

    Intelligence-driven computer network defense informed by analysis of adversary campaigns and intrusion kill chains

    Eric M Hutchins, Michael J Cloppert, Rohan M Amin, et al. Intelligence-driven computer network defense informed by analysis of adversary campaigns and intrusion kill chains. Leading Issues in Information Warfare & Security Research , 1(1):80, 2011

  56. [56]

    Towards automated penetration testing: Introducing llm benchmark, analysis, and improvements

    Isamu Isozaki, Manil Shrestha, Rick Console, and Edward Kim. Towards automated penetration testing: Introducing llm benchmark, analysis, and improvements. In Adjunct Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization, pages 404–419, 2025

  57. [57]

    Measuring and augmenting large language models for solving capture-the-flag chal- lenges

    Zimo Ji, Daoyuan Wu, Wenyuan Jiang, Pingchuan Ma, Zongjie Li, and Shuai Wang. Measuring and augmenting large language models for solving capture-the-flag chal- lenges. In Proceedings of the 2025 ACM SIGSAC Conference on Computer and Com- munications Security, pages 603–617, 2025

  58. [58]

    Survey of hallucination in natural language generation

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM computing surveys , 55(12):1–38, 2023

  59. [59]

    Sok: Agentic skills – beyond tool use in llm agents, 2026

    Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang, Qin Wang, and Guang- sheng Yu. Sok: Agentic skills – beyond tool use in llm agents, 2026

  60. [60]

    SWE-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations , 2024

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations , 2024

  61. [61]

    Dense passage retrieval for open-domain question answering

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) , pages 6769–6781, 2020

  62. [62]

    Metasploit: the penetration tester’s guide

    David Kennedy, Jim O’gorman, Devon Kearns, and Mati Aharoni. Metasploit: the penetration tester’s guide . No Starch Press, 2011

  63. [63]

    arXiv preprint arXiv:2508.07382 , year=

    He Kong, Die Hu, Jingguo Ge, Liangxiong Li, Hui Li, and Tong Li. Pentest-r1: Towards autonomous penetration testing reasoning optimized via two-stage reinforce- ment learning. arXiv preprint arXiv:2508.07382 , 2025

  64. [64]

    Vulnbot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework.ArXiv, abs/2501.13411, jan 2025

    He Kong, Die Hu, Jingguo Ge, Liangxiong Li, Tong Li, and Bingzhen Wu. Vulnbot: Autonomous penetration testing for a multi-agent collaborative framework. arXiv preprint arXiv:2501.13411 , 2025. 88

  65. [65]

    Se perspective on llms: Biases in code generation, code interpretability, and code security risks

    Rrezarta Krasniqi, Depeng Xu, and Marco Vieira. Se perspective on llms: Biases in code generation, code interpretability, and code security risks. ACM Computing Surveys, 58(5):1–16, 2025

  66. [66]

    LangGraph: Low-level orchestration framework for building stateful agents

    LangChain. LangGraph: Low-level orchestration framework for building stateful agents. https://github.com/langchain-ai/langgraph, 2026

  67. [67]

    Lost in the middle: How language models use long contexts

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the association for computational linguistics , 12:157–173, 2024

  68. [68]

    Pacebench: A Framework for Evaluating Practical AI Cyber-Exploitation Capabilities.ArXiv, abs/2510.11688, oct 2025

    Zicheng Liu, Lige Huang, Jie Zhang, Dongrui Liu, Yuan Tian, and Jing Shao. Pacebench: A framework for evaluating practical ai cyber-exploitation capabilities. arXiv preprint arXiv:2510.11688 , 2025

  69. [69]

    Yu, and Ming Zhang

    Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bo- han Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, Rongcheng Tu, Xiao Luo, Wei Ju, Zhiping Xiao, Yifan Wang, Meng Xiao, Chenwu Liu, Jingyang Yuan, Shichang Zhang, Yiqiao Jin, Fan Zhang, Xian Wu, Hanqing Zhao, Dacheng Tao, Philip S. Yu, and Ming Zhang. Large language model agent: A sur...

  70. [70]

    xOffense: An Autonomous Multi-Agent Framework for Penetration Testing with Domain-Adapted Large Language Models

    Phung Duc Luong, Le Tran Gia Bao, Nguyen Vu Khai Tam, Dong Huu Nguyen Khoa, Nguyen Huu Quyen, Van-Hau Pham, and Phan The Duy. xoffense: An ai-driven autonomous penetration testing framework with offensive knowledge-enhanced llms and multi agent systems. arXiv preprint arXiv:2509.13021 , 2025

  71. [71]

    xbow-competition

    M-SEC. xbow-competition. https://github.com/m-sec-org/xbow-competition , 2026

  72. [72]

    Shell or Nothing: Real-World Benchmarks and Memory-Activated Agents for Automated Penetration Testing

    Wuyuao Mai, Geng Hong, Qi Liu, Jinsong Chen, Jiarun Dai, Xudong Pan, Yuan Zhang, and Min Yang. Shell or nothing: Real-world benchmarks and memory- activated agents for automated penetration testing. arXiv preprint arXiv:2509.09207 , 2025

  73. [73]

    Graphical user interfaces

    Aaron Marcus. Graphical user interfaces. In Handbook of human-computer interaction, pages 423–440. Elsevier, 1997

  74. [74]

    Cai: An open, bug bounty-ready cybersecurity ai, 2025

    Víctor Mayoral-Vilches, Luis Javier Navarrete-Lozano, María Sanz-Gómez, Lidia Salas Espejo, Martiño Crespo-Álvarez, Francisco Oca-Gonzalez, Francesco Balassone, Al- fonso Glera-Picón, Unai Ayucar-Carbajo, Jon Ander Ruiz-Alcalde, et al. Cai: An open, bug bounty-ready cybersecurity ai. arXiv preprint arXiv:2504.06017 , 2025

  75. [75]

    Mike A Merrill, Alexander Glenn Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Jenia Jitsev, Marianna Nezhurina, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, ...

  76. [76]

    CWE-Common Weakness Enumeration

    MITRE Corporation. CWE-Common Weakness Enumeration. https://cwe.mitre. org/, 2026

  77. [77]

    Kimi Code CLI

    Moonshot AI. Kimi Code CLI. https://github.com/MoonshotAI/kimi-cli, 2026

  78. [78]

    Penetration testing and ethical hacking services market size & share analysis - growth trends and forecast (2025 - 2030)

    Mordor Intelligence. Penetration testing and ethical hacking services market size & share analysis - growth trends and forecast (2025 - 2030). https://www.mordorinte lligence.com/industry-reports/penetration-testing-and-ethical-hacking -services-market , 2025

  79. [79]

    Hacksynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing.ArXiv, abs/2412.01778, dec 2024

    Lajos Muzsai, David Imolai, and András Lukács. Hacksynth: Llm agent and evalua- tion framework for autonomous penetration testing. arXiv preprint arXiv:2412.01778 , 2024

  80. [80]

    Rapidpen: Fully Automated IP-to-Shell Penetration Testing with LLM-based Agents.ArXiv, abs/2502.16730, feb 2025

    Sho Nakatani. Rapidpen: Fully automated ip-to-shell penetration testing with llm- based agents. arXiv preprint arXiv:2502.16730 , 2025

Showing first 80 references.