pith. sign in

arxiv: 2605.24949 · v1 · pith:HNFVYY7Rnew · submitted 2026-05-24 · 💻 cs.CR · cs.AI

APT-Agent: Automated Penetration Testing using Large Language Models

Pith reviewed 2026-06-30 00:08 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords penetration testinglarge language modelsautomationhallucination recoverymemory architecturecybersecurityexploitation success
0
0 comments X

The pith

APT-Agent uses a rectification module and command memory to let an LLM complete full penetration testing sequences at 84 percent success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Penetration testing chains reconnaissance, exploitation, and exfiltration steps on complex systems, but LLMs often generate invalid commands or lose track of prior actions. The paper presents APT-Agent as a framework that adds a hybrid rectification step to fix bad commands and a command-specific memory store to keep operational context across sequences. It evaluates the system on Metasploitable 2 against seven services covering web, database, and network protocols. Under matched conditions the approach reaches 84.29 percent end-to-end exploitation success while baselines achieve 48.57 percent and 18.57 percent. This setup reduces the need for constant human oversight in security assessments.

Core claim

APT-Agent is a fully automated LLM-driven penetration testing framework that systematically orchestrates reconnaissance, exploitation, and exfiltration. It introduces a hybrid rectification module to recover hallucinated commands and a command-specific memory architecture to preserve operational context across multi-step attack sequences, achieving an 84.29 percent end-to-end exploitation success rate on Metasploitable 2 against seven vulnerable services.

What carries the argument

The hybrid rectification module paired with command-specific memory architecture, which corrects invalid LLM outputs and retains attack history for sequential decision making.

If this is right

  • Enables more complete automation of reconnaissance through exfiltration phases without constant operator input.
  • Produces higher success rates than both script-based and earlier LLM pen-testing tools on the same target services.
  • Applies across web, database, and network protocol vulnerabilities in a single framework.
  • Lowers the cognitive load on security teams by handling error recovery internally.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar rectification and memory techniques could transfer to other multi-step security tasks such as vulnerability scanning or incident response.
  • Testing the framework on production networks rather than lab images would reveal how well the modules handle real-world noise and defenses.
  • The architecture suggests that targeted memory designs may improve LLM reliability in other long-horizon technical domains beyond security.

Load-bearing premise

The rectification module reliably fixes hallucinated commands and the memory system preserves enough context for the LLM to finish multi-step sequences without human fixes.

What would settle it

Measure end-to-end success rates on the same seven services after disabling the rectification module or the command memory and compare against the reported 84.29 percent baseline.

Figures

Figures reproduced from arXiv: 2605.24949 by (2) CSIRO Data61), Alsharif Abuadbba (2), Dan Dongseong Kim (1) ((1) University of Queensland, Kristen Moore (2), William Guanting Li (1).

Figure 1
Figure 1. Figure 1: Example of Hallucination Challenge #2: Insufficient long-term contextual memory. Multi-step penetration tests require preserving accurate, ver￾arXiv:2605.24949v1 [cs.CR] 24 May 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Humans carry context across stages, while LLMs repeat outputs and [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of APT-Agent and a CMM, which maintains stage-specific logs of executed commands and outcomes, enabling long-term context retention and adaptive reasoning. Mirroring real-world penetration testing teams, APT-Agent separates strategic reasoning from tactical execution. The Tactic Selection Module acts as the controller, the Command Generation Module issues commands, and the Output Trans￾lation Modu… view at source ↗
Figure 4
Figure 4. Figure 4: reports service-wise results across seven target ser￾vices, with ten independent runs per service. APT-Agent consistently outperforms both baselines across nearly all services. The largest gaps appear on Apache 2.2.8, OpenSSH 4.7, and Telnet, where PentestGPT achieves at most 0–5 successes and Script Kiddie 0–2, while APT-Agent succeeds in 7–9 runs. On services including Apache 2.2.8, UnrealIRCd, and Samba… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation Study: Success Rate [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Rectification Methods Success Rate runs, APT-Agent issued 53 Metasploit module invocations, 16 of which (30.2%) were hallucinated; the Rectification Module recovered 10 of these, yielding a 62.5% in-run correction rate. The CMM’s contribution to efficiency was equally pro￾nounced: on UnrealIRCd, average EXFILTRATE iterations fell from 52 without the CMM to 3 with it, a 17-fold reduction. Together, these re… view at source ↗
Figure 7
Figure 7. Figure 7: LLM Context Comparison: Duplication Rate [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: LLM Context Methods Comparison: Success Rate [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
read the original abstract

Penetration testing is essential to securing modern web infrastructures, yet traditional manual methods struggle to keep pace with their scale and complexity. Large Language Models (LLMs) offer new opportunities for automating these tasks, but existing approaches face two persistent challenges: hallucination of technical entities and insufficient long-term contextual memory. To address these issues, we present APT-Agent, a fully automated LLM-driven penetration testing framework that systematically orchestrates reconnaissance, exploitation, and exfiltration. APT-Agent introduces a hybrid rectification module to recover hallucinated commands and a command-specific memory architecture to preserve operational context across multi-step attack sequences. We evaluate our APT-Agent on Metasploitable 2 against seven vulnerable services spanning web, database, and network protocols. APT-Agent achieves an 84.29% end-to-end exploitation success rate, compared to 48.57% (Script Kiddie) and 18.57% (PentestGPT) under matched conditions. By reducing cognitive burden and minimizing reliance on human intervention, APT-Agent represents a step toward scalable, reliable, and cognitively efficient automation for penetration testing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents APT-Agent, an LLM-based automated penetration testing framework that introduces a hybrid rectification module to recover from hallucinated commands and a command-specific memory architecture to maintain context over multi-step attacks. It evaluates the system on Metasploitable 2 targeting seven vulnerable services and reports an 84.29% end-to-end exploitation success rate, outperforming Script Kiddie (48.57%) and PentestGPT (18.57%) under matched conditions.

Significance. If the empirical results hold under rigorous evaluation, the work would demonstrate a meaningful advance in applying LLMs to security automation by addressing hallucination and context retention, with the explicit baseline comparisons on a fixed target environment providing a clear point of reference for the field.

major comments (1)
  1. [Evaluation] The central empirical claim of 84.29% success rate is presented without any description of the experimental protocol, number of trials, success criteria, variance across runs, or statistical comparison to baselines. This is load-bearing for the headline result and prevents assessment of whether the reported delta is robust or reproducible.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the evaluation requires substantially more detail on the experimental protocol to support the central claims.

read point-by-point responses
  1. Referee: [Evaluation] The central empirical claim of 84.29% success rate is presented without any description of the experimental protocol, number of trials, success criteria, variance across runs, or statistical comparison to baselines. This is load-bearing for the headline result and prevents assessment of whether the reported delta is robust or reproducible.

    Authors: We acknowledge that this comment is correct and that the manuscript as submitted does not contain the requested details. In the revised version we will add a dedicated Experimental Protocol subsection that specifies the number of independent trials, the precise success criteria (end-to-end exploitation resulting in shell access or data exfiltration), observed variance across runs, and statistical comparisons (e.g., appropriate significance tests) to the baselines. This addition will make the 84.29% result reproducible and allow readers to assess its robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical systems contribution: it describes an LLM orchestration framework (hybrid rectification + command-specific memory) and reports measured success rates on Metasploitable 2 against explicit baselines. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claim (84.29 % vs. 48.57 % / 18.57 %) is an observable experimental outcome, not a quantity forced by definition or prior self-work. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract does not detail any free parameters, axioms, or invented entities beyond the described modules.

pith-pipeline@v0.9.1-grok · 5747 in / 1106 out tokens · 40447 ms · 2026-06-30T00:08:53.153043+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 24 canonical work pages · 4 internal anchors

  1. [1]

    Technical guide to information security testing and assessment,

    National Institute of Standards and Technology, “Technical guide to information security testing and assessment,” U.S. Department of Commerce, Tech. Rep. Special Publication 800-115, Sep. 2008, accessed: September 03, 2025. [Online]. Available: https://csrc.nist.gov/ publications/detail/sp/800-115/final

  2. [2]

    Nautilus: Automated restful api vulnerability detection,

    G. Deng, Z. Zhang, Y . Li, Y . Liu, T. Zhang, Y . Liu, Y . Guo, and D. Wang, “Nautilus: Automated restful api vulnerability detection,” inProceedings of the 32nd USENIX Security Symposium. USENIX Association, 2023

  3. [3]

    Automated progressive red teaming,

    B. Jiang, Y . Jing, T. Shen, T. Wu, Q. Yang, and D. Xiong, “Automated progressive red teaming,”arXiv preprint arXiv:2407.03876, 2024. [Online]. Available: https://arxiv.org/abs/2407.03876

  4. [4]

    Automated penetration testing: An overview,

    F. Abu-Dabaseh and E. Alshammari, “Automated penetration testing: An overview,” inComputer Science & Information Technology (CS & IT), Apr. 2018, pp. 121–129

  5. [6]

    A Survey of Large Language Models

    [Online]. Available: https://arxiv.org/abs/2303.18223

  6. [7]

    Digital twin for smart manufacturing, A review,

    Y . Liu, T. Han, S. Ma, J. Zhang, Y . Yang, J. Tian, H. He, A. Li, M. He, Z. Liu, Z. Wu, L. Zhao, D. Zhu, X. Li, N. Qiang, D. Shen, T. Liu, and B. Ge, “Summary of chatgpt-related research and perspective towards the future of large language models,”Meta- Radiology, vol. 1, no. 2, p. 100017, Sep. 2023. [Online]. Available: http://dx.doi.org/10.1016/j.metra...

  7. [8]

    Exploitflow, cyber security exploitation routes for game theory and ai research in robotics,

    V . Mayoral-Vilches, G. Deng, Y . Liu, M. Pinzger, and S. Rass, “Exploitflow, cyber security exploitation routes for game theory and ai research in robotics,” 2023. [Online]. Available: https: //arxiv.org/abs/2308.02152

  8. [9]

    How well does llm generate security tests?

    Y . Zhang, W. Song, Z. Ji, D. Yao, and N. Meng, “How well does llm generate security tests?”arXiv preprint arXiv:2310.00710, 2023. [Online]. Available: https://arxiv.org/abs/2310.00710

  9. [10]

    Large language models for blockchain security: A systematic literature review,

    Z. He, Z. Li, S. Yang, H. Ye, A. Qiao, X. Zhang, X. Luo, and T. Chen, “Large language models for blockchain security: A systematic literature review,”arXiv preprint arXiv:2403.14280, 2025. [Online]. Available: https://arxiv.org/abs/2403.14280

  10. [11]

    From promise to peril: Rethinking cybersecu- rity red and blue teaming in the age of llms,

    A. Abuadbba, K. Moore, D. Goel, C. Hicks, V . Mavroudis, B. Hasir- cioglu, and P. Jennings, “From promise to peril: Rethinking cybersecu- rity red and blue teaming in the age of llms,”IEEE Security & Privacy, vol. 24, no. 2, pp. 53–63, 2026

  11. [12]

    doi:10.48550/arXiv.2402.06664

    R. Fang, R. Bindu, A. Gupta, Q. Zhan, and D. Kang, “Llm agents can autonomously hack websites,”arXiv preprint arXiv:2402.06664, Feb

  12. [13]

    doi:10.48550/arXiv.2402.06664

    [Online]. Available: https://doi.org/10.48550/arXiv.2402.06664

  13. [14]

    LLM Agents can Autonomously Exploit One-day Vulnerabilities

    R. Fang, R. Bindu, A. Gupta, and D. Kang, “Llm agents can autonomously exploit one-day vulnerabilities,”arXiv preprint arXiv:2404.08144, Apr. 2024. [Online]. Available: https://doi.org/10. 48550/arXiv.2404.08144

  14. [15]

    Pentestgpt: An llm-empowered automatic penetration testing tool,

    G. Deng, Y . Liu, V . Mayoral-Vilches, P. Liu, Y . Li, Y . Xu, T. Zhang, Y . Liu, M. Pinzger, and S. Rass, “Pentestgpt: An llm-empowered automatic penetration testing tool,”arXiv preprint arXiv:2308.06782, Aug. 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2308. 06782

  15. [16]

    Autoattacker: A large language model guided system to implement automatic cyber-attacks,

    J. Xu, J. W. Stokes, G. McDonald, X. Bai, D. Marshall, S. Wang, A. Swaminathan, and Z. Li, “Autoattacker: A large language model guided system to implement automatic cyber-attacks,”arXiv preprint arXiv:2403.01038, Mar. 2024. [Online]. Available: https: //doi.org/10.48550/arXiv.2403.01038

  16. [17]

    Zhang, O

    M. Zhang, O. Press, W. Merrill, A. Liu, and N. A. Smith, “How language model hallucinations can snowball,” 2023. [Online]. Available: https://arxiv.org/abs/2305.13534

  17. [18]

    Drowzee: Metamorphic testing for fact-conflicting hallucination detection in large language models,

    N. Li, Y . Li, Y . Liu, L. Shi, K. Wang, and H. Wang, “Drowzee: Metamorphic testing for fact-conflicting hallucination detection in large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2405. 00648

  18. [19]

    SelfcheckGPT: Zero-resource black-box hallucination detection for generative large language models,

    P. Manakul, A. Liusie, and M. Gales, “SelfcheckGPT: Zero-resource black-box hallucination detection for generative large language models,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023. [Online]. Available: https://openreview.net/forum?id=RwzFNbJ3Ez

  19. [20]

    Attention Is All You Need

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems. Curran Associates Inc., 2023. [Online]. Available: https://arxiv.org/abs/1706.03762

  20. [21]

    Chatgpt is not enough: Enhancing large language models with knowledge graphs for fact-aware language modeling,

    L. Yang, H. Chen, Z. Li, X. Ding, and X. Wu, “Chatgpt is not enough: Enhancing large language models with knowledge graphs for fact-aware language modeling,”arXiv preprint arXiv:2306.11489, Jun

  21. [22]
  22. [23]

    Efficient interactive fuzzy keyword search,

    S. Ji, G. Li, C. Li, and J. Feng, “Efficient interactive fuzzy keyword search,” inProceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM). New York, NY , USA: Association for Computing Machinery, 2009. [Online]. Available: https://doi.org/10.1145/1526709.1526760

  23. [24]

    Metasploitable 2,

    Rapid7, “Metasploitable 2,” 2025, accessed: September 03, 2025. [On- line]. Available: https://docs.rapid7.com/metasploit/metasploitable-2/

  24. [25]

    Llms killed the script kiddie: How agents supported by large language models change the landscape of network threat testing,

    S. Moskal, S. Laney, E. Hemberg, and U.-M. O’Reilly, “Llms killed the script kiddie: How agents supported by large language models change the landscape of network threat testing,”arXiv preprint arXiv:2309.00667, Sep. 2023. [Online]. Available: https: //arxiv.org/abs/2309.00667

  25. [26]

    Pomdp + information- decay: Incorporating defender’s behaviour in autonomous penetration testing,

    J. Schwartz, H. Kurniawati, and E. El-Mahassni, “Pomdp + information- decay: Incorporating defender’s behaviour in autonomous penetration testing,” inProceedings of the 30th International Conference on Automated Planning and Scheduling (ICAPS), 2020, pp. 235–243. [Online]. Available: https://doi.org/10.1609/icaps.v30i1.6666

  26. [27]

    Gail-pt: An intelligent penetration testing framework with generative adversarial imitation learning,

    J. Chen, S. Hu, H. Zheng, C. Xing, and G. Zhang, “Gail-pt: An intelligent penetration testing framework with generative adversarial imitation learning,”Computers & Security, vol. 126, p. 103055, 2023

  27. [28]

    Evaluation of reinforcement learning for autonomous penetration testing using a3c, q-learning and dqn,

    N. Becker, D. Reti, E. V . N. Ntagiou, M. Wallum, and H. D. Schotten, “Evaluation of reinforcement learning for autonomous penetration testing using a3c, q-learning and dqn,”arXiv preprint arXiv:2407.15656, Jul. 2024. [Online]. Available: https://doi.org/10. 48550/arXiv.2407.15656

  28. [29]

    A hierarchical deep reinforcement learning model with expert prior knowledge for intelligent penetration testing,

    Q. Li, M. Zhang, Y . Shen, R. Wang, M. Hu, Y . Li, and H. Hao, “A hierarchical deep reinforcement learning model with expert prior knowledge for intelligent penetration testing,”Computers & Security, vol. 132, p. 103358, 2023. [Online]. Available: https: //doi.org/10.1016/j.cose.2023.103358

  29. [30]

    Pentestagent: Incorporating llm agents to automated penetration testing,

    X. Shen, L. Wang, Z. Li, Y . Chen, W. Zhao, D. Sun, J. Wang, and W. Ruan, “Pentestagent: Incorporating llm agents to automated penetration testing,”arXiv preprint arXiv:2411.05185, 2025. [Online]. Available: https://arxiv.org/abs/2411.05185

  30. [31]

    Metasploit,

    Rapid7, “Metasploit,” https://www.metasploit.com/, 2025, accessed: September 03, 2025

  31. [32]

    Penetration testing for system security: Methods and practical approaches,

    W. Zhang, J. Xing, and X. Li, “Penetration testing for system security: Methods and practical approaches,”arXiv preprint arXiv:2505.19174, May 2025. [Online]. Available: https://arxiv.org/abs/2505.19174

  32. [33]

    Penetration Testing == POMDP Solving?

    C. Sarraute, O. Buffet, and J. Hoffmann, “Penetration testing == pomdp solving?”arXiv preprint arXiv:1306.4714, 2013. [Online]. Available: https://doi.org/10.48550/arXiv.1306.4714

  33. [34]

    Gpt-3.5 turbo,

    OpenAI, “Gpt-3.5 turbo,” https://platform.openai.com/docs/models/ gpt-3.5-turbo, 2022, accessed: September 03, 2025

  34. [35]

    ——, “Gpt-4o,” https://platform.openai.com/docs/models/gpt-4o, 2024, accessed: September 03, 2025

  35. [36]

    Mitre att&ck ® — enterprise tactics,

    MITRE Corporation, “Mitre att&ck ® — enterprise tactics,” accessed 2025-10-08. [Online]. Available: https://attack.mitre.org/ tactics/enterprise/

  36. [37]

    PenHeal: A Two-Stage LLM Framework for Automated Pentesting and Optimal Remediation,

    J. Huang and Q. Zhu, “PenHeal: A Two-Stage LLM Framework for Automated Pentesting and Optimal Remediation,” inProceedings of the Workshop on Autonomous Cybersecurity. ACM, Nov. 2024. [Online]. Available: https://dl.acm.org/doi/10.1145/3689933.3690831 APPENDIXA COMPARISON WITHPENHEAL PenHeal [34] also evaluates its framework on the Metas- ploitable II envi...