xOffense: An Autonomous Multi-Agent Framework for Penetration Testing with Domain-Adapted Large Language Models
Pith reviewed 2026-05-18 16:33 UTC · model grok-4.3
The pith
A fine-tuned mid-scale LLM in a multi-agent setup automates penetration testing and reaches 79 percent sub-task success on benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By fine-tuning Qwen3-32B on Chain-of-Thought penetration testing data and placing it inside an orchestration layer that coordinates dedicated agents for reconnaissance, vulnerability scanning, and exploitation, xOffense produces autonomous workflows that reach 79.17 percent sub-task completion on AutoPenBench and AI-Pentest-Benchmark, exceeding VulnBot and PentestGPT.
What carries the argument
The orchestration layer that assigns and coordinates specialized agents powered by the fine-tuned LLM to generate precise tool commands and maintain consistent multi-step reasoning.
If this is right
- Penetration testing becomes executable as a fully machine-driven process that scales with available compute rather than expert hours.
- Results gain reproducibility because the same model and orchestration produce consistent command sequences.
- Security assessments can shift from occasional manual reviews to routine, on-demand automated runs.
- Domain-adapted mid-scale models prove capable of handling the full chain from reconnaissance to exploitation when structured with agent roles.
Where Pith is reading between the lines
- The same agent orchestration pattern could transfer to related security tasks such as continuous monitoring or post-breach analysis.
- Real-world deployment would need checks for cases where network defenses or ethical limits require human judgment to avoid unintended actions.
- Hybrid human-AI loops might emerge as a practical next step to handle the minority of sub-tasks the model does not complete.
Load-bearing premise
The fine-tuned LLM will generate precise tool commands and sustain consistent multi-step reasoning across varied penetration testing scenarios without requiring human correction or intervention.
What would settle it
A new benchmark or live target set where the framework completes under 60 percent of sub-tasks or requires repeated human intervention to continue would show the autonomy claim does not hold.
Figures
read the original abstract
This work introduces xOffense, an AI-driven, multi-agent penetration testing framework that shifts the process from labor-intensive, expert-driven manual efforts to fully automated, machine-executable workflows capable of scaling seamlessly with computational infrastructure. At its core, xOffense leverages a fine-tuned, mid-scale open-source LLM (Qwen3-32B) to drive reasoning and decision-making in penetration testing. The framework assigns specialized agents to reconnaissance, vulnerability scanning, and exploitation, with an orchestration layer ensuring seamless coordination across phases. Fine-tuning on Chain-of-Thought penetration testing data further enables the model to generate precise tool commands and perform consistent multi-step reasoning. We evaluate xOffense on two rigorous benchmarks: AutoPenBench and AI-Pentest-Benchmark. The results demonstrate that xOffense consistently outperforms contemporary methods, achieving a sub-task completion rate of 79.17%, decisively surpassing leading systems such as VulnBot and PentestGPT. These findings highlight the potential of domain-adapted mid-scale LLMs, when embedded within structured multi-agent orchestration, to deliver superior, cost-efficient, and reproducible solutions for autonomous penetration testing.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces xOffense, an AI-driven multi-agent framework for penetration testing that employs a fine-tuned Qwen3-32B LLM to enable fully automated workflows. Specialized agents handle reconnaissance, vulnerability scanning, and exploitation, coordinated by an orchestration layer. The authors report that xOffense achieves a 79.17% sub-task completion rate on AutoPenBench and AI-Pentest-Benchmark, outperforming systems like VulnBot and PentestGPT.
Significance. If validated, the results would indicate that domain-adapted mid-scale LLMs combined with multi-agent orchestration can provide superior, cost-efficient solutions for autonomous penetration testing. This could have implications for scaling security testing without heavy reliance on human experts.
major comments (2)
- [Abstract and Results] The abstract and results section state a 79.17% sub-task completion rate and benchmark superiority but supply no details on experimental controls, statistical significance, exact task definitions, or potential data leakage between fine-tuning and evaluation sets. This information is required to verify the central performance claim.
- [Experimental Evaluation] The claim of fully automated workflows and consistent multi-step reasoning is load-bearing for the autonomy and cost-efficiency advantages, yet the experimental evaluation provides no breakdown of command failure rates, retry counts, fraction of trajectories completed with zero human input, or total commands issued.
minor comments (2)
- [Related Work] Clarify the precise differences between the proposed orchestration layer and prior multi-agent pentesting systems in the related work section.
- [Methodology] Expand the description of the Chain-of-Thought fine-tuning dataset, including its size, source, and construction process.
Simulated Author's Rebuttal
We thank the referee for their detailed and insightful comments on our manuscript. We value the opportunity to clarify and strengthen our presentation of the experimental results and evaluation methodology. Below, we address each major comment point by point.
read point-by-point responses
-
Referee: [Abstract and Results] The abstract and results section state a 79.17% sub-task completion rate and benchmark superiority but supply no details on experimental controls, statistical significance, exact task definitions, or potential data leakage between fine-tuning and evaluation sets. This information is required to verify the central performance claim.
Authors: We fully agree that these details are crucial for the credibility of our central claims. The current manuscript provides high-level results but lacks the requested granularity. In the revised manuscript, we will add comprehensive information in the Experimental Setup and Results sections. Specifically, we will describe the experimental controls (e.g., fixed environment setups and multiple runs), report statistical significance using paired t-tests or similar with p-values, provide exact definitions of sub-tasks drawn from the benchmark papers, and explicitly address data leakage by detailing how the fine-tuning dataset was curated separately from the evaluation benchmarks with no overlap. We will also include error bars or confidence intervals around the 79.17% figure to better contextualize the results. revision: yes
-
Referee: [Experimental Evaluation] The claim of fully automated workflows and consistent multi-step reasoning is load-bearing for the autonomy and cost-efficiency advantages, yet the experimental evaluation provides no breakdown of command failure rates, retry counts, fraction of trajectories completed with zero human input, or total commands issued.
Authors: This is a valid observation, as our evaluation focused on overall success rates rather than these granular automation metrics. To address this, we will revise the Experimental Evaluation section to include a detailed breakdown. This will encompass: observed command failure rates across all agents, average retry counts for failed commands, confirmation that 100% of trajectories were completed with zero human input as the framework operates autonomously, and the total number of commands issued during the benchmark evaluations. These additions will directly support our claims regarding fully automated workflows and cost-efficiency. revision: yes
Circularity Check
No circularity detected in empirical framework and benchmark evaluation
full rationale
The paper presents an empirical system description and benchmark results rather than a mathematical derivation chain. It introduces a multi-agent framework, describes fine-tuning an LLM on external Chain-of-Thought data, and reports measured sub-task completion rates on independent benchmarks (AutoPenBench, AI-Pentest-Benchmark). No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations that reduce the central performance claim to its own inputs appear in the provided text. The evaluation relies on external test sets and comparisons to prior systems, making the reported 79.17% rate an independent measurement rather than a constructed tautology.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A fine-tuned mid-scale LLM can generate precise tool commands and maintain consistent multi-step reasoning for penetration testing tasks.
Forward citations
Cited by 2 Pith papers
-
Hackers or Hallucinators? A Comprehensive Analysis of LLM-Based Automated Penetration Testing
The first SoK on LLM-based AutoPT frameworks provides a six-dimension taxonomy of agent designs and a unified empirical benchmark evaluating 15 frameworks via over 10 billion tokens and 1,500 manually reviewed logs.
-
PoC-Adapt: Semantic-Aware Automated Vulnerability Reproduction with LLM Multi-Agents and Reinforcement Learning-Driven Adaptive Policy
PoC-Adapt improves automated PoC exploit generation reliability by 25% and lowers cost using semantic state validation and RL adaptive policies, verifying 12 PoCs from 80 recent CVE attempts at $0.42 each.
Reference graph
Works this paper leans on
-
[1]
Infosecurity Magazine. Nvd revamps operations amid cve surge.https://www.infosecurity-magazine.com/news/ nvd-revamps-operations-cve-surge/, 2024. Accessed: 2025-07- 30
work page 2024
-
[2]
GBHackers. Nist facing challenges in manag- ing cve backlog.https://gbhackers.com/ nist-facing-challenges-in-managing-cve-backlog/, 2024. Accessed: 2025-07-30
work page 2024
-
[3]
Deep exploit.https: //www.blackhat.com/us-18/arsenal/schedule/index.html# deep-exploit-11908, 2018
Isao Takaesu and Daisuke Chikamori. Deep exploit.https: //www.blackhat.com/us-18/arsenal/schedule/index.html# deep-exploit-11908, 2018. Presented at Black Hat USA 2018 Arsenal, Las Vegas. Accessed: 2025-07-30
work page 2018
-
[4]
Metasploit — penetration testing software, pen testing security
Rapid7. Metasploit — penetration testing software, pen testing security. https://www.metasploit.com/, 2024. Accessed: July 27, 2024
work page 2024
-
[5]
Abdul Samad, Saad Altaf, and M Junaid Arshad. Advancements in au- tomated penetration testing for iot security by leveraging reinforcement learning.evaluation, 8:9, 2024
work page 2024
-
[6]
Khuong Tran, Ashlesha Akella, Maxwell Standen, Junae Kim, David Bowman, Toby Richer, and Chin-Teng Lin. Deep hierarchical rein- forcement agents for automated penetration testing.arXiv preprint arXiv:2109.06449, 2021
-
[7]
Pen- testGPT: Evaluating and harnessing large language models for automated penetration testing
Gelei Deng, Yi Liu, V ´ıctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. Pen- testGPT: Evaluating and harnessing large language models for automated penetration testing. In33rd USENIX Security Symposium (USENIX Secu- rity 24), pages 847–864, Philadelphia, PA, 2024. USENIX Association
work page 2024
-
[8]
Pentestagent: Incorpo- rating llm agents to automated penetration testing
Xiangmin Shen, Lingzhi Wang, Zhenyuan Li, Yan Chen, Wencheng Zhao, Dawei Sun, Jiashui Wang, and Wei Ruan. Pentestagent: Incorpo- rating llm agents to automated penetration testing. InProceedings of the 20th ACM Asia Conference on Computer and Communications Security, pages 375–391, 2025
work page 2025
-
[9]
Vulnbot: Autonomous penetration testing for a multi-agent collaborative framework,
He Kong, Die Hu, Jingguo Ge, Liangxiong Li, Tong Li, and Bingzhen Wu. VulnBot: Autonomous penetration testing for a multi-agent collabo- rative framework.arXiv preprint arXiv:2501.13411, Jan 2025
-
[10]
Autopenbench: Benchmark- ing generative agents for penetration testing, 2024
Luca Gioacchini, Marco Mellia, Idilio Drago, Alexander Delsanto, Giuseppe Siracusano, and Roberto Bifulco. Autopenbench: Benchmark- ing generative agents for penetration testing, 2024
work page 2024
-
[11]
Isamu Isozaki. Ai-pentest-benchmark: A benchmark for auto- mated penetration testing.https://github.com/isamu-isozaki/ AI-Pentest-Benchmark, 2024. GitHub repository. Accessed: 2025- 07-30
work page 2024
-
[12]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Nmap: The network mapper - free security scanner
Gordon Lyon. Nmap: The network mapper - free security scanner. https://nmap.org/, 2024. Accessed: July 27, 2024
work page 2024
-
[14]
Nikto web server scanner.https://github.com/sullo/ nikto, 2024
Chris Sullo. Nikto web server scanner.https://github.com/sullo/ nikto, 2024. Accessed: July 27, 2024
work page 2024
-
[15]
Wpscan wordpress security scanner.https://github
WPScan Team. Wpscan wordpress security scanner.https://github. com/wpscanteam/wpscan, 2024. Accessed: July 27, 2024
work page 2024
-
[16]
Automating post-exploitation with deep reinforcement learning.Computers&Security, 100:102108, 2021
Ryusei Maeda and Mamoru Mimura. Automating post-exploitation with deep reinforcement learning.Computers&Security, 100:102108, 2021
work page 2021
-
[17]
Van-Hau Pham, Hien Do Hoang, Phan Thanh Trung, Van Dinh Quoc, Trong-Nghia To, and Phan The Duy. Raiju: Reinforcement learning- guided post-exploitation for automating security assessment of network systems.Computer Networks, 253:110706, 2024
work page 2024
-
[18]
AUTOATTACKER: A large language model guided system to implement automatic cyber-attacks,
Jiacen Xu, Jack W Stokes, GeoffMcDonald, Xuesong Bai, David Mar- shall, Siyue Wang, Adith Swaminathan, and Zhou Li. AutoAttacker: A large language model guided system to implement automatic cyber- attacks.arXiv preprint arXiv:2403.01038, 2024
-
[19]
Hanzheng Dai, Yuanliang Li, Zhibo Zhang, and Jun Yan. Refpentester: A knowledge-informed self-reflective penetration testing framework based on large language models.arXiv preprint arXiv:2505.07089, 2025
-
[20]
Sho Nakatani. Rapidpen: Fully automated ip-to-shell penetration testing with llm-based agents.arXiv preprint arXiv:2502.16730, 2025
-
[21]
Weber, Ioannis Tzachristas, and Aifen Sui
Dominik M. Weber, Ioannis Tzachristas, and Aifen Sui. Perses: Unlock- ing privilege escalation for small llms via extensible heterogeneity. In Proceedings of the 20th ACM Asia Conference on Computer and Com- munications Security (ASIA CCS ’25). ACM, 2025
work page 2025
-
[22]
LLM Agents can Autonomously Exploit One-day Vulnerabilities
Richard Fang, Rohan Bindu, Akul Gupta, and Daniel Kang. Llm agents can autonomously exploit one-day vulnerabilities.arXiv preprint arXiv:2404.08144, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Yuxuan Zhu, Antony Kellermann, Akul Gupta, Philip Li, Richard Fang, Rohan Bindu, and Daniel Kang. Teams of llm agents can exploit zero-day vulnerabilities.arXiv preprint arXiv:2406.01637, Mar 2025. 16
- [24]
-
[25]
Nyu ctf bench: A scalable open-source bench- mark dataset for evaluating llms in offensive security
Minghao Shao, Sofija Jancheska, Meet Udeshi, Brendan Dolan-Gavitt, Haoran Xi, Kimberly Milner, Boyuan Chen, Max Yin, Siddharth Garg, Ramesh Karri, Prashanth Krishnamurthy, Farshad Khorrami, and Muhammad Shafique. Nyu ctf bench: A scalable open-source bench- mark dataset for evaluating llms in offensive security. InNeurIPS 2024 Datasets and Benchmarks Track, 2024
work page 2024
-
[26]
Yuxuan Zhu, Antony Kellermann, Dylan Bowman, Philip Li, Akul Gupta, Adarsh Danda, Richard Fang, Conner Jensen, Eric Ihli, Jason Benn, Jet Geronimo, Avi Dhir, Sudhit Rao, Kaicheng Yu, Twm Stone, and Daniel Kang. Cve-bench: A benchmark for ai agents’ ability to exploit real- world web application vulnerabilities.arXiv preprint arXiv:2503.17332, Mar 2025
-
[27]
To- wards automated penetration testing: Introducing LLM benchmark, anal- ysis, and improvements
Isamu Isozaki, Manil Shrestha, Rick Console, and Edward Kim. To- wards automated penetration testing: Introducing LLM benchmark, anal- ysis, and improvements. InProceedings of the 2025 ACM Conference (companion/adjunct) on Computer and Communications Security, 2025. Accessed: 2025-08-06
work page 2025
-
[28]
Julius Henke. Autopentest: Enhancing vulnerability management with autonomous llm agents.arXiv preprint arXiv:2505.10321, 2025
-
[29]
Camel: Communicative agents for ”mind” exploration of large language model society
Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for ”mind” exploration of large language model society. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Pro- cessing Systems, volume 36, pages 51991–52008. Curran Associates, Inc., 2023
work page 2023
-
[30]
langchain-chatchat.GitHub repository, 2024
Liu Qian, Song Jinke, Huang Zhiguo, Zhang Yuxuan, glide the, and li- unux4odoo. langchain-chatchat.GitHub repository, 2024
work page 2024
-
[31]
Dirb web content scanner.https://gitlab.com/ kalilinux/packages/dirb, 2025
DirB Project. Dirb web content scanner.https://gitlab.com/ kalilinux/packages/dirb, 2025
work page 2025
-
[32]
Gobuster Project. Gobuster - directory/file, dns and vhost busting tool written in go.https://github.com/OJ/gobuster, 2025
work page 2025
-
[33]
OW ASP Amass Project. Owasp amass - in-depth attack surface mapping and asset discovery.https://github.com/owasp-amass/amass, 2025
work page 2025
-
[34]
sqlmap Developers. sqlmap - automatic sql injection and database takeover tool.https://github.com/sqlmapproject/sqlmap, 2025
work page 2025
-
[35]
Thc-hydra - network logon cracker.https:// github.com/vanhauser-thc/thc-hydra, 2025
THC Hydra Team. Thc-hydra - network logon cracker.https:// github.com/vanhauser-thc/thc-hydra, 2025
work page 2025
-
[36]
John the ripper - password cracker.https:// github.com/openwall/john, 2025
Openwall Project. John the ripper - password cracker.https:// github.com/openwall/john, 2025
work page 2025
-
[37]
Exploit database (exploit-db).https://www
Offensive Security. Exploit database (exploit-db).https://www. exploit-db.com/, 2025
work page 2025
-
[38]
Carlos Polop. Hacktricks: Hacking techniques & privilege escalation encyclopedia.https://book.hacktricks.xyz/, 2025
work page 2025
-
[39]
Hacking articles: A cyber secu- rity community blog.https://www.hackingarticles.in/, 2025
Raj Chandel and Hacking Articles Team. Hacking articles: A cyber secu- rity community blog.https://www.hackingarticles.in/, 2025
work page 2025
-
[40]
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, et al. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent.arXiv preprint arXiv:2507.02259, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Kali linux: Penetration testing and ethical hacking linux distribution.https://www.kali.org/, 2025
Offensive Security. Kali linux: Penetration testing and ethical hacking linux distribution.https://www.kali.org/, 2025
work page 2025
-
[42]
Zero: Memory optimizations toward training trillion parameter models
Samyam Rajbhandari, JeffRasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020
work page 2020
-
[43]
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022
work page 2022
-
[44]
Tryhackme: Hands-on cybersecurity training plat- form.https://tryhackme.com, 2024
TryHackMe Team. Tryhackme: Hands-on cybersecurity training plat- form.https://tryhackme.com, 2024
work page 2024
-
[45]
Hack the box: Cybersecurity labs and challenges
HackTheBox Team. Hack the box: Cybersecurity labs and challenges. https://www.hackthebox.com, 2024
work page 2024
-
[46]
Vulnhub: Vulnerable machines for penetration testing practice.https://www.vulnhub.com, 2024
VulnHub Community. Vulnhub: Vulnerable machines for penetration testing practice.https://www.vulnhub.com, 2024
work page 2024
-
[47]
HuggingFace Team. Huggingface datasets hub: Open-source datasets for machine learning.https://huggingface.co/datasets, 2024
work page 2024
-
[48]
Whiterabbitneo cybersecu- rity dataset (wrn-chapter-1, wrn-chapter-2).https://huggingface
Migel Tissera and WhiteRabbitNeo Team. Whiterabbitneo cybersecu- rity dataset (wrn-chapter-1, wrn-chapter-2).https://huggingface. co/datasets/WhiteRabbitNeo/WRN-Chapter-1, 2024. 17
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.