xOffense: An Autonomous Multi-Agent Framework for Penetration Testing with Domain-Adapted Large Language Models

Dong Huu Nguyen Khoa; Le Tran Gia Bao; Nguyen Huu Quyen; Nguyen Vu Khai Tam; Phan The Duy; Phung Duc Luong; Van-Hau Pham

arxiv: 2509.13021 · v2 · submitted 2025-09-16 · 💻 cs.CR · cs.AI

xOffense: An Autonomous Multi-Agent Framework for Penetration Testing with Domain-Adapted Large Language Models

Phung Duc Luong , Le Tran Gia Bao , Nguyen Vu Khai Tam , Dong Huu Nguyen Khoa , Nguyen Huu Quyen , Van-Hau Pham , Phan The Duy This is my paper

Pith reviewed 2026-05-18 16:33 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords penetration testingmulti-agent systemslarge language modelsautonomous securityfine-tuned modelscybersecurity automationvulnerability exploitation

0 comments

The pith

A fine-tuned mid-scale LLM in a multi-agent setup automates penetration testing and reaches 79 percent sub-task success on benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents xOffense as a complete shift from expert-driven manual penetration testing to an automated system where specialized agents handle reconnaissance, scanning, and exploitation. It relies on a fine-tuned open-source model to generate commands and sustain reasoning chains across multiple steps. A reader would care because this promises security testing that runs without constant human oversight and scales simply by adding compute. The work shows the system beats prior tools on two established benchmarks by completing nearly four-fifths of sub-tasks.

Core claim

By fine-tuning Qwen3-32B on Chain-of-Thought penetration testing data and placing it inside an orchestration layer that coordinates dedicated agents for reconnaissance, vulnerability scanning, and exploitation, xOffense produces autonomous workflows that reach 79.17 percent sub-task completion on AutoPenBench and AI-Pentest-Benchmark, exceeding VulnBot and PentestGPT.

What carries the argument

The orchestration layer that assigns and coordinates specialized agents powered by the fine-tuned LLM to generate precise tool commands and maintain consistent multi-step reasoning.

If this is right

Penetration testing becomes executable as a fully machine-driven process that scales with available compute rather than expert hours.
Results gain reproducibility because the same model and orchestration produce consistent command sequences.
Security assessments can shift from occasional manual reviews to routine, on-demand automated runs.
Domain-adapted mid-scale models prove capable of handling the full chain from reconnaissance to exploitation when structured with agent roles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same agent orchestration pattern could transfer to related security tasks such as continuous monitoring or post-breach analysis.
Real-world deployment would need checks for cases where network defenses or ethical limits require human judgment to avoid unintended actions.
Hybrid human-AI loops might emerge as a practical next step to handle the minority of sub-tasks the model does not complete.

Load-bearing premise

The fine-tuned LLM will generate precise tool commands and sustain consistent multi-step reasoning across varied penetration testing scenarios without requiring human correction or intervention.

What would settle it

A new benchmark or live target set where the framework completes under 60 percent of sub-tasks or requires repeated human intervention to continue would show the autonomy claim does not hold.

Figures

Figures reproduced from arXiv: 2509.13021 by Dong Huu Nguyen Khoa, Le Tran Gia Bao, Nguyen Huu Quyen, Nguyen Vu Khai Tam, Phan The Duy, Phung Duc Luong, Van-Hau Pham.

**Figure 1.** Figure 1: The Overall Architecture of the xOffense Framework. tasks, such as performing an exhaustive enumeration of writable directories for privilege escalation through misconfigured permissions or publicly writable paths — (find / -writable 2>/dev/null) or listing running processes (ps aux), are contingent on this authentication [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗

**Figure 2.** Figure 2: Task Coordination Graph (TCG) illustrating task dependencies and execution status. Completed tasks are shown in dark, the current task in orange, and pending tasks in light blue. Algorithm 1 Check and Reflection Procedure Require: TCG (Task Coordination Graph), Knowledge Repository KR 1: while not all tasks completed do 2: t ← NextTask(TCG) 3: r ← Execute(t) 4: if CheckSuccess(r) then 5: MarkCompleted(t) … view at source ↗

**Figure 3.** Figure 3: Comparison of subtask completion rates across six real-world vulnerable machines in a No-RAG setting. [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of subtask completion rates across six real-world vulnerable machines with RAG setting. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

read the original abstract

This work introduces xOffense, an AI-driven, multi-agent penetration testing framework that shifts the process from labor-intensive, expert-driven manual efforts to fully automated, machine-executable workflows capable of scaling seamlessly with computational infrastructure. At its core, xOffense leverages a fine-tuned, mid-scale open-source LLM (Qwen3-32B) to drive reasoning and decision-making in penetration testing. The framework assigns specialized agents to reconnaissance, vulnerability scanning, and exploitation, with an orchestration layer ensuring seamless coordination across phases. Fine-tuning on Chain-of-Thought penetration testing data further enables the model to generate precise tool commands and perform consistent multi-step reasoning. We evaluate xOffense on two rigorous benchmarks: AutoPenBench and AI-Pentest-Benchmark. The results demonstrate that xOffense consistently outperforms contemporary methods, achieving a sub-task completion rate of 79.17%, decisively surpassing leading systems such as VulnBot and PentestGPT. These findings highlight the potential of domain-adapted mid-scale LLMs, when embedded within structured multi-agent orchestration, to deliver superior, cost-efficient, and reproducible solutions for autonomous penetration testing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

xOffense gives a concrete multi-agent setup with a fine-tuned Qwen3-32B model that hits 79% on the benchmarks and beats a couple of prior systems, but the autonomy numbers are not broken down enough to judge how much human help was really needed.

read the letter

The main point is that this paper gives a concrete multi-agent setup with a fine-tuned Qwen3-32B model that hits 79.17% sub-task completion on AutoPenBench and AI-Pentest-Benchmark and beats VulnBot and PentestGPT in their tests. They split the work into three phases—reconnaissance, vulnerability scanning, and exploitation—with an orchestration layer on top and fine-tune the model on Chain-of-Thought penetration testing examples so it can output tool commands more reliably. That architecture is straightforward to understand and gives other groups a clear recipe to try or compare against. The benchmark comparisons are also useful as a snapshot of where current AI pentesting tools stand on those particular tasks. The soft spots are around the autonomy claims. The abstract calls the workflows fully automated and the reasoning consistent, yet there is no count of how many commands failed, how many retries occurred, or what fraction of runs finished with zero human input. If even moderate intervention was required on the harder sub-tasks, the completion rate does not fully demonstrate the cost or scalability advantage that is advertised. The same goes for missing details on experimental controls and any check for overlap between the fine-tuning data and the evaluation sets. This paper is for people who build or evaluate LLM tools for security work. A reader who wants a practical example of domain-adapted agents and benchmark numbers will get something usable from it, though they will probably want the full logs before treating the results as settled. The authors show clear thinking in how they structured the agents and picked the fine-tuning approach, so the paper deserves a serious referee. I would send it to review but flag the need for intervention rates and contamination checks in the revision requests.

Referee Report

2 major / 2 minor

Summary. The paper introduces xOffense, an AI-driven multi-agent framework for penetration testing that employs a fine-tuned Qwen3-32B LLM to enable fully automated workflows. Specialized agents handle reconnaissance, vulnerability scanning, and exploitation, coordinated by an orchestration layer. The authors report that xOffense achieves a 79.17% sub-task completion rate on AutoPenBench and AI-Pentest-Benchmark, outperforming systems like VulnBot and PentestGPT.

Significance. If validated, the results would indicate that domain-adapted mid-scale LLMs combined with multi-agent orchestration can provide superior, cost-efficient solutions for autonomous penetration testing. This could have implications for scaling security testing without heavy reliance on human experts.

major comments (2)

[Abstract and Results] The abstract and results section state a 79.17% sub-task completion rate and benchmark superiority but supply no details on experimental controls, statistical significance, exact task definitions, or potential data leakage between fine-tuning and evaluation sets. This information is required to verify the central performance claim.
[Experimental Evaluation] The claim of fully automated workflows and consistent multi-step reasoning is load-bearing for the autonomy and cost-efficiency advantages, yet the experimental evaluation provides no breakdown of command failure rates, retry counts, fraction of trajectories completed with zero human input, or total commands issued.

minor comments (2)

[Related Work] Clarify the precise differences between the proposed orchestration layer and prior multi-agent pentesting systems in the related work section.
[Methodology] Expand the description of the Chain-of-Thought fine-tuning dataset, including its size, source, and construction process.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and insightful comments on our manuscript. We value the opportunity to clarify and strengthen our presentation of the experimental results and evaluation methodology. Below, we address each major comment point by point.

read point-by-point responses

Referee: [Abstract and Results] The abstract and results section state a 79.17% sub-task completion rate and benchmark superiority but supply no details on experimental controls, statistical significance, exact task definitions, or potential data leakage between fine-tuning and evaluation sets. This information is required to verify the central performance claim.

Authors: We fully agree that these details are crucial for the credibility of our central claims. The current manuscript provides high-level results but lacks the requested granularity. In the revised manuscript, we will add comprehensive information in the Experimental Setup and Results sections. Specifically, we will describe the experimental controls (e.g., fixed environment setups and multiple runs), report statistical significance using paired t-tests or similar with p-values, provide exact definitions of sub-tasks drawn from the benchmark papers, and explicitly address data leakage by detailing how the fine-tuning dataset was curated separately from the evaluation benchmarks with no overlap. We will also include error bars or confidence intervals around the 79.17% figure to better contextualize the results. revision: yes
Referee: [Experimental Evaluation] The claim of fully automated workflows and consistent multi-step reasoning is load-bearing for the autonomy and cost-efficiency advantages, yet the experimental evaluation provides no breakdown of command failure rates, retry counts, fraction of trajectories completed with zero human input, or total commands issued.

Authors: This is a valid observation, as our evaluation focused on overall success rates rather than these granular automation metrics. To address this, we will revise the Experimental Evaluation section to include a detailed breakdown. This will encompass: observed command failure rates across all agents, average retry counts for failed commands, confirmation that 100% of trajectories were completed with zero human input as the framework operates autonomously, and the total number of commands issued during the benchmark evaluations. These additions will directly support our claims regarding fully automated workflows and cost-efficiency. revision: yes

Circularity Check

0 steps flagged

No circularity detected in empirical framework and benchmark evaluation

full rationale

The paper presents an empirical system description and benchmark results rather than a mathematical derivation chain. It introduces a multi-agent framework, describes fine-tuning an LLM on external Chain-of-Thought data, and reports measured sub-task completion rates on independent benchmarks (AutoPenBench, AI-Pentest-Benchmark). No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations that reduce the central performance claim to its own inputs appear in the provided text. The evaluation relies on external test sets and comparisons to prior systems, making the reported 79.17% rate an independent measurement rather than a constructed tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework depends on the practical assumption that a fine-tuned mid-scale LLM can reliably drive autonomous tool use and reasoning in security workflows; no new mathematical axioms or invented physical entities are introduced.

axioms (1)

domain assumption A fine-tuned mid-scale LLM can generate precise tool commands and maintain consistent multi-step reasoning for penetration testing tasks.
Invoked to justify autonomous operation of the specialized agents without human oversight.

pith-pipeline@v0.9.0 · 5760 in / 1188 out tokens · 43344 ms · 2026-05-18T16:33:26.772594+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Hackers or Hallucinators? A Comprehensive Analysis of LLM-Based Automated Penetration Testing
cs.CR 2026-04 unverdicted novelty 8.0

The first SoK on LLM-based AutoPT frameworks provides a six-dimension taxonomy of agent designs and a unified empirical benchmark evaluating 15 frameworks via over 10 billion tokens and 1,500 manually reviewed logs.
PoC-Adapt: Semantic-Aware Automated Vulnerability Reproduction with LLM Multi-Agents and Reinforcement Learning-Driven Adaptive Policy
cs.CR 2026-04 unverdicted novelty 6.0

PoC-Adapt improves automated PoC exploit generation reliability by 25% and lowers cost using semantic state validation and RL adaptive policies, verifying 12 PoCs from 80 recent CVE attempts at $0.42 each.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 2 Pith papers · 3 internal anchors

[1]

Nvd revamps operations amid cve surge.https://www.infosecurity-magazine.com/news/ nvd-revamps-operations-cve-surge/, 2024

Infosecurity Magazine. Nvd revamps operations amid cve surge.https://www.infosecurity-magazine.com/news/ nvd-revamps-operations-cve-surge/, 2024. Accessed: 2025-07- 30

work page 2024
[2]

Nist facing challenges in manag- ing cve backlog.https://gbhackers.com/ nist-facing-challenges-in-managing-cve-backlog/, 2024

GBHackers. Nist facing challenges in manag- ing cve backlog.https://gbhackers.com/ nist-facing-challenges-in-managing-cve-backlog/, 2024. Accessed: 2025-07-30

work page 2024
[3]

Deep exploit.https: //www.blackhat.com/us-18/arsenal/schedule/index.html# deep-exploit-11908, 2018

Isao Takaesu and Daisuke Chikamori. Deep exploit.https: //www.blackhat.com/us-18/arsenal/schedule/index.html# deep-exploit-11908, 2018. Presented at Black Hat USA 2018 Arsenal, Las Vegas. Accessed: 2025-07-30

work page 2018
[4]

Metasploit — penetration testing software, pen testing security

Rapid7. Metasploit — penetration testing software, pen testing security. https://www.metasploit.com/, 2024. Accessed: July 27, 2024

work page 2024
[5]

Advancements in au- tomated penetration testing for iot security by leveraging reinforcement learning.evaluation, 8:9, 2024

Abdul Samad, Saad Altaf, and M Junaid Arshad. Advancements in au- tomated penetration testing for iot security by leveraging reinforcement learning.evaluation, 8:9, 2024

work page 2024
[6]

Deep hierarchical rein- forcement agents for automated penetration testing.arXiv preprint arXiv:2109.06449, 2021

Khuong Tran, Ashlesha Akella, Maxwell Standen, Junae Kim, David Bowman, Toby Richer, and Chin-Teng Lin. Deep hierarchical rein- forcement agents for automated penetration testing.arXiv preprint arXiv:2109.06449, 2021

work page arXiv 2021
[7]

Pen- testGPT: Evaluating and harnessing large language models for automated penetration testing

Gelei Deng, Yi Liu, V ´ıctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. Pen- testGPT: Evaluating and harnessing large language models for automated penetration testing. In33rd USENIX Security Symposium (USENIX Secu- rity 24), pages 847–864, Philadelphia, PA, 2024. USENIX Association

work page 2024
[8]

Pentestagent: Incorpo- rating llm agents to automated penetration testing

Xiangmin Shen, Lingzhi Wang, Zhenyuan Li, Yan Chen, Wencheng Zhao, Dawei Sun, Jiashui Wang, and Wei Ruan. Pentestagent: Incorpo- rating llm agents to automated penetration testing. InProceedings of the 20th ACM Asia Conference on Computer and Communications Security, pages 375–391, 2025

work page 2025
[9]

Vulnbot: Autonomous penetration testing for a multi-agent collaborative framework,

He Kong, Die Hu, Jingguo Ge, Liangxiong Li, Tong Li, and Bingzhen Wu. VulnBot: Autonomous penetration testing for a multi-agent collabo- rative framework.arXiv preprint arXiv:2501.13411, Jan 2025

work page arXiv 2025
[10]

Autopenbench: Benchmark- ing generative agents for penetration testing, 2024

Luca Gioacchini, Marco Mellia, Idilio Drago, Alexander Delsanto, Giuseppe Siracusano, and Roberto Bifulco. Autopenbench: Benchmark- ing generative agents for penetration testing, 2024

work page 2024
[11]

Ai-pentest-benchmark: A benchmark for auto- mated penetration testing.https://github.com/isamu-isozaki/ AI-Pentest-Benchmark, 2024

Isamu Isozaki. Ai-pentest-benchmark: A benchmark for auto- mated penetration testing.https://github.com/isamu-isozaki/ AI-Pentest-Benchmark, 2024. GitHub repository. Accessed: 2025- 07-30

work page 2024
[12]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Nmap: The network mapper - free security scanner

Gordon Lyon. Nmap: The network mapper - free security scanner. https://nmap.org/, 2024. Accessed: July 27, 2024

work page 2024
[14]

Nikto web server scanner.https://github.com/sullo/ nikto, 2024

Chris Sullo. Nikto web server scanner.https://github.com/sullo/ nikto, 2024. Accessed: July 27, 2024

work page 2024
[15]

Wpscan wordpress security scanner.https://github

WPScan Team. Wpscan wordpress security scanner.https://github. com/wpscanteam/wpscan, 2024. Accessed: July 27, 2024

work page 2024
[16]

Automating post-exploitation with deep reinforcement learning.Computers&Security, 100:102108, 2021

Ryusei Maeda and Mamoru Mimura. Automating post-exploitation with deep reinforcement learning.Computers&Security, 100:102108, 2021

work page 2021
[17]

Raiju: Reinforcement learning- guided post-exploitation for automating security assessment of network systems.Computer Networks, 253:110706, 2024

Van-Hau Pham, Hien Do Hoang, Phan Thanh Trung, Van Dinh Quoc, Trong-Nghia To, and Phan The Duy. Raiju: Reinforcement learning- guided post-exploitation for automating security assessment of network systems.Computer Networks, 253:110706, 2024

work page 2024
[18]

AUTOATTACKER: A large language model guided system to implement automatic cyber-attacks,

Jiacen Xu, Jack W Stokes, GeoffMcDonald, Xuesong Bai, David Mar- shall, Siyue Wang, Adith Swaminathan, and Zhou Li. AutoAttacker: A large language model guided system to implement automatic cyber- attacks.arXiv preprint arXiv:2403.01038, 2024

work page arXiv 2024
[19]

Refpentester: A knowledge-informed self-reflective penetration testing framework based on large language models.arXiv preprint arXiv:2505.07089, 2025

Hanzheng Dai, Yuanliang Li, Zhibo Zhang, and Jun Yan. Refpentester: A knowledge-informed self-reflective penetration testing framework based on large language models.arXiv preprint arXiv:2505.07089, 2025

work page arXiv 2025
[20]

Rapidpen: Fully automated ip-to-shell penetration testing with llm-based agents.arXiv preprint arXiv:2502.16730, 2025

Sho Nakatani. Rapidpen: Fully automated ip-to-shell penetration testing with llm-based agents.arXiv preprint arXiv:2502.16730, 2025

work page arXiv 2025
[21]

Weber, Ioannis Tzachristas, and Aifen Sui

Dominik M. Weber, Ioannis Tzachristas, and Aifen Sui. Perses: Unlock- ing privilege escalation for small llms via extensible heterogeneity. In Proceedings of the 20th ACM Asia Conference on Computer and Com- munications Security (ASIA CCS ’25). ACM, 2025

work page 2025
[22]

LLM Agents can Autonomously Exploit One-day Vulnerabilities

Richard Fang, Rohan Bindu, Akul Gupta, and Daniel Kang. Llm agents can autonomously exploit one-day vulnerabilities.arXiv preprint arXiv:2404.08144, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

the winning worker cost

Yuxuan Zhu, Antony Kellermann, Akul Gupta, Philip Li, Richard Fang, Rohan Bindu, and Daniel Kang. Teams of llm agents can exploit zero-day vulnerabilities.arXiv preprint arXiv:2406.01637, Mar 2025. 16

work page arXiv 2025
[24]

Muzsai, D

Lajos Muzsai, David Imolai, and Andr ´as Luk´acs. Hacksynth: Llm agent and evaluation framework for autonomous penetration testing.arXiv preprint arXiv:2412.01778, 2024

work page arXiv 2024
[25]

Nyu ctf bench: A scalable open-source bench- mark dataset for evaluating llms in offensive security

Minghao Shao, Sofija Jancheska, Meet Udeshi, Brendan Dolan-Gavitt, Haoran Xi, Kimberly Milner, Boyuan Chen, Max Yin, Siddharth Garg, Ramesh Karri, Prashanth Krishnamurthy, Farshad Khorrami, and Muhammad Shafique. Nyu ctf bench: A scalable open-source bench- mark dataset for evaluating llms in offensive security. InNeurIPS 2024 Datasets and Benchmarks Track, 2024

work page 2024
[26]

Cve-bench: A benchmark for ai agents’ ability to exploit real- world web application vulnerabilities.arXiv preprint arXiv:2503.17332, Mar 2025

Yuxuan Zhu, Antony Kellermann, Dylan Bowman, Philip Li, Akul Gupta, Adarsh Danda, Richard Fang, Conner Jensen, Eric Ihli, Jason Benn, Jet Geronimo, Avi Dhir, Sudhit Rao, Kaicheng Yu, Twm Stone, and Daniel Kang. Cve-bench: A benchmark for ai agents’ ability to exploit real- world web application vulnerabilities.arXiv preprint arXiv:2503.17332, Mar 2025

work page arXiv 2025
[27]

To- wards automated penetration testing: Introducing LLM benchmark, anal- ysis, and improvements

Isamu Isozaki, Manil Shrestha, Rick Console, and Edward Kim. To- wards automated penetration testing: Introducing LLM benchmark, anal- ysis, and improvements. InProceedings of the 2025 ACM Conference (companion/adjunct) on Computer and Communications Security, 2025. Accessed: 2025-08-06

work page 2025
[28]

Autopentest: Enhancing vulnerability management with autonomous llm agents.arXiv preprint arXiv:2505.10321, 2025

Julius Henke. Autopentest: Enhancing vulnerability management with autonomous llm agents.arXiv preprint arXiv:2505.10321, 2025

work page arXiv 2025
[29]

Camel: Communicative agents for ”mind” exploration of large language model society

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for ”mind” exploration of large language model society. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Pro- cessing Systems, volume 36, pages 51991–52008. Curran Associates, Inc., 2023

work page 2023
[30]

langchain-chatchat.GitHub repository, 2024

Liu Qian, Song Jinke, Huang Zhiguo, Zhang Yuxuan, glide the, and li- unux4odoo. langchain-chatchat.GitHub repository, 2024

work page 2024
[31]

Dirb web content scanner.https://gitlab.com/ kalilinux/packages/dirb, 2025

DirB Project. Dirb web content scanner.https://gitlab.com/ kalilinux/packages/dirb, 2025

work page 2025
[32]

Gobuster - directory/file, dns and vhost busting tool written in go.https://github.com/OJ/gobuster, 2025

Gobuster Project. Gobuster - directory/file, dns and vhost busting tool written in go.https://github.com/OJ/gobuster, 2025

work page 2025
[33]

Owasp amass - in-depth attack surface mapping and asset discovery.https://github.com/owasp-amass/amass, 2025

OW ASP Amass Project. Owasp amass - in-depth attack surface mapping and asset discovery.https://github.com/owasp-amass/amass, 2025

work page 2025
[34]

sqlmap - automatic sql injection and database takeover tool.https://github.com/sqlmapproject/sqlmap, 2025

sqlmap Developers. sqlmap - automatic sql injection and database takeover tool.https://github.com/sqlmapproject/sqlmap, 2025

work page 2025
[35]

Thc-hydra - network logon cracker.https:// github.com/vanhauser-thc/thc-hydra, 2025

THC Hydra Team. Thc-hydra - network logon cracker.https:// github.com/vanhauser-thc/thc-hydra, 2025

work page 2025
[36]

John the ripper - password cracker.https:// github.com/openwall/john, 2025

Openwall Project. John the ripper - password cracker.https:// github.com/openwall/john, 2025

work page 2025
[37]

Exploit database (exploit-db).https://www

Offensive Security. Exploit database (exploit-db).https://www. exploit-db.com/, 2025

work page 2025
[38]

Hacktricks: Hacking techniques & privilege escalation encyclopedia.https://book.hacktricks.xyz/, 2025

Carlos Polop. Hacktricks: Hacking techniques & privilege escalation encyclopedia.https://book.hacktricks.xyz/, 2025

work page 2025
[39]

Hacking articles: A cyber secu- rity community blog.https://www.hackingarticles.in/, 2025

Raj Chandel and Hacking Articles Team. Hacking articles: A cyber secu- rity community blog.https://www.hackingarticles.in/, 2025

work page 2025
[40]

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, et al. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent.arXiv preprint arXiv:2507.02259, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Kali linux: Penetration testing and ethical hacking linux distribution.https://www.kali.org/, 2025

Offensive Security. Kali linux: Penetration testing and ethical hacking linux distribution.https://www.kali.org/, 2025

work page 2025
[42]

Zero: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, JeffRasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020

work page 2020
[43]

Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

work page 2022
[44]

Tryhackme: Hands-on cybersecurity training plat- form.https://tryhackme.com, 2024

TryHackMe Team. Tryhackme: Hands-on cybersecurity training plat- form.https://tryhackme.com, 2024

work page 2024
[45]

Hack the box: Cybersecurity labs and challenges

HackTheBox Team. Hack the box: Cybersecurity labs and challenges. https://www.hackthebox.com, 2024

work page 2024
[46]

Vulnhub: Vulnerable machines for penetration testing practice.https://www.vulnhub.com, 2024

VulnHub Community. Vulnhub: Vulnerable machines for penetration testing practice.https://www.vulnhub.com, 2024

work page 2024
[47]

Huggingface datasets hub: Open-source datasets for machine learning.https://huggingface.co/datasets, 2024

HuggingFace Team. Huggingface datasets hub: Open-source datasets for machine learning.https://huggingface.co/datasets, 2024

work page 2024
[48]

Whiterabbitneo cybersecu- rity dataset (wrn-chapter-1, wrn-chapter-2).https://huggingface

Migel Tissera and WhiteRabbitNeo Team. Whiterabbitneo cybersecu- rity dataset (wrn-chapter-1, wrn-chapter-2).https://huggingface. co/datasets/WhiteRabbitNeo/WRN-Chapter-1, 2024. 17

work page 2024

[1] [1]

Nvd revamps operations amid cve surge.https://www.infosecurity-magazine.com/news/ nvd-revamps-operations-cve-surge/, 2024

Infosecurity Magazine. Nvd revamps operations amid cve surge.https://www.infosecurity-magazine.com/news/ nvd-revamps-operations-cve-surge/, 2024. Accessed: 2025-07- 30

work page 2024

[2] [2]

Nist facing challenges in manag- ing cve backlog.https://gbhackers.com/ nist-facing-challenges-in-managing-cve-backlog/, 2024

GBHackers. Nist facing challenges in manag- ing cve backlog.https://gbhackers.com/ nist-facing-challenges-in-managing-cve-backlog/, 2024. Accessed: 2025-07-30

work page 2024

[3] [3]

Deep exploit.https: //www.blackhat.com/us-18/arsenal/schedule/index.html# deep-exploit-11908, 2018

Isao Takaesu and Daisuke Chikamori. Deep exploit.https: //www.blackhat.com/us-18/arsenal/schedule/index.html# deep-exploit-11908, 2018. Presented at Black Hat USA 2018 Arsenal, Las Vegas. Accessed: 2025-07-30

work page 2018

[4] [4]

Metasploit — penetration testing software, pen testing security

Rapid7. Metasploit — penetration testing software, pen testing security. https://www.metasploit.com/, 2024. Accessed: July 27, 2024

work page 2024

[5] [5]

Advancements in au- tomated penetration testing for iot security by leveraging reinforcement learning.evaluation, 8:9, 2024

Abdul Samad, Saad Altaf, and M Junaid Arshad. Advancements in au- tomated penetration testing for iot security by leveraging reinforcement learning.evaluation, 8:9, 2024

work page 2024

[6] [6]

Deep hierarchical rein- forcement agents for automated penetration testing.arXiv preprint arXiv:2109.06449, 2021

Khuong Tran, Ashlesha Akella, Maxwell Standen, Junae Kim, David Bowman, Toby Richer, and Chin-Teng Lin. Deep hierarchical rein- forcement agents for automated penetration testing.arXiv preprint arXiv:2109.06449, 2021

work page arXiv 2021

[7] [7]

Pen- testGPT: Evaluating and harnessing large language models for automated penetration testing

Gelei Deng, Yi Liu, V ´ıctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. Pen- testGPT: Evaluating and harnessing large language models for automated penetration testing. In33rd USENIX Security Symposium (USENIX Secu- rity 24), pages 847–864, Philadelphia, PA, 2024. USENIX Association

work page 2024

[8] [8]

Pentestagent: Incorpo- rating llm agents to automated penetration testing

Xiangmin Shen, Lingzhi Wang, Zhenyuan Li, Yan Chen, Wencheng Zhao, Dawei Sun, Jiashui Wang, and Wei Ruan. Pentestagent: Incorpo- rating llm agents to automated penetration testing. InProceedings of the 20th ACM Asia Conference on Computer and Communications Security, pages 375–391, 2025

work page 2025

[9] [9]

Vulnbot: Autonomous penetration testing for a multi-agent collaborative framework,

He Kong, Die Hu, Jingguo Ge, Liangxiong Li, Tong Li, and Bingzhen Wu. VulnBot: Autonomous penetration testing for a multi-agent collabo- rative framework.arXiv preprint arXiv:2501.13411, Jan 2025

work page arXiv 2025

[10] [10]

Autopenbench: Benchmark- ing generative agents for penetration testing, 2024

Luca Gioacchini, Marco Mellia, Idilio Drago, Alexander Delsanto, Giuseppe Siracusano, and Roberto Bifulco. Autopenbench: Benchmark- ing generative agents for penetration testing, 2024

work page 2024

[11] [11]

Ai-pentest-benchmark: A benchmark for auto- mated penetration testing.https://github.com/isamu-isozaki/ AI-Pentest-Benchmark, 2024

Isamu Isozaki. Ai-pentest-benchmark: A benchmark for auto- mated penetration testing.https://github.com/isamu-isozaki/ AI-Pentest-Benchmark, 2024. GitHub repository. Accessed: 2025- 07-30

work page 2024

[12] [12]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Nmap: The network mapper - free security scanner

Gordon Lyon. Nmap: The network mapper - free security scanner. https://nmap.org/, 2024. Accessed: July 27, 2024

work page 2024

[14] [14]

Nikto web server scanner.https://github.com/sullo/ nikto, 2024

Chris Sullo. Nikto web server scanner.https://github.com/sullo/ nikto, 2024. Accessed: July 27, 2024

work page 2024

[15] [15]

Wpscan wordpress security scanner.https://github

WPScan Team. Wpscan wordpress security scanner.https://github. com/wpscanteam/wpscan, 2024. Accessed: July 27, 2024

work page 2024

[16] [16]

Automating post-exploitation with deep reinforcement learning.Computers&Security, 100:102108, 2021

Ryusei Maeda and Mamoru Mimura. Automating post-exploitation with deep reinforcement learning.Computers&Security, 100:102108, 2021

work page 2021

[17] [17]

Raiju: Reinforcement learning- guided post-exploitation for automating security assessment of network systems.Computer Networks, 253:110706, 2024

Van-Hau Pham, Hien Do Hoang, Phan Thanh Trung, Van Dinh Quoc, Trong-Nghia To, and Phan The Duy. Raiju: Reinforcement learning- guided post-exploitation for automating security assessment of network systems.Computer Networks, 253:110706, 2024

work page 2024

[18] [18]

AUTOATTACKER: A large language model guided system to implement automatic cyber-attacks,

Jiacen Xu, Jack W Stokes, GeoffMcDonald, Xuesong Bai, David Mar- shall, Siyue Wang, Adith Swaminathan, and Zhou Li. AutoAttacker: A large language model guided system to implement automatic cyber- attacks.arXiv preprint arXiv:2403.01038, 2024

work page arXiv 2024

[19] [19]

Refpentester: A knowledge-informed self-reflective penetration testing framework based on large language models.arXiv preprint arXiv:2505.07089, 2025

Hanzheng Dai, Yuanliang Li, Zhibo Zhang, and Jun Yan. Refpentester: A knowledge-informed self-reflective penetration testing framework based on large language models.arXiv preprint arXiv:2505.07089, 2025

work page arXiv 2025

[20] [20]

Rapidpen: Fully automated ip-to-shell penetration testing with llm-based agents.arXiv preprint arXiv:2502.16730, 2025

Sho Nakatani. Rapidpen: Fully automated ip-to-shell penetration testing with llm-based agents.arXiv preprint arXiv:2502.16730, 2025

work page arXiv 2025

[21] [21]

Weber, Ioannis Tzachristas, and Aifen Sui

Dominik M. Weber, Ioannis Tzachristas, and Aifen Sui. Perses: Unlock- ing privilege escalation for small llms via extensible heterogeneity. In Proceedings of the 20th ACM Asia Conference on Computer and Com- munications Security (ASIA CCS ’25). ACM, 2025

work page 2025

[22] [22]

LLM Agents can Autonomously Exploit One-day Vulnerabilities

Richard Fang, Rohan Bindu, Akul Gupta, and Daniel Kang. Llm agents can autonomously exploit one-day vulnerabilities.arXiv preprint arXiv:2404.08144, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

the winning worker cost

Yuxuan Zhu, Antony Kellermann, Akul Gupta, Philip Li, Richard Fang, Rohan Bindu, and Daniel Kang. Teams of llm agents can exploit zero-day vulnerabilities.arXiv preprint arXiv:2406.01637, Mar 2025. 16

work page arXiv 2025

[24] [24]

Muzsai, D

Lajos Muzsai, David Imolai, and Andr ´as Luk´acs. Hacksynth: Llm agent and evaluation framework for autonomous penetration testing.arXiv preprint arXiv:2412.01778, 2024

work page arXiv 2024

[25] [25]

Nyu ctf bench: A scalable open-source bench- mark dataset for evaluating llms in offensive security

Minghao Shao, Sofija Jancheska, Meet Udeshi, Brendan Dolan-Gavitt, Haoran Xi, Kimberly Milner, Boyuan Chen, Max Yin, Siddharth Garg, Ramesh Karri, Prashanth Krishnamurthy, Farshad Khorrami, and Muhammad Shafique. Nyu ctf bench: A scalable open-source bench- mark dataset for evaluating llms in offensive security. InNeurIPS 2024 Datasets and Benchmarks Track, 2024

work page 2024

[26] [26]

Cve-bench: A benchmark for ai agents’ ability to exploit real- world web application vulnerabilities.arXiv preprint arXiv:2503.17332, Mar 2025

Yuxuan Zhu, Antony Kellermann, Dylan Bowman, Philip Li, Akul Gupta, Adarsh Danda, Richard Fang, Conner Jensen, Eric Ihli, Jason Benn, Jet Geronimo, Avi Dhir, Sudhit Rao, Kaicheng Yu, Twm Stone, and Daniel Kang. Cve-bench: A benchmark for ai agents’ ability to exploit real- world web application vulnerabilities.arXiv preprint arXiv:2503.17332, Mar 2025

work page arXiv 2025

[27] [27]

To- wards automated penetration testing: Introducing LLM benchmark, anal- ysis, and improvements

Isamu Isozaki, Manil Shrestha, Rick Console, and Edward Kim. To- wards automated penetration testing: Introducing LLM benchmark, anal- ysis, and improvements. InProceedings of the 2025 ACM Conference (companion/adjunct) on Computer and Communications Security, 2025. Accessed: 2025-08-06

work page 2025

[28] [28]

Autopentest: Enhancing vulnerability management with autonomous llm agents.arXiv preprint arXiv:2505.10321, 2025

Julius Henke. Autopentest: Enhancing vulnerability management with autonomous llm agents.arXiv preprint arXiv:2505.10321, 2025

work page arXiv 2025

[29] [29]

Camel: Communicative agents for ”mind” exploration of large language model society

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for ”mind” exploration of large language model society. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Pro- cessing Systems, volume 36, pages 51991–52008. Curran Associates, Inc., 2023

work page 2023

[30] [30]

langchain-chatchat.GitHub repository, 2024

Liu Qian, Song Jinke, Huang Zhiguo, Zhang Yuxuan, glide the, and li- unux4odoo. langchain-chatchat.GitHub repository, 2024

work page 2024

[31] [31]

Dirb web content scanner.https://gitlab.com/ kalilinux/packages/dirb, 2025

DirB Project. Dirb web content scanner.https://gitlab.com/ kalilinux/packages/dirb, 2025

work page 2025

[32] [32]

Gobuster - directory/file, dns and vhost busting tool written in go.https://github.com/OJ/gobuster, 2025

Gobuster Project. Gobuster - directory/file, dns and vhost busting tool written in go.https://github.com/OJ/gobuster, 2025

work page 2025

[33] [33]

Owasp amass - in-depth attack surface mapping and asset discovery.https://github.com/owasp-amass/amass, 2025

OW ASP Amass Project. Owasp amass - in-depth attack surface mapping and asset discovery.https://github.com/owasp-amass/amass, 2025

work page 2025

[34] [34]

sqlmap - automatic sql injection and database takeover tool.https://github.com/sqlmapproject/sqlmap, 2025

sqlmap Developers. sqlmap - automatic sql injection and database takeover tool.https://github.com/sqlmapproject/sqlmap, 2025

work page 2025

[35] [35]

Thc-hydra - network logon cracker.https:// github.com/vanhauser-thc/thc-hydra, 2025

THC Hydra Team. Thc-hydra - network logon cracker.https:// github.com/vanhauser-thc/thc-hydra, 2025

work page 2025

[36] [36]

John the ripper - password cracker.https:// github.com/openwall/john, 2025

Openwall Project. John the ripper - password cracker.https:// github.com/openwall/john, 2025

work page 2025

[37] [37]

Exploit database (exploit-db).https://www

Offensive Security. Exploit database (exploit-db).https://www. exploit-db.com/, 2025

work page 2025

[38] [38]

Hacktricks: Hacking techniques & privilege escalation encyclopedia.https://book.hacktricks.xyz/, 2025

Carlos Polop. Hacktricks: Hacking techniques & privilege escalation encyclopedia.https://book.hacktricks.xyz/, 2025

work page 2025

[39] [39]

Hacking articles: A cyber secu- rity community blog.https://www.hackingarticles.in/, 2025

Raj Chandel and Hacking Articles Team. Hacking articles: A cyber secu- rity community blog.https://www.hackingarticles.in/, 2025

work page 2025

[40] [40]

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, et al. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent.arXiv preprint arXiv:2507.02259, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Kali linux: Penetration testing and ethical hacking linux distribution.https://www.kali.org/, 2025

Offensive Security. Kali linux: Penetration testing and ethical hacking linux distribution.https://www.kali.org/, 2025

work page 2025

[42] [42]

Zero: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, JeffRasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020

work page 2020

[43] [43]

Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

work page 2022

[44] [44]

Tryhackme: Hands-on cybersecurity training plat- form.https://tryhackme.com, 2024

TryHackMe Team. Tryhackme: Hands-on cybersecurity training plat- form.https://tryhackme.com, 2024

work page 2024

[45] [45]

Hack the box: Cybersecurity labs and challenges

HackTheBox Team. Hack the box: Cybersecurity labs and challenges. https://www.hackthebox.com, 2024

work page 2024

[46] [46]

Vulnhub: Vulnerable machines for penetration testing practice.https://www.vulnhub.com, 2024

VulnHub Community. Vulnhub: Vulnerable machines for penetration testing practice.https://www.vulnhub.com, 2024

work page 2024

[47] [47]

Huggingface datasets hub: Open-source datasets for machine learning.https://huggingface.co/datasets, 2024

HuggingFace Team. Huggingface datasets hub: Open-source datasets for machine learning.https://huggingface.co/datasets, 2024

work page 2024

[48] [48]

Whiterabbitneo cybersecu- rity dataset (wrn-chapter-1, wrn-chapter-2).https://huggingface

Migel Tissera and WhiteRabbitNeo Team. Whiterabbitneo cybersecu- rity dataset (wrn-chapter-1, wrn-chapter-2).https://huggingface. co/datasets/WhiteRabbitNeo/WRN-Chapter-1, 2024. 17

work page 2024