arxiv: 2604.18718 · v1 · submitted 2026-04-20 · 💻 cs.CR · cs.AI

Recognition: unknown

Towards Optimal Agentic Architectures for Offensive Security Tasks

Isaac David , Arthur Gervais

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:58 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords agentic architecturesoffensive securityLLM agentssecurity benchmarkscoordination topologiescost-quality tradeoffsvulnerability detection

0 comments

The pith

Agentic security audits reveal a non-monotonic cost-quality frontier across coordination topologies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether fixing one coordination topology in LLM-based offensive security systems is optimal or if varying the number of agents and their interactions can improve results. It introduces a benchmark of 20 interactive targets, each with one ground-truth vulnerability, and runs 600 experiments across five architecture families in whitebox and blackbox settings. Detection rates reach 58 percent overall for any detection and 49.8 percent for validated findings, with multi-agent independent setups highest at 64.2 percent validated detection yet single-agent systems cheapest at 0.058 dollars per validated finding. Observability and target domain drive larger differences than topology, showing that added coordination helps coverage only until latency, token costs, and validation difficulty are counted.

Core claim

Across 600 runs on 20 targets, validated detection reaches 49.8 percent overall. MAS-Indep attains 64.2 percent validated detection while SAS sets the efficiency baseline at 0.058 dollars per validated finding. Whitebox mode outperforms blackbox (67.0 percent versus 32.7 percent) and web targets outperform binary (74.3 percent versus 25.3 percent). Bootstrap intervals confirm that domain and observability dominate, producing a non-monotonic cost-quality frontier where broader coordination improves coverage but does not dominate once latency, token cost, and exploit-validation difficulty are included.

What carries the argument

Controlled benchmark of 20 interactive targets with endpoint-reachable ground-truth vulnerabilities, executed across five architecture families in whitebox and blackbox modes to measure validated detection rate and cost per finding.

If this is right

Whitebox access yields materially higher validated detection than blackbox access.
Web and API targets prove substantially easier than binary targets under the same architectures.
Some leading whitebox topologies remain statistically close, so topology choice is secondary to observability and domain.
Broader coordination improves coverage only up to the point where added latency and token cost exceed the marginal gain in validated findings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Production deployments may benefit from adaptive selection among topologies rather than committing to one family in advance.
The cost-quality frontier implies that real-world pipelines could monitor per-target latency and validation difficulty to switch architectures dynamically.
The gap between any-detection and validated-detection rates highlights the practical importance of automated exploit confirmation steps that current benchmarks treat as separate.

Load-bearing premise

The 20 chosen targets and the binary definition of validated detection accurately represent the distribution and difficulty of real-world offensive security tasks.

What would settle it

Repeating the experiments on a substantially larger or differently distributed set of targets, or replacing the binary validated-detection label with a graded exploit-success score, would show whether the non-monotonic frontier still holds.

Figures

Figures reproduced from arXiv: 2604.18718 by Arthur Gervais, Isaac David.

**Figure 2.** Figure 2: Validated detection heatmaps on the completed 600-run core benchmark, aggregated by [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Cost-quality frontier for the ten architecture-mode cells in the completed 600-run core [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Average per-run cost and token decomposition by architecture on the completed 600-run [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

read the original abstract

Agentic security systems increasingly audit live targets with tool-using LLMs, but prior systems fix a single coordination topology, leaving unclear when additional agents help and when they only add cost. We treat topology choice as an empirical systems question. We introduce a controlled benchmark of 20 interactive targets (10 web/API and 10 binary), each exposing one endpoint-reachable ground-truth vulnerability, evaluated in whitebox and blackbox modes. The core study executes 600 runs over five architecture families, three model families, and both access modes, with a separate 60-run long-context pilot reported only in the appendix. On the completed core benchmark, detection-any reaches 58.0% and validated detection reaches 49.8%. MAS-Indep attains the highest validated detection rate (64.2%), while SAS is the strongest efficiency baseline at $0.058 per validated finding. Whitebox materially outperforms blackbox (67.0% vs. 32.7% validated detection), and web materially outperforms binary (74.3% vs. 25.3%). Bootstrap confidence intervals and paired target-level deltas show that the dominant effects are observability and domain, while some leading whitebox topologies remain statistically close. The main result is a non-monotonic cost-quality frontier: broader coordination can improve coverage, but it does not dominate once latency, token cost, and exploit-validation difficulty are taken into account.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper runs a controlled head-to-head on five agent topologies for offensive security and finds that broader coordination improves coverage up to a point but loses to simpler setups once costs and validation effort are counted.

read the letter

The main thing to know is that this is a straightforward empirical comparison of agent coordination choices on security tasks, and the headline result is a non-monotonic cost-quality frontier rather than a blanket win for multi-agent setups. They built a benchmark of 20 interactive targets (10 web/API, 10 binary), each with one ground-truth vulnerability, and ran 600 experiments across five architecture families, three model families, and whitebox/blackbox modes. Detection-any hits 58%, validated detection 49.8%, with MAS-Indep at 64.2% validated and SAS cheapest at $0.058 per finding. Whitebox clears blackbox (67% vs 33%) and web clears binary (74% vs 25%), backed by bootstrap intervals and target-level paired deltas that point to observability and domain as the bigger drivers than topology in many cases. A small long-context pilot sits in the appendix. This is useful because it treats topology as a measurable systems question instead of assuming more agents are always better, and the separation of effects gives concrete numbers practitioners can use when picking an architecture. The soft spot is the benchmark scale. Twenty targets is enough to show trends but leaves open whether the non-monotonic pattern holds on harder or more varied real-world cases, and the binary validated-detection rule collapses what are often graded outcomes. Without the exact prompt templates and exclusion criteria in front of you it is hard to judge how much the rankings depend on those choices. No math or fitted parameters here, just measurements, so the usual circularity worries do not apply and the citations track prior agent security work without obvious holes. This is for people building or studying LLM agents in security tooling. It deserves a serious referee because the design is transparent and the results are specific enough to replicate or extend, even if reviewers will likely ask for a larger target set or sensitivity checks.

Referee Report

2 major / 3 minor

Summary. The paper treats choice of agent coordination topology as an empirical systems question for LLM-based offensive security auditing. It introduces a controlled benchmark of 20 interactive targets (10 web/API, 10 binary), each with one endpoint-reachable ground-truth vulnerability, and reports results from 600 runs across five architecture families, three model families, and whitebox/blackbox access modes (plus a 60-run long-context pilot in the appendix). Core findings are an overall validated detection rate of 49.8% (MAS-Indep highest at 64.2%), SAS as the strongest efficiency baseline ($0.058 per validated finding), large gaps favoring whitebox over blackbox (67.0% vs 32.7%) and web over binary (74.3% vs 25.3%), supported by bootstrap confidence intervals and paired target-level deltas. The central claim is a non-monotonic cost-quality frontier: broader coordination can improve coverage but does not dominate once latency, token cost, and exploit-validation difficulty are accounted for.

Significance. If the benchmark is representative, the work supplies one of the first systematic, multi-architecture comparisons in this domain, with concrete, statistically supported metrics that can guide practical deployment of tool-using LLM agents for security tasks. The reporting of bootstrap CIs, target-level paired deltas, and explicit cost-quality trade-offs is a clear strength that moves beyond single-topology case studies. The non-monotonic frontier result, if robust, directly informs when additional agents add net value versus overhead.

major comments (2)

[§3] Benchmark construction (§3, Target Selection and Evaluation Protocol): The non-monotonic cost-quality frontier and the relative ranking of MAS-Indep versus SAS rest on the assumption that the 20 targets and the binary 'validated detection' success criterion faithfully sample real-world difficulty, observability, and validation cost. The manuscript provides no quantitative comparison of target difficulty distribution, vulnerability types, or observability characteristics against broader offensive-security corpora, nor any sensitivity analysis to post-hoc exclusion rules. This is load-bearing for the central claim.
[§5] Statistical interpretation of topology rankings (§5, Results): The paper notes that some leading whitebox topologies remain statistically close, yet still presents MAS-Indep as attaining the highest validated detection rate. No formal multiple-comparison correction or power analysis is reported for the 600-run design, leaving open whether the observed ordering (and therefore the non-monotonic frontier) is robust to small changes in the target set or success threshold.

minor comments (3)

[Abstract and §4] The abstract and §4 introduce acronyms (MAS-Indep, SAS) without immediate expansion; define all architecture families on first use in the main text.
[Appendix] The 60-run long-context pilot is relegated to the appendix; a one-paragraph summary of its implications for the cost-quality frontier should be added to the main discussion or conclusion.
[Figures] Figure(s) depicting the cost-quality frontier would benefit from explicit annotation of the non-monotonic inflection points and error bars derived from the bootstrap intervals.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. The comments on benchmark representativeness and statistical robustness are well-taken and will improve the manuscript. We respond point-by-point to the major comments below.

read point-by-point responses

Referee: [§3] Benchmark construction (§3, Target Selection and Evaluation Protocol): The non-monotonic cost-quality frontier and the relative ranking of MAS-Indep versus SAS rest on the assumption that the 20 targets and the binary 'validated detection' success criterion faithfully sample real-world difficulty, observability, and validation cost. The manuscript provides no quantitative comparison of target difficulty distribution, vulnerability types, or observability characteristics against broader offensive-security corpora, nor any sensitivity analysis to post-hoc exclusion rules. This is load-bearing for the central claim.

Authors: We agree that demonstrating the benchmark's representativeness strengthens the central claims. No standardized public corpus of interactive targets with endpoint-reachable ground-truth vulnerabilities currently exists for direct quantitative matching, which precludes a full comparative analysis. We will revise Section 3 to add an explicit discussion of target selection criteria, diversity across domains and access modes, and limitations on generalizability. We will also perform and report a sensitivity analysis varying the validated-detection threshold and target subsets to test robustness of the topology rankings and non-monotonic frontier. revision: partial
Referee: [§5] Statistical interpretation of topology rankings (§5, Results): The paper notes that some leading whitebox topologies remain statistically close, yet still presents MAS-Indep as attaining the highest validated detection rate. No formal multiple-comparison correction or power analysis is reported for the 600-run design, leaving open whether the observed ordering (and therefore the non-monotonic frontier) is robust to small changes in the target set or success threshold.

Authors: We appreciate the call for additional statistical safeguards. The existing bootstrap CIs and paired target-level deltas already indicate that top whitebox topologies are close. In the revision we will add a power analysis for the primary detection-rate comparisons and apply a multiple-comparison correction (Bonferroni) to the paired statistical tests. These changes will clarify robustness without changing the reported ordering or the cost-quality frontier conclusion. revision: yes

standing simulated objections not resolved

Quantitative comparison of the 20-target difficulty distribution and observability characteristics against broader offensive-security corpora, as no directly comparable public datasets exist.

Circularity Check

0 steps flagged

No significant circularity in empirical measurement study

full rationale

The paper conducts a controlled empirical benchmark with 600 runs across five architecture families on 20 fixed targets, reporting observed detection rates, costs, and a non-monotonic frontier directly from the experimental data. There are no derivations, equations, fitted parameters renamed as predictions, or self-citations that reduce the central claims to quantities defined inside the paper. The benchmark targets, whitebox/blackbox modes, and validated-detection metric are defined independently of the results, so the reported outcomes remain self-contained measurements rather than tautological restatements.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on the assumption that each target contains exactly one endpoint-reachable ground-truth vulnerability and that reaching it constitutes validated success; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Each of the 20 targets exposes exactly one endpoint-reachable ground-truth vulnerability that defines success.
Used to compute validated detection rates and to treat any other finding as failure.

pith-pipeline@v0.9.0 · 5542 in / 1333 out tokens · 48575 ms · 2026-05-10T03:58:46.374422+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Patch2Vuln: Agentic Reconstruction of Vulnerabilities from Linux Distribution Binary Patches
cs.CR 2026-05 unverdicted novelty 6.0

An agentic pipeline localizes the security-relevant function in 10 of 20 Ubuntu binary security updates and produces an accepted root-cause classification in 11 of 20, limited mainly by binary differencing coverage.

Reference graph

Works this paper leans on

21 extracted references · 12 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Purple llama cyberseceval: A secure coding benchmark for language models, 2023

Manish Bhatt, Sahana Chennabasappa, Cyrus Nikolaidis, Shengye Wan, Ivan Evtimov, Do- minik Gabi, Daniel Song, Faizan Ahmad, Cornelius Aschermann, Lorenzo Fontana, Sasha Frolov, Ravi Prakash Giri, Dhaval Kapil, Yiannis Kozyrakis, David LeBlanc, James Milazzo, 9 Aleksandar Straumann, Gabriel Synnaeve, Varun V ontimitta, Spencer Whitman, and Joshua Saxe. Pur...

work page arXiv 2023
[2]

CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large language models.arXiv preprint arXiv:2404.13161, 2024

Manish Bhatt, Sahana Chennabasappa, Yue Li, Cyrus Nikolaidis, Daniel Song, Shengye Wan, Faizan Ahmad, Cornelius Aschermann, Yaohui Chen, Dhaval Kapil, David Molnar, Spencer Whitman, and Joshua Saxe. Cyberseceval 2: A wide-ranging cybersecurity evaluation suite for large language models, 2024. URLhttps://arxiv.org/abs/2404.13161

work page arXiv 2024
[3]

Multi-agent penetration testing ai for the web,

Isaac David and Arthur Gervais. Multi-agent penetration testing ai for the web, 2025. URL https://arxiv.org/abs/2508.20816

work page arXiv 2025
[4]

Pentestgpt: Evaluating and harness- ing large language models for automated penetration testing

Gelei Deng, Yi Liu, Victor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. Pentestgpt: Evaluating and harness- ing large language models for automated penetration testing. In33rd USENIX Security Symposium (USENIX Security 24), 2024. URLhttps://www.usenix.org/conference/ usenixsecurity24/presentation/deng

2024
[5]

Llm agents can au- tonomously exploit one-day vulnerabilities

Richard Fang, Rohan Bindu, Akul Gupta, and Daniel Kang. LLM agents can autonomously exploit one-day vulnerabilities, 2024. URLhttps://arxiv.org/abs/2404.08144

work page arXiv 2024
[6]

LLM agents can autonomously hack websites, 2024

Richard Fang, Rohan Bindu, Akul Gupta, Qiusi Zhan, and Daniel Kang. LLM agents can autonomously hack websites, 2024. URLhttps://arxiv.org/abs/2402.06664

work page arXiv 2024
[7]

Single-agent or Multi -agent Systems ? Why Not Both ?, May 2025

Mingyan Gao, Yanzi Li, Banruo Liu, Yifan Yu, Phillip Wang, Ching-Yu Lin, and Fan Lai. Single-agent or multi-agent systems? why not both?, 2025. URLhttps://arxiv.org/abs/ 2505.18286

work page arXiv 2025
[8]

Ai agent smart contract exploit generation

Arthur Gervais and Liyi Zhou. Ai agent smart contract exploit generation. InFinancial Cryp- tography and Data Security (FC), 2026. URLhttps://arxiv.org/abs/2507.05558

work page arXiv 2026
[9]

Yubin Kim, Ken Gu, Chanwoo Park, Chunjong Park, Samuel Schmidgall, A. Ali Heydari, Yao Yan, Zhihan Zhang, Yuchen Zhuang, Yun Liu, Mark Malhotra, Paul Pu Liang, Hae Won Park, Yuzhe Yang, Xuhai Xu, Yilun Du, Shwetak Patel, Tim Althoff, Daniel McDuff, and Xin Liu. Towards a science of scaling agent systems, 2025. URLhttps://arxiv.org/abs/2512. 08296

2025
[10]

Agentbench: Evaluating LLMs as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating LLMs as agents. InInternational Conference on Learn- ing Representat...

2024
[11]

Zefang Liu, Jialei Shi, and John F. Buford. Cyberbench: A multi-task benchmark for evaluating large language models in cybersecurity. InAAAI 2024 Workshop on Artificial Intelligence for Cyber Security (AICS), 2024. URLhttps://aics.site/AICS2024/AICS_CyberBench. pdf

2024
[12]

Common weakness enumeration, 2026

MITRE. Common weakness enumeration, 2026. URLhttps://cwe.mitre.org/

2026
[13]

Single-agent LLMs outperform multi-agent systems on multi-hop reasoning under equal thinking token budgets, 2026

Dat Tran and Douwe Kiela. Single-agent LLMs outperform multi-agent systems on multi-hop reasoning under equal thinking token budgets, 2026. URLhttps://arxiv.org/abs/2604. 02460

2026
[14]

Lin, Andy Applebaum, Tejal Patwardhan, Alpin Yukseloglu, and Olivia Watkins

Justin Wang, Andreas Bigger, Xiaohai Xu, Justin W. Lin, Andy Applebaum, Tejal Patwardhan, Alpin Yukseloglu, and Olivia Watkins. Evmbench: Evaluating ai agents on smart contract security, 2026. URLhttps://cdn.openai.com/evmbench/evmbench.pdf

2026
[15]

Ai agents find $4.6m in blockchain smart contract exploits.https://red.anthropic.com/ 2025/smart-contracts/, December 2025

Winnie Xiao, Cole Killian, Henry Sleight, Alan Chan, Nicholas Carlini, and Alwin Peng. Ai agents find $4.6m in blockchain smart contract exploits.https://red.anthropic.com/ 2025/smart-contracts/, December 2025. Anthropic Frontier Red Team blog post intro- ducing SCONE-bench. 10

2025
[16]

Jiawei Zhang, Guangyu Liu, Oscar Johansson, et al

Jiawei Xu, Arief Koesdwiady, Sisong Bei, Yan Han, Baixiang Huang, Dakuo Wang, Yutong Chen, Zheshen Wang, Peihao Wang, Pan Li, and Ying Ding. Rethinking the value of multi- agent workflow: A strong single agent baseline, 2026. URLhttps://arxiv.org/abs/ 2601.12307

work page arXiv 2026
[17]

arXiv preprint arXiv:2306.14898 , year=

John Yang, Akshara Prabhakar, Karthik Narasimhan, and Shunyu Yao. Intercode: Stan- dardizing and benchmarking interactive coding with execution feedback. InAdvances in Neural Information Processing Systems 36: Datasets and Benchmarks Track, 2023. URL https://arxiv.org/abs/2306.14898

work page arXiv 2023
[18]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023. URLhttps://arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Zhang, Joey Ji, Celeste Menders, Riya Dulepet, T

Andy K. Zhang, Joey Ji, Celeste Menders, Riya Dulepet, Thomas Qin, Ron Y . Wang, Junrong Wu, Kyleen Liao, Jiliang Li, Jinghan Hu, Sara Hong, Nardos Demilew, Shivatmica Murgai, Jason Tran, Nishka Kacheria, Ethan Ho, Denis Liu, Lauren McLane, Olivia Bruvik, Dai-Rong Han, Seungwoo Kim, Akhil Vyas, Cuiyuanxiu Chen, Ryan Li, Weiran Xu, Jonathan Z. Ye, Prerit C...

work page arXiv 2025
[20]

Zhang, Neil Perry, Riya Dulepet, Joey Ji, Celeste Menders, Justin W

Andy K. Zhang, Neil Perry, Riya Dulepet, Joey Ji, Celeste Menders, Justin W. Lin, Eliot Jones, Gashon Hussein, Samantha Liu, Donovan Jasper, Pura Peetathawatchai, Ari Glenn, Vikram Sivashankar, Daniel Zamoshchin, Leo Glikbarg, Derek Askaryar, Mike Yang, Teddy Zhang, Rishi Alluri, Nathan Tran, Rinnara Sangpisit, Polycarpos Yiorkadjis, Kenny Osele, Gautham ...

2025
[21]

Teams of LLM agents can exploit zero-day vulnerabilities.arXiv preprint arXiv:2406.01637, 2024

Yuxuan Zhu, Antony Kellermann, Akul Gupta, Philip Li, Richard Fang, Rohan Bindu, and Daniel Kang. Teams of LLM agents can exploit zero-day vulnerabilities, 2024. URLhttps: //arxiv.org/abs/2406.01637. 11 A Limitations, Broader Impact, and Safety This work evaluates authorized testing only and uses intentionally vulnerable targets. Any release of offensive-...

work page arXiv 2024