FORGE: Multi-Agent Graduated Exploitation and Detection Engineering
Pith reviewed 2026-06-28 09:43 UTC · model grok-4.3
The pith
A multi-agent system with graduated exploitation depth reaches 67.8 percent L1+ success on 603 CVEs at low cost and generates higher-quality detection rules from deeper traces.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Graduated exploitation depth, assessed by an LLM-primary oracle on a four-level taxonomy from no evidence to full compromise, supplies both high exploitation success independent of metadata scores and detection rules with measurably stronger grounding when derived from L2+ traces.
What carries the argument
The four-level exploitation taxonomy (L0 to L3) assessed by an LLM-primary oracle, which converts partial exploitation progress into reusable behavioral traces for rule generation.
Load-bearing premise
The LLM-primary oracle supplies reliable and unbiased labels for the four exploitation levels that can be treated as ground truth for both success and rule quality.
What would settle it
An independent human-expert labeling of the L0-L3 taxonomy on a random subset of the 603 CVEs, followed by a direct comparison of agreement rates with the LLM oracle.
Figures
read the original abstract
Vulnerability disclosure volumes now far exceed organizational assessment capacity, yet three adjacent research communities (proof-of-concept generation, vulnerability prioritization, and detection rule engineering) operate largely in isolation. Existing automated exploit generation systems report binary pass/fail outcomes, discarding partial progress and producing no signal for the other two communities. This paper presents FORGE, a multi-agent system that bridges these three silos through graduated exploitation depth. Five specialized agents (Intel, Generator, Planner, Exploit, and Detector) execute in a fixed pipeline that (1) generates targeted vulnerable applications from CVE metadata, (2) conducts coached, multi-turn exploitation assessed by an LLM-primary oracle on a four-level taxonomy (L0: no evidence through L3: full compromise), and (3) produces Sigma and Snort detection rules grounded in OpenTelemetry exploitation traces. Graduated depth is the bridging mechanism: deeper exploitation yields richer behavioral traces for detection engineering, while depth data across scoring bands provides ground truth for prioritization validation. A tiered knowledge architecture accumulates intelligence across assessments, transferring build and exploitation experience to subsequent CVEs. Evaluation on 603 CVEs from the CVE-GENIE dataset achieves 67.8% end-to-end L1+ exploitation at USD 1.50 per CVE across eight languages and 187 CWE types. Exploitation rates remain near 68% regardless of EPSS or CVSS band, indicating that pattern-level reachability is orthogonal to metadata-based prioritization. Detection rules from L2+ exploitation achieve significantly higher span-normalized grounding than L1-derived rules (p=0.035), and 93.4% of generated Snort rules produce zero false positives against a synthetic benign corpus.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FORGE, a multi-agent pipeline (Intel, Generator, Planner, Exploit, Detector) that generates vulnerable applications from CVE metadata, performs coached multi-turn exploitation labeled by an LLM-primary oracle into a four-level taxonomy (L0: no evidence to L3: full compromise), and derives Sigma/Snort detection rules from OpenTelemetry traces. On 603 CVEs from CVE-GENIE it reports 67.8% end-to-end L1+ success at $1.50 per CVE across eight languages and 187 CWEs; success is statistically independent of EPSS and CVSS bands; L2+-derived rules show higher span-normalized grounding than L1-derived rules (p=0.035); and 93.4% of generated Snort rules yield zero false positives on a synthetic benign corpus. A tiered knowledge base transfers experience across assessments.
Significance. If the LLM oracle labels are reliable, the work supplies the first large-scale, graduated exploitation dataset that simultaneously supplies ground-truth reachability signals for prioritization validation and behavioral traces for detection engineering. The reported orthogonality between pattern-level exploitation success and metadata-based scores would be a substantive empirical result for the vulnerability management community.
major comments (1)
- [Abstract and Evaluation section] Abstract and Evaluation section: The central performance claims (67.8% L1+ exploitation, p=0.035 grounding difference, 93.4% zero-FP Snort rules) rest on treating the LLM-primary oracle’s L0–L3 classifications as ground truth. No description is given of the oracle’s prompt template, few-shot examples, temperature settings, validation against human experts, calibration set, inter-rater agreement, or error analysis on known ground-truth exploits. Because these labels are used both to count successes and to select traces for rule generation, any systematic bias directly affects all headline statistics.
minor comments (2)
- [Abstract] The cost figure of USD 1.50 per CVE should be accompanied by an explicit breakdown of which API calls, compute, and human oversight are included.
- [Evaluation section] The synthetic benign corpus used for false-positive testing should be characterized (size, traffic distribution, generation method) to allow reproducibility.
Simulated Author's Rebuttal
We thank the referee for highlighting the need for greater transparency around the LLM oracle. We agree that additional methodological details are required and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract and Evaluation section] Abstract and Evaluation section: The central performance claims (67.8% L1+ exploitation, p=0.035 grounding difference, 93.4% zero-FP Snort rules) rest on treating the LLM-primary oracle’s L0–L3 classifications as ground truth. No description is given of the oracle’s prompt template, few-shot examples, temperature settings, validation against human experts, calibration set, inter-rater agreement, or error analysis on known ground-truth exploits. Because these labels are used both to count successes and to select traces for rule generation, any systematic bias directly affects all headline statistics.
Authors: We acknowledge that the current manuscript does not include a description of the oracle prompt template, few-shot examples, temperature settings, calibration set, inter-rater agreement, or error analysis. In the revised version we will add a new subsection (Evaluation: Oracle Configuration) that supplies the exact prompt template, the few-shot examples used, temperature (0.0 for the primary labeling calls), and any calibration steps performed on a small internal set. We did not conduct a full human-expert validation or inter-rater study across all 603 CVEs because of scale and cost; however, we will report an error analysis on a randomly sampled subset of 50 CVEs for which two authors independently labeled the traces and computed agreement with the LLM oracle. We will also discuss the implications of any observed discrepancies for the headline statistics and rule-generation pipeline. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper reports empirical results from executing the FORGE pipeline on the external CVE-GENIE dataset of 603 CVEs, with all metrics (67.8% L1+ exploitation, p=0.035 grounding difference, 93.4% zero-FP rules) presented as direct measurements against the LLM oracle assessments and synthetic corpus. No equations, fitted parameters, self-citations, or renamings appear in the provided text that would reduce any claimed result to an input by construction. The evaluation chain remains self-contained as observed outcomes on independent data rather than derived quantities.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2510.18508 (2025), https://arxiv.org/abs/2510.18508
Al Haddad, O., Ikram, M., Ahmed, E., Lee, Y.: Prompting the priorities: A first look at evaluating LLMs for vulnerability triage and prioritization. arXiv preprint arXiv:2510.18508 (2025), https://arxiv.org/abs/2510.18508
-
[2]
In: Proceedings of the 32nd Annual Conference on Computer Security Applications (ACSAC)
Applebaum, A., Miller, D., Strom, B., Korban, C., Wolf, R.: Intelligent, automated red team emulation. In: Proceedings of the 32nd Annual Conference on Computer Security Applications (ACSAC). pp. 363–373 (2016). https://doi.org/10.1145/29 91079.2991111
work page doi:10.1145/29 2016
-
[3]
In: Proceedings of the 2006 IEEE Sym- posium on Security and Privacy (S&P)
Brumley, D., Newsome, J., Song, D., Wang, H., Jha, S.: Towards automatic gen- eration of vulnerability-based signatures. In: Proceedings of the 2006 IEEE Sym- posium on Security and Privacy (S&P). pp. 2–16 (2006). https://doi.org/10.1109/ sp.2006.41
2006
-
[4]
arXiv preprint arXiv:2502.04953 (2025), https://arxiv.org/abs/2502.04953
Bui, Q.C., Iannone, E., Camporese, M., Hinrichs, T., Tony, C., Tóth, L., Palomba, F., Hegedűs, P., Massacci, F., Scandariato, R.: A systematic literature review on automated exploit and security test generation. arXiv preprint arXiv:2502.04953 (2025), https://arxiv.org/abs/2502.04953
-
[5]
https://www.cve.org/About/Metrics (2025), accessed: February 2026
CVE.org: CVE metrics. https://www.cve.org/About/Metrics (2025), accessed: February 2026
2025
-
[6]
In: Proceedings of the 33rd USENIX Security Symposium
Deng, G., Liu, Y., Mayoral-Vilches, V., Liu, P., Li, Y., Xu, Y., Zhang, T., Liu, Y., Pinzger, M., Rass, S.: PentestGPT: Evaluating and harnessing large language models for automated penetration testing. In: Proceedings of the 33rd USENIX Security Symposium. pp. 847–864 (2024), https://www.usenix.org/conference/us enixsecurity24/presentation/deng
2024
-
[7]
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Denison, C., MacDiarmid, M., Barez, F., Duvenaud, D., Kravec, S., Marks, S., Schiefer, N., Soklaski, R., Tamkin, A., Kaplan, J., Shlegeris, B., Bowman, S.R., Perez, E., Hubinger, E.: Sycophancy to subterfuge: Investigating reward-tampering in large language models. arXiv preprint arXiv:2406.10162 (2024), https://arxiv. org/abs/2406.10162
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
ACM Computing Surveys56(8), 1–41 (2024)
Elder, S., Rahman, M.R., Fringer, G., Kapoor, K., Williams, L.: A survey on software vulnerability exploitability assessment. ACM Computing Surveys56(8), 1–41 (2024). https://doi.org/10.1145/3648610 FORGE: Multi-Agent Graduated Exploitation and Detection Engineering 17
-
[9]
In: IEEE International Conference on Big Data (BigData)
Fairbanks, J., Serra, E.: Reflective beam search for automated TTP extraction and sigma rule generation from cyber threat intelligence. In: IEEE International Conference on Big Data (BigData). pp. 2130–2135 (2025). https://doi.org/10.110 9/bigdata66926.2025.11401712
-
[10]
LLM Agents can Autonomously Exploit One-day Vulnerabilities
Fang, R., Bindu, R., Gupta, A., Kang, D.: LLM agents can autonomously exploit one-day vulnerabilities. arXiv preprint arXiv:2404.08144 (2024), https://arxiv.or g/abs/2404.08144
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Fleischer, F., Zhang, C., Jang, J., Cho, J., Xu, M., Kim, T.: Contextualizing sink knowledgeforJavavulnerabilitydiscovery.arXivpreprintarXiv:2604.01645(2026), https://arxiv.org/abs/2604.01645
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[12]
Gordeychik, S.: Prediction meets patch queues: Empirical limits of EPSS-only prioritization using CISA KEV additions in 2025. TechRxiv preprint (2026). https://doi.org/10.36227/techrxiv.176857939.95987957/v1
-
[13]
Journal of Cybersecurity6(1) (2020)
Jacobs, J., Romanosky, S., Adjerid, I., Baker, W.: Improving vulnerability reme- diation through better exploit prediction. Journal of Cybersecurity6(1) (2020). https://doi.org/10.1093/cybsec/tyaa015
-
[14]
In: IEEE European Symposium on Security and Privacy Workshops (Eu- roS&PW)
Jacobs, J., Romanosky, S., Suciu, O., Edwards, B., Sarabi, A.: Enhancing vul- nerability prioritization: Data-driven exploit predictions with community-driven insights. In: IEEE European Symposium on Security and Privacy Workshops (Eu- roS&PW). pp. 194–206 (2023). https://doi.org/10.1109/EuroSPW59978.2023.00 027
-
[15]
Journal of Information Processing31, 591–601 (2023)
Kobayashi, M., Kanemoto, Y., Kotani, D., Okabe, Y.: Generation of IDS signatures through exhaustive execution path exploration in PoC codes for vulnerabilities. Journal of Information Processing31, 591–601 (2023). https://doi.org/10.2197/ip sjjip.31.591
work page doi:10.2197/ip 2023
-
[16]
In: Proceed- ings of the 2025 ACM SIGSAC Conference on Computer and Communications Security (CCS)
Koscinski, V., Nelson, M., Okutan, A., Falso, R., Mirakhorli, M.: Conflicting scores, confusing signals: An empirical study of vulnerability scoring systems. In: Proceed- ings of the 2025 ACM SIGSAC Conference on Computer and Communications Security (CCS). pp. 1904–1918 (2025). https://doi.org/10.1145/3719027.3765210
-
[17]
arXiv preprint arXiv:2603.02297 (2026), https: //arxiv.org/abs/2603.02297
Lau, N., Sloot, L., Raj, J., Boscardin, G.M., Harris, E., Bowman, D., Brajkovski, M., Chawla, J., Zhao, D.: ZeroDayBench: Evaluating LLM agents on unseen zero- day vulnerabilities for cyberdefense. arXiv preprint arXiv:2603.02297 (2026), https: //arxiv.org/abs/2603.02297
-
[18]
arXiv preprint arXiv:2602.13574 (2026), https://arxiv.org/abs/2602.13574
Li, H., Che, X., Wang, Y., Liao, X., Xing, L.: Execution-state-aware LLM reasoning for automated proof-of-vulnerability generation. arXiv preprint arXiv:2602.13574 (2026), https://arxiv.org/abs/2602.13574
-
[19]
In: Proceedings of the 12th ACM Conference on Computer and Communications Security (CCS)
Liang, Z., Sekar, R.: Fast and automated generation of attack signatures: A basis for building self-protecting servers. In: Proceedings of the 12th ACM Conference on Computer and Communications Security (CCS). pp. 213–222 (2005). https: //doi.org/10.1145/1102120.1102150
- [20]
-
[21]
CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability
Luo, X., Zhang, J., Zhou, S., Huang, R., Xiao, C., Zhu, Q., Ma, Z., Yue, X., Yue, Y., Zeng, W., Che, W.: CVE-Factory: Scaling expert-level agentic tasks for code security vulnerability. arXiv preprint arXiv:2602.03012 (2026), https://arxiv.org/ abs/2602.03012
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[22]
Mell, P., Spring, J.: Likely exploited vulnerabilities. Tech. Rep. NIST.CSWP.41, National Institute of Standards and Technology (2025), https://nvlpubs.nist.gov /nistpubs/CSWP/NIST.CSWP.41.pdf 18 F. Shaikh
2025
-
[23]
FALCON: Transforming Cyber Threat Intelligence into Deployable IDS Rules with Self-Reflection
Mitra, S., Bazarov, A., Duclos, M., Mittal, S., Piplai, A., Rahman, M.R., Zieglar, E., Rahimi, S.: FALCON: Autonomous cyber threat intelligence mining with LLMs for IDS rule generation. arXiv preprint arXiv:2508.18684 (2025), https://arxiv.or g/abs/2508.18684
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Newsome, J., Song, D.: Dynamic taint analysis for automatic detection, anal- ysis, and signature generation of exploits on commodity software. In: Pro- ceedings of the 12th Network and Distributed System Security Symposium (NDSS) (2005), https://www.ndss-symposium.org/ndss2005/dynamic-taint-anal ysis-automatic-detection-analysis-and-signaturegeneration-ex...
2005
-
[25]
arXiv preprint arXiv:2411.02618 (2024), https://arxiv.org/abs/2411.02618
Parla, R.: Efficacy of EPSS in high severity CVEs found in KEV. arXiv preprint arXiv:2411.02618 (2024), https://arxiv.org/abs/2411.02618
-
[26]
arXiv preprint arXiv:2508.03882 (2025), https://arxiv.org/abs/2508.03882
Sánchez-Matas, A., Escribano Ruiz, P., Díaz-López, D., Perales Gómez, A.L., Ne- spoli, P., Martínez Pérez, G.: Simulating cyberattacks through a breach attack simulation (BAS) platform empowered by security chaos engineering (SCE). arXiv preprint arXiv:2508.03882 (2025), https://arxiv.org/abs/2508.03882
-
[27]
arXiv preprint arXiv:2505.06701 (2025), https://arxiv.org/abs/ 2505.06701
Shukla, A., Gandhi, P.A., Elovici, Y., Shabtai, A.: RuleGenie: SIEM detection rule set optimization. arXiv preprint arXiv:2505.06701 (2025), https://arxiv.org/abs/ 2505.06701
-
[28]
PoCGen: Generating Proof-of-Concept Exploits for Vulnerabilities in Npm Packages
Simsek, D., Eghbali, A., Pradel, M.: PoCGen: Generating proof-of-concept exploits for vulnerabilities in npm packages. arXiv preprint arXiv:2506.04962 (2025), https: //arxiv.org/abs/2506.04962
work page internal anchor Pith review arXiv 2025
-
[29]
Smart, J., Jun, S.Y., et al.: Blueprint: Stakeholder-Specific Vulnerability Categorization guidance. Tech. rep., Sandia National Laboratories (2026), https://research-hub.nlr.gov/en/publications/blueprint-stakeholder-specific-vul nerability-categorization-guida/
2026
-
[30]
In: IEEE RIVF International Conference on Computing and Communication Technologies
Tran, T.T.V., Le, T.B.T., Truong, T.H.H., Thai, H.V., Hien, D.H., Phan, T.D.: EvoSIEM: Detecting and generating SIEM rule evasion behaviors in network sys- tems. In: IEEE RIVF International Conference on Computing and Communication Technologies. pp. 498–503 (2025). https://doi.org/10.1109/rivf68649.2025.1136512 9
-
[31]
In: Proceedings of the 33rd USENIX Security Symposium
Uetz, R., Herzog, M., Hackländer, L., Schwarz, S., Henze, M.: You cannot escape me: Detecting evasions of SIEM rules in enterprise networks. In: Proceedings of the 33rd USENIX Security Symposium. pp. 5179–5196 (2024), https://www.usen ix.org/conference/usenixsecurity24/presentation/uetz
2024
-
[32]
arXiv preprint arXiv:2509.01835 (2025), https://arxiv.org/abs/2509.01835
Ullah, S., Balasubramanian, P., Guo, W., Burnett, A., Pearce, H., Kruegel, C., Vigna, G., Stringhini, G.: From CVE entries to verifiable exploits: An automated multi-agent framework for reproducing CVEs. arXiv preprint arXiv:2509.01835 (2025), https://arxiv.org/abs/2509.01835
-
[33]
ReAct: Synergizing Reasoning and Acting in Language Models
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing reasoning and acting in language models. In: Proceedings of the 11th International Conference on Learning Representations (ICLR) (2023), https://ar xiv.org/abs/2210.03629
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Zhu, Y., Kellermann, A., Bowman, D., Li, P., Gupta, A., Danda, A., Fang, R., Jensen, C., Ihli, E., Benn, J., Geronimo, J., Dhir, A., Rao, S., Yu, K., Stone, T., Kang,D.:CVE-Bench:AbenchmarkforAIagents’abilitytoexploitreal-worldweb application vulnerabilities. In: Proceedings of the 42nd International Conference on Machine Learning (ICML) (2025), https://a...
-
[35]
arXiv preprint arXiv:2406.01637 (2024), https://arxiv.org/abs/2406.01637
Zhu, Y., Kellermann, A., Gupta, A., Li, P., Fang, R., Bindu, R., Kang, D.: Teams of LLM agents can exploit zero-day vulnerabilities. arXiv preprint arXiv:2406.01637 (2024), https://arxiv.org/abs/2406.01637
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.