pith. sign in

arxiv: 2606.03453 · v1 · pith:S3MRFZEMnew · submitted 2026-06-02 · 💻 cs.CR · cs.AI· cs.MA

FORGE: Multi-Agent Graduated Exploitation and Detection Engineering

Pith reviewed 2026-06-28 09:43 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.MA
keywords multi-agent systemsvulnerability exploitationdetection rule generationCVE analysisLLM oracleexploitation taxonomySnort rulesOpenTelemetry traces
0
0 comments X

The pith

A multi-agent system with graduated exploitation depth reaches 67.8 percent L1+ success on 603 CVEs at low cost and generates higher-quality detection rules from deeper traces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FORGE, a pipeline of five agents that turns CVE metadata into vulnerable applications, runs coached multi-turn exploitation assessed on a four-level scale by an LLM oracle, and extracts Sigma and Snort rules from the resulting OpenTelemetry traces. It reports that this graduated approach yields 67.8 percent end-to-end L1+ exploitation across eight languages and 187 CWE types at roughly 1.50 USD per CVE, with success rates staying near 68 percent regardless of EPSS or CVSS band. Detection rules derived from L2+ exploitation show statistically higher span-normalized grounding than those from L1 attempts, and most generated Snort rules produce zero false positives on a synthetic benign corpus. The work treats the four exploitation levels as a bridge that supplies both prioritization ground truth and richer behavioral data for detection engineering.

Core claim

Graduated exploitation depth, assessed by an LLM-primary oracle on a four-level taxonomy from no evidence to full compromise, supplies both high exploitation success independent of metadata scores and detection rules with measurably stronger grounding when derived from L2+ traces.

What carries the argument

The four-level exploitation taxonomy (L0 to L3) assessed by an LLM-primary oracle, which converts partial exploitation progress into reusable behavioral traces for rule generation.

Load-bearing premise

The LLM-primary oracle supplies reliable and unbiased labels for the four exploitation levels that can be treated as ground truth for both success and rule quality.

What would settle it

An independent human-expert labeling of the L0-L3 taxonomy on a random subset of the 603 CVEs, followed by a direct comparison of agreement rates with the LLM oracle.

Figures

Figures reproduced from arXiv: 2606.03453 by Farooq Shaikh.

Figure 1
Figure 1. Figure 1: FORGE pipeline architecture with running example (CVE-2025-30370, jupyterlab-git command-injection RCE via $() substitution that bypasses partial shell￾escaping; L3 in 4 exploit turns). The LLM-primary oracle evaluates every exploit turn against CWE-specific criteria. Five knowledge stores accumulate intelligence across as￾sessments. 3.1 Pipeline Overview The pipeline processes a single CVE through five st… view at source ↗
Figure 2
Figure 2. Figure 2: Exploitation depth and prioritization metric correlation (RQ1–RQ2). fall between 86.7% and 96.8%. One L3 case is a likely false positive (autho￾rization bypass reaching L2 evidence but not full compromise); the remaining six non-VALID assessments are BORDERLINE. The audit script and per-CVE verdicts are released with the artifact. Baseline. CVE-GENIE [32] is the closest comparable system (50.9% binary suc￾… view at source ↗
Figure 3
Figure 3. Figure 3: Exploitation level distribution for the 12 most frequent primary CWEs (% of CVEs in each row reaching each level). Cell shade is the within-CWE share of the L0–L3 column; rows marked with ∗ are structurally capped at L2 by the oracle (§3.3) [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Detection rule quality (RQ3) and cost decomposition (RQ4). the operational question of how much detection-relevant evidence each collected span carries. Under this normalized metric, L2+ rules achieve marginally sig￾nificantly higher grounding (0.307 vs. 0.175, p=0.035): each span from deeper exploitation contributes more detection-relevant evidence. Rules are generated for 380 L1+ CVEs (mean 2.0 rules/CVE… view at source ↗
read the original abstract

Vulnerability disclosure volumes now far exceed organizational assessment capacity, yet three adjacent research communities (proof-of-concept generation, vulnerability prioritization, and detection rule engineering) operate largely in isolation. Existing automated exploit generation systems report binary pass/fail outcomes, discarding partial progress and producing no signal for the other two communities. This paper presents FORGE, a multi-agent system that bridges these three silos through graduated exploitation depth. Five specialized agents (Intel, Generator, Planner, Exploit, and Detector) execute in a fixed pipeline that (1) generates targeted vulnerable applications from CVE metadata, (2) conducts coached, multi-turn exploitation assessed by an LLM-primary oracle on a four-level taxonomy (L0: no evidence through L3: full compromise), and (3) produces Sigma and Snort detection rules grounded in OpenTelemetry exploitation traces. Graduated depth is the bridging mechanism: deeper exploitation yields richer behavioral traces for detection engineering, while depth data across scoring bands provides ground truth for prioritization validation. A tiered knowledge architecture accumulates intelligence across assessments, transferring build and exploitation experience to subsequent CVEs. Evaluation on 603 CVEs from the CVE-GENIE dataset achieves 67.8% end-to-end L1+ exploitation at USD 1.50 per CVE across eight languages and 187 CWE types. Exploitation rates remain near 68% regardless of EPSS or CVSS band, indicating that pattern-level reachability is orthogonal to metadata-based prioritization. Detection rules from L2+ exploitation achieve significantly higher span-normalized grounding than L1-derived rules (p=0.035), and 93.4% of generated Snort rules produce zero false positives against a synthetic benign corpus.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces FORGE, a multi-agent pipeline (Intel, Generator, Planner, Exploit, Detector) that generates vulnerable applications from CVE metadata, performs coached multi-turn exploitation labeled by an LLM-primary oracle into a four-level taxonomy (L0: no evidence to L3: full compromise), and derives Sigma/Snort detection rules from OpenTelemetry traces. On 603 CVEs from CVE-GENIE it reports 67.8% end-to-end L1+ success at $1.50 per CVE across eight languages and 187 CWEs; success is statistically independent of EPSS and CVSS bands; L2+-derived rules show higher span-normalized grounding than L1-derived rules (p=0.035); and 93.4% of generated Snort rules yield zero false positives on a synthetic benign corpus. A tiered knowledge base transfers experience across assessments.

Significance. If the LLM oracle labels are reliable, the work supplies the first large-scale, graduated exploitation dataset that simultaneously supplies ground-truth reachability signals for prioritization validation and behavioral traces for detection engineering. The reported orthogonality between pattern-level exploitation success and metadata-based scores would be a substantive empirical result for the vulnerability management community.

major comments (1)
  1. [Abstract and Evaluation section] Abstract and Evaluation section: The central performance claims (67.8% L1+ exploitation, p=0.035 grounding difference, 93.4% zero-FP Snort rules) rest on treating the LLM-primary oracle’s L0–L3 classifications as ground truth. No description is given of the oracle’s prompt template, few-shot examples, temperature settings, validation against human experts, calibration set, inter-rater agreement, or error analysis on known ground-truth exploits. Because these labels are used both to count successes and to select traces for rule generation, any systematic bias directly affects all headline statistics.
minor comments (2)
  1. [Abstract] The cost figure of USD 1.50 per CVE should be accompanied by an explicit breakdown of which API calls, compute, and human oversight are included.
  2. [Evaluation section] The synthetic benign corpus used for false-positive testing should be characterized (size, traffic distribution, generation method) to allow reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency around the LLM oracle. We agree that additional methodological details are required and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and Evaluation section] Abstract and Evaluation section: The central performance claims (67.8% L1+ exploitation, p=0.035 grounding difference, 93.4% zero-FP Snort rules) rest on treating the LLM-primary oracle’s L0–L3 classifications as ground truth. No description is given of the oracle’s prompt template, few-shot examples, temperature settings, validation against human experts, calibration set, inter-rater agreement, or error analysis on known ground-truth exploits. Because these labels are used both to count successes and to select traces for rule generation, any systematic bias directly affects all headline statistics.

    Authors: We acknowledge that the current manuscript does not include a description of the oracle prompt template, few-shot examples, temperature settings, calibration set, inter-rater agreement, or error analysis. In the revised version we will add a new subsection (Evaluation: Oracle Configuration) that supplies the exact prompt template, the few-shot examples used, temperature (0.0 for the primary labeling calls), and any calibration steps performed on a small internal set. We did not conduct a full human-expert validation or inter-rater study across all 603 CVEs because of scale and cost; however, we will report an error analysis on a randomly sampled subset of 50 CVEs for which two authors independently labeled the traces and computed agreement with the LLM oracle. We will also discuss the implications of any observed discrepancies for the headline statistics and rule-generation pipeline. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper reports empirical results from executing the FORGE pipeline on the external CVE-GENIE dataset of 603 CVEs, with all metrics (67.8% L1+ exploitation, p=0.035 grounding difference, 93.4% zero-FP rules) presented as direct measurements against the LLM oracle assessments and synthetic corpus. No equations, fitted parameters, self-citations, or renamings appear in the provided text that would reduce any claimed result to an input by construction. The evaluation chain remains self-contained as observed outcomes on independent data rather than derived quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, mathematical axioms, or newly postulated entities. The four-level taxonomy and the five agent roles are presented as design choices rather than derived quantities.

pith-pipeline@v0.9.1-grok · 5831 in / 1385 out tokens · 22754 ms · 2026-06-28T09:43:29.486484+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 28 canonical work pages · 7 internal anchors

  1. [1]

    arXiv preprint arXiv:2510.18508 (2025), https://arxiv.org/abs/2510.18508

    Al Haddad, O., Ikram, M., Ahmed, E., Lee, Y.: Prompting the priorities: A first look at evaluating LLMs for vulnerability triage and prioritization. arXiv preprint arXiv:2510.18508 (2025), https://arxiv.org/abs/2510.18508

  2. [2]

    In: Proceedings of the 32nd Annual Conference on Computer Security Applications (ACSAC)

    Applebaum, A., Miller, D., Strom, B., Korban, C., Wolf, R.: Intelligent, automated red team emulation. In: Proceedings of the 32nd Annual Conference on Computer Security Applications (ACSAC). pp. 363–373 (2016). https://doi.org/10.1145/29 91079.2991111

  3. [3]

    In: Proceedings of the 2006 IEEE Sym- posium on Security and Privacy (S&P)

    Brumley, D., Newsome, J., Song, D., Wang, H., Jha, S.: Towards automatic gen- eration of vulnerability-based signatures. In: Proceedings of the 2006 IEEE Sym- posium on Security and Privacy (S&P). pp. 2–16 (2006). https://doi.org/10.1109/ sp.2006.41

  4. [4]

    arXiv preprint arXiv:2502.04953 (2025), https://arxiv.org/abs/2502.04953

    Bui, Q.C., Iannone, E., Camporese, M., Hinrichs, T., Tony, C., Tóth, L., Palomba, F., Hegedűs, P., Massacci, F., Scandariato, R.: A systematic literature review on automated exploit and security test generation. arXiv preprint arXiv:2502.04953 (2025), https://arxiv.org/abs/2502.04953

  5. [5]

    https://www.cve.org/About/Metrics (2025), accessed: February 2026

    CVE.org: CVE metrics. https://www.cve.org/About/Metrics (2025), accessed: February 2026

  6. [6]

    In: Proceedings of the 33rd USENIX Security Symposium

    Deng, G., Liu, Y., Mayoral-Vilches, V., Liu, P., Li, Y., Xu, Y., Zhang, T., Liu, Y., Pinzger, M., Rass, S.: PentestGPT: Evaluating and harnessing large language models for automated penetration testing. In: Proceedings of the 33rd USENIX Security Symposium. pp. 847–864 (2024), https://www.usenix.org/conference/us enixsecurity24/presentation/deng

  7. [7]

    Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

    Denison, C., MacDiarmid, M., Barez, F., Duvenaud, D., Kravec, S., Marks, S., Schiefer, N., Soklaski, R., Tamkin, A., Kaplan, J., Shlegeris, B., Bowman, S.R., Perez, E., Hubinger, E.: Sycophancy to subterfuge: Investigating reward-tampering in large language models. arXiv preprint arXiv:2406.10162 (2024), https://arxiv. org/abs/2406.10162

  8. [8]

    ACM Computing Surveys56(8), 1–41 (2024)

    Elder, S., Rahman, M.R., Fringer, G., Kapoor, K., Williams, L.: A survey on software vulnerability exploitability assessment. ACM Computing Surveys56(8), 1–41 (2024). https://doi.org/10.1145/3648610 FORGE: Multi-Agent Graduated Exploitation and Detection Engineering 17

  9. [9]

    In: IEEE International Conference on Big Data (BigData)

    Fairbanks, J., Serra, E.: Reflective beam search for automated TTP extraction and sigma rule generation from cyber threat intelligence. In: IEEE International Conference on Big Data (BigData). pp. 2130–2135 (2025). https://doi.org/10.110 9/bigdata66926.2025.11401712

  10. [10]

    LLM Agents can Autonomously Exploit One-day Vulnerabilities

    Fang, R., Bindu, R., Gupta, A., Kang, D.: LLM agents can autonomously exploit one-day vulnerabilities. arXiv preprint arXiv:2404.08144 (2024), https://arxiv.or g/abs/2404.08144

  11. [11]

    Fleischer, F., Zhang, C., Jang, J., Cho, J., Xu, M., Kim, T.: Contextualizing sink knowledgeforJavavulnerabilitydiscovery.arXivpreprintarXiv:2604.01645(2026), https://arxiv.org/abs/2604.01645

  12. [12]

    TechRxiv preprint (2026)

    Gordeychik, S.: Prediction meets patch queues: Empirical limits of EPSS-only prioritization using CISA KEV additions in 2025. TechRxiv preprint (2026). https://doi.org/10.36227/techrxiv.176857939.95987957/v1

  13. [13]

    Journal of Cybersecurity6(1) (2020)

    Jacobs, J., Romanosky, S., Adjerid, I., Baker, W.: Improving vulnerability reme- diation through better exploit prediction. Journal of Cybersecurity6(1) (2020). https://doi.org/10.1093/cybsec/tyaa015

  14. [14]

    In: IEEE European Symposium on Security and Privacy Workshops (Eu- roS&PW)

    Jacobs, J., Romanosky, S., Suciu, O., Edwards, B., Sarabi, A.: Enhancing vul- nerability prioritization: Data-driven exploit predictions with community-driven insights. In: IEEE European Symposium on Security and Privacy Workshops (Eu- roS&PW). pp. 194–206 (2023). https://doi.org/10.1109/EuroSPW59978.2023.00 027

  15. [15]

    Journal of Information Processing31, 591–601 (2023)

    Kobayashi, M., Kanemoto, Y., Kotani, D., Okabe, Y.: Generation of IDS signatures through exhaustive execution path exploration in PoC codes for vulnerabilities. Journal of Information Processing31, 591–601 (2023). https://doi.org/10.2197/ip sjjip.31.591

  16. [16]

    In: Proceed- ings of the 2025 ACM SIGSAC Conference on Computer and Communications Security (CCS)

    Koscinski, V., Nelson, M., Okutan, A., Falso, R., Mirakhorli, M.: Conflicting scores, confusing signals: An empirical study of vulnerability scoring systems. In: Proceed- ings of the 2025 ACM SIGSAC Conference on Computer and Communications Security (CCS). pp. 1904–1918 (2025). https://doi.org/10.1145/3719027.3765210

  17. [17]

    arXiv preprint arXiv:2603.02297 (2026), https: //arxiv.org/abs/2603.02297

    Lau, N., Sloot, L., Raj, J., Boscardin, G.M., Harris, E., Bowman, D., Brajkovski, M., Chawla, J., Zhao, D.: ZeroDayBench: Evaluating LLM agents on unseen zero- day vulnerabilities for cyberdefense. arXiv preprint arXiv:2603.02297 (2026), https: //arxiv.org/abs/2603.02297

  18. [18]

    arXiv preprint arXiv:2602.13574 (2026), https://arxiv.org/abs/2602.13574

    Li, H., Che, X., Wang, Y., Liao, X., Xing, L.: Execution-state-aware LLM reasoning for automated proof-of-vulnerability generation. arXiv preprint arXiv:2602.13574 (2026), https://arxiv.org/abs/2602.13574

  19. [19]

    In: Proceedings of the 12th ACM Conference on Computer and Communications Security (CCS)

    Liang, Z., Sekar, R.: Fast and automated generation of attack signatures: A basis for building self-protecting servers. In: Proceedings of the 12th ACM Conference on Computer and Communications Security (CCS). pp. 213–222 (2005). https: //doi.org/10.1145/1102120.1102150

  20. [20]

    Liu, B., Zhao, Y., Xu, G., Wang, H.: LLM agents for automated web vulnerability reproduction: Are we there yet? arXiv preprint arXiv:2510.14700 (2025), https: //arxiv.org/abs/2510.14700

  21. [21]

    CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability

    Luo, X., Zhang, J., Zhou, S., Huang, R., Xiao, C., Zhu, Q., Ma, Z., Yue, X., Yue, Y., Zeng, W., Che, W.: CVE-Factory: Scaling expert-level agentic tasks for code security vulnerability. arXiv preprint arXiv:2602.03012 (2026), https://arxiv.org/ abs/2602.03012

  22. [22]

    Mell, P., Spring, J.: Likely exploited vulnerabilities. Tech. Rep. NIST.CSWP.41, National Institute of Standards and Technology (2025), https://nvlpubs.nist.gov /nistpubs/CSWP/NIST.CSWP.41.pdf 18 F. Shaikh

  23. [23]

    FALCON: Transforming Cyber Threat Intelligence into Deployable IDS Rules with Self-Reflection

    Mitra, S., Bazarov, A., Duclos, M., Mittal, S., Piplai, A., Rahman, M.R., Zieglar, E., Rahimi, S.: FALCON: Autonomous cyber threat intelligence mining with LLMs for IDS rule generation. arXiv preprint arXiv:2508.18684 (2025), https://arxiv.or g/abs/2508.18684

  24. [24]

    Newsome, J., Song, D.: Dynamic taint analysis for automatic detection, anal- ysis, and signature generation of exploits on commodity software. In: Pro- ceedings of the 12th Network and Distributed System Security Symposium (NDSS) (2005), https://www.ndss-symposium.org/ndss2005/dynamic-taint-anal ysis-automatic-detection-analysis-and-signaturegeneration-ex...

  25. [25]

    arXiv preprint arXiv:2411.02618 (2024), https://arxiv.org/abs/2411.02618

    Parla, R.: Efficacy of EPSS in high severity CVEs found in KEV. arXiv preprint arXiv:2411.02618 (2024), https://arxiv.org/abs/2411.02618

  26. [26]

    arXiv preprint arXiv:2508.03882 (2025), https://arxiv.org/abs/2508.03882

    Sánchez-Matas, A., Escribano Ruiz, P., Díaz-López, D., Perales Gómez, A.L., Ne- spoli, P., Martínez Pérez, G.: Simulating cyberattacks through a breach attack simulation (BAS) platform empowered by security chaos engineering (SCE). arXiv preprint arXiv:2508.03882 (2025), https://arxiv.org/abs/2508.03882

  27. [27]

    arXiv preprint arXiv:2505.06701 (2025), https://arxiv.org/abs/ 2505.06701

    Shukla, A., Gandhi, P.A., Elovici, Y., Shabtai, A.: RuleGenie: SIEM detection rule set optimization. arXiv preprint arXiv:2505.06701 (2025), https://arxiv.org/abs/ 2505.06701

  28. [28]

    PoCGen: Generating Proof-of-Concept Exploits for Vulnerabilities in Npm Packages

    Simsek, D., Eghbali, A., Pradel, M.: PoCGen: Generating proof-of-concept exploits for vulnerabilities in npm packages. arXiv preprint arXiv:2506.04962 (2025), https: //arxiv.org/abs/2506.04962

  29. [29]

    Smart, J., Jun, S.Y., et al.: Blueprint: Stakeholder-Specific Vulnerability Categorization guidance. Tech. rep., Sandia National Laboratories (2026), https://research-hub.nlr.gov/en/publications/blueprint-stakeholder-specific-vul nerability-categorization-guida/

  30. [30]

    In: IEEE RIVF International Conference on Computing and Communication Technologies

    Tran, T.T.V., Le, T.B.T., Truong, T.H.H., Thai, H.V., Hien, D.H., Phan, T.D.: EvoSIEM: Detecting and generating SIEM rule evasion behaviors in network sys- tems. In: IEEE RIVF International Conference on Computing and Communication Technologies. pp. 498–503 (2025). https://doi.org/10.1109/rivf68649.2025.1136512 9

  31. [31]

    In: Proceedings of the 33rd USENIX Security Symposium

    Uetz, R., Herzog, M., Hackländer, L., Schwarz, S., Henze, M.: You cannot escape me: Detecting evasions of SIEM rules in enterprise networks. In: Proceedings of the 33rd USENIX Security Symposium. pp. 5179–5196 (2024), https://www.usen ix.org/conference/usenixsecurity24/presentation/uetz

  32. [32]

    arXiv preprint arXiv:2509.01835 (2025), https://arxiv.org/abs/2509.01835

    Ullah, S., Balasubramanian, P., Guo, W., Burnett, A., Pearce, H., Kruegel, C., Vigna, G., Stringhini, G.: From CVE entries to verifiable exploits: An automated multi-agent framework for reproducing CVEs. arXiv preprint arXiv:2509.01835 (2025), https://arxiv.org/abs/2509.01835

  33. [33]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing reasoning and acting in language models. In: Proceedings of the 11th International Conference on Learning Representations (ICLR) (2023), https://ar xiv.org/abs/2210.03629

  34. [34]

    In: Proceedings of the 42nd International Conference on Machine Learning (ICML) (2025), https://arxiv.org/abs/2503.17332

    Zhu, Y., Kellermann, A., Bowman, D., Li, P., Gupta, A., Danda, A., Fang, R., Jensen, C., Ihli, E., Benn, J., Geronimo, J., Dhir, A., Rao, S., Yu, K., Stone, T., Kang,D.:CVE-Bench:AbenchmarkforAIagents’abilitytoexploitreal-worldweb application vulnerabilities. In: Proceedings of the 42nd International Conference on Machine Learning (ICML) (2025), https://a...

  35. [35]

    arXiv preprint arXiv:2406.01637 (2024), https://arxiv.org/abs/2406.01637

    Zhu, Y., Kellermann, A., Gupta, A., Li, P., Fang, R., Bindu, R., Kang, D.: Teams of LLM agents can exploit zero-day vulnerabilities. arXiv preprint arXiv:2406.01637 (2024), https://arxiv.org/abs/2406.01637