FORGE: Multi-Agent Graduated Exploitation and Detection Engineering

Farooq Shaikh

arxiv: 2606.03453 · v1 · pith:S3MRFZEMnew · submitted 2026-06-02 · 💻 cs.CR · cs.AI· cs.MA

FORGE: Multi-Agent Graduated Exploitation and Detection Engineering

Farooq Shaikh This is my paper

Pith reviewed 2026-06-28 09:43 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.MA

keywords multi-agent systemsvulnerability exploitationdetection rule generationCVE analysisLLM oracleexploitation taxonomySnort rulesOpenTelemetry traces

0 comments

The pith

A multi-agent system with graduated exploitation depth reaches 67.8 percent L1+ success on 603 CVEs at low cost and generates higher-quality detection rules from deeper traces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FORGE, a pipeline of five agents that turns CVE metadata into vulnerable applications, runs coached multi-turn exploitation assessed on a four-level scale by an LLM oracle, and extracts Sigma and Snort rules from the resulting OpenTelemetry traces. It reports that this graduated approach yields 67.8 percent end-to-end L1+ exploitation across eight languages and 187 CWE types at roughly 1.50 USD per CVE, with success rates staying near 68 percent regardless of EPSS or CVSS band. Detection rules derived from L2+ exploitation show statistically higher span-normalized grounding than those from L1 attempts, and most generated Snort rules produce zero false positives on a synthetic benign corpus. The work treats the four exploitation levels as a bridge that supplies both prioritization ground truth and richer behavioral data for detection engineering.

Core claim

Graduated exploitation depth, assessed by an LLM-primary oracle on a four-level taxonomy from no evidence to full compromise, supplies both high exploitation success independent of metadata scores and detection rules with measurably stronger grounding when derived from L2+ traces.

What carries the argument

The four-level exploitation taxonomy (L0 to L3) assessed by an LLM-primary oracle, which converts partial exploitation progress into reusable behavioral traces for rule generation.

Load-bearing premise

The LLM-primary oracle supplies reliable and unbiased labels for the four exploitation levels that can be treated as ground truth for both success and rule quality.

What would settle it

An independent human-expert labeling of the L0-L3 taxonomy on a random subset of the 603 CVEs, followed by a direct comparison of agreement rates with the LLM oracle.

Figures

Figures reproduced from arXiv: 2606.03453 by Farooq Shaikh.

**Figure 1.** Figure 1: FORGE pipeline architecture with running example (CVE-2025-30370, jupyterlab-git command-injection RCE via $() substitution that bypasses partial shellescaping; L3 in 4 exploit turns). The LLM-primary oracle evaluates every exploit turn against CWE-specific criteria. Five knowledge stores accumulate intelligence across assessments. 3.1 Pipeline Overview The pipeline processes a single CVE through five st… view at source ↗

**Figure 2.** Figure 2: Exploitation depth and prioritization metric correlation (RQ1–RQ2). fall between 86.7% and 96.8%. One L3 case is a likely false positive (authorization bypass reaching L2 evidence but not full compromise); the remaining six non-VALID assessments are BORDERLINE. The audit script and per-CVE verdicts are released with the artifact. Baseline. CVE-GENIE [32] is the closest comparable system (50.9% binary suc… view at source ↗

**Figure 3.** Figure 3: Exploitation level distribution for the 12 most frequent primary CWEs (% of CVEs in each row reaching each level). Cell shade is the within-CWE share of the L0–L3 column; rows marked with ∗ are structurally capped at L2 by the oracle (§3.3) [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Detection rule quality (RQ3) and cost decomposition (RQ4). the operational question of how much detection-relevant evidence each collected span carries. Under this normalized metric, L2+ rules achieve marginally significantly higher grounding (0.307 vs. 0.175, p=0.035): each span from deeper exploitation contributes more detection-relevant evidence. Rules are generated for 380 L1+ CVEs (mean 2.0 rules/CVE… view at source ↗

read the original abstract

Vulnerability disclosure volumes now far exceed organizational assessment capacity, yet three adjacent research communities (proof-of-concept generation, vulnerability prioritization, and detection rule engineering) operate largely in isolation. Existing automated exploit generation systems report binary pass/fail outcomes, discarding partial progress and producing no signal for the other two communities. This paper presents FORGE, a multi-agent system that bridges these three silos through graduated exploitation depth. Five specialized agents (Intel, Generator, Planner, Exploit, and Detector) execute in a fixed pipeline that (1) generates targeted vulnerable applications from CVE metadata, (2) conducts coached, multi-turn exploitation assessed by an LLM-primary oracle on a four-level taxonomy (L0: no evidence through L3: full compromise), and (3) produces Sigma and Snort detection rules grounded in OpenTelemetry exploitation traces. Graduated depth is the bridging mechanism: deeper exploitation yields richer behavioral traces for detection engineering, while depth data across scoring bands provides ground truth for prioritization validation. A tiered knowledge architecture accumulates intelligence across assessments, transferring build and exploitation experience to subsequent CVEs. Evaluation on 603 CVEs from the CVE-GENIE dataset achieves 67.8% end-to-end L1+ exploitation at USD 1.50 per CVE across eight languages and 187 CWE types. Exploitation rates remain near 68% regardless of EPSS or CVSS band, indicating that pattern-level reachability is orthogonal to metadata-based prioritization. Detection rules from L2+ exploitation achieve significantly higher span-normalized grounding than L1-derived rules (p=0.035), and 93.4% of generated Snort rules produce zero false positives against a synthetic benign corpus.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FORGE links CVE metadata to multi-agent exploitation and rule generation in one pipeline, but the reported rates and p-value rest on an unvalidated LLM oracle for the L0-L3 labels.

read the letter

The paper describes a five-agent pipeline that turns CVE descriptions into vulnerable test apps, runs coached multi-turn exploitation, labels the outcome on a four-level scale via an LLM oracle, and then emits Sigma and Snort rules from the resulting OpenTelemetry traces. It also carries forward build and exploit knowledge across CVEs. On 603 CVEs spanning eight languages and 187 CWEs it reports 67.8 percent L1-or-better success at roughly $1.50 per CVE, with rates staying flat across EPSS and CVSS bands, plus a statistically higher grounding score for rules derived from L2+ traces (p=0.035) and 93.4 percent zero-FP Snort rules on synthetic benign traffic.

The concrete numbers and the end-to-end scope are the parts that stand out. Connecting the three communities through graduated depth rather than binary pass/fail is a reasonable engineering move, and the tiered knowledge store is a practical way to amortize cost. The claim that reachability appears orthogonal to metadata scores would be useful if it holds.

The central weakness is the LLM oracle itself. It supplies the L0-L3 labels that both count successes and decide which traces are rich enough for rule generation. The abstract gives no prompting details, no calibration set, no inter-rater numbers, and no error analysis against known exploits. Any consistent bias or inconsistency in those labels flows straight into the 67.8 percent figure, the EPSS/CVSS independence result, and the p=0.035 comparison. That is not a minor omission when the oracle is load-bearing.

The work is aimed at people building automated vulnerability tooling who need a concrete pipeline to test or extend. It is worth sending to referees so the oracle can be examined in detail and the evaluation can be reproduced or stress-tested; the architecture is described clearly enough that reviewers can ask for the missing validation steps without starting from scratch.

Referee Report

1 major / 2 minor

Summary. The paper introduces FORGE, a multi-agent pipeline (Intel, Generator, Planner, Exploit, Detector) that generates vulnerable applications from CVE metadata, performs coached multi-turn exploitation labeled by an LLM-primary oracle into a four-level taxonomy (L0: no evidence to L3: full compromise), and derives Sigma/Snort detection rules from OpenTelemetry traces. On 603 CVEs from CVE-GENIE it reports 67.8% end-to-end L1+ success at $1.50 per CVE across eight languages and 187 CWEs; success is statistically independent of EPSS and CVSS bands; L2+-derived rules show higher span-normalized grounding than L1-derived rules (p=0.035); and 93.4% of generated Snort rules yield zero false positives on a synthetic benign corpus. A tiered knowledge base transfers experience across assessments.

Significance. If the LLM oracle labels are reliable, the work supplies the first large-scale, graduated exploitation dataset that simultaneously supplies ground-truth reachability signals for prioritization validation and behavioral traces for detection engineering. The reported orthogonality between pattern-level exploitation success and metadata-based scores would be a substantive empirical result for the vulnerability management community.

major comments (1)

[Abstract and Evaluation section] Abstract and Evaluation section: The central performance claims (67.8% L1+ exploitation, p=0.035 grounding difference, 93.4% zero-FP Snort rules) rest on treating the LLM-primary oracle’s L0–L3 classifications as ground truth. No description is given of the oracle’s prompt template, few-shot examples, temperature settings, validation against human experts, calibration set, inter-rater agreement, or error analysis on known ground-truth exploits. Because these labels are used both to count successes and to select traces for rule generation, any systematic bias directly affects all headline statistics.

minor comments (2)

[Abstract] The cost figure of USD 1.50 per CVE should be accompanied by an explicit breakdown of which API calls, compute, and human oversight are included.
[Evaluation section] The synthetic benign corpus used for false-positive testing should be characterized (size, traffic distribution, generation method) to allow reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency around the LLM oracle. We agree that additional methodological details are required and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and Evaluation section] Abstract and Evaluation section: The central performance claims (67.8% L1+ exploitation, p=0.035 grounding difference, 93.4% zero-FP Snort rules) rest on treating the LLM-primary oracle’s L0–L3 classifications as ground truth. No description is given of the oracle’s prompt template, few-shot examples, temperature settings, validation against human experts, calibration set, inter-rater agreement, or error analysis on known ground-truth exploits. Because these labels are used both to count successes and to select traces for rule generation, any systematic bias directly affects all headline statistics.

Authors: We acknowledge that the current manuscript does not include a description of the oracle prompt template, few-shot examples, temperature settings, calibration set, inter-rater agreement, or error analysis. In the revised version we will add a new subsection (Evaluation: Oracle Configuration) that supplies the exact prompt template, the few-shot examples used, temperature (0.0 for the primary labeling calls), and any calibration steps performed on a small internal set. We did not conduct a full human-expert validation or inter-rater study across all 603 CVEs because of scale and cost; however, we will report an error analysis on a randomly sampled subset of 50 CVEs for which two authors independently labeled the traces and computed agreement with the LLM oracle. We will also discuss the implications of any observed discrepancies for the headline statistics and rule-generation pipeline. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper reports empirical results from executing the FORGE pipeline on the external CVE-GENIE dataset of 603 CVEs, with all metrics (67.8% L1+ exploitation, p=0.035 grounding difference, 93.4% zero-FP rules) presented as direct measurements against the LLM oracle assessments and synthetic corpus. No equations, fitted parameters, self-citations, or renamings appear in the provided text that would reduce any claimed result to an input by construction. The evaluation chain remains self-contained as observed outcomes on independent data rather than derived quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, mathematical axioms, or newly postulated entities. The four-level taxonomy and the five agent roles are presented as design choices rather than derived quantities.

pith-pipeline@v0.9.1-grok · 5831 in / 1385 out tokens · 22754 ms · 2026-06-28T09:43:29.486484+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 28 canonical work pages · 7 internal anchors

[1]

arXiv preprint arXiv:2510.18508 (2025), https://arxiv.org/abs/2510.18508

Al Haddad, O., Ikram, M., Ahmed, E., Lee, Y.: Prompting the priorities: A first look at evaluating LLMs for vulnerability triage and prioritization. arXiv preprint arXiv:2510.18508 (2025), https://arxiv.org/abs/2510.18508

work page arXiv 2025
[2]

In: Proceedings of the 32nd Annual Conference on Computer Security Applications (ACSAC)

Applebaum, A., Miller, D., Strom, B., Korban, C., Wolf, R.: Intelligent, automated red team emulation. In: Proceedings of the 32nd Annual Conference on Computer Security Applications (ACSAC). pp. 363–373 (2016). https://doi.org/10.1145/29 91079.2991111

work page doi:10.1145/29 2016
[3]

In: Proceedings of the 2006 IEEE Sym- posium on Security and Privacy (S&P)

Brumley, D., Newsome, J., Song, D., Wang, H., Jha, S.: Towards automatic gen- eration of vulnerability-based signatures. In: Proceedings of the 2006 IEEE Sym- posium on Security and Privacy (S&P). pp. 2–16 (2006). https://doi.org/10.1109/ sp.2006.41

2006
[4]

arXiv preprint arXiv:2502.04953 (2025), https://arxiv.org/abs/2502.04953

Bui, Q.C., Iannone, E., Camporese, M., Hinrichs, T., Tony, C., Tóth, L., Palomba, F., Hegedűs, P., Massacci, F., Scandariato, R.: A systematic literature review on automated exploit and security test generation. arXiv preprint arXiv:2502.04953 (2025), https://arxiv.org/abs/2502.04953

work page arXiv 2025
[5]

https://www.cve.org/About/Metrics (2025), accessed: February 2026

CVE.org: CVE metrics. https://www.cve.org/About/Metrics (2025), accessed: February 2026

2025
[6]

In: Proceedings of the 33rd USENIX Security Symposium

Deng, G., Liu, Y., Mayoral-Vilches, V., Liu, P., Li, Y., Xu, Y., Zhang, T., Liu, Y., Pinzger, M., Rass, S.: PentestGPT: Evaluating and harnessing large language models for automated penetration testing. In: Proceedings of the 33rd USENIX Security Symposium. pp. 847–864 (2024), https://www.usenix.org/conference/us enixsecurity24/presentation/deng

2024
[7]

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Denison, C., MacDiarmid, M., Barez, F., Duvenaud, D., Kravec, S., Marks, S., Schiefer, N., Soklaski, R., Tamkin, A., Kaplan, J., Shlegeris, B., Bowman, S.R., Perez, E., Hubinger, E.: Sycophancy to subterfuge: Investigating reward-tampering in large language models. arXiv preprint arXiv:2406.10162 (2024), https://arxiv. org/abs/2406.10162

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

ACM Computing Surveys56(8), 1–41 (2024)

Elder, S., Rahman, M.R., Fringer, G., Kapoor, K., Williams, L.: A survey on software vulnerability exploitability assessment. ACM Computing Surveys56(8), 1–41 (2024). https://doi.org/10.1145/3648610 FORGE: Multi-Agent Graduated Exploitation and Detection Engineering 17

work page doi:10.1145/3648610 2024
[9]

In: IEEE International Conference on Big Data (BigData)

Fairbanks, J., Serra, E.: Reflective beam search for automated TTP extraction and sigma rule generation from cyber threat intelligence. In: IEEE International Conference on Big Data (BigData). pp. 2130–2135 (2025). https://doi.org/10.110 9/bigdata66926.2025.11401712

work page arXiv 2025
[10]

LLM Agents can Autonomously Exploit One-day Vulnerabilities

Fang, R., Bindu, R., Gupta, A., Kang, D.: LLM agents can autonomously exploit one-day vulnerabilities. arXiv preprint arXiv:2404.08144 (2024), https://arxiv.or g/abs/2404.08144

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Fleischer, F., Zhang, C., Jang, J., Cho, J., Xu, M., Kim, T.: Contextualizing sink knowledgeforJavavulnerabilitydiscovery.arXivpreprintarXiv:2604.01645(2026), https://arxiv.org/abs/2604.01645

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

TechRxiv preprint (2026)

Gordeychik, S.: Prediction meets patch queues: Empirical limits of EPSS-only prioritization using CISA KEV additions in 2025. TechRxiv preprint (2026). https://doi.org/10.36227/techrxiv.176857939.95987957/v1

work page doi:10.36227/techrxiv.176857939.95987957/v1 2025
[13]

Journal of Cybersecurity6(1) (2020)

Jacobs, J., Romanosky, S., Adjerid, I., Baker, W.: Improving vulnerability reme- diation through better exploit prediction. Journal of Cybersecurity6(1) (2020). https://doi.org/10.1093/cybsec/tyaa015

work page doi:10.1093/cybsec/tyaa015 2020
[14]

In: IEEE European Symposium on Security and Privacy Workshops (Eu- roS&PW)

Jacobs, J., Romanosky, S., Suciu, O., Edwards, B., Sarabi, A.: Enhancing vul- nerability prioritization: Data-driven exploit predictions with community-driven insights. In: IEEE European Symposium on Security and Privacy Workshops (Eu- roS&PW). pp. 194–206 (2023). https://doi.org/10.1109/EuroSPW59978.2023.00 027

work page doi:10.1109/eurospw59978.2023.00 2023
[15]

Journal of Information Processing31, 591–601 (2023)

Kobayashi, M., Kanemoto, Y., Kotani, D., Okabe, Y.: Generation of IDS signatures through exhaustive execution path exploration in PoC codes for vulnerabilities. Journal of Information Processing31, 591–601 (2023). https://doi.org/10.2197/ip sjjip.31.591

work page doi:10.2197/ip 2023
[16]

In: Proceed- ings of the 2025 ACM SIGSAC Conference on Computer and Communications Security (CCS)

Koscinski, V., Nelson, M., Okutan, A., Falso, R., Mirakhorli, M.: Conflicting scores, confusing signals: An empirical study of vulnerability scoring systems. In: Proceed- ings of the 2025 ACM SIGSAC Conference on Computer and Communications Security (CCS). pp. 1904–1918 (2025). https://doi.org/10.1145/3719027.3765210

work page doi:10.1145/3719027.3765210 2025
[17]

arXiv preprint arXiv:2603.02297 (2026), https: //arxiv.org/abs/2603.02297

Lau, N., Sloot, L., Raj, J., Boscardin, G.M., Harris, E., Bowman, D., Brajkovski, M., Chawla, J., Zhao, D.: ZeroDayBench: Evaluating LLM agents on unseen zero- day vulnerabilities for cyberdefense. arXiv preprint arXiv:2603.02297 (2026), https: //arxiv.org/abs/2603.02297

work page arXiv 2026
[18]

arXiv preprint arXiv:2602.13574 (2026), https://arxiv.org/abs/2602.13574

Li, H., Che, X., Wang, Y., Liao, X., Xing, L.: Execution-state-aware LLM reasoning for automated proof-of-vulnerability generation. arXiv preprint arXiv:2602.13574 (2026), https://arxiv.org/abs/2602.13574

work page arXiv 2026
[19]

In: Proceedings of the 12th ACM Conference on Computer and Communications Security (CCS)

Liang, Z., Sekar, R.: Fast and automated generation of attack signatures: A basis for building self-protecting servers. In: Proceedings of the 12th ACM Conference on Computer and Communications Security (CCS). pp. 213–222 (2005). https: //doi.org/10.1145/1102120.1102150

work page doi:10.1145/1102120.1102150 2005
[20]

Liu, B., Zhao, Y., Xu, G., Wang, H.: LLM agents for automated web vulnerability reproduction: Are we there yet? arXiv preprint arXiv:2510.14700 (2025), https: //arxiv.org/abs/2510.14700

work page arXiv 2025
[21]

CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability

Luo, X., Zhang, J., Zhou, S., Huang, R., Xiao, C., Zhu, Q., Ma, Z., Yue, X., Yue, Y., Zeng, W., Che, W.: CVE-Factory: Scaling expert-level agentic tasks for code security vulnerability. arXiv preprint arXiv:2602.03012 (2026), https://arxiv.org/ abs/2602.03012

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

Mell, P., Spring, J.: Likely exploited vulnerabilities. Tech. Rep. NIST.CSWP.41, National Institute of Standards and Technology (2025), https://nvlpubs.nist.gov /nistpubs/CSWP/NIST.CSWP.41.pdf 18 F. Shaikh

2025
[23]

FALCON: Transforming Cyber Threat Intelligence into Deployable IDS Rules with Self-Reflection

Mitra, S., Bazarov, A., Duclos, M., Mittal, S., Piplai, A., Rahman, M.R., Zieglar, E., Rahimi, S.: FALCON: Autonomous cyber threat intelligence mining with LLMs for IDS rule generation. arXiv preprint arXiv:2508.18684 (2025), https://arxiv.or g/abs/2508.18684

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Newsome, J., Song, D.: Dynamic taint analysis for automatic detection, anal- ysis, and signature generation of exploits on commodity software. In: Pro- ceedings of the 12th Network and Distributed System Security Symposium (NDSS) (2005), https://www.ndss-symposium.org/ndss2005/dynamic-taint-anal ysis-automatic-detection-analysis-and-signaturegeneration-ex...

2005
[25]

arXiv preprint arXiv:2411.02618 (2024), https://arxiv.org/abs/2411.02618

Parla, R.: Efficacy of EPSS in high severity CVEs found in KEV. arXiv preprint arXiv:2411.02618 (2024), https://arxiv.org/abs/2411.02618

work page arXiv 2024
[26]

arXiv preprint arXiv:2508.03882 (2025), https://arxiv.org/abs/2508.03882

Sánchez-Matas, A., Escribano Ruiz, P., Díaz-López, D., Perales Gómez, A.L., Ne- spoli, P., Martínez Pérez, G.: Simulating cyberattacks through a breach attack simulation (BAS) platform empowered by security chaos engineering (SCE). arXiv preprint arXiv:2508.03882 (2025), https://arxiv.org/abs/2508.03882

work page arXiv 2025
[27]

arXiv preprint arXiv:2505.06701 (2025), https://arxiv.org/abs/ 2505.06701

Shukla, A., Gandhi, P.A., Elovici, Y., Shabtai, A.: RuleGenie: SIEM detection rule set optimization. arXiv preprint arXiv:2505.06701 (2025), https://arxiv.org/abs/ 2505.06701

work page arXiv 2025
[28]

PoCGen: Generating Proof-of-Concept Exploits for Vulnerabilities in Npm Packages

Simsek, D., Eghbali, A., Pradel, M.: PoCGen: Generating proof-of-concept exploits for vulnerabilities in npm packages. arXiv preprint arXiv:2506.04962 (2025), https: //arxiv.org/abs/2506.04962

work page internal anchor Pith review arXiv 2025
[29]

Smart, J., Jun, S.Y., et al.: Blueprint: Stakeholder-Specific Vulnerability Categorization guidance. Tech. rep., Sandia National Laboratories (2026), https://research-hub.nlr.gov/en/publications/blueprint-stakeholder-specific-vul nerability-categorization-guida/

2026
[30]

In: IEEE RIVF International Conference on Computing and Communication Technologies

Tran, T.T.V., Le, T.B.T., Truong, T.H.H., Thai, H.V., Hien, D.H., Phan, T.D.: EvoSIEM: Detecting and generating SIEM rule evasion behaviors in network sys- tems. In: IEEE RIVF International Conference on Computing and Communication Technologies. pp. 498–503 (2025). https://doi.org/10.1109/rivf68649.2025.1136512 9

work page doi:10.1109/rivf68649.2025.1136512 2025
[31]

In: Proceedings of the 33rd USENIX Security Symposium

Uetz, R., Herzog, M., Hackländer, L., Schwarz, S., Henze, M.: You cannot escape me: Detecting evasions of SIEM rules in enterprise networks. In: Proceedings of the 33rd USENIX Security Symposium. pp. 5179–5196 (2024), https://www.usen ix.org/conference/usenixsecurity24/presentation/uetz

2024
[32]

arXiv preprint arXiv:2509.01835 (2025), https://arxiv.org/abs/2509.01835

Ullah, S., Balasubramanian, P., Guo, W., Burnett, A., Pearce, H., Kruegel, C., Vigna, G., Stringhini, G.: From CVE entries to verifiable exploits: An automated multi-agent framework for reproducing CVEs. arXiv preprint arXiv:2509.01835 (2025), https://arxiv.org/abs/2509.01835

work page arXiv 2025
[33]

ReAct: Synergizing Reasoning and Acting in Language Models

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing reasoning and acting in language models. In: Proceedings of the 11th International Conference on Learning Representations (ICLR) (2023), https://ar xiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

In: Proceedings of the 42nd International Conference on Machine Learning (ICML) (2025), https://arxiv.org/abs/2503.17332

Zhu, Y., Kellermann, A., Bowman, D., Li, P., Gupta, A., Danda, A., Fang, R., Jensen, C., Ihli, E., Benn, J., Geronimo, J., Dhir, A., Rao, S., Yu, K., Stone, T., Kang,D.:CVE-Bench:AbenchmarkforAIagents’abilitytoexploitreal-worldweb application vulnerabilities. In: Proceedings of the 42nd International Conference on Machine Learning (ICML) (2025), https://a...

work page arXiv 2025
[35]

arXiv preprint arXiv:2406.01637 (2024), https://arxiv.org/abs/2406.01637

Zhu, Y., Kellermann, A., Gupta, A., Li, P., Fang, R., Bindu, R., Kang, D.: Teams of LLM agents can exploit zero-day vulnerabilities. arXiv preprint arXiv:2406.01637 (2024), https://arxiv.org/abs/2406.01637

work page arXiv 2024

[1] [1]

arXiv preprint arXiv:2510.18508 (2025), https://arxiv.org/abs/2510.18508

Al Haddad, O., Ikram, M., Ahmed, E., Lee, Y.: Prompting the priorities: A first look at evaluating LLMs for vulnerability triage and prioritization. arXiv preprint arXiv:2510.18508 (2025), https://arxiv.org/abs/2510.18508

work page arXiv 2025

[2] [2]

In: Proceedings of the 32nd Annual Conference on Computer Security Applications (ACSAC)

Applebaum, A., Miller, D., Strom, B., Korban, C., Wolf, R.: Intelligent, automated red team emulation. In: Proceedings of the 32nd Annual Conference on Computer Security Applications (ACSAC). pp. 363–373 (2016). https://doi.org/10.1145/29 91079.2991111

work page doi:10.1145/29 2016

[3] [3]

In: Proceedings of the 2006 IEEE Sym- posium on Security and Privacy (S&P)

Brumley, D., Newsome, J., Song, D., Wang, H., Jha, S.: Towards automatic gen- eration of vulnerability-based signatures. In: Proceedings of the 2006 IEEE Sym- posium on Security and Privacy (S&P). pp. 2–16 (2006). https://doi.org/10.1109/ sp.2006.41

2006

[4] [4]

arXiv preprint arXiv:2502.04953 (2025), https://arxiv.org/abs/2502.04953

Bui, Q.C., Iannone, E., Camporese, M., Hinrichs, T., Tony, C., Tóth, L., Palomba, F., Hegedűs, P., Massacci, F., Scandariato, R.: A systematic literature review on automated exploit and security test generation. arXiv preprint arXiv:2502.04953 (2025), https://arxiv.org/abs/2502.04953

work page arXiv 2025

[5] [5]

https://www.cve.org/About/Metrics (2025), accessed: February 2026

CVE.org: CVE metrics. https://www.cve.org/About/Metrics (2025), accessed: February 2026

2025

[6] [6]

In: Proceedings of the 33rd USENIX Security Symposium

Deng, G., Liu, Y., Mayoral-Vilches, V., Liu, P., Li, Y., Xu, Y., Zhang, T., Liu, Y., Pinzger, M., Rass, S.: PentestGPT: Evaluating and harnessing large language models for automated penetration testing. In: Proceedings of the 33rd USENIX Security Symposium. pp. 847–864 (2024), https://www.usenix.org/conference/us enixsecurity24/presentation/deng

2024

[7] [7]

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Denison, C., MacDiarmid, M., Barez, F., Duvenaud, D., Kravec, S., Marks, S., Schiefer, N., Soklaski, R., Tamkin, A., Kaplan, J., Shlegeris, B., Bowman, S.R., Perez, E., Hubinger, E.: Sycophancy to subterfuge: Investigating reward-tampering in large language models. arXiv preprint arXiv:2406.10162 (2024), https://arxiv. org/abs/2406.10162

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

ACM Computing Surveys56(8), 1–41 (2024)

Elder, S., Rahman, M.R., Fringer, G., Kapoor, K., Williams, L.: A survey on software vulnerability exploitability assessment. ACM Computing Surveys56(8), 1–41 (2024). https://doi.org/10.1145/3648610 FORGE: Multi-Agent Graduated Exploitation and Detection Engineering 17

work page doi:10.1145/3648610 2024

[9] [9]

In: IEEE International Conference on Big Data (BigData)

Fairbanks, J., Serra, E.: Reflective beam search for automated TTP extraction and sigma rule generation from cyber threat intelligence. In: IEEE International Conference on Big Data (BigData). pp. 2130–2135 (2025). https://doi.org/10.110 9/bigdata66926.2025.11401712

work page arXiv 2025

[10] [10]

LLM Agents can Autonomously Exploit One-day Vulnerabilities

Fang, R., Bindu, R., Gupta, A., Kang, D.: LLM agents can autonomously exploit one-day vulnerabilities. arXiv preprint arXiv:2404.08144 (2024), https://arxiv.or g/abs/2404.08144

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Fleischer, F., Zhang, C., Jang, J., Cho, J., Xu, M., Kim, T.: Contextualizing sink knowledgeforJavavulnerabilitydiscovery.arXivpreprintarXiv:2604.01645(2026), https://arxiv.org/abs/2604.01645

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

TechRxiv preprint (2026)

Gordeychik, S.: Prediction meets patch queues: Empirical limits of EPSS-only prioritization using CISA KEV additions in 2025. TechRxiv preprint (2026). https://doi.org/10.36227/techrxiv.176857939.95987957/v1

work page doi:10.36227/techrxiv.176857939.95987957/v1 2025

[13] [13]

Journal of Cybersecurity6(1) (2020)

Jacobs, J., Romanosky, S., Adjerid, I., Baker, W.: Improving vulnerability reme- diation through better exploit prediction. Journal of Cybersecurity6(1) (2020). https://doi.org/10.1093/cybsec/tyaa015

work page doi:10.1093/cybsec/tyaa015 2020

[14] [14]

In: IEEE European Symposium on Security and Privacy Workshops (Eu- roS&PW)

Jacobs, J., Romanosky, S., Suciu, O., Edwards, B., Sarabi, A.: Enhancing vul- nerability prioritization: Data-driven exploit predictions with community-driven insights. In: IEEE European Symposium on Security and Privacy Workshops (Eu- roS&PW). pp. 194–206 (2023). https://doi.org/10.1109/EuroSPW59978.2023.00 027

work page doi:10.1109/eurospw59978.2023.00 2023

[15] [15]

Journal of Information Processing31, 591–601 (2023)

Kobayashi, M., Kanemoto, Y., Kotani, D., Okabe, Y.: Generation of IDS signatures through exhaustive execution path exploration in PoC codes for vulnerabilities. Journal of Information Processing31, 591–601 (2023). https://doi.org/10.2197/ip sjjip.31.591

work page doi:10.2197/ip 2023

[16] [16]

In: Proceed- ings of the 2025 ACM SIGSAC Conference on Computer and Communications Security (CCS)

Koscinski, V., Nelson, M., Okutan, A., Falso, R., Mirakhorli, M.: Conflicting scores, confusing signals: An empirical study of vulnerability scoring systems. In: Proceed- ings of the 2025 ACM SIGSAC Conference on Computer and Communications Security (CCS). pp. 1904–1918 (2025). https://doi.org/10.1145/3719027.3765210

work page doi:10.1145/3719027.3765210 2025

[17] [17]

arXiv preprint arXiv:2603.02297 (2026), https: //arxiv.org/abs/2603.02297

Lau, N., Sloot, L., Raj, J., Boscardin, G.M., Harris, E., Bowman, D., Brajkovski, M., Chawla, J., Zhao, D.: ZeroDayBench: Evaluating LLM agents on unseen zero- day vulnerabilities for cyberdefense. arXiv preprint arXiv:2603.02297 (2026), https: //arxiv.org/abs/2603.02297

work page arXiv 2026

[18] [18]

arXiv preprint arXiv:2602.13574 (2026), https://arxiv.org/abs/2602.13574

Li, H., Che, X., Wang, Y., Liao, X., Xing, L.: Execution-state-aware LLM reasoning for automated proof-of-vulnerability generation. arXiv preprint arXiv:2602.13574 (2026), https://arxiv.org/abs/2602.13574

work page arXiv 2026

[19] [19]

In: Proceedings of the 12th ACM Conference on Computer and Communications Security (CCS)

Liang, Z., Sekar, R.: Fast and automated generation of attack signatures: A basis for building self-protecting servers. In: Proceedings of the 12th ACM Conference on Computer and Communications Security (CCS). pp. 213–222 (2005). https: //doi.org/10.1145/1102120.1102150

work page doi:10.1145/1102120.1102150 2005

[20] [20]

Liu, B., Zhao, Y., Xu, G., Wang, H.: LLM agents for automated web vulnerability reproduction: Are we there yet? arXiv preprint arXiv:2510.14700 (2025), https: //arxiv.org/abs/2510.14700

work page arXiv 2025

[21] [21]

CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability

Luo, X., Zhang, J., Zhou, S., Huang, R., Xiao, C., Zhu, Q., Ma, Z., Yue, X., Yue, Y., Zeng, W., Che, W.: CVE-Factory: Scaling expert-level agentic tasks for code security vulnerability. arXiv preprint arXiv:2602.03012 (2026), https://arxiv.org/ abs/2602.03012

work page internal anchor Pith review Pith/arXiv arXiv 2026

[22] [22]

Mell, P., Spring, J.: Likely exploited vulnerabilities. Tech. Rep. NIST.CSWP.41, National Institute of Standards and Technology (2025), https://nvlpubs.nist.gov /nistpubs/CSWP/NIST.CSWP.41.pdf 18 F. Shaikh

2025

[23] [23]

FALCON: Transforming Cyber Threat Intelligence into Deployable IDS Rules with Self-Reflection

Mitra, S., Bazarov, A., Duclos, M., Mittal, S., Piplai, A., Rahman, M.R., Zieglar, E., Rahimi, S.: FALCON: Autonomous cyber threat intelligence mining with LLMs for IDS rule generation. arXiv preprint arXiv:2508.18684 (2025), https://arxiv.or g/abs/2508.18684

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Newsome, J., Song, D.: Dynamic taint analysis for automatic detection, anal- ysis, and signature generation of exploits on commodity software. In: Pro- ceedings of the 12th Network and Distributed System Security Symposium (NDSS) (2005), https://www.ndss-symposium.org/ndss2005/dynamic-taint-anal ysis-automatic-detection-analysis-and-signaturegeneration-ex...

2005

[25] [25]

arXiv preprint arXiv:2411.02618 (2024), https://arxiv.org/abs/2411.02618

Parla, R.: Efficacy of EPSS in high severity CVEs found in KEV. arXiv preprint arXiv:2411.02618 (2024), https://arxiv.org/abs/2411.02618

work page arXiv 2024

[26] [26]

arXiv preprint arXiv:2508.03882 (2025), https://arxiv.org/abs/2508.03882

Sánchez-Matas, A., Escribano Ruiz, P., Díaz-López, D., Perales Gómez, A.L., Ne- spoli, P., Martínez Pérez, G.: Simulating cyberattacks through a breach attack simulation (BAS) platform empowered by security chaos engineering (SCE). arXiv preprint arXiv:2508.03882 (2025), https://arxiv.org/abs/2508.03882

work page arXiv 2025

[27] [27]

arXiv preprint arXiv:2505.06701 (2025), https://arxiv.org/abs/ 2505.06701

Shukla, A., Gandhi, P.A., Elovici, Y., Shabtai, A.: RuleGenie: SIEM detection rule set optimization. arXiv preprint arXiv:2505.06701 (2025), https://arxiv.org/abs/ 2505.06701

work page arXiv 2025

[28] [28]

PoCGen: Generating Proof-of-Concept Exploits for Vulnerabilities in Npm Packages

Simsek, D., Eghbali, A., Pradel, M.: PoCGen: Generating proof-of-concept exploits for vulnerabilities in npm packages. arXiv preprint arXiv:2506.04962 (2025), https: //arxiv.org/abs/2506.04962

work page internal anchor Pith review arXiv 2025

[29] [29]

Smart, J., Jun, S.Y., et al.: Blueprint: Stakeholder-Specific Vulnerability Categorization guidance. Tech. rep., Sandia National Laboratories (2026), https://research-hub.nlr.gov/en/publications/blueprint-stakeholder-specific-vul nerability-categorization-guida/

2026

[30] [30]

In: IEEE RIVF International Conference on Computing and Communication Technologies

Tran, T.T.V., Le, T.B.T., Truong, T.H.H., Thai, H.V., Hien, D.H., Phan, T.D.: EvoSIEM: Detecting and generating SIEM rule evasion behaviors in network sys- tems. In: IEEE RIVF International Conference on Computing and Communication Technologies. pp. 498–503 (2025). https://doi.org/10.1109/rivf68649.2025.1136512 9

work page doi:10.1109/rivf68649.2025.1136512 2025

[31] [31]

In: Proceedings of the 33rd USENIX Security Symposium

Uetz, R., Herzog, M., Hackländer, L., Schwarz, S., Henze, M.: You cannot escape me: Detecting evasions of SIEM rules in enterprise networks. In: Proceedings of the 33rd USENIX Security Symposium. pp. 5179–5196 (2024), https://www.usen ix.org/conference/usenixsecurity24/presentation/uetz

2024

[32] [32]

arXiv preprint arXiv:2509.01835 (2025), https://arxiv.org/abs/2509.01835

Ullah, S., Balasubramanian, P., Guo, W., Burnett, A., Pearce, H., Kruegel, C., Vigna, G., Stringhini, G.: From CVE entries to verifiable exploits: An automated multi-agent framework for reproducing CVEs. arXiv preprint arXiv:2509.01835 (2025), https://arxiv.org/abs/2509.01835

work page arXiv 2025

[33] [33]

ReAct: Synergizing Reasoning and Acting in Language Models

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing reasoning and acting in language models. In: Proceedings of the 11th International Conference on Learning Representations (ICLR) (2023), https://ar xiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

In: Proceedings of the 42nd International Conference on Machine Learning (ICML) (2025), https://arxiv.org/abs/2503.17332

Zhu, Y., Kellermann, A., Bowman, D., Li, P., Gupta, A., Danda, A., Fang, R., Jensen, C., Ihli, E., Benn, J., Geronimo, J., Dhir, A., Rao, S., Yu, K., Stone, T., Kang,D.:CVE-Bench:AbenchmarkforAIagents’abilitytoexploitreal-worldweb application vulnerabilities. In: Proceedings of the 42nd International Conference on Machine Learning (ICML) (2025), https://a...

work page arXiv 2025

[35] [35]

arXiv preprint arXiv:2406.01637 (2024), https://arxiv.org/abs/2406.01637

Zhu, Y., Kellermann, A., Gupta, A., Li, P., Fang, R., Bindu, R., Kang, D.: Teams of LLM agents can exploit zero-day vulnerabilities. arXiv preprint arXiv:2406.01637 (2024), https://arxiv.org/abs/2406.01637

work page arXiv 2024