arxiv: 2604.05130 · v1 · submitted 2026-04-06 · 💻 cs.SE

Recognition: no theorem link

A Multi-Agent Framework for Automated Exploit Generation with Constraint-Guided Comprehension and Reflection

Siyi Chen , Tianhan Luo , Shijian Wu , Xiangyu Liu , Yilin Zhou , Qi Li , Wenyuan Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:58 UTC · model grok-4.3

classification 💻 cs.SE

keywords automated exploit generationmulti-agent frameworkvulnerability verificationzero-day discoveryLLM-based securitystatic analysis refinementruntime feedbackself-refinement loop

0 comments

The pith

Vulnsage uses specialized agents and runtime feedback to turn static vulnerability reports into working exploits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Vulnsage as a multi-agent framework that automates exploit generation by mimicking the steps human security researchers follow when investigating code. A central supervisor coordinates a Code Analyzer Agent to spot potential issues via static analysis, a Code Generation Agent to produce candidate exploits with an LLM, a Validation Agent to run them and capture traces, and Reflection Agents to analyze errors and either refine the exploit or dismiss the alert as a false positive. This setup targets the core problems of path coverage in fuzzing, constraint solving in symbolic execution, and high false-positive rates in static tools. If the approach works as described, it would let teams confirm which reported vulnerabilities are actually exploitable without as much manual work.

Core claim

Vulnsage decomposes automated exploit generation into an orchestrated workflow of specialized agents: the Code Analyzer Agent performs static analysis to identify vulnerabilities and gather context; the Code Generation Agent creates candidate exploits using an LLM; the Validation Agent executes candidates and collects traces; and Reflection Agents use runtime error analysis in iterative loops to improve the exploit or reason that the original alert is a false positive. Experimental results show this process produces 34.64 percent more successful exploits than prior tools such as ExplodeJS and enables discovery of 146 verified zero-day vulnerabilities in real-world code.

What carries the argument

The iterative self-refinement loop run by the Validation Agent and Reflection Agents, which feeds execution traces and runtime error details back to improve candidate exploits or classify alerts as false positives.

Load-bearing premise

The iterative feedback loop with execution traces and runtime error analysis reliably improves exploit success rates or correctly distinguishes true vulnerabilities from false positives without introducing systematic biases or missing edge cases.

What would settle it

Independent re-testing on the same programs used for comparison shows Vulnsage produces the same number or fewer working exploits than ExplodeJS, or independent verification fails to confirm the claimed zero-day vulnerabilities.

Figures

Figures reproduced from arXiv: 2604.05130 by Qi Li, Shijian Wu, Siyi Chen, Tianhan Luo, Wenyuan Xu, Xiangyu Liu, Yilin Zhou.

**Figure 2.** Figure 2: Successful exploit of CVE-2023-39017 by VulnSage. that the input controls the exploit execution as the attacker intended to — executing malicious JNDI look up from URL in line 7, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Overview of VulnSage. Our architecture is based on a multi-agent framework as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 6.** Figure 6: An alert information of [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗

**Figure 5.** Figure 5: The prompt for the Supervisor Agent. Inspired by AgentScope, an implementation of the ReAct [17], we design the prompt as shown in [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 7.** Figure 7: Overview of the Code Generation Agent. 3.4.1 Constraints extraction. Constraints extraction converts the taint-flow propagation of an alert into a set of constraints. The constraints mean that if the input satisfies these constraints, the input of the entry function can execute code to sink function, as the alert information describes. Unlike symbolic execution, our constraints are described in natural lan… view at source ↗

**Figure 8.** Figure 8: The Prompt for the Correction Insight Agent. [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 9.** Figure 9: A 0-day vulnerability discovered by VulnSage [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

read the original abstract

Open-source libraries are widely used in modern software development, introducing significant security vulnerabilities. While static analysis tools can identify potential vulnerabilities at scale, they often generate overwhelming reports with high false positive rates. Automated Exploit Generation (AEG) emerges as a promising solution to confirm vulnerability authenticity by generating an exploit. However, traditional AEG approaches based on fuzzing or symbolic execution face path coverage and constraint-solving problems. Although LLMs show great potential for AEG, how to effectively leverage them to comprehend vulnerabilities and generate corresponding exploits is still an open question. To address these challenges, we propose Vulnsage, a multi-agent framework for AEG. Vulnsage simulates human security researchers' workflows by decomposing the complex AEG process into multiple specialized sub-agents: Code Analyzer Agent, Code Generation Agent, Validation Agent, and a set of Reflection Agents, orchestrated by a central supervisor through iterative cycles. Given a target program, the Code Analyzer Agent performs static analysis to identify potential vulnerabilities and collects relevant information for each one. The Code Generation Agent then utilizes an LLM to generate candidate exploits. The Validation Agent and Reflection Agents form a feedback-driven self-refinement loop that uses execution traces and runtime error analysis to either improve the exploit iteratively or reason about the false positive alert. Experimental evaluation demonstrates that Vulnsage succeeds in generating 34.64\% more exploits than state-of-the-art tools such as \explodejs. Furthermore, Vulnsage has successfully discovered and verified 146 zero-day vulnerabilities in real-world scenarios, demonstrating its practical effectiveness for assisting security assessment in software supply chains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Vulnsage offers a structured multi-agent LLM setup for turning vulnerability reports into exploits, but the reported gains and zero-day counts rest on validation steps whose exact success criteria are not described.

read the letter

The key takeaway is that Vulnsage uses a team of LLM agents to generate exploits from vulnerability alerts in open source code, claiming a solid improvement over existing tools and a batch of new zero-day finds. The details on how those results were measured are missing from the abstract, which makes it hard to judge how much to trust the numbers. The new part is the specific agent roles and the reflection loop that feeds back execution traces and errors to refine the exploit or flag a false positive. This setup tries to copy the back-and-forth a security researcher would do when testing a potential bug. It is a practical way to structure the problem and could help with the path coverage issues that plague traditional AEG methods like fuzzing or symbolic execution. What works here is the clear division of labor. The analyzer pulls out relevant code info, the generator makes the exploit attempt, and the validator plus reflectors close the loop with real run data. That feedback mechanism is a step forward from one-shot LLM prompts for this kind of task. The soft spots are in the evaluation. The 34.64% gain and 146 zero-days are presented without the test methodology, baseline comparisons beyond naming ExplodeJS, or how the system decides an exploit succeeded. The stress test note is right that without a precise rule for what counts as success in the validation agent, the counts could be inflated. Same for the zero-days: internal agent agreement is not the same as external confirmation. LLM variability is not mentioned either, which matters for reproducibility. This paper is for people in software security who are experimenting with LLM agents for automated testing. A reader who wants to build similar systems could use the agent breakdown as a starting point. It deserves a serious referee because the core idea is grounded in real problems with current AEG and the claims are specific enough that reviewers can ask for the missing experimental controls and data. I would recommend sending it to peer review rather than desk rejecting it. The architecture is worth checking even if the current results need tightening.

Referee Report

2 major / 1 minor

Summary. The paper proposes Vulnsage, a multi-agent LLM framework for automated exploit generation (AEG) that decomposes the task into specialized agents (Code Analyzer, Code Generation, Validation, and Reflection Agents) orchestrated by a supervisor. The agents perform static analysis, generate candidate exploits, and use iterative feedback from execution traces and runtime errors to refine exploits or flag false positives. The central claims are a 34.64% improvement in successful exploit generation over state-of-the-art tools such as ExplodeJS and the discovery/verification of 146 zero-day vulnerabilities in real-world open-source libraries.

Significance. If the performance and zero-day claims are substantiated with rigorous methodology, the multi-agent reflection loop could meaningfully advance AEG by improving upon the path-coverage and constraint-solving limitations of fuzzing and symbolic execution. The approach has potential practical value for confirming vulnerabilities in software supply chains and reducing false positives from static analysis tools.

major comments (2)

[Abstract and Experimental Evaluation] Abstract and Experimental Evaluation section: The headline claims of 34.64% more exploits than ExplodeJS and 146 verified zero-day vulnerabilities are presented without any description of the experimental methodology, target programs/datasets, baseline implementations, success criteria for exploit generation, statistical tests, or controls for LLM non-determinism. These omissions make the central empirical claims impossible to evaluate or reproduce.
[Framework Description (Validation and Reflection Agents)] Validation Agent and Reflection Agents description: The feedback loop is said to use 'execution traces and runtime error analysis' to improve exploits or reason about false positives, but no explicit decision procedure is given for labeling success (e.g., whether any non-zero exit code, specific error message, or memory corruption indicator counts as confirmation). This leaves open the possibility of inflated success rates or self-confirmation bias.

minor comments (1)

[Abstract] The tool name 'ExplodeJS' is referenced without citation or description; a reference or brief explanation should be added for readers unfamiliar with it.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable comments, which have helped us identify areas for improvement in our manuscript. Below, we provide point-by-point responses to the major comments and indicate the revisions we intend to make.

read point-by-point responses

Referee: [Abstract and Experimental Evaluation] Abstract and Experimental Evaluation section: The headline claims of 34.64% more exploits than ExplodeJS and 146 verified zero-day vulnerabilities are presented without any description of the experimental methodology, target programs/datasets, baseline implementations, success criteria for exploit generation, statistical tests, or controls for LLM non-determinism. These omissions make the central empirical claims impossible to evaluate or reproduce.

Authors: We acknowledge that the abstract is necessarily concise and omits full methodological details. The Experimental Evaluation section describes the use of real-world open-source libraries as targets, ExplodeJS as a baseline, and success as verified exploit generation. To improve reproducibility and address the referee's concern, we will revise the Experimental Evaluation section to explicitly detail: the full list of target programs and datasets, baseline implementations and configurations, precise success criteria for exploit generation, statistical tests supporting the 34.64% improvement, and controls for LLM non-determinism (e.g., repeated runs with varied seeds and temperature settings). We will also incorporate a concise methodology overview into the abstract where space allows. revision: yes
Referee: [Framework Description (Validation and Reflection Agents)] Validation Agent and Reflection Agents description: The feedback loop is said to use 'execution traces and runtime error analysis' to improve exploits or reason about false positives, but no explicit decision procedure is given for labeling success (e.g., whether any non-zero exit code, specific error message, or memory corruption indicator counts as confirmation). This leaves open the possibility of inflated success rates or self-confirmation bias.

Authors: We agree that an explicit decision procedure strengthens the description. The current framework relies on the LLM-powered Reflection Agents to interpret execution traces and runtime errors for iterative refinement or false-positive identification. In the revision, we will add a dedicated subsection under the Validation and Reflection Agents that formalizes the success-labeling criteria, including concrete indicators such as memory corruption signals, specific error patterns associated with exploitation, and combinations of non-zero exit codes with other runtime evidence. We will also describe safeguards against self-confirmation bias, such as requiring corroboration from multiple execution environments or external validation tools where feasible. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with external benchmarks

full rationale

The paper describes a multi-agent AEG framework (Code Analyzer, Code Generation, Validation, Reflection Agents) and reports empirical results: 34.64% more exploits than ExplodeJS plus 146 zero-day discoveries. No equations, parameters, derivations, or self-citations appear in the provided text. Success metrics rely on execution traces and runtime errors rather than any self-referential definition or fitted input renamed as prediction. The central claims are falsifiable via external reproduction and do not reduce to the framework's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The description introduces a new framework but contains no mathematical model, fitted parameters, or unstated background axioms; the contribution is architectural and empirical.

pith-pipeline@v0.9.0 · 5605 in / 1217 out tokens · 43331 ms · 2026-05-10T18:58:24.754833+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 3 internal anchors

[1]

QLPro: Automated Code Vulnerability Discovery via LLM and Static Code Analysis Integration.CoRRabs/2506.23644 (2025)

2025. QLPro: Automated Code Vulnerability Discovery via LLM and Static Code Analysis Integration.CoRRabs/2506.23644 (2025). arXiv:2506.23644 doi: 10.48550/ARXIV.2506.23644 Withdrawn

work page doi:10.48550/arxiv.2506.23644 2025
[2]

Thanassis Avgerinos, Sang Kil Cha, Brent Lim Tze Hao, and David Brumley

work page
[3]

InProceedings of the Network and Distributed System Security Symposium, NDSS 2011, San Diego, California, USA, 6th February - 9th February 2011

AEG: Automatic Exploit Generation. InProceedings of the Network and Distributed System Security Symposium, NDSS 2011, San Diego, California, USA, 6th February - 9th February 2011. The Internet Society. https://www.ndss- symposium.org/ndss2011/aeg-automatic-exploit-generation

work page 2011
[4]

Roberto Baldoni, Emilio Coppa, Daniele Cono D’Elia, Camil Demetrescu, and Irene Finocchi. 2018. A Survey of Symbolic Execution Techniques.ACM Comput. Surv.51, 3 (2018), 50:1–50:39. doi: 10.1145/3182657

work page doi:10.1145/3182657 2018
[5]

Masudul Hasan Masud Bhuiyan, Adithya Srinivas Parthasarathy, Nikos Vasilakis, Michael Pradel, and Cristian-Alexandru Staicu. 2023. SecBench.js: An Executable Security Benchmark Suite for Server-Side JavaScript. In45th IEEE/ACM Interna- tional Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE, 1059–1070. doi: 10.110...

work page doi:10.1109/icse48619.2023.00096 2023
[6]

Xiaohe Bo, Zeyu Zhang, Quanyu Dai, Xueyang Feng, Lei Wang, Rui Li, Xu Chen, and Ji-Rong Wen. 2024. Reflective Multi-Agent Collaboration based on Large Language Models. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, Amir Gl...

work page 2024
[7]

Marcel Böhme, Van-Thuan Pham, Manh-Dung Nguyen, and Abhik Roychoud- hury. 2017. Directed Greybox Fuzzing. InProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, Dallas, TX, USA, October 30 - November 03, 2017, Bhavani Thuraisingham, David Evans, Tal Malkin, and Dongyan Xu (Eds.). ACM, 2329–2344. doi: 10.1145/313...

work page doi:10.1145/3133956.3134020 2017
[8]

Marcel Böhme, Van-Thuan Pham, and Abhik Roychoudhury. 2019. Coverage- Based Greybox Fuzzing as Markov Chain.IEEE Trans. Software Eng.45, 5, 489–506. doi: 10.1109/TSE.2017.2785841

work page doi:10.1109/tse.2017.2785841 2019
[9]

Cristian Cadar, Daniel Dunbar, and Dawson R. Engler. 2008. KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs. In8th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2008, December 8-10, 2008, San Diego, California, USA, Proceedings, Richard Draves and Robbert van Renesse (Eds.). USENIX Assoc...

work page 2008
[10]

Darion Cassel, Nuno Sabino, Min-Chien Hsu, Ruben Martins, and Limin Jia. 2025. NodeMedic-FINE: Automatic Detection and Exploit Synthesis for Node.js Vul- nerabilities. In32nd Annual Network and Distributed System Security Symposium, NDSS 2025, San Diego, California, USA, February 24-28, 2025. The Internet Soci- ety. https://www.ndss-symposium.org/ndss-pap...

work page 2025
[11]

Sang Kil Cha, Thanassis Avgerinos, Alexandre Rebert, and David Brumley. 2012. Unleashing Mayhem on Binary Code. InIEEE Symposium on Security and Privacy, SP 2012, 21-23 May 2012, San Francisco, California, USA. IEEE Computer Society, 380–394. doi: 10.1109/SP.2012.31

work page doi:10.1109/sp.2012.31 2012
[12]

Ricardo Corin and Felipe Andrés Manzano. 2012. Taint Analysis of Security Code in the KLEE Symbolic Execution Engine. InInformation and Communications Security - 14th International Conference, ICICS 2012, Hong Kong, China, October 29- 31, 2012. Proceedings (Lecture Notes in Computer Science, Vol. 7618), Tat Wing Chim and Tsz Hon Yuen (Eds.). Springer, 264...

work page doi:10.1007/978-3-642-34129-8_23 2012
[13]

Patrick Cousot and Radhia Cousot. 1977. Abstract Interpretation: A Unified Lattice Model for Static Analysis of Programs by Construction or Approximation of Fixpoints. InConference Record of the Fourth ACM Symposium on Principles of Programming Languages, Los Angeles, California, USA, January 1977, Robert M. Graham, Michael A. Harrison, and Ravi Sethi (Ed...

work page arXiv 1977
[14]

Dorothy E. Denning. 1976. A Lattice Model of Secure Information Flow.Commun. ACM19, 5 (1976), 236–243. doi: 10.1145/360051.360056

work page doi:10.1145/360051.360056 1976
[15]

Richard Fang, Rohan Bindu, Akul Gupta, and Daniel Kang. 2024. LLM Agents can Autonomously Exploit One-day Vulnerabilities.CoRRabs/2404.08144 (2024). arXiv:2404.08144 doi: 10.48550/ARXIV.2404.08144

work page doi:10.48550/arxiv.2404.08144 2024
[16]

Richard Fang, Rohan Bindu, Akul Gupta, Qiusi Zhan, and Daniel Kang. 2024. Teams of LLM Agents can Exploit Zero-Day Vulnerabilities.CoRRabs/2406.01637 (2024). arXiv:2406.01637 doi: 10.48550/ARXIV.2406.01637

work page doi:10.48550/arxiv.2406.01637 2024
[17]

Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D. Goodman. 2025. Cognitive Behaviors that Enable Self-Improving Reason- ers, or, Four Habits of Highly Effective STaRs.CoRRabs/2503.01307 (2025). arXiv:2503.01307 doi: 10.48550/ARXIV.2503.01307

work page internal anchor Pith review doi:10.48550/arxiv.2503.01307 2025
[18]

Dawei Gao, Zitao Li, Weirui Kuang, Xuchen Pan, Daoyuan Chen, Zhijian Ma, Bingchen Qian, Liuyi Yao, Lin Zhu, Chen Cheng, Hongzhu Shi, Yaliang Li, Bolin Ding, and Jingren Zhou. 2024. AgentScope: A Flexible yet Robust Multi-Agent Platform.CoRRabs/2402.14034 (2024). arXiv:2402.14034 doi: 10.48550/ARXIV. 2402.14034 ICPC ’26, April 12–13, 2026, Rio de Janeiro, ...

work page internal anchor Pith review doi:10.48550/arxiv 2024
[19]

2021.CodeQL

GitHub Security Lab. 2021.CodeQL. GitHub. https://codeql.github.com/docs/

work page 2021
[20]

Katerina Goseva-Popstojanova and Andrei Perhinschi. 2015. On the capability of static code analysis to detect security vulnerabilities.Inf. Softw. Technol.68 (2015), 18–33. doi: 10.1016/J.INFSOF.2015.08.002

work page doi:10.1016/j.infsof.2015.08.002 2015
[21]

Junqing He, Kunhao Pan, Xiaoqun Dong, Zhuoyang Song, LiuYiBo LiuYiBo, Qian- guosun Qianguosun, Yuxin Liang, Hao Wang, Enming Zhang, and Jiaxing Zhang

work page
[22]

Never Lost in the Middle: Mastering Long-Context Question Answering with Position-Agnostic Decompositional Training. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Comput...

work page doi:10.18653/v1/2024.acl-long.736 2024
[23]

Peyman Hosseini, Ignacio Castro, Iacopo Ghinassi, and Matthew Purver. 2025. Efficient Solutions For An Intriguing Failure of LLMs: Long Context Window Does Not Mean LLMs Can Analyze Long Sequences Flawlessly. InProceedings of the 31st International Conference on Computational Linguistics, COLING 2025, Abu Dhabi, UAE, January 19-24, 2025, Owen Rambow, Leo ...

work page 2025
[24]

David Jin, Qian Fu, and Yuekang Li. 2025. Good News for Script Kiddies? Evalu- ating Large Language Models for Automated Exploit Generation. In2025 IEEE Security and Privacy, SP 2025 - Workshops, San Francisco, CA, USA, May 15, 2025, Marina Blanton, William Enck, and Cristina Nita-Rotaru (Eds.). IEEE, 278–282. doi: 10.1109/SPW67851.2025.00039

work page doi:10.1109/spw67851.2025.00039 2025
[25]

Mingqing Kang, Yichao Xu, Song Li, Rigel Gjomemo, Jianwei Hou, V. N. Venkatakrishnan, and Yinzhi Cao. 2023. Scaling JavaScript Abstract Interpretation to Detect and Exploit Node.js Taint-style Vulnerability. In44th IEEE Symposium on Security and Privacy, SP 2023, San Francisco, CA, USA, May 21-25, 2023. IEEE, 1059–1076. doi: 10.1109/SP46215.2023.10179352

work page doi:10.1109/sp46215.2023.10179352 2023
[26]

Pasareanu

Rody Kersten, Kasper Søe Luckow, and Corina S. Pasareanu. 2017. POSTER: AFL-based Fuzzing for Java with Kelinci. InProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, Dallas, TX, USA, October 30 - November 03, 2017, Bhavani Thuraisingham, David Evans, Tal Malkin, and Dongyan Xu (Eds.). ACM, 2511–2513. doi: 10.1...

work page doi:10.1145/3133956.3138820 2017
[27]

James C. King. 1976. Symbolic Execution and Program Testing.Commun. ACM 19, 7 (1976), 385–394. doi: 10.1145/360248.360252

work page doi:10.1145/360248.360252 1976
[28]

Maxwell Koo. 2024. Uncovering Vulnerabilities In Open Source Libraries: A Technical Case Study. https://www.mayhem.security/blog/uncovering-vulner abilities-in-open-source-libraries

work page 2024
[29]

Yihe Li, Ruijie Meng, and Gregory J. Duck. 2025. Large Language Model powered Symbolic Execution.CoRRabs/2505.13452 (2025). arXiv:2505.13452 doi: 10.48550/ ARXIV.2505.13452

work page arXiv 2025
[30]

and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy , title = "

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Trans. Assoc. Comput. Linguistics12 (2024), 157–173. doi: 10.1162/TACL_A_00638

work page doi:10.1162/tacl_a_00638 2024
[31]

Zijun Liu, Zhennan Wan, Peng Li, Ming Yan, Ji Zhang, Fei Huang, and Yang Liu

work page
[32]

Scaling External Knowledge Input Beyond Context Windows of LLMs via Multi-Agent Collaboration

Scaling External Knowledge Input Beyond Context Windows of LLMs via Multi-Agent Collaboration.CoRRabs/2505.21471 (2025). arXiv:2505.21471 doi: 10.48550/ARXIV.2505.21471

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.21471 2025
[33]

Valentin J. M. Manès, HyungSeok Han, Choongwoo Han, Sang Kil Cha, Manuel Egele, Edward J. Schwartz, and Maverick Woo. 2021. The Art, Science, and Engineering of Fuzzing: A Survey.IEEE Trans. Software Eng.47, 11 (2021), 2312–

work page 2021
[34]

doi: 10.1109/TSE.2019.2946563

work page doi:10.1109/tse.2019.2946563 2019
[35]

Coimbra, Nuno Santos, Limin Jia, and José Fragoso Santos

Filipe Marques, Mafalda Ferreira, André Nascimento, Miguel E. Coimbra, Nuno Santos, Limin Jia, and José Fragoso Santos. 2025. Automated Exploit Generation for Node.js Packages.Proc. ACM Program. Lang.9, PLDI (2025), 1341–1366. doi: 10.1145/3729304

work page doi:10.1145/3729304 2025
[36]

Galindo, and David Benavides

Antonio Germán Márquez, Ángel Jesús Varela-Vaca, María Teresa Gómez-López, José A. Galindo, and David Benavides. 2024. Vulnerability impact analysis in software project dependencies based on Satisfiability Modulo Theories (SMT). Comput. Secur.139 (2024), 103669. doi: 10.1016/J.COSE.2023.103669

work page doi:10.1016/j.cose.2023.103669 2024
[37]

Miller, Lars Fredriksen, and Bryan So

Barton P. Miller, Lars Fredriksen, and Bryan So. 1990. An Empirical Study of the Reliability of UNIX Utilities.Commun. ACM33, 12 (1990), 32–44. doi: 10.1145/96267.96279

work page doi:10.1145/96267.96279 1990
[38]

James Newsome and Dawn Xiaodong Song. 2005. Dynamic Taint Analysis for Automatic Detection, Analysis, and SignatureGeneration of Exploits on Com- modity Software. InProceedings of the Network and Distributed System Secu- rity Symposium, NDSS 2005, San Diego, California, USA. The Internet Society. https://www.ndss-symposium.org/ndss2005/dynamic-taint-analy...

work page 2005
[39]

Vikram Nitin, Baishakhi Ray, and Roshanak Zilouchian Moghaddam. 2025. Fault- Line: Automated Proof-of-Vulnerability Generation Using LLM Agents.CoRR abs/2507.15241 (2025). arXiv:2507.15241 doi: 10.48550/ARXIV.2507.15241

work page doi:10.48550/arxiv.2507.15241 2025
[40]

Ana Nunez, Nafis Tanveer Islam, Sumit Kumar Jha, and Peyman Najafirad. 2024. AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Genera- tion through Static Analysis and Fuzz Testing.CoRRabs/2409.10737 (2024). arXiv:2409.10737 doi: 10.48550/ARXIV.2409.10737

work page doi:10.48550/arxiv.2409.10737 2024
[41]

Wanzong Peng, Lin Ye, Xuetao Du, Hongli Zhang, Dongyang Zhan, Yunting Zhang, Yicheng Guo, and Chen Zhang. 2025. PwnGPT: Automatic Exploit Gener- ation Based on Large Language Models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025, Wanxia...

work page 2025
[42]

Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu, Jingren Zhou, and Junyang Lin. 2025. Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture- of-Expert Models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL...

work page 2025
[43]

Francisco Ribeiro. 2023. Large Language Models for Automated Program Repair. InCompanion Proceedings of the 2023 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity, SPLASH 2023, Cascais, Portugal, October 22-27, 2023, Vasco Thudichum Vasconce- los (Ed.). ACM, 7–9. doi: 10.1145/3618305.3623587

work page doi:10.1145/3618305.3623587 2023
[44]

Qingkai Shi, Xiao Xiao, Rongxin Wu, Jinguo Zhou, Gang Fan, and Charles Zhang

work page
[45]

InProceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation(Philadelphia, PA, USA)(PLDI 2018)

Pinpoint: fast and precise sparse value flow analysis for million lines of code. InProceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation(Philadelphia, PA, USA)(PLDI 2018). Association for Computing Machinery, New York, NY, USA, 693–706. doi: 10.1145/3192366. 3192418

work page doi:10.1145/3192366 2018
[46]

Yan Shoshitaishvili, Ruoyu Wang, Christopher Salls, Nick Stephens, Mario Polino, Andrew Dutcher, John Grosen, Siji Feng, Christophe Hauser, Christopher Krügel, and Giovanni Vigna. 2016. SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis. InIEEE Symposium on Security and Privacy, SP 2016, San Jose, CA, USA, May 22-26, 2016. IEEE Comput...

work page doi:10.1109/sp.2016.17 2016
[47]

Deniz Simsek, Aryaz Eghbali, and Michael Pradel. 2025. PoCGen: Generat- ing Proof-of-Concept Exploits for Vulnerabilities in Npm Packages.CoRR abs/2506.04962 (2025). arXiv:2506.04962 doi: 10.48550/ARXIV.2506.04962

work page doi:10.48550/arxiv.2506.04962 2025
[48]

Nick Stephens, John Grosen, Christopher Salls, Andrew Dutcher, Ruoyu Wang, Jacopo Corbetta, Yan Shoshitaishvili, Christopher Kruegel, and Giovanni Vi- gna. 2016. Driller: Augmenting Fuzzing Through Selective Symbolic Execu- tion. In23rd Annual Network and Distributed System Security Symposium, NDSS 2016, San Diego, California, USA, February 21-24, 2016. T...

work page 2016
[49]

Ziliang Wang, Ge Li, Jia Li, Hao Zhu, and Zhi Jin. 2025. VulAgent: A Hypothe- sis Validation-Based Multi-Agent System for Software Vulnerability Detection. arXiv:2509.11523 [cs.SE] https://arxiv.org/abs/2509.11523

work page arXiv 2025
[50]

Ziyue Wang and Liyi Zhou. 2025. Agentic Discovery and Validation of Android App Vulnerabilities.CoRRabs/2508.21579 (2025). arXiv:2508.21579 doi: 10.48550/ ARXIV.2508.21579

work page arXiv 2025
[51]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, Novem...

work page 2022
[52]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview .net/forum?id=WE_vluYUL-X

work page 2023
[53]

Michał Zalewski. 2014. American fuzzy lop. http://lcamtuf.coredump.cx/afl/. (2014)

work page 2014
[54]

Jun Zhang, Shuyang Jiang, Jiangtao Feng, Lin Zheng, and Lingpeng Kong. 2023. CAB: Comprehensive Attention Benchmarking on Long Sequence Modeling. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Hon- olulu, Hawaii, USA (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara E...

work page 2023
[55]

Yuntong Zhang, Jiawei Wang, Dominic Berzin, Martin Mirchev, Dongge Liu, Ab- hishek Arya, Oliver Chang, and Abhik Roychoudhury. 2024. Fixing Security Vul- nerabilities with AI in OSS-Fuzz.CoRRabs/2411.03346 (2024). arXiv:2411.03346 doi: 10.48550/ARXIV.2411.03346

work page doi:10.48550/arxiv.2411.03346 2024
[56]

Liu, and John C

Zexin Zhong, Jiangchao Liu, Diyu Wu, Peng Di, Yulei Sui, Alex X. Liu, and John C. S. Lui. 2023. Scalable Compositional Static Taint Analysis for Sen- sitive Data Tracing on Industrial Micro-Services. In45th IEEE/ACM Interna- tional Conference on Software Engineering: Software Engineering in Practice, A Multi-Agent Framework for Automated Exploit Generatio...

work page doi:10.1109/icse-seip58684.2023.00015 2023
[57]

Zhuotong Zhou, Yongzhuo Yang, Susheng Wu, Yiheng Huang, Bihuan Chen, and Xin Peng. 2024. Magneto: A Step-Wise Approach to Exploit Vulnerabilities in Dependent Libraries via LLM-Empowered Directed Fuzzing. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, ASE 2024, Sacramento, CA, USA, October 27 - November 1, 2...

work page doi:10.1145/3691620 2024