pith. machine review for the scientific record. sign in

arxiv: 2604.05130 · v1 · submitted 2026-04-06 · 💻 cs.SE

Recognition: no theorem link

A Multi-Agent Framework for Automated Exploit Generation with Constraint-Guided Comprehension and Reflection

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:58 UTC · model grok-4.3

classification 💻 cs.SE
keywords automated exploit generationmulti-agent frameworkvulnerability verificationzero-day discoveryLLM-based securitystatic analysis refinementruntime feedbackself-refinement loop
0
0 comments X

The pith

Vulnsage uses specialized agents and runtime feedback to turn static vulnerability reports into working exploits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Vulnsage as a multi-agent framework that automates exploit generation by mimicking the steps human security researchers follow when investigating code. A central supervisor coordinates a Code Analyzer Agent to spot potential issues via static analysis, a Code Generation Agent to produce candidate exploits with an LLM, a Validation Agent to run them and capture traces, and Reflection Agents to analyze errors and either refine the exploit or dismiss the alert as a false positive. This setup targets the core problems of path coverage in fuzzing, constraint solving in symbolic execution, and high false-positive rates in static tools. If the approach works as described, it would let teams confirm which reported vulnerabilities are actually exploitable without as much manual work.

Core claim

Vulnsage decomposes automated exploit generation into an orchestrated workflow of specialized agents: the Code Analyzer Agent performs static analysis to identify vulnerabilities and gather context; the Code Generation Agent creates candidate exploits using an LLM; the Validation Agent executes candidates and collects traces; and Reflection Agents use runtime error analysis in iterative loops to improve the exploit or reason that the original alert is a false positive. Experimental results show this process produces 34.64 percent more successful exploits than prior tools such as ExplodeJS and enables discovery of 146 verified zero-day vulnerabilities in real-world code.

What carries the argument

The iterative self-refinement loop run by the Validation Agent and Reflection Agents, which feeds execution traces and runtime error details back to improve candidate exploits or classify alerts as false positives.

Load-bearing premise

The iterative feedback loop with execution traces and runtime error analysis reliably improves exploit success rates or correctly distinguishes true vulnerabilities from false positives without introducing systematic biases or missing edge cases.

What would settle it

Independent re-testing on the same programs used for comparison shows Vulnsage produces the same number or fewer working exploits than ExplodeJS, or independent verification fails to confirm the claimed zero-day vulnerabilities.

Figures

Figures reproduced from arXiv: 2604.05130 by Qi Li, Shijian Wu, Siyi Chen, Tianhan Luo, Wenyuan Xu, Xiangyu Liu, Yilin Zhou.

Figure 1
Figure 1. Figure 1: Source code snippet of CVE-2023-39017. A successful exploit code generated by VulnSage is shown in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Successful exploit of CVE-2023-39017 by VulnSage. that the input controls the exploit execution as the attacker in￾tended to — executing malicious JNDI look up from URL in line 7, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of VulnSage. Our architecture is based on a multi-agent framework as illus￾trated in [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: An alert information of [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: The prompt for the Supervisor Agent. Inspired by AgentScope, an implementation of the ReAct [17], we design the prompt as shown in [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Overview of the Code Generation Agent. 3.4.1 Constraints extraction. Constraints extraction converts the taint-flow propagation of an alert into a set of constraints. The constraints mean that if the input satisfies these constraints, the input of the entry function can execute code to sink function, as the alert information describes. Unlike symbolic execution, our constraints are described in natural lan… view at source ↗
Figure 8
Figure 8. Figure 8: The Prompt for the Correction Insight Agent. [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: A 0-day vulnerability discovered by VulnSage [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
read the original abstract

Open-source libraries are widely used in modern software development, introducing significant security vulnerabilities. While static analysis tools can identify potential vulnerabilities at scale, they often generate overwhelming reports with high false positive rates. Automated Exploit Generation (AEG) emerges as a promising solution to confirm vulnerability authenticity by generating an exploit. However, traditional AEG approaches based on fuzzing or symbolic execution face path coverage and constraint-solving problems. Although LLMs show great potential for AEG, how to effectively leverage them to comprehend vulnerabilities and generate corresponding exploits is still an open question. To address these challenges, we propose Vulnsage, a multi-agent framework for AEG. Vulnsage simulates human security researchers' workflows by decomposing the complex AEG process into multiple specialized sub-agents: Code Analyzer Agent, Code Generation Agent, Validation Agent, and a set of Reflection Agents, orchestrated by a central supervisor through iterative cycles. Given a target program, the Code Analyzer Agent performs static analysis to identify potential vulnerabilities and collects relevant information for each one. The Code Generation Agent then utilizes an LLM to generate candidate exploits. The Validation Agent and Reflection Agents form a feedback-driven self-refinement loop that uses execution traces and runtime error analysis to either improve the exploit iteratively or reason about the false positive alert. Experimental evaluation demonstrates that Vulnsage succeeds in generating 34.64\% more exploits than state-of-the-art tools such as \explodejs. Furthermore, Vulnsage has successfully discovered and verified 146 zero-day vulnerabilities in real-world scenarios, demonstrating its practical effectiveness for assisting security assessment in software supply chains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Vulnsage, a multi-agent LLM framework for automated exploit generation (AEG) that decomposes the task into specialized agents (Code Analyzer, Code Generation, Validation, and Reflection Agents) orchestrated by a supervisor. The agents perform static analysis, generate candidate exploits, and use iterative feedback from execution traces and runtime errors to refine exploits or flag false positives. The central claims are a 34.64% improvement in successful exploit generation over state-of-the-art tools such as ExplodeJS and the discovery/verification of 146 zero-day vulnerabilities in real-world open-source libraries.

Significance. If the performance and zero-day claims are substantiated with rigorous methodology, the multi-agent reflection loop could meaningfully advance AEG by improving upon the path-coverage and constraint-solving limitations of fuzzing and symbolic execution. The approach has potential practical value for confirming vulnerabilities in software supply chains and reducing false positives from static analysis tools.

major comments (2)
  1. [Abstract and Experimental Evaluation] Abstract and Experimental Evaluation section: The headline claims of 34.64% more exploits than ExplodeJS and 146 verified zero-day vulnerabilities are presented without any description of the experimental methodology, target programs/datasets, baseline implementations, success criteria for exploit generation, statistical tests, or controls for LLM non-determinism. These omissions make the central empirical claims impossible to evaluate or reproduce.
  2. [Framework Description (Validation and Reflection Agents)] Validation Agent and Reflection Agents description: The feedback loop is said to use 'execution traces and runtime error analysis' to improve exploits or reason about false positives, but no explicit decision procedure is given for labeling success (e.g., whether any non-zero exit code, specific error message, or memory corruption indicator counts as confirmation). This leaves open the possibility of inflated success rates or self-confirmation bias.
minor comments (1)
  1. [Abstract] The tool name 'ExplodeJS' is referenced without citation or description; a reference or brief explanation should be added for readers unfamiliar with it.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable comments, which have helped us identify areas for improvement in our manuscript. Below, we provide point-by-point responses to the major comments and indicate the revisions we intend to make.

read point-by-point responses
  1. Referee: [Abstract and Experimental Evaluation] Abstract and Experimental Evaluation section: The headline claims of 34.64% more exploits than ExplodeJS and 146 verified zero-day vulnerabilities are presented without any description of the experimental methodology, target programs/datasets, baseline implementations, success criteria for exploit generation, statistical tests, or controls for LLM non-determinism. These omissions make the central empirical claims impossible to evaluate or reproduce.

    Authors: We acknowledge that the abstract is necessarily concise and omits full methodological details. The Experimental Evaluation section describes the use of real-world open-source libraries as targets, ExplodeJS as a baseline, and success as verified exploit generation. To improve reproducibility and address the referee's concern, we will revise the Experimental Evaluation section to explicitly detail: the full list of target programs and datasets, baseline implementations and configurations, precise success criteria for exploit generation, statistical tests supporting the 34.64% improvement, and controls for LLM non-determinism (e.g., repeated runs with varied seeds and temperature settings). We will also incorporate a concise methodology overview into the abstract where space allows. revision: yes

  2. Referee: [Framework Description (Validation and Reflection Agents)] Validation Agent and Reflection Agents description: The feedback loop is said to use 'execution traces and runtime error analysis' to improve exploits or reason about false positives, but no explicit decision procedure is given for labeling success (e.g., whether any non-zero exit code, specific error message, or memory corruption indicator counts as confirmation). This leaves open the possibility of inflated success rates or self-confirmation bias.

    Authors: We agree that an explicit decision procedure strengthens the description. The current framework relies on the LLM-powered Reflection Agents to interpret execution traces and runtime errors for iterative refinement or false-positive identification. In the revision, we will add a dedicated subsection under the Validation and Reflection Agents that formalizes the success-labeling criteria, including concrete indicators such as memory corruption signals, specific error patterns associated with exploitation, and combinations of non-zero exit codes with other runtime evidence. We will also describe safeguards against self-confirmation bias, such as requiring corroboration from multiple execution environments or external validation tools where feasible. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with external benchmarks

full rationale

The paper describes a multi-agent AEG framework (Code Analyzer, Code Generation, Validation, Reflection Agents) and reports empirical results: 34.64% more exploits than ExplodeJS plus 146 zero-day discoveries. No equations, parameters, derivations, or self-citations appear in the provided text. Success metrics rely on execution traces and runtime errors rather than any self-referential definition or fitted input renamed as prediction. The central claims are falsifiable via external reproduction and do not reduce to the framework's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The description introduces a new framework but contains no mathematical model, fitted parameters, or unstated background axioms; the contribution is architectural and empirical.

pith-pipeline@v0.9.0 · 5605 in / 1217 out tokens · 43331 ms · 2026-05-10T18:58:24.754833+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 3 internal anchors

  1. [1]

    QLPro: Automated Code Vulnerability Discovery via LLM and Static Code Analysis Integration.CoRRabs/2506.23644 (2025)

    2025. QLPro: Automated Code Vulnerability Discovery via LLM and Static Code Analysis Integration.CoRRabs/2506.23644 (2025). arXiv:2506.23644 doi: 10.48550/ARXIV.2506.23644 Withdrawn

  2. [2]

    Thanassis Avgerinos, Sang Kil Cha, Brent Lim Tze Hao, and David Brumley

  3. [3]

    InProceedings of the Network and Distributed System Security Symposium, NDSS 2011, San Diego, California, USA, 6th February - 9th February 2011

    AEG: Automatic Exploit Generation. InProceedings of the Network and Distributed System Security Symposium, NDSS 2011, San Diego, California, USA, 6th February - 9th February 2011. The Internet Society. https://www.ndss- symposium.org/ndss2011/aeg-automatic-exploit-generation

  4. [4]

    Roberto Baldoni, Emilio Coppa, Daniele Cono D’Elia, Camil Demetrescu, and Irene Finocchi. 2018. A Survey of Symbolic Execution Techniques.ACM Comput. Surv.51, 3 (2018), 50:1–50:39. doi: 10.1145/3182657

  5. [5]

    Masudul Hasan Masud Bhuiyan, Adithya Srinivas Parthasarathy, Nikos Vasilakis, Michael Pradel, and Cristian-Alexandru Staicu. 2023. SecBench.js: An Executable Security Benchmark Suite for Server-Side JavaScript. In45th IEEE/ACM Interna- tional Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE, 1059–1070. doi: 10.110...

  6. [6]

    Xiaohe Bo, Zeyu Zhang, Quanyu Dai, Xueyang Feng, Lei Wang, Rui Li, Xu Chen, and Ji-Rong Wen. 2024. Reflective Multi-Agent Collaboration based on Large Language Models. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, Amir Gl...

  7. [7]

    Marcel Böhme, Van-Thuan Pham, Manh-Dung Nguyen, and Abhik Roychoud- hury. 2017. Directed Greybox Fuzzing. InProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, Dallas, TX, USA, October 30 - November 03, 2017, Bhavani Thuraisingham, David Evans, Tal Malkin, and Dongyan Xu (Eds.). ACM, 2329–2344. doi: 10.1145/313...

  8. [8]

    Marcel Böhme, Van-Thuan Pham, and Abhik Roychoudhury. 2019. Coverage- Based Greybox Fuzzing as Markov Chain.IEEE Trans. Software Eng.45, 5, 489–506. doi: 10.1109/TSE.2017.2785841

  9. [9]

    Cristian Cadar, Daniel Dunbar, and Dawson R. Engler. 2008. KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs. In8th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2008, December 8-10, 2008, San Diego, California, USA, Proceedings, Richard Draves and Robbert van Renesse (Eds.). USENIX Assoc...

  10. [10]

    Darion Cassel, Nuno Sabino, Min-Chien Hsu, Ruben Martins, and Limin Jia. 2025. NodeMedic-FINE: Automatic Detection and Exploit Synthesis for Node.js Vul- nerabilities. In32nd Annual Network and Distributed System Security Symposium, NDSS 2025, San Diego, California, USA, February 24-28, 2025. The Internet Soci- ety. https://www.ndss-symposium.org/ndss-pap...

  11. [11]

    Sang Kil Cha, Thanassis Avgerinos, Alexandre Rebert, and David Brumley. 2012. Unleashing Mayhem on Binary Code. InIEEE Symposium on Security and Privacy, SP 2012, 21-23 May 2012, San Francisco, California, USA. IEEE Computer Society, 380–394. doi: 10.1109/SP.2012.31

  12. [12]

    Ricardo Corin and Felipe Andrés Manzano. 2012. Taint Analysis of Security Code in the KLEE Symbolic Execution Engine. InInformation and Communications Security - 14th International Conference, ICICS 2012, Hong Kong, China, October 29- 31, 2012. Proceedings (Lecture Notes in Computer Science, Vol. 7618), Tat Wing Chim and Tsz Hon Yuen (Eds.). Springer, 264...

  13. [13]

    Patrick Cousot and Radhia Cousot. 1977. Abstract Interpretation: A Unified Lattice Model for Static Analysis of Programs by Construction or Approximation of Fixpoints. InConference Record of the Fourth ACM Symposium on Principles of Programming Languages, Los Angeles, California, USA, January 1977, Robert M. Graham, Michael A. Harrison, and Ravi Sethi (Ed...

  14. [14]

    Dorothy E. Denning. 1976. A Lattice Model of Secure Information Flow.Commun. ACM19, 5 (1976), 236–243. doi: 10.1145/360051.360056

  15. [15]

    Richard Fang, Rohan Bindu, Akul Gupta, and Daniel Kang. 2024. LLM Agents can Autonomously Exploit One-day Vulnerabilities.CoRRabs/2404.08144 (2024). arXiv:2404.08144 doi: 10.48550/ARXIV.2404.08144

  16. [16]

    Richard Fang, Rohan Bindu, Akul Gupta, Qiusi Zhan, and Daniel Kang. 2024. Teams of LLM Agents can Exploit Zero-Day Vulnerabilities.CoRRabs/2406.01637 (2024). arXiv:2406.01637 doi: 10.48550/ARXIV.2406.01637

  17. [17]

    Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D. Goodman. 2025. Cognitive Behaviors that Enable Self-Improving Reason- ers, or, Four Habits of Highly Effective STaRs.CoRRabs/2503.01307 (2025). arXiv:2503.01307 doi: 10.48550/ARXIV.2503.01307

  18. [18]

    Dawei Gao, Zitao Li, Weirui Kuang, Xuchen Pan, Daoyuan Chen, Zhijian Ma, Bingchen Qian, Liuyi Yao, Lin Zhu, Chen Cheng, Hongzhu Shi, Yaliang Li, Bolin Ding, and Jingren Zhou. 2024. AgentScope: A Flexible yet Robust Multi-Agent Platform.CoRRabs/2402.14034 (2024). arXiv:2402.14034 doi: 10.48550/ARXIV. 2402.14034 ICPC ’26, April 12–13, 2026, Rio de Janeiro, ...

  19. [19]

    2021.CodeQL

    GitHub Security Lab. 2021.CodeQL. GitHub. https://codeql.github.com/docs/

  20. [20]

    Katerina Goseva-Popstojanova and Andrei Perhinschi. 2015. On the capability of static code analysis to detect security vulnerabilities.Inf. Softw. Technol.68 (2015), 18–33. doi: 10.1016/J.INFSOF.2015.08.002

  21. [21]

    Junqing He, Kunhao Pan, Xiaoqun Dong, Zhuoyang Song, LiuYiBo LiuYiBo, Qian- guosun Qianguosun, Yuxin Liang, Hao Wang, Enming Zhang, and Jiaxing Zhang

  22. [22]

    Never Lost in the Middle: Mastering Long-Context Question Answering with Position-Agnostic Decompositional Training. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Comput...

  23. [23]

    Peyman Hosseini, Ignacio Castro, Iacopo Ghinassi, and Matthew Purver. 2025. Efficient Solutions For An Intriguing Failure of LLMs: Long Context Window Does Not Mean LLMs Can Analyze Long Sequences Flawlessly. InProceedings of the 31st International Conference on Computational Linguistics, COLING 2025, Abu Dhabi, UAE, January 19-24, 2025, Owen Rambow, Leo ...

  24. [24]

    David Jin, Qian Fu, and Yuekang Li. 2025. Good News for Script Kiddies? Evalu- ating Large Language Models for Automated Exploit Generation. In2025 IEEE Security and Privacy, SP 2025 - Workshops, San Francisco, CA, USA, May 15, 2025, Marina Blanton, William Enck, and Cristina Nita-Rotaru (Eds.). IEEE, 278–282. doi: 10.1109/SPW67851.2025.00039

  25. [25]

    Mingqing Kang, Yichao Xu, Song Li, Rigel Gjomemo, Jianwei Hou, V. N. Venkatakrishnan, and Yinzhi Cao. 2023. Scaling JavaScript Abstract Interpretation to Detect and Exploit Node.js Taint-style Vulnerability. In44th IEEE Symposium on Security and Privacy, SP 2023, San Francisco, CA, USA, May 21-25, 2023. IEEE, 1059–1076. doi: 10.1109/SP46215.2023.10179352

  26. [26]

    Pasareanu

    Rody Kersten, Kasper Søe Luckow, and Corina S. Pasareanu. 2017. POSTER: AFL-based Fuzzing for Java with Kelinci. InProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, Dallas, TX, USA, October 30 - November 03, 2017, Bhavani Thuraisingham, David Evans, Tal Malkin, and Dongyan Xu (Eds.). ACM, 2511–2513. doi: 10.1...

  27. [27]

    James C. King. 1976. Symbolic Execution and Program Testing.Commun. ACM 19, 7 (1976), 385–394. doi: 10.1145/360248.360252

  28. [28]

    Maxwell Koo. 2024. Uncovering Vulnerabilities In Open Source Libraries: A Technical Case Study. https://www.mayhem.security/blog/uncovering-vulner abilities-in-open-source-libraries

  29. [29]

    Yihe Li, Ruijie Meng, and Gregory J. Duck. 2025. Large Language Model powered Symbolic Execution.CoRRabs/2505.13452 (2025). arXiv:2505.13452 doi: 10.48550/ ARXIV.2505.13452

  30. [30]

    and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy , title = "

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Trans. Assoc. Comput. Linguistics12 (2024), 157–173. doi: 10.1162/TACL_A_00638

  31. [31]

    Zijun Liu, Zhennan Wan, Peng Li, Ming Yan, Ji Zhang, Fei Huang, and Yang Liu

  32. [32]

    Scaling External Knowledge Input Beyond Context Windows of LLMs via Multi-Agent Collaboration

    Scaling External Knowledge Input Beyond Context Windows of LLMs via Multi-Agent Collaboration.CoRRabs/2505.21471 (2025). arXiv:2505.21471 doi: 10.48550/ARXIV.2505.21471

  33. [33]

    Valentin J. M. Manès, HyungSeok Han, Choongwoo Han, Sang Kil Cha, Manuel Egele, Edward J. Schwartz, and Maverick Woo. 2021. The Art, Science, and Engineering of Fuzzing: A Survey.IEEE Trans. Software Eng.47, 11 (2021), 2312–

  34. [34]

    doi: 10.1109/TSE.2019.2946563

  35. [35]

    Coimbra, Nuno Santos, Limin Jia, and José Fragoso Santos

    Filipe Marques, Mafalda Ferreira, André Nascimento, Miguel E. Coimbra, Nuno Santos, Limin Jia, and José Fragoso Santos. 2025. Automated Exploit Generation for Node.js Packages.Proc. ACM Program. Lang.9, PLDI (2025), 1341–1366. doi: 10.1145/3729304

  36. [36]

    Galindo, and David Benavides

    Antonio Germán Márquez, Ángel Jesús Varela-Vaca, María Teresa Gómez-López, José A. Galindo, and David Benavides. 2024. Vulnerability impact analysis in software project dependencies based on Satisfiability Modulo Theories (SMT). Comput. Secur.139 (2024), 103669. doi: 10.1016/J.COSE.2023.103669

  37. [37]

    Miller, Lars Fredriksen, and Bryan So

    Barton P. Miller, Lars Fredriksen, and Bryan So. 1990. An Empirical Study of the Reliability of UNIX Utilities.Commun. ACM33, 12 (1990), 32–44. doi: 10.1145/96267.96279

  38. [38]

    James Newsome and Dawn Xiaodong Song. 2005. Dynamic Taint Analysis for Automatic Detection, Analysis, and SignatureGeneration of Exploits on Com- modity Software. InProceedings of the Network and Distributed System Secu- rity Symposium, NDSS 2005, San Diego, California, USA. The Internet Society. https://www.ndss-symposium.org/ndss2005/dynamic-taint-analy...

  39. [39]

    Vikram Nitin, Baishakhi Ray, and Roshanak Zilouchian Moghaddam. 2025. Fault- Line: Automated Proof-of-Vulnerability Generation Using LLM Agents.CoRR abs/2507.15241 (2025). arXiv:2507.15241 doi: 10.48550/ARXIV.2507.15241

  40. [40]

    Ana Nunez, Nafis Tanveer Islam, Sumit Kumar Jha, and Peyman Najafirad. 2024. AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Genera- tion through Static Analysis and Fuzz Testing.CoRRabs/2409.10737 (2024). arXiv:2409.10737 doi: 10.48550/ARXIV.2409.10737

  41. [41]

    Wanzong Peng, Lin Ye, Xuetao Du, Hongli Zhang, Dongyang Zhan, Yunting Zhang, Yicheng Guo, and Chen Zhang. 2025. PwnGPT: Automatic Exploit Gener- ation Based on Large Language Models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025, Wanxia...

  42. [42]

    Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu, Jingren Zhou, and Junyang Lin. 2025. Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture- of-Expert Models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL...

  43. [43]

    Francisco Ribeiro. 2023. Large Language Models for Automated Program Repair. InCompanion Proceedings of the 2023 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity, SPLASH 2023, Cascais, Portugal, October 22-27, 2023, Vasco Thudichum Vasconce- los (Ed.). ACM, 7–9. doi: 10.1145/3618305.3623587

  44. [44]

    Qingkai Shi, Xiao Xiao, Rongxin Wu, Jinguo Zhou, Gang Fan, and Charles Zhang

  45. [45]

    InProceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation(Philadelphia, PA, USA)(PLDI 2018)

    Pinpoint: fast and precise sparse value flow analysis for million lines of code. InProceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation(Philadelphia, PA, USA)(PLDI 2018). Association for Computing Machinery, New York, NY, USA, 693–706. doi: 10.1145/3192366. 3192418

  46. [46]

    Yan Shoshitaishvili, Ruoyu Wang, Christopher Salls, Nick Stephens, Mario Polino, Andrew Dutcher, John Grosen, Siji Feng, Christophe Hauser, Christopher Krügel, and Giovanni Vigna. 2016. SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis. InIEEE Symposium on Security and Privacy, SP 2016, San Jose, CA, USA, May 22-26, 2016. IEEE Comput...

  47. [47]

    Deniz Simsek, Aryaz Eghbali, and Michael Pradel. 2025. PoCGen: Generat- ing Proof-of-Concept Exploits for Vulnerabilities in Npm Packages.CoRR abs/2506.04962 (2025). arXiv:2506.04962 doi: 10.48550/ARXIV.2506.04962

  48. [48]

    Nick Stephens, John Grosen, Christopher Salls, Andrew Dutcher, Ruoyu Wang, Jacopo Corbetta, Yan Shoshitaishvili, Christopher Kruegel, and Giovanni Vi- gna. 2016. Driller: Augmenting Fuzzing Through Selective Symbolic Execu- tion. In23rd Annual Network and Distributed System Security Symposium, NDSS 2016, San Diego, California, USA, February 21-24, 2016. T...

  49. [49]

    Ziliang Wang, Ge Li, Jia Li, Hao Zhu, and Zhi Jin. 2025. VulAgent: A Hypothe- sis Validation-Based Multi-Agent System for Software Vulnerability Detection. arXiv:2509.11523 [cs.SE] https://arxiv.org/abs/2509.11523

  50. [50]

    Ziyue Wang and Liyi Zhou. 2025. Agentic Discovery and Validation of Android App Vulnerabilities.CoRRabs/2508.21579 (2025). arXiv:2508.21579 doi: 10.48550/ ARXIV.2508.21579

  51. [51]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, Novem...

  52. [52]

    Narasimhan, and Yuan Cao

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview .net/forum?id=WE_vluYUL-X

  53. [53]

    Michał Zalewski. 2014. American fuzzy lop. http://lcamtuf.coredump.cx/afl/. (2014)

  54. [54]

    Jun Zhang, Shuyang Jiang, Jiangtao Feng, Lin Zheng, and Lingpeng Kong. 2023. CAB: Comprehensive Attention Benchmarking on Long Sequence Modeling. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Hon- olulu, Hawaii, USA (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara E...

  55. [55]

    Yuntong Zhang, Jiawei Wang, Dominic Berzin, Martin Mirchev, Dongge Liu, Ab- hishek Arya, Oliver Chang, and Abhik Roychoudhury. 2024. Fixing Security Vul- nerabilities with AI in OSS-Fuzz.CoRRabs/2411.03346 (2024). arXiv:2411.03346 doi: 10.48550/ARXIV.2411.03346

  56. [56]

    Liu, and John C

    Zexin Zhong, Jiangchao Liu, Diyu Wu, Peng Di, Yulei Sui, Alex X. Liu, and John C. S. Lui. 2023. Scalable Compositional Static Taint Analysis for Sen- sitive Data Tracing on Industrial Micro-Services. In45th IEEE/ACM Interna- tional Conference on Software Engineering: Software Engineering in Practice, A Multi-Agent Framework for Automated Exploit Generatio...

  57. [57]

    Zhuotong Zhou, Yongzhuo Yang, Susheng Wu, Yiheng Huang, Bihuan Chen, and Xin Peng. 2024. Magneto: A Step-Wise Approach to Exploit Vulnerabilities in Dependent Libraries via LLM-Empowered Directed Fuzzing. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, ASE 2024, Sacramento, CA, USA, October 27 - November 1, 2...