pith. machine review for the scientific record. sign in

arxiv: 2605.06601 · v1 · submitted 2026-05-07 · 💻 cs.CR · cs.AI

Recognition: unknown

Patch2Vuln: Agentic Reconstruction of Vulnerabilities from Linux Distribution Binary Patches

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:47 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords vulnerability reconstructionbinary patchesagentic analysisLinux distributionssecurity updatesroot cause analysislanguage model agents
0
0 comments X

The pith

A local agent reconstructs vulnerability root causes from binary patches in 11 of 20 Linux updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper evaluates whether an offline language-model agent can figure out the security implications of Linux distribution updates when only binary packages are available. It presents Patch2Vuln, a pipeline that pulls old and new ELF files from .deb packages, uses Ghidra to diff them, ranks changed functions, and has the agent generate audits and validation plans. The evaluation on 20 security updates and 5 controls shows the agent succeeds in localizing the relevant function in 10 cases and classifying the root cause in 11, but many failures trace back to the binary diffing step missing the key function. This matters because defenders often lack source code for timely patch analysis, so binary-only methods could speed up vulnerability understanding in operational settings.

Core claim

Patch2Vuln is a resumable pipeline that extracts old/new ELF pairs from Ubuntu .deb packages, diffs them with Ghidra and Ghidriff, ranks changed functions by security relevance, builds dossiers, and directs an offline agent to output a preliminary audit, bounded validation plan, and final root-cause classification. On 20 security-update pairs, the agent localizes the verified security-relevant patch function in 10 cases and assigns an accepted final root-cause class in 11 cases, with all 5 negative controls correctly marked unknown. Oracle analysis reveals that 6 failures occur before agent reasoning due to the differencer or ranker omitting the right function.

What carries the argument

The agentic pipeline combining binary differencing of ELF pairs with function ranking and offline language model reasoning to produce root-cause audits.

If this is right

  • Binary-only reconstruction can identify patch functions without source access.
  • Validation passes can produce behavioral differentials in some cases like tcpdump.
  • Negative controls are reliably classified as unknown without false positives.
  • Binary diff coverage limits the overall success rate.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Improving binary differencing tools could raise the localization rate above 50%.
  • Extending to other distributions or architectures would test broader applicability.
  • Combining with dynamic analysis might yield more crash proofs in validation.
  • Such agents could integrate into automated patch monitoring workflows.

Load-bearing premise

The binary differencer and function ranker must reliably surface the security-relevant changed function for the agent to analyze it.

What would settle it

Observe whether the pipeline's localization success rate exceeds 10 out of 20 when tested on a new set of 20 security-update pairs from Ubuntu or another distribution.

Figures

Figures reproduced from arXiv: 2605.06601 by Arthur Gervais, Isaac David.

Figure 1
Figure 1. Figure 1: Patch2Vuln architecture. The upper lane contains the evidence visible to the agent. Human view at source ↗
Figure 2
Figure 2. Figure 2: Failure localization for the 20 security-update pairs. Half of the targets are localized by the view at source ↗
read the original abstract

Security updates create a short but important window in which defenders and attackers can compare vulnerable and patched software. Yet in many operational settings, the most accessible artifacts are binary packages rather than source patches or advisory text. This paper asks whether a language-model agent, restricted to local binary-derived evidence, can reconstruct the security meaning of Linux distribution updates. Patch2Vuln is a local, resumable pipeline that extracts old/new ELF pairs, diffs them with Ghidra and Ghidriff, ranks changed functions, builds candidate dossiers, and asks an offline agent to produce a preliminary audit, bounded validation plan, and final audit. We evaluate Patch2Vuln on 25 Ubuntu `.deb` package pairs: 20 security-update pairs and five negative controls, all manually adjudicated against private source-patch and binary-function ground truth. The agent localizes a verified security-relevant patch function in 10 of 20 security pairs and assigns an accepted final root-cause class in 11 of 20. Oracle diagnostics show that six security pairs fail before model reasoning because the binary differ or ranker omits the right function, with one additional context-export miss. A separate bounded validation pass produces two target-level minimized behavioral old/new differentials, both for tcpdump, but no crash, timeout, sanitizer finding, or memory-corruption proof; all five negative controls are classified as unknown and produce no validation differentials. These results support agentic vulnerability reconstruction from binary patches as a useful research target while showing that binary-diff coverage and local behavioral validation remain the limiting components.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Patch2Vuln, a local, resumable pipeline that extracts old/new ELF pairs from Ubuntu .deb packages, diffs them via Ghidra and Ghidriff, ranks changed functions, builds candidate dossiers, and uses an offline language-model agent to generate a preliminary audit, bounded validation plan, and final root-cause classification. On a manually adjudicated set of 25 pairs (20 security-update pairs and 5 negative controls), the agent localizes a verified security-relevant patch function in 10 of 20 security pairs and assigns an accepted final root-cause class in 11 of 20; six security pairs fail before model reasoning due to binary-diff or ranker omissions, one additional context-export miss occurs, the validation pass yields only two target-level behavioral differentials (both for tcpdump) with no crash/timeout/sanitizer/memory-corruption evidence, and all negative controls are classified as unknown.

Significance. If the results hold, the work establishes a concrete baseline for agentic vulnerability reconstruction from binary-only artifacts, which is significant for operational settings where source patches or advisories are unavailable. The explicit identification of binary-diff coverage and local behavioral validation as limiting factors, together with the use of negative controls and modest framing, positions the contribution as a useful research target rather than an end-to-end solution.

major comments (2)
  1. [Evaluation] Evaluation section (abstract and results): the localization success of 10/20 and accepted classification of 11/20 are load-bearing for the central claim, yet six of the 20 security pairs fail before any agent reasoning because the binary differencer/ranker omits the relevant function; this upstream dependency means the agent's contribution is only evaluated on a reduced subset and the pipeline's overall effectiveness cannot be isolated from the quality of the differencing tool.
  2. [Validation] Validation pass (abstract and results): the bounded validation produces only two minimized behavioral old/new differentials (both tcpdump) and no crash, timeout, sanitizer finding, or memory-corruption proof across the 20 security cases; without concrete exploit or proof evidence the final root-cause classifications rest primarily on agent reasoning, weakening support for the claim that the method reconstructs verifiable vulnerabilities.
minor comments (2)
  1. [Abstract] Abstract: the distinction between 'localizes a verified security-relevant patch function' (10/20) and 'assigns an accepted final root-cause class' (11/20) is not fully explained; clarify the overlap, the adjudication criteria for acceptance, and how the two metrics relate.
  2. [Evaluation] Evaluation: add a compact table or per-pair breakdown of outcomes for the 20 security pairs (localization success, classification result, failure mode) to improve transparency and allow readers to assess patterns.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive review and the recommendation for minor revision. We address each major comment below. The manuscript already transparently reports the limitations noted by the referee, supporting our modest framing of the work as a baseline for future research.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section (abstract and results): the localization success of 10/20 and accepted classification of 11/20 are load-bearing for the central claim, yet six of the 20 security pairs fail before any agent reasoning because the binary differencer/ranker omits the relevant function; this upstream dependency means the agent's contribution is only evaluated on a reduced subset and the pipeline's overall effectiveness cannot be isolated from the quality of the differencing tool.

    Authors: We thank the referee for highlighting this important aspect of the evaluation. The paper explicitly discloses in both the abstract and the results section that six security pairs fail prior to agent reasoning due to omissions by the binary differencer or ranker. Our evaluation measures the end-to-end pipeline performance, which necessarily depends on the upstream binary analysis tools. We provide oracle diagnostics precisely to allow readers to assess the agent's contribution separately on the subset where the differencing succeeds. The central claim concerns the feasibility of agentic reconstruction using binary-derived artifacts, not an isolated evaluation of the language model. We believe the current presentation is accurate and does not require revision. revision: no

  2. Referee: [Validation] Validation pass (abstract and results): the bounded validation produces only two minimized behavioral old/new differentials (both tcpdump) and no crash, timeout, sanitizer finding, or memory-corruption proof across the 20 security cases; without concrete exploit or proof evidence the final root-cause classifications rest primarily on agent reasoning, weakening support for the claim that the method reconstructs verifiable vulnerabilities.

    Authors: The referee correctly observes that the bounded validation yields limited concrete evidence. This is reported in the manuscript: only two target-level behavioral differentials were produced, both for tcpdump, with no crash, timeout, sanitizer, or memory-corruption findings. The root-cause classifications rely on the agent's analysis of the binary-derived dossiers, as the local validation environment did not yield stronger proof artifacts in most cases. We frame the contribution as a baseline for agentic vulnerability reconstruction rather than a method that produces verifiable exploits. The absence of such evidence is presented as a key limitation and research target. No changes to the manuscript are necessary, as the results are presented factually and the claims are appropriately qualified. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical pipeline (binary extraction, Ghidra/Ghidriff differencing, function ranking, dossier construction, and offline agent auditing) evaluated directly on 25 manually adjudicated Ubuntu package pairs. All reported outcomes (10/20 localization, 11/20 root-cause classification, 6 early failures from differencer/ranker) are measured against private source-patch ground truth with explicit negative controls. No equations, fitted parameters, uniqueness theorems, or self-citations appear in the derivation chain; the central claims are statistical results from the evaluation itself and do not reduce to any input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that Ghidra/Ghidriff can produce usable function-level diffs from stripped or optimized ELF binaries and that an offline LLM can perform security reasoning from the resulting context without source code. No free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Ghidra and Ghidriff produce sufficiently accurate function-level diffs on real Ubuntu ELF binaries for the downstream ranking step to be meaningful.
    Invoked in the pipeline description where changed functions are ranked and fed to the agent.
  • domain assumption An offline language-model agent can produce preliminary audits, validation plans, and root-cause classifications that align with human ground truth when given only binary-derived context.
    Central to the evaluation metrics of localization and accepted classification.

pith-pipeline@v0.9.0 · 5578 in / 1475 out tokens · 28531 ms · 2026-05-08T08:47:52.936253+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 23 canonical work pages · 1 internal anchor

  1. [1]

    Jimenez, Farshad Khorrami, Prashanth Krishnamurthy, Brendan Dolan-Gavitt, Muhammad Shafique, Karthik Narasimhan, Ramesh Karri, and Ofir Press

    Talor Abramovich, Meet Udeshi, Minghao Shao, Kilian Lieret, Haoran Xi, Kimberly Milner, Sofija Jancheska, John Yang, Carlos E. Jimenez, Farshad Khorrami, Prashanth Krishnamurthy, Brendan Dolan-Gavitt, Muhammad Shafique, Karthik Narasimhan, Ramesh Karri, and Ofir Press. EnIGMA: Interactive tools substantially assist LM agents in finding security vulnerabil...

  2. [2]

    AEG: Au- tomatic exploit generation

    Thanassis Avgerinos, Sang Kil Cha, Brent Lim Tze Hao, and David Brumley. AEG: Au- tomatic exploit generation. InProceedings of the Network and Distributed System Secu- rity Symposium, NDSS 2011, 2011. URL https://www.ndss-symposium.org/ndss2011/ aeg-automatic-exploit-generation/

  3. [3]

    Automatic patch- based exploit generation is possible: Techniques and implications

    David Brumley, Pongsin Poosankam, Dawn Xiaodong Song, and Jiang Zheng. Automatic patch- based exploit generation is possible: Techniques and implications. InProceedings of the IEEE Symposium on Security and Privacy, pages 143–157, April 2008. doi: 10.1109/SP.2008.17. URLhttps://doi.org/10.1109/SP.2008.17

  4. [4]

    Schwartz

    David Brumley, Ivan Jager, Thanassis Avgerinos, and Edward J. Schwartz. BAP: A binary analysis platform. InProceedings of the 23rd International Conference on Computer Aided Verification, CA V 2011, pages 463–469, 2011. doi: 10.1007/978-3-642-22110-1_37. URL https://doi.org/10.1007/978-3-642-22110-1_37

  5. [5]

    Cristian Cadar, Daniel Dunbar, and Dawson R. Engler. KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. InProceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2008, pages 209–224, 2008. URL https://www.usenix.org/conference/osdi-08/presentation/ klee-unassisted-and-automat...

  6. [6]

    Unleashing Mayhem on Binary Code

    Sang Kil Cha, Thanassis Avgerinos, Alexandre Rebert, and David Brumley. Unleashing MAYHEM on binary code. InProceedings of the IEEE Symposium on Security and Privacy, pages 380–394, 2012. doi: 10.1109/SP.2012.31. URL https://doi.org/10.1109/SP.2012. 31

  7. [7]

    S2E: A platform for in- vivo multi-path analysis of software systems

    Vitaly Chipounov, V olodymyr Kuznetsov, and George Candea. S2E: A platform for in- vivo multi-path analysis of software systems. InProceedings of the Sixteenth Interna- tional Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVI, pages 265–278, 2011. doi: 10.1145/1950365.1950396. URL https://doi.org/10.1145/19503...

  8. [8]

    Ghidriff documentation, 2025

    Clearbluejar. Ghidriff documentation, 2025. URL https://clearbluejar.github.io/ ghidriff/. Accessed 2026-05-02

  9. [9]

    Multi-agent penetration testing ai for the web,

    Isaac David and Arthur Gervais. Multi-agent penetration testing ai for the web, 2025. URL https://arxiv.org/abs/2508.20816

  10. [10]

    Towards Optimal Agentic Architectures for Offensive Security Tasks

    Isaac David and Arthur Gervais. Towards optimal agentic architectures for offensive security tasks, 2026. URLhttps://arxiv.org/abs/2604.18718

  11. [11]

    Cyber grand challenge, 2014

    Defense Advanced Research Projects Agency. Cyber grand challenge, 2014. URL https: //www.darpa.mil/about/innovation-timeline/cyber-grand-challenge . Accessed 2026-05-02

  12. [12]

    Pentestgpt: An llm- empowered automatic penetration testing tool

    Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. PentestGPT: An LLM-empowered automatic penetration testing tool, 2024. URLhttps://arxiv.org/abs/2308.06782

  13. [13]

    Steven H. H. Ding, Benjamin C. M. Fung, and Philippe Charland. Asm2Vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. InProceedings of the IEEE Symposium on Security and Privacy, pages 472–489,

  14. [14]

    URLhttps://doi.org/10.1109/SP.2019.00003

    doi: 10.1109/SP.2019.00003. URLhttps://doi.org/10.1109/SP.2019.00003

  15. [15]

    Install docker desktop on mac, 2026

    Docker. Install docker desktop on mac, 2026. URLhttps://docs.docker.com/desktop/ setup/install/mac-install/. Accessed 2026-05-02. 10

  16. [16]

    DeepBinDiff: Learning program- wide code representations for binary diffing

    Yue Duan, Xuezixiang Li, Jinghan Wang, and Heng Yin. DeepBinDiff: Learning program- wide code representations for binary diffing. InProceedings of the Network and Distributed System Security Symposium, NDSS 2020, 2020. doi: 10.14722/ndss.2020.24311. URL https: //doi.org/10.14722/ndss.2020.24311

  17. [17]

    discovRE: Efficient cross- architecture identification of bugs in binary code

    Sebastian Eschweiler, Khaled Yakdan, and Elmar Gerhards-Padilla. discovRE: Efficient cross- architecture identification of bugs in binary code. InProceedings of the Network and Distributed System Security Symposium, NDSS 2016, 2016. doi: 10.14722/ndss.2016.23185. URL https: //doi.org/10.14722/ndss.2016.23185

  18. [18]

    Llm agents can au- tonomously exploit one-day vulnerabilities

    Richard Fang, Rohan Bindu, Akul Gupta, and Daniel Kang. LLM agents can autonomously exploit one-day vulnerabilities, 2024. URLhttps://arxiv.org/abs/2404.08144

  19. [19]

    Scalable graph-based bug search for firmware images

    Qian Feng, Rundong Zhou, Chengcheng Xu, Yao Cheng, Brian Testa, and Heng Yin. Scalable graph-based bug search for firmware images. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS 2016, pages 480–491, 2016. doi: 10.1145/2976749.2978370. URLhttps://doi.org/10.1145/2976749.2978370

  20. [20]

    Reiter, and Dawn Song

    Debin Gao, Michael K. Reiter, and Dawn Song. BinHunt: Automatically finding semantic differences in binary programs. InProceedings of the 10th International Conference on Information and Communications Security, ICICS 2008, pages 238–255, 2008. URL https: //people.eecs.berkeley.edu/~dawnsong/papers/2008%20binhunt_icics08.pdf

  21. [21]

    VulSeeker: A semantic learning based vulnerability seeker for cross-platform binary

    Jian Gao, Xin Yang, Ying Fu, Yu Jiang, and Jiaguang Sun. VulSeeker: A semantic learning based vulnerability seeker for cross-platform binary. InProceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, pages 896–899, 2018. doi: 10.1145/3238147.3240480. URLhttps://doi.org/10.1145/3238147.3240480

  22. [22]

    Levin, and David A

    Patrice Godefroid, Michael Y . Levin, and David A. Molnar. SAGE: Whitebox fuzzing for security testing.Communications of the ACM, 55(3):40–44, 2012. doi: 10.1145/2093548. 2093564. URLhttps://doi.org/10.1145/2093548.2093564

  23. [23]

    ReDeBug: Finding unpatched code clones in entire OS distributions

    Jiyong Jang, Abeer Agrawal, and David Brumley. ReDeBug: Finding unpatched code clones in entire OS distributions. InProceedings of the IEEE Symposium on Security and Privacy, pages 48–62, 2012. URL https://www.ieee-security.org/TC/SP2012/papers/4681a048. pdf

  24. [24]

    VUDDY: A scalable approach for vulnerable code clone discovery

    Seulbae Kim, Seunghoon Woo, Heejo Lee, and Hakjoo Oh. VUDDY: A scalable approach for vulnerable code clone discovery. InProceedings of the IEEE Symposium on Security and Privacy, pages 595–614, 2017. URLhttps://dblp.org/rec/conf/sp/KimWLO17

  25. [25]

    SAFE: Self-attentive function embeddings for binary similarity

    Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Roberto Baldoni, and Leonardo Querzoni. SAFE: Self-attentive function embeddings for binary similarity. InProceedings of the 16th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, DIMV A 2019, pages 309–329, 2019. doi: 10.1007/978-3-030-22038-9_15. URL h...

  26. [26]

    Ghidra headless analyzer documentation, 2025

    National Security Agency. Ghidra headless analyzer documentation, 2025. URL https: //ghidra.re/ghidra_docs/analyzeHeadlessREADME.html. Accessed 2026-05-02

  27. [27]

    SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis,

    Yan Shoshitaishvili, Ruoyu Wang, Christopher Salls, Nick Stephens, Mario Polino, Andrew Dutcher, John Grosen, Siji Feng, Christophe Hauser, Christopher Kruegel, and Giovanni Vigna. SOK: (state of) the art of war: Offensive techniques in binary analysis. InProceedings of the IEEE Symposium on Security and Privacy, pages 138–157, 2016. doi: 10.1109/SP.2016....

  28. [28]

    BitBlaze: A new approach to computer security via binary analysis

    Dawn Song, David Brumley, Heng Yin, Juan Caballero, Ivan Jager, Min Gyung Kang, Zhenkai Liang, James Newsome, Pongsin Poosankam, and Prateek Saxena. BitBlaze: A new approach to computer security via binary analysis. InProceedings of the 4th International Conference on In- formation Systems Security, ICISS 2008, pages 1–25, 2008. doi: 10.1007/978-3-540-898...

  29. [29]

    In Proceedings 2016 Network and Distributed System Security Symposium

    Nick Stephens, John Grosen, Christopher Salls, Andrew Dutcher, Ruoyu Wang, Jacopo Corbetta, Yan Shoshitaishvili, Christopher Kruegel, and Giovanni Vigna. Driller: Augmenting fuzzing through selective symbolic execution. InProceedings of the Network and Distributed System Security Symposium, NDSS 2016, 2016. doi: 10.14722/ndss.2016.23368. URL https://doi. ...

  30. [30]

    CVE-2018-14464, 2018

    Ubuntu. CVE-2018-14464, 2018. URL https://ubuntu.com/security/ CVE-2018-14464. Accessed 2026-05-02

  31. [31]

    CVE-2018-16301, 2019

    Ubuntu. CVE-2018-16301, 2019. URL https://ubuntu.com/security/ CVE-2018-16301. Accessed 2026-05-03

  32. [32]

    CVE-2020-8037, 2020

    Ubuntu. CVE-2020-8037, 2020. URL https://ubuntu.com/security/CVE-2020-8037. Accessed 2026-05-03

  33. [33]

    USN-4252-1: tcpdump vulnerabilities, 2020

    Ubuntu. USN-4252-1: tcpdump vulnerabilities, 2020. URL https://ubuntu.com/ security/notices/USN-4252-1. Accessed 2026-05-02

  34. [34]

    CVE-2022-25235, 2022

    Ubuntu. CVE-2022-25235, 2022. URL https://ubuntu.com/security/ CVE-2022-25235. Accessed 2026-05-02

  35. [35]

    USN-5288-1: Expat vulnerabilities, 2022

    Ubuntu. USN-5288-1: Expat vulnerabilities, 2022. URL https://ubuntu.com/security/ notices/USN-5288-1. Accessed 2026-05-02

  36. [36]

    Neural network- based graph embedding for cross-platform binary code similarity detection

    Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. Neural network- based graph embedding for cross-platform binary code similarity detection. InProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, pages 363–376, 2017. doi: 10.1145/3133956.3134018. URL https://doi.org/10.1145/ 3133956.3134018

  37. [37]

    Patch based vulnerability matching for binary programs

    Yifei Xu, Zhengzi Xu, Bihuan Chen, Fu Song, Yang Liu, and Ting Liu. Patch based vulnerability matching for binary programs. InProceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2020, 2020. doi: 10.1145/3395363. 3397361. URLhttps://doi.org/10.1145/3395363.3397361

  38. [38]

    Cybench: A framework for evaluating cybersecurity capabilities and risks of language models

    Andy K. Zhang, Neil Perry, Riya Dulepet, Joey Ji, Celeste Menders, Justin W. Lin, Eliot Jones, Gashon Hussein, Samantha Liu, Donovan Jasper, Pura Peetathawatchai, Ari Glenn, Vikram Sivashankar, Daniel Zamoshchin, Leo Glikbarg, Derek Askaryar, Mike Yang, Teddy Zhang, Rishi Alluri, Nathan Tran, Rinnara Sangpisit, Polycarpos Yiorkadjis, Kenny Osele, Gautham ...

  39. [39]

    Teams of LLM agents can exploit zero-day vulnerabilities.arXiv preprint arXiv:2406.01637, 2024

    Yuxuan Zhu, Antony Kellermann, Akul Gupta, Philip Li, Richard Fang, Rohan Bindu, and Daniel Kang. Teams of LLM agents can exploit zero-day vulnerabilities, 2025. URL https: //arxiv.org/abs/2406.01637. 12 A Discussion A.1 What Counts as Identifying a Vulnerability? The most objective validation for some memory-safety patches is a differential input that fa...