pith. machine review for the scientific record. sign in

arxiv: 2605.15097 · v1 · submitted 2026-05-14 · 💻 cs.SE · cs.CR

Recognition: no theorem link

Veritas: A Semantically Grounded Agentic Framework for Memory Corruption Vulnerability Detection in Binaries

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:57 UTC · model grok-4.3

classification 💻 cs.SE cs.CR
keywords binary analysisvulnerability detectionmemory corruptionstatic slicingLLM-based detectionruntime validationstripped binariesagentic framework
0
0 comments X

The pith

Veritas detects memory corruption in stripped binaries by grounding LLM reasoning in reconstructed value flows and runtime validation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Veritas as a framework that recovers object semantics and value flows from lossy binary representations using a static slicer on lifted intermediate code. It then applies a dual-view LLM to reason step by step over these flows in decompiled C and the IR, followed by a multi-agent validator that uses debugger checks to confirm or reject potential bugs. This design aims to make automated detection reliable by tying abstract reasoning to concrete evidence from the program. A reader would care because memory corruption bugs in binaries are hard to find automatically due to missing high-level information, and effective tools could improve software security. The evaluation shows the approach works on real cases with high detection rates and few errors.

Core claim

Veritas achieves reliable memory corruption vulnerability detection in stripped binaries through a combination of static value-flow reconstruction, LLM-based hypothesis generation over grounded artifacts, and multi-agent runtime validation. On a benchmark of real-world cases it reaches 90% recall, produces no false positives among 623 exhaustively checked candidates, and identified a new Apple vulnerability confirmed with a CVE.

What carries the argument

The key mechanism is the static slicer that emits compact witness-backed flow objects from LLVM-IR facts including def-use relations, calls, returns, globals, and pointer operations, which ground the subsequent LLM detector and validator.

If this is right

  • If correct, this shows that semantic grounding can overcome the limitations of lossy binary representations for LLM-based analysis.
  • High recall with minimal false positives supports deployment in practical security scanning pipelines.
  • The discovery of a real CVE demonstrates applicability to large commercial software.
  • Validation through debugger-visible artifacts reduces reliance on purely static methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach could extend to detecting other classes of vulnerabilities by adjusting the flow queries and validation oracles.
  • Improving the accuracy of the IR lifting step would likely increase overall effectiveness on more complex binaries.
  • Similar grounding techniques might apply to related problems like identifying side-channel leaks or protocol flaws in binaries.

Load-bearing premise

The static slicer can reconstruct accurate value-flow relations, object semantics, and interprocedural paths from stripped binaries without critical omissions or mismatches that would mislead the LLM reasoning.

What would settle it

Finding a known memory corruption vulnerability in a stripped binary where the slicer fails to capture the relevant data flow path, resulting in the detector missing the bug despite its presence.

Figures

Figures reproduced from arXiv: 2605.15097 by Alfredo Pesoli, Lorenzo Cavallaro, Marco Valleri, Suman Jana, Xinran Zheng.

Figure 1
Figure 1. Figure 1: Failure case showing the need for semantic ground [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Realistic program artifacts from CVE-2020-13790, an out-of-bounds read in the PPM reader. The vulnerability arises [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The workflow of Veritas, organized as static grounding, controlled LLM reasoning, and runtime grounding. (1) The Semantic-driven Context Slicer extracts witness-backed flows with taint-relevant anchor cues from stripped binaries without LLMs. (2) The Dual-view Vulnerability Detector reasons step-wise along these flows, consulting LLVM IR at anchored functions where decompiled code is ambiguous. (3) The Aut… view at source ↗
Figure 4
Figure 4. Figure 4: The simplified code of CVE-2025-[REDACTED]. [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
read the original abstract

Detecting memory corruption vulnerabilities in stripped binaries requires recovering object semantics, interprocedural propagation, and feasible triggers from low-level, lossy representations. Recent LLM-based approaches improve code understanding, but reliable detection still requires grounding in memory-relevant semantics and runtime feasibility evidence. We present Veritas, a semantically grounded framework for binary memory corruption vulnerability detection. Veritas combines a static slicer over RetDec-lifted LLVM IR, a dual-view LLM detector that reasons step by step over grounded flows using decompiled C and selective LLVM IR, and a multi-agent validator that checks hypotheses against debugger-visible artifacts and runtime evidence. The slicer reconstructs value-flow relations from LLVM-IR facts, including def-use, calls, returns, globals, and pointer operations, and emits compact witness-backed flow objects. The detector uses these artifacts to reason about control flow, bounds, and object correspondence without rediscovering whole-binary propagation. The validator confirms or rejects candidates through guided debugging, breakpoint inspection, and memory-checking oracles. We implement Veritas as a modular pipeline and evaluate it on a curated benchmark of real-world binary vulnerability cases. Veritas achieves 90\% recall. For false-positive assessment, we exhaustively validate and manually verify 623 detector candidates and audit additional candidates from larger cases. The exhaustive subset produces no false positives, while the additional audit identifies two confirmed false positives. In a real-world application, Veritas discovered a previously unknown Apple vulnerability that was confirmed and assigned a CVE. These results support semantic grounding as an operational design principle for practical binary vulnerability detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Veritas, a semantically grounded agentic framework for detecting memory corruption vulnerabilities in stripped binaries. It integrates a static slicer on RetDec-lifted LLVM IR that emits witness-backed flow objects capturing def-use chains, pointer operations, globals, calls, and returns; a dual-view LLM detector that performs step-by-step reasoning over both decompiled C and selective LLVM IR using these artifacts; and a multi-agent validator that confirms or rejects hypotheses via guided debugging, breakpoint inspection, and memory-checking oracles. On a curated benchmark of real-world binary cases, Veritas reports 90% recall. Exhaustive manual validation of 623 detector candidates yields zero false positives, while a larger audit identifies two confirmed false positives. The framework also discovered a previously unknown Apple vulnerability that received a CVE.

Significance. If the results hold after addressing the evaluation gaps, the work is significant for establishing semantic grounding as a practical design principle in binary vulnerability detection. By combining static value-flow reconstruction with LLM reasoning and runtime evidence, it achieves high recall with near-zero false positives on real-world cases and demonstrates utility through a confirmed CVE discovery. This hybrid pipeline offers a concrete advance over ungrounded LLM approaches or purely static methods for analyzing lossy stripped binaries.

major comments (2)
  1. [Abstract and Evaluation section] Abstract and Evaluation section: The 90% recall and zero-FP claims on the 623-candidate exhaustive subset rest on the static slicer correctly emitting faithful witness-backed flow objects for def-use, pointer semantics, and interprocedural relations. No quantitative measurement of slicer fidelity (e.g., ground-truth flow coverage, aliasing accuracy, or semantic mismatch rate against the original binaries) is reported on the benchmark set. This is load-bearing because systematic reconstruction errors from RetDec lifting of stripped binaries would propagate directly into the dual-view detector and validator, potentially inflating the reported performance.
  2. [Benchmark and §4] Benchmark and §4 (or equivalent evaluation section): The manuscript provides insufficient detail on benchmark construction, including how the real-world vulnerability cases were selected, the criteria for including or excluding candidates, and the exact process that produced the 623 cases for exhaustive validation. Without these, it is impossible to assess whether post-hoc filtering or selection bias affects the recall and false-positive numbers.
minor comments (2)
  1. [Abstract] The abstract refers to 'exhaustive validation' of 623 candidates and 'additional audit' of larger cases; the main text should explicitly define the scope of each set and the criteria distinguishing them.
  2. [Slicer description] Notation for the witness-backed flow objects (e.g., how def-use facts are encoded) should be introduced with a small example early in the description of the slicer to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to incorporate additional details and evaluations where feasible.

read point-by-point responses
  1. Referee: [Abstract and Evaluation section] Abstract and Evaluation section: The 90% recall and zero-FP claims on the 623-candidate exhaustive subset rest on the static slicer correctly emitting faithful witness-backed flow objects for def-use, pointer semantics, and interprocedural relations. No quantitative measurement of slicer fidelity (e.g., ground-truth flow coverage, aliasing accuracy, or semantic mismatch rate against the original binaries) is reported on the benchmark set. This is load-bearing because systematic reconstruction errors from RetDec lifting of stripped binaries would propagate directly into the dual-view detector and validator, potentially inflating the reported performance.

    Authors: We acknowledge that a direct quantitative assessment of slicer fidelity (such as flow coverage or semantic mismatch rates against ground-truth binaries) is not reported and would strengthen the claims. The exhaustive manual validation of all 623 candidates, in which we inspected the emitted flow objects for correctness against the original binaries, provides indirect evidence of sufficient fidelity for the evaluated cases. However, we agree this is insufficient as standalone validation. In the revised manuscript, we will add a new subsection in the evaluation reporting slicer metrics on a representative subset of the benchmark (e.g., precision/recall of reconstructed def-use chains and pointer relations against manually annotated ground truth), along with a limitations discussion on RetDec lifting errors. revision: yes

  2. Referee: [Benchmark and §4] Benchmark and §4 (or equivalent evaluation section): The manuscript provides insufficient detail on benchmark construction, including how the real-world vulnerability cases were selected, the criteria for including or excluding candidates, and the exact process that produced the 623 cases for exhaustive validation. Without these, it is impossible to assess whether post-hoc filtering or selection bias affects the recall and false-positive numbers.

    Authors: We agree that greater transparency on benchmark construction is necessary to allow assessment of selection bias. The current manuscript describes the benchmark only at a high level as 'a curated benchmark of real-world binary vulnerability cases.' In the revised version, we will expand the relevant section (currently §4) with explicit details: sources used (public CVE databases, vulnerability reports, and open-source projects), inclusion criteria (e.g., confirmed memory corruption bugs with available stripped binaries and PoCs), exclusion criteria (e.g., cases requiring dynamic linking not supported by our pipeline), and the precise sampling process that yielded the 623 cases for exhaustive validation. This will include a table or flowchart summarizing the curation pipeline. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with independent validation

full rationale

The paper presents an empirical system (static slicer over RetDec IR + dual-view LLM detector + multi-agent runtime validator) whose central claims are performance numbers measured on external benchmarks and a real-world CVE case. No equations, fitted parameters, or derivations appear; the 90% recall and FP counts are reported as direct experimental outcomes against curated cases rather than predictions derived from the inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked in the provided text. The framework is therefore self-contained against external evidence and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard assumptions from binary lifting and LLM code reasoning rather than new free parameters or invented entities.

axioms (2)
  • domain assumption RetDec lifting produces LLVM IR faithful enough for accurate value-flow reconstruction in stripped binaries
    Invoked by the static slicer component described in the abstract.
  • domain assumption LLMs can perform reliable step-by-step reasoning about memory safety when supplied with grounded flow objects and selective IR
    Central premise of the dual-view detector.

pith-pipeline@v0.9.0 · 5595 in / 1412 out tokens · 69831 ms · 2026-05-15T13:57:29.593973+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 1 internal anchor

  1. [1]

    0xdea. 2025. semgrep-rules. https://github.com/0xdea/semgrep-rules. [Online; accessed 29-Jan-2025]

  2. [2]

    Talor Abramovich, Meet Udeshi, Minghao Shao, Kilian Lieret, Haoran Xi, Kim- berly Milner, Sofija Jancheska, John Yang, Carlos E Jimenez, Farshad Khorrami, et al. [n. d.]. EnIGMA: Interactive Tools Substantially Assist LM Agents in Find- ing Security Vulnerabilities. InForty-second International Conference on Machine Learning

  3. [3]

    Talor Abramovich, Meet Udeshi, Minghao Shao, Kilian Lieret, Haoran Xi, Kim- berly Milner, Sofija Jancheska, John Yang, Carlos E Jimenez, Farshad Khorrami, et al. 2024. Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities.arXiv preprint arXiv:2409.16165(2024)

  4. [4]

    AFL++ Team. 2023. AFL++: Combining Incremental Steps of Fuzzing Research. https://github.com/AFLplusplus/AFLplusplus. Accessed: 2026-01-20

  5. [5]

    Anthropic. [n. d.]. Claude Code Overview. https://code.claude.com/docs/en/ overview. Claude Code Docs, accessed April 6, 2026

  6. [6]

    Anthropic. 2026. Partnering with Mozilla to Improve Firefox’s Security. https: //www.anthropic.com/news/mozilla-firefox-security. Accessed April 2026

  7. [7]

    Anthropic. 2026. Project Glasswing: Securing Critical Software for the AI Era. https://www.anthropic.com/project/glasswing. Accessed April 2026

  8. [8]

    Anthropic Frontier Red Team. 2026. Claude Mythos Preview. https://red. anthropic.com/2026/mythos-preview/. Accessed April 2026

  9. [9]

    Avast Software. 2015. RetDec: A Retargetable Machine-Code Decompiler. https: //github.com/avast/retdec. [Online; accessed 29-Jan-2025]

  10. [10]

    Thanassis Avgerinos, Sang Kil Cha, Alexandre Rebert, Edward J Schwartz, Mav- erick Woo, and David Brumley. 2014. Automatic exploit generation.Commun. ACM57, 2 (2014), 74–84

  11. [11]

    David Brumley, Ivan Jager, Thanassis Avgerinos, and Edward J Schwartz. 2011. BAP: A binary analysis platform. InInternational Conference on Computer Aided Verification. Springer, 463–469

  12. [12]

    Cristian Cadar, Daniel Dunbar, Dawson R Engler, et al. 2008. Klee: unassisted and automatic generation of high-coverage tests for complex systems programs.. InOSDI, Vol. 8. 209–224

  13. [13]

    Junkai Chen, Zhiyuan Pan, Xing Hu, Zhenhao Li, Ge Li, and Xin Xia. 2024. Reasoning runtime behavior of a program with llm: How far are we?arXiv preprint arXiv:2403.16437(2024)

  14. [14]

    Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, and Yizheng Chen. 2024. Vulnerability detection with code language models: How far are we?arXiv preprint arXiv:2403.18624(2024)

  15. [15]

    Dawson Engler, David Yu Chen, Seth Hallem, Andy Chou, and Benjamin Chelf

  16. [16]

    Bugs as deviant behavior: A general approach to inferring errors in systems code.ACM SIGOPS Operating Systems Review35, 5 (2001), 57–72

  17. [17]

    2026.System Over Model: Zero-Day Discovery at the Jagged Frontier

    Stanislav Fort. 2026.System Over Model: Zero-Day Discovery at the Jagged Frontier. https://aisle.com/blog/system-over-model-zero-day-discovery-at-the- jagged-frontier AISLE blog

  18. [18]

    Fraunhofer FKIE. 2025. cwe_checker: Static Binary Analysis for CWE Detection. https://github.com/fkie-cad/cwe_checker. [Online; accessed 29-Jan-2025]

  19. [19]

    Vinod Ganapathy, Somesh Jha, David Chandler, David Melski, and David Vitek

  20. [20]

    In Proceedings of the 10th ACM conference on Computer and communications security

    Buffer overrun detection using linear programming and static analysis. In Proceedings of the 10th ACM conference on Computer and communications security. 345–354

  21. [21]

    Jinyao Guo, Chengpeng Wang, Dominic Deluca, Jinjie Liu, Zhuo Zhang, and Xiangyu Zhang. 2025. BugScope: Learn to Find Bugs Like Human.arXiv preprint arXiv:2507.15671(2025)

  22. [22]

    Jinyao Guo, Chengpeng Wang, Xiangzhe Xu, Zian Su, and Xiangyu Zhang. 2025. Repoaudit: An autonomous llm-agent for repository-level code auditing.arXiv preprint arXiv:2501.18160(2025)

  23. [23]

    Nasir Hussain, Haohan Chen, Chanh Tran, Philip Huang, Zhuohao Li, Pravir Chugh, William Chen, Ashish Kundu, and Yuan Tian. 2025. VulBinLLM: LLM-powered Vulnerability Detection for Stripped Binaries.arXiv preprint arXiv:2505.22010(2025)

  24. [24]

    Yu Jiang, Jie Liang, Fuchen Ma, Yuanliang Chen, Chijin Zhou, Yuheng Shen, Zhiyong Wu, Jingzhou Fu, Mingzhe Wang, Shanshan Li, et al . 2024. When fuzzing meets llms: Challenges and opportunities. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 492–496

  25. [25]

    Keen Security Lab. 2023. BinAbsInspector. https://github.com/KeenSecurityLab/ BinAbsInspector. GitHub repository. Retrieved May 1, 2023

  26. [26]

    Hwiwon Lee, Ziqi Zhang, Hanxiao Lu, and Lingming Zhang. 2025. SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks. arXiv preprint arXiv:2506.11791(2025)

  27. [27]

    Ahmed Lekssays, Hamza Mouhcine, Khang Tran, Ting Yu, and Issa Khalil. 2025. {LLMxCPG}:{Context-Aware} vulnerability detection through code property {Graph-Guided} large language models. In34th USENIX Security Symposium (USENIX Security 25). 489–507

  28. [28]

    Ziyang Li, Saikat Dutta, and Mayur Naik. 2024. IRIS: LLM-assisted static analysis for detecting security vulnerabilities.arXiv preprint arXiv:2405.17238(2024)

  29. [29]

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics 12 (2024), 157–173

  30. [30]

    Puzhuo Liu, Chengnian Sun, Yaowen Zheng, Xuan Feng, Chuan Qin, Yuncheng Wang, Zhi Li, and Limin Sun. 2023. Harnessing the power of llm to support binary taint analysis.arXiv preprint arXiv:2310.08275(2023)

  31. [31]

    Xiang Mei, Pulkit Singh Singaria, Jordi Del Castillo, Haoran Xi, Tiffany Bao, Ruoyu Wang, Yan Shoshitaishvili, Adam Doupé, Hammond Pearce, Brendan Dolan-Gavitt, et al. 2024. Arvo: Atlas of reproducible vulnerabilities for open source software.arXiv preprint arXiv:2408.02153(2024)

  32. [32]

    Meta. 2025. Infer Static Analyzer. https://fbinfer.com/. [Online; accessed 29-Jan-2025]

  33. [33]

    National Security Agency. 2019. Ghidra Software Reverse Engineering Frame- work. https://ghidra-sre.org/. [Online; accessed 29-Jan-2025]

  34. [34]

    Yuzhou Nie, Hongwei Li, Chengquan Guo, Ruizhe Jiang, Zhun Wang, Bo Li, Dawn Song, and Wenbo Guo. 2025. VulnLLM-R: Specialized Reasoning LLM with Agent Scaffold for Vulnerability Detection.arXiv preprint arXiv:2512.07533 (2025)

  35. [35]

    2017.Juliet C/C++ 1.3

    NSA Center for Assured Software. 2017.Juliet C/C++ 1.3. https://samate.nist. gov/SARD/test-suites/112

  36. [36]

    OpenAI. [n. d.]. Codex. https://openai.com/codex/. OpenAI Developers, accessed April 6, 2026

  37. [37]

    OpenAI. 2026. Introducing GPT-5.4. https://openai.com/index/introducing-gpt- 5-4/. Accessed: 2026-04-25

  38. [38]

    Chengbin Pang, Ruotong Yu, Yaohui Chen, Eric Koskinen, Georgios Portokalidis, Bing Mao, and Jun Xu. 2021. Sok: All you ever wanted to know about x86/x64 binary disassembly but were afraid to ask. In2021 IEEE symposium on security and privacy (SP). IEEE, 833–851

  39. [39]

    radareorg. 2025. radare2: Reverse Engineering Framework. https://github.com/ radareorg/radare2. [Online; accessed 29-Jan-2025]

  40. [40]

    Nilo Redini, Aravind Machiry, Ruoyu Wang, Chad Spensky, Andrea Continella, Yan Shoshitaishvili, Christopher Kruegel, and Giovanni Vigna. 2020. Karonte: Detecting insecure multi-binary interactions in embedded firmware. In2020 IEEE Symposium on Security and Privacy (SP). IEEE, 1544–1561

  41. [41]

    Semgrep. 2025. Semgrep: Lightweight Static Analysis for Many Languages. https://github.com/semgrep/semgrep. [Online; accessed 29-Jan-2025]

  42. [42]

    Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and Dmitriy Vyukov. 2012. {AddressSanitizer}: A fast address sanity checker. In2012 USENIX annual technical conference (USENIX ATC 12). 309–318

  43. [43]

    Xiuwei Shang, Guoqiang Chen, Shaoyin Cheng, Benlong Wu, Li Hu, Gangyang Li, Weiming Zhang, and Nenghai Yu. 2025. BinMetric: A Comprehensive Binary Analysis Benchmark for Large Language Models.arXiv preprint arXiv:2505.07360 Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Xinran Zheng, Alfredo Pesoli, Marco Valleri, Suman Jana, and Lorenzo Cavallaro (2025)

  44. [44]

    Minghao Shao, Haoran Xi, Nanda Rani, Meet Udeshi, Venkata Sai Charan Putrevu, Kimberly Milner, Brendan Dolan-Gavitt, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, et al. 2025. CRAKEN: Cybersecurity LLM Agent with Knowledge-Based Execution.arXiv preprint arXiv:2505.17107(2025)

  45. [45]

    Ze Sheng, Zhicheng Chen, Shuning Gu, Heqing Huang, Guofei Gu, and Jeff Huang. 2025. Llms in software security: A survey of vulnerability detection techniques and insights.Comput. Surveys58, 5 (2025), 1–35

  46. [46]

    Benjamin Steenhoek, Md Mahbubur Rahman, Monoshi Kumar Roy, Mirza Sanjida Alam, Hengbo Tong, Swarna Das, Earl T Barr, and Wei Le. 2024. To err is machine: Vulnerability detection challenges llm reasoning.arXiv preprint arXiv:2403.17218 (2024)

  47. [47]

    Saad Ullah, Mingji Han, Saurabh Pujar, Hammond Pearce, Ayse Coskun, and Gianluca Stringhini. 2024. Llms cannot reliably identify and reason about security vulnerabilities (yet?): A comprehensive evaluation, framework, and benchmarks. In2024 IEEE symposium on security and privacy (SP). IEEE, 862–880

  48. [48]

    Jayakrishna Vadayath, Moritz Eckert, Kyle Zeng, Nicolaas Weideman, Gokulkr- ishna Praveen Menon, Yanick Fratantonio, Davide Balzarotti, Adam Doupé, Tiffany Bao, Ruoyu Wang, et al. 2022. Arbiter: Bridging the static and dynamic divide in vulnerability discovery on binary programs. In31st USENIX Security Symposium (USENIX Security 22). 413–430

  49. [49]

    Valgrind Developers. 2002. Valgrind: An Instrumentation Framework for Building Dynamic Analysis Tools. https://valgrind.org/. [Online; accessed 29-Jan-2025]

  50. [50]

    David A Wagner, Jeffrey S Foster, Eric A Brewer, and Alexander Aiken. 2000. A first step towards automated detection of buffer overrun vulnerabilities.. InNDSS, Vol. 20. 0

  51. [51]

    Fish Wang and Yan Shoshitaishvili. 2017. Angr-the next generation of binary analysis. In2017 IEEE Cybersecurity Development (SecDev). IEEE, 8–9

  52. [52]

    Felix Weissberg, Lukas Pirch, Erik Imgrund, Jonas Möller, Thorsten Eisenhofer, and Konrad Rieck. 2025. LLM-based Vulnerability Discovery through the Lens of Code Metrics.arXiv preprint arXiv:2509.19117(2025)

  53. [53]

    Xin-Cheng Wen, Yijun Yang, Cuiyun Gao, Yang Xiao, and Deheng Ye. 2025. Boost- ing Vulnerability Detection of LLMs via Curriculum Preference Optimization with Synthetic Reasoning Data.arXiv preprint arXiv:2506.07390(2025)

  54. [54]

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. 2024. Autogen: Enabling next-gen LLM applications via multi-agent conversations. InFirst Conference on Language Modeling

  55. [55]

    Jun Xu, Dongliang Mu, Ping Chen, Xinyu Xing, Pei Wang, and Peng Liu. 2016. Credal: Towards locating a memory corruption vulnerability with your core dump. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. 529–540

  56. [56]

    Jun Xu, Peng Ning, Chongkyung Kil, Yan Zhai, and Chris Bookholt. 2005. Auto- matic diagnosis and response to memory corruption vulnerabilities. InProceedings of the 12th ACM conference on Computer and communications security. 223–234

  57. [57]

    Hua Yan, Yulei Sui, Shiping Chen, and Jingling Xue. 2018. Spatio-temporal context reduction: A pointer-analysis-based static approach for detecting use-after-free vulnerabilities. InProceedings of the 40th International Conference on Software Engineering. 327–337

  58. [58]

    Dayu Yang, Tianyang Liu, Daoan Zhang, Antoine Simoulin, Xiaoyi Liu, Yuwei Cao, Zhaopu Teng, Xin Qian, Grey Yang, Jiebo Luo, et al. 2025. Code to think, think to code: A survey on code-enhanced reasoning and reasoning-driven code intelligence in llms.arXiv preprint arXiv:2502.19411(2025)

  59. [59]

    Alperen Yildiz, Sin G Teo, Yiling Lou, Yebo Feng, Chong Wang, and Dinil Mon Di- vakaran. 2025. Benchmarking llms and llm-based agents in practical vulnerability detection for code repositories. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 30848–30865

  60. [60]

    Hang Zhang, Jangha Kim, Chuhong Yuan, Zhiyun Qian, and Taesoo Kim. 2025. Statically discover cross-entry use-after-free vulnerabilities in the linux kernel. InNetwork and Distributed System Security (NDSS) Symposium. Ethical Considerations This work involves automated detection and validation of mem- ory corruption vulnerabilities in binaries. Our benchma...