Recognition: no theorem link
Veritas: A Semantically Grounded Agentic Framework for Memory Corruption Vulnerability Detection in Binaries
Pith reviewed 2026-05-15 13:57 UTC · model grok-4.3
The pith
Veritas detects memory corruption in stripped binaries by grounding LLM reasoning in reconstructed value flows and runtime validation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Veritas achieves reliable memory corruption vulnerability detection in stripped binaries through a combination of static value-flow reconstruction, LLM-based hypothesis generation over grounded artifacts, and multi-agent runtime validation. On a benchmark of real-world cases it reaches 90% recall, produces no false positives among 623 exhaustively checked candidates, and identified a new Apple vulnerability confirmed with a CVE.
What carries the argument
The key mechanism is the static slicer that emits compact witness-backed flow objects from LLVM-IR facts including def-use relations, calls, returns, globals, and pointer operations, which ground the subsequent LLM detector and validator.
If this is right
- If correct, this shows that semantic grounding can overcome the limitations of lossy binary representations for LLM-based analysis.
- High recall with minimal false positives supports deployment in practical security scanning pipelines.
- The discovery of a real CVE demonstrates applicability to large commercial software.
- Validation through debugger-visible artifacts reduces reliance on purely static methods.
Where Pith is reading between the lines
- This approach could extend to detecting other classes of vulnerabilities by adjusting the flow queries and validation oracles.
- Improving the accuracy of the IR lifting step would likely increase overall effectiveness on more complex binaries.
- Similar grounding techniques might apply to related problems like identifying side-channel leaks or protocol flaws in binaries.
Load-bearing premise
The static slicer can reconstruct accurate value-flow relations, object semantics, and interprocedural paths from stripped binaries without critical omissions or mismatches that would mislead the LLM reasoning.
What would settle it
Finding a known memory corruption vulnerability in a stripped binary where the slicer fails to capture the relevant data flow path, resulting in the detector missing the bug despite its presence.
Figures
read the original abstract
Detecting memory corruption vulnerabilities in stripped binaries requires recovering object semantics, interprocedural propagation, and feasible triggers from low-level, lossy representations. Recent LLM-based approaches improve code understanding, but reliable detection still requires grounding in memory-relevant semantics and runtime feasibility evidence. We present Veritas, a semantically grounded framework for binary memory corruption vulnerability detection. Veritas combines a static slicer over RetDec-lifted LLVM IR, a dual-view LLM detector that reasons step by step over grounded flows using decompiled C and selective LLVM IR, and a multi-agent validator that checks hypotheses against debugger-visible artifacts and runtime evidence. The slicer reconstructs value-flow relations from LLVM-IR facts, including def-use, calls, returns, globals, and pointer operations, and emits compact witness-backed flow objects. The detector uses these artifacts to reason about control flow, bounds, and object correspondence without rediscovering whole-binary propagation. The validator confirms or rejects candidates through guided debugging, breakpoint inspection, and memory-checking oracles. We implement Veritas as a modular pipeline and evaluate it on a curated benchmark of real-world binary vulnerability cases. Veritas achieves 90\% recall. For false-positive assessment, we exhaustively validate and manually verify 623 detector candidates and audit additional candidates from larger cases. The exhaustive subset produces no false positives, while the additional audit identifies two confirmed false positives. In a real-world application, Veritas discovered a previously unknown Apple vulnerability that was confirmed and assigned a CVE. These results support semantic grounding as an operational design principle for practical binary vulnerability detection.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Veritas, a semantically grounded agentic framework for detecting memory corruption vulnerabilities in stripped binaries. It integrates a static slicer on RetDec-lifted LLVM IR that emits witness-backed flow objects capturing def-use chains, pointer operations, globals, calls, and returns; a dual-view LLM detector that performs step-by-step reasoning over both decompiled C and selective LLVM IR using these artifacts; and a multi-agent validator that confirms or rejects hypotheses via guided debugging, breakpoint inspection, and memory-checking oracles. On a curated benchmark of real-world binary cases, Veritas reports 90% recall. Exhaustive manual validation of 623 detector candidates yields zero false positives, while a larger audit identifies two confirmed false positives. The framework also discovered a previously unknown Apple vulnerability that received a CVE.
Significance. If the results hold after addressing the evaluation gaps, the work is significant for establishing semantic grounding as a practical design principle in binary vulnerability detection. By combining static value-flow reconstruction with LLM reasoning and runtime evidence, it achieves high recall with near-zero false positives on real-world cases and demonstrates utility through a confirmed CVE discovery. This hybrid pipeline offers a concrete advance over ungrounded LLM approaches or purely static methods for analyzing lossy stripped binaries.
major comments (2)
- [Abstract and Evaluation section] Abstract and Evaluation section: The 90% recall and zero-FP claims on the 623-candidate exhaustive subset rest on the static slicer correctly emitting faithful witness-backed flow objects for def-use, pointer semantics, and interprocedural relations. No quantitative measurement of slicer fidelity (e.g., ground-truth flow coverage, aliasing accuracy, or semantic mismatch rate against the original binaries) is reported on the benchmark set. This is load-bearing because systematic reconstruction errors from RetDec lifting of stripped binaries would propagate directly into the dual-view detector and validator, potentially inflating the reported performance.
- [Benchmark and §4] Benchmark and §4 (or equivalent evaluation section): The manuscript provides insufficient detail on benchmark construction, including how the real-world vulnerability cases were selected, the criteria for including or excluding candidates, and the exact process that produced the 623 cases for exhaustive validation. Without these, it is impossible to assess whether post-hoc filtering or selection bias affects the recall and false-positive numbers.
minor comments (2)
- [Abstract] The abstract refers to 'exhaustive validation' of 623 candidates and 'additional audit' of larger cases; the main text should explicitly define the scope of each set and the criteria distinguishing them.
- [Slicer description] Notation for the witness-backed flow objects (e.g., how def-use facts are encoded) should be introduced with a small example early in the description of the slicer to aid readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to incorporate additional details and evaluations where feasible.
read point-by-point responses
-
Referee: [Abstract and Evaluation section] Abstract and Evaluation section: The 90% recall and zero-FP claims on the 623-candidate exhaustive subset rest on the static slicer correctly emitting faithful witness-backed flow objects for def-use, pointer semantics, and interprocedural relations. No quantitative measurement of slicer fidelity (e.g., ground-truth flow coverage, aliasing accuracy, or semantic mismatch rate against the original binaries) is reported on the benchmark set. This is load-bearing because systematic reconstruction errors from RetDec lifting of stripped binaries would propagate directly into the dual-view detector and validator, potentially inflating the reported performance.
Authors: We acknowledge that a direct quantitative assessment of slicer fidelity (such as flow coverage or semantic mismatch rates against ground-truth binaries) is not reported and would strengthen the claims. The exhaustive manual validation of all 623 candidates, in which we inspected the emitted flow objects for correctness against the original binaries, provides indirect evidence of sufficient fidelity for the evaluated cases. However, we agree this is insufficient as standalone validation. In the revised manuscript, we will add a new subsection in the evaluation reporting slicer metrics on a representative subset of the benchmark (e.g., precision/recall of reconstructed def-use chains and pointer relations against manually annotated ground truth), along with a limitations discussion on RetDec lifting errors. revision: yes
-
Referee: [Benchmark and §4] Benchmark and §4 (or equivalent evaluation section): The manuscript provides insufficient detail on benchmark construction, including how the real-world vulnerability cases were selected, the criteria for including or excluding candidates, and the exact process that produced the 623 cases for exhaustive validation. Without these, it is impossible to assess whether post-hoc filtering or selection bias affects the recall and false-positive numbers.
Authors: We agree that greater transparency on benchmark construction is necessary to allow assessment of selection bias. The current manuscript describes the benchmark only at a high level as 'a curated benchmark of real-world binary vulnerability cases.' In the revised version, we will expand the relevant section (currently §4) with explicit details: sources used (public CVE databases, vulnerability reports, and open-source projects), inclusion criteria (e.g., confirmed memory corruption bugs with available stripped binaries and PoCs), exclusion criteria (e.g., cases requiring dynamic linking not supported by our pipeline), and the precise sampling process that yielded the 623 cases for exhaustive validation. This will include a table or flowchart summarizing the curation pipeline. revision: yes
Circularity Check
No circularity: empirical pipeline with independent validation
full rationale
The paper presents an empirical system (static slicer over RetDec IR + dual-view LLM detector + multi-agent runtime validator) whose central claims are performance numbers measured on external benchmarks and a real-world CVE case. No equations, fitted parameters, or derivations appear; the 90% recall and FP counts are reported as direct experimental outcomes against curated cases rather than predictions derived from the inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked in the provided text. The framework is therefore self-contained against external evidence and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption RetDec lifting produces LLVM IR faithful enough for accurate value-flow reconstruction in stripped binaries
- domain assumption LLMs can perform reliable step-by-step reasoning about memory safety when supplied with grounded flow objects and selective IR
Reference graph
Works this paper leans on
-
[1]
0xdea. 2025. semgrep-rules. https://github.com/0xdea/semgrep-rules. [Online; accessed 29-Jan-2025]
work page 2025
-
[2]
Talor Abramovich, Meet Udeshi, Minghao Shao, Kilian Lieret, Haoran Xi, Kim- berly Milner, Sofija Jancheska, John Yang, Carlos E Jimenez, Farshad Khorrami, et al. [n. d.]. EnIGMA: Interactive Tools Substantially Assist LM Agents in Find- ing Security Vulnerabilities. InForty-second International Conference on Machine Learning
-
[3]
Talor Abramovich, Meet Udeshi, Minghao Shao, Kilian Lieret, Haoran Xi, Kim- berly Milner, Sofija Jancheska, John Yang, Carlos E Jimenez, Farshad Khorrami, et al. 2024. Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities.arXiv preprint arXiv:2409.16165(2024)
-
[4]
AFL++ Team. 2023. AFL++: Combining Incremental Steps of Fuzzing Research. https://github.com/AFLplusplus/AFLplusplus. Accessed: 2026-01-20
work page 2023
-
[5]
Anthropic. [n. d.]. Claude Code Overview. https://code.claude.com/docs/en/ overview. Claude Code Docs, accessed April 6, 2026
work page 2026
-
[6]
Anthropic. 2026. Partnering with Mozilla to Improve Firefox’s Security. https: //www.anthropic.com/news/mozilla-firefox-security. Accessed April 2026
work page 2026
-
[7]
Anthropic. 2026. Project Glasswing: Securing Critical Software for the AI Era. https://www.anthropic.com/project/glasswing. Accessed April 2026
work page 2026
-
[8]
Anthropic Frontier Red Team. 2026. Claude Mythos Preview. https://red. anthropic.com/2026/mythos-preview/. Accessed April 2026
work page 2026
-
[9]
Avast Software. 2015. RetDec: A Retargetable Machine-Code Decompiler. https: //github.com/avast/retdec. [Online; accessed 29-Jan-2025]
work page 2015
-
[10]
Thanassis Avgerinos, Sang Kil Cha, Alexandre Rebert, Edward J Schwartz, Mav- erick Woo, and David Brumley. 2014. Automatic exploit generation.Commun. ACM57, 2 (2014), 74–84
work page 2014
-
[11]
David Brumley, Ivan Jager, Thanassis Avgerinos, and Edward J Schwartz. 2011. BAP: A binary analysis platform. InInternational Conference on Computer Aided Verification. Springer, 463–469
work page 2011
-
[12]
Cristian Cadar, Daniel Dunbar, Dawson R Engler, et al. 2008. Klee: unassisted and automatic generation of high-coverage tests for complex systems programs.. InOSDI, Vol. 8. 209–224
work page 2008
- [13]
- [14]
-
[15]
Dawson Engler, David Yu Chen, Seth Hallem, Andy Chou, and Benjamin Chelf
-
[16]
Bugs as deviant behavior: A general approach to inferring errors in systems code.ACM SIGOPS Operating Systems Review35, 5 (2001), 57–72
work page 2001
-
[17]
2026.System Over Model: Zero-Day Discovery at the Jagged Frontier
Stanislav Fort. 2026.System Over Model: Zero-Day Discovery at the Jagged Frontier. https://aisle.com/blog/system-over-model-zero-day-discovery-at-the- jagged-frontier AISLE blog
work page 2026
-
[18]
Fraunhofer FKIE. 2025. cwe_checker: Static Binary Analysis for CWE Detection. https://github.com/fkie-cad/cwe_checker. [Online; accessed 29-Jan-2025]
work page 2025
-
[19]
Vinod Ganapathy, Somesh Jha, David Chandler, David Melski, and David Vitek
-
[20]
In Proceedings of the 10th ACM conference on Computer and communications security
Buffer overrun detection using linear programming and static analysis. In Proceedings of the 10th ACM conference on Computer and communications security. 345–354
-
[21]
Jinyao Guo, Chengpeng Wang, Dominic Deluca, Jinjie Liu, Zhuo Zhang, and Xiangyu Zhang. 2025. BugScope: Learn to Find Bugs Like Human.arXiv preprint arXiv:2507.15671(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [22]
- [23]
-
[24]
Yu Jiang, Jie Liang, Fuchen Ma, Yuanliang Chen, Chijin Zhou, Yuheng Shen, Zhiyong Wu, Jingzhou Fu, Mingzhe Wang, Shanshan Li, et al . 2024. When fuzzing meets llms: Challenges and opportunities. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 492–496
work page 2024
-
[25]
Keen Security Lab. 2023. BinAbsInspector. https://github.com/KeenSecurityLab/ BinAbsInspector. GitHub repository. Retrieved May 1, 2023
work page 2023
- [26]
-
[27]
Ahmed Lekssays, Hamza Mouhcine, Khang Tran, Ting Yu, and Issa Khalil. 2025. {LLMxCPG}:{Context-Aware} vulnerability detection through code property {Graph-Guided} large language models. In34th USENIX Security Symposium (USENIX Security 25). 489–507
work page 2025
- [28]
-
[29]
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics 12 (2024), 157–173
work page 2024
- [30]
-
[31]
Xiang Mei, Pulkit Singh Singaria, Jordi Del Castillo, Haoran Xi, Tiffany Bao, Ruoyu Wang, Yan Shoshitaishvili, Adam Doupé, Hammond Pearce, Brendan Dolan-Gavitt, et al. 2024. Arvo: Atlas of reproducible vulnerabilities for open source software.arXiv preprint arXiv:2408.02153(2024)
-
[32]
Meta. 2025. Infer Static Analyzer. https://fbinfer.com/. [Online; accessed 29-Jan-2025]
work page 2025
-
[33]
National Security Agency. 2019. Ghidra Software Reverse Engineering Frame- work. https://ghidra-sre.org/. [Online; accessed 29-Jan-2025]
work page 2019
- [34]
-
[35]
NSA Center for Assured Software. 2017.Juliet C/C++ 1.3. https://samate.nist. gov/SARD/test-suites/112
work page 2017
-
[36]
OpenAI. [n. d.]. Codex. https://openai.com/codex/. OpenAI Developers, accessed April 6, 2026
work page 2026
-
[37]
OpenAI. 2026. Introducing GPT-5.4. https://openai.com/index/introducing-gpt- 5-4/. Accessed: 2026-04-25
work page 2026
-
[38]
Chengbin Pang, Ruotong Yu, Yaohui Chen, Eric Koskinen, Georgios Portokalidis, Bing Mao, and Jun Xu. 2021. Sok: All you ever wanted to know about x86/x64 binary disassembly but were afraid to ask. In2021 IEEE symposium on security and privacy (SP). IEEE, 833–851
work page 2021
-
[39]
radareorg. 2025. radare2: Reverse Engineering Framework. https://github.com/ radareorg/radare2. [Online; accessed 29-Jan-2025]
work page 2025
-
[40]
Nilo Redini, Aravind Machiry, Ruoyu Wang, Chad Spensky, Andrea Continella, Yan Shoshitaishvili, Christopher Kruegel, and Giovanni Vigna. 2020. Karonte: Detecting insecure multi-binary interactions in embedded firmware. In2020 IEEE Symposium on Security and Privacy (SP). IEEE, 1544–1561
work page 2020
-
[41]
Semgrep. 2025. Semgrep: Lightweight Static Analysis for Many Languages. https://github.com/semgrep/semgrep. [Online; accessed 29-Jan-2025]
work page 2025
-
[42]
Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and Dmitriy Vyukov. 2012. {AddressSanitizer}: A fast address sanity checker. In2012 USENIX annual technical conference (USENIX ATC 12). 309–318
work page 2012
-
[43]
Xiuwei Shang, Guoqiang Chen, Shaoyin Cheng, Benlong Wu, Li Hu, Gangyang Li, Weiming Zhang, and Nenghai Yu. 2025. BinMetric: A Comprehensive Binary Analysis Benchmark for Large Language Models.arXiv preprint arXiv:2505.07360 Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Xinran Zheng, Alfredo Pesoli, Marco Valleri, Suman Jana, and Lorenzo Cavallaro (2025)
-
[44]
Minghao Shao, Haoran Xi, Nanda Rani, Meet Udeshi, Venkata Sai Charan Putrevu, Kimberly Milner, Brendan Dolan-Gavitt, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, et al. 2025. CRAKEN: Cybersecurity LLM Agent with Knowledge-Based Execution.arXiv preprint arXiv:2505.17107(2025)
-
[45]
Ze Sheng, Zhicheng Chen, Shuning Gu, Heqing Huang, Guofei Gu, and Jeff Huang. 2025. Llms in software security: A survey of vulnerability detection techniques and insights.Comput. Surveys58, 5 (2025), 1–35
work page 2025
- [46]
-
[47]
Saad Ullah, Mingji Han, Saurabh Pujar, Hammond Pearce, Ayse Coskun, and Gianluca Stringhini. 2024. Llms cannot reliably identify and reason about security vulnerabilities (yet?): A comprehensive evaluation, framework, and benchmarks. In2024 IEEE symposium on security and privacy (SP). IEEE, 862–880
work page 2024
-
[48]
Jayakrishna Vadayath, Moritz Eckert, Kyle Zeng, Nicolaas Weideman, Gokulkr- ishna Praveen Menon, Yanick Fratantonio, Davide Balzarotti, Adam Doupé, Tiffany Bao, Ruoyu Wang, et al. 2022. Arbiter: Bridging the static and dynamic divide in vulnerability discovery on binary programs. In31st USENIX Security Symposium (USENIX Security 22). 413–430
work page 2022
-
[49]
Valgrind Developers. 2002. Valgrind: An Instrumentation Framework for Building Dynamic Analysis Tools. https://valgrind.org/. [Online; accessed 29-Jan-2025]
work page 2002
-
[50]
David A Wagner, Jeffrey S Foster, Eric A Brewer, and Alexander Aiken. 2000. A first step towards automated detection of buffer overrun vulnerabilities.. InNDSS, Vol. 20. 0
work page 2000
-
[51]
Fish Wang and Yan Shoshitaishvili. 2017. Angr-the next generation of binary analysis. In2017 IEEE Cybersecurity Development (SecDev). IEEE, 8–9
work page 2017
- [52]
- [53]
-
[54]
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. 2024. Autogen: Enabling next-gen LLM applications via multi-agent conversations. InFirst Conference on Language Modeling
work page 2024
-
[55]
Jun Xu, Dongliang Mu, Ping Chen, Xinyu Xing, Pei Wang, and Peng Liu. 2016. Credal: Towards locating a memory corruption vulnerability with your core dump. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. 529–540
work page 2016
-
[56]
Jun Xu, Peng Ning, Chongkyung Kil, Yan Zhai, and Chris Bookholt. 2005. Auto- matic diagnosis and response to memory corruption vulnerabilities. InProceedings of the 12th ACM conference on Computer and communications security. 223–234
work page 2005
-
[57]
Hua Yan, Yulei Sui, Shiping Chen, and Jingling Xue. 2018. Spatio-temporal context reduction: A pointer-analysis-based static approach for detecting use-after-free vulnerabilities. InProceedings of the 40th International Conference on Software Engineering. 327–337
work page 2018
-
[58]
Dayu Yang, Tianyang Liu, Daoan Zhang, Antoine Simoulin, Xiaoyi Liu, Yuwei Cao, Zhaopu Teng, Xin Qian, Grey Yang, Jiebo Luo, et al. 2025. Code to think, think to code: A survey on code-enhanced reasoning and reasoning-driven code intelligence in llms.arXiv preprint arXiv:2502.19411(2025)
-
[59]
Alperen Yildiz, Sin G Teo, Yiling Lou, Yebo Feng, Chong Wang, and Dinil Mon Di- vakaran. 2025. Benchmarking llms and llm-based agents in practical vulnerability detection for code repositories. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 30848–30865
work page 2025
-
[60]
Hang Zhang, Jangha Kim, Chuhong Yuan, Zhiyun Qian, and Taesoo Kim. 2025. Statically discover cross-entry use-after-free vulnerabilities in the linux kernel. InNetwork and Distributed System Security (NDSS) Symposium. Ethical Considerations This work involves automated detection and validation of mem- ory corruption vulnerabilities in binaries. Our benchma...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.