Recognition: 2 theorem links
· Lean TheoremMitigating False Positives in Static Memory Safety Analysis of Rust Programs via Reinforcement Learning
Pith reviewed 2026-05-08 18:07 UTC · model grok-4.3
The pith
Reinforcement learning can suppress false positive warnings from Rust static analyzers like Rudra by learning policies from MIR features and fuzzing feedback.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By treating false-positive suppression as a reinforcement learning task, an agent can learn a policy that uses features from Rust's mid-level intermediate representation to classify warnings and selectively invokes cargo-fuzz for dynamic confirmation, yielding 65.2 percent accuracy and an F1 score of 0.659, a 17.1 percent improvement over the strongest LLM baseline.
What carries the argument
An RL agent that learns a warning-suppression policy from contextual features extracted from Rust's mid-level intermediate representation, with cargo-fuzz providing auxiliary dynamic validation signals.
If this is right
- Raw Rudra precision rises from 25.6 percent to 59.0 percent while recall reaches 74.6 percent.
- Adding targeted fuzzing yields another 10.7 percentage points in accuracy over the RL-only version.
- The hybrid approach outperforms LLM-based warning classification by 17.1 percent in accuracy.
- Developers receive fewer spurious alerts, lowering the effort needed to trust static memory-safety results.
Where Pith is reading between the lines
- Similar RL policies could be trained for other languages that expose comparable intermediate representations.
- Embedding the agent inside continuous-integration pipelines would let teams triage only the warnings the policy retains.
- If MIR features prove insufficient on certain code patterns, augmenting the state with additional static facts could be tested directly.
Load-bearing premise
The assumption that features drawn from the mid-level intermediate representation supply enough context for the agent to learn a reliable distinction between true and false warnings, and that fuzzing feedback is accurate and unbiased.
What would settle it
A test set of Rust programs containing documented memory-safety bugs where the trained agent suppresses a substantial fraction of the true-positive warnings that the original static analyzer had flagged.
Figures
read the original abstract
Static analysis tools are essential for ensuring memory safety in Rust programs, particularly as Rust gains adoption in safety-critical domains. However, existing tools such as Rudra and MirChecker suffer from high false positive rates, which diminish developer trust, increase manual review effort, and may obscure genuine vulnerabilities. This paper presents a novel reinforcement learning (RL)-based approach for automatically classifying and suppressing spurious warnings in static memory safety analysis for Rust. To achieve this, we design an RL agent that learns a warning suppression policy by extracting contextual features from Rust's Mid-level Intermediate Representation (MIR) and optimizing its decisions through interaction with static analysis outputs. To improve decision quality, we integrate dynamic validation via cargo-fuzz as an auxiliary feedback mechanism, allowing the agent to selectively validate suspicious warnings through targeted fuzz testing. Our evaluation shows that the proposed approach significantly outperforms state-of-the-art LLM-based baselines, achieving 65.2% accuracy and an F1 score of 0.659, an improvement of 17.1% over the best LLM baseline. With a recall of 74.6%, our method successfully identifies nearly three-quarters of true bugs while substantially reducing false positives, improving precision from 25.6% in raw Rudra output to 59.0%. Incorporating dynamic fuzzing further boosts performance, yielding additional improvements of 10.7 percentage points in accuracy and 8.6 percentage points in F1 score over the RL-only variant. Overall, our work demonstrates that combining reinforcement learning with hybrid static-dynamic analysis can substantially reduce false positives and improve the practical usability of memory safety verification tools for Rust.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a reinforcement learning (RL) framework to reduce false positives in static memory safety analyzers for Rust, such as Rudra. It uses contextual features from the Mid-level Intermediate Representation (MIR) to train an agent that learns a suppression policy, augmented by dynamic feedback from cargo-fuzz. The evaluation reports that this approach achieves 65.2% accuracy and an F1 score of 0.659, outperforming LLM-based baselines by 17.1%, with precision improving from 25.6% to 59.0% and recall at 74.6%. Dynamic fuzzing adds further gains.
Significance. If the empirical results hold under rigorous validation, the hybrid RL approach combining MIR-derived features with fuzzing feedback could substantially improve the practical utility of static analyzers for Rust memory safety, a domain where high false-positive rates currently limit adoption in safety-critical code. The work credits the integration of static analysis outputs with dynamic validation as a key enabler for the reported precision lift.
major comments (2)
- [Evaluation] Evaluation section: the headline metrics (65.2% accuracy, F1=0.659, precision rising from 25.6% to 59.0%, recall 74.6%) are presented without any description of dataset size, number of programs or warnings, how ground-truth labels for true bugs were obtained, RL training hyperparameters, or statistical significance tests. These omissions make it impossible to verify whether the 17.1% improvement over LLM baselines is robust or reproducible.
- [Method (Dynamic Validation)] Dynamic validation subsection: the reward signal relies on cargo-fuzz to label true positives, yet no coverage statistics, number of fuzzing campaigns, or comparison against exhaustive or symbolic validation are supplied. If fuzzing systematically misses data-dependent or rare-path violations (as is common in Rust), the RL agent receives false-negative labels and is incentivized to suppress genuine bugs, directly undermining the claimed precision and recall figures.
minor comments (1)
- [Abstract] Abstract: the phrase 'state-of-the-art LLM-based baselines' is used without naming the specific models or providing citations; this should be expanded for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the evaluation and dynamic validation aspects of our work. We address each major comment below and indicate the revisions incorporated into the manuscript.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the headline metrics (65.2% accuracy, F1=0.659, precision rising from 25.6% to 59.0%, recall 74.6%) are presented without any description of dataset size, number of programs or warnings, how ground-truth labels for true bugs were obtained, RL training hyperparameters, or statistical significance tests. These omissions make it impossible to verify whether the 17.1% improvement over LLM baselines is robust or reproducible.
Authors: We agree that the original Evaluation section omitted details required for reproducibility and verification of the results. The revised manuscript expands this section to describe the dataset size and composition (number of programs and warnings), the procedure used to obtain ground-truth labels for true bugs, the full set of RL training hyperparameters, and the statistical significance tests performed to support the reported improvements over LLM baselines. revision: yes
-
Referee: [Method (Dynamic Validation)] Dynamic validation subsection: the reward signal relies on cargo-fuzz to label true positives, yet no coverage statistics, number of fuzzing campaigns, or comparison against exhaustive or symbolic validation are supplied. If fuzzing systematically misses data-dependent or rare-path violations (as is common in Rust), the RL agent receives false-negative labels and is incentivized to suppress genuine bugs, directly undermining the claimed precision and recall figures.
Authors: We acknowledge the validity of this concern regarding potential incomplete coverage in fuzzing. The revised manuscript adds coverage statistics from the cargo-fuzz campaigns, the number of fuzzing campaigns executed, and a rationale for not performing exhaustive or symbolic validation (due to scalability limitations with Rust programs). We also include an explicit discussion of the risk of false-negative labels from missed paths and describe how the RL framework integrates the dynamic signal conservatively alongside MIR features to limit its impact; an ablation study has been added showing that precision gains remain even under partial coverage. revision: partial
Circularity Check
No circularity in empirical RL training pipeline
full rationale
The paper describes an empirical reinforcement learning approach that trains an agent on MIR-derived contextual features to suppress false positives from static analyzers like Rudra, with cargo-fuzz providing external reward signals for validation. Reported metrics (65.2% accuracy, 0.659 F1, precision lift from 25.6% to 59.0%) arise from standard training/evaluation loops against held-out data and baselines, not from any internal equations or parameters that define the target quantities by construction. No self-definitional reductions, fitted-input predictions, or load-bearing self-citations appear in the abstract or described method; the derivation chain consists of feature extraction, policy optimization, and hybrid static-dynamic feedback, all externally grounded.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
Cost.FunctionalEquation (washburn_uniqueness_aczel)J(x) = ½(x + x⁻¹) − 1 uniqueness unclearCorrect classifications receive positive rewards (+15), while incorrect classifications are penalized (-15)... Executing the fuzzing action incurs a cost penalty (-5)
Reference graph
Works this paper leans on
-
[1]
https://github.com/Akileshdash/rl-guided-static-analysis-rust/blob/main/ README.md
2025. https://github.com/Akileshdash/rl-guided-static-analysis-rust/blob/main/ README.md
2025
-
[2]
https://github.com/Akileshdash/Rudra/blob/master/README.md
2025. https://github.com/Akileshdash/Rudra/blob/master/README.md
2025
-
[3]
aarc Developers. 2021. aarc crate, version 0.3.2. https://crates.io/crates/aarc/0.3.2. Accessed: 2026-01-10
2021
-
[4]
Vytautas Astrauskas, Christoph Matheja, Federico Poli, Peter Müller, and Alexan- der J. Summers. 2020. How Do Programmers Use Unsafe Rust?. InProceed- ings of the ACM on Programming Languages (OOPSLA), Vol. 4. 136:1–136:27. doi:10.1145/3428204
-
[5]
Vytautas Astrauskas, Peter Müller, Federico Poli, and Alexander J Summers. 2019. Leveraging Rust types for modular specification and verification. InProceedings of the ACM on Programming Languages (OOPSLA), Vol. 3. 1–30
2019
-
[6]
David Morgenthaler, John Penix, and William Pugh
Nathaniel Ayewah, David Hovemeyer, J. David Morgenthaler, John Penix, and William Pugh. 2008. Using Static Analysis to Find Bugs.IEEE Software25, 5, 22–29. doi:10.1109/MS.2008.130
-
[7]
Yechan Bae, Youngsuk Kim, Ammar Askar, Jungwon Lim, and Taesoo Kim. 2021. Rudra: Finding Memory Safety Bugs in Rust at the Ecosystem Scale. InProceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (SOSP). 84–99. doi:10.1145/3477132.3483570
-
[8]
Al Bessey, Ken Block, Ben Chelf, Andy Chou, Bryan Fulton, Seth Hallem, Charles Henri-Gros, Asya Kamsky, Scott McPeak, and Dawson Engler. 2010. A Few Billion Lines of Code Later: Using Static Analysis to Find Bugs in the Real World. InCommunications of the ACM, Vol. 53. 66–75. doi:10.1145/1646353.1646374
-
[9]
Marcel Böhme, Van-Thuan Pham, Manh-Dung Nguyen, and Abhik Roychoudhury
-
[10]
InProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security
Directed greybox fuzzing. InProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 2329–2344
2017
-
[11]
Montgomery Carter, Shaobo He, Jonathan Whitaker, Zvonimir Rakamarić, and Michael Emmi. 2016. SMACK software verification toolchain. InProceedings of the 38th International Conference on Software Engineering Companion. 589–592
2016
-
[12]
Microsoft Security Response Center. 2019. Why Rust for Safe Systems Program- ming. https://msrc-blog.microsoft.com/2019/07/22/why-rust-for-safe-systems- programming/. Accessed: 2024-12-10
2019
-
[13]
Partha Chakraborty, Mahmoud Alfadel, and Meiyappan Nagappan. 2024. RLoca- tor: Reinforcement learning for bug localization.IEEE Transactions on Software Engineering(2024)
2024
-
[14]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating Large Language Models Trained on Code. InarXiv preprint arXiv:2107.03374
work page internal anchor Pith review arXiv 2021
-
[15]
Maria Christakis and Christian Bird. 2016. What Developers Want and Need from Program Analysis: An Empirical Study. InProceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE). 332–343. doi:10.1145/2970276.2970347
-
[16]
Cloudflare. 2020. Enjoy a Slice of QUIC, and Rust! https://blog.cloudflare.com/ enjoy-a-slice-of-quic-and-rust/. Accessed: 2024-12-10
2020
-
[17]
National Vulnerability Database. 2020. CVE-2020-35905. https://nvd.nist.gov/ vuln/detail/CVE-2020-35905. Accessed: 2026-01-09
2020
-
[18]
National Vulnerability Database. 2021. CVE-2020-36323. https://nvd.nist.gov/ vuln/detail/CVE-2020-36323. Accessed: 2026-01-09
2021
- [19]
-
[20]
Ana Nora Evans, Bradford Campbell, and Mary Lou Soffa. 2020. Is Rust Used Safely by Software Developers?. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering (ICSE). 246–257. doi:10.1145/3377811.3380413
-
[21]
Z Feng. 2020. Codebert: A pre-trained model for program-ming and natural languages.arXiv preprint arXiv:2002.08155(2020)
work page internal anchor Pith review arXiv 2020
-
[22]
MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Gen- eration
Sushant Ghimire, Michael W. Godfrey, and Chanchal K. Roy. 2023. Yuga: Au- tomatically Detecting Lifetime Annotation Bugs in the Rust Language.IEEE Transactions on Software Engineering49, 4 (2023), 2075–2091. doi:10.1109/TSE. 2022.3200162
work page doi:10.1109/tse 2023
-
[23]
Aaron Grattafiori et al . 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https://arxiv.org/abs/2407.21783
work page internal anchor Pith review arXiv 2024
-
[24]
Sarah Heckman and Laurie Williams. 2009. A Model Building Process for Iden- tifying Actionable Static Analysis Alerts. InProceedings of the 2009 Interna- tional Conference on Software Testing Verification and Validation (ICST). 161–170. doi:10.1109/ICST.2009.47
-
[25]
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...
work page Pith review arXiv 2024
-
[26]
Brittany Johnson, Yoonki Song, Emerson Murphy-Hill, and Robert Bowdidge
-
[27]
InProceedings of the 2013 International Conference on Software Engineering (ICSE)
Why Don’t Software Developers Use Static Analysis Tools to Find Bugs?. InProceedings of the 2013 International Conference on Software Engineering (ICSE). 672–681. doi:10.1109/ICSE.2013.6606613
-
[28]
Ralf Jung, Jacques-Henri Jourdan, Robbert Krebbers, and Derek Dreyer. 2018. RustBelt: Securing the Foundations of the Rust Programming Language. In Proceedings of the ACM on Programming Languages (POPL), Vol. 2. 66:1–66:34. doi:10.1145/3158154
-
[29]
Ralf Jung, Benjamin Kimock, Christian Poveda, Eduardo Sánchez Muñoz, Oli Scherer, and Qian Wang. 2026. Miri: Practical Undefined Behavior Detection for Rust.Proceedings of the ACM on Programming Languages10, POPL (2026), 1383–1411
2026
-
[30]
Sunghun Kim and Michael D. Ernst. 2007. Which Warnings Should I Fix First?. InProceedings of the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE). 45–54. doi:10.1145/1287624.1287633
-
[31]
George Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, and Michael Hicks. 2018. Evaluating Fuzz Testing. InProceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (CCS). 2123–2138. doi:10.1145/3243734. 3243804
-
[32]
Ted Kremenek, Ken Ashcraft, Junfeng Yang, and Dawson Engler. 2004. Corre- lation Exploitation in Error Ranking. InProceedings of the 12th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE). 83–93. doi:10.1145/1029894.1029909
-
[33]
Yi Li, Aws Albarghouthi, Zachary Kincaid, Mayur Naik, et al . 2019. Hybrid program analysis for effective bug detection. InProceedings of the ACM/IEEE International Conference on Software Engineering
2019
-
[34]
Zhuohua Li, Jincheng Wang, Mingshen Sun, and John C. S. Lui. 2021. MirChecker: Detecting Bugs in Rust Programs via Static Analysis. InProceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security (CCS). 2183–
2021
-
[35]
doi:10.1145/3460120.3484541
-
[36]
Kevin Lira, Baldoino Fonseca, Wesley KG Assunccao, Davy Baya, and Marcio Ribeiro. 2025. Beyond Code Explanations: a Ray of Hope for Cross-Language Vul- nerability Repair. In2025 2nd IEEE/ACM International Conference on AI-powered Software (AIware). IEEE, 01–09
2025
-
[37]
Guoming Long, Jingzhi Gong, Hui Fang, and Tao Chen. 2025. Learning Software Bug Reports: A Systematic Literature Review.ACM Transactions on Software Engineering and Methodology(2025)
2025
- [38]
-
[39]
Nicholas D. Matsakis and Felix S. Klock. 2014. The Rust Language. InProceedings of the 2014 ACM SIGAda Annual Conference on High Integrity Language Technology (HILT). 103–104. doi:10.1145/2663171.2663188
-
[40]
Nathalia Nascimento, Everton Guimaraes, Sai Sanjna Chintakunta, and San- thosh Anitha Boominathan. 2025. How Effective are LLMs for Data Science Coding? A Controlled Experiment. In2025 IEEE/ACM 22nd International Confer- ence on Mining Software Repositories (MSR). IEEE, 211–222
2025
-
[41]
National Vulnerability Database. 2021. CVE-2020-36317. https://nvd.nist.gov/ vuln/detail/CVE-2020-36317. Accessed: 2026-01-09
2021
- [42]
-
[43]
Michael Pradel and Koushik Sen. 2018. Deepbugs: A learning approach to name- based bug detection.Proceedings of the ACM on Programming Languages2, OOPSLA (2018), 1–25
2018
-
[44]
Boqin Qin, Yilun Chen, Zeming Yu, Linhai Song, and Yiying Zhang. 2020. Under- standing Memory and Thread Safety Practices and Issues in Real-World Rust Pro- grams. InProceedings of the 41st ACM SIGPLAN Conference on Programming Lan- guage Design and Implementation (PLDI). 763–779. doi:10.1145/3385412.3386036
-
[45]
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. 2023. Code Llama: Open Foundation Models for Code.arXiv preprint arXiv:2308.12950(2023)
work page internal anchor Pith review arXiv 2023
-
[46]
Rust Fuzzing Authority. 2024. cargo-fuzz: Fuzz testing for Rust. https://github. com/rust-fuzz/cargo-fuzz
2024
-
[47]
Rust Project Developers. 2025. Rust Version 1.88.0. https://doc.rust-lang.org/ beta/releases.html#version-1880-2025-06-26. Accessed: 2026-01-10
2025
-
[48]
Rust Release Notes. 2021. Rust Version 1.56.0. https://doc.rust-lang.org/beta/ releases.html#version-1560-2021-10-21. Accessed: 2026-01-10
2021
-
[49]
Joseph R. Ruthruff, John Penix, J. David Morgenthaler, Sebastian Elbaum, and Gregg Rothermel. 2008. Predicting Accurate and Actionable Static Analysis Warnings: An Experimental Approach. InProceedings of the 30th International Conference on Software Engineering (ICSE). 341–350. doi:10.1145/1368088.1368135
-
[50]
Iman Saberi, Amirreza Esmaeili, Fatemeh Fard, and Fuxiang Chen. 2025. Adv- Fusion: Adapter-based Knowledge Transfer for Code Summarization on Code Language Models. In2025 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 563–574
2025
-
[51]
Caitlin Sadowski, Edward Aftandilian, Alex Eagle, Liam Miller-Cushon, and Ciera Jaspan. 2018. Lessons from Building Static Analysis Tools at Google. In Communications of the ACM, Vol. 61. 58–66. doi:10.1145/3188720
-
[52]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov
-
[53]
Proximal Policy Optimization Algorithms
Proximal Policy Optimization Algorithms. arXiv:1707.06347 [cs.LG] https: EASE 2026, 9–12 June, 2026, Glasgow, Scotland, United Kingdom Akilesh et al. //arxiv.org/abs/1707.06347
work page internal anchor Pith review arXiv 2026
-
[54]
Amazon Web Services. 2020. Why AWS Loves Rust, and How We’d Like to Help. https://aws.amazon.com/blogs/opensource/why-aws-loves-rust-and-how- wed-like-to-help/. Accessed: 2024-12-10
2020
-
[55]
Ayushi Sharma, Shashank Sharma, Sai Ritvik Tanksalkar, Santiago Torres-Arias, and Aravind Machiry. 2024. Rust for embedded systems: current state and open problems. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 2296–2310
2024
- [56]
-
[57]
Sutton and Andrew G
Richard S. Sutton and Andrew G. Barto. 2018.Reinforcement Learning: An Intro- duction(2nd ed.). MIT Press
2018
-
[58]
Linus Torvalds and Linux Kernel Team. 2022. Linux 6.1: Rust Support Merged into Mainline Kernel. https://www.kernel.org/. Accessed: 2024-12-10
2022
- [59]
-
[60]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837
2022
-
[61]
Hui Xu, Zhuangbin Chen, Mingshen Sun, Yangfan Zhou, and Michael R. Lyu
-
[62]
InACM Transactions on Software Engineering and Methodology (TOSEM), Vol
Memory-Safety Challenge Considered Solved? An In-Depth Study with All Rust CVEs. InACM Transactions on Software Engineering and Methodology (TOSEM), Vol. 31. 3:1–3:25. doi:10.1145/3466642
-
[63]
Michal Zalewski. 2014. American fuzzy lop. http://lcamtuf.coredump.cx/afl/
2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.