Finding Memory Leaks in C/C++ Programs via Neuro-Symbolic Augmented Static Analysis
Pith reviewed 2026-05-14 22:30 UTC · model grok-4.3
The pith
MemHint augments static analyzers with LLMs and Z3 to detect 52 memory leaks in 3.4 million lines of C/C++ code.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MemHint parses a target codebase, applies an LLM to label each function as allocator, deallocator or neither while producing ownership summaries, discards any summary whose claimed operation cannot occur on a feasible path according to Z3, injects the surviving summaries into CodeQL and Infer, uses Z3 to drop warnings on infeasible paths, and runs a final LLM check to retain only genuine bugs.
What carries the argument
Neuro-symbolic pipeline that uses LLM classification of custom memory functions to produce ownership summaries and Z3 to validate reachability and filter infeasible paths.
Load-bearing premise
The LLM correctly identifies which functions perform memory operations and what carries ownership, and Z3 accurately determines which paths are feasible.
What would settle it
A codebase containing many project-specific allocators where the LLM misclassifies them would cause MemHint to report no more leaks than unaugmented CodeQL or Infer.
Figures
read the original abstract
Memory leaks remain prevalent in real-world C/C++ software. Static analyzers such as CodeQL provide scalable program analysis but frequently miss such bugs because they cannot recognize project-specific custom memory-management functions and lack path-sensitive control-flow modeling. We present MemHint, a neuro-symbolic pipeline that addresses both limitations by combining LLMs' semantic understanding of code with Z3-based symbolic reasoning. MemHint parses the target codebase and applies an LLM to classify each function as a memory allocator, deallocator, or neither, producing function summaries that record which argument or return value carries memory ownership, extending the analyzer's built-in knowledge beyond standard primitives such as malloc and free. A Z3-based validation step checks each summary against the function's control-flow graph, discarding those whose claimed memory operation is unreachable on any feasible path. The validated summaries are injected into CodeQL and Infer via their respective extension mechanisms. Z3 path feasibility filtering then eliminates warnings on infeasible paths, and a final LLM-based validation step confirms whether each remaining warning is a genuine bug. On eight real-world C/C++ projects totaling over 3.6M lines of code, MemHint detects 54 unique memory leaks (53 confirmed/fixed) at approximately $1.7 per detected bug, compared to 19 by vanilla CodeQL and 3 by vanilla Infer.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MemHint, a neuro-symbolic pipeline that uses an LLM to classify custom C/C++ memory-management functions (allocators/deallocators) and produce ownership summaries, validates them with Z3 reachability checks on the CFG, injects the summaries into CodeQL and Infer, applies Z3 path-feasibility filtering, and uses a final LLM step to confirm warnings. On seven real-world projects (>3.4 MLOC) it reports 52 unique leaks (49 confirmed/fixed, 4 CVEs) versus 19 for vanilla CodeQL and 3 for vanilla Infer, at roughly $1.7 per detected bug.
Significance. If the LLM classifications prove reliable, the work shows a practical, low-cost route to extending scalable static analyzers to project-specific APIs by combining semantic understanding with symbolic validation. The scale of the evaluation corpus, the concrete bug counts, independent confirmations, and cost metric are concrete strengths that would be of interest to the static-analysis and software-engineering communities.
major comments (3)
- [§3] §3 (LLM-based classification pipeline): No precision, recall, or sampled manual-audit numbers are reported for the LLM's labeling of custom allocators, deallocators, and ownership semantics on the 3.4 MLOC corpus. Because the generated summaries are injected directly into CodeQL and Infer, any systematic misclassification would inflate the reported 52 detections; the Z3 step only checks reachability of the claimed operation, not semantic correctness of the label.
- [§4] §4 (Experimental evaluation): The comparison with vanilla CodeQL and Infer lacks an ablation that removes either the LLM summaries or the Z3 filtering steps, so the individual contribution of each component to the jump from 19/3 to 52 leaks cannot be quantified. In addition, no false-negative analysis or sampling of missed leaks is provided.
- [§4] §4 (Bug confirmation): The claim that 49 of the 52 leaks were independently confirmed/fixed is load-bearing for the central improvement claim, yet the manuscript gives no protocol for the confirmation process, inter-annotator agreement, or how false positives were ruled out.
minor comments (3)
- [§4] The cost figure of approximately $1.7 per bug should be accompanied by an explicit breakdown (LLM API calls, Z3 queries, etc.) in the evaluation section.
- Clarify the exact extension mechanisms used to inject the validated summaries into CodeQL and Infer (e.g., which predicates or taint rules are overridden).
- A small number of typos and inconsistent capitalization appear in the abstract and section headings; a light copy-edit pass is recommended.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the planned revisions.
read point-by-point responses
-
Referee: [§3] §3 (LLM-based classification pipeline): No precision, recall, or sampled manual-audit numbers are reported for the LLM's labeling of custom allocators, deallocators, and ownership semantics on the 3.4 MLOC corpus. Because the generated summaries are injected directly into CodeQL and Infer, any systematic misclassification would inflate the reported 52 detections; the Z3 step only checks reachability of the claimed operation, not semantic correctness of the label.
Authors: We agree that quantitative metrics for the LLM classification step are missing and would strengthen the evaluation. In the revision we will add a manual audit of a random sample of 100 functions drawn from the corpus, reporting precision, recall, and F1 scores separately for allocator/deallocator identification and for ownership-summary accuracy. While the Z3 reachability check discards summaries whose claimed operations are unreachable, we acknowledge it does not verify semantic correctness; the added audit will quantify any misclassification rate. The high rate of independently confirmed bugs (49/52) supplies supporting end-to-end evidence, but we will make the classification reliability explicit. revision: yes
-
Referee: [§4] §4 (Experimental evaluation): The comparison with vanilla CodeQL and Infer lacks an ablation that removes either the LLM summaries or the Z3 filtering steps, so the individual contribution of each component to the jump from 19/3 to 52 leaks cannot be quantified. In addition, no false-negative analysis or sampling of missed leaks is provided.
Authors: We concur that an ablation study is needed to isolate component contributions. We will add results in §4 for three configurations on the same seven projects: (i) vanilla CodeQL/Infer, (ii) augmented with LLM summaries only, and (iii) full pipeline with both LLM summaries and Z3 filtering. This will quantify the incremental gains. For false negatives we will sample 10 % of functions in the largest project, manually inspect for missed leaks, and report an estimated recall; we note that exhaustive ground truth is unavailable, but sampling provides a practical bound. revision: yes
-
Referee: [§4] §4 (Bug confirmation): The claim that 49 of the 52 leaks were independently confirmed/fixed is load-bearing for the central improvement claim, yet the manuscript gives no protocol for the confirmation process, inter-annotator agreement, or how false positives were ruled out.
Authors: We will expand §4 with a precise confirmation protocol: each of the 52 warnings was reviewed independently by two authors; disagreements were resolved by joint discussion until consensus. We will report the resulting inter-annotator agreement percentage. False positives were ruled out by (a) confirming via Z3 that no deallocation occurs on any feasible path and (b) manual inspection of the call graph and ownership flow. Where possible, confirmation was further supported by submitted patches or CVEs. These details will be added to the revised manuscript. revision: yes
Circularity Check
No significant circularity; empirical results on external projects with independent validation
full rationale
The paper describes a neuro-symbolic pipeline evaluated on seven external real-world C/C++ codebases (3.4M LOC total). Claims rest on direct comparisons to unmodified CodeQL and Infer baselines plus manual confirmation of detected bugs (49/52 fixed, CVEs submitted). No mathematical derivation chain, no fitted parameters renamed as predictions, no self-citations invoked as load-bearing uniqueness theorems, and no ansatz or renaming of known results. The LLM classification and Z3 filtering steps are tool components whose correctness is assessed via external outcomes rather than by construction from the same inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Z3 solver accurately determines path feasibility in the control-flow graph
- domain assumption LLM can reliably classify functions as allocators, deallocators, or neither based on code semantics
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.