pith. sign in

arxiv: 2603.27224 · v4 · pith:QDEQBASNnew · submitted 2026-03-28 · 💻 cs.SE · cs.CR

Finding Memory Leaks in C/C++ Programs via Neuro-Symbolic Augmented Static Analysis

Pith reviewed 2026-05-14 22:30 UTC · model grok-4.3

classification 💻 cs.SE cs.CR
keywords memory leaksstatic analysisC/C++LLMZ3CodeQLbug detectionneuro-symbolic
0
0 comments X

The pith

MemHint augments static analyzers with LLMs and Z3 to detect 52 memory leaks in 3.4 million lines of C/C++ code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MemHint, a neuro-symbolic system that improves static analysis for memory leaks in C and C++ programs. It uses large language models to classify custom memory allocation and deallocation functions beyond standard primitives and to record ownership details for arguments or return values. Z3 then validates each classification against the function's control-flow graph and later filters warnings to infeasible paths only. The validated summaries are injected into CodeQL and Infer via their extension points, after which an LLM confirms which remaining warnings represent genuine leaks. On seven real projects totaling over 3.4 million lines, the pipeline produced 52 unique detections, 49 of which were confirmed and fixed, at roughly $1.7 per bug.

Core claim

MemHint parses a target codebase, applies an LLM to label each function as allocator, deallocator or neither while producing ownership summaries, discards any summary whose claimed operation cannot occur on a feasible path according to Z3, injects the surviving summaries into CodeQL and Infer, uses Z3 to drop warnings on infeasible paths, and runs a final LLM check to retain only genuine bugs.

What carries the argument

Neuro-symbolic pipeline that uses LLM classification of custom memory functions to produce ownership summaries and Z3 to validate reachability and filter infeasible paths.

Load-bearing premise

The LLM correctly identifies which functions perform memory operations and what carries ownership, and Z3 accurately determines which paths are feasible.

What would settle it

A codebase containing many project-specific allocators where the LLM misclassifies them would cause MemHint to report no more leaks than unaugmented CodeQL or Infer.

Figures

Figures reproduced from arXiv: 2603.27224 by Bo Wang, David Lo, Huihui Huang, Jieke Shi, Zhou Yang.

Figure 1
Figure 1. Figure 1: A memory leak in OpenSSL (v3.6.1) [5 lll➊hh f igure 1: A memory leak in OpenSSL (v3.6.1) [5 Figure 1: A memory leak in OpenSSL (v3.6.1) [ [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the MemHint pipeline. Stage 1 (Summary Generation): extract metadata, generate summaries with an LLM, and validate with Z3 SMT solver. Stage 2 (Summary-Augmented Analysis): inject validated summaries into static analyzers (e.g., CodeQL, Infer). Stage 3 (Warning Validation): filter infeasible warnings with Z3 and validate remaining ones using an LLM. (a) Valid Allocator Entry 𝑏1 Alloc Ret null R… view at source ↗
Figure 4
Figure 4. Figure 4: Z3-based Summary Validation. (a)–(b) Allocator: (a) [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Z3-based memory-leak feasibility checking Figure 5: Z3-based memory-leak feasibility checking [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
read the original abstract

Memory leaks remain prevalent in real-world C/C++ software. Static analyzers such as CodeQL provide scalable program analysis but frequently miss such bugs because they cannot recognize project-specific custom memory-management functions and lack path-sensitive control-flow modeling. We present MemHint, a neuro-symbolic pipeline that addresses both limitations by combining LLMs' semantic understanding of code with Z3-based symbolic reasoning. MemHint parses the target codebase and applies an LLM to classify each function as a memory allocator, deallocator, or neither, producing function summaries that record which argument or return value carries memory ownership, extending the analyzer's built-in knowledge beyond standard primitives such as malloc and free. A Z3-based validation step checks each summary against the function's control-flow graph, discarding those whose claimed memory operation is unreachable on any feasible path. The validated summaries are injected into CodeQL and Infer via their respective extension mechanisms. Z3 path feasibility filtering then eliminates warnings on infeasible paths, and a final LLM-based validation step confirms whether each remaining warning is a genuine bug. On eight real-world C/C++ projects totaling over 3.6M lines of code, MemHint detects 54 unique memory leaks (53 confirmed/fixed) at approximately $1.7 per detected bug, compared to 19 by vanilla CodeQL and 3 by vanilla Infer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces MemHint, a neuro-symbolic pipeline that uses an LLM to classify custom C/C++ memory-management functions (allocators/deallocators) and produce ownership summaries, validates them with Z3 reachability checks on the CFG, injects the summaries into CodeQL and Infer, applies Z3 path-feasibility filtering, and uses a final LLM step to confirm warnings. On seven real-world projects (>3.4 MLOC) it reports 52 unique leaks (49 confirmed/fixed, 4 CVEs) versus 19 for vanilla CodeQL and 3 for vanilla Infer, at roughly $1.7 per detected bug.

Significance. If the LLM classifications prove reliable, the work shows a practical, low-cost route to extending scalable static analyzers to project-specific APIs by combining semantic understanding with symbolic validation. The scale of the evaluation corpus, the concrete bug counts, independent confirmations, and cost metric are concrete strengths that would be of interest to the static-analysis and software-engineering communities.

major comments (3)
  1. [§3] §3 (LLM-based classification pipeline): No precision, recall, or sampled manual-audit numbers are reported for the LLM's labeling of custom allocators, deallocators, and ownership semantics on the 3.4 MLOC corpus. Because the generated summaries are injected directly into CodeQL and Infer, any systematic misclassification would inflate the reported 52 detections; the Z3 step only checks reachability of the claimed operation, not semantic correctness of the label.
  2. [§4] §4 (Experimental evaluation): The comparison with vanilla CodeQL and Infer lacks an ablation that removes either the LLM summaries or the Z3 filtering steps, so the individual contribution of each component to the jump from 19/3 to 52 leaks cannot be quantified. In addition, no false-negative analysis or sampling of missed leaks is provided.
  3. [§4] §4 (Bug confirmation): The claim that 49 of the 52 leaks were independently confirmed/fixed is load-bearing for the central improvement claim, yet the manuscript gives no protocol for the confirmation process, inter-annotator agreement, or how false positives were ruled out.
minor comments (3)
  1. [§4] The cost figure of approximately $1.7 per bug should be accompanied by an explicit breakdown (LLM API calls, Z3 queries, etc.) in the evaluation section.
  2. Clarify the exact extension mechanisms used to inject the validated summaries into CodeQL and Infer (e.g., which predicates or taint rules are overridden).
  3. A small number of typos and inconsistent capitalization appear in the abstract and section headings; a light copy-edit pass is recommended.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the planned revisions.

read point-by-point responses
  1. Referee: [§3] §3 (LLM-based classification pipeline): No precision, recall, or sampled manual-audit numbers are reported for the LLM's labeling of custom allocators, deallocators, and ownership semantics on the 3.4 MLOC corpus. Because the generated summaries are injected directly into CodeQL and Infer, any systematic misclassification would inflate the reported 52 detections; the Z3 step only checks reachability of the claimed operation, not semantic correctness of the label.

    Authors: We agree that quantitative metrics for the LLM classification step are missing and would strengthen the evaluation. In the revision we will add a manual audit of a random sample of 100 functions drawn from the corpus, reporting precision, recall, and F1 scores separately for allocator/deallocator identification and for ownership-summary accuracy. While the Z3 reachability check discards summaries whose claimed operations are unreachable, we acknowledge it does not verify semantic correctness; the added audit will quantify any misclassification rate. The high rate of independently confirmed bugs (49/52) supplies supporting end-to-end evidence, but we will make the classification reliability explicit. revision: yes

  2. Referee: [§4] §4 (Experimental evaluation): The comparison with vanilla CodeQL and Infer lacks an ablation that removes either the LLM summaries or the Z3 filtering steps, so the individual contribution of each component to the jump from 19/3 to 52 leaks cannot be quantified. In addition, no false-negative analysis or sampling of missed leaks is provided.

    Authors: We concur that an ablation study is needed to isolate component contributions. We will add results in §4 for three configurations on the same seven projects: (i) vanilla CodeQL/Infer, (ii) augmented with LLM summaries only, and (iii) full pipeline with both LLM summaries and Z3 filtering. This will quantify the incremental gains. For false negatives we will sample 10 % of functions in the largest project, manually inspect for missed leaks, and report an estimated recall; we note that exhaustive ground truth is unavailable, but sampling provides a practical bound. revision: yes

  3. Referee: [§4] §4 (Bug confirmation): The claim that 49 of the 52 leaks were independently confirmed/fixed is load-bearing for the central improvement claim, yet the manuscript gives no protocol for the confirmation process, inter-annotator agreement, or how false positives were ruled out.

    Authors: We will expand §4 with a precise confirmation protocol: each of the 52 warnings was reviewed independently by two authors; disagreements were resolved by joint discussion until consensus. We will report the resulting inter-annotator agreement percentage. False positives were ruled out by (a) confirming via Z3 that no deallocation occurs on any feasible path and (b) manual inspection of the call graph and ownership flow. Where possible, confirmation was further supported by submitted patches or CVEs. These details will be added to the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results on external projects with independent validation

full rationale

The paper describes a neuro-symbolic pipeline evaluated on seven external real-world C/C++ codebases (3.4M LOC total). Claims rest on direct comparisons to unmodified CodeQL and Infer baselines plus manual confirmation of detected bugs (49/52 fixed, CVEs submitted). No mathematical derivation chain, no fitted parameters renamed as predictions, no self-citations invoked as load-bearing uniqueness theorems, and no ansatz or renaming of known results. The LLM classification and Z3 filtering steps are tool components whose correctness is assessed via external outcomes rather than by construction from the same inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard assumptions about LLM semantic understanding and solver correctness rather than new fitted parameters or invented entities.

axioms (2)
  • standard math Z3 solver accurately determines path feasibility in the control-flow graph
    Invoked in the validation step to discard unreachable summaries.
  • domain assumption LLM can reliably classify functions as allocators, deallocators, or neither based on code semantics
    Core step that extends the analyzer beyond built-in primitives.

pith-pipeline@v0.9.0 · 5552 in / 1392 out tokens · 44422 ms · 2026-05-14T22:30:44.761213+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.