pith. sign in

arxiv: 2606.21822 · v1 · pith:EN6YQ4ZSnew · submitted 2026-06-20 · 💻 cs.PL · cs.AI· cs.SE

CNnotator: LLM-Guided Memory Safety Annotation Synthesis

Pith reviewed 2026-06-26 11:19 UTC · model grok-4.3

classification 💻 cs.PL cs.AIcs.SE
keywords memory safetyC programminglarge language modelsannotation synthesisprogram verificationspecification generationCN tool
0
0 comments X

The pith

Large language models can synthesize CN annotations to represent memory usage in C programs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs can determine the implicit ways C code handles memory by producing annotations in the CN specification language. Memory safety errors remain a leading source of bugs in C systems, and making those usage patterns explicit is required before verification or migration to safer languages can occur. CNnotator prompts an LLM to propose candidate annotations and then runs them through CN's testing process to check correctness. On the small-to-medium programs examined, the best model reaches 90 percent success on its first attempt and 97 percent overall, while another model succeeds on 65 percent of first attempts. These rates indicate that the annotation task is becoming feasible for AI assistance rather than remaining a purely manual chore.

Core claim

CNnotator shows that LLMs can generate CN specifications that capture memory usage patterns in small-to-medium C programs. The system works by having the model propose annotations and then validating them through CN's hybrid testing and verification capabilities. Reported outcomes include 90 percent first-attempt success and 97 percent overall success for one model, together with 65 percent first-attempt success for another model. The authors conclude that this level of performance makes AI-assisted annotation practical for real-world C codebases.

What carries the argument

CNnotator, the LLM-guided loop that generates candidate CN memory annotations and validates them by running CN tests.

If this is right

  • Memory usage in existing C code can be documented without exhaustive manual reverse-engineering.
  • CN verification becomes applicable to more programs because the annotation step is automated.
  • Memory safety errors can be located earlier in the development or maintenance process.
  • Efforts to migrate C systems to memory-safe languages gain an initial explicit model of resource behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same generate-and-test loop could be applied to other specification languages that describe resource usage.
  • Annotations produced this way might serve as starting points for stronger formal proofs once testing succeeds.
  • For very large codebases the method would likely need to be combined with modular decomposition so that LLMs receive manageable fragments.

Load-bearing premise

The small-to-medium programs chosen for testing are representative of memory usage patterns in larger, real-world C codebases, and successful passage of CN tests confirms that the generated annotations are correct and useful in practice.

What would settle it

A realistic C codebase where repeated LLM attempts produce no set of CN annotations that both pass CN testing and prevent actual memory errors during execution.

Figures

Figures reproduced from arXiv: 2606.21822 by Mike Dodds, Twain Byrnes.

Figure 1
Figure 1. Figure 1: CNnotator workflow. The tool preprocesses C files (merging and adding CN macros), then iterates through each [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: CNnotator performance across five LLM backends on 31 C functions. (a) Overall success rate. (b) First-attempt success [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
read the original abstract

Memory safety errors account for a large proportion of security bugs in systems written in C; modern languages such as Java and Rust prevent such bugs because they are memory-safe by design. To migrate systems to safer languages or identify memory errors, we must first determine how legacy code manipulates memory. This information is only represented implicitly in such code. In many cases, memory usage patterns are merely tedious for humans to figure out, rather than truly difficult. In this work, we ask if large language models (LLMs) can perform this task by having them synthesize annotations representing memory usage as specifications in CN, a hybrid testing/verification tool. Our tool, CNnotator, uses LLMs to automatically generate and test CN specifications. We find that current models are able to generate CN specifications for small-to-medium C programs, with the OpenAI o3 reasoning model achieving a 90% success rate on first attempts and 97% overall success, while the chat model GPT-4o correctly annotates 65% of first attempts. These results suggest AI-assisted annotation is becoming practical for real-world C codebases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents CNnotator, a tool that uses LLMs to automatically generate and test CN specifications for memory-safety properties in C programs. It claims that current models can perform this task on small-to-medium C programs, with the o3 reasoning model achieving 90% success on first attempts and 97% overall, and GPT-4o achieving 65% on first attempts. The work positions this as evidence that AI-assisted annotation is becoming practical for real-world C codebases.

Significance. If the empirical results hold under stronger evaluation, the work would show that LLMs can automate inference of implicit memory-usage patterns that are tedious but not fundamentally difficult, offering a practical aid for verification or language migration of legacy C code. The direct measurement of LLM performance on a concrete annotation task (rather than a fitted or self-referential metric) is a strength.

major comments (2)
  1. [Evaluation] Evaluation section: The reported success rates (o3: 90% first-attempt / 97% overall; GPT-4o: 65% first-attempt) are presented without any information on the number of programs, selection criteria, size/complexity distribution, or failure-mode breakdown. This information is load-bearing for the central claim that the models succeed on representative small-to-medium C programs.
  2. [Validation] Validation subsection: The paper treats passage of the CN test suites as confirmation that generated annotations are correct, yet provides no description of how the test inputs were chosen, whether they exercise aliasing or pointer-arithmetic cases, or any argument that the suites constitute a sound oracle. If the tests are derived from the same examples used in prompting or lack coverage, the success percentages do not establish soundness.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'small-to-medium C programs' is used without a concrete definition (e.g., LOC ranges or cyclomatic complexity), making it difficult to interpret the scope of the reported results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive review. The comments correctly identify gaps in the level of detail provided for the evaluation and validation. We will revise the manuscript to supply the missing information and strengthen the presentation of our experimental setup.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: The reported success rates (o3: 90% first-attempt / 97% overall; GPT-4o: 65% first-attempt) are presented without any information on the number of programs, selection criteria, size/complexity distribution, or failure-mode breakdown. This information is load-bearing for the central claim that the models succeed on representative small-to-medium C programs.

    Authors: We agree that the Evaluation section requires additional context to support the central claim. The current manuscript reports aggregate success rates but does not characterize the benchmark in detail. In the revised version we will add a dedicated subsection (or expanded table) that reports: the exact number of programs evaluated, the selection criteria and sources, the distribution of program sizes and complexity metrics (e.g., lines of code, number of allocations and pointers), and a categorized breakdown of observed failure modes. These additions will make the applicability to small-to-medium C programs explicit and reproducible. revision: yes

  2. Referee: [Validation] Validation subsection: The paper treats passage of the CN test suites as confirmation that generated annotations are correct, yet provides no description of how the test inputs were chosen, whether they exercise aliasing or pointer-arithmetic cases, or any argument that the suites constitute a sound oracle. If the tests are derived from the same examples used in prompting or lack coverage, the success percentages do not establish soundness.

    Authors: We acknowledge the need for a clearer justification of the validation oracle. The revised Validation subsection will describe the process used to construct the test inputs, explicitly note whether they cover aliasing, pointer arithmetic, and other relevant cases, and provide an argument for why the suites constitute a reasonable (if not exhaustive) oracle for the memory-safety properties expressed in CN. We will also clarify the relationship between the test cases and any examples appearing in prompts, and we will discuss the inherent limitations of test-based validation. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurement of LLM annotation success

full rationale

The paper reports measured success rates (o3 at 90%/97%, GPT-4o at 65%) from an experimental pipeline that prompts LLMs to produce CN annotations and then runs them through the CN testing harness on a set of C programs. No equations, fitted parameters, or derivations are present; the results are raw counts of annotations that pass the provided test suites. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear. The evaluation chain is self-contained and externally falsifiable via the same LLM+CN procedure on new programs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the work is an empirical tool evaluation rather than a theoretical derivation.

pith-pipeline@v0.9.1-grok · 5724 in / 1003 out tokens · 23125 ms · 2026-06-26T11:19:33.207686+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 5 canonical work pages

  1. [1]

    Aamer and Benjamin C

    Zain K. Aamer and Benjamin C. Pierce. 2025. Bennet: Randomized Specification Testing for Heap-Manipulating Programs.Proceedings of the ACM on Program- ming Languages9, OOPSLA2 (2025), 3924–3953. doi:10.1145/3764115

  2. [2]

    Anthropic. 2025. Claude Code: An Agentic CLI Coding Assistant. Anthropic. https://claude.ai/code Accessed 2025

  3. [3]

    Rini Banerjee, Kayvan Memarian, Dhruv Makwana, Christopher Pulte, Neel Krishnaswami, and Peter Sewell. 2025. Fulminate: Testing CN Separation-Logic Specifications in C.Proceedings of the ACM on Programming Languages9, POPL, Article 43 (2025), 33 pages. doi:10.1145/3704879

  4. [4]

    Le, Christopher Ré, and Azalia Mirhoseini

    Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher Ré, and Azalia Mirhoseini. 2024. Large Language Monkeys: Scal- ing Inference Compute with Repeated Sampling. arXiv:2407.21787 [cs.LG] https://arxiv.org/abs/2407.21787

  5. [5]

    Twain Byrnes. 2025. Escaping Isla Nublar: Coming Around to LLMs for Formal Methods. Galois Blog. https://www.galois.com/articles/escaping-isla-nublar- coming-around-to-llms-for-formal-methods Published August 18, 2025

  6. [6]

    Harrison Chase. 2022. LangChain. GitHub repository. https://github.com/ langchain-ai/langchain Accessed 2025

  7. [7]

    Chromium Security Team. 2025. Memory Safety. Chromium Project Documenta- tion. https://www.chromium.org/Home/chromium-security/memory-safety/ Accessed 2025

  8. [8]

    Citrus Contributors. 2017. Citrus: C to Rust Converter. GitLab repository. https: //gitlab.com/citrus-rs/citrus Accessed 2025

  9. [9]

    DARPA Information Innovation Office. 2024. TRACTOR: Translating All C to Rust. DARPA Program Information. https://www.darpa.mil/program/translating- all-c-to-rust Accessed 2025

  10. [10]

    Chongzhou Fang, Ning Miao, Shaurya Srivastav, Jialin Liu, Ruoyu Zhang, Ruijie Fang, Asmita, Ryan Tsang, Najmeh Nazari, Han Wang, and Houman Homayoun

  11. [11]

    InProceedings of the 33rd USENIX Security Symposium (USENIX Security 24)

    Large Language Models for Code Analysis: Do LLMs Really Do Their Job?. InProceedings of the 33rd USENIX Security Symposium (USENIX Security 24). USENIX Association, Philadelphia, PA, 829–846. https://www.usenix.org/ conference/usenixsecurity24/presentation/fang

  12. [12]

    Immunant and Galois. 2024. C2Rust: Migrate C Code to Rust. GitHub repository. https://github.com/immunant/c2rust Accessed 2025

  13. [13]

    Anirudh Khatry, Robert Zhang, Jia Pan, Ziteng Wang, Qiaochu Chen, Greg Durrett, and Isil Dillig. 2025. CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation. arXiv:2504.15254 [cs.SE] https://arxiv.org/abs/ 2504.15254

  14. [14]

    Anirudh Khatry, Robert Zhang, Jia Pan, Ziteng Wang, Qiaochu Chen, Greg Durrett, and Isil Dillig. 2025. CRUST-Bench Project Website. Project Website. https://crust-bench.github.io Accessed 2025

  15. [15]

    Florent Kirchner, Nikolai Kosmatov, Virgile Prevosto, Julien Signoles, and Boris Yakobowski. 2015. Frama-C: A Software Analysis Perspective.Formal Aspects of Computing27, 3 (2015), 573–609. doi:10.1007/s00165-014-0326-7

  16. [16]

    LangChain Inc. 2024. LangGraph: Build Resilient Language Agents as Graphs. GitHub repository. https://github.com/langchain-ai/langgraph Accessed 2025

  17. [17]

    Wei Ma, Shangqing Liu, Zhihao Lin, Wenhan Wang, Qiang Hu, Ye Liu, Cen Zhang, Liming Nie, Li Li, and Yang Liu. 2024. LMs: Understanding Code Syntax and Semantics for Code Analysis. arXiv:2305.12138 [cs.SE] https://arxiv.org/abs/ 2305.12138

  18. [18]

    Vikram Nitin, Rahul Krishna, Luiz Lemos do Valle, and Baishakhi Ray. 2025. C2SaferRust: Transforming C Projects into Safer Rust with NeuroSymbolic Tech- niques. arXiv:2501.14257 [cs.SE] https://arxiv.org/abs/2501.14257

  19. [19]

    Makwana, Thomas Sewell, Kayvan Memarian, Peter Sewell, and Neel Krishnaswami

    Christopher Pulte, Dhruv C. Makwana, Thomas Sewell, Kayvan Memarian, Peter Sewell, and Neel Krishnaswami. 2023. CN: Verifying Systems C Code with Separation-Logic Refinement Types.Proceedings of the ACM on Programming Languages7, POPL (2023), 191–220. doi:10.1145/3571194

  20. [20]

    Pierce, Cole Schlesinger, Jessica Shi, and Elizabeth Austell

    Christopher Pulte, Benjamin C. Pierce, Cole Schlesinger, Jessica Shi, and Elizabeth Austell. 2025. CN Tutorial. Online tutorial. https://rems-project.github.io/cn- tutorial/ Accessed 2025

  21. [21]

    Reynolds

    John C. Reynolds. 2002. Separation Logic: A Logic for Shared Mutable Data Struc- tures. InProceedings of the 17th Annual IEEE Symposium on Logic in Computer Sci- ence (LICS). IEEE, Copenhagen, Denmark, 55–74. doi:10.1109/LICS.2002.1029817

  22. [22]

    Jamey Sharp. 2016. Corrode: C to Rust Translator. GitHub repository. https: //github.com/jameysharp/corrode Accessed 2025

  23. [23]

    Tree-sitter Contributors. 2024. Tree-sitter: An Incremental Parsing System for Programming Tools. GitHub repository. https://github.com/tree-sitter/tree-sitter Accessed 2025