CNnotator: LLM-Guided Memory Safety Annotation Synthesis

Mike Dodds; Twain Byrnes

arxiv: 2606.21822 · v1 · pith:EN6YQ4ZSnew · submitted 2026-06-20 · 💻 cs.PL · cs.AI· cs.SE

CNnotator: LLM-Guided Memory Safety Annotation Synthesis

Twain Byrnes , Mike Dodds This is my paper

Pith reviewed 2026-06-26 11:19 UTC · model grok-4.3

classification 💻 cs.PL cs.AIcs.SE

keywords memory safetyC programminglarge language modelsannotation synthesisprogram verificationspecification generationCN tool

0 comments

The pith

Large language models can synthesize CN annotations to represent memory usage in C programs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs can determine the implicit ways C code handles memory by producing annotations in the CN specification language. Memory safety errors remain a leading source of bugs in C systems, and making those usage patterns explicit is required before verification or migration to safer languages can occur. CNnotator prompts an LLM to propose candidate annotations and then runs them through CN's testing process to check correctness. On the small-to-medium programs examined, the best model reaches 90 percent success on its first attempt and 97 percent overall, while another model succeeds on 65 percent of first attempts. These rates indicate that the annotation task is becoming feasible for AI assistance rather than remaining a purely manual chore.

Core claim

CNnotator shows that LLMs can generate CN specifications that capture memory usage patterns in small-to-medium C programs. The system works by having the model propose annotations and then validating them through CN's hybrid testing and verification capabilities. Reported outcomes include 90 percent first-attempt success and 97 percent overall success for one model, together with 65 percent first-attempt success for another model. The authors conclude that this level of performance makes AI-assisted annotation practical for real-world C codebases.

What carries the argument

CNnotator, the LLM-guided loop that generates candidate CN memory annotations and validates them by running CN tests.

If this is right

Memory usage in existing C code can be documented without exhaustive manual reverse-engineering.
CN verification becomes applicable to more programs because the annotation step is automated.
Memory safety errors can be located earlier in the development or maintenance process.
Efforts to migrate C systems to memory-safe languages gain an initial explicit model of resource behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same generate-and-test loop could be applied to other specification languages that describe resource usage.
Annotations produced this way might serve as starting points for stronger formal proofs once testing succeeds.
For very large codebases the method would likely need to be combined with modular decomposition so that LLMs receive manageable fragments.

Load-bearing premise

The small-to-medium programs chosen for testing are representative of memory usage patterns in larger, real-world C codebases, and successful passage of CN tests confirms that the generated annotations are correct and useful in practice.

What would settle it

A realistic C codebase where repeated LLM attempts produce no set of CN annotations that both pass CN testing and prevent actual memory errors during execution.

Figures

Figures reproduced from arXiv: 2606.21822 by Mike Dodds, Twain Byrnes.

**Figure 1.** Figure 1: CNnotator workflow. The tool preprocesses C files (merging and adding CN macros), then iterates through each [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: CNnotator performance across five LLM backends on 31 C functions. (a) Overall success rate. (b) First-attempt success [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Memory safety errors account for a large proportion of security bugs in systems written in C; modern languages such as Java and Rust prevent such bugs because they are memory-safe by design. To migrate systems to safer languages or identify memory errors, we must first determine how legacy code manipulates memory. This information is only represented implicitly in such code. In many cases, memory usage patterns are merely tedious for humans to figure out, rather than truly difficult. In this work, we ask if large language models (LLMs) can perform this task by having them synthesize annotations representing memory usage as specifications in CN, a hybrid testing/verification tool. Our tool, CNnotator, uses LLMs to automatically generate and test CN specifications. We find that current models are able to generate CN specifications for small-to-medium C programs, with the OpenAI o3 reasoning model achieving a 90% success rate on first attempts and 97% overall success, while the chat model GPT-4o correctly annotates 65% of first attempts. These results suggest AI-assisted annotation is becoming practical for real-world C codebases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs can generate CN annotations for small C programs at reported high rates, but the evaluation gives almost no details on programs or test strength so the numbers are hard to interpret.

read the letter

The paper's main point is that LLMs can synthesize CN memory-safety annotations for C code, with o3 reaching 90% first-try success and 97% overall while GPT-4o hits 65%. That is the concrete result they report.

What is new is the direct use of LLMs for CN specification synthesis. CN combines testing and verification, so automating the annotation step for legacy C code is a reasonable target. The work does a straightforward job of running the same task across models and reporting the percentages.

The soft spots sit in the evaluation. The abstract gives success rates but says nothing about how many programs were tested, how they were chosen, or what the CN test suites actually exercise. The stress-test concern lands: if the tests miss aliasing, pointer arithmetic, or other edge cases, then annotations that pass could still be wrong or incomplete. Without failure-mode analysis or controls, the numbers do not strongly support the claim that the approach is becoming practical.

This is for people working on program analysis tools for C, especially those interested in memory safety or migration assistance. A reader in that niche might pick up the idea of LLM-guided annotation and try it themselves, but would treat the reported rates as preliminary.

I would send it for peer review once the authors add the missing details on benchmarks and validation. The core idea is worth checking, but the current evidence is too thin for a strong conclusion.

Referee Report

2 major / 1 minor

Summary. The paper presents CNnotator, a tool that uses LLMs to automatically generate and test CN specifications for memory-safety properties in C programs. It claims that current models can perform this task on small-to-medium C programs, with the o3 reasoning model achieving 90% success on first attempts and 97% overall, and GPT-4o achieving 65% on first attempts. The work positions this as evidence that AI-assisted annotation is becoming practical for real-world C codebases.

Significance. If the empirical results hold under stronger evaluation, the work would show that LLMs can automate inference of implicit memory-usage patterns that are tedious but not fundamentally difficult, offering a practical aid for verification or language migration of legacy C code. The direct measurement of LLM performance on a concrete annotation task (rather than a fitted or self-referential metric) is a strength.

major comments (2)

[Evaluation] Evaluation section: The reported success rates (o3: 90% first-attempt / 97% overall; GPT-4o: 65% first-attempt) are presented without any information on the number of programs, selection criteria, size/complexity distribution, or failure-mode breakdown. This information is load-bearing for the central claim that the models succeed on representative small-to-medium C programs.
[Validation] Validation subsection: The paper treats passage of the CN test suites as confirmation that generated annotations are correct, yet provides no description of how the test inputs were chosen, whether they exercise aliasing or pointer-arithmetic cases, or any argument that the suites constitute a sound oracle. If the tests are derived from the same examples used in prompting or lack coverage, the success percentages do not establish soundness.

minor comments (1)

[Abstract] Abstract: The phrase 'small-to-medium C programs' is used without a concrete definition (e.g., LOC ranges or cyclomatic complexity), making it difficult to interpret the scope of the reported results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive review. The comments correctly identify gaps in the level of detail provided for the evaluation and validation. We will revise the manuscript to supply the missing information and strengthen the presentation of our experimental setup.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The reported success rates (o3: 90% first-attempt / 97% overall; GPT-4o: 65% first-attempt) are presented without any information on the number of programs, selection criteria, size/complexity distribution, or failure-mode breakdown. This information is load-bearing for the central claim that the models succeed on representative small-to-medium C programs.

Authors: We agree that the Evaluation section requires additional context to support the central claim. The current manuscript reports aggregate success rates but does not characterize the benchmark in detail. In the revised version we will add a dedicated subsection (or expanded table) that reports: the exact number of programs evaluated, the selection criteria and sources, the distribution of program sizes and complexity metrics (e.g., lines of code, number of allocations and pointers), and a categorized breakdown of observed failure modes. These additions will make the applicability to small-to-medium C programs explicit and reproducible. revision: yes
Referee: [Validation] Validation subsection: The paper treats passage of the CN test suites as confirmation that generated annotations are correct, yet provides no description of how the test inputs were chosen, whether they exercise aliasing or pointer-arithmetic cases, or any argument that the suites constitute a sound oracle. If the tests are derived from the same examples used in prompting or lack coverage, the success percentages do not establish soundness.

Authors: We acknowledge the need for a clearer justification of the validation oracle. The revised Validation subsection will describe the process used to construct the test inputs, explicitly note whether they cover aliasing, pointer arithmetic, and other relevant cases, and provide an argument for why the suites constitute a reasonable (if not exhaustive) oracle for the memory-safety properties expressed in CN. We will also clarify the relationship between the test cases and any examples appearing in prompts, and we will discuss the inherent limitations of test-based validation. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurement of LLM annotation success

full rationale

The paper reports measured success rates (o3 at 90%/97%, GPT-4o at 65%) from an experimental pipeline that prompts LLMs to produce CN annotations and then runs them through the CN testing harness on a set of C programs. No equations, fitted parameters, or derivations are present; the results are raw counts of annotations that pass the provided test suites. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear. The evaluation chain is self-contained and externally falsifiable via the same LLM+CN procedure on new programs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the work is an empirical tool evaluation rather than a theoretical derivation.

pith-pipeline@v0.9.1-grok · 5724 in / 1003 out tokens · 23125 ms · 2026-06-26T11:19:33.207686+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 5 canonical work pages

[1]

Aamer and Benjamin C

Zain K. Aamer and Benjamin C. Pierce. 2025. Bennet: Randomized Specification Testing for Heap-Manipulating Programs.Proceedings of the ACM on Program- ming Languages9, OOPSLA2 (2025), 3924–3953. doi:10.1145/3764115

work page doi:10.1145/3764115 2025
[2]

Anthropic. 2025. Claude Code: An Agentic CLI Coding Assistant. Anthropic. https://claude.ai/code Accessed 2025

2025
[3]

Rini Banerjee, Kayvan Memarian, Dhruv Makwana, Christopher Pulte, Neel Krishnaswami, and Peter Sewell. 2025. Fulminate: Testing CN Separation-Logic Specifications in C.Proceedings of the ACM on Programming Languages9, POPL, Article 43 (2025), 33 pages. doi:10.1145/3704879

work page doi:10.1145/3704879 2025
[4]

Le, Christopher Ré, and Azalia Mirhoseini

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher Ré, and Azalia Mirhoseini. 2024. Large Language Monkeys: Scal- ing Inference Compute with Repeated Sampling. arXiv:2407.21787 [cs.LG] https://arxiv.org/abs/2407.21787

Pith/arXiv arXiv 2024
[5]

Twain Byrnes. 2025. Escaping Isla Nublar: Coming Around to LLMs for Formal Methods. Galois Blog. https://www.galois.com/articles/escaping-isla-nublar- coming-around-to-llms-for-formal-methods Published August 18, 2025

2025
[6]

Harrison Chase. 2022. LangChain. GitHub repository. https://github.com/ langchain-ai/langchain Accessed 2025

2022
[7]

Chromium Security Team. 2025. Memory Safety. Chromium Project Documenta- tion. https://www.chromium.org/Home/chromium-security/memory-safety/ Accessed 2025

2025
[8]

Citrus Contributors. 2017. Citrus: C to Rust Converter. GitLab repository. https: //gitlab.com/citrus-rs/citrus Accessed 2025

2017
[9]

DARPA Information Innovation Office. 2024. TRACTOR: Translating All C to Rust. DARPA Program Information. https://www.darpa.mil/program/translating- all-c-to-rust Accessed 2025

2024
[10]

Chongzhou Fang, Ning Miao, Shaurya Srivastav, Jialin Liu, Ruoyu Zhang, Ruijie Fang, Asmita, Ryan Tsang, Najmeh Nazari, Han Wang, and Houman Homayoun
[11]

InProceedings of the 33rd USENIX Security Symposium (USENIX Security 24)

Large Language Models for Code Analysis: Do LLMs Really Do Their Job?. InProceedings of the 33rd USENIX Security Symposium (USENIX Security 24). USENIX Association, Philadelphia, PA, 829–846. https://www.usenix.org/ conference/usenixsecurity24/presentation/fang
[12]

Immunant and Galois. 2024. C2Rust: Migrate C Code to Rust. GitHub repository. https://github.com/immunant/c2rust Accessed 2025

2024
[13]

Anirudh Khatry, Robert Zhang, Jia Pan, Ziteng Wang, Qiaochu Chen, Greg Durrett, and Isil Dillig. 2025. CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation. arXiv:2504.15254 [cs.SE] https://arxiv.org/abs/ 2504.15254

arXiv 2025
[14]

Anirudh Khatry, Robert Zhang, Jia Pan, Ziteng Wang, Qiaochu Chen, Greg Durrett, and Isil Dillig. 2025. CRUST-Bench Project Website. Project Website. https://crust-bench.github.io Accessed 2025

2025
[15]

Florent Kirchner, Nikolai Kosmatov, Virgile Prevosto, Julien Signoles, and Boris Yakobowski. 2015. Frama-C: A Software Analysis Perspective.Formal Aspects of Computing27, 3 (2015), 573–609. doi:10.1007/s00165-014-0326-7

work page doi:10.1007/s00165-014-0326-7 2015
[16]

LangChain Inc. 2024. LangGraph: Build Resilient Language Agents as Graphs. GitHub repository. https://github.com/langchain-ai/langgraph Accessed 2025

2024
[17]

Wei Ma, Shangqing Liu, Zhihao Lin, Wenhan Wang, Qiang Hu, Ye Liu, Cen Zhang, Liming Nie, Li Li, and Yang Liu. 2024. LMs: Understanding Code Syntax and Semantics for Code Analysis. arXiv:2305.12138 [cs.SE] https://arxiv.org/abs/ 2305.12138

Pith/arXiv arXiv 2024
[18]

Vikram Nitin, Rahul Krishna, Luiz Lemos do Valle, and Baishakhi Ray. 2025. C2SaferRust: Transforming C Projects into Safer Rust with NeuroSymbolic Tech- niques. arXiv:2501.14257 [cs.SE] https://arxiv.org/abs/2501.14257

arXiv 2025
[19]

Makwana, Thomas Sewell, Kayvan Memarian, Peter Sewell, and Neel Krishnaswami

Christopher Pulte, Dhruv C. Makwana, Thomas Sewell, Kayvan Memarian, Peter Sewell, and Neel Krishnaswami. 2023. CN: Verifying Systems C Code with Separation-Logic Refinement Types.Proceedings of the ACM on Programming Languages7, POPL (2023), 191–220. doi:10.1145/3571194

work page doi:10.1145/3571194 2023
[20]

Pierce, Cole Schlesinger, Jessica Shi, and Elizabeth Austell

Christopher Pulte, Benjamin C. Pierce, Cole Schlesinger, Jessica Shi, and Elizabeth Austell. 2025. CN Tutorial. Online tutorial. https://rems-project.github.io/cn- tutorial/ Accessed 2025

2025
[21]

Reynolds

John C. Reynolds. 2002. Separation Logic: A Logic for Shared Mutable Data Struc- tures. InProceedings of the 17th Annual IEEE Symposium on Logic in Computer Sci- ence (LICS). IEEE, Copenhagen, Denmark, 55–74. doi:10.1109/LICS.2002.1029817

work page doi:10.1109/lics.2002.1029817 2002
[22]

Jamey Sharp. 2016. Corrode: C to Rust Translator. GitHub repository. https: //github.com/jameysharp/corrode Accessed 2025

2016
[23]

Tree-sitter Contributors. 2024. Tree-sitter: An Incremental Parsing System for Programming Tools. GitHub repository. https://github.com/tree-sitter/tree-sitter Accessed 2025

2024

[1] [1]

Aamer and Benjamin C

Zain K. Aamer and Benjamin C. Pierce. 2025. Bennet: Randomized Specification Testing for Heap-Manipulating Programs.Proceedings of the ACM on Program- ming Languages9, OOPSLA2 (2025), 3924–3953. doi:10.1145/3764115

work page doi:10.1145/3764115 2025

[2] [2]

Anthropic. 2025. Claude Code: An Agentic CLI Coding Assistant. Anthropic. https://claude.ai/code Accessed 2025

2025

[3] [3]

Rini Banerjee, Kayvan Memarian, Dhruv Makwana, Christopher Pulte, Neel Krishnaswami, and Peter Sewell. 2025. Fulminate: Testing CN Separation-Logic Specifications in C.Proceedings of the ACM on Programming Languages9, POPL, Article 43 (2025), 33 pages. doi:10.1145/3704879

work page doi:10.1145/3704879 2025

[4] [4]

Le, Christopher Ré, and Azalia Mirhoseini

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher Ré, and Azalia Mirhoseini. 2024. Large Language Monkeys: Scal- ing Inference Compute with Repeated Sampling. arXiv:2407.21787 [cs.LG] https://arxiv.org/abs/2407.21787

Pith/arXiv arXiv 2024

[5] [5]

Twain Byrnes. 2025. Escaping Isla Nublar: Coming Around to LLMs for Formal Methods. Galois Blog. https://www.galois.com/articles/escaping-isla-nublar- coming-around-to-llms-for-formal-methods Published August 18, 2025

2025

[6] [6]

Harrison Chase. 2022. LangChain. GitHub repository. https://github.com/ langchain-ai/langchain Accessed 2025

2022

[7] [7]

Chromium Security Team. 2025. Memory Safety. Chromium Project Documenta- tion. https://www.chromium.org/Home/chromium-security/memory-safety/ Accessed 2025

2025

[8] [8]

Citrus Contributors. 2017. Citrus: C to Rust Converter. GitLab repository. https: //gitlab.com/citrus-rs/citrus Accessed 2025

2017

[9] [9]

DARPA Information Innovation Office. 2024. TRACTOR: Translating All C to Rust. DARPA Program Information. https://www.darpa.mil/program/translating- all-c-to-rust Accessed 2025

2024

[10] [10]

Chongzhou Fang, Ning Miao, Shaurya Srivastav, Jialin Liu, Ruoyu Zhang, Ruijie Fang, Asmita, Ryan Tsang, Najmeh Nazari, Han Wang, and Houman Homayoun

[11] [11]

InProceedings of the 33rd USENIX Security Symposium (USENIX Security 24)

Large Language Models for Code Analysis: Do LLMs Really Do Their Job?. InProceedings of the 33rd USENIX Security Symposium (USENIX Security 24). USENIX Association, Philadelphia, PA, 829–846. https://www.usenix.org/ conference/usenixsecurity24/presentation/fang

[12] [12]

Immunant and Galois. 2024. C2Rust: Migrate C Code to Rust. GitHub repository. https://github.com/immunant/c2rust Accessed 2025

2024

[13] [13]

Anirudh Khatry, Robert Zhang, Jia Pan, Ziteng Wang, Qiaochu Chen, Greg Durrett, and Isil Dillig. 2025. CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation. arXiv:2504.15254 [cs.SE] https://arxiv.org/abs/ 2504.15254

arXiv 2025

[14] [14]

Anirudh Khatry, Robert Zhang, Jia Pan, Ziteng Wang, Qiaochu Chen, Greg Durrett, and Isil Dillig. 2025. CRUST-Bench Project Website. Project Website. https://crust-bench.github.io Accessed 2025

2025

[15] [15]

Florent Kirchner, Nikolai Kosmatov, Virgile Prevosto, Julien Signoles, and Boris Yakobowski. 2015. Frama-C: A Software Analysis Perspective.Formal Aspects of Computing27, 3 (2015), 573–609. doi:10.1007/s00165-014-0326-7

work page doi:10.1007/s00165-014-0326-7 2015

[16] [16]

LangChain Inc. 2024. LangGraph: Build Resilient Language Agents as Graphs. GitHub repository. https://github.com/langchain-ai/langgraph Accessed 2025

2024

[17] [17]

Wei Ma, Shangqing Liu, Zhihao Lin, Wenhan Wang, Qiang Hu, Ye Liu, Cen Zhang, Liming Nie, Li Li, and Yang Liu. 2024. LMs: Understanding Code Syntax and Semantics for Code Analysis. arXiv:2305.12138 [cs.SE] https://arxiv.org/abs/ 2305.12138

Pith/arXiv arXiv 2024

[18] [18]

Vikram Nitin, Rahul Krishna, Luiz Lemos do Valle, and Baishakhi Ray. 2025. C2SaferRust: Transforming C Projects into Safer Rust with NeuroSymbolic Tech- niques. arXiv:2501.14257 [cs.SE] https://arxiv.org/abs/2501.14257

arXiv 2025

[19] [19]

Makwana, Thomas Sewell, Kayvan Memarian, Peter Sewell, and Neel Krishnaswami

Christopher Pulte, Dhruv C. Makwana, Thomas Sewell, Kayvan Memarian, Peter Sewell, and Neel Krishnaswami. 2023. CN: Verifying Systems C Code with Separation-Logic Refinement Types.Proceedings of the ACM on Programming Languages7, POPL (2023), 191–220. doi:10.1145/3571194

work page doi:10.1145/3571194 2023

[20] [20]

Pierce, Cole Schlesinger, Jessica Shi, and Elizabeth Austell

Christopher Pulte, Benjamin C. Pierce, Cole Schlesinger, Jessica Shi, and Elizabeth Austell. 2025. CN Tutorial. Online tutorial. https://rems-project.github.io/cn- tutorial/ Accessed 2025

2025

[21] [21]

Reynolds

John C. Reynolds. 2002. Separation Logic: A Logic for Shared Mutable Data Struc- tures. InProceedings of the 17th Annual IEEE Symposium on Logic in Computer Sci- ence (LICS). IEEE, Copenhagen, Denmark, 55–74. doi:10.1109/LICS.2002.1029817

work page doi:10.1109/lics.2002.1029817 2002

[22] [22]

Jamey Sharp. 2016. Corrode: C to Rust Translator. GitHub repository. https: //github.com/jameysharp/corrode Accessed 2025

2016

[23] [23]

Tree-sitter Contributors. 2024. Tree-sitter: An Incremental Parsing System for Programming Tools. GitHub repository. https://github.com/tree-sitter/tree-sitter Accessed 2025

2024