pith. sign in

arxiv: 2602.13851 · v3 · submitted 2026-02-14 · 💻 cs.SE · cs.AI

Evaluating LLM-Generated ACSL Annotations for Formal Verification

Pith reviewed 2026-05-15 22:12 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords ACSL annotationsformal verificationLLM-generated specificationsFrama-CCASP benchmarkrule-based annotationSMT solversC program verification
0
0 comments X

The pith

Rule-based methods produce more reliable ACSL annotations than one-shot LLM outputs for C program verification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper evaluates strategies for automatically creating ACSL annotations that allow formal verification of real-world C programs. It directly compares a rule-based Python script, Frama-C's RTE plugin, and three large language models on a filtered subset of the CASP benchmark. Each set of generated annotations is fed into Frama-C's WP plugin together with multiple SMT solvers to measure proof success rates, timeouts, and processing time. A sympathetic reader would care because dependable software requires accurate specifications, and the results indicate whether current LLMs can practically assist or replace established automation techniques in this domain. The study concludes that rule-based approaches deliver more consistent verification success while LLM performance varies substantially.

Core claim

Through one-shot generation on the filtered CASP benchmark, the paper shows that rule-based annotation methods achieve higher rates of successful proofs under Frama-C's WP plugin and SMT solvers than the outputs produced by DeepSeek-V3.2, GPT-5.2, or OLMo 3.1 32B Instruct, with the latter exhibiting greater variability in both success and solver behavior.

What carries the argument

The one-shot annotation generation comparison, executed via Frama-C WP on the filtered CASP benchmark and measured by proof success rates across multiple SMT solvers.

If this is right

  • Rule-based scripts remain the safer default choice when high assurance verification of C code is required.
  • LLM outputs can still serve as starting points but need post-processing or additional checks to reach reliable proof rates.
  • One-shot LLM annotation generation is currently insufficient as a standalone replacement for established automation in formal verification pipelines.
  • Further prompt refinement or hybrid rule-plus-LLM workflows would be needed before LLMs become primary tools for ACSL specification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • A hybrid pipeline that lets LLMs propose annotations and then applies rule-based refinement could combine the strengths of both approaches.
  • The performance gap observed here suggests that benchmarks focused on verification outcomes, rather than annotation similarity alone, should guide future LLM training for specification tasks.
  • Expanding the evaluation to include iterative interaction with the verifier or fine-tuning on verification traces could narrow the reliability difference between LLMs and rule-based methods.
  • These results point toward treating LLMs as assistive rather than autonomous components in formal methods toolchains for the near term.

Load-bearing premise

The filtered CASP subset represents typical real-world C programs and that one-shot prompting supplies a fair head-to-head test of the methods.

What would settle it

Running the same generation methods on an unfiltered or larger CASP collection, or repeating the comparison with multi-shot or engineered prompts, and finding that any LLM then matches or exceeds the rule-based proof success rate.

Figures

Figures reproduced from arXiv: 2602.13851 by Arshad Beg, Diarmuid O'Donoghue, Rosemary Monahan.

Figure 1
Figure 1. Figure 1: End-to-end research workflow illustrating how a curated dataset is progressively transformed into pure C files, enriched with ACSL annotations generated via a combination of automated scripts, Frama-C RTE, and large language models (DeepSeek-V3.2, GPT-5.2, and OLMo-3.1 32B Instruct), then consolidated into annotated C files that are evaluated using weakest-precondition (WP) prover tests to produce verifica… view at source ↗
Figure 2
Figure 2. Figure 2: Visual summary of EVA analysis results. (a) Number of alarms per file. (b) Distribution of statement coverage for successful and failed analyses. (c) Relationship between kernel warnings and coverage. by the Python script and the RTE plugin, achieve the highest mean proof success rates while maintaining low timeout counts, confirming their reliability for automated verification. Among LLM-based approaches,… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of WP verification outcomes across ACSL generation methods. The figure contrasts proof success, solver robustness, and their interaction for tool-generated and LLM-generated specifications. generation for structurally conventional C programs without requiring deep semantic reasoning [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: a presents a comparative analysis of Qed time variability, where the reported values correspond to the standard deviation of Qed times (in milliseconds) across four solvers, while Figure 4b presents the proof success distribution. Alt-Ergo and CVC5 exhibit similar variability profiles, with tightly clustered distributions and median standard deviation values around 7.5 ms, indicating rela￾tively stable and… view at source ↗
Figure 5
Figure 5. Figure 5: Solver Performance for RTE-Generated ACSL 4.5 GPT-5.2 Generation [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Solver Performance for GPT-5–Generated ACSL previous configurations, dispersion is notably lower, with standard deviations confined to a narrow interval between 13.97 and 14.72, suggesting more stable proof coverage across tasks. Timeout counts, however, remain uneven: Alt-Ergo records 130 timeouts, substantially exceeding Z3 (22), CVC5 (23), and especially CVC4 (13). Qed averages are considerably higher t… view at source ↗
Figure 7
Figure 7. Figure 7: Solver Performance for DeepSeek–Generated ACSL Figure 8a presents a comparative analysis of Qed time variability for ACSL specifications generated by the OLMo3 model, where the reported values correspond to the standard deviation of Qed times (in milliseconds) across the four solvers, while the Figure 8b presents the proof success distribution. In contrast to DeepSeek- and GPT-5– generated specifications, … view at source ↗
read the original abstract

Formal specifications are crucial for building verifiable and dependable software systems, yet generating accurate and verifiable specifications for real-world C programs remains challenging. This paper presents an empirical evaluation of automated ACSL annotation generation strategies for C programs, comparing a rule-based Python script, Frama-C's RTE plugin, and three large language models (DeepSeek-V3.2, GPT-5.2, and OLMo 3.1 32B Instruct). The study focuses on one-shot annotation generation, assessing how these approaches perform when directly applied to verification tasks. Using a filtered subset of the CASP benchmark, we evaluate generated annotations through Frama-C's WP plugin with multiple SMT solvers, analyzing proof success rates, solver timeouts, and internal processing time. Our results show that rule-based approaches remain more reliable for verification success, while LLM-based methods exhibit more variable performance. These findings highlight both the current limitations and the potential of LLMs as complementary tools for automated specification generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. This paper presents an empirical evaluation of ACSL annotation generation for C programs, comparing a rule-based Python script, Frama-C's RTE plugin, and three LLMs (DeepSeek-V3.2, GPT-5.2, OLMo 3.1 32B Instruct) using one-shot prompting on a filtered CASP benchmark subset. Annotations are assessed for verifiability via Frama-C's WP plugin with multiple SMT solvers, focusing on proof success rates, solver timeouts, and processing time. The central claim is that rule-based methods are more reliable for verification success while LLM-based methods show greater variability.

Significance. If the results hold under fair conditions, the work provides concrete empirical data on the relative strengths of established rule-based tools versus current LLMs for generating verifiable specifications in C. This is useful for practitioners in formal verification, as it quantifies performance gaps using public benchmarks and standard tools like Frama-C, and could inform hybrid approaches. The direct measurement of proof outcomes via external solvers adds grounding to the comparative assessment.

major comments (1)
  1. [Evaluation setup (abstract and methods)] The one-shot prompting strategy for LLMs (as described in the abstract) may not constitute a fair apples-to-apples comparison with the rule-based script and Frama-C RTE, which are purpose-built for ACSL/C. Without few-shot examples, chain-of-thought, or task-specific instructions, any observed variability in LLM performance could stem from unequal prompting effort rather than fundamental limitations, especially if the filtered CASP subset contains patterns already well-handled by rules. This assumption is load-bearing for the headline claim that rule-based approaches remain more reliable.
minor comments (3)
  1. [Abstract] The abstract provides no details on the exact prompt templates used for the three LLMs or the specific filtering criteria applied to the CASP benchmark subset; these should be documented to allow replication.
  2. [Results] No statistical significance tests are mentioned for differences in proof success rates or timeouts across methods; adding these would strengthen the comparative claims.
  3. [Evaluation] Clarify the exact versions or configurations of the SMT solvers and Frama-C WP plugin used, as these can affect timeout and success metrics.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful and detailed review. We address the single major comment below.

read point-by-point responses
  1. Referee: [Evaluation setup (abstract and methods)] The one-shot prompting strategy for LLMs (as described in the abstract) may not constitute a fair apples-to-apples comparison with the rule-based script and Frama-C RTE, which are purpose-built for ACSL/C. Without few-shot examples, chain-of-thought, or task-specific instructions, any observed variability in LLM performance could stem from unequal prompting effort rather than fundamental limitations, especially if the filtered CASP subset contains patterns already well-handled by rules. This assumption is load-bearing for the headline claim that rule-based approaches remain more reliable.

    Authors: We appreciate the referee's concern regarding fairness. Our study is explicitly scoped to one-shot prompting for the LLMs, as stated in the abstract and methods, to establish a baseline for direct, minimally prompted application of current models. The rule-based script and Frama-C RTE are specialized tools by design, and the comparison is intended to quantify their reliability relative to LLMs under standard one-shot conditions rather than to claim optimality for LLMs. The manuscript does not assert that LLMs cannot improve with advanced prompting; it reports observed performance gaps in the evaluated setting. We therefore maintain that the central claim holds for the conditions described, while acknowledging that further prompting engineering lies beyond the paper's stated objectives. revision: no

Circularity Check

0 steps flagged

No circularity in empirical evaluation

full rationale

The paper reports direct experimental measurements of annotation generation success rates and verification outcomes using external tools (Frama-C WP, SMT solvers) on a benchmark subset. No mathematical derivations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the described methodology or results. All claims rest on observable tool outputs rather than internal redefinitions or ansatzes, satisfying the criteria for a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking study with no mathematical derivations, fitted parameters, axioms, or invented entities; it relies on standard software verification tools and an existing public benchmark.

pith-pipeline@v0.9.0 · 5467 in / 1183 out tokens · 24729 ms · 2026-05-15T22:12:29.037197+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

  1. [1]

    Arshad Beg, Diarmuid O’Donoghue, and Rosemary Monahan. 2025. Leveraging LLMs for Formal Software Requirements: Challenges and Prospects. In Proceedings of OVERLAY 2025: Artificial Intelligence and Formal Verification, Logic, Automata, and Synthesis (CEUR Workshop Proceedings, Vol. 4142) . CEUR-WS.org, 95–105. https://ceur-ws.org /Vol-4142/paper11.pdf Eval...

  2. [2]

    Arshad Beg, Diarmuid O’Donoghue, and Rosemary Monahan. 2025. Traceable and Verifiable Software Requirements: A Synthesis of AI- Enabled Formal Methods. Journal of Software Testing, Verification and Reliability (2025). doi:10.5281/zenodo.17772835 Submitted; Version v1 available on Zenodo

  3. [3]

    Matthias Cosler, Christopher Hahn, Daniel Mendoza, Frederik Schmitt, and Caroline Trippel. 2023. nl2spec: Interactively Translat- ing Unstructured Natural Language to Temporal Logics with Large Language Models. arXiv: 2303.04864 [cs.LO] https://arxiv.org/abs/23 03.04864

  4. [4]

    Wen Fan, Marilyn Rego, Xin Hu, Sanya Dod, Zhaorui Ni, Danning Xie, Jenna DiVincenzo, and Lin Tan. 2025. Evaluating the Ability of Large Language Models to Generate Verifiable Specifications in Veri- Fast. arXiv: 2411.02318 [cs.SE] https://arxiv.org/abs/2411.02318

  5. [5]

    Wenji Fang, Mengming Li, Min Li, Zhiyuan Yan, Shang Liu, Hongce Zhang, and Zhiyao Xie. 2024. AssertLLM: Generating Hardware Ver- ification Assertions from Design Specifications via Multi-LLMs. In 2024 IEEE LLM Aided Design Workshop (LAD) . 1–1. doi:10.1109/LA D62341.2024.10691792

  6. [6]

    George Granberry, Wolfgang Ahrendt, and Moa Johansson

  7. [7]

    In Integrated Formal Methods , Nikolai Kosma- tov and Laura Kovács (Eds.)

    Specify What? Enhancing Neural Specification Synthesis by Symbolic Methods. In Integrated Formal Methods , Nikolai Kosma- tov and Laura Kovács (Eds.). Springer Nature Switzerland, Cham, 307–325

  8. [8]

    George Granberry, Wolfgang Ahrendt, and Moa Johansson. 2025. To- wards Integrating Copiloting and Formal Methods. In Leveraging Ap- plications of Formal Methods, Verification and Validation. Specifica- tion and Verification , Tiziana Margaria and Bernhard Steffen (Eds.). Springer Nature Switzerland, Cham, 144–158

  9. [9]

    Niclas Hertzberg, Merlijn Sevenhuijsen, Liv Kåreborn, and Anna Lokrantz. 2025. CASP: An Evaluation Dataset for Formal Verification of C Code. In Bridging the Gap Between AI and Reality - Third Interna- tional Conference on Bridging the Gap between AI and Reality, AISoLA 2025, Rhodes, Greece, November 1-5, 2025, Selected Papers (Lecture Notes in Computer S...

  10. [10]

    doi:10.1007/978-3-032-07132-3_5

  11. [11]

    Lahiri, Akash Lal, Aseem Rastogi, Subhajit Roy, and Rahul Sharma

    Adharsh Kamath, Aditya Senthilnathan, Saikat Chakraborty, Pantazis Deligiannis, Shuvendu K. Lahiri, Akash Lal, Aseem Rastogi, Subha- jit Roy, and Rahul Sharma. 2023. Finding Inductive Loop Invari- ants using Large Language Models. CoRR abs/2311.07948 (2023). arXiv:2311.07948 doi:10.48550/ARXIV.2311.07948

  12. [12]

    Minghai Lu, Benjamin Delaware, and Tianyi Zhang. 2024. Proof Automation with Large Language Models. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineer- ing, ASE 2024, Sacramento, CA, USA, October 27 - November 1, 2024 , Vladimir Filkov, Baishakhi Ray, and Minghui Zhou (Eds.). ACM, 1509–

  13. [13]

    doi:10.1145/3691620.3695521

  14. [14]

    Lezhi Ma, Shangqing Liu, Yi Li, Xiaofei Xie, and Lei Bu. 2024. SpecGen: Automated Generation of Formal Program Specifications via Large Language Models. (2024). arXiv: 2401.08807 [cs.SE] https://arxiv.org/ abs/2401.08807

  15. [15]

    Anmol Nayak, Hari Prasad Timmapathini, Vidhya Murali, Karthikeyan Ponnalagu, Vijendran Gopalan Venkoparao, and Amalinda Post. 2022. Req2Spec: Transforming Software Require- ments into Formal Specifications Using Natural Language Processing. In Requirements Engineering: Foundation for Software Quality: 28th International Working Conference, REFSQ 2022, Bir...

  16. [16]

    Sv-llm: An agentic approach for soc security verification using large language models,

    Dipayan Saha, Shams Tarek, Hasan Al Shaikh, Khan Thamid Hasan, Pavan Sai Nalluri, Md. Ajoad Hasan, Nashmin Alam, Jingbo Zhou, Sujan Kumar Saha, Mark Tehranipoor, and Farimah Farahmandi. 2025. SV-LLM: An Agentic Approach for SoC Security Verification using Large Language Models. (2025). arXiv: 2506.20415 [cs.CR] https: //arxiv.org/abs/2506.20415

  17. [17]

    Merlijn Sevenhuijsen, Khashayar Etemadi, and Mattias Nyberg. 2025. VeCoGen: Automating Generation of Formally Verified C Code With Large Language Models. In 13th IEEE/ACM International Conference on Formal Methods in Software Engineering, FormaliSE@ICSE 2025, Ot- tawa, ON, Canada, April 27-28, 2025 . IEEE, 101–112. doi:10.1109/FO RMALISE66629.2025.00017

  18. [18]

    Sumanth Varambally, Thomas Voice, Yanchao Sun, Zhifeng Chen, Rose Yu, and Ke Ye. 2025. Hilbert: Recursively Building For- mal Proofs with Informal Reasoning. CoRR abs/2509.22819 (2025). arXiv:2509.22819 doi:10.48550/ARXIV.2509.22819

  19. [19]

    Cordeiro, and Liping Zhao

    Weiqi Wang, Marie Farrell, Lucas C. Cordeiro, and Liping Zhao. 2025. Supporting Software Formal Verification with Large Language Mod- els: An Experimental Study. In 33rd IEEE International Requirements Engineering Conference, RE 2025, Valencia, Spain, September 1-5, 2025 . IEEE, 423–431. doi:10.1109/RE63999.2025.00049

  20. [20]

    Cheng Wen, Jialun Cao, Jie Su, Zhiwu Xu, Shengchao Qin, Mengda He, Haokun Li, Shing-Chi Cheung, and Cong Tian. 2024. Enchant- ing Program Specification Synthesis by Large Language Models Using Static Analysis and Program Verification. In Computer Aided Verifica- tion - 36th International Conference, CA V 2024, Montreal, QC, Canada, July 24-27, 2024, Proce...

  21. [21]

    Haoze Wu, Clark Barrett, and Nina Narodytska. 2024. Lemur: Inte- grating Large Language Models in Automated Program Verification. arXiv:2310.04870 [cs.FL] https://arxiv.org/abs/2310.04870

  22. [22]

    Yedi Zhang, Sun Yi Emma, Annabelle Lee Jia En, and Jin Song Dong

  23. [23]

    Darijani, J

    RvLLM: LLM Runtime Verification with Domain Knowledge. CoRR abs/2505.18585 (2025). arXiv: 2505.18585 doi:10.48550/ARXIV.2 505.18585

  24. [24]

    Yihang Zhao. 2025. Leveraging Large Language Models for Ontology Requirements Engineering. In The Semantic Web: ESWC 2025 Satellite Events - Portoroz, Slovenia, June 1-5, 2025, Proceedings (Lecture Notes in Computer Science, Vol. 15832) , Edward Curry, Valentina Presutti, John P. McCrae, Mehwish Alam, Pieter Colpaert, Josiane Xavier Par- reira, Diego Coll...