Evaluating LLM-Generated ACSL Annotations for Formal Verification

Arshad Beg; Diarmuid O'Donoghue; Rosemary Monahan

arxiv: 2602.13851 · v3 · submitted 2026-02-14 · 💻 cs.SE · cs.AI

Evaluating LLM-Generated ACSL Annotations for Formal Verification

Arshad Beg , Diarmuid O'Donoghue , Rosemary Monahan This is my paper

Pith reviewed 2026-05-15 22:12 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords ACSL annotationsformal verificationLLM-generated specificationsFrama-CCASP benchmarkrule-based annotationSMT solversC program verification

0 comments

The pith

Rule-based methods produce more reliable ACSL annotations than one-shot LLM outputs for C program verification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper evaluates strategies for automatically creating ACSL annotations that allow formal verification of real-world C programs. It directly compares a rule-based Python script, Frama-C's RTE plugin, and three large language models on a filtered subset of the CASP benchmark. Each set of generated annotations is fed into Frama-C's WP plugin together with multiple SMT solvers to measure proof success rates, timeouts, and processing time. A sympathetic reader would care because dependable software requires accurate specifications, and the results indicate whether current LLMs can practically assist or replace established automation techniques in this domain. The study concludes that rule-based approaches deliver more consistent verification success while LLM performance varies substantially.

Core claim

Through one-shot generation on the filtered CASP benchmark, the paper shows that rule-based annotation methods achieve higher rates of successful proofs under Frama-C's WP plugin and SMT solvers than the outputs produced by DeepSeek-V3.2, GPT-5.2, or OLMo 3.1 32B Instruct, with the latter exhibiting greater variability in both success and solver behavior.

What carries the argument

The one-shot annotation generation comparison, executed via Frama-C WP on the filtered CASP benchmark and measured by proof success rates across multiple SMT solvers.

If this is right

Rule-based scripts remain the safer default choice when high assurance verification of C code is required.
LLM outputs can still serve as starting points but need post-processing or additional checks to reach reliable proof rates.
One-shot LLM annotation generation is currently insufficient as a standalone replacement for established automation in formal verification pipelines.
Further prompt refinement or hybrid rule-plus-LLM workflows would be needed before LLMs become primary tools for ACSL specification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A hybrid pipeline that lets LLMs propose annotations and then applies rule-based refinement could combine the strengths of both approaches.
The performance gap observed here suggests that benchmarks focused on verification outcomes, rather than annotation similarity alone, should guide future LLM training for specification tasks.
Expanding the evaluation to include iterative interaction with the verifier or fine-tuning on verification traces could narrow the reliability difference between LLMs and rule-based methods.
These results point toward treating LLMs as assistive rather than autonomous components in formal methods toolchains for the near term.

Load-bearing premise

The filtered CASP subset represents typical real-world C programs and that one-shot prompting supplies a fair head-to-head test of the methods.

What would settle it

Running the same generation methods on an unfiltered or larger CASP collection, or repeating the comparison with multi-shot or engineered prompts, and finding that any LLM then matches or exceeds the rule-based proof success rate.

Figures

Figures reproduced from arXiv: 2602.13851 by Arshad Beg, Diarmuid O'Donoghue, Rosemary Monahan.

**Figure 1.** Figure 1: End-to-end research workflow illustrating how a curated dataset is progressively transformed into pure C files, enriched with ACSL annotations generated via a combination of automated scripts, Frama-C RTE, and large language models (DeepSeek-V3.2, GPT-5.2, and OLMo-3.1 32B Instruct), then consolidated into annotated C files that are evaluated using weakest-precondition (WP) prover tests to produce verifica… view at source ↗

**Figure 2.** Figure 2: Visual summary of EVA analysis results. (a) Number of alarms per file. (b) Distribution of statement coverage for successful and failed analyses. (c) Relationship between kernel warnings and coverage. by the Python script and the RTE plugin, achieve the highest mean proof success rates while maintaining low timeout counts, confirming their reliability for automated verification. Among LLM-based approaches,… view at source ↗

**Figure 3.** Figure 3: Comparison of WP verification outcomes across ACSL generation methods. The figure contrasts proof success, solver robustness, and their interaction for tool-generated and LLM-generated specifications. generation for structurally conventional C programs without requiring deep semantic reasoning [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: a presents a comparative analysis of Qed time variability, where the reported values correspond to the standard deviation of Qed times (in milliseconds) across four solvers, while Figure 4b presents the proof success distribution. Alt-Ergo and CVC5 exhibit similar variability profiles, with tightly clustered distributions and median standard deviation values around 7.5 ms, indicating relatively stable and… view at source ↗

**Figure 5.** Figure 5: Solver Performance for RTE-Generated ACSL 4.5 GPT-5.2 Generation [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Solver Performance for GPT-5–Generated ACSL previous configurations, dispersion is notably lower, with standard deviations confined to a narrow interval between 13.97 and 14.72, suggesting more stable proof coverage across tasks. Timeout counts, however, remain uneven: Alt-Ergo records 130 timeouts, substantially exceeding Z3 (22), CVC5 (23), and especially CVC4 (13). Qed averages are considerably higher t… view at source ↗

**Figure 7.** Figure 7: Solver Performance for DeepSeek–Generated ACSL Figure 8a presents a comparative analysis of Qed time variability for ACSL specifications generated by the OLMo3 model, where the reported values correspond to the standard deviation of Qed times (in milliseconds) across the four solvers, while the Figure 8b presents the proof success distribution. In contrast to DeepSeek- and GPT-5– generated specifications, … view at source ↗

read the original abstract

Formal specifications are crucial for building verifiable and dependable software systems, yet generating accurate and verifiable specifications for real-world C programs remains challenging. This paper presents an empirical evaluation of automated ACSL annotation generation strategies for C programs, comparing a rule-based Python script, Frama-C's RTE plugin, and three large language models (DeepSeek-V3.2, GPT-5.2, and OLMo 3.1 32B Instruct). The study focuses on one-shot annotation generation, assessing how these approaches perform when directly applied to verification tasks. Using a filtered subset of the CASP benchmark, we evaluate generated annotations through Frama-C's WP plugin with multiple SMT solvers, analyzing proof success rates, solver timeouts, and internal processing time. Our results show that rule-based approaches remain more reliable for verification success, while LLM-based methods exhibit more variable performance. These findings highlight both the current limitations and the potential of LLMs as complementary tools for automated specification generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Rule-based ACSL generation still beats the tested LLMs on verification success in this one-shot CASP comparison, though the prompting setup may tilt the results.

read the letter

The main thing to know is that on the filtered CASP subset, the rule-based Python script and Frama-C RTE produce more annotations that actually verify with Frama-C WP and SMT solvers than DeepSeek-V3.2, GPT-5.2, or OLMo 3.1 32B Instruct do in one-shot mode. The paper reports higher proof success rates and fewer timeouts for the non-LLM approaches, with LLMs showing more inconsistent results across the benchmark cases. This is a straightforward empirical head-to-head that runs the generated specs through the real verification pipeline rather than stopping at annotation syntax or manual checks. It adds specific numbers for these three models against the established baselines on this public benchmark, which fills in a gap in the reported comparisons. The evaluation design is grounded because it uses the same WP plugin and solvers for all methods and measures concrete outcomes like proof success and internal processing time. The work is honest about the current limitations of LLMs in this setting without overclaiming. The soft spot is the one-shot constraint on the LLMs. Rule-based tools are purpose-built for ACSL and C, while the models receive minimal direct input with no examples, chain-of-thought, or task-specific tuning. That difference could account for part of the performance gap, and the paper does not explore whether better prompting would narrow it. Details on the exact prompt templates and the precise filtering rules for the CASP subset are also thin, and there are no statistical significance tests reported on the success-rate differences. If the subset leans toward patterns that rules already handle cleanly, the results might look less decisive on a broader sample. This paper is for researchers working on formal verification of C code who want current baseline numbers on LLM annotation generation. It does not introduce new techniques but supplies useful data points for anyone deciding whether to integrate LLMs into their annotation workflow. I would send it to peer review. The core comparison is clear and reproducible enough to benefit from referee input on prompting choices and benchmark selection.

Referee Report

1 major / 3 minor

Summary. This paper presents an empirical evaluation of ACSL annotation generation for C programs, comparing a rule-based Python script, Frama-C's RTE plugin, and three LLMs (DeepSeek-V3.2, GPT-5.2, OLMo 3.1 32B Instruct) using one-shot prompting on a filtered CASP benchmark subset. Annotations are assessed for verifiability via Frama-C's WP plugin with multiple SMT solvers, focusing on proof success rates, solver timeouts, and processing time. The central claim is that rule-based methods are more reliable for verification success while LLM-based methods show greater variability.

Significance. If the results hold under fair conditions, the work provides concrete empirical data on the relative strengths of established rule-based tools versus current LLMs for generating verifiable specifications in C. This is useful for practitioners in formal verification, as it quantifies performance gaps using public benchmarks and standard tools like Frama-C, and could inform hybrid approaches. The direct measurement of proof outcomes via external solvers adds grounding to the comparative assessment.

major comments (1)

[Evaluation setup (abstract and methods)] The one-shot prompting strategy for LLMs (as described in the abstract) may not constitute a fair apples-to-apples comparison with the rule-based script and Frama-C RTE, which are purpose-built for ACSL/C. Without few-shot examples, chain-of-thought, or task-specific instructions, any observed variability in LLM performance could stem from unequal prompting effort rather than fundamental limitations, especially if the filtered CASP subset contains patterns already well-handled by rules. This assumption is load-bearing for the headline claim that rule-based approaches remain more reliable.

minor comments (3)

[Abstract] The abstract provides no details on the exact prompt templates used for the three LLMs or the specific filtering criteria applied to the CASP benchmark subset; these should be documented to allow replication.
[Results] No statistical significance tests are mentioned for differences in proof success rates or timeouts across methods; adding these would strengthen the comparative claims.
[Evaluation] Clarify the exact versions or configurations of the SMT solvers and Frama-C WP plugin used, as these can affect timeout and success metrics.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful and detailed review. We address the single major comment below.

read point-by-point responses

Referee: [Evaluation setup (abstract and methods)] The one-shot prompting strategy for LLMs (as described in the abstract) may not constitute a fair apples-to-apples comparison with the rule-based script and Frama-C RTE, which are purpose-built for ACSL/C. Without few-shot examples, chain-of-thought, or task-specific instructions, any observed variability in LLM performance could stem from unequal prompting effort rather than fundamental limitations, especially if the filtered CASP subset contains patterns already well-handled by rules. This assumption is load-bearing for the headline claim that rule-based approaches remain more reliable.

Authors: We appreciate the referee's concern regarding fairness. Our study is explicitly scoped to one-shot prompting for the LLMs, as stated in the abstract and methods, to establish a baseline for direct, minimally prompted application of current models. The rule-based script and Frama-C RTE are specialized tools by design, and the comparison is intended to quantify their reliability relative to LLMs under standard one-shot conditions rather than to claim optimality for LLMs. The manuscript does not assert that LLMs cannot improve with advanced prompting; it reports observed performance gaps in the evaluated setting. We therefore maintain that the central claim holds for the conditions described, while acknowledging that further prompting engineering lies beyond the paper's stated objectives. revision: no

Circularity Check

0 steps flagged

No circularity in empirical evaluation

full rationale

The paper reports direct experimental measurements of annotation generation success rates and verification outcomes using external tools (Frama-C WP, SMT solvers) on a benchmark subset. No mathematical derivations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the described methodology or results. All claims rest on observable tool outputs rather than internal redefinitions or ansatzes, satisfying the criteria for a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking study with no mathematical derivations, fitted parameters, axioms, or invented entities; it relies on standard software verification tools and an existing public benchmark.

pith-pipeline@v0.9.0 · 5467 in / 1183 out tokens · 24729 ms · 2026-05-15T22:12:29.037197+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our results show that rule-based approaches remain more reliable for verification success, while LLM-based methods exhibit more variable performance.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat ≃ Nat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Using a filtered subset of the CASP benchmark, we evaluate generated annotations through Frama-C’s WP plugin with multiple SMT solvers

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

[1]

Arshad Beg, Diarmuid O’Donoghue, and Rosemary Monahan. 2025. Leveraging LLMs for Formal Software Requirements: Challenges and Prospects. In Proceedings of OVERLAY 2025: Artificial Intelligence and Formal Verification, Logic, Automata, and Synthesis (CEUR Workshop Proceedings, Vol. 4142) . CEUR-WS.org, 95–105. https://ceur-ws.org /Vol-4142/paper11.pdf Eval...

work page 2025
[2]

Arshad Beg, Diarmuid O’Donoghue, and Rosemary Monahan. 2025. Traceable and Verifiable Software Requirements: A Synthesis of AI- Enabled Formal Methods. Journal of Software Testing, Verification and Reliability (2025). doi:10.5281/zenodo.17772835 Submitted; Version v1 available on Zenodo

work page doi:10.5281/zenodo.17772835 2025
[3]

Matthias Cosler, Christopher Hahn, Daniel Mendoza, Frederik Schmitt, and Caroline Trippel. 2023. nl2spec: Interactively Translat- ing Unstructured Natural Language to Temporal Logics with Large Language Models. arXiv: 2303.04864 [cs.LO] https://arxiv.org/abs/23 03.04864

work page arXiv 2023
[4]

Wen Fan, Marilyn Rego, Xin Hu, Sanya Dod, Zhaorui Ni, Danning Xie, Jenna DiVincenzo, and Lin Tan. 2025. Evaluating the Ability of Large Language Models to Generate Verifiable Specifications in Veri- Fast. arXiv: 2411.02318 [cs.SE] https://arxiv.org/abs/2411.02318

work page arXiv 2025
[5]

Wenji Fang, Mengming Li, Min Li, Zhiyuan Yan, Shang Liu, Hongce Zhang, and Zhiyao Xie. 2024. AssertLLM: Generating Hardware Ver- ification Assertions from Design Specifications via Multi-LLMs. In 2024 IEEE LLM Aided Design Workshop (LAD) . 1–1. doi:10.1109/LA D62341.2024.10691792

work page doi:10.1109/la 2024
[6]

George Granberry, Wolfgang Ahrendt, and Moa Johansson

work page
[7]

In Integrated Formal Methods , Nikolai Kosma- tov and Laura Kovács (Eds.)

Specify What? Enhancing Neural Specification Synthesis by Symbolic Methods. In Integrated Formal Methods , Nikolai Kosma- tov and Laura Kovács (Eds.). Springer Nature Switzerland, Cham, 307–325

work page
[8]

George Granberry, Wolfgang Ahrendt, and Moa Johansson. 2025. To- wards Integrating Copiloting and Formal Methods. In Leveraging Ap- plications of Formal Methods, Verification and Validation. Specifica- tion and Verification , Tiziana Margaria and Bernhard Steffen (Eds.). Springer Nature Switzerland, Cham, 144–158

work page 2025
[9]

Niclas Hertzberg, Merlijn Sevenhuijsen, Liv Kåreborn, and Anna Lokrantz. 2025. CASP: An Evaluation Dataset for Formal Verification of C Code. In Bridging the Gap Between AI and Reality - Third Interna- tional Conference on Bridging the Gap between AI and Reality, AISoLA 2025, Rhodes, Greece, November 1-5, 2025, Selected Papers (Lecture Notes in Computer S...

work page 2025
[10]

doi:10.1007/978-3-032-07132-3_5

work page doi:10.1007/978-3-032-07132-3_5
[11]

Lahiri, Akash Lal, Aseem Rastogi, Subhajit Roy, and Rahul Sharma

Adharsh Kamath, Aditya Senthilnathan, Saikat Chakraborty, Pantazis Deligiannis, Shuvendu K. Lahiri, Akash Lal, Aseem Rastogi, Subha- jit Roy, and Rahul Sharma. 2023. Finding Inductive Loop Invari- ants using Large Language Models. CoRR abs/2311.07948 (2023). arXiv:2311.07948 doi:10.48550/ARXIV.2311.07948

work page doi:10.48550/arxiv.2311.07948 2023
[12]

Minghai Lu, Benjamin Delaware, and Tianyi Zhang. 2024. Proof Automation with Large Language Models. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineer- ing, ASE 2024, Sacramento, CA, USA, October 27 - November 1, 2024 , Vladimir Filkov, Baishakhi Ray, and Minghui Zhou (Eds.). ACM, 1509–

work page 2024
[13]

doi:10.1145/3691620.3695521

work page doi:10.1145/3691620.3695521
[14]

Lezhi Ma, Shangqing Liu, Yi Li, Xiaofei Xie, and Lei Bu. 2024. SpecGen: Automated Generation of Formal Program Specifications via Large Language Models. (2024). arXiv: 2401.08807 [cs.SE] https://arxiv.org/ abs/2401.08807

work page arXiv 2024
[15]

Anmol Nayak, Hari Prasad Timmapathini, Vidhya Murali, Karthikeyan Ponnalagu, Vijendran Gopalan Venkoparao, and Amalinda Post. 2022. Req2Spec: Transforming Software Require- ments into Formal Specifications Using Natural Language Processing. In Requirements Engineering: Foundation for Software Quality: 28th International Working Conference, REFSQ 2022, Bir...

work page 2022
[16]

Sv-llm: An agentic approach for soc security verification using large language models,

Dipayan Saha, Shams Tarek, Hasan Al Shaikh, Khan Thamid Hasan, Pavan Sai Nalluri, Md. Ajoad Hasan, Nashmin Alam, Jingbo Zhou, Sujan Kumar Saha, Mark Tehranipoor, and Farimah Farahmandi. 2025. SV-LLM: An Agentic Approach for SoC Security Verification using Large Language Models. (2025). arXiv: 2506.20415 [cs.CR] https: //arxiv.org/abs/2506.20415

work page arXiv 2025
[17]

Merlijn Sevenhuijsen, Khashayar Etemadi, and Mattias Nyberg. 2025. VeCoGen: Automating Generation of Formally Verified C Code With Large Language Models. In 13th IEEE/ACM International Conference on Formal Methods in Software Engineering, FormaliSE@ICSE 2025, Ot- tawa, ON, Canada, April 27-28, 2025 . IEEE, 101–112. doi:10.1109/FO RMALISE66629.2025.00017

work page doi:10.1109/fo 2025
[18]

Sumanth Varambally, Thomas Voice, Yanchao Sun, Zhifeng Chen, Rose Yu, and Ke Ye. 2025. Hilbert: Recursively Building For- mal Proofs with Informal Reasoning. CoRR abs/2509.22819 (2025). arXiv:2509.22819 doi:10.48550/ARXIV.2509.22819

work page doi:10.48550/arxiv.2509.22819 2025
[19]

Cordeiro, and Liping Zhao

Weiqi Wang, Marie Farrell, Lucas C. Cordeiro, and Liping Zhao. 2025. Supporting Software Formal Verification with Large Language Mod- els: An Experimental Study. In 33rd IEEE International Requirements Engineering Conference, RE 2025, Valencia, Spain, September 1-5, 2025 . IEEE, 423–431. doi:10.1109/RE63999.2025.00049

work page doi:10.1109/re63999.2025.00049 2025
[20]

Cheng Wen, Jialun Cao, Jie Su, Zhiwu Xu, Shengchao Qin, Mengda He, Haokun Li, Shing-Chi Cheung, and Cong Tian. 2024. Enchant- ing Program Specification Synthesis by Large Language Models Using Static Analysis and Program Verification. In Computer Aided Verifica- tion - 36th International Conference, CA V 2024, Montreal, QC, Canada, July 24-27, 2024, Proce...

work page doi:10.1007/978-3-031-65630-9_16 2024
[21]

Haoze Wu, Clark Barrett, and Nina Narodytska. 2024. Lemur: Inte- grating Large Language Models in Automated Program Verification. arXiv:2310.04870 [cs.FL] https://arxiv.org/abs/2310.04870

work page arXiv 2024
[22]

Yedi Zhang, Sun Yi Emma, Annabelle Lee Jia En, and Jin Song Dong

work page
[23]

Darijani, J

RvLLM: LLM Runtime Verification with Domain Knowledge. CoRR abs/2505.18585 (2025). arXiv: 2505.18585 doi:10.48550/ARXIV.2 505.18585

work page doi:10.48550/arxiv.2 2025
[24]

Yihang Zhao. 2025. Leveraging Large Language Models for Ontology Requirements Engineering. In The Semantic Web: ESWC 2025 Satellite Events - Portoroz, Slovenia, June 1-5, 2025, Proceedings (Lecture Notes in Computer Science, Vol. 15832) , Edward Curry, Valentina Presutti, John P. McCrae, Mehwish Alam, Pieter Colpaert, Josiane Xavier Par- reira, Diego Coll...

work page doi:10.1007/978-3-031-99554-5_40 2025

[1] [1]

Arshad Beg, Diarmuid O’Donoghue, and Rosemary Monahan. 2025. Leveraging LLMs for Formal Software Requirements: Challenges and Prospects. In Proceedings of OVERLAY 2025: Artificial Intelligence and Formal Verification, Logic, Automata, and Synthesis (CEUR Workshop Proceedings, Vol. 4142) . CEUR-WS.org, 95–105. https://ceur-ws.org /Vol-4142/paper11.pdf Eval...

work page 2025

[2] [2]

Arshad Beg, Diarmuid O’Donoghue, and Rosemary Monahan. 2025. Traceable and Verifiable Software Requirements: A Synthesis of AI- Enabled Formal Methods. Journal of Software Testing, Verification and Reliability (2025). doi:10.5281/zenodo.17772835 Submitted; Version v1 available on Zenodo

work page doi:10.5281/zenodo.17772835 2025

[3] [3]

Matthias Cosler, Christopher Hahn, Daniel Mendoza, Frederik Schmitt, and Caroline Trippel. 2023. nl2spec: Interactively Translat- ing Unstructured Natural Language to Temporal Logics with Large Language Models. arXiv: 2303.04864 [cs.LO] https://arxiv.org/abs/23 03.04864

work page arXiv 2023

[4] [4]

Wen Fan, Marilyn Rego, Xin Hu, Sanya Dod, Zhaorui Ni, Danning Xie, Jenna DiVincenzo, and Lin Tan. 2025. Evaluating the Ability of Large Language Models to Generate Verifiable Specifications in Veri- Fast. arXiv: 2411.02318 [cs.SE] https://arxiv.org/abs/2411.02318

work page arXiv 2025

[5] [5]

Wenji Fang, Mengming Li, Min Li, Zhiyuan Yan, Shang Liu, Hongce Zhang, and Zhiyao Xie. 2024. AssertLLM: Generating Hardware Ver- ification Assertions from Design Specifications via Multi-LLMs. In 2024 IEEE LLM Aided Design Workshop (LAD) . 1–1. doi:10.1109/LA D62341.2024.10691792

work page doi:10.1109/la 2024

[6] [6]

George Granberry, Wolfgang Ahrendt, and Moa Johansson

work page

[7] [7]

In Integrated Formal Methods , Nikolai Kosma- tov and Laura Kovács (Eds.)

Specify What? Enhancing Neural Specification Synthesis by Symbolic Methods. In Integrated Formal Methods , Nikolai Kosma- tov and Laura Kovács (Eds.). Springer Nature Switzerland, Cham, 307–325

work page

[8] [8]

George Granberry, Wolfgang Ahrendt, and Moa Johansson. 2025. To- wards Integrating Copiloting and Formal Methods. In Leveraging Ap- plications of Formal Methods, Verification and Validation. Specifica- tion and Verification , Tiziana Margaria and Bernhard Steffen (Eds.). Springer Nature Switzerland, Cham, 144–158

work page 2025

[9] [9]

Niclas Hertzberg, Merlijn Sevenhuijsen, Liv Kåreborn, and Anna Lokrantz. 2025. CASP: An Evaluation Dataset for Formal Verification of C Code. In Bridging the Gap Between AI and Reality - Third Interna- tional Conference on Bridging the Gap between AI and Reality, AISoLA 2025, Rhodes, Greece, November 1-5, 2025, Selected Papers (Lecture Notes in Computer S...

work page 2025

[10] [10]

doi:10.1007/978-3-032-07132-3_5

work page doi:10.1007/978-3-032-07132-3_5

[11] [11]

Lahiri, Akash Lal, Aseem Rastogi, Subhajit Roy, and Rahul Sharma

Adharsh Kamath, Aditya Senthilnathan, Saikat Chakraborty, Pantazis Deligiannis, Shuvendu K. Lahiri, Akash Lal, Aseem Rastogi, Subha- jit Roy, and Rahul Sharma. 2023. Finding Inductive Loop Invari- ants using Large Language Models. CoRR abs/2311.07948 (2023). arXiv:2311.07948 doi:10.48550/ARXIV.2311.07948

work page doi:10.48550/arxiv.2311.07948 2023

[12] [12]

Minghai Lu, Benjamin Delaware, and Tianyi Zhang. 2024. Proof Automation with Large Language Models. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineer- ing, ASE 2024, Sacramento, CA, USA, October 27 - November 1, 2024 , Vladimir Filkov, Baishakhi Ray, and Minghui Zhou (Eds.). ACM, 1509–

work page 2024

[13] [13]

doi:10.1145/3691620.3695521

work page doi:10.1145/3691620.3695521

[14] [14]

Lezhi Ma, Shangqing Liu, Yi Li, Xiaofei Xie, and Lei Bu. 2024. SpecGen: Automated Generation of Formal Program Specifications via Large Language Models. (2024). arXiv: 2401.08807 [cs.SE] https://arxiv.org/ abs/2401.08807

work page arXiv 2024

[15] [15]

Anmol Nayak, Hari Prasad Timmapathini, Vidhya Murali, Karthikeyan Ponnalagu, Vijendran Gopalan Venkoparao, and Amalinda Post. 2022. Req2Spec: Transforming Software Require- ments into Formal Specifications Using Natural Language Processing. In Requirements Engineering: Foundation for Software Quality: 28th International Working Conference, REFSQ 2022, Bir...

work page 2022

[16] [16]

Sv-llm: An agentic approach for soc security verification using large language models,

Dipayan Saha, Shams Tarek, Hasan Al Shaikh, Khan Thamid Hasan, Pavan Sai Nalluri, Md. Ajoad Hasan, Nashmin Alam, Jingbo Zhou, Sujan Kumar Saha, Mark Tehranipoor, and Farimah Farahmandi. 2025. SV-LLM: An Agentic Approach for SoC Security Verification using Large Language Models. (2025). arXiv: 2506.20415 [cs.CR] https: //arxiv.org/abs/2506.20415

work page arXiv 2025

[17] [17]

Merlijn Sevenhuijsen, Khashayar Etemadi, and Mattias Nyberg. 2025. VeCoGen: Automating Generation of Formally Verified C Code With Large Language Models. In 13th IEEE/ACM International Conference on Formal Methods in Software Engineering, FormaliSE@ICSE 2025, Ot- tawa, ON, Canada, April 27-28, 2025 . IEEE, 101–112. doi:10.1109/FO RMALISE66629.2025.00017

work page doi:10.1109/fo 2025

[18] [18]

Sumanth Varambally, Thomas Voice, Yanchao Sun, Zhifeng Chen, Rose Yu, and Ke Ye. 2025. Hilbert: Recursively Building For- mal Proofs with Informal Reasoning. CoRR abs/2509.22819 (2025). arXiv:2509.22819 doi:10.48550/ARXIV.2509.22819

work page doi:10.48550/arxiv.2509.22819 2025

[19] [19]

Cordeiro, and Liping Zhao

Weiqi Wang, Marie Farrell, Lucas C. Cordeiro, and Liping Zhao. 2025. Supporting Software Formal Verification with Large Language Mod- els: An Experimental Study. In 33rd IEEE International Requirements Engineering Conference, RE 2025, Valencia, Spain, September 1-5, 2025 . IEEE, 423–431. doi:10.1109/RE63999.2025.00049

work page doi:10.1109/re63999.2025.00049 2025

[20] [20]

Cheng Wen, Jialun Cao, Jie Su, Zhiwu Xu, Shengchao Qin, Mengda He, Haokun Li, Shing-Chi Cheung, and Cong Tian. 2024. Enchant- ing Program Specification Synthesis by Large Language Models Using Static Analysis and Program Verification. In Computer Aided Verifica- tion - 36th International Conference, CA V 2024, Montreal, QC, Canada, July 24-27, 2024, Proce...

work page doi:10.1007/978-3-031-65630-9_16 2024

[21] [21]

Haoze Wu, Clark Barrett, and Nina Narodytska. 2024. Lemur: Inte- grating Large Language Models in Automated Program Verification. arXiv:2310.04870 [cs.FL] https://arxiv.org/abs/2310.04870

work page arXiv 2024

[22] [22]

Yedi Zhang, Sun Yi Emma, Annabelle Lee Jia En, and Jin Song Dong

work page

[23] [23]

Darijani, J

RvLLM: LLM Runtime Verification with Domain Knowledge. CoRR abs/2505.18585 (2025). arXiv: 2505.18585 doi:10.48550/ARXIV.2 505.18585

work page doi:10.48550/arxiv.2 2025

[24] [24]

Yihang Zhao. 2025. Leveraging Large Language Models for Ontology Requirements Engineering. In The Semantic Web: ESWC 2025 Satellite Events - Portoroz, Slovenia, June 1-5, 2025, Proceedings (Lecture Notes in Computer Science, Vol. 15832) , Edward Curry, Valentina Presutti, John P. McCrae, Mehwish Alam, Pieter Colpaert, Josiane Xavier Par- reira, Diego Coll...

work page doi:10.1007/978-3-031-99554-5_40 2025