An Empirical Study of LLM-Generated Specifications for VeriFast

Danning Xie; Jenna DiVincenzo; Lin Tan; Marilyn Rego; Minh Tran; Sanya Dod; Wen Fan; Xin Hu

arxiv: 2606.26490 · v1 · pith:X5VVVKIWnew · submitted 2026-06-25 · 💻 cs.SE · cs.AI· cs.LO· cs.PL

An Empirical Study of LLM-Generated Specifications for VeriFast

Wen Fan , Minh Tran , Sanya Dod , Xin Hu , Marilyn Rego , Danning Xie , Jenna DiVincenzo , Lin Tan This is my paper

Pith reviewed 2026-06-26 04:46 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LOcs.PL

keywords LLM-generated specificationsVeriFastseparation logicstatic verificationempirical evaluationfunctional preservationverification successC programs

0 comments

The pith

LLMs generate functionally preserving specifications for VeriFast over 91 percent of the time but verify successfully only 31.4 percent of the time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how effectively large language models can create the complex auxiliary specifications needed for separation logic verifiers like VeriFast to check heap-manipulating C programs. It applies ten models using eight different prompting strategies and three kinds of input to 303 functions, then measures whether the generated code and specs keep the original behavior and whether VeriFast can prove them. Results indicate high rates of functional preservation but only modest verification rates, with the majority of failures due to incorrect use of VeriFast-specific features. The work identifies concrete factors such as model choice and input type that improve outcomes and highlights the need for better domain knowledge in LLMs.

Core claim

Through systematic testing, the authors establish that LLMs preserve functional behavior in both the source code and the generated specifications at rates exceeding 91 percent, yet only 31.4 percent of the cases lead to successful verification by VeriFast. Models such as Gemini 2.5 Pro and the inclusion of formal contracts as prompts yield higher success rates. The dominant source of errors, accounting for 94 percent, is the LLMs' incorrect handling of the domain-specific knowledge required by separation logic verifiers.

What carries the argument

The multi-stage empirical evaluation using eight prompting approaches on ten LLMs with three input types to generate and check specifications for 303 C functions, focusing on functional equivalence and verifiability.

If this is right

Higher success rates occur when using Gemini 2.5 Pro compared to other models.
Including formal contracts in the prompt improves verification outcomes.
The bulk of verification failures stem from errors in applying separation logic concepts rather than from altering program semantics.
Functional preservation remains high across most configurations, indicating LLMs reliably maintain observable behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Fine-tuning or retrieval-augmented generation focused on separation logic rules could raise verification success rates.
The same limitations may affect LLM use with other separation logic tools beyond VeriFast.
Scaling the method to full programs rather than individual functions might expose additional challenges in maintaining consistency across specifications.

Load-bearing premise

The 303 C functions along with the eight prompting approaches and three input types adequately represent practical VeriFast usage scenarios, and the automated checks for functional preservation and verifiability measure real-world utility without systematic bias.

What would settle it

Applying the identical prompting and evaluation protocol to a fresh collection of C functions from industrial codebases or to a different separation logic verifier would produce markedly different functional preservation or verification success percentages.

Figures

Figures reproduced from arXiv: 2606.26490 by Danning Xie, Jenna DiVincenzo, Lin Tan, Marilyn Rego, Minh Tran, Sanya Dod, Wen Fan, Xin Hu.

**Figure 1.** Figure 1: Static verification in VeriFast of the append_tail function for singly-linked lists. 1, verifying append_tail requires 34 lines of auxiliary specification code compared to 16 lines of program code2 . In SL verifiers, predicates enable the expression of heap properties for recursive heap data structures, such as lseg and llist on lines 4-9 in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: The overview of the study III. STUDY DESIGN To evaluate LLMs on generating VeriFast specifications, we aim to answer the following research questions: RQ1 How well do LLMs preserve functional behavior when generating specifications for proofs of correctness in VeriFast across different prompts, input types, and LLMs? RQ2 How successful are LLMs at generating specifications that result in verified code with… view at source ↗

read the original abstract

Static verification tools can assure industrial scale software, but require significant human labor to write specifications. This is particularly true of static verifiers based on separation logic (SL verifiers), which excel at verifying heapmanipulating programs, but require many complex auxiliary specifications to reason about heap structure. Recent work applies large language models (LLMs) to generate code, tests, and proofs, including specifications for verifiers, but mostly targeting non-SL verifiers. To address this gap, this paper thoroughly evaluates how well LLMs perform when prompted to generate specifications for verifying 303 C functions with the SL verifier VeriFast. We explored eight prompting approaches, ten LLMs, and three input types in two stages. Quantitative and qualitative analyses are used to assess the LLM-generated code and specifications for functional behavior, verifiability and errors. The results show that LLMs preserve functional behavior in source code and specifications (both over 91%), but achieve modest verification success (31.4%). Using Gemini 2.5 Pro and providing formal contracts lead to higher success rates in our setting. Moreover, most errors (94%) come from LLMs' mistakes in the domainspecific knowledge of SL verifiers such as VeriFast. These findings provide guidance for optimizing LLM-generated specifications for SL verifiers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

First quantitative look at LLMs writing specs for a separation-logic verifier shows high functional preservation but only 31% verification success, mostly from domain-knowledge gaps.

read the letter

The paper's core finding is that LLMs keep functional behavior in both code and specs above 91% across their runs, yet only reach 31.4% verification success with VeriFast, and 94% of the failures trace to missing knowledge of how separation logic works in that tool. Gemini 2.5 Pro plus formal contracts helped in their tests.

What is new is the focus on an SL verifier. Earlier LLM specification work targeted other tools, so the numbers on 303 C functions, eight prompting strategies, ten models, and three input types fill a gap with concrete data rather than another generic code-generation study.

The work is straightforward in its setup and reports the metrics clearly, including the error breakdown. That gives readers a usable baseline for this specific verifier.

The main soft spot is whether the 303 functions and the automated checks for preservation and verifiability are representative or carry selection bias. The abstract does not spell out how the functions were chosen or how the 94% classification was validated, so the domain-knowledge claim rests on procedures that need more detail to assess. The modest success rate itself is not surprising given the verifier's demands, but it limits how far the results generalize without further testing.

This is for researchers working on LLM support for formal methods, especially separation logic. Someone looking for initial empirical numbers on VeriFast would find it useful as a starting point.

It deserves peer review because it supplies the first reported data on this combination of models and verifier, even if the methodology section will need tightening.

Referee Report

3 major / 2 minor

Summary. The manuscript reports an empirical evaluation of LLMs for generating specifications for the separation logic verifier VeriFast on 303 C functions. It examines eight prompting approaches, ten LLMs, and three input types across two stages, finding that generated code and specifications preserve functional behavior in over 91% of cases, achieve 31.4% verification success (higher with Gemini 2.5 Pro and formal contracts), and that 94% of errors arise from domain-specific knowledge gaps in SL verifiers.

Significance. If the measurements prove robust, the work fills a gap in LLM-assisted specification generation for heap-manipulating programs under separation logic, offering quantitative guidance on prompting strategies and highlighting domain knowledge as the dominant failure mode. The scale of the study (303 functions, multiple models and inputs) strengthens its potential utility for tool builders and practitioners.

major comments (3)

[§4] §4 (Experimental Setup): The selection criteria and sampling method for the 303 C functions are not described in sufficient detail to evaluate selection bias or representativeness, which directly affects the reliability of the reported 31.4% verification success rate and the generalizability claim.
[§5] §5 (Results, functional preservation): The automated procedure used to measure functional behavior preservation (claimed >91% for both code and specifications) is not specified, including whether it relies on test suites, differential execution, or manual review; without this, the central quantitative claims cannot be verified.
[§5.3] §5.3 (Error Analysis): The classification process that attributes 94% of errors to domain-specific knowledge of VeriFast/SL verifiers lacks an explicit coding scheme, inter-rater reliability measure, or example annotations, making the dominant-error conclusion load-bearing yet unverifiable from the reported methodology.

minor comments (2)

[Abstract] Abstract: The phrase 'two stages' is used without defining what the stages consist of or how they differ in prompting or evaluation.
[Figures/Tables] Table captions and axis labels in the results figures should explicitly state the exact success metric (e.g., 'VeriFast verification success rate') to avoid ambiguity with functional-preservation percentages.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving methodological transparency, and we will revise the paper accordingly to address them.

read point-by-point responses

Referee: [§4] §4 (Experimental Setup): The selection criteria and sampling method for the 303 C functions are not described in sufficient detail to evaluate selection bias or representativeness, which directly affects the reliability of the reported 31.4% verification success rate and the generalizability claim.

Authors: We agree that §4 requires additional detail on function selection. The 303 functions were drawn from open-source C repositories and prior VeriFast benchmarks, with a focus on heap-manipulating code; we will expand the section to explicitly describe the sources, inclusion criteria, sampling procedure, and any steps taken to mitigate bias. revision: yes
Referee: [§5] §5 (Results, functional preservation): The automated procedure used to measure functional behavior preservation (claimed >91% for both code and specifications) is not specified, including whether it relies on test suites, differential execution, or manual review; without this, the central quantitative claims cannot be verified.

Authors: The preservation metric combined differential execution against available test suites with manual behavioral comparison for functions lacking tests. We will add a precise description of this procedure, including any automation scripts and decision rules, to the revised §5. revision: yes
Referee: [§5.3] §5.3 (Error Analysis): The classification process that attributes 94% of errors to domain-specific knowledge of VeriFast/SL verifiers lacks an explicit coding scheme, inter-rater reliability measure, or example annotations, making the dominant-error conclusion load-bearing yet unverifiable from the reported methodology.

Authors: We will augment §5.3 with the full coding scheme (distinguishing functional, syntactic, and domain-knowledge errors), inter-rater agreement statistics, and additional annotated examples to make the 94% attribution fully reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This paper is an empirical evaluation reporting measured outcomes (functional preservation rates, verification success, error distributions) from direct experiments on 303 C functions across LLMs, prompting strategies, and input types. No derivation chain, equations, fitted parameters presented as predictions, or self-referential definitions exist in the work. All reported metrics are computed from the experimental runs themselves without reduction to prior fitted values or self-citations that bear the central claim. The study is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical evaluation study. It introduces no free parameters, mathematical axioms, or invented entities; all reported figures are observed experimental outcomes.

pith-pipeline@v0.9.1-grok · 5781 in / 1277 out tokens · 37393 ms · 2026-06-26T04:46:42.930464+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

83 extracted references · 5 canonical work pages

[1]

Chaff: Engineering an efficient sat solver,

M. W. Moskewiczet al., “Chaff: Engineering an efficient sat solver,” in Proc. of the 38th annual Design Automation Conf., 2001, pp. 530–535

2001
[2]

An axiomatic basis for computer programming,

C. A. R. Hoare, “An axiomatic basis for computer programming,” Communications of the ACM, vol. 12, no. 10, pp. 576–580, 1969

1969
[3]

Dafny: An automatic program verifier for functional correctness,

K. R. M. Leino, “Dafny: An automatic program verifier for functional correctness,” inInternational conference on logic for programming artificial intelligence and reasoning. Springer, 2010, pp. 348–370

2010
[4]

Verus: Verifying rust progs. using linear ghost types,

A. Lattuadaet al., “Verus: Verifying rust progs. using linear ghost types,” Proc. of the ACM on PLs., vol. 7, no. OOPSLA1, pp. 286–315, 2023

2023
[5]

Formally verified cloud-scale authorization,

A. Chakarovet al., “Formally verified cloud-scale authorization,” in 2025 IEEE/ACM 47th Int’l. Conf. on Softw. Eng. (ICSE). IEEE Computer Society, 2025, pp. 703–703

2025
[6]

Software verification with verifast: Industrial case studies,

P. Philippaertset al., “Software verification with verifast: Industrial case studies,”Science of Computer Programming, vol. 82, pp. 77–97, 2014

2014
[7]

Codet: Code generation with generated tests,

B. Chenet al., “Codet: Code generation with generated tests,”arXiv preprint arXiv:2207.10397, 2022

Pith/arXiv arXiv 2022
[8]

Automatic generation of programming exercises and code explanations using large language models,

S. Sarsaet al., “Automatic generation of programming exercises and code explanations using large language models,” inProc. of the 2022 ACM Conf. on Int’l. Computing Educ. Research-Vol. 1, 2022, pp. 27–43

2022
[9]

Software testing with large language models: Survey, landscape, and vision,

J. Wanget al., “Software testing with large language models: Survey, landscape, and vision,”IEEE Trans. Softw. Eng., 2024

2024
[10]

Fuzz4all: Universal fuzzing with llms,

C. S. Xiaet al., “Fuzz4all: Universal fuzzing with llms,” inProc. of the IEEE/ACM 46th Int’l Conf. on Softw. Eng., 2024, pp. 1–13

2024
[11]

The daikon system for dynamic detection of likely invariants,

M. D. Ernstet al., “The daikon system for dynamic detection of likely invariants,”Science of computer prog., vol. 69, no. 1-3, pp. 35–45, 2007

2007
[12]

Can large language models reason about program invariants?

K. Peiet al., “Can large language models reason about program invariants?” inInt’l Conf. on ML. PMLR, 2023, pp. 27 496–27 520

2023
[13]

Impact of large language models on generating software specifications,

D. Xieet al., “Impact of large language models on generating software specifications,”arXiv preprint arXiv:2306.03324, 2023

arXiv 2023
[14]

Classinvgen: Class invariant synthesis using large language models,

C. Sunet al., “Classinvgen: Class invariant synthesis using large language models,” inInt’l Symp on AI Verif.Springer, 2025, pp. 64–96

2025
[15]

Clause2inv: A generate-combine-check framework for loop invariant inference,

W. Caoet al., “Clause2inv: A generate-combine-check framework for loop invariant inference,”Proceedings of the ACM on Software Engineering, vol. 2, no. ISSTA, pp. 1009–1030, 2025

2025
[16]

Lyra: Orchestrating dual correction in automated theorem proving,

C. Zhenget al., “Lyra: Orchestrating dual correction in automated theorem proving,”arXiv preprint arXiv:2309.15806, 2023

arXiv 2023
[17]

Leandojo: Theorem proving with retrieval-augmented language models,

K. Yanget al., “Leandojo: Theorem proving with retrieval-augmented language models,”Adv. in Neural Info. Processing Syst., vol. 36, 2024

2024
[18]

Lisa: Language models of isabelle proofs,

A. Q. Jianget al., “Lisa: Language models of isabelle proofs,” in6th Conf. on AI and Theorem Proving, 2021, pp. 378–392

2021
[19]

Llmstep: Llm proofstep suggestions in lean,

S. Wellecket al., “Llmstep: Llm proofstep suggestions in lean,”arXiv preprint arXiv:2310.18457, 2023

arXiv 2023
[20]

Baldur: Whole-proof generation and repair with large language models,

E. Firstet al., “Baldur: Whole-proof generation and repair with large language models,” inProc. of the 31st ACM Joint European Softw. Eng. Conf. and Symp. on the Found. of Softw. Eng., 2023, pp. 1229–1241

2023
[21]

Vecogen: Automating generation of formally verified c code with large language models,

M. Sevenhuijsenet al., “Vecogen: Automating generation of formally verified c code with large language models,” in2025 IEEE/ACM 13th Int’l Conf. on FM in SE (FormaliSE). IEEE, 2025, pp. 101–112

2025
[22]

Enchanting program specification synthesis by large language models using static analysis and program verification,

C. Wenet al., “Enchanting program specification synthesis by large language models using static analysis and program verification,” inInt’l Conf. on Computer Aided Verification. Springer, 2024, pp. 302–328

2024
[23]

Can chatgpt support software verification?

C. Janßenet al., “Can chatgpt support software verification?” inInt’l Conf. on Fundam. Approaches to SE. Springer, 2024, pp. 266–279

2024
[24]

Leveraging llms for program verification,

A. Kamathet al., “Leveraging llms for program verification,” in2024 FM in Computer-Aided Design (FMCAD). IEEE, 2024, pp. 107–118

2024
[25]

Closing the Gap: A User Study on the Real- world Usefulness of AI-powered Vulnerability Detection & Repair in the IDE,

L. Maet al.,SpecGen: Automated Generation of Formal Program Specifications via Large Language Models. IEEE Press, 2025, p. 16–28. [Online]. Available: https://doi.org/10.1109/ICSE55347.2025.00129

work page doi:10.1109/icse55347.2025.00129 2025
[26]

Towards ai-assisted synthesis of verified dafny methods,

M. R. H. Misuet al., “Towards ai-assisted synthesis of verified dafny methods,”Proc. of the ACM on SE, vol. 1, no. FSE, pp. 812–835, 2024

2024
[27]

Laurel: Unblocking automated verification with large language models,

E. Mugnieret al., “Laurel: Unblocking automated verification with large language models,”Proc. ACM Program. Lang., vol. 9, no. OOPSLA1, Apr. 2025. [Online]. Available: https://doi.org/10.1145/3720499

work page doi:10.1145/3720499 2025
[28]

Automatic generation of loop invariants in dafny with large language models,

J. Pascoal Fariaet al., “Automatic generation of loop invariants in dafny with large language models,” inInternational Conference on Fundamentals of Software Engineering. Springer, 2025, pp. 138–154

2025
[29]

Leveraging large language models to boost dafny’s developers productivity,

´A. F. Silvaet al., “Leveraging large language models to boost dafny’s developers productivity,” inProc. of the 2024 IEEE/ACM 12th Int’l Conf. on FM in SE (FormaliSE), 2024, pp. 138–142

2024
[30]

Evaluating llm-driven user-intent formalization for verification-aware languages,

S. K. Lahiri, “Evaluating llm-driven user-intent formalization for verification-aware languages,” inCONFERENCE ON FORMAL METH- ODS IN COMPUTER-AIDED DESIGN–FMCAD 2024, 2024, p. 142

2024
[31]

dafny-annotator: Ai-assisted verification of dafny programs,

G. Poesiaet al., “dafny-annotator: Ai-assisted verification of dafny programs,”arXiv preprint arXiv:2411.15143, 2024

arXiv 2024
[32]

Leveraging large language models for automated proof synthesis in rust,

J. Yaoet al., “Leveraging large language models for automated proof synthesis in rust,”arXiv preprint arXiv:2311.03739, 2023

arXiv 2023
[33]

Automated proof generation for rust code via self- evolution,

T. Chenet al., “Automated proof generation for rust code via self- evolution,” inThe 13th Int’l Conf. on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=2NqssmiXLu

2025
[34]

Autoverus: Automated proof generation for rust code,

C. Yanget al., “Autoverus: Automated proof generation for rust code,” Proc. ACM Program. Lang., vol. 9, no. OOPSLA2, Oct. 2025. [Online]. Available: https://doi.org/10.1145/3763174

work page doi:10.1145/3763174 2025
[35]

Evaluating the ability of gpt-4o to generate verifiable specifications in verifast,

M. Regoet al., “Evaluating the ability of gpt-4o to generate verifiable specifications in verifast,” in2025 IEEE/ACM 2nd Int’l Conf on AI Foundation Models and Softw. Eng. (Forge), 2025, pp. 246–251

2025
[36]

Dafnypro: Llm-assisted automated verification for dafny programs,

D. Banerjeeet al., “Dafnypro: Llm-assisted automated verification for dafny programs,”arXiv preprint arXiv:2601.05385, 2026

arXiv 2026
[37]

Inferring multiple helper dafny assertions with llms,

´A. Silvaet al., “Inferring multiple helper dafny assertions with llms,” arXiv preprint arXiv:2511.00125, 2025

arXiv 2025
[38]

Sep. logic: A logic for shared mutable data structs

J. C. Reynolds, “Sep. logic: A logic for shared mutable data structs.” in Proc. 17th Ann. IEEE Symp. on Logic in CS. IEEE, 2002, pp. 55–74

2002
[39]

Sep. logic and abstraction,

M. Parkinsonet al., “Sep. logic and abstraction,” inProc. of the 32nd ACM SIGPLAN-SIGACT Symp. on Princ. of PLs., 2005, pp. 247–258

2005
[40]

Viper: A verification infrastructure for permission- based reasoning,

P. M ¨ulleret al., “Viper: A verification infrastructure for permission- based reasoning,” inVerification, Model Checking, and Abstract In- terpretation: 17th Int’l Conf., VMCAI 2016, St. Petersburg, FL, USA, January 17-19, 2016. Proceedings 17. Springer, 2016, pp. 41–62

2016
[41]

Gillian, part i: a multi-language platform for symbolic execution,

J. Fragoso Santoset al., “Gillian, part i: a multi-language platform for symbolic execution,” inProc. of the 41st ACM SIGPLAN Conf. on Prog. Lang. Design and Implementation, 2020, pp. 927–942

2020
[42]

Verifast: A powerful, sound, predictable, fast verifier for c and java,

B. Jacobset al., “Verifast: A powerful, sound, predictable, fast verifier for c and java,” inNASA FM Symp.Springer, 2011, pp. 41–55

2011
[43]

Implicit dynamic frames,

J. Smanset al., “Implicit dynamic frames,”ACM Transactions on Prog. Langs. and Syst. (TOPLAS), vol. 34, no. 1, pp. 1–58, 2012

2012
[44]

Attention is all you need,

A. Vaswaniet al., “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

2017
[45]

Llama: Open and efficient foundation language models,

H. Touvronet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

Pith/arXiv arXiv 2023
[46]

Deepseek-v3 technical report,

A. Liuet al., “Deepseek-v3 technical report,”arXiv preprint arXiv:2412.19437, 2024

Pith/arXiv arXiv 2024
[47]

Codegen: An open large language model for code with multi-turn program synth

E. Nijkampet al., “Codegen: An open large language model for code with multi-turn program synth.”arXiv preprint arXiv:2203.13474, 2022

Pith/arXiv arXiv 2022
[48]

Starcoder 2 and the stack v2: The next generation,

A. Lozhkovet al., “Starcoder 2 and the stack v2: The next generation,” arXiv preprint arXiv:2402.19173, 2024

Pith/arXiv arXiv 2024
[49]

Language models are few-shot learners,

T. Brownet al., “Language models are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

1901
[50]

Palm: Scaling language modeling with pathways,

A. Chowdheryet al., “Palm: Scaling language modeling with pathways,” J. of Machine Learning Research, vol. 24, no. 240, pp. 1–113, 2023

2023
[51]

The llama 3 herd of models,

A. Grattafioriet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

Pith/arXiv arXiv 2024
[52]

An explanation of in-context learning as implicit bayesian inference,

S. M. Xieet al., “An explanation of in-context learning as implicit bayesian inference,”arXiv preprint arXiv:2111.02080, 2021

Pith/arXiv arXiv 2021
[53]

Chain-of-thought prompting elicits reasoning in llms,

J. Weiet al., “Chain-of-thought prompting elicits reasoning in llms,” Adv. in neural info. proc. syst., vol. 35, pp. 24 824–24 837, 2022

2022
[54]

Retrieval-augmented generation for knowledge-intensive nlp tasks,

P. Lewiset al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,”Adv. in neural info. proc. syst., vol. 33, pp. 9459–9474, 2020

2020
[55]

Solving quantitative reasoning probls. with lang. models,

A. Lewkowyczet al., “Solving quantitative reasoning probls. with lang. models,”Adv. in neural info. proc. syst., vol. 35, pp. 3843–3857, 2022

2022
[56]

Star: Bootstrapping reasoning with reasoning,

E. Zelikmanet al., “Star: Bootstrapping reasoning with reasoning,”Adv. in Neural Info. Proc. Syst., vol. 35, pp. 15 476–15 488, 2022

2022
[57]

Swe-agent: Agent-comptr. interfaces enable autom. softw. eng

J. Yanget al., “Swe-agent: Agent-comptr. interfaces enable autom. softw. eng.”Adv. in Neural Info. Proc. Syst., vol. 37, pp. 50 528–50 652, 2024

2024
[58]

Repocoder: Repository-level code completion through iterative retrieval and gen

F. Zhanget al., “Repocoder: Repository-level code completion through iterative retrieval and gen.”arXiv preprint arXiv:2303.12570, 2023

arXiv 2023
[59]

Repoagent: An llm-powered open-source framework for repository-level code documentation generation,

Q. Luoet al., “Repoagent: An llm-powered open-source framework for repository-level code documentation generation,”arXiv preprint arXiv:2402.16667, 2024

arXiv 2024
[60]

Llms for validating network protocol parsers,

M. Zhenget al., “Llms for validating network protocol parsers,” in2025 IEEE Secur. and Priv. Wkshps. (SPW). IEEE, 2025, pp. 56–64

2025
[61]

Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models,

Y . Denget al., “Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models,” inProc. of the 32nd ACM SIGSOFT int’l symp. on softw. test. and anal., 2023, pp. 423–435

2023
[62]

Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models,

C. Lemieuxet al., “Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models,” in2023 IEEE/ACM 45th Int’l Conf. on Softw. Eng. (ICSE). IEEE, 2023, pp. 919–931

2023
[63]

Autocoderover: Autonomous program improvement,

Y . Zhanget al., “Autocoderover: Autonomous program improvement,” inProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, 2024, pp. 1592–1604

2024
[64]

Swe-bench: Can language models resolve real- world github issues?

C. E. Jimenezet al., “Swe-bench: Can language models resolve real- world github issues?”arXiv preprint arXiv:2310.06770, 2023

Pith/arXiv arXiv 2023
[65]

Cocomic: Code completion by jointly modeling in-file and cross-file context,

Y . Dinget al., “Cocomic: Code completion by jointly modeling in-file and cross-file context,” inProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, pp. 3433–3445

2024
[66]

Closing the Gap: A User Study on the Real- world Usefulness of AI-powered Vulnerability Detection & Repair in the IDE,

K. Thompsonet al., “Rango: Adaptive retrieval-augmented proving for automated software verification,” inProceedings of the IEEE/ACM 47th International Conference on Software Engineering, ser. ICSE ’25. IEEE Press, 2025, p. 347–359. [Online]. Available: https: //doi.org/10.1109/ICSE55347.2025.00161

work page doi:10.1109/icse55347.2025.00161 2025
[67]

Adaptive proof refinement with llm-guided strategy selection,

M. Luet al., “Adaptive proof refinement with llm-guided strategy selection,”arXiv preprint arXiv:2510.25103, 2025

arXiv 2025
[68]

Magnushammer: A transformer-based approach to premise selection,

M. Mikułaet al., “Magnushammer: A transformer-based approach to premise selection,” inThe 12th Int’l Conf. on Learning Rep., 2024. [Online]. Available: https://openreview.net/forum?id=oYjPk8mqA V

2024
[69]

Robertsonet al.,The probabilistic relevance framework: BM25 and beyond

S. Robertsonet al.,The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc, 2009, vol. 4

2009
[70]

Dense passage retrieval for open-domain question answering

V . Karpukhinet al., “Dense passage retrieval for open-domain question answering.” inEMNLP (1), 2020, pp. 6769–6781

2020
[71]

Verifast 25.11,

B. Jacobset al., “Verifast 25.11,” https://github.com/verifast/verifast, 2025

2025
[72]

Annotation inference for separation logic based verifiers,

F. V ogelset al., “Annotation inference for separation logic based verifiers,” inInternational Conference on Formal Methods for Open Object-Based Distributed Systems. Springer, 2011, pp. 319–333

2011
[73]

Z3: An efficient smt solver,

L. De Mouraet al., “Z3: An efficient smt solver,” inInt’l Conf. on Tools & Algo. for the Const. and Anal. of Syst.Springer, 2008, pp. 337–340

2008
[74]

Evaluating large language models in class-level code generation,

X. Duet al., “Evaluating large language models in class-level code generation,” inProceedings of the IEEE/ACM 46th International Conference on Software Engineering, ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery, 2024. [Online]. Available: https://doi.org/10.1145/3597503.3639219

work page doi:10.1145/3597503.3639219 2024
[75]

Understanding interobserver agreement: the kappa statistic,

A. J. Vieraet al., “Understanding interobserver agreement: the kappa statistic,”Fam med, vol. 37, no. 5, pp. 360–363, 2005

2005
[76]

Can large language models transform natural language intent into formal method postconditions?

M. Endreset al., “Can large language models transform natural language intent into formal method postconditions?”Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 1889–1912, 2024

1912
[77]

Automated lemma synth. in symbolic-heap sep. logic,

Q.-T. Taet al., “Automated lemma synth. in symbolic-heap sep. logic,” Proc. of the ACM on Prog. Langs., vol. 2, no. POPL, pp. 1–29, 2017

2017
[78]

Compositional shape analysis by means of bi- abduction,

C. Calcagnoet al., “Compositional shape analysis by means of bi- abduction,” inProc. of the 36th annual ACM SIGPLAN-SIGACT symp. on Princ. of prog. langs., 2009, pp. 289–300

2009
[79]

Compositional shape analysis by means of bi-abduction,

——, “Compositional shape analysis by means of bi-abduction,”Journal of the ACM (JACM), vol. 58, no. 6, pp. 1–66, 2011

2011
[80]

Inferring loop invariants using postconditions

C. A. Furiaet al., “Inferring loop invariants using postconditions.”Fields of logic and computation, vol. 6300, pp. 277–300, 2010

2010

Showing first 80 references.

[1] [1]

Chaff: Engineering an efficient sat solver,

M. W. Moskewiczet al., “Chaff: Engineering an efficient sat solver,” in Proc. of the 38th annual Design Automation Conf., 2001, pp. 530–535

2001

[2] [2]

An axiomatic basis for computer programming,

C. A. R. Hoare, “An axiomatic basis for computer programming,” Communications of the ACM, vol. 12, no. 10, pp. 576–580, 1969

1969

[3] [3]

Dafny: An automatic program verifier for functional correctness,

K. R. M. Leino, “Dafny: An automatic program verifier for functional correctness,” inInternational conference on logic for programming artificial intelligence and reasoning. Springer, 2010, pp. 348–370

2010

[4] [4]

Verus: Verifying rust progs. using linear ghost types,

A. Lattuadaet al., “Verus: Verifying rust progs. using linear ghost types,” Proc. of the ACM on PLs., vol. 7, no. OOPSLA1, pp. 286–315, 2023

2023

[5] [5]

Formally verified cloud-scale authorization,

A. Chakarovet al., “Formally verified cloud-scale authorization,” in 2025 IEEE/ACM 47th Int’l. Conf. on Softw. Eng. (ICSE). IEEE Computer Society, 2025, pp. 703–703

2025

[6] [6]

Software verification with verifast: Industrial case studies,

P. Philippaertset al., “Software verification with verifast: Industrial case studies,”Science of Computer Programming, vol. 82, pp. 77–97, 2014

2014

[7] [7]

Codet: Code generation with generated tests,

B. Chenet al., “Codet: Code generation with generated tests,”arXiv preprint arXiv:2207.10397, 2022

Pith/arXiv arXiv 2022

[8] [8]

Automatic generation of programming exercises and code explanations using large language models,

S. Sarsaet al., “Automatic generation of programming exercises and code explanations using large language models,” inProc. of the 2022 ACM Conf. on Int’l. Computing Educ. Research-Vol. 1, 2022, pp. 27–43

2022

[9] [9]

Software testing with large language models: Survey, landscape, and vision,

J. Wanget al., “Software testing with large language models: Survey, landscape, and vision,”IEEE Trans. Softw. Eng., 2024

2024

[10] [10]

Fuzz4all: Universal fuzzing with llms,

C. S. Xiaet al., “Fuzz4all: Universal fuzzing with llms,” inProc. of the IEEE/ACM 46th Int’l Conf. on Softw. Eng., 2024, pp. 1–13

2024

[11] [11]

The daikon system for dynamic detection of likely invariants,

M. D. Ernstet al., “The daikon system for dynamic detection of likely invariants,”Science of computer prog., vol. 69, no. 1-3, pp. 35–45, 2007

2007

[12] [12]

Can large language models reason about program invariants?

K. Peiet al., “Can large language models reason about program invariants?” inInt’l Conf. on ML. PMLR, 2023, pp. 27 496–27 520

2023

[13] [13]

Impact of large language models on generating software specifications,

D. Xieet al., “Impact of large language models on generating software specifications,”arXiv preprint arXiv:2306.03324, 2023

arXiv 2023

[14] [14]

Classinvgen: Class invariant synthesis using large language models,

C. Sunet al., “Classinvgen: Class invariant synthesis using large language models,” inInt’l Symp on AI Verif.Springer, 2025, pp. 64–96

2025

[15] [15]

Clause2inv: A generate-combine-check framework for loop invariant inference,

W. Caoet al., “Clause2inv: A generate-combine-check framework for loop invariant inference,”Proceedings of the ACM on Software Engineering, vol. 2, no. ISSTA, pp. 1009–1030, 2025

2025

[16] [16]

Lyra: Orchestrating dual correction in automated theorem proving,

C. Zhenget al., “Lyra: Orchestrating dual correction in automated theorem proving,”arXiv preprint arXiv:2309.15806, 2023

arXiv 2023

[17] [17]

Leandojo: Theorem proving with retrieval-augmented language models,

K. Yanget al., “Leandojo: Theorem proving with retrieval-augmented language models,”Adv. in Neural Info. Processing Syst., vol. 36, 2024

2024

[18] [18]

Lisa: Language models of isabelle proofs,

A. Q. Jianget al., “Lisa: Language models of isabelle proofs,” in6th Conf. on AI and Theorem Proving, 2021, pp. 378–392

2021

[19] [19]

Llmstep: Llm proofstep suggestions in lean,

S. Wellecket al., “Llmstep: Llm proofstep suggestions in lean,”arXiv preprint arXiv:2310.18457, 2023

arXiv 2023

[20] [20]

Baldur: Whole-proof generation and repair with large language models,

E. Firstet al., “Baldur: Whole-proof generation and repair with large language models,” inProc. of the 31st ACM Joint European Softw. Eng. Conf. and Symp. on the Found. of Softw. Eng., 2023, pp. 1229–1241

2023

[21] [21]

Vecogen: Automating generation of formally verified c code with large language models,

M. Sevenhuijsenet al., “Vecogen: Automating generation of formally verified c code with large language models,” in2025 IEEE/ACM 13th Int’l Conf. on FM in SE (FormaliSE). IEEE, 2025, pp. 101–112

2025

[22] [22]

Enchanting program specification synthesis by large language models using static analysis and program verification,

C. Wenet al., “Enchanting program specification synthesis by large language models using static analysis and program verification,” inInt’l Conf. on Computer Aided Verification. Springer, 2024, pp. 302–328

2024

[23] [23]

Can chatgpt support software verification?

C. Janßenet al., “Can chatgpt support software verification?” inInt’l Conf. on Fundam. Approaches to SE. Springer, 2024, pp. 266–279

2024

[24] [24]

Leveraging llms for program verification,

A. Kamathet al., “Leveraging llms for program verification,” in2024 FM in Computer-Aided Design (FMCAD). IEEE, 2024, pp. 107–118

2024

[25] [25]

Closing the Gap: A User Study on the Real- world Usefulness of AI-powered Vulnerability Detection & Repair in the IDE,

L. Maet al.,SpecGen: Automated Generation of Formal Program Specifications via Large Language Models. IEEE Press, 2025, p. 16–28. [Online]. Available: https://doi.org/10.1109/ICSE55347.2025.00129

work page doi:10.1109/icse55347.2025.00129 2025

[26] [26]

Towards ai-assisted synthesis of verified dafny methods,

M. R. H. Misuet al., “Towards ai-assisted synthesis of verified dafny methods,”Proc. of the ACM on SE, vol. 1, no. FSE, pp. 812–835, 2024

2024

[27] [27]

Laurel: Unblocking automated verification with large language models,

E. Mugnieret al., “Laurel: Unblocking automated verification with large language models,”Proc. ACM Program. Lang., vol. 9, no. OOPSLA1, Apr. 2025. [Online]. Available: https://doi.org/10.1145/3720499

work page doi:10.1145/3720499 2025

[28] [28]

Automatic generation of loop invariants in dafny with large language models,

J. Pascoal Fariaet al., “Automatic generation of loop invariants in dafny with large language models,” inInternational Conference on Fundamentals of Software Engineering. Springer, 2025, pp. 138–154

2025

[29] [29]

Leveraging large language models to boost dafny’s developers productivity,

´A. F. Silvaet al., “Leveraging large language models to boost dafny’s developers productivity,” inProc. of the 2024 IEEE/ACM 12th Int’l Conf. on FM in SE (FormaliSE), 2024, pp. 138–142

2024

[30] [30]

Evaluating llm-driven user-intent formalization for verification-aware languages,

S. K. Lahiri, “Evaluating llm-driven user-intent formalization for verification-aware languages,” inCONFERENCE ON FORMAL METH- ODS IN COMPUTER-AIDED DESIGN–FMCAD 2024, 2024, p. 142

2024

[31] [31]

dafny-annotator: Ai-assisted verification of dafny programs,

G. Poesiaet al., “dafny-annotator: Ai-assisted verification of dafny programs,”arXiv preprint arXiv:2411.15143, 2024

arXiv 2024

[32] [32]

Leveraging large language models for automated proof synthesis in rust,

J. Yaoet al., “Leveraging large language models for automated proof synthesis in rust,”arXiv preprint arXiv:2311.03739, 2023

arXiv 2023

[33] [33]

Automated proof generation for rust code via self- evolution,

T. Chenet al., “Automated proof generation for rust code via self- evolution,” inThe 13th Int’l Conf. on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=2NqssmiXLu

2025

[34] [34]

Autoverus: Automated proof generation for rust code,

C. Yanget al., “Autoverus: Automated proof generation for rust code,” Proc. ACM Program. Lang., vol. 9, no. OOPSLA2, Oct. 2025. [Online]. Available: https://doi.org/10.1145/3763174

work page doi:10.1145/3763174 2025

[35] [35]

Evaluating the ability of gpt-4o to generate verifiable specifications in verifast,

M. Regoet al., “Evaluating the ability of gpt-4o to generate verifiable specifications in verifast,” in2025 IEEE/ACM 2nd Int’l Conf on AI Foundation Models and Softw. Eng. (Forge), 2025, pp. 246–251

2025

[36] [36]

Dafnypro: Llm-assisted automated verification for dafny programs,

D. Banerjeeet al., “Dafnypro: Llm-assisted automated verification for dafny programs,”arXiv preprint arXiv:2601.05385, 2026

arXiv 2026

[37] [37]

Inferring multiple helper dafny assertions with llms,

´A. Silvaet al., “Inferring multiple helper dafny assertions with llms,” arXiv preprint arXiv:2511.00125, 2025

arXiv 2025

[38] [38]

Sep. logic: A logic for shared mutable data structs

J. C. Reynolds, “Sep. logic: A logic for shared mutable data structs.” in Proc. 17th Ann. IEEE Symp. on Logic in CS. IEEE, 2002, pp. 55–74

2002

[39] [39]

Sep. logic and abstraction,

M. Parkinsonet al., “Sep. logic and abstraction,” inProc. of the 32nd ACM SIGPLAN-SIGACT Symp. on Princ. of PLs., 2005, pp. 247–258

2005

[40] [40]

Viper: A verification infrastructure for permission- based reasoning,

P. M ¨ulleret al., “Viper: A verification infrastructure for permission- based reasoning,” inVerification, Model Checking, and Abstract In- terpretation: 17th Int’l Conf., VMCAI 2016, St. Petersburg, FL, USA, January 17-19, 2016. Proceedings 17. Springer, 2016, pp. 41–62

2016

[41] [41]

Gillian, part i: a multi-language platform for symbolic execution,

J. Fragoso Santoset al., “Gillian, part i: a multi-language platform for symbolic execution,” inProc. of the 41st ACM SIGPLAN Conf. on Prog. Lang. Design and Implementation, 2020, pp. 927–942

2020

[42] [42]

Verifast: A powerful, sound, predictable, fast verifier for c and java,

B. Jacobset al., “Verifast: A powerful, sound, predictable, fast verifier for c and java,” inNASA FM Symp.Springer, 2011, pp. 41–55

2011

[43] [43]

Implicit dynamic frames,

J. Smanset al., “Implicit dynamic frames,”ACM Transactions on Prog. Langs. and Syst. (TOPLAS), vol. 34, no. 1, pp. 1–58, 2012

2012

[44] [44]

Attention is all you need,

A. Vaswaniet al., “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

2017

[45] [45]

Llama: Open and efficient foundation language models,

H. Touvronet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

Pith/arXiv arXiv 2023

[46] [46]

Deepseek-v3 technical report,

A. Liuet al., “Deepseek-v3 technical report,”arXiv preprint arXiv:2412.19437, 2024

Pith/arXiv arXiv 2024

[47] [47]

Codegen: An open large language model for code with multi-turn program synth

E. Nijkampet al., “Codegen: An open large language model for code with multi-turn program synth.”arXiv preprint arXiv:2203.13474, 2022

Pith/arXiv arXiv 2022

[48] [48]

Starcoder 2 and the stack v2: The next generation,

A. Lozhkovet al., “Starcoder 2 and the stack v2: The next generation,” arXiv preprint arXiv:2402.19173, 2024

Pith/arXiv arXiv 2024

[49] [49]

Language models are few-shot learners,

T. Brownet al., “Language models are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

1901

[50] [50]

Palm: Scaling language modeling with pathways,

A. Chowdheryet al., “Palm: Scaling language modeling with pathways,” J. of Machine Learning Research, vol. 24, no. 240, pp. 1–113, 2023

2023

[51] [51]

The llama 3 herd of models,

A. Grattafioriet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

Pith/arXiv arXiv 2024

[52] [52]

An explanation of in-context learning as implicit bayesian inference,

S. M. Xieet al., “An explanation of in-context learning as implicit bayesian inference,”arXiv preprint arXiv:2111.02080, 2021

Pith/arXiv arXiv 2021

[53] [53]

Chain-of-thought prompting elicits reasoning in llms,

J. Weiet al., “Chain-of-thought prompting elicits reasoning in llms,” Adv. in neural info. proc. syst., vol. 35, pp. 24 824–24 837, 2022

2022

[54] [54]

Retrieval-augmented generation for knowledge-intensive nlp tasks,

P. Lewiset al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,”Adv. in neural info. proc. syst., vol. 33, pp. 9459–9474, 2020

2020

[55] [55]

Solving quantitative reasoning probls. with lang. models,

A. Lewkowyczet al., “Solving quantitative reasoning probls. with lang. models,”Adv. in neural info. proc. syst., vol. 35, pp. 3843–3857, 2022

2022

[56] [56]

Star: Bootstrapping reasoning with reasoning,

E. Zelikmanet al., “Star: Bootstrapping reasoning with reasoning,”Adv. in Neural Info. Proc. Syst., vol. 35, pp. 15 476–15 488, 2022

2022

[57] [57]

Swe-agent: Agent-comptr. interfaces enable autom. softw. eng

J. Yanget al., “Swe-agent: Agent-comptr. interfaces enable autom. softw. eng.”Adv. in Neural Info. Proc. Syst., vol. 37, pp. 50 528–50 652, 2024

2024

[58] [58]

Repocoder: Repository-level code completion through iterative retrieval and gen

F. Zhanget al., “Repocoder: Repository-level code completion through iterative retrieval and gen.”arXiv preprint arXiv:2303.12570, 2023

arXiv 2023

[59] [59]

Repoagent: An llm-powered open-source framework for repository-level code documentation generation,

Q. Luoet al., “Repoagent: An llm-powered open-source framework for repository-level code documentation generation,”arXiv preprint arXiv:2402.16667, 2024

arXiv 2024

[60] [60]

Llms for validating network protocol parsers,

M. Zhenget al., “Llms for validating network protocol parsers,” in2025 IEEE Secur. and Priv. Wkshps. (SPW). IEEE, 2025, pp. 56–64

2025

[61] [61]

Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models,

Y . Denget al., “Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models,” inProc. of the 32nd ACM SIGSOFT int’l symp. on softw. test. and anal., 2023, pp. 423–435

2023

[62] [62]

Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models,

C. Lemieuxet al., “Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models,” in2023 IEEE/ACM 45th Int’l Conf. on Softw. Eng. (ICSE). IEEE, 2023, pp. 919–931

2023

[63] [63]

Autocoderover: Autonomous program improvement,

Y . Zhanget al., “Autocoderover: Autonomous program improvement,” inProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, 2024, pp. 1592–1604

2024

[64] [64]

Swe-bench: Can language models resolve real- world github issues?

C. E. Jimenezet al., “Swe-bench: Can language models resolve real- world github issues?”arXiv preprint arXiv:2310.06770, 2023

Pith/arXiv arXiv 2023

[65] [65]

Cocomic: Code completion by jointly modeling in-file and cross-file context,

Y . Dinget al., “Cocomic: Code completion by jointly modeling in-file and cross-file context,” inProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, pp. 3433–3445

2024

[66] [66]

Closing the Gap: A User Study on the Real- world Usefulness of AI-powered Vulnerability Detection & Repair in the IDE,

K. Thompsonet al., “Rango: Adaptive retrieval-augmented proving for automated software verification,” inProceedings of the IEEE/ACM 47th International Conference on Software Engineering, ser. ICSE ’25. IEEE Press, 2025, p. 347–359. [Online]. Available: https: //doi.org/10.1109/ICSE55347.2025.00161

work page doi:10.1109/icse55347.2025.00161 2025

[67] [67]

Adaptive proof refinement with llm-guided strategy selection,

M. Luet al., “Adaptive proof refinement with llm-guided strategy selection,”arXiv preprint arXiv:2510.25103, 2025

arXiv 2025

[68] [68]

Magnushammer: A transformer-based approach to premise selection,

M. Mikułaet al., “Magnushammer: A transformer-based approach to premise selection,” inThe 12th Int’l Conf. on Learning Rep., 2024. [Online]. Available: https://openreview.net/forum?id=oYjPk8mqA V

2024

[69] [69]

Robertsonet al.,The probabilistic relevance framework: BM25 and beyond

S. Robertsonet al.,The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc, 2009, vol. 4

2009

[70] [70]

Dense passage retrieval for open-domain question answering

V . Karpukhinet al., “Dense passage retrieval for open-domain question answering.” inEMNLP (1), 2020, pp. 6769–6781

2020

[71] [71]

Verifast 25.11,

B. Jacobset al., “Verifast 25.11,” https://github.com/verifast/verifast, 2025

2025

[72] [72]

Annotation inference for separation logic based verifiers,

F. V ogelset al., “Annotation inference for separation logic based verifiers,” inInternational Conference on Formal Methods for Open Object-Based Distributed Systems. Springer, 2011, pp. 319–333

2011

[73] [73]

Z3: An efficient smt solver,

L. De Mouraet al., “Z3: An efficient smt solver,” inInt’l Conf. on Tools & Algo. for the Const. and Anal. of Syst.Springer, 2008, pp. 337–340

2008

[74] [74]

Evaluating large language models in class-level code generation,

X. Duet al., “Evaluating large language models in class-level code generation,” inProceedings of the IEEE/ACM 46th International Conference on Software Engineering, ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery, 2024. [Online]. Available: https://doi.org/10.1145/3597503.3639219

work page doi:10.1145/3597503.3639219 2024

[75] [75]

Understanding interobserver agreement: the kappa statistic,

A. J. Vieraet al., “Understanding interobserver agreement: the kappa statistic,”Fam med, vol. 37, no. 5, pp. 360–363, 2005

2005

[76] [76]

Can large language models transform natural language intent into formal method postconditions?

M. Endreset al., “Can large language models transform natural language intent into formal method postconditions?”Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 1889–1912, 2024

1912

[77] [77]

Automated lemma synth. in symbolic-heap sep. logic,

Q.-T. Taet al., “Automated lemma synth. in symbolic-heap sep. logic,” Proc. of the ACM on Prog. Langs., vol. 2, no. POPL, pp. 1–29, 2017

2017

[78] [78]

Compositional shape analysis by means of bi- abduction,

C. Calcagnoet al., “Compositional shape analysis by means of bi- abduction,” inProc. of the 36th annual ACM SIGPLAN-SIGACT symp. on Princ. of prog. langs., 2009, pp. 289–300

2009

[79] [79]

Compositional shape analysis by means of bi-abduction,

——, “Compositional shape analysis by means of bi-abduction,”Journal of the ACM (JACM), vol. 58, no. 6, pp. 1–66, 2011

2011

[80] [80]

Inferring loop invariants using postconditions

C. A. Furiaet al., “Inferring loop invariants using postconditions.”Fields of logic and computation, vol. 6300, pp. 277–300, 2010

2010