pith. sign in

arxiv: 2606.26490 · v1 · pith:X5VVVKIWnew · submitted 2026-06-25 · 💻 cs.SE · cs.AI· cs.LO· cs.PL

An Empirical Study of LLM-Generated Specifications for VeriFast

Pith reviewed 2026-06-26 04:46 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LOcs.PL
keywords LLM-generated specificationsVeriFastseparation logicstatic verificationempirical evaluationfunctional preservationverification successC programs
0
0 comments X

The pith

LLMs generate functionally preserving specifications for VeriFast over 91 percent of the time but verify successfully only 31.4 percent of the time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how effectively large language models can create the complex auxiliary specifications needed for separation logic verifiers like VeriFast to check heap-manipulating C programs. It applies ten models using eight different prompting strategies and three kinds of input to 303 functions, then measures whether the generated code and specs keep the original behavior and whether VeriFast can prove them. Results indicate high rates of functional preservation but only modest verification rates, with the majority of failures due to incorrect use of VeriFast-specific features. The work identifies concrete factors such as model choice and input type that improve outcomes and highlights the need for better domain knowledge in LLMs.

Core claim

Through systematic testing, the authors establish that LLMs preserve functional behavior in both the source code and the generated specifications at rates exceeding 91 percent, yet only 31.4 percent of the cases lead to successful verification by VeriFast. Models such as Gemini 2.5 Pro and the inclusion of formal contracts as prompts yield higher success rates. The dominant source of errors, accounting for 94 percent, is the LLMs' incorrect handling of the domain-specific knowledge required by separation logic verifiers.

What carries the argument

The multi-stage empirical evaluation using eight prompting approaches on ten LLMs with three input types to generate and check specifications for 303 C functions, focusing on functional equivalence and verifiability.

If this is right

  • Higher success rates occur when using Gemini 2.5 Pro compared to other models.
  • Including formal contracts in the prompt improves verification outcomes.
  • The bulk of verification failures stem from errors in applying separation logic concepts rather than from altering program semantics.
  • Functional preservation remains high across most configurations, indicating LLMs reliably maintain observable behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Fine-tuning or retrieval-augmented generation focused on separation logic rules could raise verification success rates.
  • The same limitations may affect LLM use with other separation logic tools beyond VeriFast.
  • Scaling the method to full programs rather than individual functions might expose additional challenges in maintaining consistency across specifications.

Load-bearing premise

The 303 C functions along with the eight prompting approaches and three input types adequately represent practical VeriFast usage scenarios, and the automated checks for functional preservation and verifiability measure real-world utility without systematic bias.

What would settle it

Applying the identical prompting and evaluation protocol to a fresh collection of C functions from industrial codebases or to a different separation logic verifier would produce markedly different functional preservation or verification success percentages.

Figures

Figures reproduced from arXiv: 2606.26490 by Danning Xie, Jenna DiVincenzo, Lin Tan, Marilyn Rego, Minh Tran, Sanya Dod, Wen Fan, Xin Hu.

Figure 1
Figure 1. Figure 1: Static verification in VeriFast of the append_tail function for singly-linked lists. 1, verifying append_tail requires 34 lines of auxiliary specification code compared to 16 lines of program code2 . In SL verifiers, predicates enable the expression of heap properties for recursive heap data structures, such as lseg and llist on lines 4-9 in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overview of the study III. STUDY DESIGN To evaluate LLMs on generating VeriFast specifications, we aim to answer the following research questions: RQ1 How well do LLMs preserve functional behavior when generating specifications for proofs of correctness in VeriFast across different prompts, input types, and LLMs? RQ2 How successful are LLMs at generating specifications that result in verified code with… view at source ↗
read the original abstract

Static verification tools can assure industrial scale software, but require significant human labor to write specifications. This is particularly true of static verifiers based on separation logic (SL verifiers), which excel at verifying heapmanipulating programs, but require many complex auxiliary specifications to reason about heap structure. Recent work applies large language models (LLMs) to generate code, tests, and proofs, including specifications for verifiers, but mostly targeting non-SL verifiers. To address this gap, this paper thoroughly evaluates how well LLMs perform when prompted to generate specifications for verifying 303 C functions with the SL verifier VeriFast. We explored eight prompting approaches, ten LLMs, and three input types in two stages. Quantitative and qualitative analyses are used to assess the LLM-generated code and specifications for functional behavior, verifiability and errors. The results show that LLMs preserve functional behavior in source code and specifications (both over 91%), but achieve modest verification success (31.4%). Using Gemini 2.5 Pro and providing formal contracts lead to higher success rates in our setting. Moreover, most errors (94%) come from LLMs' mistakes in the domainspecific knowledge of SL verifiers such as VeriFast. These findings provide guidance for optimizing LLM-generated specifications for SL verifiers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript reports an empirical evaluation of LLMs for generating specifications for the separation logic verifier VeriFast on 303 C functions. It examines eight prompting approaches, ten LLMs, and three input types across two stages, finding that generated code and specifications preserve functional behavior in over 91% of cases, achieve 31.4% verification success (higher with Gemini 2.5 Pro and formal contracts), and that 94% of errors arise from domain-specific knowledge gaps in SL verifiers.

Significance. If the measurements prove robust, the work fills a gap in LLM-assisted specification generation for heap-manipulating programs under separation logic, offering quantitative guidance on prompting strategies and highlighting domain knowledge as the dominant failure mode. The scale of the study (303 functions, multiple models and inputs) strengthens its potential utility for tool builders and practitioners.

major comments (3)
  1. [§4] §4 (Experimental Setup): The selection criteria and sampling method for the 303 C functions are not described in sufficient detail to evaluate selection bias or representativeness, which directly affects the reliability of the reported 31.4% verification success rate and the generalizability claim.
  2. [§5] §5 (Results, functional preservation): The automated procedure used to measure functional behavior preservation (claimed >91% for both code and specifications) is not specified, including whether it relies on test suites, differential execution, or manual review; without this, the central quantitative claims cannot be verified.
  3. [§5.3] §5.3 (Error Analysis): The classification process that attributes 94% of errors to domain-specific knowledge of VeriFast/SL verifiers lacks an explicit coding scheme, inter-rater reliability measure, or example annotations, making the dominant-error conclusion load-bearing yet unverifiable from the reported methodology.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'two stages' is used without defining what the stages consist of or how they differ in prompting or evaluation.
  2. [Figures/Tables] Table captions and axis labels in the results figures should explicitly state the exact success metric (e.g., 'VeriFast verification success rate') to avoid ambiguity with functional-preservation percentages.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving methodological transparency, and we will revise the paper accordingly to address them.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Setup): The selection criteria and sampling method for the 303 C functions are not described in sufficient detail to evaluate selection bias or representativeness, which directly affects the reliability of the reported 31.4% verification success rate and the generalizability claim.

    Authors: We agree that §4 requires additional detail on function selection. The 303 functions were drawn from open-source C repositories and prior VeriFast benchmarks, with a focus on heap-manipulating code; we will expand the section to explicitly describe the sources, inclusion criteria, sampling procedure, and any steps taken to mitigate bias. revision: yes

  2. Referee: [§5] §5 (Results, functional preservation): The automated procedure used to measure functional behavior preservation (claimed >91% for both code and specifications) is not specified, including whether it relies on test suites, differential execution, or manual review; without this, the central quantitative claims cannot be verified.

    Authors: The preservation metric combined differential execution against available test suites with manual behavioral comparison for functions lacking tests. We will add a precise description of this procedure, including any automation scripts and decision rules, to the revised §5. revision: yes

  3. Referee: [§5.3] §5.3 (Error Analysis): The classification process that attributes 94% of errors to domain-specific knowledge of VeriFast/SL verifiers lacks an explicit coding scheme, inter-rater reliability measure, or example annotations, making the dominant-error conclusion load-bearing yet unverifiable from the reported methodology.

    Authors: We will augment §5.3 with the full coding scheme (distinguishing functional, syntactic, and domain-knowledge errors), inter-rater agreement statistics, and additional annotated examples to make the 94% attribution fully reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This paper is an empirical evaluation reporting measured outcomes (functional preservation rates, verification success, error distributions) from direct experiments on 303 C functions across LLMs, prompting strategies, and input types. No derivation chain, equations, fitted parameters presented as predictions, or self-referential definitions exist in the work. All reported metrics are computed from the experimental runs themselves without reduction to prior fitted values or self-citations that bear the central claim. The study is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical evaluation study. It introduces no free parameters, mathematical axioms, or invented entities; all reported figures are observed experimental outcomes.

pith-pipeline@v0.9.1-grok · 5781 in / 1277 out tokens · 37393 ms · 2026-06-26T04:46:42.930464+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

83 extracted references · 5 canonical work pages

  1. [1]

    Chaff: Engineering an efficient sat solver,

    M. W. Moskewiczet al., “Chaff: Engineering an efficient sat solver,” in Proc. of the 38th annual Design Automation Conf., 2001, pp. 530–535

  2. [2]

    An axiomatic basis for computer programming,

    C. A. R. Hoare, “An axiomatic basis for computer programming,” Communications of the ACM, vol. 12, no. 10, pp. 576–580, 1969

  3. [3]

    Dafny: An automatic program verifier for functional correctness,

    K. R. M. Leino, “Dafny: An automatic program verifier for functional correctness,” inInternational conference on logic for programming artificial intelligence and reasoning. Springer, 2010, pp. 348–370

  4. [4]

    Verus: Verifying rust progs. using linear ghost types,

    A. Lattuadaet al., “Verus: Verifying rust progs. using linear ghost types,” Proc. of the ACM on PLs., vol. 7, no. OOPSLA1, pp. 286–315, 2023

  5. [5]

    Formally verified cloud-scale authorization,

    A. Chakarovet al., “Formally verified cloud-scale authorization,” in 2025 IEEE/ACM 47th Int’l. Conf. on Softw. Eng. (ICSE). IEEE Computer Society, 2025, pp. 703–703

  6. [6]

    Software verification with verifast: Industrial case studies,

    P. Philippaertset al., “Software verification with verifast: Industrial case studies,”Science of Computer Programming, vol. 82, pp. 77–97, 2014

  7. [7]

    Codet: Code generation with generated tests,

    B. Chenet al., “Codet: Code generation with generated tests,”arXiv preprint arXiv:2207.10397, 2022

  8. [8]

    Automatic generation of programming exercises and code explanations using large language models,

    S. Sarsaet al., “Automatic generation of programming exercises and code explanations using large language models,” inProc. of the 2022 ACM Conf. on Int’l. Computing Educ. Research-Vol. 1, 2022, pp. 27–43

  9. [9]

    Software testing with large language models: Survey, landscape, and vision,

    J. Wanget al., “Software testing with large language models: Survey, landscape, and vision,”IEEE Trans. Softw. Eng., 2024

  10. [10]

    Fuzz4all: Universal fuzzing with llms,

    C. S. Xiaet al., “Fuzz4all: Universal fuzzing with llms,” inProc. of the IEEE/ACM 46th Int’l Conf. on Softw. Eng., 2024, pp. 1–13

  11. [11]

    The daikon system for dynamic detection of likely invariants,

    M. D. Ernstet al., “The daikon system for dynamic detection of likely invariants,”Science of computer prog., vol. 69, no. 1-3, pp. 35–45, 2007

  12. [12]

    Can large language models reason about program invariants?

    K. Peiet al., “Can large language models reason about program invariants?” inInt’l Conf. on ML. PMLR, 2023, pp. 27 496–27 520

  13. [13]

    Impact of large language models on generating software specifications,

    D. Xieet al., “Impact of large language models on generating software specifications,”arXiv preprint arXiv:2306.03324, 2023

  14. [14]

    Classinvgen: Class invariant synthesis using large language models,

    C. Sunet al., “Classinvgen: Class invariant synthesis using large language models,” inInt’l Symp on AI Verif.Springer, 2025, pp. 64–96

  15. [15]

    Clause2inv: A generate-combine-check framework for loop invariant inference,

    W. Caoet al., “Clause2inv: A generate-combine-check framework for loop invariant inference,”Proceedings of the ACM on Software Engineering, vol. 2, no. ISSTA, pp. 1009–1030, 2025

  16. [16]

    Lyra: Orchestrating dual correction in automated theorem proving,

    C. Zhenget al., “Lyra: Orchestrating dual correction in automated theorem proving,”arXiv preprint arXiv:2309.15806, 2023

  17. [17]

    Leandojo: Theorem proving with retrieval-augmented language models,

    K. Yanget al., “Leandojo: Theorem proving with retrieval-augmented language models,”Adv. in Neural Info. Processing Syst., vol. 36, 2024

  18. [18]

    Lisa: Language models of isabelle proofs,

    A. Q. Jianget al., “Lisa: Language models of isabelle proofs,” in6th Conf. on AI and Theorem Proving, 2021, pp. 378–392

  19. [19]

    Llmstep: Llm proofstep suggestions in lean,

    S. Wellecket al., “Llmstep: Llm proofstep suggestions in lean,”arXiv preprint arXiv:2310.18457, 2023

  20. [20]

    Baldur: Whole-proof generation and repair with large language models,

    E. Firstet al., “Baldur: Whole-proof generation and repair with large language models,” inProc. of the 31st ACM Joint European Softw. Eng. Conf. and Symp. on the Found. of Softw. Eng., 2023, pp. 1229–1241

  21. [21]

    Vecogen: Automating generation of formally verified c code with large language models,

    M. Sevenhuijsenet al., “Vecogen: Automating generation of formally verified c code with large language models,” in2025 IEEE/ACM 13th Int’l Conf. on FM in SE (FormaliSE). IEEE, 2025, pp. 101–112

  22. [22]

    Enchanting program specification synthesis by large language models using static analysis and program verification,

    C. Wenet al., “Enchanting program specification synthesis by large language models using static analysis and program verification,” inInt’l Conf. on Computer Aided Verification. Springer, 2024, pp. 302–328

  23. [23]

    Can chatgpt support software verification?

    C. Janßenet al., “Can chatgpt support software verification?” inInt’l Conf. on Fundam. Approaches to SE. Springer, 2024, pp. 266–279

  24. [24]

    Leveraging llms for program verification,

    A. Kamathet al., “Leveraging llms for program verification,” in2024 FM in Computer-Aided Design (FMCAD). IEEE, 2024, pp. 107–118

  25. [25]

    Closing the Gap: A User Study on the Real- world Usefulness of AI-powered Vulnerability Detection & Repair in the IDE,

    L. Maet al.,SpecGen: Automated Generation of Formal Program Specifications via Large Language Models. IEEE Press, 2025, p. 16–28. [Online]. Available: https://doi.org/10.1109/ICSE55347.2025.00129

  26. [26]

    Towards ai-assisted synthesis of verified dafny methods,

    M. R. H. Misuet al., “Towards ai-assisted synthesis of verified dafny methods,”Proc. of the ACM on SE, vol. 1, no. FSE, pp. 812–835, 2024

  27. [27]

    Laurel: Unblocking automated verification with large language models,

    E. Mugnieret al., “Laurel: Unblocking automated verification with large language models,”Proc. ACM Program. Lang., vol. 9, no. OOPSLA1, Apr. 2025. [Online]. Available: https://doi.org/10.1145/3720499

  28. [28]

    Automatic generation of loop invariants in dafny with large language models,

    J. Pascoal Fariaet al., “Automatic generation of loop invariants in dafny with large language models,” inInternational Conference on Fundamentals of Software Engineering. Springer, 2025, pp. 138–154

  29. [29]

    Leveraging large language models to boost dafny’s developers productivity,

    ´A. F. Silvaet al., “Leveraging large language models to boost dafny’s developers productivity,” inProc. of the 2024 IEEE/ACM 12th Int’l Conf. on FM in SE (FormaliSE), 2024, pp. 138–142

  30. [30]

    Evaluating llm-driven user-intent formalization for verification-aware languages,

    S. K. Lahiri, “Evaluating llm-driven user-intent formalization for verification-aware languages,” inCONFERENCE ON FORMAL METH- ODS IN COMPUTER-AIDED DESIGN–FMCAD 2024, 2024, p. 142

  31. [31]

    dafny-annotator: Ai-assisted verification of dafny programs,

    G. Poesiaet al., “dafny-annotator: Ai-assisted verification of dafny programs,”arXiv preprint arXiv:2411.15143, 2024

  32. [32]

    Leveraging large language models for automated proof synthesis in rust,

    J. Yaoet al., “Leveraging large language models for automated proof synthesis in rust,”arXiv preprint arXiv:2311.03739, 2023

  33. [33]

    Automated proof generation for rust code via self- evolution,

    T. Chenet al., “Automated proof generation for rust code via self- evolution,” inThe 13th Int’l Conf. on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=2NqssmiXLu

  34. [34]

    Autoverus: Automated proof generation for rust code,

    C. Yanget al., “Autoverus: Automated proof generation for rust code,” Proc. ACM Program. Lang., vol. 9, no. OOPSLA2, Oct. 2025. [Online]. Available: https://doi.org/10.1145/3763174

  35. [35]

    Evaluating the ability of gpt-4o to generate verifiable specifications in verifast,

    M. Regoet al., “Evaluating the ability of gpt-4o to generate verifiable specifications in verifast,” in2025 IEEE/ACM 2nd Int’l Conf on AI Foundation Models and Softw. Eng. (Forge), 2025, pp. 246–251

  36. [36]

    Dafnypro: Llm-assisted automated verification for dafny programs,

    D. Banerjeeet al., “Dafnypro: Llm-assisted automated verification for dafny programs,”arXiv preprint arXiv:2601.05385, 2026

  37. [37]

    Inferring multiple helper dafny assertions with llms,

    ´A. Silvaet al., “Inferring multiple helper dafny assertions with llms,” arXiv preprint arXiv:2511.00125, 2025

  38. [38]

    Sep. logic: A logic for shared mutable data structs

    J. C. Reynolds, “Sep. logic: A logic for shared mutable data structs.” in Proc. 17th Ann. IEEE Symp. on Logic in CS. IEEE, 2002, pp. 55–74

  39. [39]

    Sep. logic and abstraction,

    M. Parkinsonet al., “Sep. logic and abstraction,” inProc. of the 32nd ACM SIGPLAN-SIGACT Symp. on Princ. of PLs., 2005, pp. 247–258

  40. [40]

    Viper: A verification infrastructure for permission- based reasoning,

    P. M ¨ulleret al., “Viper: A verification infrastructure for permission- based reasoning,” inVerification, Model Checking, and Abstract In- terpretation: 17th Int’l Conf., VMCAI 2016, St. Petersburg, FL, USA, January 17-19, 2016. Proceedings 17. Springer, 2016, pp. 41–62

  41. [41]

    Gillian, part i: a multi-language platform for symbolic execution,

    J. Fragoso Santoset al., “Gillian, part i: a multi-language platform for symbolic execution,” inProc. of the 41st ACM SIGPLAN Conf. on Prog. Lang. Design and Implementation, 2020, pp. 927–942

  42. [42]

    Verifast: A powerful, sound, predictable, fast verifier for c and java,

    B. Jacobset al., “Verifast: A powerful, sound, predictable, fast verifier for c and java,” inNASA FM Symp.Springer, 2011, pp. 41–55

  43. [43]

    Implicit dynamic frames,

    J. Smanset al., “Implicit dynamic frames,”ACM Transactions on Prog. Langs. and Syst. (TOPLAS), vol. 34, no. 1, pp. 1–58, 2012

  44. [44]

    Attention is all you need,

    A. Vaswaniet al., “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

  45. [45]

    Llama: Open and efficient foundation language models,

    H. Touvronet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

  46. [46]

    Deepseek-v3 technical report,

    A. Liuet al., “Deepseek-v3 technical report,”arXiv preprint arXiv:2412.19437, 2024

  47. [47]

    Codegen: An open large language model for code with multi-turn program synth

    E. Nijkampet al., “Codegen: An open large language model for code with multi-turn program synth.”arXiv preprint arXiv:2203.13474, 2022

  48. [48]

    Starcoder 2 and the stack v2: The next generation,

    A. Lozhkovet al., “Starcoder 2 and the stack v2: The next generation,” arXiv preprint arXiv:2402.19173, 2024

  49. [49]

    Language models are few-shot learners,

    T. Brownet al., “Language models are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

  50. [50]

    Palm: Scaling language modeling with pathways,

    A. Chowdheryet al., “Palm: Scaling language modeling with pathways,” J. of Machine Learning Research, vol. 24, no. 240, pp. 1–113, 2023

  51. [51]

    The llama 3 herd of models,

    A. Grattafioriet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

  52. [52]

    An explanation of in-context learning as implicit bayesian inference,

    S. M. Xieet al., “An explanation of in-context learning as implicit bayesian inference,”arXiv preprint arXiv:2111.02080, 2021

  53. [53]

    Chain-of-thought prompting elicits reasoning in llms,

    J. Weiet al., “Chain-of-thought prompting elicits reasoning in llms,” Adv. in neural info. proc. syst., vol. 35, pp. 24 824–24 837, 2022

  54. [54]

    Retrieval-augmented generation for knowledge-intensive nlp tasks,

    P. Lewiset al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,”Adv. in neural info. proc. syst., vol. 33, pp. 9459–9474, 2020

  55. [55]

    Solving quantitative reasoning probls. with lang. models,

    A. Lewkowyczet al., “Solving quantitative reasoning probls. with lang. models,”Adv. in neural info. proc. syst., vol. 35, pp. 3843–3857, 2022

  56. [56]

    Star: Bootstrapping reasoning with reasoning,

    E. Zelikmanet al., “Star: Bootstrapping reasoning with reasoning,”Adv. in Neural Info. Proc. Syst., vol. 35, pp. 15 476–15 488, 2022

  57. [57]

    Swe-agent: Agent-comptr. interfaces enable autom. softw. eng

    J. Yanget al., “Swe-agent: Agent-comptr. interfaces enable autom. softw. eng.”Adv. in Neural Info. Proc. Syst., vol. 37, pp. 50 528–50 652, 2024

  58. [58]

    Repocoder: Repository-level code completion through iterative retrieval and gen

    F. Zhanget al., “Repocoder: Repository-level code completion through iterative retrieval and gen.”arXiv preprint arXiv:2303.12570, 2023

  59. [59]

    Repoagent: An llm-powered open-source framework for repository-level code documentation generation,

    Q. Luoet al., “Repoagent: An llm-powered open-source framework for repository-level code documentation generation,”arXiv preprint arXiv:2402.16667, 2024

  60. [60]

    Llms for validating network protocol parsers,

    M. Zhenget al., “Llms for validating network protocol parsers,” in2025 IEEE Secur. and Priv. Wkshps. (SPW). IEEE, 2025, pp. 56–64

  61. [61]

    Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models,

    Y . Denget al., “Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models,” inProc. of the 32nd ACM SIGSOFT int’l symp. on softw. test. and anal., 2023, pp. 423–435

  62. [62]

    Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models,

    C. Lemieuxet al., “Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models,” in2023 IEEE/ACM 45th Int’l Conf. on Softw. Eng. (ICSE). IEEE, 2023, pp. 919–931

  63. [63]

    Autocoderover: Autonomous program improvement,

    Y . Zhanget al., “Autocoderover: Autonomous program improvement,” inProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, 2024, pp. 1592–1604

  64. [64]

    Swe-bench: Can language models resolve real- world github issues?

    C. E. Jimenezet al., “Swe-bench: Can language models resolve real- world github issues?”arXiv preprint arXiv:2310.06770, 2023

  65. [65]

    Cocomic: Code completion by jointly modeling in-file and cross-file context,

    Y . Dinget al., “Cocomic: Code completion by jointly modeling in-file and cross-file context,” inProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, pp. 3433–3445

  66. [66]

    Closing the Gap: A User Study on the Real- world Usefulness of AI-powered Vulnerability Detection & Repair in the IDE,

    K. Thompsonet al., “Rango: Adaptive retrieval-augmented proving for automated software verification,” inProceedings of the IEEE/ACM 47th International Conference on Software Engineering, ser. ICSE ’25. IEEE Press, 2025, p. 347–359. [Online]. Available: https: //doi.org/10.1109/ICSE55347.2025.00161

  67. [67]

    Adaptive proof refinement with llm-guided strategy selection,

    M. Luet al., “Adaptive proof refinement with llm-guided strategy selection,”arXiv preprint arXiv:2510.25103, 2025

  68. [68]

    Magnushammer: A transformer-based approach to premise selection,

    M. Mikułaet al., “Magnushammer: A transformer-based approach to premise selection,” inThe 12th Int’l Conf. on Learning Rep., 2024. [Online]. Available: https://openreview.net/forum?id=oYjPk8mqA V

  69. [69]

    Robertsonet al.,The probabilistic relevance framework: BM25 and beyond

    S. Robertsonet al.,The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc, 2009, vol. 4

  70. [70]

    Dense passage retrieval for open-domain question answering

    V . Karpukhinet al., “Dense passage retrieval for open-domain question answering.” inEMNLP (1), 2020, pp. 6769–6781

  71. [71]

    Verifast 25.11,

    B. Jacobset al., “Verifast 25.11,” https://github.com/verifast/verifast, 2025

  72. [72]

    Annotation inference for separation logic based verifiers,

    F. V ogelset al., “Annotation inference for separation logic based verifiers,” inInternational Conference on Formal Methods for Open Object-Based Distributed Systems. Springer, 2011, pp. 319–333

  73. [73]

    Z3: An efficient smt solver,

    L. De Mouraet al., “Z3: An efficient smt solver,” inInt’l Conf. on Tools & Algo. for the Const. and Anal. of Syst.Springer, 2008, pp. 337–340

  74. [74]

    Evaluating large language models in class-level code generation,

    X. Duet al., “Evaluating large language models in class-level code generation,” inProceedings of the IEEE/ACM 46th International Conference on Software Engineering, ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery, 2024. [Online]. Available: https://doi.org/10.1145/3597503.3639219

  75. [75]

    Understanding interobserver agreement: the kappa statistic,

    A. J. Vieraet al., “Understanding interobserver agreement: the kappa statistic,”Fam med, vol. 37, no. 5, pp. 360–363, 2005

  76. [76]

    Can large language models transform natural language intent into formal method postconditions?

    M. Endreset al., “Can large language models transform natural language intent into formal method postconditions?”Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 1889–1912, 2024

  77. [77]

    Automated lemma synth. in symbolic-heap sep. logic,

    Q.-T. Taet al., “Automated lemma synth. in symbolic-heap sep. logic,” Proc. of the ACM on Prog. Langs., vol. 2, no. POPL, pp. 1–29, 2017

  78. [78]

    Compositional shape analysis by means of bi- abduction,

    C. Calcagnoet al., “Compositional shape analysis by means of bi- abduction,” inProc. of the 36th annual ACM SIGPLAN-SIGACT symp. on Princ. of prog. langs., 2009, pp. 289–300

  79. [79]

    Compositional shape analysis by means of bi-abduction,

    ——, “Compositional shape analysis by means of bi-abduction,”Journal of the ACM (JACM), vol. 58, no. 6, pp. 1–66, 2011

  80. [80]

    Inferring loop invariants using postconditions

    C. A. Furiaet al., “Inferring loop invariants using postconditions.”Fields of logic and computation, vol. 6300, pp. 277–300, 2010

Showing first 80 references.