arxiv: 2604.22601 · v1 · submitted 2026-04-24 · 💻 cs.SE · cs.AI

Recognition: unknown

From Natural Language to Verified Code: Toward AI Assisted Problem-to-Code Generation with Dafny-Based Formal Verification

Ahmed Ryan, Md Erfan, Md Kamal Hossain Chowdhury, Md Rayhanur Rahman

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:15 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords large language modelsformal verificationDafnynatural language to codeprompt engineeringverified code generationautomated software engineering

0 comments

The pith

Open-weight LLMs achieve high rates of formal verification when generating code from natural language using structured prompting and Dafny feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that large language models can translate natural language problem descriptions into code that passes both functional tests and mathematical verification in Dafny. It releases the NL2VC-60 dataset of 60 algorithmic problems and tests seven open-weight models across contextless prompts, signature prompts that give code structure, and self-healing prompts that iterate on Dafny error messages. Contextless attempts almost always fail, but the guided approaches lift success to 90 percent or higher on the evaluated sets. This matters because it offers a concrete route to AI-generated code whose correctness is enforced by proof rather than testing alone.

Core claim

By supplying structural signatures of the target code and allowing models to iteratively repair outputs using feedback from the Dafny verifier, open-weight LLMs can produce both implementations and annotations that satisfy formal proofs on the NL2VC-60 problems, with functional correctness further ensured by uDebug validation; contextless prompting yields near-zero success while the combined strategy yields rates above 80 percent for several models.

What carries the argument

The tiered prompting strategy of contextless prompts, signature prompts supplying structural anchors, and self-healing prompts that incorporate iterative feedback from the Dafny verifier, applied to the NL2VC-60 dataset and cross-checked with uDebug functional tests.

If this is right

Contextless prompting produces near-universal failure on verifiable code generation.
Signature and self-healing prompting produce large gains, reaching 90.91 percent verification success for Gemma 4-31B and 81.82 percent for GPT-OSS 120B.
Open-weight LLMs become practical tools for synthesizing the annotations needed for high-assurance software.
Formal verification can be integrated directly into the natural-language-to-code workflow rather than applied after the fact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams without formal-methods expertise could use the same loop to draft reliable implementations for safety-critical modules.
The method might extend to larger codebases if the prompting patterns generalize beyond single-algorithm problems.
Because the models are open-weight, the approach could be deployed locally or fine-tuned on organization-specific verification tasks.

Load-bearing premise

The 11 randomly selected problems represent the general class of complex algorithmic tasks, and uDebug tests combined with Dafny proofs reliably exclude trivial or vacuous specifications.

What would settle it

A new collection of algorithmic problems on which even signature and self-healing prompts produce low verification rates, or cases where Dafny accepts the output yet uDebug reveals functional errors.

read the original abstract

Large Language Models (LLMs) show promise in automated software engineering, yet their guarantee of correctness is frequently undermined by erroneous or hallucinated code. To enforce model honesty, formal verification requires LLMs to synthesize implementation logic alongside formal specifications that are subsequently proven correct by a mathematical verifier. However, the transition from informal natural language to precise formal specification remains an arduous task. Our work addresses this by providing the NaturalLanguage2VerifiedCode (NL2VC)-60 dataset: a collection of 60 complex algorithmic problems. We evaluate 11 randomly selected problem sets across seven open-weight LLMs using a tiered prompting strategy: contextless prompts, signature prompts providing structural anchors, and self-healing prompts utilizing iterative feedback from the Dafny verifier. To address vacuous verification, where models satisfy verifiers with trivial specifications, we integrate the uDebug platform to ensure functional validation. Our results show that while contextless prompting leads to near-universal failure, structural signatures and iterative self-healing facilitate a dramatic performance turnaround. Specifically, Gemma 4-31B achieved a 90.91\% verification success rate, while GPT-OSS 120B rose from zero to 81.82\% success with signature-guided feedback. These findings indicate that formal verification is now attainable for open-weight LLMs, which serve as effective apprentices for synthesizing complex annotations and facilitating high-assurance software development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a small but real demonstration that signature prompts plus Dafny self-healing can push open LLMs from near-zero to 80-90% verification success on 11 problems, yet the narrow sample and missing details keep the broader claim from landing.

read the letter

The useful part is the NL2VC-60 dataset and the concrete tiered prompting experiment. They start with plain natural-language prompts that mostly fail, add structural signatures that give the model anchors, then run an iterative loop where Dafny errors are fed back for fixes. On the 11 problems they tested, this lifts Gemma 4-31B to 90.91% and the 120B model to 81.82%. Adding uDebug functional checks is a reasonable way to push back against vacuous specs. That combination is new enough relative to the cited LLM code-gen work and worth noting for anyone trying to close the loop between generation and proof.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the NL2VC-60 dataset of 60 complex algorithmic problems and evaluates seven open-weight LLMs on 11 randomly selected problems using tiered prompting (contextless, signature prompts providing structural anchors, and self-healing with iterative Dafny verifier feedback) combined with uDebug functional validation to prevent vacuous specifications. It reports near-zero success with contextless prompts but dramatic improvements to 90.91% verification success for Gemma 4-31B and 81.82% for GPT-OSS 120B after self-healing, concluding that formal verification is now attainable for open-weight LLMs as apprentices for high-assurance code synthesis.

Significance. If the empirical results hold under broader evaluation, the work offers a concrete demonstration that open-weight LLMs can generate Dafny-annotated code that passes both functional tests and formal proofs, with the tiered prompting and uDebug integration providing a practical mechanism to reduce hallucinations. The use of an external verifier (Dafny) and functional oracle (uDebug) to enforce correctness is a methodological strength that could inform future AI-assisted formal methods pipelines.

major comments (2)

[Abstract] Abstract and results description: the central claim that 'formal verification is now attainable for open-weight LLMs' and that they serve as 'effective apprentices for synthesizing complex annotations' rests on success rates from only 11 randomly selected problems out of NL2VC-60, with no disclosed list of the problems, selection criteria, complexity classification (e.g., data structures or asymptotic bounds), or argument that the random draw avoids bias toward easier cases where signature prompts plus verifier feedback suffice.
[Results and evaluation description] Evaluation methodology (as described in the abstract and results): the paper invokes uDebug to rule out vacuous specifications but provides no details on the test suites used, how they were constructed, or statistical significance of the reported rates (e.g., confidence intervals or exact problem selection process), leaving the 'dramatic performance turnaround' claim only partially supported and vulnerable to selection bias.

minor comments (2)

[Abstract] The abstract and main text would benefit from an explicit link or repository reference to the NL2VC-60 dataset and any accompanying code or prompts for reproducibility.
[Throughout] Notation for model names (e.g., 'Gemma 4-31B', 'GPT-OSS 120B') and the exact definition of 'self-healing prompts' could be clarified with a brief example or table in the methods section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. These have prompted us to enhance the transparency of our problem selection process and evaluation methodology. We respond to each major comment below and have incorporated the suggested clarifications into the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and results description: the central claim that 'formal verification is now attainable for open-weight LLMs' and that they serve as 'effective apprentices for synthesizing complex annotations' rests on success rates from only 11 randomly selected problems out of NL2VC-60, with no disclosed list of the problems, selection criteria, complexity classification (e.g., data structures or asymptotic bounds), or argument that the random draw avoids bias toward easier cases where signature prompts plus verifier feedback suffice.

Authors: We agree that additional details on the selected problems are necessary to fully support our claims. In the revised manuscript, we have added a dedicated table in the results section that lists all 11 problems, their domains (e.g., sorting, graph algorithms, dynamic programming), and asymptotic complexity bounds. The selection process is now explicitly described as uniform random sampling from the NL2VC-60 dataset using a fixed seed for reproducibility. We further argue that NL2VC-60 was designed to span a representative range of algorithmic difficulties, and the random draw therefore does not systematically favor easier instances. These changes provide concrete evidence that the observed success rates are not an artifact of problem selection. revision: yes
Referee: [Results and evaluation description] Evaluation methodology (as described in the abstract and results): the paper invokes uDebug to rule out vacuous specifications but provides no details on the test suites used, how they were constructed, or statistical significance of the reported rates (e.g., confidence intervals or exact problem selection process), leaving the 'dramatic performance turnaround' claim only partially supported and vulnerable to selection bias.

Authors: We have substantially expanded the evaluation methodology section. For each of the 11 problems we now detail the uDebug test suites, which were constructed from the canonical inputs provided in the original problem statements and augmented with systematically generated edge cases and boundary conditions to ensure non-vacuous specifications. We report the number of tests per problem and their coverage metrics. In addition, we have added 95% binomial confidence intervals for all reported success rates to quantify statistical uncertainty given the sample size. The random selection procedure is stated with the exact seed and sampling method. These revisions directly address concerns about transparency and strengthen the empirical foundation of the performance claims. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical measurements against external verifiers

full rationale

The paper presents success rates obtained by executing LLMs on 11 problems from the NL2VC-60 dataset under tiered prompting, with outcomes checked by the independent Dafny verifier and uDebug functional tests. No equations, fitted parameters, predictions, or first-principles derivations appear that could reduce to the inputs by construction. The reported percentages (e.g., Gemma 4-31B at 90.91%) are direct experimental counts, not self-defined quantities. No self-citations are invoked as load-bearing uniqueness theorems, and the methodology relies on external tools rather than internal redefinitions. Methodological limitations around sample size are validity concerns, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that Dafny and uDebug function as reliable external oracles and that the selected problems test genuine complexity rather than prompt-specific artifacts.

axioms (2)

domain assumption Dafny verifier produces sound proofs for the generated specifications and implementations
Invoked throughout the evaluation pipeline to certify correctness.
domain assumption uDebug platform supplies independent functional tests that detect vacuous specifications
Used explicitly to address the vacuous verification problem mentioned in the abstract.

pith-pipeline@v0.9.0 · 5559 in / 1336 out tokens · 28473 ms · 2026-05-08T11:15:40.229987+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

93 extracted references · 17 canonical work pages · 8 internal anchors

[1]

d.].Microsoft Copilot: Your AI Companion

[n. d.].Microsoft Copilot: Your AI Companion. https: //copilot.microsoft.com/ Accessed: 2026-04-21

2026
[2]

UVa Online Judge

2026. UVa Online Judge. https://onlinejudge.org/. Ac- cessed: 2026-04-19

2026
[3]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774(2023)

work page internal anchor Pith review arXiv 2023
[4]

Alibaba Qwen Team. 2025. Qwen3-Coder-30B: Spe- cialized Models for Agentic Code Intelligence. https: //huggingface.co/Qwen Accessed: Apr. 22, 2026

2025
[5]

Alibaba Qwen Team. 2026. Qwen3.5 and Qwen3.6-MoE: Advancing Open-Weight Foundation Models. https:// github.com/QwenLM/Qwen Accessed: Apr. 22, 2026

2026
[6]

Amazon Web Services, Inc. 2024. Amazon Q Devel- oper: AI coding companion. https://aws.amazon.com/q/ developer/ Accessed: 2026-04-21

2024
[7]

Anysphere, Inc. 2024. Cursor: The AI Code Editor. https://cursor.com/ Accessed: 2026-04-21. 14

2024
[8]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models.arXiv preprint arXiv:2108.07732(2021)

work page internal anchor Pith review arXiv 2021
[9]

Mantas Baksys, Stefan Zetzsche, Olivier Bouissou, Remi Delmas, Soonho Kong, and Sean B Holden. 2025. AT- LAS: Automated Toolkit for Large-Scale Verified Code Synthesis.arXiv preprint arXiv:2512.10173(2025)

work page arXiv 2025
[10]

Debangshu Banerjee, Olivier Bouissou, and Stefan Zet- zsche. 2026. DafnyPro: LLM-Assisted Automated Verification for Dafny Programs.arXiv preprint arXiv:2601.05385(2026)

work page arXiv 2026
[11]

Clark Barrett, Leonardo De Moura, and Aaron Stump
[12]

InInternational Conference on Computer Aided Verification

SMT-COMP: Satisfiability modulo theories com- petition. InInternational Conference on Computer Aided Verification. Springer, 20–23
[13]

Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, et al. 2026. Qwen3-coder- next technical report.arXiv preprint arXiv:2603.00729 (2026)

work page arXiv 2026
[14]

Franck Cassez, Joanne Fuller, Milad K Ghale, David J Pearce, and Horacio MA Quiles. 2023. Formal and executable semantics of the ethereum virtual machine in dafny. InInternational Symposium on Formal Methods. Springer, 571–583

2023
[15]

Aleksandar Chakarov, Aleksandr Fedchin, Zvonimir Rakamari´c, and Neha Rungta. 2022. Better counterexam- ples for Dafny. InInternational Conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 404–411

2022
[16]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021)

work page internal anchor Pith review arXiv 2021
[17]

Byron Cook. 2018. Formal reasoning about the security of amazon web services. InInternational Conference on Computer Aided Verification. Springer, 38–47

2018
[18]

Jade Copet, Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, et al. 2025. Cwm: An open-weights llm for research on code generation with world models.arXiv preprint arXiv:2510.02387 (2025)

work page arXiv 2025
[19]

Łukasz Czajka, Burak Ekici, and Cezary Kaliszyk. 2018. Concrete semantics with Coq and CoqHammer. InInter- national Conference on Intelligent Computer Mathemat- ics. Springer, 53–59

2018
[20]

2024.Dafny Reference Manual

Dafny Team. 2024.Dafny Reference Manual. Dafny Software Foundation. https://dafny.org/ Accessed: 2026- 04-20

2024
[21]

Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver. InInternational conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 337–340

2008
[22]

Nusrat Farzana Dipu, Muhammad Monir Hossain, Kimia Zamiri Azar, Farimah Farahmandi, and Mark Tehranipoor. 2024. Formalfuzzer: Formal verification assisted fuzz testing for soc vulnerability detection. In 2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 355–361

2024
[23]

Foster, et al

Aleksandr Fedchin, Tyler Dean, Jeffrey S. Foster, et al
[24]

Amazon Science(2023)

A Toolkit for Automated Testing of Dafny. Amazon Science(2023). https://www.amazon.science/ publications/a-toolkit-for-automated-testing-of-dafny

2023
[25]

Robert Feldt and Ana Magazinius. 2010. Validity threats in empirical software engineering research-an initial sur- vey.. InSeke. 374–379

2010
[26]

Emily First, Markus N Rabe, Talia Ringer, and Yuriy Brun. 2023. Baldur: Whole-proof generation and repair with large language models. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engi- neering. 1229–1241

2023
[27]

Simon Frieder, Luca Pinchetti, Chevalier Cheva- lier, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Petersen, and Julius Berner. 2023. Mathematical capabilities of chatgpt.Advances in neu- ral information processing systems36 (2023), 27699– 27744

2023
[28]

Hubert Garavel, Maurice H Ter Beek, and Jaco Van De Pol. 2020. The 2020 expert survey on formal meth- ods. InInternational Conference on Formal Methods for Industrial Critical Systems. Springer, 3–69

2020
[29]

Louie Giray. 2023. Prompt engineering with ChatGPT: a guide for academic writers.Annals of biomedical engineering51, 12 (2023), 2629–2633

2023
[30]

GitHub Community. 2026. Dafny Repositories Search Results. https://github.com/search?q=dafny&type= repositories Accessed: 2026-04-23

2026
[31]

2024.Vibe Coding in Google AI Studio: Building Apps with Natural Language

Google. 2024.Vibe Coding in Google AI Studio: Building Apps with Natural Language. https://aistudio.google. com/vibe-code Accessed: Apr. 22, 2026

2024
[32]

Google DeepMind. 2026. Gemma 4: Next-Generation Open Multimodal Models. https://ai.google.dev/gemma Accessed: Apr. 22, 2026

2026
[33]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

work page internal anchor Pith review arXiv 2024
[34]

Sumit Gulwani, Oleksandr Polozov, and Rishabh Singh
[35]

Program synthesis.Foundations and Trends in Programming Languages4, 1-2 (2017), 1–119

2017
[36]

Charles Antony Richard Hoare. 1969. An axiomatic basis for computer programming.Commun. ACM12, 10 (1969), 576–580

1969
[37]

Ahmed Irfan, Sorawee Porncharoenwase, Zvonimir Rakamari´c, Neha Rungta, and Emina Torlak. 2022. Test- ing Dafny (experience paper). InProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 556–567

2022
[38]

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, 15 and Pascale Fung. 2023. Survey of hallucination in natural language generation.ACM computing surveys 55, 12 (2023), 1–38

2023
[39]

Albert Qiaochu Jiang, Wenda Li, Szymon Tworkowski, Konrad Czechowski, Tomasz Odrzygó´ zd´ z, Piotr Miło´s, Yuhuai Wu, and Mateja Jamnik. 2022. Thor: Wielding hammers to integrate language models and automated theorem provers.Advances in Neural Information Pro- cessing Systems35 (2022), 8360–8373

2022
[40]

2021.Theories of programming: the life and works of Tony Hoare

Cliff B Jones and Jayadev Misra. 2021.Theories of programming: the life and works of Tony Hoare. ACM

2021
[41]

Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Mate- jovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, et al. 2025. Gemma 3 technical report.arXiv preprint arXiv:2503.197864 (2025)

work page internal anchor Pith review arXiv 2025
[42]

Gerwin Klein, Philip Derrin, and Kevin Elphinstone
[43]

InProceedings of the 14th ACM SIGPLAN international conference on Functional programming

Experience report: sel4: formally verifying a high- performance microkernel. InProceedings of the 14th ACM SIGPLAN international conference on Functional programming. 91–96
[44]

Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, and Percy S Liang. 2019. Spoc: Search-based pseudocode to code.Advances in Neural Information Processing Systems32 (2019)

2019
[45]

Claire Le Goues, K Rustan M Leino, and Michał Moskal
[46]

In International Conference on Software Engineering and Formal Methods

The boogie verification debugger (tool paper). In International Conference on Software Engineering and Formal Methods. Springer, 407–414
[47]

K Rustan M Leino. 2010. Dafny: An automatic program verifier for functional correctness. InInternational con- ference on logic for programming artificial intelligence and reasoning. Springer, 348–370

2010
[48]

K Rustan M Leino. 2012. Developing verified programs with Dafny. InProceedings of the 2012 ACM conference on High integrity language technology. 9–10

2012
[49]

Xavier Leroy. 2009. Formal verification of a realistic compiler.Commun. ACM52, 7 (2009), 107–115

2009
[50]

2025.The CompCert C verified compiler: Documentation and user’s manual

Xavier Leroy. 2025.The CompCert C verified compiler: Documentation and user’s manual. Ph. D. Dissertation. Inria

2025
[51]

2001.Model-based verification: Analysis guidelines

Grace A Lewis, Santiago Comella-Dorda, David P Gluch, John Hudak, and Charles Weinstock. 2001.Model-based verification: Analysis guidelines. Technical Report

2001
[52]

Liyi Li, Mingwei Zhu, Rance Cleaveland, Alexander Nicolellis, Yi Lee, Le Chang, and Xiaodi Wu. 2022. Qafny: A quantum-program verifier.arXiv preprint arXiv:2211.06411(2022)

work page arXiv 2022
[53]

Chloe Loughridge, Qinyi Sun, Seth Ahrenbach, Fed- erico Cassano, Chuyue Sun, Ying Sheng, Anish Mudide, Md Rakib Hossain Misu, Nada Amin, and Max Tegmark
[54]

Dafnybench: A benchmark for formal software verification.arXiv preprint arXiv:2406.08467(2024)

work page arXiv 2024
[55]

Lezhi Ma, Shangqing Liu, Yi Li, Xiaofei Xie, and Lei Bu. 2025. Specgen: Automated generation of formal pro- gram specifications via large language models. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE, 16–28

2025
[56]

Aman Madaan, Shuyan Zhou, Uri Alon, Yiming Yang, and Graham Neubig. 2022. Language models of code are few-shot commonsense learners. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 1384–1403

2022
[57]

Md Motaleb Hossen Manik and Ge Wang. 2026. Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models.arXiv preprint arXiv:2604.07035(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[58]

Ali Mesbah, Arie Van Deursen, and Danny Roest. 2011. Invariant-based automatic testing of modern web appli- cations.IEEE Transactions on Software Engineering38, 1 (2011), 35–53

2011
[59]

Microsoft Research. 2024. Dafny: A Language and Program Verifier for Functional Correctness. https://www.microsoft.com/en-us/research/project/ dafny-a-language-and-program-verifier-for-functional-correctness/ Official Project Page

2024
[60]

Mistral AI. 2024. Codestral-22B-v0.1: An Open-Weight Model for Professional Coders. https://mistral.ai/news/ codestral/ Accessed: Apr. 22, 2026

2024
[61]

Md Rakib Hossain Misu, Cristina V Lopes, Iris Ma, and James Noble. 2024. Towards ai-assisted synthesis of verified dafny methods.Proceedings of the ACM on Software Engineering1, FSE (2024), 812–835

2024
[62]

Toby Murray, Daniel Matichuk, Matthew Brassil, Peter Gammie, Timothy Bourke, Sean Seefried, Corey Lewis, Xin Gao, and Gerwin Klein. 2013. seL4: from general purpose to a proof of information flow enforcement. In 2013 IEEE Symposium on Security and Privacy. IEEE, 415–429

2013
[63]

Anthony Narkawicz, César A Munoz, and Aaron M Dutle. 2017. The MINERV A software development process. InNASA Formal Methods Symposium (NFM) 2017

2017
[64]

Noor Nashid, Mifta Sintaha, and Ali Mesbah. 2023. Retrieval-based prompt selection for code-related few- shot learning. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2450–2462

2023
[65]

James Noble, David Streader, Isaac Oscar Gariano, and Miniruwani Samarakoon. 2022. More programming than programming: Teaching formal methods in a software engineering programme. InNASA Formal Methods Sym- posium. Springer, 431–450

2022
[66]

OpenAI. 2025. Introducing GPT-OSS: Open- Weight Reasoning Models. https://openai.com/ index/introducing-gpt-oss/ Accessed: Apr. 22, 2026

2025
[67]

Saswata Paul, Elkin Cruz, Airin Dutta, Ankita Bhaumik, Erik Blasch, Gul Agha, Stacy Patterson, Fotis Kop- saftopoulos, and Carlos Varela. 2023. Formal verification of safety-critical aerospace systems.IEEE Aerospace and Electronic Systems Magazine38, 5 (2023), 72–88

2023
[68]

Partha Pratim Ray. 2025. A review on vibe coding: Fundamentals, state-of-the-art, challenges and future di- rections.Authorea Preprints(2025)

2025
[69]

Laria Reynolds and Kyle McDonell. 2021. Prompt programming for large language models: Beyond the 16 few-shot paradigm. InExtended abstracts of the 2021 CHI conference on human factors in computing systems. 1–7

2021
[70]

Talia Ringer, Karl Palmskog, Ilya Sergey, Gligoric Milos, and Zachary Tatlock. 2019. QED at large: A survey of engineering of formally verified software.Foundations and Trends in Programming Languages5, 2-3 (2019), 102–281

2019
[71]

JM Rushby. 1995. Model checking and other ways of automating formal methods.Position paper for panel on model checking for concurrent programs, Software Quality Week, San Francisco(1995)

1995
[72]

Ahmed Ryan, Ibrahim Khalil, Abdullah Al Jahid, Md Erfan, Sungbin Park, Akond Ashfaque Ur Rahman, and Md Rayhanur Rahman. 2026. Mind the Gap: Evaluating LLMs for High-Level Malicious Package Detection vs. Fine-Grained Indicator Identification.arXiv preprint arXiv:2602.16304(2026)

work page arXiv 2026
[73]

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroen- sri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. 2025. Medgemma technical report.arXiv preprint arXiv:2507.05201(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[74]

Janet Siegmund, Norbert Siegmund, and Sven Apel
[75]

In2015 IEEE/ACM 37th IEEE In- ternational Conference on Software Engineering, V ol

Views on internal and external validity in empirical software engineering. In2015 IEEE/ACM 37th IEEE In- ternational Conference on Software Engineering, V ol. 1. IEEE, 9–19
[76]

Chuyue Sun, Ying Sheng, Oded Padon, and Clark Bar- rett. 2024. Clover: Clo sed-loop ver ifiable code gen- eration. InInternational Symposium on AI Verification. Springer, 134–155

2024
[77]

Maurice ter Beek, Manfred Broy, and Brijesh Dongol
[78]

The role of formal methods in computer science education.ACM Inroads15, 4 (2024), 58–66

2024
[79]

The Dafny Project. 2024. Dafny NuGet Package: The Dafny Compiler and Verifier. NuGet Package Manager. https://www.nuget.org/packages/Dafny/ Version 4.x.x, Accessed: Apr. 22, 2026

2024
[80]

2024.Dafny Reference Man- ual

The Dafny Project. 2024.Dafny Reference Man- ual. Amazon Web Services. https://dafny.org/latest/ DafnyRef/DafnyRef Accessed: Apr. 22, 2026

2024

Showing first 80 references.