Recognition: no theorem link
Natural Language based Specification and Verification
Pith reviewed 2026-05-13 01:19 UTC · model grok-4.3
The pith
Large language models can generate natural language specifications and use them to verify code implementations in a compositional manner.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLMs can serve as both generators of natural language specifications that describe desired program behavior and as compositional verifiers that check implementations against those descriptions by breaking verification tasks into manageable subproblems, and early experiments indicate this pipeline produces useful results.
What carries the argument
Compositional verification driven by natural language specifications, where the LLM decomposes a high-level description into checks on individual functions or modules and confirms each satisfies its part of the overall spec.
If this is right
- Programmers could express security and correctness requirements in everyday language rather than mathematical logic.
- Verification efforts could scale to larger codebases by handling one module or function at a time.
- Vulnerabilities introduced during LLM-assisted code generation could be detected before the code is deployed.
- Development teams might adopt verification more readily because the specification step no longer requires specialized training.
Where Pith is reading between the lines
- This style of verification might eventually be combined with formal tools by translating the natural language spec into a machine-checkable form when higher assurance is required.
- Success would depend on how well the method handles ambiguity that is common in natural language but fatal in verification contexts.
- Further testing on complete open-source projects would reveal whether the compositional breakdown remains reliable as system size grows.
Load-bearing premise
Large language models can produce accurate natural language specifications and carry out sound compositional verification without needing the strict syntax of formal languages or extra human review to correct errors.
What would settle it
A concrete counterexample would be a collection of real programs where an LLM-generated natural language specification leads the verification process to accept code that contains security vulnerabilities or to reject correct implementations.
Figures
read the original abstract
Recent frontier large language models (LLMs) have shown strong performance in identifying security vulnerabilities in large, mature open-source systems. As LLM-generated code becomes increasingly common, a natural goal is to prevent such models from producing vulnerable implementations in the first place. Formal verification offers a principled route to this objective, but existing verification pipelines typically require specifications written in rigid formal languages. Prior work has explored using LLMs to synthesize such specifications, with limited success. In this paper, we investigate a different approach: using LLMs both to generate specifications and to verify implementations compositionally when the specifications are expressed in natural language. Our preliminary results suggest that this approach is promising.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes using LLMs both to generate natural-language specifications for code and to perform compositional verification of implementations against those specifications, as an alternative to formal languages or prior LLM synthesis of rigid specs. It claims that preliminary results indicate the approach is promising for preventing vulnerabilities in LLM-generated code.
Significance. If the approach can be shown to work reliably, it would lower the barrier to verification by avoiding formal specification languages, potentially enabling scalable, automated checks during LLM-assisted development. The idea of compositional natural-language verification is novel in intent but requires substantial empirical grounding to assess its practical significance.
major comments (2)
- Abstract: The central claim that 'preliminary results suggest that this approach is promising' is unsupported, as the abstract (and by extension the manuscript) provides no data, evaluation metrics, concrete examples of generated specs or verification outcomes, or methodology details. This prevents assessment of whether LLM-generated natural-language specs are accurate or whether compositional verification is sound.
- Abstract / Introduction: The approach treats LLM outputs for both specification generation and verification as authoritative without formal semantics, human oversight, or cross-checks against executable oracles. Natural language ambiguity and potential LLM inconsistencies could silently invalidate any link in the chain (spec generation, decomposition, per-component verdict), yet no mechanism to detect or mitigate unsoundness is described.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important areas for clarification and improvement in our manuscript on natural language-based specification and verification. We address each major comment point by point below and describe the revisions we will make.
read point-by-point responses
-
Referee: Abstract: The central claim that 'preliminary results suggest that this approach is promising' is unsupported, as the abstract (and by extension the manuscript) provides no data, evaluation metrics, concrete examples of generated specs or verification outcomes, or methodology details. This prevents assessment of whether LLM-generated natural-language specs are accurate or whether compositional verification is sound.
Authors: We agree that the abstract is too high-level and does not provide enough concrete information to substantiate the claim. While the full manuscript describes our methodology for LLM-based spec generation and compositional verification along with initial experimental outcomes, these details are not summarized in the abstract. We will revise the abstract to include specific examples of generated natural-language specifications, sample verification verdicts, and key preliminary metrics (such as agreement rates between LLM verifiers and ground-truth checks on small codebases). This will make the support for our claim explicit and allow readers to assess the accuracy and soundness of the approach. revision: yes
-
Referee: Abstract / Introduction: The approach treats LLM outputs for both specification generation and verification as authoritative without formal semantics, human oversight, or cross-checks against executable oracles. Natural language ambiguity and potential LLM inconsistencies could silently invalidate any link in the chain (spec generation, decomposition, per-component verdict), yet no mechanism to detect or mitigate unsoundness is described.
Authors: We acknowledge this as a substantive limitation of relying on natural language and LLM reasoning without formal grounding. Our preliminary exploration prioritizes accessibility over formal guarantees, but we did not detail mechanisms to address ambiguity or inconsistency in the current version. In the revision, we will expand the discussion to explicitly address these risks, including potential silent failures in the verification chain. We will also outline mitigation strategies such as cross-verification via multiple independent LLM queries for consensus, optional human review for high-stakes components, and hybrid checks against simple executable oracles where feasible. These additions will clarify the current scope and limitations without overclaiming soundness. revision: yes
Circularity Check
No circularity: empirical suggestion without derivations or self-referential chains
full rationale
The paper advances an empirical proposal to use LLMs for natural-language specification generation and compositional verification, supported only by preliminary results. No equations, fitted parameters, uniqueness theorems, or ansatzes appear in the provided text. The central claim does not reduce to its own inputs by construction; it rests on external experimental outcomes rather than any definitional or citation-based loop. Self-citation is absent from the abstract and description, and the approach is presented as an investigation rather than a closed-form derivation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks
Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Ji- azheng Xu, Lei Hou, Yuxiao Dong, et al. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. In63rd Annual Meeting of the Association for Computational Linguistics, pages 3639–3664, 2025
work page 2025
-
[2]
Slam and static driver verifier: Technology transfer of formal methods inside microsoft
Thomas Ball, Byron Cook, Vladimir Levin, and Sriram K Rajamani. Slam and static driver verifier: Technology transfer of formal methods inside microsoft. InInternational Conference on Integrated Formal Methods, pages 1–20. Springer, 2004
work page 2004
-
[3]
Debangshu Banerjee, Olivier Bouissou, and Stefan Zetzsche. Dafnypro: Llm-assisted automated verification for dafny programs.arXiv preprint arXiv:2601.05385, 2026
-
[4]
Dirk Beyer, Thomas A Henzinger, Ranjit Jhala, and Rupak Majumdar. The software model checker blast: Applications to software engineering.International Journal on Software Tools for Technology Transfer, 9(5):505–525, 2007
work page 2007
-
[5]
Dirk Beyer, Thomas A Henzinger, and Grégory Théoduloz. Configurable software verification: Concretizing the convergence of model checking and program analysis. InInternational Conference on Computer Aided Verification, pages 504–518. Springer, 2007
work page 2007
-
[6]
Ranking llm-generated loop invariants for program verification
Saikat Chakraborty, Shuvendu Lahiri, Sarah Fakhoury, Akash Lal, Madanlal Musuvathi, Aseem Rastogi, Aditya Senthilnathan, Rahul Sharma, and Nikhil Swamy. Ranking llm-generated loop invariants for program verification. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 9164–9175, 2023
work page 2023
-
[7]
Formal methods: State of the art and future directions
Edmund M Clarke and Jeannette M Wing. Formal methods: State of the art and future directions. ACM Computing Surveys (CSUR), 28(4):626–643, 1996
work page 1996
-
[8]
Madeline Endres, Sarah Fakhoury, Saikat Chakraborty, and Shuvendu K. Lahiri. Can large language models transform natural language intent into formal method postconditions?Proc. ACM Softw. Eng., 1(FSE), July 2024
work page 2024
-
[9]
Michael D Ernst, Jeff H Perkins, Philip J Guo, Stephen McCamant, Carlos Pacheco, Matthew S Tschantz, and Chen Xiao. The daikon system for dynamic detection of likely invariants.Science of computer programming, 69(1-3):35–45, 2007
work page 2007
-
[10]
Houdini, an annotation assistant for esc/java
Cormac Flanagan and K Rustan M Leino. Houdini, an annotation assistant for esc/java. In International symposium of formal methods Europe, pages 500–517. Springer, 2001
work page 2001
-
[11]
Refinement of trace abstraction
Matthias Heizmann, Jochen Hoenicke, and Andreas Podelski. Refinement of trace abstraction. In Jens Palsberg and Zhendong Su, editors,Static Analysis, 16th International Symposium, SAS 2009, Los Angeles, CA, USA, August 9-11, 2009. Proceedings, volume 5673 ofLecture Notes in Computer Science, pages 69–85. Springer, 2009
work page 2009
-
[12]
C. A. R. Hoare. An axiomatic basis for computer programming.Communications of the ACM, 12(10):576–580, 1969
work page 1969
-
[13]
Leveraging llms for program verification
Adharsh Kamath, Nausheen Mohammed, Aditya Senthilnathan, Saikat Chakraborty, Pantazis Deligiannis, Shuvendu K Lahiri, Akash Lal, Aseem Rastogi, Subhajit Roy, and Rahul Sharma. Leveraging llms for program verification. In2024 Formal Methods in Computer-Aided Design (FMCAD), pages 107–118. IEEE, 2024
work page 2024
-
[14]
Cbmc–c bounded model checker: (competition contribution)
Daniel Kroening and Michael Tautschnig. Cbmc–c bounded model checker: (competition contribution). InInternational Conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 389–391. Springer, 2014
work page 2014
- [15]
-
[16]
Specgen: Automated generation of formal program specifications via large language models
Lezhi Ma, Shangqing Liu, Yi Li, Xiaofei Xie, and Lei Bu. Specgen: Automated generation of formal program specifications via large language models. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), pages 16–28. IEEE, 2025
work page 2025
-
[17]
Eric Mugnier, Emmanuel Anaya Gonzalez, Nadia Polikarpova, Ranjit Jhala, and Zhou Yuanyuan. Laurel: Unblocking automated verification with large language models.Proceedings of the ACM on Programming Languages, 9(OOPSLA1):1519–1545, 2025
work page 2025
-
[18]
Automatic generation of loop invariants in dafny with large language models
João Pascoal Faria, Emanuel Trigo, and Rui Abreu. Automatic generation of loop invariants in dafny with large language models. InInternational Conference on Fundamentals of Software 10 Engineering, pages 138–154. Springer, 2025
work page 2025
-
[19]
Kexin Pei, David Bieber, Kensen Shi, Charles Sutton, and Pengcheng Yin. Can large language models reason about program invariants? InInternational Conference on Machine Learning, pages 27496–27520. PMLR, 2023
work page 2023
-
[20]
Patrick M Rondon, Ming Kawaguci, and Ranjit Jhala. Liquid types. InProceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 159–169, 2008
work page 2008
-
[21]
Inferring multiple helper dafny assertions with llms, 2025
Álvaro Silva, Alexandra Mendes, and Ruben Martins. Inferring multiple helper dafny assertions with llms, 2025
work page 2025
-
[22]
Symbiotic: Synergy of instrumentation, slicing, and symbolic execution: (competition contribution)
Jiri Slaby, Jan Strejˇcek, and Marek Trtík. Symbiotic: Synergy of instrumentation, slicing, and symbolic execution: (competition contribution). InInternational Conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 630–632. Springer, 2013
work page 2013
-
[23]
Llms versus the halting problem: Revisiting program termination prediction
Oren Sultan, Jordi Armengol-Estape, Pascal Kesseli, Julien Vanegue, Dafna Shahaf, Yossi Adi, and Peter O’Hearn. Llms versus the halting problem: Revisiting program termination prediction. arXiv preprint arXiv:2601.18987, 2026
-
[24]
Cheng Wen, Jialun Cao, Jie Su, Zhiwu Xu, Shengchao Qin, Mengda He, Haokun Li, Shing-Chi Cheung, and Cong Tian. Enchanting program specification synthesis by large language models using static analysis and program verification. InInternational Conference on Computer Aided Verification, pages 302–328. Springer, 2024
work page 2024
-
[25]
Lemur: Integrating large language models in automated program verification
Haoze Wu, Clark Barrett, and Nina Narodytska. Lemur: Integrating large language models in automated program verification. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[26]
Xu Xu, Xin Li, Xingwei Qu, Jie Fu, and Binhang Yuan. Local success does not compose: Benchmarking large language models for compositional formal verification.arXiv preprint arXiv:2509.23061, 2025. A Broader Impact and Societal Implications Our work studies whether large language models can act as practical verifiers for memory-safety properties. We believ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.