arxiv: 2605.11315 · v1 · submitted 2026-05-11 · 💻 cs.SE · cs.AI· cs.CR

Recognition: no theorem link

Natural Language based Specification and Verification

Chengyu Song, Zhaorui Li

Pith reviewed 2026-05-13 01:19 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CR

keywords natural language specificationsLLM-based verificationcompositional verificationsoftware securitycode correctnessspecification generationformal methods alternative

0 comments

The pith

Large language models can generate natural language specifications and use them to verify code implementations in a compositional manner.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores a shift away from rigid formal specification languages by letting LLMs both write requirements in plain English and check whether code satisfies those requirements by verifying smaller pieces separately. This matters because formal verification has been difficult to apply at scale to large, real-world systems, especially code produced by the same models. If the approach holds, it offers a more accessible way to prevent security problems in generated software without requiring developers to master logical notations. The authors present preliminary evidence that the method can work on practical examples.

Core claim

LLMs can serve as both generators of natural language specifications that describe desired program behavior and as compositional verifiers that check implementations against those descriptions by breaking verification tasks into manageable subproblems, and early experiments indicate this pipeline produces useful results.

What carries the argument

Compositional verification driven by natural language specifications, where the LLM decomposes a high-level description into checks on individual functions or modules and confirms each satisfies its part of the overall spec.

If this is right

Programmers could express security and correctness requirements in everyday language rather than mathematical logic.
Verification efforts could scale to larger codebases by handling one module or function at a time.
Vulnerabilities introduced during LLM-assisted code generation could be detected before the code is deployed.
Development teams might adopt verification more readily because the specification step no longer requires specialized training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This style of verification might eventually be combined with formal tools by translating the natural language spec into a machine-checkable form when higher assurance is required.
Success would depend on how well the method handles ambiguity that is common in natural language but fatal in verification contexts.
Further testing on complete open-source projects would reveal whether the compositional breakdown remains reliable as system size grows.

Load-bearing premise

Large language models can produce accurate natural language specifications and carry out sound compositional verification without needing the strict syntax of formal languages or extra human review to correct errors.

What would settle it

A concrete counterexample would be a collection of real programs where an LLM-generated natural language specification leads the verification process to accept code that contains security vulnerabilities or to reject correct implementations.

Figures

Figures reproduced from arXiv: 2605.11315 by Chengyu Song, Zhaorui Li.

read the original abstract

Recent frontier large language models (LLMs) have shown strong performance in identifying security vulnerabilities in large, mature open-source systems. As LLM-generated code becomes increasingly common, a natural goal is to prevent such models from producing vulnerable implementations in the first place. Formal verification offers a principled route to this objective, but existing verification pipelines typically require specifications written in rigid formal languages. Prior work has explored using LLMs to synthesize such specifications, with limited success. In this paper, we investigate a different approach: using LLMs both to generate specifications and to verify implementations compositionally when the specifications are expressed in natural language. Our preliminary results suggest that this approach is promising.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper floats using LLMs for natural-language specs in both generation and compositional verification as an alternative to formal specs, but supplies no actual results or methods to evaluate the claim.

read the letter

The main takeaway is that the authors want to drop formal specification languages and instead let LLMs write natural-language specs for code while also using LLMs to check implementations against those specs in a compositional manner. This is positioned as different from prior work that tries to synthesize formal specs from LLMs. The motivation is reasonable: formal specs are a pain to write, LLMs already find real vulnerabilities in large codebases, and natural language might lower the barrier for verifying LLM-generated code at scale. The compositional angle is a sensible way to tackle bigger programs without blowing up the context window. That framing is clear and direct. Beyond the idea itself, though, there is almost nothing to assess. The abstract calls the results preliminary and promising but gives no examples of the specs, no description of the verification procedure, no metrics, no baselines, and no failure cases. Without those details it is impossible to tell whether the approach produces sound verdicts or just plausible text. Natural language specs carry real risks of ambiguity and inconsistency that formal languages avoid, and the paper does not appear to add any cross-checks, oracles, or human oversight to catch LLM errors in either spec generation or verification. The stress-test concern about silent failures in the chain looks accurate from what is shown. This kind of work would interest people building AI coding assistants or exploring lighter-weight verification, but a reader would mostly get a high-level suggestion rather than usable techniques or evidence. The paper is too thin on substance to justify peer review right now; it would need concrete experiments and a clear account of how soundness is handled before it merits referee effort.

Referee Report

2 major / 0 minor

Summary. The paper proposes using LLMs both to generate natural-language specifications for code and to perform compositional verification of implementations against those specifications, as an alternative to formal languages or prior LLM synthesis of rigid specs. It claims that preliminary results indicate the approach is promising for preventing vulnerabilities in LLM-generated code.

Significance. If the approach can be shown to work reliably, it would lower the barrier to verification by avoiding formal specification languages, potentially enabling scalable, automated checks during LLM-assisted development. The idea of compositional natural-language verification is novel in intent but requires substantial empirical grounding to assess its practical significance.

major comments (2)

Abstract: The central claim that 'preliminary results suggest that this approach is promising' is unsupported, as the abstract (and by extension the manuscript) provides no data, evaluation metrics, concrete examples of generated specs or verification outcomes, or methodology details. This prevents assessment of whether LLM-generated natural-language specs are accurate or whether compositional verification is sound.
Abstract / Introduction: The approach treats LLM outputs for both specification generation and verification as authoritative without formal semantics, human oversight, or cross-checks against executable oracles. Natural language ambiguity and potential LLM inconsistencies could silently invalidate any link in the chain (spec generation, decomposition, per-component verdict), yet no mechanism to detect or mitigate unsoundness is described.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important areas for clarification and improvement in our manuscript on natural language-based specification and verification. We address each major comment point by point below and describe the revisions we will make.

read point-by-point responses

Referee: Abstract: The central claim that 'preliminary results suggest that this approach is promising' is unsupported, as the abstract (and by extension the manuscript) provides no data, evaluation metrics, concrete examples of generated specs or verification outcomes, or methodology details. This prevents assessment of whether LLM-generated natural-language specs are accurate or whether compositional verification is sound.

Authors: We agree that the abstract is too high-level and does not provide enough concrete information to substantiate the claim. While the full manuscript describes our methodology for LLM-based spec generation and compositional verification along with initial experimental outcomes, these details are not summarized in the abstract. We will revise the abstract to include specific examples of generated natural-language specifications, sample verification verdicts, and key preliminary metrics (such as agreement rates between LLM verifiers and ground-truth checks on small codebases). This will make the support for our claim explicit and allow readers to assess the accuracy and soundness of the approach. revision: yes
Referee: Abstract / Introduction: The approach treats LLM outputs for both specification generation and verification as authoritative without formal semantics, human oversight, or cross-checks against executable oracles. Natural language ambiguity and potential LLM inconsistencies could silently invalidate any link in the chain (spec generation, decomposition, per-component verdict), yet no mechanism to detect or mitigate unsoundness is described.

Authors: We acknowledge this as a substantive limitation of relying on natural language and LLM reasoning without formal grounding. Our preliminary exploration prioritizes accessibility over formal guarantees, but we did not detail mechanisms to address ambiguity or inconsistency in the current version. In the revision, we will expand the discussion to explicitly address these risks, including potential silent failures in the verification chain. We will also outline mitigation strategies such as cross-verification via multiple independent LLM queries for consensus, optional human review for high-stakes components, and hybrid checks against simple executable oracles where feasible. These additions will clarify the current scope and limitations without overclaiming soundness. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical suggestion without derivations or self-referential chains

full rationale

The paper advances an empirical proposal to use LLMs for natural-language specification generation and compositional verification, supported only by preliminary results. No equations, fitted parameters, uniqueness theorems, or ansatzes appear in the provided text. The central claim does not reduce to its own inputs by construction; it rests on external experimental outcomes rather than any definitional or citation-based loop. Self-citation is absent from the abstract and description, and the approach is presented as an investigation rather than a closed-form derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the work rests on unstated assumptions about LLM capabilities for specification and verification tasks.

pith-pipeline@v0.9.0 · 5396 in / 987 out tokens · 28408 ms · 2026-05-13T01:19:22.363462+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

[1]

Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks

Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Ji- azheng Xu, Lei Hou, Yuxiao Dong, et al. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. In63rd Annual Meeting of the Association for Computational Linguistics, pages 3639–3664, 2025

work page 2025
[2]

Slam and static driver verifier: Technology transfer of formal methods inside microsoft

Thomas Ball, Byron Cook, Vladimir Levin, and Sriram K Rajamani. Slam and static driver verifier: Technology transfer of formal methods inside microsoft. InInternational Conference on Integrated Formal Methods, pages 1–20. Springer, 2004

work page 2004
[3]

Dafnypro: Llm-assisted automated verification for dafny programs.arXiv preprint arXiv:2601.05385, 2026

Debangshu Banerjee, Olivier Bouissou, and Stefan Zetzsche. Dafnypro: Llm-assisted automated verification for dafny programs.arXiv preprint arXiv:2601.05385, 2026

work page arXiv 2026
[4]

The software model checker blast: Applications to software engineering.International Journal on Software Tools for Technology Transfer, 9(5):505–525, 2007

Dirk Beyer, Thomas A Henzinger, Ranjit Jhala, and Rupak Majumdar. The software model checker blast: Applications to software engineering.International Journal on Software Tools for Technology Transfer, 9(5):505–525, 2007

work page 2007
[5]

Configurable software verification: Concretizing the convergence of model checking and program analysis

Dirk Beyer, Thomas A Henzinger, and Grégory Théoduloz. Configurable software verification: Concretizing the convergence of model checking and program analysis. InInternational Conference on Computer Aided Verification, pages 504–518. Springer, 2007

work page 2007
[6]

Ranking llm-generated loop invariants for program verification

Saikat Chakraborty, Shuvendu Lahiri, Sarah Fakhoury, Akash Lal, Madanlal Musuvathi, Aseem Rastogi, Aditya Senthilnathan, Rahul Sharma, and Nikhil Swamy. Ranking llm-generated loop invariants for program verification. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 9164–9175, 2023

work page 2023
[7]

Formal methods: State of the art and future directions

Edmund M Clarke and Jeannette M Wing. Formal methods: State of the art and future directions. ACM Computing Surveys (CSUR), 28(4):626–643, 1996

work page 1996
[8]

Madeline Endres, Sarah Fakhoury, Saikat Chakraborty, and Shuvendu K. Lahiri. Can large language models transform natural language intent into formal method postconditions?Proc. ACM Softw. Eng., 1(FSE), July 2024

work page 2024
[9]

The daikon system for dynamic detection of likely invariants.Science of computer programming, 69(1-3):35–45, 2007

Michael D Ernst, Jeff H Perkins, Philip J Guo, Stephen McCamant, Carlos Pacheco, Matthew S Tschantz, and Chen Xiao. The daikon system for dynamic detection of likely invariants.Science of computer programming, 69(1-3):35–45, 2007

work page 2007
[10]

Houdini, an annotation assistant for esc/java

Cormac Flanagan and K Rustan M Leino. Houdini, an annotation assistant for esc/java. In International symposium of formal methods Europe, pages 500–517. Springer, 2001

work page 2001
[11]

Refinement of trace abstraction

Matthias Heizmann, Jochen Hoenicke, and Andreas Podelski. Refinement of trace abstraction. In Jens Palsberg and Zhendong Su, editors,Static Analysis, 16th International Symposium, SAS 2009, Los Angeles, CA, USA, August 9-11, 2009. Proceedings, volume 5673 ofLecture Notes in Computer Science, pages 69–85. Springer, 2009

work page 2009
[12]

C. A. R. Hoare. An axiomatic basis for computer programming.Communications of the ACM, 12(10):576–580, 1969

work page 1969
[13]

Leveraging llms for program verification

Adharsh Kamath, Nausheen Mohammed, Aditya Senthilnathan, Saikat Chakraborty, Pantazis Deligiannis, Shuvendu K Lahiri, Akash Lal, Aseem Rastogi, Subhajit Roy, and Rahul Sharma. Leveraging llms for program verification. In2024 Formal Methods in Computer-Aided Design (FMCAD), pages 107–118. IEEE, 2024

work page 2024
[14]

Cbmc–c bounded model checker: (competition contribution)

Daniel Kroening and Michael Tautschnig. Cbmc–c bounded model checker: (competition contribution). InInternational Conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 389–391. Springer, 2014

work page 2014
[15]

Rustan M

K. Rustan M. Leino. Dafny: An automatic program verifier for functional correctness. In International Conference on Logic for Programming Artificial Intelligence and Reasoning, pages 348–370. Springer, 2010

work page 2010
[16]

Specgen: Automated generation of formal program specifications via large language models

Lezhi Ma, Shangqing Liu, Yi Li, Xiaofei Xie, and Lei Bu. Specgen: Automated generation of formal program specifications via large language models. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), pages 16–28. IEEE, 2025

work page 2025
[17]

Laurel: Unblocking automated verification with large language models.Proceedings of the ACM on Programming Languages, 9(OOPSLA1):1519–1545, 2025

Eric Mugnier, Emmanuel Anaya Gonzalez, Nadia Polikarpova, Ranjit Jhala, and Zhou Yuanyuan. Laurel: Unblocking automated verification with large language models.Proceedings of the ACM on Programming Languages, 9(OOPSLA1):1519–1545, 2025

work page 2025
[18]

Automatic generation of loop invariants in dafny with large language models

João Pascoal Faria, Emanuel Trigo, and Rui Abreu. Automatic generation of loop invariants in dafny with large language models. InInternational Conference on Fundamentals of Software 10 Engineering, pages 138–154. Springer, 2025

work page 2025
[19]

Can large language models reason about program invariants? InInternational Conference on Machine Learning, pages 27496–27520

Kexin Pei, David Bieber, Kensen Shi, Charles Sutton, and Pengcheng Yin. Can large language models reason about program invariants? InInternational Conference on Machine Learning, pages 27496–27520. PMLR, 2023

work page 2023
[20]

Liquid types

Patrick M Rondon, Ming Kawaguci, and Ranjit Jhala. Liquid types. InProceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 159–169, 2008

work page 2008
[21]

Inferring multiple helper dafny assertions with llms, 2025

Álvaro Silva, Alexandra Mendes, and Ruben Martins. Inferring multiple helper dafny assertions with llms, 2025

work page 2025
[22]

Symbiotic: Synergy of instrumentation, slicing, and symbolic execution: (competition contribution)

Jiri Slaby, Jan Strejˇcek, and Marek Trtík. Symbiotic: Synergy of instrumentation, slicing, and symbolic execution: (competition contribution). InInternational Conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 630–632. Springer, 2013

work page 2013
[23]

Llms versus the halting problem: Revisiting program termination prediction

Oren Sultan, Jordi Armengol-Estape, Pascal Kesseli, Julien Vanegue, Dafna Shahaf, Yossi Adi, and Peter O’Hearn. Llms versus the halting problem: Revisiting program termination prediction. arXiv preprint arXiv:2601.18987, 2026

work page arXiv 2026
[24]

Enchanting program specification synthesis by large language models using static analysis and program verification

Cheng Wen, Jialun Cao, Jie Su, Zhiwu Xu, Shengchao Qin, Mengda He, Haokun Li, Shing-Chi Cheung, and Cong Tian. Enchanting program specification synthesis by large language models using static analysis and program verification. InInternational Conference on Computer Aided Verification, pages 302–328. Springer, 2024

work page 2024
[25]

Lemur: Integrating large language models in automated program verification

Haoze Wu, Clark Barrett, and Nina Narodytska. Lemur: Integrating large language models in automated program verification. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[26]

function

Xu Xu, Xin Li, Xingwei Qu, Jie Fu, and Binhang Yuan. Local success does not compose: Benchmarking large language models for compositional formal verification.arXiv preprint arXiv:2509.23061, 2025. A Broader Impact and Societal Implications Our work studies whether large language models can act as practical verifiers for memory-safety properties. We believ...

work page arXiv 2025