pith. machine review for the scientific record. sign in

arxiv: 2605.11315 · v1 · submitted 2026-05-11 · 💻 cs.SE · cs.AI· cs.CR

Recognition: no theorem link

Natural Language based Specification and Verification

Chengyu Song, Zhaorui Li

Pith reviewed 2026-05-13 01:19 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CR
keywords natural language specificationsLLM-based verificationcompositional verificationsoftware securitycode correctnessspecification generationformal methods alternative
0
0 comments X

The pith

Large language models can generate natural language specifications and use them to verify code implementations in a compositional manner.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores a shift away from rigid formal specification languages by letting LLMs both write requirements in plain English and check whether code satisfies those requirements by verifying smaller pieces separately. This matters because formal verification has been difficult to apply at scale to large, real-world systems, especially code produced by the same models. If the approach holds, it offers a more accessible way to prevent security problems in generated software without requiring developers to master logical notations. The authors present preliminary evidence that the method can work on practical examples.

Core claim

LLMs can serve as both generators of natural language specifications that describe desired program behavior and as compositional verifiers that check implementations against those descriptions by breaking verification tasks into manageable subproblems, and early experiments indicate this pipeline produces useful results.

What carries the argument

Compositional verification driven by natural language specifications, where the LLM decomposes a high-level description into checks on individual functions or modules and confirms each satisfies its part of the overall spec.

If this is right

  • Programmers could express security and correctness requirements in everyday language rather than mathematical logic.
  • Verification efforts could scale to larger codebases by handling one module or function at a time.
  • Vulnerabilities introduced during LLM-assisted code generation could be detected before the code is deployed.
  • Development teams might adopt verification more readily because the specification step no longer requires specialized training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This style of verification might eventually be combined with formal tools by translating the natural language spec into a machine-checkable form when higher assurance is required.
  • Success would depend on how well the method handles ambiguity that is common in natural language but fatal in verification contexts.
  • Further testing on complete open-source projects would reveal whether the compositional breakdown remains reliable as system size grows.

Load-bearing premise

Large language models can produce accurate natural language specifications and carry out sound compositional verification without needing the strict syntax of formal languages or extra human review to correct errors.

What would settle it

A concrete counterexample would be a collection of real programs where an LLM-generated natural language specification leads the verification process to accept code that contains security vulnerabilities or to reject correct implementations.

Figures

Figures reproduced from arXiv: 2605.11315 by Chengyu Song, Zhaorui Li.

Figure 1
Figure 1. Figure 1: Overview of the NLForge pipeline. The system first extracts program structure and then [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

Recent frontier large language models (LLMs) have shown strong performance in identifying security vulnerabilities in large, mature open-source systems. As LLM-generated code becomes increasingly common, a natural goal is to prevent such models from producing vulnerable implementations in the first place. Formal verification offers a principled route to this objective, but existing verification pipelines typically require specifications written in rigid formal languages. Prior work has explored using LLMs to synthesize such specifications, with limited success. In this paper, we investigate a different approach: using LLMs both to generate specifications and to verify implementations compositionally when the specifications are expressed in natural language. Our preliminary results suggest that this approach is promising.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes using LLMs both to generate natural-language specifications for code and to perform compositional verification of implementations against those specifications, as an alternative to formal languages or prior LLM synthesis of rigid specs. It claims that preliminary results indicate the approach is promising for preventing vulnerabilities in LLM-generated code.

Significance. If the approach can be shown to work reliably, it would lower the barrier to verification by avoiding formal specification languages, potentially enabling scalable, automated checks during LLM-assisted development. The idea of compositional natural-language verification is novel in intent but requires substantial empirical grounding to assess its practical significance.

major comments (2)
  1. Abstract: The central claim that 'preliminary results suggest that this approach is promising' is unsupported, as the abstract (and by extension the manuscript) provides no data, evaluation metrics, concrete examples of generated specs or verification outcomes, or methodology details. This prevents assessment of whether LLM-generated natural-language specs are accurate or whether compositional verification is sound.
  2. Abstract / Introduction: The approach treats LLM outputs for both specification generation and verification as authoritative without formal semantics, human oversight, or cross-checks against executable oracles. Natural language ambiguity and potential LLM inconsistencies could silently invalidate any link in the chain (spec generation, decomposition, per-component verdict), yet no mechanism to detect or mitigate unsoundness is described.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important areas for clarification and improvement in our manuscript on natural language-based specification and verification. We address each major comment point by point below and describe the revisions we will make.

read point-by-point responses
  1. Referee: Abstract: The central claim that 'preliminary results suggest that this approach is promising' is unsupported, as the abstract (and by extension the manuscript) provides no data, evaluation metrics, concrete examples of generated specs or verification outcomes, or methodology details. This prevents assessment of whether LLM-generated natural-language specs are accurate or whether compositional verification is sound.

    Authors: We agree that the abstract is too high-level and does not provide enough concrete information to substantiate the claim. While the full manuscript describes our methodology for LLM-based spec generation and compositional verification along with initial experimental outcomes, these details are not summarized in the abstract. We will revise the abstract to include specific examples of generated natural-language specifications, sample verification verdicts, and key preliminary metrics (such as agreement rates between LLM verifiers and ground-truth checks on small codebases). This will make the support for our claim explicit and allow readers to assess the accuracy and soundness of the approach. revision: yes

  2. Referee: Abstract / Introduction: The approach treats LLM outputs for both specification generation and verification as authoritative without formal semantics, human oversight, or cross-checks against executable oracles. Natural language ambiguity and potential LLM inconsistencies could silently invalidate any link in the chain (spec generation, decomposition, per-component verdict), yet no mechanism to detect or mitigate unsoundness is described.

    Authors: We acknowledge this as a substantive limitation of relying on natural language and LLM reasoning without formal grounding. Our preliminary exploration prioritizes accessibility over formal guarantees, but we did not detail mechanisms to address ambiguity or inconsistency in the current version. In the revision, we will expand the discussion to explicitly address these risks, including potential silent failures in the verification chain. We will also outline mitigation strategies such as cross-verification via multiple independent LLM queries for consensus, optional human review for high-stakes components, and hybrid checks against simple executable oracles where feasible. These additions will clarify the current scope and limitations without overclaiming soundness. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical suggestion without derivations or self-referential chains

full rationale

The paper advances an empirical proposal to use LLMs for natural-language specification generation and compositional verification, supported only by preliminary results. No equations, fitted parameters, uniqueness theorems, or ansatzes appear in the provided text. The central claim does not reduce to its own inputs by construction; it rests on external experimental outcomes rather than any definitional or citation-based loop. Self-citation is absent from the abstract and description, and the approach is presented as an investigation rather than a closed-form derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the work rests on unstated assumptions about LLM capabilities for specification and verification tasks.

pith-pipeline@v0.9.0 · 5396 in / 987 out tokens · 28408 ms · 2026-05-13T01:19:22.363462+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

  1. [1]

    Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks

    Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Ji- azheng Xu, Lei Hou, Yuxiao Dong, et al. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. In63rd Annual Meeting of the Association for Computational Linguistics, pages 3639–3664, 2025

  2. [2]

    Slam and static driver verifier: Technology transfer of formal methods inside microsoft

    Thomas Ball, Byron Cook, Vladimir Levin, and Sriram K Rajamani. Slam and static driver verifier: Technology transfer of formal methods inside microsoft. InInternational Conference on Integrated Formal Methods, pages 1–20. Springer, 2004

  3. [3]

    Dafnypro: Llm-assisted automated verification for dafny programs.arXiv preprint arXiv:2601.05385, 2026

    Debangshu Banerjee, Olivier Bouissou, and Stefan Zetzsche. Dafnypro: Llm-assisted automated verification for dafny programs.arXiv preprint arXiv:2601.05385, 2026

  4. [4]

    The software model checker blast: Applications to software engineering.International Journal on Software Tools for Technology Transfer, 9(5):505–525, 2007

    Dirk Beyer, Thomas A Henzinger, Ranjit Jhala, and Rupak Majumdar. The software model checker blast: Applications to software engineering.International Journal on Software Tools for Technology Transfer, 9(5):505–525, 2007

  5. [5]

    Configurable software verification: Concretizing the convergence of model checking and program analysis

    Dirk Beyer, Thomas A Henzinger, and Grégory Théoduloz. Configurable software verification: Concretizing the convergence of model checking and program analysis. InInternational Conference on Computer Aided Verification, pages 504–518. Springer, 2007

  6. [6]

    Ranking llm-generated loop invariants for program verification

    Saikat Chakraborty, Shuvendu Lahiri, Sarah Fakhoury, Akash Lal, Madanlal Musuvathi, Aseem Rastogi, Aditya Senthilnathan, Rahul Sharma, and Nikhil Swamy. Ranking llm-generated loop invariants for program verification. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 9164–9175, 2023

  7. [7]

    Formal methods: State of the art and future directions

    Edmund M Clarke and Jeannette M Wing. Formal methods: State of the art and future directions. ACM Computing Surveys (CSUR), 28(4):626–643, 1996

  8. [8]

    Madeline Endres, Sarah Fakhoury, Saikat Chakraborty, and Shuvendu K. Lahiri. Can large language models transform natural language intent into formal method postconditions?Proc. ACM Softw. Eng., 1(FSE), July 2024

  9. [9]

    The daikon system for dynamic detection of likely invariants.Science of computer programming, 69(1-3):35–45, 2007

    Michael D Ernst, Jeff H Perkins, Philip J Guo, Stephen McCamant, Carlos Pacheco, Matthew S Tschantz, and Chen Xiao. The daikon system for dynamic detection of likely invariants.Science of computer programming, 69(1-3):35–45, 2007

  10. [10]

    Houdini, an annotation assistant for esc/java

    Cormac Flanagan and K Rustan M Leino. Houdini, an annotation assistant for esc/java. In International symposium of formal methods Europe, pages 500–517. Springer, 2001

  11. [11]

    Refinement of trace abstraction

    Matthias Heizmann, Jochen Hoenicke, and Andreas Podelski. Refinement of trace abstraction. In Jens Palsberg and Zhendong Su, editors,Static Analysis, 16th International Symposium, SAS 2009, Los Angeles, CA, USA, August 9-11, 2009. Proceedings, volume 5673 ofLecture Notes in Computer Science, pages 69–85. Springer, 2009

  12. [12]

    C. A. R. Hoare. An axiomatic basis for computer programming.Communications of the ACM, 12(10):576–580, 1969

  13. [13]

    Leveraging llms for program verification

    Adharsh Kamath, Nausheen Mohammed, Aditya Senthilnathan, Saikat Chakraborty, Pantazis Deligiannis, Shuvendu K Lahiri, Akash Lal, Aseem Rastogi, Subhajit Roy, and Rahul Sharma. Leveraging llms for program verification. In2024 Formal Methods in Computer-Aided Design (FMCAD), pages 107–118. IEEE, 2024

  14. [14]

    Cbmc–c bounded model checker: (competition contribution)

    Daniel Kroening and Michael Tautschnig. Cbmc–c bounded model checker: (competition contribution). InInternational Conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 389–391. Springer, 2014

  15. [15]

    Rustan M

    K. Rustan M. Leino. Dafny: An automatic program verifier for functional correctness. In International Conference on Logic for Programming Artificial Intelligence and Reasoning, pages 348–370. Springer, 2010

  16. [16]

    Specgen: Automated generation of formal program specifications via large language models

    Lezhi Ma, Shangqing Liu, Yi Li, Xiaofei Xie, and Lei Bu. Specgen: Automated generation of formal program specifications via large language models. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), pages 16–28. IEEE, 2025

  17. [17]

    Laurel: Unblocking automated verification with large language models.Proceedings of the ACM on Programming Languages, 9(OOPSLA1):1519–1545, 2025

    Eric Mugnier, Emmanuel Anaya Gonzalez, Nadia Polikarpova, Ranjit Jhala, and Zhou Yuanyuan. Laurel: Unblocking automated verification with large language models.Proceedings of the ACM on Programming Languages, 9(OOPSLA1):1519–1545, 2025

  18. [18]

    Automatic generation of loop invariants in dafny with large language models

    João Pascoal Faria, Emanuel Trigo, and Rui Abreu. Automatic generation of loop invariants in dafny with large language models. InInternational Conference on Fundamentals of Software 10 Engineering, pages 138–154. Springer, 2025

  19. [19]

    Can large language models reason about program invariants? InInternational Conference on Machine Learning, pages 27496–27520

    Kexin Pei, David Bieber, Kensen Shi, Charles Sutton, and Pengcheng Yin. Can large language models reason about program invariants? InInternational Conference on Machine Learning, pages 27496–27520. PMLR, 2023

  20. [20]

    Liquid types

    Patrick M Rondon, Ming Kawaguci, and Ranjit Jhala. Liquid types. InProceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 159–169, 2008

  21. [21]

    Inferring multiple helper dafny assertions with llms, 2025

    Álvaro Silva, Alexandra Mendes, and Ruben Martins. Inferring multiple helper dafny assertions with llms, 2025

  22. [22]

    Symbiotic: Synergy of instrumentation, slicing, and symbolic execution: (competition contribution)

    Jiri Slaby, Jan Strejˇcek, and Marek Trtík. Symbiotic: Synergy of instrumentation, slicing, and symbolic execution: (competition contribution). InInternational Conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 630–632. Springer, 2013

  23. [23]

    Llms versus the halting problem: Revisiting program termination prediction

    Oren Sultan, Jordi Armengol-Estape, Pascal Kesseli, Julien Vanegue, Dafna Shahaf, Yossi Adi, and Peter O’Hearn. Llms versus the halting problem: Revisiting program termination prediction. arXiv preprint arXiv:2601.18987, 2026

  24. [24]

    Enchanting program specification synthesis by large language models using static analysis and program verification

    Cheng Wen, Jialun Cao, Jie Su, Zhiwu Xu, Shengchao Qin, Mengda He, Haokun Li, Shing-Chi Cheung, and Cong Tian. Enchanting program specification synthesis by large language models using static analysis and program verification. InInternational Conference on Computer Aided Verification, pages 302–328. Springer, 2024

  25. [25]

    Lemur: Integrating large language models in automated program verification

    Haoze Wu, Clark Barrett, and Nina Narodytska. Lemur: Integrating large language models in automated program verification. InThe Twelfth International Conference on Learning Representations, 2024

  26. [26]

    function

    Xu Xu, Xin Li, Xingwei Qu, Jie Fu, and Binhang Yuan. Local success does not compose: Benchmarking large language models for compositional formal verification.arXiv preprint arXiv:2509.23061, 2025. A Broader Impact and Societal Implications Our work studies whether large language models can act as practical verifiers for memory-safety properties. We believ...