AI for software engineering: from probable to provable
Pith reviewed 2026-05-17 04:10 UTC · model grok-4.3
The pith
AI code becomes reliable when paired with formal specifications and verification
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Vibe coding with AI faces obstacles in goal specification and hallucinations, but these can be overcome by integrating AI creativity with formal specification methods and the power of formal program verification supported by modern proof tools, resulting in programs that are correct or very close to correct.
What carries the argument
The hybrid approach of combining artificial intelligence's creativity with formal specification and verification techniques, which ensures program correctness while preserving innovative potential.
If this is right
- AI-generated programs can achieve high levels of correctness through verification.
- Prompt engineering difficulties are mitigated by formal requirements.
- Hallucinations in code output are reduced or eliminated.
- Modern proof tools make verification practical for AI-assisted development.
Where Pith is reading between the lines
- Such integration could lead to development environments where AI suggests code and formal tools automatically verify it.
- This might bridge the gap between agile AI methods and traditional rigorous engineering practices.
- Testing this in large-scale projects could reveal scalability of the approach.
Load-bearing premise
Formal specification and verification techniques can be practically integrated with current AI code generators without losing the speed and creativity benefits of AI.
What would settle it
A controlled study comparing productivity and error rates in AI-only coding versus AI combined with formal methods on a set of programming tasks.
read the original abstract
Vibe coding, the much-touted use of AI techniques for programming, faces two overwhelming obstacles: the difficulty of specifying goals ("prompt engineering" is a form of requirements engineering, one of the toughest disciplines of software engineering); and the hallucination phenomenon. Programs are only useful if they are correct or very close to correct. The solution? Combine the creativity of artificial intelligence with the rigor of formal specification methods and the power of formal program verification, supported by modern proof tools.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript argues that AI-assisted 'vibe coding' is limited by the difficulties of prompt engineering (as a form of requirements engineering) and by hallucinations, and proposes that these can be overcome by combining AI creativity with the rigor of formal specification methods and formal program verification supported by modern proof tools.
Significance. If the proposed integration of AI generation with formal methods can be made practical, the work would point to a high-impact direction for reliable software engineering, moving AI outputs from probabilistic to machine-checkable correctness. The manuscript correctly identifies two core obstacles in current practice but provides no supporting evidence or examples.
major comments (2)
- [Abstract] Abstract: the central claim that formal specification and verification techniques can be practically integrated with current AI code generators without losing speed and creativity benefits rests on an unexamined feasibility assumption; no mechanism, workflow, or scaling argument is supplied for deriving or maintaining machine-checkable specifications from natural-language prompts.
- [Main text] Main proposal: the text asserts that modern proof tools will support the approach but contains no discussion of how proof assistants would handle the volume, style, or error patterns typical of LLM-generated code, which is load-bearing for the practicality of the suggested solution.
minor comments (1)
- [Abstract] The term 'vibe coding' is used without definition or citation, which reduces accessibility for readers outside the immediate community.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the potential high-impact direction of combining AI code generation with formal methods. We agree that the manuscript is a concise position paper that identifies key obstacles but does not yet supply detailed mechanisms, workflows, or discussions of practical challenges. We will perform a major revision to strengthen these aspects while preserving the core vision.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that formal specification and verification techniques can be practically integrated with current AI code generators without losing speed and creativity benefits rests on an unexamined feasibility assumption; no mechanism, workflow, or scaling argument is supplied for deriving or maintaining machine-checkable specifications from natural-language prompts.
Authors: We accept this assessment. The abstract currently states the proposal at a high level without elaboration. In revision we will rewrite the abstract to frame the integration as a proposed research direction rather than an immediately deployable solution. We will also insert a short paragraph describing an initial workflow: (1) LLM-assisted translation of natural-language prompts into candidate formal specifications (e.g., in Dafny or TLA+), (2) iterative refinement and proof generation using AI-augmented proof assistants, and (3) selective human review for safety-critical components. Scaling considerations will be explicitly noted as open questions requiring future empirical study, with pointers to related work on AI-assisted formal methods. revision: yes
-
Referee: [Main text] Main proposal: the text asserts that modern proof tools will support the approach but contains no discussion of how proof assistants would handle the volume, style, or error patterns typical of LLM-generated code, which is load-bearing for the practicality of the suggested solution.
Authors: This observation is correct and points to a genuine gap. The current text does not examine the concrete difficulties of verifying LLM outputs. We will add a dedicated subsection that discusses these issues: common LLM error patterns (e.g., subtle logical inconsistencies, incorrect invariants, or off-by-one mistakes), the challenge of large code volumes, and how modern proof assistants can mitigate them through modular verification, counterexample generation, and AI-generated proof sketches. We will emphasize that full automation remains difficult for large systems but may be viable for critical modules, and we will cite relevant literature on LLM-assisted theorem proving. revision: yes
Circularity Check
No circularity in forward-looking proposal
full rationale
The manuscript is a high-level conceptual proposal advocating the integration of AI code generation with formal specification and verification techniques to mitigate prompt engineering and hallucination issues. It contains no equations, derivations, fitted parameters, or empirical results. The central claim does not reduce to any self-definition, self-citation chain, or renamed known result; it is presented as a suggested direction rather than a derived conclusion from internal inputs. No load-bearing steps exist that could exhibit circularity by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Formal specification methods can capture programming goals more reliably than natural-language prompts.
- domain assumption Modern proof tools are capable of verifying code produced by AI generators.
Forward citations
Cited by 1 Pith paper
-
Making Software Meaningful
Committing to explicit meaning via a domain-grounded vocabulary of individuals, actions, facts, and concepts improves software usability, enables modular LLM code generation, and supports accountable agent behavior.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.