Recognition: 2 theorem links
LLM-Assisted Repository-Level Generation with Structured Spec-Driven Engineering
Pith reviewed 2026-05-08 17:43 UTC · model grok-4.3
The pith
Structured specifications let LLMs generate high-quality code for entire software repositories.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Structured specifications as LLM inputs make high-quality, repository-level code generation a tangible goal, while at the same time offering superior verifiability, leading to significant potential for improvement. This is examined through a pilot study that generated Model-View-Controller business logic for three software systems using five LLMs.
What carries the argument
Structured spec-driven engineering (SSDE), a paradigm that replaces natural language prompts with structured artifacts to guide LLM code generation at repository scale.
If this is right
- Repository-level code generation becomes feasible with existing LLMs.
- Generated outputs gain superior verifiability over prompt-only workflows.
- Software engineering gains a clearer path to productivity gains through guided LLM use.
- A roadmap emerges for tackling the remaining scaling challenges.
Where Pith is reading between the lines
- SSDE may generalize beyond MVC systems, but its performance in other architectures like microservices would need separate testing.
- Pairing SSDE outputs with automated verification suites could create stronger end-to-end quality guarantees.
- Practical adoption would likely require new tools that help engineers build the required structured specifications quickly.
Load-bearing premise
Structured artifacts will consistently reduce ambiguity and improve output quality enough to overcome the inherent limitations of current LLMs when scaling from function-level to repository-level generation.
What would settle it
A side-by-side comparison of repository-level code generated via SSDE versus natural language prompts, measured on correctness, maintainability, and defect rates, that shows no clear quality gain.
Figures
read the original abstract
State-of-the-art Large Language Models (LLMs) excel in code generation at the function level. However, the output quality significantly declines when scaling to repository-level systems. Current workflows relying only on natural language prompts suffer from inherent ambiguity and a lack of verifiability. To address this, we propose structured spec-driven engineering (SSDE), a paradigm that leverages structured artifacts to guide LLM generation. We argue that structured specifications as LLM inputs make high-quality, repository-level code generation a tangible goal, while at the same time offering superior verifiability, leading to significant potential for improvement. We first investigate the feasibility of this vision through a pilot study generating Model-View-Controller (MVC) business logic for three software systems using five LLMs, and then highlight the potential, challenges, and future roadmap for SSDE.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Structured Spec-Driven Engineering (SSDE) as a paradigm that uses structured artifacts (rather than natural-language prompts) to guide LLMs toward higher-quality, more verifiable repository-level code generation. It supports the claim with a pilot study that generates MVC business logic for three software systems using five LLMs, then discusses potential, challenges, and a future roadmap.
Significance. If the central claim holds, SSDE could meaningfully advance automated software engineering by providing a more reliable path from specifications to repository-scale implementations. The proposal is timely and the emphasis on verifiability is a clear strength; the pilot offers initial feasibility evidence, though its narrow scope limits the strength of the extrapolation.
major comments (2)
- [Pilot study section] The pilot study (described after the abstract) generates only MVC business logic for three systems and provides no quantitative results, no comparison baselines against unstructured prompts, and no metrics on verifiability such as spec-compliance rates or error counts. This absence of concrete data makes it impossible to evaluate whether structured inputs actually overcome LLM limitations at repository scale.
- [Introduction and pilot study] The central claim that structured specifications make repository-level generation 'tangible' rests on an untested extrapolation: the pilot omits cross-module dependencies, data consistency across layers, testing harnesses, and build integration, all of which define real repository-level tasks. Without evidence that SSDE scales beyond the narrow MVC case, the feasibility argument remains speculative.
minor comments (2)
- [Abstract] The abstract and introduction could more explicitly state the exact structured artifacts used (e.g., formal schemas, UML diagrams, or domain-specific languages) to allow readers to assess reproducibility.
- [Roadmap section] The future-roadmap section would benefit from concrete milestones or evaluation criteria for the proposed verifiability improvements.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback. We agree that the pilot study is limited in scope and lacks quantitative evaluation, and that the manuscript's claims require clearer scoping to avoid over-extrapolation. We have revised the paper to explicitly frame it as a vision paper proposing SSDE, with the pilot serving only as an initial qualitative feasibility illustration rather than comprehensive evidence. The revisions include expanded limitations discussion, tempered language in the introduction, and an updated roadmap. Below we respond point by point to the major comments.
read point-by-point responses
-
Referee: [Pilot study section] The pilot study (described after the abstract) generates only MVC business logic for three systems and provides no quantitative results, no comparison baselines against unstructured prompts, and no metrics on verifiability such as spec-compliance rates or error counts. This absence of concrete data makes it impossible to evaluate whether structured inputs actually overcome LLM limitations at repository scale.
Authors: We acknowledge the validity of this observation. The pilot was intentionally designed as a small-scale qualitative demonstration to show how structured specifications can be used as LLM inputs for generating MVC business logic in three systems across five models; it does not include quantitative metrics, baselines, or verifiability counts because those were outside its scope as an initial exploration. In the revised manuscript we have added a dedicated limitations subsection that explicitly states the absence of such data and outlines planned follow-up experiments to measure spec-compliance rates, error counts, and comparisons against natural-language prompting. We cannot, however, supply the missing quantitative results from the existing pilot without conducting new experiments. revision: partial
-
Referee: [Introduction and pilot study] The central claim that structured specifications make repository-level generation 'tangible' rests on an untested extrapolation: the pilot omits cross-module dependencies, data consistency across layers, testing harnesses, and build integration, all of which define real repository-level tasks. Without evidence that SSDE scales beyond the narrow MVC case, the feasibility argument remains speculative.
Authors: We agree that the pilot does not address cross-module dependencies, inter-layer consistency, testing, or build integration, and that these elements are essential for genuine repository-scale work. The original manuscript already contains a challenges section and future-roadmap subsection that flag these gaps, but the introduction's phrasing could be read as overstating the pilot's reach. We have revised the introduction to state clearly that the pilot is restricted to MVC business logic generation and that full repository-level scaling (including the omitted aspects) remains an open research question to be addressed in future work. This change preserves the core vision while removing the untested extrapolation. revision: yes
Circularity Check
No circularity: proposal paper with pilot study contains no derivations, fits, or self-referential predictions
full rationale
The manuscript is a paradigm proposal for structured spec-driven engineering (SSDE) plus a feasibility pilot on MVC business logic in three systems. No equations, fitted parameters, uniqueness theorems, or predictions appear that could reduce to inputs by construction. The core claim is an argument that structured artifacts reduce ambiguity, supported directly by the described experiments rather than any self-citation chain or definitional loop. Self-citations, if present, are not load-bearing for any derivation. The work is therefore self-contained against external benchmarks with no circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Structured specifications reduce ambiguity compared to natural language prompts for LLM code generation.
invented entities (1)
-
Structured Spec-Driven Engineering (SSDE)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Seyed Moein Abtahi and Akramul Azim. 2025. Augmenting Large Language Models with Static Code Analysis for Automated Code Quality Improvements. InIEEE/ACM 2nd International Conference on AI Foundation Models and Software Engineering (FORGE’25). doi:10.1109/Forge66646.2025.00017
-
[2]
2025.Claude Code
Anthropic. 2025.Claude Code. https://code.claude.com/
2025
-
[3]
Anthropic. 2025. Introducing Claude Sonnet 4.5. https://www.anthropic.com/ news/claude-sonnet-4-5
2025
- [4]
-
[5]
Severin Bergsmann, Alexander Schmidt, Stefan Fischer, and Rudolf Ramler. 2024. First Experiments on Automated Execution of Gherkin Test Specifications with Collaborating LLM Agents. InProceedings of the 15th ACM International Workshop on Automating Test Case Design, Selection and Evaluation (A-TEST’24). ACM, 12–15. doi:10.1145/3678719.3685692
-
[6]
Fatma Bozyigit, Tolgahan Bardakci, Alireza Khalilipour, Moharram Challenger, Guus Ramackers, Önder Babur, and Michel R. V. Chaudron. 2024. Generating domain models from natural language text using NLP: a benchmark dataset and experimental comparison of tools.Software and Systems Modeling23, 6 (2024). doi:10.1007/s10270-024-01176-y
-
[7]
Babikian, Shuzhao Feng, Dániel Varró, and Gunter Muss- bacher
Boqi Chen, Aren A. Babikian, Shuzhao Feng, Dániel Varró, and Gunter Muss- bacher. 2025. LLM-based Satisfiability Checking of String Requirements by Consistent Data and Checker Generation. In33rd IEEE International Requirements Engineering Conference (RE’25). IEEE, 231–243. doi:10.1109/RE63999.2025.00030
-
[8]
Boqi Chen, José Antonio Hernández López, Gunter Mussbacher, and Dániel Varró
-
[9]
Prompt -to-SQL injections in LLM -integrated web applications: Risks and defenses
The Power of Types: Exploring the Impact of Type Checking on Neural Bug Detection in Dynamically Typed Languages. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering (ICSE’25). IEEE Press, 489–501. doi:10.1109/ICSE55347.2025.00088
-
[10]
Kua Chen, Yujing Yang, Boqi Chen, José Antonio Hernández López, Gunter Mussbacher, and Dániel Varró. 2023. Automated Domain Modeling with Large Language Models: A Comparative Study. InACM/IEEE 26th International Con- ference on Model Driven Engineering Languages and Systems (MODELS’23). IEEE Press. doi:10.1109/MODELS58315.2023.00037
-
[11]
Mark Chen et al . 2021. Evaluating Large Language Models Trained on Code. Computing Research Repository(2021). arXiv:2107.03374 [cs.LG]
work page internal anchor Pith review arXiv 2021
-
[12]
Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023. Teaching Large Language Models to Self-Debug. arXiv:2304.05128 [cs.CL]
work page internal anchor Pith review arXiv 2023
-
[13]
Cucumber. 2025. Gherkin. https://github.com/cucumber/gherkin/
2025
-
[14]
Den Delimarsky. 2025. Spec-driven development with AI: Get started with a new open source toolkit. GitHub. https://github.blog/ai-and-ml/generative-ai/spec- driven-development-with-ai-get-started-with-a-new-open-source-toolkit/
2025
-
[15]
Lukas Fruntke and Jens Krinke. 2025. Automatically Fixing Dependency Breaking Changes.Proceedings of the ACM on Software Engineering, Article FSE096 (2025). doi:10.1145/3729366
-
[16]
Antonio García-Domínguez and Dimitris Kolovos. 2024. EMFatic: A textual syntax for EMF Ecore models. https://eclipse.dev/emfatic/
2024
-
[17]
2021.GitHub Copilot
GitHub. 2021.GitHub Copilot. https://github.com/features/copilot
2021
-
[18]
Sean Grove. 2025. The New Code. the AI Engineer World’s Fair 2025. https: //www.youtube.com/watch?v=8rABwKRsec4
2025
-
[19]
Kamil Grzybek et al. 2019. Modular Monolith with DDD. https://github.com/ kgrzybek/modular-monolith-with-ddd
2019
-
[20]
Sirui Hong et al. 2024. MetaGPT: Meta Programming for A Multi-Agent Collabora- tive Framework. InThe 12th International Conference on Learning Representations (ICLR’24). https://openreview.net/forum?id=VtmBAGCN7o
2024
-
[21]
Faizan Khan, Boqi Chen, Daniel Varro, and Shane McIntosh. 2022. An Empirical Study of Type-Related Defects in Python Projects.IEEE Transactions on Software Engineering(2022). doi:10.1109/TSE.2021.3082068
-
[22]
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. InProceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS’22). Curran Associates Inc., Article 1613, 22199-22213. https://dl.acm. org/doi/10.5555/3600270.3601883
-
[23]
Timothy C. Lethbridge et al. 2021. Umple: Model-driven development for open source and education.Science of Computer Programming208 (2021). doi:10.1016/ j.scico.2021.102665
-
[24]
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Lan- guage Models for Code Generation. InThe 37th Conference on Neural Information Processing Systems (NeurIPS’23). https://openreview.net/forum?id=1qvx610Cu7
2023
-
[25]
Meta AI. 2024. llama3.2:3b. https://ollama.com/library/llama3.2:3b/
2024
-
[26]
Behrooz Omidvar Tehrani, Ishaani M, and Anmol Anubhai. 2024. Evaluating Human-AI Partnership for LLM-based Code Migration. InExtended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA’24). ACM. doi:10.1145/3613905.3650896
-
[27]
OpenAI. 2025. GPT-5 nano Model. https://platform.openai.com/docs/models/gpt- 5-nano
2025
-
[28]
OpenAI. 2025. GPT-5.1: A smarter, more conversational ChatGPT. https://openai. com/index/gpt-5-1/
2025
-
[29]
Santiago Padron, Julien Valentin, Iwan Olier, Rémi Séguin, Mathieu Guimond, Yejia Shen, and Daniel Yu. 2025. CheECSEManager. https://github.com/F2025- ECSE223/ecse223-group-project-p16. GitHub
2025
-
[30]
Mike Pagel, Vincent Aranega, and Andreas Schmidl. 2021. pyecoregen. https: //github.com/pyecore/pyecoregen/
2021
-
[31]
PlantUML. 2026. PlantUML at a Glance. https://plantuml.com/
2026
-
[32]
Alexander Poth, Olsi Rrjolli, Huiyu Wang, and Klaus Schmid. 2026. Baseline Evaluation of LLM-Facilitated UI Test-Case Generation from Gherkin Specifica- tions. InSystems, Software and Services Process Improvement, Murat Yilmaz, Paul Clarke, Andreas Riel, Richard Messnarz, Mikus Zelmenis, and Ivi Anna Buce (Eds.). Springer Nature Switzerland. doi:10.1007/9...
- [33]
-
[34]
Qwen Team. 2025. Qwen3-Coder. https://github.com/QwenLM/Qwen3-Coder
2025
- [35]
-
[36]
Sepehr Sharifi, Alireza Parvizimosaed, Daniel Amyot, Luigi Logrippo, and John Mylopoulos. 2020. Symboleo: Towards a Specification Language for Le- gal Contracts. InIEEE 28th International Requirements Engineering Conference (RE’20). doi:10.1109/RE48521.2020.00049 Artifact URL: https://github.com/Smart- Contract-Modelling-uOttawa/Symboleo-JS-Core
-
[37]
Jonathan Silva, Qin Ma, Jordi Cabot, Pierre Kelsen, and Henderik A. Proper. 2024. Application of the Tree-of-Thoughts Framework to LLM-Enabled Domain Mod- eling. InProceedings of the 43rd International Conference on Conceptual Modeling (ER’24). Springer-Verlag, 94–111. doi:10.1007/978-3-031-75872-0_6
-
[38]
David Steinberg, Frank Budinsky, Marcelo Paternostro, and Ed Merks. 2009. EMF: Eclipse Modeling Framework 2.0(2nd ed.). Addison-Wesley Professional. https://dl.acm.org/doi/10.5555/1197540
-
[39]
Artem Syromiatnikov and Danny Weyns. 2014. A journey through the land of model-view-design patterns. InProceedings of the 2014 IEEE/IFIP Conference on Software Architecture (ICSA’14). IEEE. doi:10.1109/WICSA.2014.13
-
[40]
Yujing Yang, Boqi Chen, Kua Chen, Gunter Mussbacher, and Dániel Varró. 2024. Multi-step Iterative Automated Domain Modeling with Large Language Models. InProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems (MODELS’24). ACM, 587–595. doi:10.1145/ 3652620.3687807
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.