arxiv: 2604.26615 · v1 · submitted 2026-04-29 · 💻 cs.SE · cs.AI

Recognition: unknown

TDD Governance for Multi-Agent Code Generation via Prompt Engineering

Tarlan Hasanli , Shahbaz Siddeeq , Bishwash Khanal , Pyry Kotilainen , Tommi Mikkonen , Pekka Abrahamsson

Authors on Pith no claims yet

Pith reviewed 2026-05-07 11:37 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords test-driven developmentprompt engineeringmulti-agent systemsLLM code generationsoftware engineering governanceworkflow orchestrationAI-assisted programming

0 comments

The pith

Encoding TDD principles as a machine-readable manifesto and distributing them across prompt and workflow layers constrains multi-agent LLM code generation to follow structured discipline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes turning classical test-driven development into an enforceable governance system for large language models that generate code. Rather than treating tests as optional aids, the approach formalizes TDD's Red-Green-Refactor cycle into a manifesto that is injected at planning, generation, repair, and validation stages. A layered architecture keeps creative model proposals separate from deterministic checks that enforce ordering, bounded loops, and gates. Readers would care because unconstrained LLMs often produce unstable or non-deterministic code that skips engineering process. If the mechanism works, it shows how software engineering rules can be baked directly into AI orchestration instead of left to chance.

Core claim

The authors present an AI-native TDD framework that extracts classical TDD principles, encodes them in a machine-readable manifesto, and distributes the rules across planning, generation, repair, and validation stages inside a layered architecture. This architecture separates model proposal from deterministic engine authority and uses the manifesto to enforce phase ordering, bounded repair loops, validation gates, and atomic mutation control, with the goal of improving stability and reproducibility in multi-agent LLM code generation.

What carries the argument

A machine-readable TDD manifesto distributed across planning, generation, repair, and validation stages in a layered architecture that separates LLM proposals from deterministic governance authority.

If this is right

Phase ordering forces the workflow to follow the Red-Green-Refactor sequence without skipping steps.
Bounded repair loops cap the number of iterations at each stage to prevent uncontrolled drift.
Validation gates insert deterministic checks that block progress until criteria are met.
Atomic mutation control limits each change to small, verifiable units that can be rolled back.
The separation of proposal from authority reduces non-determinism by letting the engine override model outputs when rules are violated.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar manifestos could encode other methodologies such as domain-driven design or security review checklists into the same orchestration layer.
The approach might reduce the human oversight needed when LLMs are used for large-scale refactoring tasks.
Benchmarking could reveal whether the added constraints increase or decrease overall task completion time compared with free-form prompting.
Integration with continuous integration pipelines could make the governance rules executable outside the prompt layer.

Load-bearing premise

That turning TDD principles into prompts and workflow rules will make LLMs reliably follow the discipline without unacceptable loss of code quality or creativity.

What would settle it

Compare code produced by the governed system against an unconstrained multi-agent baseline on identical tasks, measuring the fraction of outputs that complete all TDD phases in order and the variance in functional correctness across repeated runs.

Figures

Figures reproduced from arXiv: 2604.26615 by Bishwash Khanal, Pekka Abrahamsson, Pyry Kotilainen, Shahbaz Siddeeq, Tarlan Hasanli, Tommi Mikkonen.

**Figure 1.** Figure 1: Illustrative manifesto entries and their governance view at source ↗

**Figure 2.** Figure 2: Governed AI-native TDD workflow with separation view at source ↗

read the original abstract

Large language models (LLMs) accelerate software development but often exhibit instability, non-determinism, and weak adherence to development discipline in unconstrained workflows. While test-driven development (TDD) provides a structured Red-Green-Refactor process, existing LLM-based approaches typically use tests as auxiliary inputs rather than enforceable process constraints. We present an AI-native TDD framework that operationalizes classical TDD principles as structured prompt-level and workflow-level governance mechanisms. Extracted principles are formalized in a machine-readable manifesto and distributed across planning, generation, repair, and validation stages within a layered architecture that separates model proposal from deterministic engine authority. The system enforces phase ordering, bounded repair loops, validation gates, and atomic mutation control to improve stability and reproducibility. We describe architecture and discuss encoding software engineering discipline directly into prompt orchestration, which we think offers a promising direction for reliable LLM-assisted development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes an AI-native TDD framework for multi-agent code generation that operationalizes classical TDD (Red-Green-Refactor) principles as structured prompt-level and workflow-level governance mechanisms. These principles are extracted and formalized into a machine-readable manifesto, then distributed across planning, generation, repair, and validation stages within a layered architecture that separates LLM model proposals from deterministic engine authority. The system is claimed to enforce phase ordering, bounded repair loops, validation gates, and atomic mutation control, thereby improving stability and reproducibility over unconstrained LLM workflows. The manuscript describes the architecture and discusses encoding software engineering discipline into prompt orchestration.

Significance. If the governance mechanisms prove effective at constraining LLM behavior without unacceptable loss of quality, the work could represent a promising direction for reliable AI-assisted software engineering by directly embedding process discipline into prompt orchestration and workflow design. The explicit separation of proposal generation from deterministic authority is a thoughtful architectural choice that targets known LLM non-determinism. The absence of any empirical validation, however, keeps the significance prospective.

major comments (2)

Abstract: The manuscript asserts that the framework 'enforces phase ordering, bounded repair loops, validation gates, and atomic mutation control to improve stability and reproducibility,' yet supplies no experiments, case studies, metrics (e.g., iteration counts, adherence rates, or output quality measures), or even qualitative walkthroughs. This leaves the central claim—that prompt-level and workflow-level governance will reliably override LLM non-determinism—unsupported and load-bearing for the contribution.
Architecture description (throughout the manuscript): The layered architecture and manifesto distribution are presented at a conceptual level with no concrete prompt templates, pseudocode, implementation artifacts, or runnable prototype. Without these, it is impossible to evaluate whether the TDD principles are encoded in a manner that actually constrains LLM outputs as intended.

minor comments (1)

The manuscript would benefit from an explicit limitations or trade-offs subsection discussing potential impacts on generation creativity, computational cost of the governance layers, or failure modes when LLMs ignore the manifesto.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting the distinction between conceptual design and empirical demonstration. Our manuscript is a framework proposal that describes how classical TDD principles can be encoded as prompt-level and workflow-level governance; it does not include experiments. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: Abstract: The manuscript asserts that the framework 'enforces phase ordering, bounded repair loops, validation gates, and atomic mutation control to improve stability and reproducibility,' yet supplies no experiments, case studies, metrics (e.g., iteration counts, adherence rates, or output quality measures), or even qualitative walkthroughs. This leaves the central claim—that prompt-level and workflow-level governance will reliably override LLM non-determinism—unsupported and load-bearing for the contribution.

Authors: We agree that the current abstract wording presents the governance effects as achieved outcomes rather than as design intentions. The manuscript is explicitly positioned as a conceptual architecture paper (see abstract and Section 1), not an empirical study. The phrase 'enforces' was intended to describe the structural constraints imposed by phase ordering, bounded loops, and deterministic authority separation, but we acknowledge it can be misread as claiming measured reliability. We will revise the abstract to state that the framework 'is designed to enforce' these mechanisms 'with the goal of improving' stability and reproducibility, and we will add a sentence clarifying the scope as a governance proposal without empirical evaluation in this work. revision: yes
Referee: Architecture description (throughout the manuscript): The layered architecture and manifesto distribution are presented at a conceptual level with no concrete prompt templates, pseudocode, implementation artifacts, or runnable prototype. Without these, it is impossible to evaluate whether the TDD principles are encoded in a manner that actually constrains LLM outputs as intended.

Authors: The manuscript deliberately remains at the architectural and principle-formalization level to emphasize the separation of proposal generation from deterministic authority and the machine-readable manifesto as reusable governance artifacts. Full prompt templates are necessarily model- and task-specific and would have shifted the paper toward an implementation report. We will add a new subsection containing (1) pseudocode for the overall workflow (planning → generation → repair → validation with bounded loops and gates) and (2) high-level prompt skeletons for the manifesto distribution at each stage. These additions will make the encoding of Red-Green-Refactor principles more concrete while preserving the conceptual focus; a complete runnable prototype remains outside the current scope. revision: partial

Circularity Check

0 steps flagged

No circularity: framework is a proposed design artifact, not a derived quantity.

full rationale

The manuscript presents a conceptual architecture for embedding TDD principles into prompt orchestration and layered workflow governance for multi-agent LLM code generation. No equations, parameters, or quantitative predictions appear. The central description (manifesto formalization, phase ordering, repair loops, validation gates) is introduced as an engineering construction rather than derived from prior outputs or self-referential fits. No self-citations are invoked to establish uniqueness or forbid alternatives, and the text supplies no fitted-input-called-prediction pattern. The derivation chain is therefore self-contained as a design proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that LLMs are unstable without external governance and that prompt orchestration can enforce TDD without further empirical grounding; no free parameters or additional invented entities beyond the framework itself are introduced.

axioms (1)

domain assumption LLMs exhibit instability, non-determinism, and weak adherence to development discipline in unconstrained workflows
Explicitly stated in the opening sentence of the abstract as the motivating problem.

invented entities (1)

AI-native TDD framework with machine-readable manifesto and layered architecture separating model proposal from deterministic engine authority no independent evidence
purpose: To operationalize TDD principles as enforceable prompt-level and workflow-level governance mechanisms
The framework and manifesto are introduced by the paper without reference to prior independent implementations or evidence.

pith-pipeline@v0.9.0 · 5465 in / 1404 out tokens · 44628 ms · 2026-05-07T11:37:21.210261+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 11 canonical work pages · 1 internal anchor

[1]

2002.Test Driven Development: By Example

Kent Beck. 2002.Test Driven Development: By Example. Pearson Education, Boston

2002
[2]

Adnan Causevic, Daniel Sundmark, and Sasikumar Punnekkat. 2011. Factors Limiting Industrial Adoption of Test Driven Development: A Systematic Review. In2011 Fourth IEEE International Conference on Software Testing, Verification and Validation. 337–346

2011
[3]

Yi Cui. 2025. Tests as Prompt: A Test-Driven-Development Benchmark for LLM Code Generation. arXiv:2505.09027

work page arXiv 2025
[4]

Yihong Dong, Xue Jiang, Jiaru Qian, Tian Wang, Kechi Zhang, Zhi Jin, and Ge Li
[5]

CoRR , volume =

A Survey on Code Generation with LLM-based Agents. arXiv:2508.00083

work page arXiv
[6]

Sarah Fakhoury, Aaditya Naik, Georgios Sakkas, Saikat Chakraborty, Madan Musuvathi, and Shuvendu Lahiri. 2024. Exploring the Effectiveness of LLM based Test-driven Interactive Code Generation: User Study and Empirical Evaluation. InProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings(Lisbon, Portuga...

2024
[7]

2018.Refactoring: improving the design of existing code

Martin Fowler. 2018.Refactoring: improving the design of existing code. Addison- Wesley Professional

2018
[8]

Yuyao Ge, Lingrui Mei, Zenghao Duan, Tianhao Li, Yujia Zheng, Yiwei Wang, Lexin Wang, Jiayu Yao, Tianyu Liu, Yujun Cai, Baolong Bi, Fangda Guo, Jiafeng Guo, Shenghua Liu, and Xueqi Cheng. 2025. A Survey of Vibe Coding with Large Language Models. arXiv:2510.12399

work page arXiv 2025
[9]

Kevin Han, Siddharth Maddikayala, Tim Knappe, Om Patel, Austen Liao, and Amir Barati Farimani. 2026. TDFlow: Agentic Workflows for Test Driven Devel- opment. arXiv:2510.23761

work page arXiv 2026
[10]

E. G. Santana Jr, Gabriel Benjamin, Melissa Araujo, Harrison Santos, David Freitas, Eduardo Almeida, Paulo Anselmo da M. S. Neto, Jiawei Li, Jina Chun, and Iftekhar Ahmed. 2025. Which Prompting Technique Should I Use? An Empirical Investigation of Prompting Techniques for Software Engineering Tasks. arXiv:2506.05614

work page arXiv 2025
[11]

Eugene Klishevich, Yegor Denisov-Blanch, Simon Obstbaum, Igor Ciobanu, and Michal Kosinski. 2025. Measuring Determinism in Large Language Models for Software Code Review. arXiv:2502.20747

work page arXiv 2025
[12]

Jia Li, Ge Li, Yongmin Li, and Zhi Jin. 2025. Structured chain-of-thought prompt- ing for code generation.ACM Transactions on Software Engineering and Method- ology34, 2 (2025), 1–23

2025
[13]

2009.Clean code: a handbook of agile software craftsmanship

Robert C Martin. 2009.Clean code: a handbook of agile software craftsmanship. Pearson Education

2009
[14]

2011.The clean coder: a code of conduct for professional program- mers

Robert C Martin. 2011.The clean coder: a code of conduct for professional program- mers. Pearson Education

2011
[15]

Noble Saji Mathews and Meiyappan Nagappan. 2024. Test-Driven Development and LLM-based Code Generation. InProceedings of the 39th IEEE/ACM Interna- tional Conference on Automated Software Engineering(Sacramento, CA, USA) (ASE ’24). Association for Computing Machinery, New York, NY, USA, 1583–1594

2024
[16]

Zhang, Mark Harman, and Meng Wang

Shuyin Ouyang, Jie M. Zhang, Mark Harman, and Meng Wang. 2025. An Empirical Study of the Non-Determinism of ChatGPT in Code Generation.ACM Trans. Softw. Eng. Methodol.34, 2, Article 42 (Jan. 2025), 28 pages

2025
[17]

Sanyogita Piya and Allison Sullivan. 2024. LLM4TDD: Best Practices for Test Driven Development Using Large Language Models. InProceedings of the 1st International Workshop on Large Language Models for Code(Lisbon, Portugal) (LLM4Code ’24). Association for Computing Machinery, New York, NY, USA, 14–21

2024
[18]

Roumeliotis, and Manoj Karkee

Ranjan Sapkota, Konstantinos I. Roumeliotis, and Manoj Karkee. 2025. Vibe Coding vs. Agentic Coding: Fundamentals and Practical Implications of Agentic AI. arXiv:2505.19443

work page arXiv 2025
[19]

Jiho Shin, Clark Tang, Tahmineh Mohati, Maleknaz Nayebi, Song Wang, and Hadi Hemmati. 2025. Prompt Engineering or Fine-Tuning: An Empirical Assessment of LLMs for Code. arXiv:2310.10508

work page arXiv 2025
[20]

Daniel Staegemann, Matthias Volk, Maneendra Perera, Christian Haertel, Matthias Pohl, Christian Daase, and Klaus Turowski. 2022. A Literature Review on the Challenges of Applying Test-Driven Development in Software Engineering. Complex Systems Informatics and Modeling Quarterly(07 2022), 18–28

2022
[21]

Jiessie Tie, Bingsheng Yao, Tianshi Li, Syed Ishtiaque Ahmed, Dakuo Wang, and Shurui Zhou. 2024. LLMs are Imperfect, Then What? An Empirical Study on LLM Failures in Software Engineering. arXiv:2411.09916

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Bhanu Prakash Vangala, Ali Adibifar, Ashish Gehani, and Tanu Malik. 2026. AI- Generated Code Is Not Reproducible (Yet): An Empirical Study of Dependency Gaps in LLM-Based Coding Agents. arXiv:2512.22387

work page arXiv 2026
[23]

Tianxin Wei, Ting-Wei Li, Zhining Liu, Xuying Ning, Ze Yang, Jiaru Zou, Zhichen Zeng, Ruizhong Qiu, Xiao Lin, Dongqi Fu, Zihao Li, Mengting Ai, Duo Zhou, Wenxuan Bao, Yunzhe Li, Gaotang Li, Cheng Qian, Yu Wang, Xiangru Tang, Yin Xiao, Liri Fang, Hui Liu, Xianfeng Tang, Yuji Zhang, Chi Wang, Jiaxuan You, Heng Ji, Hanghang Tong, and Jingrui He. 2026. Agenti...

work page arXiv 2026
[24]

Ziyao Zhang, Chong Wang, Yanlin Wang, Ensheng Shi, Yuchi Ma, Wanjun Zhong, Jiachi Chen, Mingzhi Mao, and Zibin Zheng. 2025. Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 481–503

2025