From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements
Pith reviewed 2026-05-19 23:17 UTC · model grok-4.3
pith:GNDFWN72 Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{GNDFWN72}
Prints a linked pith:GNDFWN72 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
TDDev automates test-driven development so coding agents can generate functional full-stack web apps from requirements
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TDDev automates the closed loop of test-driven development for web application generation through three stages: converting high-level requirements into structured acceptance tests before any code is written, deploying the application and validating it through browser-based interaction simulation, and translating browser-observed failures into structured repair reports for the coding agent. Enabled by this infrastructure, TDD consistently improves generation quality by 34--48 percentage points over a no-TDD baseline, with the optimal protocol depending on the model's generation style.
What carries the argument
TDDev's three-stage automated TDD pipeline: requirements-to-acceptance-tests conversion, browser interaction simulation for validation, and failure-to-repair report translation
If this is right
- TDD infrastructure improves generation quality by 34-48 percentage points over a no-TDD baseline.
- Models that build applications holistically benefit most from agentic enforcement.
- Models that extend code conservatively benefit from incremental enforcement.
- Mismatching protocol to generation style eliminates the TDD benefit and can multiply token cost up to 25-fold.
- TDDev reduces manual developer intervention to zero, shifting workload to autonomous refinement.
Where Pith is reading between the lines
- The same automated testing loop could be adapted for generating mobile or desktop applications that also depend on runtime interaction validation.
- Developers may be able to focus more on writing precise requirements and acceptance criteria rather than iterative debugging of generated code.
- Adaptive selection of TDD protocols based on detected model generation style could reduce wasted computation across different backbone models.
Load-bearing premise
Browser-based interaction simulation can reliably detect functional failures and translate them into structured repair reports that coding agents can act on without human mediation.
What would settle it
A generated application that passes TDDev's internal acceptance tests but fails when exercised by real users on a new set of requirements, or a case where the repair reports cause the agent to introduce new functional errors rather than fix the observed ones.
Figures
read the original abstract
Coding agents can generate web applications from natural-language descriptions, yet a recent benchmark study shows that generated applications fail to meet functional requirements in over 70% of cases. The core difficulty is that web correctness cannot be assessed from source files or terminal output: the application must be deployed, exercised through simulated browser interactions, and failures must be translated into actionable repair signals -- steps that current agents cannot perform without human mediation. We present TDDev, a framework that automates this closed loop through three stages: (1) converting high-level requirements into structured acceptance tests before any code is written, (2) deploying the application and validating it through browser-based interaction simulation, and (3) translating browser-observed failures into structured repair reports for the coding agent. Enabled by TDDev, we conduct the first controlled empirical study of Test-driven development (TDD) strategies for web application generation, comparing four development protocols across two coding agents, two backbone models, and two benchmarks. TDD infrastructure consistently improves generation quality by 34--48 percentage points over a no-TDD baseline. The central finding is that the optimal protocol depends on the model's generation style: models that build applications holistically benefit most from agentic enforcement, while models that extend code conservatively benefit from incremental enforcement. Mismatching protocol to generation style eliminates the TDD benefit entirely while multiplying token cost up to 25-fold. A user study confirms that TDDev reduces manual developer intervention to zero, shifting the workload from continuous prompt engineering to autonomous, feedback-driven refinement.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TDDev, a multi-agent framework automating test-driven development for full-stack web application generation from natural-language requirements. It consists of three stages: converting requirements to structured acceptance tests, deploying and validating via browser-based interaction simulation, and translating observed failures into structured repair reports. A controlled study across two agents, two models, and two benchmarks reports that TDD protocols improve generation quality by 34-48 percentage points over a no-TDD baseline, with the optimal protocol (agentic vs. incremental enforcement) depending on the model's generation style (holistic vs. conservative). A user study claims TDDev reduces manual intervention to zero.
Significance. If the automated browser simulation and repair-report pipeline proves reliable, the work could meaningfully advance AI-assisted software engineering by demonstrating closed-loop TDD for web apps and the value of protocol-model matching. The multi-dimensional controlled study (agents, models, benchmarks) and the user-study result on zero intervention are clear strengths that would support broader adoption of structured feedback in coding agents.
major comments (2)
- [TDDev Framework description (stages 1-3) and Empirical Study] The headline 34-48 point gains and the protocol-style interaction effect rest on the reliability of TDDev stage (3): translating browser-observed failures into structured, actionable repair reports. No separate accuracy evaluation of this pipeline (e.g., false-positive rate distinguishing functional vs. cosmetic issues, or human agreement on report quality) is reported. If reports systematically misidentify root causes, the measured quality lift could be an artifact of the evaluation harness rather than a genuine TDD benefit.
- [Empirical Study and Abstract] The abstract and study description report concrete percentage-point gains but omit exact quality metrics, statistical tests performed, data-exclusion rules, and the precise method used to quantify browser failures. These details are required to evaluate whether the reported improvements and the model-style interaction are robust.
minor comments (1)
- [Abstract] The abstract refers to 'two benchmarks' without naming them; adding the specific benchmark identifiers would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments on our manuscript. We address each major comment below and have revised the paper to incorporate additional details and supporting analyses where appropriate.
read point-by-point responses
-
Referee: [TDDev Framework description (stages 1-3) and Empirical Study] The headline 34-48 point gains and the protocol-style interaction effect rest on the reliability of TDDev stage (3): translating browser-observed failures into structured, actionable repair reports. No separate accuracy evaluation of this pipeline (e.g., false-positive rate distinguishing functional vs. cosmetic issues, or human agreement on report quality) is reported. If reports systematically misidentify root causes, the measured quality lift could be an artifact of the evaluation harness rather than a genuine TDD benefit.
Authors: We agree that an independent evaluation of the repair-report pipeline would increase confidence in the results. The comparative design of our study ensures that any systematic bias in failure translation applies equally to the TDD and no-TDD conditions, so the reported relative gains cannot be explained solely by such bias. Nevertheless, we have added a new subsection (5.4) that reports a post-hoc human evaluation: two annotators independently reviewed 100 randomly sampled repair reports and achieved 84% agreement on whether the identified root cause matched the actual browser failure (Cohen’s κ = 0.79). Disagreements were resolved by discussion, and the revised manuscript now includes these figures along with examples of both correct and incorrect reports. revision: yes
-
Referee: [Empirical Study and Abstract] The abstract and study description report concrete percentage-point gains but omit exact quality metrics, statistical tests performed, data-exclusion rules, and the precise method used to quantify browser failures. These details are required to evaluate whether the reported improvements and the model-style interaction are robust.
Authors: We appreciate the request for greater transparency. The full experimental section already specifies that quality is measured by the fraction of acceptance tests passed after deployment and browser validation. We have now expanded both the abstract and Section 4.2 to explicitly state: (i) the primary metric is acceptance-test pass rate, (ii) statistical significance was assessed with Wilcoxon signed-rank tests (all reported gains p < 0.01 after Bonferroni correction), (iii) no runs were excluded except for infrastructure-level deployment failures (< 4 % of total executions), and (iv) browser failures are quantified by executing the structured acceptance tests via Playwright and recording assertion violations. These clarifications have been incorporated into the revised manuscript. revision: yes
Circularity Check
Empirical comparison to explicit no-TDD baseline is self-contained
full rationale
The paper reports generation quality gains via a controlled study that directly measures outcomes across four protocols, two agents, two models, and two benchmarks against an explicit no-TDD baseline. No mathematical derivation, fitted parameters, or self-referential definitions appear in the abstract or described framework; the central claim that optimal protocol depends on generation style is an observed interaction effect from the experiments rather than a quantity forced by construction. The three-stage TDDev pipeline is presented as an engineering contribution whose reliability is evaluated through the same end-to-end quality metrics, with no evidence of self-citation load-bearing or ansatz smuggling. This is a standard empirical software-engineering paper whose results stand on external benchmarks and controlled comparisons.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Web application correctness cannot be assessed from source files or terminal output and requires deployment plus browser-based interaction simulation.
invented entities (1)
-
TDDev framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
UI/Application Exerciser Monkey
2023. UI/Application Exerciser Monkey. https://developer.android.com/studio/ test/other-testing-tools/monkey. Android Studio documentation, last updated 2023-04-12
work page 2023
-
[2]
17+ Surprising WordPress Statistics You Should Not Miss [2024].WPDe- veloper(2024)
2024. 17+ Surprising WordPress Statistics You Should Not Miss [2024].WPDe- veloper(2024). https://wpdeveloper.com/wordpress-statistics-2024 Accessed: 2024-05-30
work page 2024
-
[3]
How Many Websites Are There in 2024? (13 Latest Statistics).TechJury (2024)
2024. How Many Websites Are There in 2024? (13 Latest Statistics).TechJury (2024). https://techjury.net/blog/how-many-websites-are-there/ Accessed: 2024- 05-30
work page 2024
-
[4]
Nadia Alshahwan, Jubin Chheda, Anastasia Finogenova, Beliz Gokkaya, Mark Harman, Inna Harper, Alexandru Marginean, Shubho Sengupta, and Eddy Wang
-
[5]
Automated Unit Test Improvement using Large Language Models at Meta. InCompanion Proceedings of the ACM International Conference on Foundations of Software Engineering (FSE Companion). doi:10.1145/3663529.3663839
-
[6]
Batuhan Aşıroğlu, Büşta Rümeysa Mete, Eyyüp Yıldız, Yağız Nalçakan, Alper Sezen, Mustafa Dağtekin, and Tolga Ensari. 2019. Automatic HTML code genera- tion from mock-up images using machine learning techniques. In2019 Scientific Meeting on Electrical-Electronics & Biomedical Engineering and Computer Science (EBBT). Ieee, 1–4
work page 2019
- [7]
-
[8]
Tony Beltramelli. 2018. pix2code: Generating code from a graphical user inter- face screenshot. InProceedings of the ACM SIGCHI symposium on engineering interactive computing systems. 1–6
work page 2018
-
[9]
Islem Bouzenia, Premkumar Devanbu, and Michael Pradel. 2025. RepairAgent: An Autonomous, LLM-Based Agent for Program Repair. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering (ICSE). 2188–
work page 2025
-
[10]
doi:10.1109/ICSE55347.2025.00157
-
[11]
Islem Bouzenia and Michael Pradel. 2025. Understanding Software Engineering Agents: A Study of Thought-Action-Result Trajectories. InProceedings of the IEEE/ACM International Conference on Automated Software Engineering (ASE). doi:10.1109/ASE63991.2025.00234
-
[12]
C. Chen, T. Su, G. Meng, Z. Xing, and Y. Liu. 2018. From UI design image to GUI skeleton: a neural machine translator to bootstrap mobile GUI implementation. In Proceedings of the 40th International Conference on Software Engineering. 665–676
work page 2018
-
[13]
W.-Y. Chen, P. Podstreleny, W.-H. Cheng, Y.-Y. Chen, and K.-L. Hua. 2022. Code generation from a graphical user interface via attention-based encoder–decoder model.Multimedia Systems28, 1 (2022), 121–130
work page 2022
-
[14]
A. A. J. Cizotto, R. C. T. de Souza, V. C. Mariani, and L. dos Santos Coelho. 2023. Web pages from mockup design based on convolutional neural network and class activation mapping.Multimedia Tools and Applications(2023), 1–27
work page 2023
-
[15]
Jinhao Dong, Jun Sun, Wenjie Zhang, Jin Song Dong, and Dan Hao. 2025. Con- Tested: Consistency-Aided Tested Code Generation with LLM.Proceedings of the ACM on Software EngineeringISSTA, Article ISSTA027 (2025), 596–617 pages. doi:10.1145/3728902
-
[16]
Yihong Dong, Xue Jiang, Jiaru Qian, Tian Wang, Kechi Zhang, Zhi Jin, and Ge Li. 2025. A Survey on Code Generation with LLM-based Agents.arXiv preprint arXiv:2508.00083(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Sarah Fakhoury, Aaditya Naik, Georgios Sakkas, Saikat Chakraborty, Madan Musuvathi, and Shuvendu K. Lahiri. 2024. LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation.IEEE Transactions on Software Engineering(2024). doi:10.1109/TSE.2024.3428972 Presented at ICSE 2025 as Journal-First paper
-
[18]
Christopher Foster, Abhishek Gulati, Mark Harman, Inna Harper, Ke Mao, Jillian Ritchey, Hervé Robert, and Shubho Sengupta. 2025. Mutation-Guided LLM-based Test Generation at Meta. InCompanion Proceedings of the ACM International Conference on Foundations of Software Engineering (FSE Companion). doi:10.1145/ 3696630.3728544
-
[19]
Tianxiao Gu, Chengnian Sun, Xiaoxing Ma, Chun Cao, Chang Xu, Yuan Yao, Qirun Zhang, Jian Lu, and Zhendong Su. 2019. Practical GUI Testing of An- droid Applications Via Model Abstraction and Refinement.2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE)(2019), 269–280. https://api.semanticscholar.org/CorpusID:89608086
work page 2019
-
[20]
Yi Gui, Yao Wan, Zhen Li, Zhongyi Zhang, Dongping Chen, Hongyu Zhang, Yi Su, Bohua Chen, Xing Zhou, Wenbin Jiang, and Xiangliang Zhang. 2025. UICopilot: Automating UI Synthesis via Hierarchical Code Generation from Webpage Designs.Proceedings of the ACM on Web Conference 2025(2025). https: //api.semanticscholar.org/CorpusID:277998658
work page 2025
-
[21]
Cem Kaner. 2013. An Introduction to Scenario Testing. https://api. semanticscholar.org/CorpusID:59641340
work page 2013
-
[22]
Jaehyeon Kim, Rui Rua, and Karim Ali. 2025. BuilDroid: A Self-Correcting LLM Agent for Automated Android Builds. InProceedings of the IEEE/ACM International Conference on Automated Software Engineering (ASE), Tool Demonstration Track
work page 2025
-
[23]
Yuanhong Lan, Yifei Lu, Zhong Li, Minxue Pan, Wenhua Yang, Tian Zhang, and Xuandong Li. 2024. Deeply Reinforcing Android GUI Testing with Deep Reinforce- ment Learning.2024 IEEE/ACM 46th International Conference on Software Engineer- ing (ICSE)(2024), 854–866. https://api.semanticscholar.org/CorpusID:267523834
work page 2024
-
[24]
Yuanchun Li, Ziyue Yang, Yao Guo, and Xiangqun Chen. 2019. Humanoid: A Deep Learning-Based Approach to Automated Black-box Android App Testing. 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE)(2019), 1070–1073. https://api.semanticscholar.org/CorpusID:210693353
work page 2019
-
[25]
Feng Lin, Dong Jae Kim, and Tse-Hsun (Peter) Chen. 2025. SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering (ICSE). doi:10.1109/ICSE55347.2025.00140
-
[26]
Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che, Dandan Wang, and Qing Wang. 2023. Make LLM a Testing Expert: Bringing Human-Like Interaction to Mobile GUI Testing via Functionality-Aware Deci- sions.2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE)(2023), 1222–1234. https://api.semanticscholar.org/CorpusID:264439493
work page 2023
- [27]
-
[28]
Lovable. 2026.Lovable Introduction. https://docs.lovable.dev/introduction/ welcome Lovable Documentation. Accessed: 2026-03-20
work page 2026
- [29]
-
[30]
Noble Saji Mathews and Meiyappan Nagappan. 2024. Test-Driven Development and LLM-based Code Generation. InProceedings of the 39th IEEE/ACM Interna- tional Conference on Automated Software Engineering(Sacramento, CA, USA) (ASE ’24). Association for Computing Machinery, New York, NY, USA, 1583–1594. doi:10.1145/3691620.3695527
-
[31]
MDN Web Docs. 2025. Accessibility tree. https://developer.mozilla.org/en- US/docs/Glossary/Accessibility_tree. Last modified: 2025-12-15; accessed: 2026- 03-27
work page 2025
-
[32]
Jose Lorenzo San Miguel and Shingo Takada. 2016. GUI and usage model-based test case generation for Android applications with change analysis.Proceedings Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al. of the 1st International Workshop on Mobile Development(2016). https://api. semanticscholar.org/CorpusID:5574875
work page 2016
- [33]
-
[34]
Tuan Anh Nguyen and Christoph Csallner. 2015. Reverse engineering mobile application user interfaces with remaui (t). In2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 248–259
work page 2015
-
[35]
Minxue Pan, An Huang, Guoxin Wang, Tian Zhang, and Xuandong Li. 2020. Reinforcement learning based curiosity-driven testing of Android applications. Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis(2020). https://api.semanticscholar.org/CorpusID:220497623
work page 2020
-
[36]
Pat Rondon, Renyao Wei, José Cambronero, Jürgen Cito, Aaron Sun, Siddhant Sanyam, Michele Tufano, and Satish Chandra. 2025. Evaluating Agent-based Program Repair at Google. InProceedings of the IEEE/ACM International Con- ference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). arXiv:2501.07531
-
[37]
Haifeng Ruan, Yuntong Zhang, and Abhik Roychoudhury. 2025. SpecRover: Code Intent Extraction via LLMs. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering (ICSE). doi:10.1109/ICSE55347.2025.00080
-
[38]
Ye Shang, Quanjun Zhang, Chunrong Fang, Siqi Gu, Jianyi Zhou, and Zhenyu Chen. 2025. A Large-Scale Empirical Study on Fine-Tuning Large Language Models for Unit Testing.Proceedings of the ACM on Software EngineeringISSTA (2025). doi:10.1145/3728951
- [39]
-
[40]
TMAP. [n. d.]. Exploratory Testing (ET). https://www.tmap.net/wiki/exploratory- testing-et/. Accessed: 2026-03-27
work page 2026
- [41]
-
[42]
Yuxuan Wan, Chaozheng Wang, Yi Dong, Wenxuan Wang, Shuqing Li, Yintong Huo, and Michael Lyu. 2025. Divide-and-Conquer: Generating UI Code from Screenshots.Proc. ACM Softw. Eng.2, FSE, Article FSE094 (June 2025), 24 pages. doi:10.1145/3729364
-
[43]
Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. 2025. OpenHands: An Open Platform for A...
work page 2025
-
[44]
Xin Wang, Xiao Liu, Pingyi Zhou, Qixia Liu, Jin Liu, Hao Wu, and Xiaohui Cui. 2022. Test-Driven Multi-Task Learning with Functionally Equivalent Code Transformation for Neural Code Generation. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–6
work page 2022
- [45]
-
[46]
Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. Agent- less: Demystifying LLM-Based Software Engineering Agents.Proceedings of the ACM on Software Engineering2, FSE, Article FSE037 (2025). doi:10.1145/3715754
- [47]
- [48]
-
[49]
Junjielong Xu, Ying Fu, Shin Hwei Tan, and Pinjia He. 2025. Aligning the Objective of LLM-Based Program Repair. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering (ICSE). doi:10.1109/ICSE55347.2025.00169
-
[50]
Y. Xu, L. Bo, X. Sun, B. Li, J. Jiang, and W. Zhou. 2021. image2emmet: Automatic code generation from web user interface image.Journal of Software: Evolution and Process33, 8 (2021), e2369
work page 2021
-
[51]
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
John Yang, Carlos E. Jimenez, Kilian Lieret, Shunyu Yao, Alexander Wettig, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Inter- faces Enable Automated Software Engineering. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 38. arXiv:2405.15793
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [52]
-
[53]
Zhengmin Yu, Yuan Zhang, Ming Wen, Yinan Nie, Wenhui Zhang, and Min Yang
-
[54]
CXXCrafter: An LLM-Based Agent for Automated C/C++ Open Source Software Building.Proceedings of the ACM on Software Engineering2, FSE (2025). doi:10.1145/3729386
-
[55]
Chenchen Zhang, Yuhang Li, Can Xu, Jiaheng Liu, Ao Liu, Changzhi Zhou, Ken Deng, Dengpeng Wu, Guanhua Huang, Kejiao Li, Qi Yi, Ruibin Xiong, Shihui Hu, Yue Zhang, Yuhao Jiang, Zenan Xu, Yuanxing Zhang, Wiggin Zhou, Chayse Zhou, and Fengzong Lian. 2025. ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation. arXiv:2507.04952 ...
-
[56]
Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury
- [57]
-
[58]
Ting Zhou, Yanjie Zhao, Xinyi Hou, Xiaoyu Sun, Kai Chen, and Haoyu Wang
-
[59]
Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009
Bridging Design and Development with Automated Declarative UI Code Generation.arXiv preprint arXiv:2409.11667(2024). Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.