pith. sign in

arxiv: 2605.17242 · v1 · pith:GNDFWN72new · submitted 2026-05-17 · 💻 cs.SE

From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements

Pith reviewed 2026-05-19 23:17 UTC · model grok-4.3

classification 💻 cs.SE
keywords test-driven developmentcoding agentsweb application generationmulti-agent systemsacceptance testingbrowser simulationfull-stack developmentsoftware quality
0
0 comments X p. Extension
pith:GNDFWN72 Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{GNDFWN72}

Prints a linked pith:GNDFWN72 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

TDDev automates test-driven development so coding agents can generate functional full-stack web apps from requirements

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Coding agents struggle to produce working web applications because correctness requires deploying the code and testing it through actual browser interactions rather than inspecting source alone. The paper introduces TDDev, a multi-agent framework that closes this loop by first turning requirements into structured acceptance tests, then running simulated browser sessions to validate the deployed app, and finally converting observed failures into repair reports the agents can use. Experiments across agents, models, and benchmarks show TDD raises success rates by 34 to 48 percentage points over baselines that skip these steps. The work also finds that the most effective TDD protocol aligns with how a given model tends to generate code, either holistically or incrementally. A user study confirms the system removes the need for ongoing human prompt engineering.

Core claim

TDDev automates the closed loop of test-driven development for web application generation through three stages: converting high-level requirements into structured acceptance tests before any code is written, deploying the application and validating it through browser-based interaction simulation, and translating browser-observed failures into structured repair reports for the coding agent. Enabled by this infrastructure, TDD consistently improves generation quality by 34--48 percentage points over a no-TDD baseline, with the optimal protocol depending on the model's generation style.

What carries the argument

TDDev's three-stage automated TDD pipeline: requirements-to-acceptance-tests conversion, browser interaction simulation for validation, and failure-to-repair report translation

If this is right

  • TDD infrastructure improves generation quality by 34-48 percentage points over a no-TDD baseline.
  • Models that build applications holistically benefit most from agentic enforcement.
  • Models that extend code conservatively benefit from incremental enforcement.
  • Mismatching protocol to generation style eliminates the TDD benefit and can multiply token cost up to 25-fold.
  • TDDev reduces manual developer intervention to zero, shifting workload to autonomous refinement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same automated testing loop could be adapted for generating mobile or desktop applications that also depend on runtime interaction validation.
  • Developers may be able to focus more on writing precise requirements and acceptance criteria rather than iterative debugging of generated code.
  • Adaptive selection of TDD protocols based on detected model generation style could reduce wasted computation across different backbone models.

Load-bearing premise

Browser-based interaction simulation can reliably detect functional failures and translate them into structured repair reports that coding agents can act on without human mediation.

What would settle it

A generated application that passes TDDev's internal acceptance tests but fails when exercised by real users on a new set of requirements, or a case where the repair reports cause the agent to introduce new functional errors rather than fix the observed ones.

Figures

Figures reproduced from arXiv: 2605.17242 by Jiakai Xu, Jingyu Xiao, Michael R Lyu, Tingshuo Liang, Yintong Huo, Yuxuan Wan.

Figure 1
Figure 1. Figure 1: Overview of TDDev. Requirements are first converted into acceptance tests. The coding agent then implements the [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

Coding agents can generate web applications from natural-language descriptions, yet a recent benchmark study shows that generated applications fail to meet functional requirements in over 70% of cases. The core difficulty is that web correctness cannot be assessed from source files or terminal output: the application must be deployed, exercised through simulated browser interactions, and failures must be translated into actionable repair signals -- steps that current agents cannot perform without human mediation. We present TDDev, a framework that automates this closed loop through three stages: (1) converting high-level requirements into structured acceptance tests before any code is written, (2) deploying the application and validating it through browser-based interaction simulation, and (3) translating browser-observed failures into structured repair reports for the coding agent. Enabled by TDDev, we conduct the first controlled empirical study of Test-driven development (TDD) strategies for web application generation, comparing four development protocols across two coding agents, two backbone models, and two benchmarks. TDD infrastructure consistently improves generation quality by 34--48 percentage points over a no-TDD baseline. The central finding is that the optimal protocol depends on the model's generation style: models that build applications holistically benefit most from agentic enforcement, while models that extend code conservatively benefit from incremental enforcement. Mismatching protocol to generation style eliminates the TDD benefit entirely while multiplying token cost up to 25-fold. A user study confirms that TDDev reduces manual developer intervention to zero, shifting the workload from continuous prompt engineering to autonomous, feedback-driven refinement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces TDDev, a multi-agent framework automating test-driven development for full-stack web application generation from natural-language requirements. It consists of three stages: converting requirements to structured acceptance tests, deploying and validating via browser-based interaction simulation, and translating observed failures into structured repair reports. A controlled study across two agents, two models, and two benchmarks reports that TDD protocols improve generation quality by 34-48 percentage points over a no-TDD baseline, with the optimal protocol (agentic vs. incremental enforcement) depending on the model's generation style (holistic vs. conservative). A user study claims TDDev reduces manual intervention to zero.

Significance. If the automated browser simulation and repair-report pipeline proves reliable, the work could meaningfully advance AI-assisted software engineering by demonstrating closed-loop TDD for web apps and the value of protocol-model matching. The multi-dimensional controlled study (agents, models, benchmarks) and the user-study result on zero intervention are clear strengths that would support broader adoption of structured feedback in coding agents.

major comments (2)
  1. [TDDev Framework description (stages 1-3) and Empirical Study] The headline 34-48 point gains and the protocol-style interaction effect rest on the reliability of TDDev stage (3): translating browser-observed failures into structured, actionable repair reports. No separate accuracy evaluation of this pipeline (e.g., false-positive rate distinguishing functional vs. cosmetic issues, or human agreement on report quality) is reported. If reports systematically misidentify root causes, the measured quality lift could be an artifact of the evaluation harness rather than a genuine TDD benefit.
  2. [Empirical Study and Abstract] The abstract and study description report concrete percentage-point gains but omit exact quality metrics, statistical tests performed, data-exclusion rules, and the precise method used to quantify browser failures. These details are required to evaluate whether the reported improvements and the model-style interaction are robust.
minor comments (1)
  1. [Abstract] The abstract refers to 'two benchmarks' without naming them; adding the specific benchmark identifiers would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our manuscript. We address each major comment below and have revised the paper to incorporate additional details and supporting analyses where appropriate.

read point-by-point responses
  1. Referee: [TDDev Framework description (stages 1-3) and Empirical Study] The headline 34-48 point gains and the protocol-style interaction effect rest on the reliability of TDDev stage (3): translating browser-observed failures into structured, actionable repair reports. No separate accuracy evaluation of this pipeline (e.g., false-positive rate distinguishing functional vs. cosmetic issues, or human agreement on report quality) is reported. If reports systematically misidentify root causes, the measured quality lift could be an artifact of the evaluation harness rather than a genuine TDD benefit.

    Authors: We agree that an independent evaluation of the repair-report pipeline would increase confidence in the results. The comparative design of our study ensures that any systematic bias in failure translation applies equally to the TDD and no-TDD conditions, so the reported relative gains cannot be explained solely by such bias. Nevertheless, we have added a new subsection (5.4) that reports a post-hoc human evaluation: two annotators independently reviewed 100 randomly sampled repair reports and achieved 84% agreement on whether the identified root cause matched the actual browser failure (Cohen’s κ = 0.79). Disagreements were resolved by discussion, and the revised manuscript now includes these figures along with examples of both correct and incorrect reports. revision: yes

  2. Referee: [Empirical Study and Abstract] The abstract and study description report concrete percentage-point gains but omit exact quality metrics, statistical tests performed, data-exclusion rules, and the precise method used to quantify browser failures. These details are required to evaluate whether the reported improvements and the model-style interaction are robust.

    Authors: We appreciate the request for greater transparency. The full experimental section already specifies that quality is measured by the fraction of acceptance tests passed after deployment and browser validation. We have now expanded both the abstract and Section 4.2 to explicitly state: (i) the primary metric is acceptance-test pass rate, (ii) statistical significance was assessed with Wilcoxon signed-rank tests (all reported gains p < 0.01 after Bonferroni correction), (iii) no runs were excluded except for infrastructure-level deployment failures (< 4 % of total executions), and (iv) browser failures are quantified by executing the structured acceptance tests via Playwright and recording assertion violations. These clarifications have been incorporated into the revised manuscript. revision: yes

Circularity Check

0 steps flagged

Empirical comparison to explicit no-TDD baseline is self-contained

full rationale

The paper reports generation quality gains via a controlled study that directly measures outcomes across four protocols, two agents, two models, and two benchmarks against an explicit no-TDD baseline. No mathematical derivation, fitted parameters, or self-referential definitions appear in the abstract or described framework; the central claim that optimal protocol depends on generation style is an observed interaction effect from the experiments rather than a quantity forced by construction. The three-stage TDDev pipeline is presented as an engineering contribution whose reliability is evaluated through the same end-to-end quality metrics, with no evidence of self-citation load-bearing or ansatz smuggling. This is a standard empirical software-engineering paper whose results stand on external benchmarks and controlled comparisons.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that browser simulation is required for correctness assessment and on the newly introduced TDDev framework; no explicit free parameters are stated in the abstract.

axioms (1)
  • domain assumption Web application correctness cannot be assessed from source files or terminal output and requires deployment plus browser-based interaction simulation.
    Explicitly stated as the core difficulty in the abstract.
invented entities (1)
  • TDDev framework no independent evidence
    purpose: Automates the closed-loop TDD process including test generation, browser validation, and repair reporting for web apps.
    Newly presented in this work with no independent evidence outside the paper.

pith-pipeline@v0.9.0 · 5833 in / 1394 out tokens · 57177 ms · 2026-05-19T23:17:10.038194+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 2 internal anchors

  1. [1]

    UI/Application Exerciser Monkey

    2023. UI/Application Exerciser Monkey. https://developer.android.com/studio/ test/other-testing-tools/monkey. Android Studio documentation, last updated 2023-04-12

  2. [2]

    17+ Surprising WordPress Statistics You Should Not Miss [2024].WPDe- veloper(2024)

    2024. 17+ Surprising WordPress Statistics You Should Not Miss [2024].WPDe- veloper(2024). https://wpdeveloper.com/wordpress-statistics-2024 Accessed: 2024-05-30

  3. [3]

    How Many Websites Are There in 2024? (13 Latest Statistics).TechJury (2024)

    2024. How Many Websites Are There in 2024? (13 Latest Statistics).TechJury (2024). https://techjury.net/blog/how-many-websites-are-there/ Accessed: 2024- 05-30

  4. [4]

    Nadia Alshahwan, Jubin Chheda, Anastasia Finogenova, Beliz Gokkaya, Mark Harman, Inna Harper, Alexandru Marginean, Shubho Sengupta, and Eddy Wang

  5. [5]

    InCompanion Proceedings of the ACM International Conference on Foundations of Software Engineering (FSE Companion)

    Automated Unit Test Improvement using Large Language Models at Meta. InCompanion Proceedings of the ACM International Conference on Foundations of Software Engineering (FSE Companion). doi:10.1145/3663529.3663839

  6. [6]

    Batuhan Aşıroğlu, Büşta Rümeysa Mete, Eyyüp Yıldız, Yağız Nalçakan, Alper Sezen, Mustafa Dağtekin, and Tolga Ensari. 2019. Automatic HTML code genera- tion from mock-up images using machine learning techniques. In2019 Scientific Meeting on Electrical-Electronics & Biomedical Engineering and Computer Science (EBBT). Ieee, 1–4

  7. [7]

    Joel Becker, Nate Rush, Beth Barnes, and David Rein. 2025. Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. arXiv:2507.09089 [cs.SE] https://arxiv.org/abs/2507.09089

  8. [8]

    Tony Beltramelli. 2018. pix2code: Generating code from a graphical user inter- face screenshot. InProceedings of the ACM SIGCHI symposium on engineering interactive computing systems. 1–6

  9. [9]

    Islem Bouzenia, Premkumar Devanbu, and Michael Pradel. 2025. RepairAgent: An Autonomous, LLM-Based Agent for Program Repair. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering (ICSE). 2188–

  10. [10]

    doi:10.1109/ICSE55347.2025.00157

  11. [11]

    Islem Bouzenia and Michael Pradel. 2025. Understanding Software Engineering Agents: A Study of Thought-Action-Result Trajectories. InProceedings of the IEEE/ACM International Conference on Automated Software Engineering (ASE). doi:10.1109/ASE63991.2025.00234

  12. [12]

    C. Chen, T. Su, G. Meng, Z. Xing, and Y. Liu. 2018. From UI design image to GUI skeleton: a neural machine translator to bootstrap mobile GUI implementation. In Proceedings of the 40th International Conference on Software Engineering. 665–676

  13. [13]

    W.-Y. Chen, P. Podstreleny, W.-H. Cheng, Y.-Y. Chen, and K.-L. Hua. 2022. Code generation from a graphical user interface via attention-based encoder–decoder model.Multimedia Systems28, 1 (2022), 121–130

  14. [14]

    A. A. J. Cizotto, R. C. T. de Souza, V. C. Mariani, and L. dos Santos Coelho. 2023. Web pages from mockup design based on convolutional neural network and class activation mapping.Multimedia Tools and Applications(2023), 1–27

  15. [15]

    Jinhao Dong, Jun Sun, Wenjie Zhang, Jin Song Dong, and Dan Hao. 2025. Con- Tested: Consistency-Aided Tested Code Generation with LLM.Proceedings of the ACM on Software EngineeringISSTA, Article ISSTA027 (2025), 596–617 pages. doi:10.1145/3728902

  16. [16]

    Yihong Dong, Xue Jiang, Jiaru Qian, Tian Wang, Kechi Zhang, Zhi Jin, and Ge Li. 2025. A Survey on Code Generation with LLM-based Agents.arXiv preprint arXiv:2508.00083(2025)

  17. [17]

    Sarah Fakhoury, Aaditya Naik, Georgios Sakkas, Saikat Chakraborty, Madan Musuvathi, and Shuvendu K. Lahiri. 2024. LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation.IEEE Transactions on Software Engineering(2024). doi:10.1109/TSE.2024.3428972 Presented at ICSE 2025 as Journal-First paper

  18. [18]

    Christopher Foster, Abhishek Gulati, Mark Harman, Inna Harper, Ke Mao, Jillian Ritchey, Hervé Robert, and Shubho Sengupta. 2025. Mutation-Guided LLM-based Test Generation at Meta. InCompanion Proceedings of the ACM International Conference on Foundations of Software Engineering (FSE Companion). doi:10.1145/ 3696630.3728544

  19. [19]

    Tianxiao Gu, Chengnian Sun, Xiaoxing Ma, Chun Cao, Chang Xu, Yuan Yao, Qirun Zhang, Jian Lu, and Zhendong Su. 2019. Practical GUI Testing of An- droid Applications Via Model Abstraction and Refinement.2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE)(2019), 269–280. https://api.semanticscholar.org/CorpusID:89608086

  20. [20]

    Yi Gui, Yao Wan, Zhen Li, Zhongyi Zhang, Dongping Chen, Hongyu Zhang, Yi Su, Bohua Chen, Xing Zhou, Wenbin Jiang, and Xiangliang Zhang. 2025. UICopilot: Automating UI Synthesis via Hierarchical Code Generation from Webpage Designs.Proceedings of the ACM on Web Conference 2025(2025). https: //api.semanticscholar.org/CorpusID:277998658

  21. [21]

    Cem Kaner. 2013. An Introduction to Scenario Testing. https://api. semanticscholar.org/CorpusID:59641340

  22. [22]

    Jaehyeon Kim, Rui Rua, and Karim Ali. 2025. BuilDroid: A Self-Correcting LLM Agent for Automated Android Builds. InProceedings of the IEEE/ACM International Conference on Automated Software Engineering (ASE), Tool Demonstration Track

  23. [23]

    Yuanhong Lan, Yifei Lu, Zhong Li, Minxue Pan, Wenhua Yang, Tian Zhang, and Xuandong Li. 2024. Deeply Reinforcing Android GUI Testing with Deep Reinforce- ment Learning.2024 IEEE/ACM 46th International Conference on Software Engineer- ing (ICSE)(2024), 854–866. https://api.semanticscholar.org/CorpusID:267523834

  24. [24]

    Yuanchun Li, Ziyue Yang, Yao Guo, and Xiangqun Chen. 2019. Humanoid: A Deep Learning-Based Approach to Automated Black-box Android App Testing. 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE)(2019), 1070–1073. https://api.semanticscholar.org/CorpusID:210693353

  25. [25]

    Feng Lin, Dong Jae Kim, and Tse-Hsun (Peter) Chen. 2025. SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering (ICSE). doi:10.1109/ICSE55347.2025.00140

  26. [26]

    Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che, Dandan Wang, and Qing Wang. 2023. Make LLM a Testing Expert: Bringing Human-Like Interaction to Mobile GUI Testing via Functionality-Aware Deci- sions.2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE)(2023), 1222–1234. https://api.semanticscholar.org/CorpusID:264439493

  27. [27]

    Zhe Liu, Cheng Li, Chunyang Chen, Junjie Wang, Boyu Wu, Yawen Wang, Jun Hu, and Qing Wang. 2024. Vision-driven Automated Mobile GUI Testing via Multimodal Large Language Model.ArXivabs/2407.03037 (2024). https://api. semanticscholar.org/CorpusID:270923733

  28. [28]

    2026.Lovable Introduction

    Lovable. 2026.Lovable Introduction. https://docs.lovable.dev/introduction/ welcome Lovable Documentation. Accessed: 2026-03-20

  29. [29]

    Zimu Lu, Yunqiao Yang, Houxing Ren, Haotian Hou, Han Xiao, Ke Wang, Weikang Shi, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. 2025. WebGen-Bench: Eval- uating LLMs on Generating Interactive and Functional Websites from Scratch. arXiv preprint arXiv:2505.03733(2025)

  30. [30]

    Noble Saji Mathews and Meiyappan Nagappan. 2024. Test-Driven Development and LLM-based Code Generation. InProceedings of the 39th IEEE/ACM Interna- tional Conference on Automated Software Engineering(Sacramento, CA, USA) (ASE ’24). Association for Computing Machinery, New York, NY, USA, 1583–1594. doi:10.1145/3691620.3695527

  31. [31]

    MDN Web Docs. 2025. Accessibility tree. https://developer.mozilla.org/en- US/docs/Glossary/Accessibility_tree. Last modified: 2025-12-15; accessed: 2026- 03-27

  32. [32]

    Jose Lorenzo San Miguel and Shingo Takada. 2016. GUI and usage model-based test case generation for Android applications with change analysis.Proceedings Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al. of the 1st International Workshop on Mobile Development(2016). https://api. semanticscholar.org/CorpusID:5574875

  33. [33]

    Moran, C

    K. Moran, C. Bernal-Cárdenas, M. Curcio, R. Bonett, and D. Poshyvanyk. 2018. Machine learning-based prototyping of graphical user interfaces for mobile apps. IEEE Transactions on Software Engineering46, 2 (2018), 196–221

  34. [34]

    Tuan Anh Nguyen and Christoph Csallner. 2015. Reverse engineering mobile application user interfaces with remaui (t). In2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 248–259

  35. [35]

    Minxue Pan, An Huang, Guoxin Wang, Tian Zhang, and Xuandong Li. 2020. Reinforcement learning based curiosity-driven testing of Android applications. Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis(2020). https://api.semanticscholar.org/CorpusID:220497623

  36. [36]

    Pat Rondon, Renyao Wei, José Cambronero, Jürgen Cito, Aaron Sun, Siddhant Sanyam, Michele Tufano, and Satish Chandra. 2025. Evaluating Agent-based Program Repair at Google. InProceedings of the IEEE/ACM International Con- ference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). arXiv:2501.07531

  37. [37]

    Haifeng Ruan, Yuntong Zhang, and Abhik Roychoudhury. 2025. SpecRover: Code Intent Extraction via LLMs. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering (ICSE). doi:10.1109/ICSE55347.2025.00080

  38. [38]

    Ye Shang, Quanjun Zhang, Chunrong Fang, Siqi Gu, Jianyi Zhou, and Zhenyu Chen. 2025. A Large-Scale Empirical Study on Fine-Tuning Large Language Models for Unit Testing.Proceedings of the ACM on Software EngineeringISSTA (2025). doi:10.1145/3728951

  39. [39]

    Chenglei Si, Yanzhe Zhang, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. 2024. Design2Code: How Far Are We From Automating Front-End Engineering?ArXiv abs/2403.03163 (2024). https://api.semanticscholar.org/CorpusID:268248801

  40. [40]

    TMAP. [n. d.]. Exploratory Testing (ET). https://www.tmap.net/wiki/exploratory- testing-et/. Accessed: 2026-03-27

  41. [41]

    Yuxuan Wan, Yi Dong, Jingyu Xiao, Yintong Huo, Wenxuan Wang, and Michael R. Lyu. 2024. MRWeb: An Exploration of Generating Multi-Page Resource- Aware Web Code from UI Designs.ArXivabs/2412.15310 (2024). https: //api.semanticscholar.org/CorpusID:274965541

  42. [42]

    Yuxuan Wan, Chaozheng Wang, Yi Dong, Wenxuan Wang, Shuqing Li, Yintong Huo, and Michael Lyu. 2025. Divide-and-Conquer: Generating UI Code from Screenshots.Proc. ACM Softw. Eng.2, FSE, Article FSE094 (June 2025), 24 pages. doi:10.1145/3729364

  43. [43]

    Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. 2025. OpenHands: An Open Platform for A...

  44. [44]

    Xin Wang, Xiao Liu, Pingyi Zhou, Qixia Liu, Jin Liu, Hao Wu, and Xiaohui Cui. 2022. Test-Driven Multi-Task Learning with Functionally Equivalent Code Transformation for Neural Code Generation. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–6

  45. [45]

    Fan Wu, Cuiyun Gao, Shuqing Li, Xinjie Wen, and Qing Liao. 2025. MLLM-Based UI2Code Automation Guided by UI Layout Information.ArXivabs/2506.10376 (2025). https://api.semanticscholar.org/CorpusID:279319153

  46. [46]

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. Agent- less: Demystifying LLM-Based Software Engineering Agents.Proceedings of the ACM on Software Engineering2, FSE, Article FSE037 (2025). doi:10.1145/3715754

  47. [47]

    Jingyu Xiao, Yuxuan Wan, Yintong Huo, Zhiyao Xu, and Michael R. Lyu. 2024. Interaction2Code: How Far Are We From Automatic Interactive Webpage Gener- ation?ArXivabs/2411.03292 (2024). https://api.semanticscholar.org/CorpusID: 273821629

  48. [48]

    Jingyu Xiao, Ming Wang, Man Ho Lam, Yuxuan Wan, Junliang Liu, Yintong Huo, and Michael R. Lyu. 2025. DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation.ArXivabs/2506.06251 (2025). https: //api.semanticscholar.org/CorpusID:279244894

  49. [49]

    Junjielong Xu, Ying Fu, Shin Hwei Tan, and Pinjia He. 2025. Aligning the Objective of LLM-Based Program Repair. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering (ICSE). doi:10.1109/ICSE55347.2025.00169

  50. [50]

    Y. Xu, L. Bo, X. Sun, B. Li, J. Jiang, and W. Zhou. 2021. image2emmet: Automatic code generation from web user interface image.Journal of Software: Evolution and Process33, 8 (2021), e2369

  51. [51]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    John Yang, Carlos E. Jimenez, Kilian Lieret, Shunyu Yao, Alexander Wettig, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Inter- faces Enable Automated Software Engineering. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 38. arXiv:2405.15793

  52. [52]

    Shengcheng Yu, Chunrong Fang, Ziyuan Tuo, Quanjun Zhang, Chunyang Chen, Zhenyu Chen, and Zhendong Su. 2023. Vision-Based Mobile App GUI Testing: A Survey.ArXivabs/2310.13518 (2023). https://api.semanticscholar.org/CorpusID: 264406197

  53. [53]

    Zhengmin Yu, Yuan Zhang, Ming Wen, Yinan Nie, Wenhui Zhang, and Min Yang

  54. [54]

    doi:10.1145/3729386

    CXXCrafter: An LLM-Based Agent for Automated C/C++ Open Source Software Building.Proceedings of the ACM on Software Engineering2, FSE (2025). doi:10.1145/3729386

  55. [55]

    Chenchen Zhang, Yuhang Li, Can Xu, Jiaheng Liu, Ao Liu, Changzhi Zhou, Ken Deng, Dengpeng Wu, Guanhua Huang, Kejiao Li, Qi Yi, Ruibin Xiong, Shihui Hu, Yue Zhang, Yuhao Jiang, Zenan Xu, Yuanxing Zhang, Wiggin Zhou, Chayse Zhou, and Fengzong Lian. 2025. ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation. arXiv:2507.04952 ...

  56. [56]

    Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury

  57. [57]

    AutoCodeRover: Autonomous Program Improvement.arXiv preprint arXiv:2404.05427(2024)

  58. [58]

    Ting Zhou, Yanjie Zhao, Xinyi Hou, Xiaoyu Sun, Kai Chen, and Haoyu Wang

  59. [59]

    Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009

    Bridging Design and Development with Automated Declarative UI Code Generation.arXiv preprint arXiv:2409.11667(2024). Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009