EvoDev: An Iterative Feature-Driven Framework for End-to-End Software Development with LLM-based Agents
Pith reviewed 2026-05-18 01:37 UTC · model grok-4.3
The pith
EvoDev's Feature Map models feature dependencies and propagates context to let LLM agents outperform linear baselines by 56.8 percent on Android tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EvoDev decomposes user requirements into a set of user-valued features and constructs a Feature Map, a directed acyclic graph that explicitly models dependencies between features. Each node in the feature map maintains multi-level information, including business logic, design, and code, which is propagated along dependencies to provide context for subsequent development iterations. Evaluation on challenging Android development tasks shows EvoDev outperforming the best baseline by 56.8 percent while raising single-agent results by 16.0 to 76.6 percent across base LLMs.
What carries the argument
The Feature Map, a directed acyclic graph of features that stores and propagates multi-level information along dependency edges to supply context during iterative development steps.
Load-bearing premise
The reported gains come mainly from the Feature Map's dependency modeling and context propagation rather than from unexamined choices in prompting, metrics, or which tasks were selected.
What would settle it
A controlled test that runs identical LLM agents on the same Android tasks once with the Feature Map enabled and once without it, then checks whether the performance difference disappears.
Figures
read the original abstract
Recent advances in large language model agents offer the promise of automating end-to-end software development from natural language requirements. However, existing approaches largely adopt linear, waterfall-style pipelines, which oversimplify the iterative nature of real-world development and struggle with complex, large-scale projects. To address these limitations, we propose EvoDev, an iterative software development framework inspired by feature-driven development. EvoDev decomposes user requirements into a set of user-valued features and constructs a Feature Map, a directed acyclic graph that explicitly models dependencies between features. Each node in the feature map maintains multi-level information, including business logic, design, and code, which is propagated along dependencies to provide context for subsequent development iterations. We evaluate EvoDev on challenging Android development tasks and show that it outperforms the best-performing baseline, Claude Code, by a substantial margin of 56.8%, while improving single-agent performance by 16.0%-76.6% across different base LLMs, highlighting the importance of dependency modeling, context propagation, and workflow-aware agent design for complex software projects. Our work summarizes practical insights for designing iterative, LLM-driven development frameworks and informs future training of base LLMs to better support iterative software development.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces EvoDev, an iterative framework for end-to-end software development with LLM-based agents. It decomposes natural language requirements into user-valued features, constructs a directed acyclic graph (Feature Map) to explicitly model inter-feature dependencies, and propagates multi-level context (business logic, design, and code) along dependency edges to inform subsequent iterations. The central empirical claim is that EvoDev outperforms the strongest baseline (Claude Code) by 56.8% and improves single-agent performance by 16.0%-76.6% across base LLMs when evaluated on challenging Android development tasks.
Significance. If the reported gains are shown to arise specifically from the DAG-based dependency modeling and context propagation rather than from unstated prompting or iteration details, the work would be significant for shifting LLM-agent research away from linear waterfall pipelines toward more realistic iterative, dependency-aware workflows. It supplies practical design insights and could inform future LLM fine-tuning for software engineering.
major comments (2)
- [Abstract and §4] Abstract and §4 (Evaluation): The central performance claims (56.8% margin over Claude Code; 16.0%-76.6% single-agent lifts) are stated without any description of the evaluation metrics, task selection criteria, number of trials, baseline implementations, statistical tests, or error analysis. This absence is load-bearing because the attribution of gains to the Feature Map cannot be assessed without these controls.
- [§4] §4 (Experiments): No ablation is reported that disables only the dependency edges and multi-level context propagation while preserving the iterative loop, agent roles, and feature decomposition. Without this isolation, it remains possible that the observed margins stem from differences in total LLM calls, prompt length, or task curation rather than the claimed DAG modeling.
minor comments (2)
- [§2] §2 (Related Work): The positioning against prior feature-driven development literature and existing LLM-agent frameworks (e.g., those using planning or reflection) would benefit from a more explicit comparison table.
- [§3] Notation: The multi-level information maintained at each Feature Map node is described in prose but never formalized (e.g., as a tuple or record type), which reduces clarity when discussing propagation.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which helps strengthen the clarity and rigor of our evaluation section. We address each major comment below and will revise the manuscript to incorporate the requested details and experiments.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Evaluation): The central performance claims (56.8% margin over Claude Code; 16.0%-76.6% single-agent lifts) are stated without any description of the evaluation metrics, task selection criteria, number of trials, baseline implementations, statistical tests, or error analysis. This absence is load-bearing because the attribution of gains to the Feature Map cannot be assessed without these controls.
Authors: We agree that the evaluation protocol requires more explicit and prominent description to support attribution of gains to the Feature Map. While §4 of the manuscript outlines the Android development tasks and overall setup, we acknowledge the need for greater detail on metrics, controls, and analysis. In the revised manuscript we will expand both the abstract and §4 to specify: the primary metric (feature-level task completion rate combining business-logic correctness and executable code), secondary metrics (design consistency and dependency resolution success), task selection criteria (Android apps requiring 5–12 interdependent features drawn from real-world requirement sets), number of trials (five independent runs per condition with reported standard deviation), baseline implementations (exact prompting and iteration limits used for Claude Code and other agents), statistical tests (paired t-tests with p-values), and error analysis (categorization of failures such as missed dependencies versus implementation bugs). These additions will make the evidence for the DAG-based context propagation more transparent. revision: yes
-
Referee: [§4] §4 (Experiments): No ablation is reported that disables only the dependency edges and multi-level context propagation while preserving the iterative loop, agent roles, and feature decomposition. Without this isolation, it remains possible that the observed margins stem from differences in total LLM calls, prompt length, or task curation rather than the claimed DAG modeling.
Authors: The referee correctly notes that our existing comparisons are against external baselines lacking the full EvoDev pipeline, which does not isolate the contribution of the dependency edges and context propagation. We will add a targeted ablation in the revised §4: a control variant that retains the iterative loop, agent roles, and feature decomposition but replaces the DAG with either independent feature processing or a fixed linear order, while matching total LLM calls and prompt lengths as closely as possible. Performance differences between this ablation and the full EvoDev will be reported to demonstrate that the observed margins arise specifically from dependency-aware multi-level context propagation rather than iteration count or task selection alone. revision: yes
Circularity Check
No circularity: empirical framework evaluation with no self-referential derivations
full rationale
The paper presents EvoDev as an iterative framework that decomposes requirements into features, builds a Feature Map DAG, and propagates multi-level context along dependencies. All central claims concern measured performance lifts (56.8% over Claude Code, 16.0%-76.6% single-agent gains) obtained from direct experimental comparison on Android tasks. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described framework. The reported results are therefore independent of the inputs by construction and rest on external benchmarks rather than any reduction to the framework's own definitions or prior author work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM agents benefit substantially from explicit dependency modeling and context propagation when performing iterative software development
invented entities (1)
-
Feature Map
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
EvoDev decomposes user requirements into a set of user-valued features and constructs a Feature Map, a directed acyclic graph that explicitly models dependencies between features. Each node in the feature map maintains multi-level information, including business logic, design, and code, which is propagated along dependencies
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We evaluate EvoDev on challenging Android development tasks and show that it outperforms the best-performing baseline, Claude Code, by a substantial margin of 56.8%
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Large Language Model-Based Agents for Software Engineering: A Survey
A literature survey that collects and categorizes 124 papers on LLM-based agents for software engineering from SE and agent perspectives.
Reference graph
Works this paper leans on
-
[1]
Andi Fitriah Abdul Kadir, Arash Habibi Lashkari, and Mahdi Daghmehchi Firooz- jaei. 2024.Android Operating System. Springer Nature Switzerland, Cham, 25–42. doi:10.1007/978-3-031-48865-8_2
-
[2]
Azat Abdullin, Pouria Derakhshanfar, and Annibale Panichella. 2025. Test Wars: A Comparative Study of SBST, Symbolic Execution, and LLM-Based Approaches to Unit Test Generation. In2025 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 221–232
work page 2025
- [3]
-
[4]
S Akilesh, Rajeev Sekar, Om Kumar CU, D Prakalya, and M Suguna. 2025. Multi- Agent hierarchical workflow for autonomous code generation with Large Lan- guage Models. In2025 IEEE International Students’ Conference on Electrical, Elec- tronics and Computer Science (SCEECS). IEEE, 1–5
work page 2025
-
[5]
as Qwen series is developed by Alibaba Cloud) Alibaba Cloud (implied. 2024. Qwen3 Coder - Agentic Coding Adventure. Web page. https://qwen3lm.com/
work page 2024
-
[6]
Anthropic. 2024. Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku. Web page. https://www.anthropic.com/news/3-5-models- and-computer-use
work page 2024
- [7]
-
[8]
Anthropic. 2025. Claude Sonnet 4. Web page. https://www.anthropic.com/ claude/sonnet
work page 2025
-
[9]
AntonOsika. 2025.GPT-Engineer. https://github.com/AntonOsika/gpt-engineer
work page 2025
-
[10]
Benoit Baudry, Khashayar Etemadi, Sen Fang, Yogya Gamage, Yi Liu, Yuxin Liu, Martin Monperrus, Javier Ron, André Silva, and Deepika Tiwari. 2024. Generative AI to generate test data generators.IEEE Software41, 6 (2024), 55–64
work page 2024
-
[11]
M Bialy, V Pantelic, J Jaskolka, A Schaap, L Patcas, M Lawford, and A Wassyng
-
[12]
In Handbook of system safety and security
Software engineering for model-based development by domain experts. In Handbook of system safety and security. Elsevier, 39–64
-
[13]
Kua Chen, Yujing Yang, Boqi Chen, José Antonio Hernández López, Gunter Mussbacher, and Dániel Varró. 2023. Automated domain modeling with large language models: A comparative study. In2023 ACM/IEEE 26th International Conference on Model Driven Engineering Languages and Systems (MODELS). IEEE, 162–172
work page 2023
-
[14]
Ru Chen, Jingwei Shen, and Xiao He. 2024. A Model Is Not Built By A Single Prompt: LLM-Based Domain Modeling With Question Decomposition.CoRR abs/2410.09854 (2024). arXiv:2410.09854 doi:10.48550/ARXIV.2410.09854
-
[15]
Matteo Ciniselli, Alberto Martin-Lopez, and Gabriele Bavota. 2024. On the generalizability of deep learning-based code completion across programming language versions. InProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension. 99–111
work page 2024
-
[16]
Alibaba Cloud. 2024. Tongyi Large Language Models: The First Choice for Enterprises Embracing the AI Era. Web page. https://www.aliyun.com/product/ tongyi
work page 2024
-
[17]
2017.Research design: Qualitative, quan- titative, and mixed methods approaches
John W Creswell and J David Creswell. 2017.Research design: Qualitative, quan- titative, and mixed methods approaches. Sage publications
work page 2017
-
[18]
Leuson Da Silva, Jordan Samhi, and Foutse Khomh. 2025. LLMs and Stack Overflow discussions: Reliability, impact, and challenges.Journal of Systems and Software(2025), 112541
work page 2025
-
[19]
2004.Domain-driven design: tackling complexity in the heart of software
Eric Evans. 2004.Domain-driven design: tackling complexity in the heart of software. Addison-Wesley Professional
work page 2004
-
[20]
Yingqiang Ge, Wenyue Hua, Kai Mei, Juntao Tan, Shuyuan Xu, Zelong Li, Yongfeng Zhang, et al. 2023. Openagi: When llm meets domain experts.Advances in Neural Information Processing Systems36 (2023), 5539–5568
work page 2023
-
[21]
Sadhna Goyal. 2007. Agile techniques for project management and software engineering. InMajor Seminar on Feature Driven Development. 1–19
work page 2007
-
[22]
Renda Han. 2024. Survey: The Evolution and Future of Android Software Devel- opment.Deep Learning and Pattern Recognition1, 1 (2024)
work page 2024
-
[23]
Stefanus A Haryono, Ferdian Thung, David Lo, Lingxiao Jiang, Julia Lawall, Hong Jin Kang, Lucas Serrano, and Gilles Muller. 2021. Androevolve: Automated update for android deprecated-api usages. In2021 IEEE/ACM 43rd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). IEEE, 1–4
work page 2021
-
[24]
Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. 2024. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. In The Twelfth International Conference on Learning Representatio...
work page 2024
-
[25]
Yue Hu, Yuzhu Cai, Yaxin Du, Xinyu Zhu, Xiangrui Liu, Zijie Yu, Yuchen Hou, Shuo Tang, and Siheng Chen. 2025. Self-Evolving Multi-Agent Collaboration Net- works for Software Development. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net. https://openreview.net/forum?id=4R71pdPBZp
work page 2025
-
[26]
LangChain Inc. 2024. LangGraph: Balance Agent Control with Agency. Web page. https://www.langchain.com/langgraph
work page 2024
- [27]
-
[28]
Junwei Liu, Yixuan Chen, Mingwei Liu, Xin Peng, and Yiling Lou. 2024. STALL+: Boosting LLM-based Repository-level Code Completion with Static Analysis. CoRRabs/2406.10018 (2024). arXiv:2406.10018 doi:10.48550/ARXIV.2406.10018
-
[29]
Jie Liu, Guohua Wang, Ronghui Yang, Mengchen Zhao, and Yi Cai. [n. d.]. AltDev: Achieving Real-Time Alignment in Multi-Agent Software Development. ([n. d.])
-
[30]
Junwei Liu, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. 2024. Large Language Model-Based Agents for Software Engineering: A Survey.CoRRabs/2409.02977 (2024). arXiv:2409.02977 doi:10. 48550/ARXIV.2409.02977
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [31]
-
[32]
Shriraj Mandulapalli, Emilio Hernandez, Wayne Jordan Hall, Alireza Chakeri, and Luis Jaimes. 2025. Development of Agentic Workflows with LangGraph for Software Development Life Cycle Automation. InNorth American Conference on Industrial Engineering and Operations Management-Computer Science Tracks. Springer, 45–54
work page 2025
-
[33]
Minh Huynh Nguyen, Thang Phan Chau, Phong X. Nguyen, and Nghi D. Q. Bui
-
[34]
Lyu, Caiming Xiong, Silvio Savarese, and Doyen Sahoo
AgileCoder: Dynamic Collaborative Agents for Software Development based on Agile Methodology. InIEEE/ACM Second International Conference on AI Foundation Models and Software Engineering, Forge@ICSE 2025, Ottawa, ON, Canada, April 27-28, 2025. IEEE, 156–167. doi:10.1109/FORGE66646.2025.00026
-
[35]
OpenAI. 2025. Introducing GPT-4.1 in the API. Web page. https://openai.com/ index/gpt-4-1/
work page 2025
-
[36]
Kai Petersen, Claes Wohlin, and Dejan Baca. 2009. The waterfall model in large- scale development. InInternational Conference on Product-Focused Software Process Improvement. Springer, 386–400
work page 2009
-
[37]
Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2024. ChatDev: Communicative Agents for Software Development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangk...
-
[38]
John Robinson. 2024. Likert scale. InEncyclopedia of quality of life and well-being research. Springer, 3917–3918
work page 2024
-
[39]
Adriana Sejfia, Satyaki Das, Saad Shafiq, and Nenad Medvidović. 2024. Toward improved deep learning-based vulnerability detection. InProceedings of the 46th IEEE/ACM international conference on software engineering. 1–12
work page 2024
-
[40]
Purva Sharma and Jayakumar Kaliappan. 2025. Optimised Intelligent Software Company Management System using Multi-Agent Framework.Grenze Interna- tional Journal of Engineering & Technology (GIJET)11 (2025)
work page 2025
-
[41]
Syed Tauhid Ullah Shah, Mohammad Hussein, Ann Barcomb, and Mohammad Moshirpour. 2025. Explainability as a Compliance Requirement: What Regulated Industries Need from AI Tools for Design Artifact Generation.arXiv e-prints (2025), arXiv–2507
work page 2025
-
[42]
Sergey Titov, Mikhail Evtikhiev, Anton Shapkin, Oleg Smirnov, Sergei Boytsov, Dariia Karaeva, Maksim Sheptyakov, Mikhail Arkhipov, Timofey Bryksin, and Egor Bogomolov. 2024. Kotlin ML Pack: Technical Report.CoRRabs/2405.19250 Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Liu et al. (2024). arXiv:2405.19250 doi:10.48550/ARXIV.2405.19250
-
[43]
Jim Whitehead. 2007. Collaboration in software engineering: A roadmap. In Future of Software Engineering (FOSE’07). IEEE, 214–225
work page 2007
-
[44]
Simiao Zhang, Jiaping Wang, Guoliang Dong, Jun Sun, Yueling Zhang, and Geguang Pu. 2024. Experimenting a New Programming Practice with LLMs. CoRRabs/2401.01062 (2024). arXiv:2401.01062 doi:10.48550/ARXIV.2401.01062 Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.