pith. machine review for the scientific record. sign in

arxiv: 2511.02399 · v2 · submitted 2025-11-04 · 💻 cs.SE · cs.AI

EvoDev: An Iterative Feature-Driven Framework for End-to-End Software Development with LLM-based Agents

Pith reviewed 2026-05-18 01:37 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords LLM agentssoftware developmentfeature-driven developmentiterative frameworkdependency modelingcontext propagationAndroid development
0
0 comments X

The pith

EvoDev's Feature Map models feature dependencies and propagates context to let LLM agents outperform linear baselines by 56.8 percent on Android tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes EvoDev, an iterative framework that decomposes requirements into features connected by a directed acyclic graph called the Feature Map. This graph carries multi-level details such as business logic, design, and code, which are passed along dependency links to give each development step richer context. The authors test the approach on demanding Android projects and report clear gains over existing methods. A sympathetic reader would care because real software work involves repeated revisions and interdependencies that simple sequential pipelines fail to capture.

Core claim

EvoDev decomposes user requirements into a set of user-valued features and constructs a Feature Map, a directed acyclic graph that explicitly models dependencies between features. Each node in the feature map maintains multi-level information, including business logic, design, and code, which is propagated along dependencies to provide context for subsequent development iterations. Evaluation on challenging Android development tasks shows EvoDev outperforming the best baseline by 56.8 percent while raising single-agent results by 16.0 to 76.6 percent across base LLMs.

What carries the argument

The Feature Map, a directed acyclic graph of features that stores and propagates multi-level information along dependency edges to supply context during iterative development steps.

Load-bearing premise

The reported gains come mainly from the Feature Map's dependency modeling and context propagation rather than from unexamined choices in prompting, metrics, or which tasks were selected.

What would settle it

A controlled test that runs identical LLM agents on the same Android tasks once with the Feature Map enabled and once without it, then checks whether the performance difference disappears.

Figures

Figures reproduced from arXiv: 2511.02399 by Chen Xu, Chong Wang, Junwei Liu, Kaseng Wong, Tong Bai, Weitong Chen, Xin Peng, Yiling Lou.

Figure 1
Figure 1. Figure 1: Overview of the FDD-inspired EvoDev framework to extract a list of features, which are defined as user-valued func￾tionalities that can be implemented within two weeks. The feature can also be integrated into feature sets, which contain a list of func￾tionally cohesive features that can be treated as a whole. The next step is to plan by feature, where team members carefully consider the dependencies and pr… view at source ↗
Figure 2
Figure 2. Figure 2: The basic activities of Feature Driven Development [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The requirement document for the countdown timer APP Next, we introduce an Architect agent to construct an overall design of the target application. In FDD, the overall model refers to the domain object model. However, our preliminary experiments indicate that LLMs struggle to generate a usable domain object model, which is also reported in prior work [12, 13, 19]. Therefore, we opt to leverage LLMs for da… view at source ↗
Figure 5
Figure 5. Figure 5: The feature list for the countdown timer APP. The [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: The overall design (including both UI and data) for [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: The feature map for the countdown timer APP, with each node containing contexts of the business, design, and [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
read the original abstract

Recent advances in large language model agents offer the promise of automating end-to-end software development from natural language requirements. However, existing approaches largely adopt linear, waterfall-style pipelines, which oversimplify the iterative nature of real-world development and struggle with complex, large-scale projects. To address these limitations, we propose EvoDev, an iterative software development framework inspired by feature-driven development. EvoDev decomposes user requirements into a set of user-valued features and constructs a Feature Map, a directed acyclic graph that explicitly models dependencies between features. Each node in the feature map maintains multi-level information, including business logic, design, and code, which is propagated along dependencies to provide context for subsequent development iterations. We evaluate EvoDev on challenging Android development tasks and show that it outperforms the best-performing baseline, Claude Code, by a substantial margin of 56.8%, while improving single-agent performance by 16.0%-76.6% across different base LLMs, highlighting the importance of dependency modeling, context propagation, and workflow-aware agent design for complex software projects. Our work summarizes practical insights for designing iterative, LLM-driven development frameworks and informs future training of base LLMs to better support iterative software development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces EvoDev, an iterative framework for end-to-end software development with LLM-based agents. It decomposes natural language requirements into user-valued features, constructs a directed acyclic graph (Feature Map) to explicitly model inter-feature dependencies, and propagates multi-level context (business logic, design, and code) along dependency edges to inform subsequent iterations. The central empirical claim is that EvoDev outperforms the strongest baseline (Claude Code) by 56.8% and improves single-agent performance by 16.0%-76.6% across base LLMs when evaluated on challenging Android development tasks.

Significance. If the reported gains are shown to arise specifically from the DAG-based dependency modeling and context propagation rather than from unstated prompting or iteration details, the work would be significant for shifting LLM-agent research away from linear waterfall pipelines toward more realistic iterative, dependency-aware workflows. It supplies practical design insights and could inform future LLM fine-tuning for software engineering.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Evaluation): The central performance claims (56.8% margin over Claude Code; 16.0%-76.6% single-agent lifts) are stated without any description of the evaluation metrics, task selection criteria, number of trials, baseline implementations, statistical tests, or error analysis. This absence is load-bearing because the attribution of gains to the Feature Map cannot be assessed without these controls.
  2. [§4] §4 (Experiments): No ablation is reported that disables only the dependency edges and multi-level context propagation while preserving the iterative loop, agent roles, and feature decomposition. Without this isolation, it remains possible that the observed margins stem from differences in total LLM calls, prompt length, or task curation rather than the claimed DAG modeling.
minor comments (2)
  1. [§2] §2 (Related Work): The positioning against prior feature-driven development literature and existing LLM-agent frameworks (e.g., those using planning or reflection) would benefit from a more explicit comparison table.
  2. [§3] Notation: The multi-level information maintained at each Feature Map node is described in prose but never formalized (e.g., as a tuple or record type), which reduces clarity when discussing propagation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps strengthen the clarity and rigor of our evaluation section. We address each major comment below and will revise the manuscript to incorporate the requested details and experiments.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Evaluation): The central performance claims (56.8% margin over Claude Code; 16.0%-76.6% single-agent lifts) are stated without any description of the evaluation metrics, task selection criteria, number of trials, baseline implementations, statistical tests, or error analysis. This absence is load-bearing because the attribution of gains to the Feature Map cannot be assessed without these controls.

    Authors: We agree that the evaluation protocol requires more explicit and prominent description to support attribution of gains to the Feature Map. While §4 of the manuscript outlines the Android development tasks and overall setup, we acknowledge the need for greater detail on metrics, controls, and analysis. In the revised manuscript we will expand both the abstract and §4 to specify: the primary metric (feature-level task completion rate combining business-logic correctness and executable code), secondary metrics (design consistency and dependency resolution success), task selection criteria (Android apps requiring 5–12 interdependent features drawn from real-world requirement sets), number of trials (five independent runs per condition with reported standard deviation), baseline implementations (exact prompting and iteration limits used for Claude Code and other agents), statistical tests (paired t-tests with p-values), and error analysis (categorization of failures such as missed dependencies versus implementation bugs). These additions will make the evidence for the DAG-based context propagation more transparent. revision: yes

  2. Referee: [§4] §4 (Experiments): No ablation is reported that disables only the dependency edges and multi-level context propagation while preserving the iterative loop, agent roles, and feature decomposition. Without this isolation, it remains possible that the observed margins stem from differences in total LLM calls, prompt length, or task curation rather than the claimed DAG modeling.

    Authors: The referee correctly notes that our existing comparisons are against external baselines lacking the full EvoDev pipeline, which does not isolate the contribution of the dependency edges and context propagation. We will add a targeted ablation in the revised §4: a control variant that retains the iterative loop, agent roles, and feature decomposition but replaces the DAG with either independent feature processing or a fixed linear order, while matching total LLM calls and prompt lengths as closely as possible. Performance differences between this ablation and the full EvoDev will be reported to demonstrate that the observed margins arise specifically from dependency-aware multi-level context propagation rather than iteration count or task selection alone. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework evaluation with no self-referential derivations

full rationale

The paper presents EvoDev as an iterative framework that decomposes requirements into features, builds a Feature Map DAG, and propagates multi-level context along dependencies. All central claims concern measured performance lifts (56.8% over Claude Code, 16.0%-76.6% single-agent gains) obtained from direct experimental comparison on Android tasks. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described framework. The reported results are therefore independent of the inputs by construction and rest on external benchmarks rather than any reduction to the framework's own definitions or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the untested assumption that structured dependency modeling and context passing will reliably improve LLM agent outputs on complex tasks; the Feature Map is introduced as a new construct without external validation.

axioms (1)
  • domain assumption LLM agents benefit substantially from explicit dependency modeling and context propagation when performing iterative software development
    Invoked to explain why the Feature Map produces the reported gains over linear baselines.
invented entities (1)
  • Feature Map no independent evidence
    purpose: Directed acyclic graph that stores and propagates multi-level information (business logic, design, code) across feature dependencies
    Newly defined structure central to the iterative workflow; no independent evidence provided outside the paper's evaluation.

pith-pipeline@v0.9.0 · 5763 in / 1323 out tokens · 35808 ms · 2026-05-18T01:37:20.673222+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Large Language Model-Based Agents for Software Engineering: A Survey

    cs.SE 2024-09 unverdicted novelty 4.0

    A literature survey that collects and categorizes 124 papers on LLM-based agents for software engineering from SE and agent perspectives.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    2024.Android Operating System

    Andi Fitriah Abdul Kadir, Arash Habibi Lashkari, and Mahdi Daghmehchi Firooz- jaei. 2024.Android Operating System. Springer Nature Switzerland, Cham, 25–42. doi:10.1007/978-3-031-48865-8_2

  2. [2]

    Azat Abdullin, Pouria Derakhshanfar, and Annibale Panichella. 2025. Test Wars: A Comparative Study of SBST, Symbolic Execution, and LLM-Based Approaches to Unit Test Generation. In2025 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 221–232

  3. [3]

    Yasaman Abedini and Abbas Heydarnoori. 2025. Leveraging Large Language Models for Classifying App Users’ Feedback.arXiv preprint arXiv:2507.08250 (2025)

  4. [4]

    S Akilesh, Rajeev Sekar, Om Kumar CU, D Prakalya, and M Suguna. 2025. Multi- Agent hierarchical workflow for autonomous code generation with Large Lan- guage Models. In2025 IEEE International Students’ Conference on Electrical, Elec- tronics and Computer Science (SCEECS). IEEE, 1–5

  5. [5]

    as Qwen series is developed by Alibaba Cloud) Alibaba Cloud (implied. 2024. Qwen3 Coder - Agentic Coding Adventure. Web page. https://qwen3lm.com/

  6. [6]

    Anthropic. 2024. Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku. Web page. https://www.anthropic.com/news/3-5-models- and-computer-use

  7. [7]

    2025.Claude Code

    Anthropic. 2025.Claude Code. https://www.anthropic.com/claude-code

  8. [8]

    Anthropic. 2025. Claude Sonnet 4. Web page. https://www.anthropic.com/ claude/sonnet

  9. [9]

    2025.GPT-Engineer

    AntonOsika. 2025.GPT-Engineer. https://github.com/AntonOsika/gpt-engineer

  10. [10]

    Benoit Baudry, Khashayar Etemadi, Sen Fang, Yogya Gamage, Yi Liu, Yuxin Liu, Martin Monperrus, Javier Ron, André Silva, and Deepika Tiwari. 2024. Generative AI to generate test data generators.IEEE Software41, 6 (2024), 55–64

  11. [11]

    M Bialy, V Pantelic, J Jaskolka, A Schaap, L Patcas, M Lawford, and A Wassyng

  12. [12]

    In Handbook of system safety and security

    Software engineering for model-based development by domain experts. In Handbook of system safety and security. Elsevier, 39–64

  13. [13]

    Kua Chen, Yujing Yang, Boqi Chen, José Antonio Hernández López, Gunter Mussbacher, and Dániel Varró. 2023. Automated domain modeling with large language models: A comparative study. In2023 ACM/IEEE 26th International Conference on Model Driven Engineering Languages and Systems (MODELS). IEEE, 162–172

  14. [14]

    Ru Chen, Jingwei Shen, and Xiao He. 2024. A Model Is Not Built By A Single Prompt: LLM-Based Domain Modeling With Question Decomposition.CoRR abs/2410.09854 (2024). arXiv:2410.09854 doi:10.48550/ARXIV.2410.09854

  15. [15]

    Matteo Ciniselli, Alberto Martin-Lopez, and Gabriele Bavota. 2024. On the generalizability of deep learning-based code completion across programming language versions. InProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension. 99–111

  16. [16]

    Alibaba Cloud. 2024. Tongyi Large Language Models: The First Choice for Enterprises Embracing the AI Era. Web page. https://www.aliyun.com/product/ tongyi

  17. [17]

    2017.Research design: Qualitative, quan- titative, and mixed methods approaches

    John W Creswell and J David Creswell. 2017.Research design: Qualitative, quan- titative, and mixed methods approaches. Sage publications

  18. [18]

    Leuson Da Silva, Jordan Samhi, and Foutse Khomh. 2025. LLMs and Stack Overflow discussions: Reliability, impact, and challenges.Journal of Systems and Software(2025), 112541

  19. [19]

    2004.Domain-driven design: tackling complexity in the heart of software

    Eric Evans. 2004.Domain-driven design: tackling complexity in the heart of software. Addison-Wesley Professional

  20. [20]

    Yingqiang Ge, Wenyue Hua, Kai Mei, Juntao Tan, Shuyuan Xu, Zelong Li, Yongfeng Zhang, et al. 2023. Openagi: When llm meets domain experts.Advances in Neural Information Processing Systems36 (2023), 5539–5568

  21. [21]

    Sadhna Goyal. 2007. Agile techniques for project management and software engineering. InMajor Seminar on Feature Driven Development. 1–19

  22. [22]

    Renda Han. 2024. Survey: The Evolution and Future of Android Software Devel- opment.Deep Learning and Pattern Recognition1, 1 (2024)

  23. [23]

    Stefanus A Haryono, Ferdian Thung, David Lo, Lingxiao Jiang, Julia Lawall, Hong Jin Kang, Lucas Serrano, and Gilles Muller. 2021. Androevolve: Automated update for android deprecated-api usages. In2021 IEEE/ACM 43rd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). IEEE, 1–4

  24. [24]

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. 2024. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. In The Twelfth International Conference on Learning Representatio...

  25. [25]

    Yue Hu, Yuzhu Cai, Yaxin Du, Xinyu Zhu, Xiangrui Liu, Zijie Yu, Yuchen Hou, Shuo Tang, and Siheng Chen. 2025. Self-Evolving Multi-Agent Collaboration Net- works for Software Development. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net. https://openreview.net/forum?id=4R71pdPBZp

  26. [26]

    LangChain Inc. 2024. LangGraph: Balance Agent Control with Agency. Web page. https://www.langchain.com/langgraph

  27. [27]

    Hong Yi Lin, Chunhua Liu, Haoyu Gao, Patanamon Thongtanunam, and Christoph Treude. 2025. CodeReviewQA: The Code Review Comprehension Assessment for Large Language Models.arXiv preprint arXiv:2503.16167(2025)

  28. [28]

    Junwei Liu, Yixuan Chen, Mingwei Liu, Xin Peng, and Yiling Lou. 2024. STALL+: Boosting LLM-based Repository-level Code Completion with Static Analysis. CoRRabs/2406.10018 (2024). arXiv:2406.10018 doi:10.48550/ARXIV.2406.10018

  29. [29]

    Jie Liu, Guohua Wang, Ronghui Yang, Mengchen Zhao, and Yi Cai. [n. d.]. AltDev: Achieving Real-Time Alignment in Multi-Agent Software Development. ([n. d.])

  30. [30]

    Junwei Liu, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. 2024. Large Language Model-Based Agents for Software Engineering: A Survey.CoRRabs/2409.02977 (2024). arXiv:2409.02977 doi:10. 48550/ARXIV.2409.02977

  31. [31]

    2025.Lovable

    Lovable. 2025.Lovable. https://lovable.dev/

  32. [32]

    Shriraj Mandulapalli, Emilio Hernandez, Wayne Jordan Hall, Alireza Chakeri, and Luis Jaimes. 2025. Development of Agentic Workflows with LangGraph for Software Development Life Cycle Automation. InNorth American Conference on Industrial Engineering and Operations Management-Computer Science Tracks. Springer, 45–54

  33. [33]

    Nguyen, and Nghi D

    Minh Huynh Nguyen, Thang Phan Chau, Phong X. Nguyen, and Nghi D. Q. Bui

  34. [34]

    Lyu, Caiming Xiong, Silvio Savarese, and Doyen Sahoo

    AgileCoder: Dynamic Collaborative Agents for Software Development based on Agile Methodology. InIEEE/ACM Second International Conference on AI Foundation Models and Software Engineering, Forge@ICSE 2025, Ottawa, ON, Canada, April 27-28, 2025. IEEE, 156–167. doi:10.1109/FORGE66646.2025.00026

  35. [35]

    OpenAI. 2025. Introducing GPT-4.1 in the API. Web page. https://openai.com/ index/gpt-4-1/

  36. [36]

    Kai Petersen, Claes Wohlin, and Dejan Baca. 2009. The waterfall model in large- scale development. InInternational Conference on Product-Focused Software Process Improvement. Springer, 386–400

  37. [37]

    Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2024. ChatDev: Communicative Agents for Software Development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangk...

  38. [38]

    John Robinson. 2024. Likert scale. InEncyclopedia of quality of life and well-being research. Springer, 3917–3918

  39. [39]

    Adriana Sejfia, Satyaki Das, Saad Shafiq, and Nenad Medvidović. 2024. Toward improved deep learning-based vulnerability detection. InProceedings of the 46th IEEE/ACM international conference on software engineering. 1–12

  40. [40]

    Purva Sharma and Jayakumar Kaliappan. 2025. Optimised Intelligent Software Company Management System using Multi-Agent Framework.Grenze Interna- tional Journal of Engineering & Technology (GIJET)11 (2025)

  41. [41]

    Syed Tauhid Ullah Shah, Mohammad Hussein, Ann Barcomb, and Mohammad Moshirpour. 2025. Explainability as a Compliance Requirement: What Regulated Industries Need from AI Tools for Design Artifact Generation.arXiv e-prints (2025), arXiv–2507

  42. [42]

    Sergey Titov, Mikhail Evtikhiev, Anton Shapkin, Oleg Smirnov, Sergei Boytsov, Dariia Karaeva, Maksim Sheptyakov, Mikhail Arkhipov, Timofey Bryksin, and Egor Bogomolov. 2024. Kotlin ML Pack: Technical Report.CoRRabs/2405.19250 Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Liu et al. (2024). arXiv:2405.19250 doi:10.48550/ARXIV.2405.19250

  43. [43]

    Jim Whitehead. 2007. Collaboration in software engineering: A roadmap. In Future of Software Engineering (FOSE’07). IEEE, 214–225

  44. [44]

    Simiao Zhang, Jiaping Wang, Guoliang Dong, Jun Sun, Yueling Zhang, and Geguang Pu. 2024. Experimenting a New Programming Practice with LLMs. CoRRabs/2401.01062 (2024). arXiv:2401.01062 doi:10.48550/ARXIV.2401.01062 Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009