pith. sign in

arxiv: 2511.02399 · v3 · pith:56JFTSWGnew · submitted 2025-11-04 · 💻 cs.SE · cs.AI

Towards Iterative End-to-End Software Development: A Feature-Driven Multi-Agent Framework

Pith reviewed 2026-05-18 01:37 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords LLM agentssoftware developmentfeature-driven developmentiterative frameworkdependency modelingcontext propagationAndroid development
0
0 comments X

The pith

EvoDev's Feature Map models feature dependencies and propagates context to let LLM agents outperform linear baselines by 56.8 percent on Android tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes EvoDev, an iterative framework that decomposes requirements into features connected by a directed acyclic graph called the Feature Map. This graph carries multi-level details such as business logic, design, and code, which are passed along dependency links to give each development step richer context. The authors test the approach on demanding Android projects and report clear gains over existing methods. A sympathetic reader would care because real software work involves repeated revisions and interdependencies that simple sequential pipelines fail to capture.

Core claim

EvoDev decomposes user requirements into a set of user-valued features and constructs a Feature Map, a directed acyclic graph that explicitly models dependencies between features. Each node in the feature map maintains multi-level information, including business logic, design, and code, which is propagated along dependencies to provide context for subsequent development iterations. Evaluation on challenging Android development tasks shows EvoDev outperforming the best baseline by 56.8 percent while raising single-agent results by 16.0 to 76.6 percent across base LLMs.

What carries the argument

The Feature Map, a directed acyclic graph of features that stores and propagates multi-level information along dependency edges to supply context during iterative development steps.

Load-bearing premise

The reported gains come mainly from the Feature Map's dependency modeling and context propagation rather than from unexamined choices in prompting, metrics, or which tasks were selected.

What would settle it

A controlled test that runs identical LLM agents on the same Android tasks once with the Feature Map enabled and once without it, then checks whether the performance difference disappears.

Figures

Figures reproduced from arXiv: 2511.02399 by Chen Xu, Chong Wang, Junwei Liu, Kaseng Wong, Tong Bai, Weitong Chen, Xin Peng, Yiling Lou.

Figure 1
Figure 1. Figure 1: Overview of the FDD-inspired EvoDev framework to extract a list of features, which are defined as user-valued func￾tionalities that can be implemented within two weeks. The feature can also be integrated into feature sets, which contain a list of func￾tionally cohesive features that can be treated as a whole. The next step is to plan by feature, where team members carefully consider the dependencies and pr… view at source ↗
Figure 2
Figure 2. Figure 2: The basic activities of Feature Driven Development [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The requirement document for the countdown timer APP Next, we introduce an Architect agent to construct an overall design of the target application. In FDD, the overall model refers to the domain object model. However, our preliminary experiments indicate that LLMs struggle to generate a usable domain object model, which is also reported in prior work [12, 13, 19]. Therefore, we opt to leverage LLMs for da… view at source ↗
Figure 5
Figure 5. Figure 5: The feature list for the countdown timer APP. The [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: The overall design (including both UI and data) for [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: The feature map for the countdown timer APP, with each node containing contexts of the business, design, and [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
read the original abstract

Recent advances in large language model agents offer the promise of automating end-to-end software development from natural language requirements. However, existing approaches largely adopt linear, waterfall-style pipelines, which oversimplify the iterative nature of real-world development and struggle with complex, large-scale projects. To address these limitations, we propose EvoDev, an iterative software development framework inspired by feature-driven development. EvoDev decomposes user requirements into a set of user-valued features and constructs a Feature Map, a directed acyclic graph that explicitly models dependencies between features. Each feature node in the feature map maintains multi-layer contexts, including business logic, software design, and code implementation, which are propagated along dependencies to provide context for subsequent development iterations. We evaluate EvoDev on challenging Android development tasks and show that it outperforms the best-performing baseline, Claude Code, by 57.3%, while improving single-agent performance by 16.0%-58.5% across different base LLMs, highlighting the importance of feature decomposition, dependency modeling, context propagation, and workflow-aware agent design for end-to-end software development. Moreover, our work summarizes practical insights for designing iterative, LLM-driven development frameworks and informs future training of base LLMs to better support iterative software development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces EvoDev, an iterative framework for end-to-end software development with LLM-based agents. It decomposes natural language requirements into user-valued features, constructs a directed acyclic graph (Feature Map) to explicitly model inter-feature dependencies, and propagates multi-level context (business logic, design, and code) along dependency edges to inform subsequent iterations. The central empirical claim is that EvoDev outperforms the strongest baseline (Claude Code) by 56.8% and improves single-agent performance by 16.0%-76.6% across base LLMs when evaluated on challenging Android development tasks.

Significance. If the reported gains are shown to arise specifically from the DAG-based dependency modeling and context propagation rather than from unstated prompting or iteration details, the work would be significant for shifting LLM-agent research away from linear waterfall pipelines toward more realistic iterative, dependency-aware workflows. It supplies practical design insights and could inform future LLM fine-tuning for software engineering.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Evaluation): The central performance claims (56.8% margin over Claude Code; 16.0%-76.6% single-agent lifts) are stated without any description of the evaluation metrics, task selection criteria, number of trials, baseline implementations, statistical tests, or error analysis. This absence is load-bearing because the attribution of gains to the Feature Map cannot be assessed without these controls.
  2. [§4] §4 (Experiments): No ablation is reported that disables only the dependency edges and multi-level context propagation while preserving the iterative loop, agent roles, and feature decomposition. Without this isolation, it remains possible that the observed margins stem from differences in total LLM calls, prompt length, or task curation rather than the claimed DAG modeling.
minor comments (2)
  1. [§2] §2 (Related Work): The positioning against prior feature-driven development literature and existing LLM-agent frameworks (e.g., those using planning or reflection) would benefit from a more explicit comparison table.
  2. [§3] Notation: The multi-level information maintained at each Feature Map node is described in prose but never formalized (e.g., as a tuple or record type), which reduces clarity when discussing propagation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps strengthen the clarity and rigor of our evaluation section. We address each major comment below and will revise the manuscript to incorporate the requested details and experiments.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Evaluation): The central performance claims (56.8% margin over Claude Code; 16.0%-76.6% single-agent lifts) are stated without any description of the evaluation metrics, task selection criteria, number of trials, baseline implementations, statistical tests, or error analysis. This absence is load-bearing because the attribution of gains to the Feature Map cannot be assessed without these controls.

    Authors: We agree that the evaluation protocol requires more explicit and prominent description to support attribution of gains to the Feature Map. While §4 of the manuscript outlines the Android development tasks and overall setup, we acknowledge the need for greater detail on metrics, controls, and analysis. In the revised manuscript we will expand both the abstract and §4 to specify: the primary metric (feature-level task completion rate combining business-logic correctness and executable code), secondary metrics (design consistency and dependency resolution success), task selection criteria (Android apps requiring 5–12 interdependent features drawn from real-world requirement sets), number of trials (five independent runs per condition with reported standard deviation), baseline implementations (exact prompting and iteration limits used for Claude Code and other agents), statistical tests (paired t-tests with p-values), and error analysis (categorization of failures such as missed dependencies versus implementation bugs). These additions will make the evidence for the DAG-based context propagation more transparent. revision: yes

  2. Referee: [§4] §4 (Experiments): No ablation is reported that disables only the dependency edges and multi-level context propagation while preserving the iterative loop, agent roles, and feature decomposition. Without this isolation, it remains possible that the observed margins stem from differences in total LLM calls, prompt length, or task curation rather than the claimed DAG modeling.

    Authors: The referee correctly notes that our existing comparisons are against external baselines lacking the full EvoDev pipeline, which does not isolate the contribution of the dependency edges and context propagation. We will add a targeted ablation in the revised §4: a control variant that retains the iterative loop, agent roles, and feature decomposition but replaces the DAG with either independent feature processing or a fixed linear order, while matching total LLM calls and prompt lengths as closely as possible. Performance differences between this ablation and the full EvoDev will be reported to demonstrate that the observed margins arise specifically from dependency-aware multi-level context propagation rather than iteration count or task selection alone. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework evaluation with no self-referential derivations

full rationale

The paper presents EvoDev as an iterative framework that decomposes requirements into features, builds a Feature Map DAG, and propagates multi-level context along dependencies. All central claims concern measured performance lifts (56.8% over Claude Code, 16.0%-76.6% single-agent gains) obtained from direct experimental comparison on Android tasks. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described framework. The reported results are therefore independent of the inputs by construction and rest on external benchmarks rather than any reduction to the framework's own definitions or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the untested assumption that structured dependency modeling and context passing will reliably improve LLM agent outputs on complex tasks; the Feature Map is introduced as a new construct without external validation.

axioms (1)
  • domain assumption LLM agents benefit substantially from explicit dependency modeling and context propagation when performing iterative software development
    Invoked to explain why the Feature Map produces the reported gains over linear baselines.
invented entities (1)
  • Feature Map no independent evidence
    purpose: Directed acyclic graph that stores and propagates multi-level information (business logic, design, code) across feature dependencies
    Newly defined structure central to the iterative workflow; no independent evidence provided outside the paper's evaluation.

pith-pipeline@v0.9.0 · 5763 in / 1323 out tokens · 35808 ms · 2026-05-18T01:37:20.673222+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Large Language Model-Based Agents for Software Engineering: A Survey

    cs.SE 2024-09 unverdicted novelty 4.0

    A literature survey that collects and categorizes 124 papers on LLM-based agents for software engineering from SE and agent perspectives.