arxiv: 2604.26275 · v1 · submitted 2026-04-29 · 💻 cs.SE

Recognition: unknown

Agentic AI in the Software Development Lifecycle: Architecture, Empirical Evidence, and the Reshaping of Software Engineering

Happy Bhati

Authors on Pith no claims yet

Pith reviewed 2026-05-07 13:21 UTC · model grok-4.3

classification 💻 cs.SE

keywords agentic AIsoftware development lifecycledelegated executionAI supervisionproductivity gainstechnical debtevaluation challengesworkforce skills

0 comments

The pith

Agentic AI systems are changing software engineering by shifting focus from code generation to supervised delegation of entire tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reviews how advanced AI models with reasoning and planning abilities are transforming software development practices. Where previous tools assisted at the level of individual lines or functions, current agentic systems manage larger scopes like features or algorithms with human oversight. It offers a six-layer architecture to model these systems, contrasts it against the standard development lifecycle, and compiles evidence of notable improvements in task success rates and time efficiency. The authors maintain that this evolution redirects attention toward effective delegation and supervision, while flagging several unresolved issues that could affect the overall value to the field. Readers should care because it points to a potential redefinition of engineering roles and workflows in the near term.

Core claim

According to the paper, the arrival of large language models with multi-step reasoning and tool use marks a qualitative change in software engineering. Earlier code tools worked on small scales, but agentic systems handle repository-level or feature-level work. The work proposes a six-layer reference architecture, sets the traditional lifecycle against a new agentic one, and presents consolidated evidence of performance increases and productivity benefits. It asserts that the key question has become one of delegated execution under supervision, along with five open problems that will decide if the change is advantageous.

What carries the argument

A six-layer reference architecture that structures agentic software engineering systems by integrating reasoning, tool use, execution, and supervision mechanisms for handling complex development tasks.

If this is right

Engineers will devote more effort to high-level direction and review of AI-generated work across projects.
Evaluation techniques need updating to properly assess agents on full-scale software challenges.
Approaches to handle technical debt from AI contributions will become essential in maintenance.
Professional skills will redistribute to emphasize AI integration, oversight, and related governance.
Decisions on the economics of human attention and collaboration will guide how these systems integrate into daily work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adoption of the proposed architecture might encourage development of common standards for agent interactions in software tools.
Exploring the open problems could connect software engineering research more closely with fields like human-computer interaction and economics.
Over time, this transition may prompt updates in computer science education to include training in supervising intelligent systems.
Further experiments could test whether the reported efficiency gains hold in diverse project types beyond the initial studies.

Load-bearing premise

The performance and productivity data drawn from multiple studies accurately captures the real-world state of agentic AI applications without major biases in selection or reporting.

What would settle it

A comprehensive, unbiased study across many software organizations that finds minimal or no productivity improvements or task success gains from using agentic systems would disprove the central evidence.

Figures

Figures reproduced from arXiv: 2604.26275 by Happy Bhati.

**Figure 1.** Figure 1: Issue-resolution rate on SWE-bench Verified, October 2023 to April 2026. Agentic systems (solid) compounded rapid gains; non-agentic RAG-based LLMs (dashed) plateaued near 20%. Data compiled from [7, 9, 16, 24, 29, 38]. 2.3 Academic and Open-Source Work The academic literature on agentic software engineering has expanded rapidly. MetaGPT [23] and ChatDev [33] both encode software development as a multi-ag… view at source ↗

**Figure 2.** Figure 2: Traditional SDLC (a) vs. Agentic SDLC (b). In the A-SDLC, an orchestrator coordinates specialized sub-agents while the human supervises at intent, review, and approval gates. L5: Governance & Safety L4: Orchestration L3: Tools & Environment L2: Agent–Computer Interface L1: Reasoning & Memory L0: Foundation Model Dependency ↑ Feedback ↓ view at source ↗

**Figure 3.** Figure 3: Six-layer reference architecture. Dependency runs upward; feedback flows downward (dashed red). We follow recent industry frameworks [17, 20, 30] in distinguishing an Agentic SDLC (A-SDLC) from the augmented SDLC merely accelerated by AI assistants view at source ↗

**Figure 4.** Figure 4: Productivity gains across four studies of AIassisted coding [10, 13, 19, 31]. Controlled experiments report the largest effects. crosoft and 7.51%–8.69% at Accenture. Longitudinal team studies found that team performance and perceived efficiency rise even when commit metrics stay flat [10]. These numbers should be read alongside two cautionary findings. First, recent work argues that AIassisted code ca… view at source ↗

read the original abstract

The arrival of large language models (LLMs) capable of multi-step reasoning, tool use, and long-horizon planning has produced a qualitative shift in software engineering. Where earlier code-completion tools such as GitHub Copilot operated at the granularity of a line or function, modern agentic systems -- Claude Code, OpenAI Codex CLI, Google Jules, Devin, OpenHands, SWE-agent, MetaGPT, ChatDev, and DeepMind's AlphaEvolve -- operate at the granularity of a repository, a feature, or an algorithm. We synthesize work from Anthropic, OpenAI, Google DeepMind, Microsoft Research, Princeton, Stanford, and the broader academic community to characterize this transition. We propose a six-layer reference architecture for agentic software engineering systems, contrast a traditional Software Development Lifecycle (SDLC) with an emerging Agentic SDLC (A-SDLC), and consolidate empirical evidence on performance (a rise from 1.96% to 78.4% on SWE-bench Verified between October 2023 and April 2026), productivity (13.6%-55.8% time savings across controlled studies), and labor-market impact (49% of jobs sampled by Anthropic in 2026 saw AI used for at least a quarter of their tasks). We argue that the central object of inquiry has shifted from code generation to delegated execution under human supervision, and we identify five open problems -- evaluation, governance, technical debt, skill redistribution, and the economics of attention -- that will determine whether the agentic transition is net-positive for the discipline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a synthesis paper that frames the shift to agentic systems in software engineering with a new six-layer architecture, but adds no fresh data or derivations.

read the letter

The main takeaway is that the paper organizes existing work on tools like Devin and SWE-agent into a six-layer architecture and an Agentic SDLC contrast, while arguing the core task has moved to delegated execution under supervision. It collects performance numbers from other groups and flags five open problems around evaluation, governance, and technical debt. That framing is the clearest contribution here. It gives a clean way to talk about repository-scale agents versus older line-by-line tools, and the open-problem list is practical for anyone thinking about next steps. The consolidation of SWE-bench gains and productivity ranges from cited studies also makes the scale of recent progress easy to see at a glance. The soft spots are straightforward. All the quantitative claims come from external papers without any new analysis, meta-review, or check for selection bias in the benchmarks. SWE-bench measures issue resolution on fixed repos, which may not capture the messier integration and ambiguity in actual development work. The labor-market statistics are drawn from a single source as well. Because there are no original experiments or parameter-free derivations, the central narrative rests entirely on how representative the cited studies turn out to be. This paper is for readers who want a structured overview of where agentic AI sits in software engineering practice, such as engineering managers or researchers scouting directions. It is not primary research. I would bring the architecture section to a reading group. It deserves peer review in a survey or position-paper venue so referees can test whether the evidence selection holds and whether the open problems are stated sharply enough.

Referee Report

1 major / 2 minor

Summary. The manuscript synthesizes advances in LLM-based agentic systems for software engineering, claiming a qualitative shift from line- or function-level code completion (e.g., GitHub Copilot) to repository- or feature-level delegated execution under human supervision. It proposes a six-layer reference architecture for such systems, contrasts the traditional SDLC with an emerging Agentic SDLC (A-SDLC), consolidates empirical evidence including SWE-bench Verified gains from 1.96% (Oct 2023) to 78.4% (Apr 2026), productivity time savings of 13.6–55.8% from controlled studies, and labor-market statistics (e.g., 49% of sampled jobs using AI for ≥25% of tasks), and identifies five open problems: evaluation, governance, technical debt, skill redistribution, and economics of attention.

Significance. If the empirical synthesis holds after addressing potential biases, the work offers a structured reference architecture and a clear framing of the transition to agentic workflows that could guide both research and practice. The explicit contrast between SDLC and A-SDLC, together with the enumerated open problems, provides a useful agenda for the field; the compilation of cross-organizational results (Anthropic, OpenAI, DeepMind, academic groups) adds breadth even if it requires stronger critical analysis.

major comments (1)

[Empirical Evidence] Empirical Evidence section (consolidation of SWE-bench and productivity results): the central claim that the field has shifted to 'delegated execution under human supervision' rests on the reported performance deltas (1.96% → 78.4% on SWE-bench Verified; 13.6–55.8% productivity gains) and labor statistics, yet the synthesis does not discuss selection criteria for the cited studies, inclusion of null or negative results, or controls for benchmark-specific scaffolding and task ambiguity that may inflate apparent capability; this is load-bearing because the narrative of rapid, representative progress depends on the representativeness of the consolidated data.

minor comments (2)

[Introduction / Architecture] The abstract lists specific systems (Claude Code, OpenAI Codex CLI, Google Jules, Devin, OpenHands, SWE-agent, MetaGPT, ChatDev, AlphaEvolve) without clarifying in the main text which are production tools versus research prototypes or whether any have been deprecated; this affects readability of the architecture discussion.
[Reference Architecture] The six-layer reference architecture is introduced but the manuscript would benefit from a table or diagram explicitly mapping each layer to traditional SDLC phases and to the cited agentic systems.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. The concern about transparency in the Empirical Evidence section is valid, and we will revise the manuscript to address it directly.

read point-by-point responses

Referee: [Empirical Evidence] Empirical Evidence section (consolidation of SWE-bench and productivity results): the central claim that the field has shifted to 'delegated execution under human supervision' rests on the reported performance deltas (1.96% → 78.4% on SWE-bench Verified; 13.6–55.8% productivity gains) and labor statistics, yet the synthesis does not discuss selection criteria for the cited studies, inclusion of null or negative results, or controls for benchmark-specific scaffolding and task ambiguity that may inflate apparent capability; this is load-bearing because the narrative of rapid, representative progress depends on the representativeness of the consolidated data.

Authors: We appreciate the referee's point that greater methodological transparency is needed to support the central narrative. The manuscript is a high-level synthesis of results from peer-reviewed papers and official benchmark reports rather than a formal systematic review or meta-analysis. The cited SWE-bench numbers are taken verbatim from the public leaderboard and associated technical reports (October 2023 to April 2026), while productivity figures come from controlled studies published by the originating organizations. In revision we will add a dedicated 'Data Sources and Selection Criteria' subsection that: (1) states the inclusion criteria (publicly reported results on SWE-bench Verified with accompanying architectural details, plus controlled productivity experiments with reported effect sizes); (2) explicitly notes the absence of published null or negative results from comparable agentic systems during the period and flags this as a limitation potentially attributable to publication bias; and (3) discusses benchmark-specific factors, including scaffolding, human oversight, and task ambiguity as described in the original SWE-bench paper. These additions will qualify the performance claims and make the basis for the shift to delegated execution clearer without changing the reported trends or overall conclusions. revision: yes

Circularity Check

0 steps flagged

No circularity; synthesis of external evidence and proposed architecture

full rationale

The paper's central claims rest on a synthesis of performance metrics and studies drawn from external organizations (Anthropic, OpenAI, DeepMind, academic groups) rather than any internal derivation, fitted parameters, or self-referential definitions. No equations, predictions, or uniqueness theorems are presented that reduce to the paper's own inputs by construction. The six-layer architecture and A-SDLC contrast are proposed as organizational frameworks, not derived results. Self-citations are absent from the load-bearing sections, and the argument for a shift to delegated execution is framed as an interpretation of cited empirical trends.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a literature synthesis and architectural proposal paper. No new free parameters, mathematical axioms, or invented physical entities are introduced; the central claims rest on the accuracy of the cited external empirical studies and the utility of the proposed conceptual layers.

pith-pipeline@v0.9.0 · 5588 in / 1263 out tokens · 65062 ms · 2026-05-07T13:21:25.704652+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 17 canonical work pages · 6 internal anchors

[1]

SWE-Bench performance with Claude 3.5 Sonnet

Anthropic. SWE-Bench performance with Claude 3.5 Sonnet. https://www.anthropic. com/research/swe-bench-sonnet, 2024

2024
[2]

Introducing Claude 3.7 Sonnet and Claude Code

Anthropic. Introducing Claude 3.7 Sonnet and Claude Code. https://www.anthropic.com/ news/claude-3-7-sonnet, 2025

2025
[3]

Introducing Claude 4 (Opus 4 and Sonnet 4)

Anthropic. Introducing Claude 4 (Opus 4 and Sonnet 4). https://www.anthropic.com/ news/claude-4, 2025

2025
[4]

Anthropic Economic Index: Insights from Claude conversations (january 2025 base- line), 2025

Anthropic. Anthropic Economic Index: Insights from Claude conversations (january 2025 base- line), 2025

2025
[5]

New building blocks for under- standing AI use

Anthropic. New building blocks for under- standing AI use. https://www.anthropic. com/research/economic-index-primitives, 2025

2025
[6]

What 81,000 people told us about the economics of AI.https://www.anthropic

Anthropic. What 81,000 people told us about the economics of AI.https://www.anthropic. com/research/81k-economics, 2026

2026
[7]

Claude Code: Anthropic’s agentic coding system

Anthropic. Claude Code: Anthropic’s agentic coding system. https://www.anthropic.com/ product/claude-code, 2026

2026
[8]

Labor market impacts of AI: A new measure and early evidence

Anthropic. Labor market impacts of AI: A new measure and early evidence. https://www.anthropic.com/research/ labor-market-impacts, 2026

2026
[9]

Claude Opus 4.7 model card and system report

Anthropic. Claude Opus 4.7 model card and system report. https://www.anthropic.com, 2026

2026
[10]

Anthropic Economic Index re- port: Learning curves (february 2026 data)

Anthropic. Anthropic Economic Index re- port: Learning curves (february 2026 data). https://www.anthropic.com/research/ economic-index-march-2026-report, 2026

2026
[11]

Program Synthesis with Large Language Models

Jacob Austin et al. Program synthesis with large language models.arXiv:2108.07732, 2021

work page internal anchor Pith review arXiv 2021
[12]

Bauer et al

S. Bauer et al. AI-assisted programming de- creases the productivity of experienced develop- ers by increasing the technical debt and main- tenance burden.arXiv:2510.10165, 2025

work page arXiv 2025
[13]

Developers’ experience with generative AI: First insights from an empirical mixed-methods field study

Charlotte Brandebusemeyer, Tobias Schimmer, and Bert Arnrich. Developers’ experience with generative AI: First insights from an empirical mixed-methods field study. InICSE SEIP 2026; arXiv:2512.19926, 2026

work page arXiv 2026
[14]

The 2025 AI Agent Index: Documenting Technical and Safety Features of Deployed Agentic AI Systems

S. Casper et al. The 2025 AI Agent Index: Documenting technical and safety features of deployed agentic AI systems.arXiv:2602.17753, 2026

work page internal anchor Pith review arXiv 2025
[15]

Evaluating Large Language Models Trained on Code

Mark Chen et al. Evaluating large language models trained on code.arXiv:2107.03374, 2021

work page internal anchor Pith review arXiv 2021
[16]

Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, and Carlos E. Jimenez. Introducing SWE-bench Verified. Technical report, OpenAI, 2024

2024
[17]

Agentic AI software development lifecycle: Secure ADLC playbook

Codebridge. Agentic AI software development lifecycle: Secure ADLC playbook. Codebridge Tech, 2026

2026
[18]

Introducing Devin, the first AI software engineer.https://cognition.ai/ blog/introducing-devin, 2024

Cognition Labs. Introducing Devin, the first AI software engineer.https://cognition.ai/ blog/introducing-devin, 2024

2024
[19]

Sea change in software development: Economic and productivity analysis of the AI- powered developer lifecycle.arXiv:2306.15033, 2023

Thomas Dohmke, Marco Iansiti, and Greg Richards. Sea change in software development: Economic and productivity analysis of the AI- powered developer lifecycle.arXiv:2306.15033, 2023

work page arXiv 2023
[20]

Agentic development lifecycle (ADLC): A new model for AI systems beyond SDLC

EPAM. Agentic development lifecycle (ADLC): A new model for AI systems beyond SDLC. https://www.epam.com/insights, 2026

2026
[21]

Gao et al

X. Gao et al. SWE-Bench-CL: Continual learn- ing for coding agents.arXiv:2507.00014, 2025

work page arXiv 2025
[22]

AlphaEvolve: A Gemini- powered coding agent for designing advanced algorithms

Google DeepMind. AlphaEvolve: A Gemini- powered coding agent for designing advanced algorithms. https://deepmind.google/blog, 2025

2025
[23]

MetaGPT: Meta programming for a multi-agent collaborative framework

Sirui Hong et al. MetaGPT: Meta programming for a multi-agent collaborative framework. In International Conference on Learning Represen- tations (ICLR), 2024. Preprint. Under review. Agentic AI in the Software Development LifecyclePage 8

2024
[24]

Jimenez, John Yang, Alexander Wet- tig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wet- tig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can lan- guage models resolve real-world GitHub issues? InInternational Conference on Learning Repre- sentations (ICLR), 2024

2024
[25]

AgentMesh: A co- operative multi-agent generative AI frame- work for software development automation

Sourena Khanzadeh. AgentMesh: A co- operative multi-agent generative AI frame- work for software development automation. arXiv:2507.19902, 2025

work page arXiv 2025
[26]

Liu et al

Y. Liu et al. A comprehensive survey on benchmarks and solutions in software engi- neering of LLM-empowered agentic systems. arXiv:2510.09721, 2025

work page arXiv 2025
[27]

AgileCoder: Dynamic collaborative agents for software development based on agile methodology

Minh-Hoang Nguyen et al. AgileCoder: Dynamic collaborative agents for software development based on agile methodology. arXiv:2406.11912, 2024

work page arXiv 2024
[28]

Alexander Novikov, Ngân V˜ u, Marvin Eisen- berger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Ab- bas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. AlphaEvolve: A coding agent for scientific and ...

work page internal anchor Pith review arXiv 2025
[29]

GPT-5.4-Codex and Codex CLI 0.120: Technical overview

OpenAI. GPT-5.4-Codex and Codex CLI 0.120: Technical overview. https://openai. com/index/codex-cli, 2026

2026
[30]

Modernizing the SDLC process with agentic AI

Shashikanta Parida. Modernizing the SDLC process with agentic AI. Microsoft Data Science Blog, 2025

2025
[31]

The Impact of AI on Developer Productivity: Evidence from GitHub Copilot

Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. The impact of AI on de- veloper productivity: Evidence from GitHub Copilot.Microsoft Research Technical Report; arXiv:2302.06590, 2023

work page internal anchor Pith review arXiv 2023
[32]

HyperAgent: Generalist software engineering agents to solve coding tasks at scale.arXiv:2409.16299, 2024

Huy Nhat Phan et al. HyperAgent: Generalist software engineering agents to solve coding tasks at scale.arXiv:2409.16299, 2024

work page arXiv 2024
[33]

ChatDev: Communicative Agents for Software Development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Da- hai Li, Zhiyuan Liu, and Maosong Sun. Chat- Dev: Communicative agents for software devel- opment.arXiv:2307.07924, 2024

work page internal anchor Pith review arXiv 2024
[34]

Sapkota et al

R. Sapkota et al. Agentic AI: A comprehensive survey of architectures, applications, and future directions.arXiv:2510.25445, 2025

work page arXiv 2025
[35]

OpenHands: An open platform for AI software developers as generalist agents

Xingyao Wang et al. OpenHands: An open platform for AI software developers as generalist agents. InInternational Conference on Learning Representations (ICLR), 2025

2025
[36]

Agents in software engineer- ing: Survey, landscape, and vision.Automated Software Engineering, 32(2):1–36, 2025

Yanlin Wang et al. Agents in software engineer- ing: Survey, landscape, and vision.Automated Software Engineering, 32(2):1–36, 2025

2025
[37]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Sys- tems (NeurIPS), 2022

2022
[38]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent–computer interfaces enable automated software engineering. InAdvances in Neural In- formation Processing Systems (NeurIPS), 2024

2024
[39]

arXiv preprint arXiv:2410.03859 , year=

John Yang et al. SWE-bench Multimodal: Do AI systems generalize to visual software do- mains?arXiv:2410.03859, 2024

work page arXiv 2024
[40]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

2023
[41]

AutoCodeRover: Au- tonomous program improvement

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. AutoCodeRover: Au- tonomous program improvement. InISSTA 2024, 2024

2024
[42]

Swe-compass: Towards unified evaluation of agentic coding abilities for large language models, 2025

Y. Zhao et al. SWE-Compass: Towards unified evaluation of agentic coding abilities for large language models.arXiv:2511.05459, 2025. Preprint. Under review

work page arXiv 2025