pith. machine review for the scientific record. sign in

arxiv: 2604.26275 · v1 · submitted 2026-04-29 · 💻 cs.SE

Recognition: unknown

Agentic AI in the Software Development Lifecycle: Architecture, Empirical Evidence, and the Reshaping of Software Engineering

Happy Bhati

Authors on Pith no claims yet

Pith reviewed 2026-05-07 13:21 UTC · model grok-4.3

classification 💻 cs.SE
keywords agentic AIsoftware development lifecycledelegated executionAI supervisionproductivity gainstechnical debtevaluation challengesworkforce skills
0
0 comments X

The pith

Agentic AI systems are changing software engineering by shifting focus from code generation to supervised delegation of entire tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reviews how advanced AI models with reasoning and planning abilities are transforming software development practices. Where previous tools assisted at the level of individual lines or functions, current agentic systems manage larger scopes like features or algorithms with human oversight. It offers a six-layer architecture to model these systems, contrasts it against the standard development lifecycle, and compiles evidence of notable improvements in task success rates and time efficiency. The authors maintain that this evolution redirects attention toward effective delegation and supervision, while flagging several unresolved issues that could affect the overall value to the field. Readers should care because it points to a potential redefinition of engineering roles and workflows in the near term.

Core claim

According to the paper, the arrival of large language models with multi-step reasoning and tool use marks a qualitative change in software engineering. Earlier code tools worked on small scales, but agentic systems handle repository-level or feature-level work. The work proposes a six-layer reference architecture, sets the traditional lifecycle against a new agentic one, and presents consolidated evidence of performance increases and productivity benefits. It asserts that the key question has become one of delegated execution under supervision, along with five open problems that will decide if the change is advantageous.

What carries the argument

A six-layer reference architecture that structures agentic software engineering systems by integrating reasoning, tool use, execution, and supervision mechanisms for handling complex development tasks.

If this is right

  • Engineers will devote more effort to high-level direction and review of AI-generated work across projects.
  • Evaluation techniques need updating to properly assess agents on full-scale software challenges.
  • Approaches to handle technical debt from AI contributions will become essential in maintenance.
  • Professional skills will redistribute to emphasize AI integration, oversight, and related governance.
  • Decisions on the economics of human attention and collaboration will guide how these systems integrate into daily work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adoption of the proposed architecture might encourage development of common standards for agent interactions in software tools.
  • Exploring the open problems could connect software engineering research more closely with fields like human-computer interaction and economics.
  • Over time, this transition may prompt updates in computer science education to include training in supervising intelligent systems.
  • Further experiments could test whether the reported efficiency gains hold in diverse project types beyond the initial studies.

Load-bearing premise

The performance and productivity data drawn from multiple studies accurately captures the real-world state of agentic AI applications without major biases in selection or reporting.

What would settle it

A comprehensive, unbiased study across many software organizations that finds minimal or no productivity improvements or task success gains from using agentic systems would disprove the central evidence.

Figures

Figures reproduced from arXiv: 2604.26275 by Happy Bhati.

Figure 1
Figure 1. Figure 1: Issue-resolution rate on SWE-bench Verified, October 2023 to April 2026. Agentic systems (solid) compounded rapid gains; non-agentic RAG-based LLMs (dashed) plateaued near 20%. Data compiled from [7, 9, 16, 24, 29, 38]. 2.3 Academic and Open-Source Work The academic literature on agentic software engi￾neering has expanded rapidly. MetaGPT [23] and ChatDev [33] both encode software development as a multi-ag… view at source ↗
Figure 2
Figure 2. Figure 2: Traditional SDLC (a) vs. Agentic SDLC (b). In the A-SDLC, an orchestrator coordinates specialized sub-agents while the human supervises at intent, review, and approval gates. L5: Governance & Safety L4: Orchestration L3: Tools & Environment L2: Agent–Computer Interface L1: Reasoning & Memory L0: Foundation Model Dependency ↑ Feedback ↓ view at source ↗
Figure 3
Figure 3. Figure 3: Six-layer reference architecture. Dependency runs upward; feedback flows downward (dashed red). We follow recent industry frameworks [17, 20, 30] in distinguishing an Agentic SDLC (A-SDLC) from the augmented SDLC merely accelerated by AI assis￾tants view at source ↗
Figure 4
Figure 4. Figure 4: Productivity gains across four studies of AI￾assisted coding [10, 13, 19, 31]. Controlled experiments report the largest effects. crosoft and 7.51%–8.69% at Accenture. Longitudinal team studies found that team performance and per￾ceived efficiency rise even when commit metrics stay flat [10]. These numbers should be read alongside two cau￾tionary findings. First, recent work argues that AI￾assisted code ca… view at source ↗
read the original abstract

The arrival of large language models (LLMs) capable of multi-step reasoning, tool use, and long-horizon planning has produced a qualitative shift in software engineering. Where earlier code-completion tools such as GitHub Copilot operated at the granularity of a line or function, modern agentic systems -- Claude Code, OpenAI Codex CLI, Google Jules, Devin, OpenHands, SWE-agent, MetaGPT, ChatDev, and DeepMind's AlphaEvolve -- operate at the granularity of a repository, a feature, or an algorithm. We synthesize work from Anthropic, OpenAI, Google DeepMind, Microsoft Research, Princeton, Stanford, and the broader academic community to characterize this transition. We propose a six-layer reference architecture for agentic software engineering systems, contrast a traditional Software Development Lifecycle (SDLC) with an emerging Agentic SDLC (A-SDLC), and consolidate empirical evidence on performance (a rise from 1.96% to 78.4% on SWE-bench Verified between October 2023 and April 2026), productivity (13.6%-55.8% time savings across controlled studies), and labor-market impact (49% of jobs sampled by Anthropic in 2026 saw AI used for at least a quarter of their tasks). We argue that the central object of inquiry has shifted from code generation to delegated execution under human supervision, and we identify five open problems -- evaluation, governance, technical debt, skill redistribution, and the economics of attention -- that will determine whether the agentic transition is net-positive for the discipline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript synthesizes advances in LLM-based agentic systems for software engineering, claiming a qualitative shift from line- or function-level code completion (e.g., GitHub Copilot) to repository- or feature-level delegated execution under human supervision. It proposes a six-layer reference architecture for such systems, contrasts the traditional SDLC with an emerging Agentic SDLC (A-SDLC), consolidates empirical evidence including SWE-bench Verified gains from 1.96% (Oct 2023) to 78.4% (Apr 2026), productivity time savings of 13.6–55.8% from controlled studies, and labor-market statistics (e.g., 49% of sampled jobs using AI for ≥25% of tasks), and identifies five open problems: evaluation, governance, technical debt, skill redistribution, and economics of attention.

Significance. If the empirical synthesis holds after addressing potential biases, the work offers a structured reference architecture and a clear framing of the transition to agentic workflows that could guide both research and practice. The explicit contrast between SDLC and A-SDLC, together with the enumerated open problems, provides a useful agenda for the field; the compilation of cross-organizational results (Anthropic, OpenAI, DeepMind, academic groups) adds breadth even if it requires stronger critical analysis.

major comments (1)
  1. [Empirical Evidence] Empirical Evidence section (consolidation of SWE-bench and productivity results): the central claim that the field has shifted to 'delegated execution under human supervision' rests on the reported performance deltas (1.96% → 78.4% on SWE-bench Verified; 13.6–55.8% productivity gains) and labor statistics, yet the synthesis does not discuss selection criteria for the cited studies, inclusion of null or negative results, or controls for benchmark-specific scaffolding and task ambiguity that may inflate apparent capability; this is load-bearing because the narrative of rapid, representative progress depends on the representativeness of the consolidated data.
minor comments (2)
  1. [Introduction / Architecture] The abstract lists specific systems (Claude Code, OpenAI Codex CLI, Google Jules, Devin, OpenHands, SWE-agent, MetaGPT, ChatDev, AlphaEvolve) without clarifying in the main text which are production tools versus research prototypes or whether any have been deprecated; this affects readability of the architecture discussion.
  2. [Reference Architecture] The six-layer reference architecture is introduced but the manuscript would benefit from a table or diagram explicitly mapping each layer to traditional SDLC phases and to the cited agentic systems.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. The concern about transparency in the Empirical Evidence section is valid, and we will revise the manuscript to address it directly.

read point-by-point responses
  1. Referee: [Empirical Evidence] Empirical Evidence section (consolidation of SWE-bench and productivity results): the central claim that the field has shifted to 'delegated execution under human supervision' rests on the reported performance deltas (1.96% → 78.4% on SWE-bench Verified; 13.6–55.8% productivity gains) and labor statistics, yet the synthesis does not discuss selection criteria for the cited studies, inclusion of null or negative results, or controls for benchmark-specific scaffolding and task ambiguity that may inflate apparent capability; this is load-bearing because the narrative of rapid, representative progress depends on the representativeness of the consolidated data.

    Authors: We appreciate the referee's point that greater methodological transparency is needed to support the central narrative. The manuscript is a high-level synthesis of results from peer-reviewed papers and official benchmark reports rather than a formal systematic review or meta-analysis. The cited SWE-bench numbers are taken verbatim from the public leaderboard and associated technical reports (October 2023 to April 2026), while productivity figures come from controlled studies published by the originating organizations. In revision we will add a dedicated 'Data Sources and Selection Criteria' subsection that: (1) states the inclusion criteria (publicly reported results on SWE-bench Verified with accompanying architectural details, plus controlled productivity experiments with reported effect sizes); (2) explicitly notes the absence of published null or negative results from comparable agentic systems during the period and flags this as a limitation potentially attributable to publication bias; and (3) discusses benchmark-specific factors, including scaffolding, human oversight, and task ambiguity as described in the original SWE-bench paper. These additions will qualify the performance claims and make the basis for the shift to delegated execution clearer without changing the reported trends or overall conclusions. revision: yes

Circularity Check

0 steps flagged

No circularity; synthesis of external evidence and proposed architecture

full rationale

The paper's central claims rest on a synthesis of performance metrics and studies drawn from external organizations (Anthropic, OpenAI, DeepMind, academic groups) rather than any internal derivation, fitted parameters, or self-referential definitions. No equations, predictions, or uniqueness theorems are presented that reduce to the paper's own inputs by construction. The six-layer architecture and A-SDLC contrast are proposed as organizational frameworks, not derived results. Self-citations are absent from the load-bearing sections, and the argument for a shift to delegated execution is framed as an interpretation of cited empirical trends.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a literature synthesis and architectural proposal paper. No new free parameters, mathematical axioms, or invented physical entities are introduced; the central claims rest on the accuracy of the cited external empirical studies and the utility of the proposed conceptual layers.

pith-pipeline@v0.9.0 · 5588 in / 1263 out tokens · 65062 ms · 2026-05-07T13:21:25.704652+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 17 canonical work pages · 6 internal anchors

  1. [1]

    SWE-Bench performance with Claude 3.5 Sonnet

    Anthropic. SWE-Bench performance with Claude 3.5 Sonnet. https://www.anthropic. com/research/swe-bench-sonnet, 2024

  2. [2]

    Introducing Claude 3.7 Sonnet and Claude Code

    Anthropic. Introducing Claude 3.7 Sonnet and Claude Code. https://www.anthropic.com/ news/claude-3-7-sonnet, 2025

  3. [3]

    Introducing Claude 4 (Opus 4 and Sonnet 4)

    Anthropic. Introducing Claude 4 (Opus 4 and Sonnet 4). https://www.anthropic.com/ news/claude-4, 2025

  4. [4]

    Anthropic Economic Index: Insights from Claude conversations (january 2025 base- line), 2025

    Anthropic. Anthropic Economic Index: Insights from Claude conversations (january 2025 base- line), 2025

  5. [5]

    New building blocks for under- standing AI use

    Anthropic. New building blocks for under- standing AI use. https://www.anthropic. com/research/economic-index-primitives, 2025

  6. [6]

    What 81,000 people told us about the economics of AI.https://www.anthropic

    Anthropic. What 81,000 people told us about the economics of AI.https://www.anthropic. com/research/81k-economics, 2026

  7. [7]

    Claude Code: Anthropic’s agentic coding system

    Anthropic. Claude Code: Anthropic’s agentic coding system. https://www.anthropic.com/ product/claude-code, 2026

  8. [8]

    Labor market impacts of AI: A new measure and early evidence

    Anthropic. Labor market impacts of AI: A new measure and early evidence. https://www.anthropic.com/research/ labor-market-impacts, 2026

  9. [9]

    Claude Opus 4.7 model card and system report

    Anthropic. Claude Opus 4.7 model card and system report. https://www.anthropic.com, 2026

  10. [10]

    Anthropic Economic Index re- port: Learning curves (february 2026 data)

    Anthropic. Anthropic Economic Index re- port: Learning curves (february 2026 data). https://www.anthropic.com/research/ economic-index-march-2026-report, 2026

  11. [11]

    Program Synthesis with Large Language Models

    Jacob Austin et al. Program synthesis with large language models.arXiv:2108.07732, 2021

  12. [12]

    Bauer et al

    S. Bauer et al. AI-assisted programming de- creases the productivity of experienced develop- ers by increasing the technical debt and main- tenance burden.arXiv:2510.10165, 2025

  13. [13]

    Developers’ experience with generative AI: First insights from an empirical mixed-methods field study

    Charlotte Brandebusemeyer, Tobias Schimmer, and Bert Arnrich. Developers’ experience with generative AI: First insights from an empirical mixed-methods field study. InICSE SEIP 2026; arXiv:2512.19926, 2026

  14. [14]

    The 2025 AI Agent Index: Documenting Technical and Safety Features of Deployed Agentic AI Systems

    S. Casper et al. The 2025 AI Agent Index: Documenting technical and safety features of deployed agentic AI systems.arXiv:2602.17753, 2026

  15. [15]

    Evaluating Large Language Models Trained on Code

    Mark Chen et al. Evaluating large language models trained on code.arXiv:2107.03374, 2021

  16. [16]

    Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, and Carlos E. Jimenez. Introducing SWE-bench Verified. Technical report, OpenAI, 2024

  17. [17]

    Agentic AI software development lifecycle: Secure ADLC playbook

    Codebridge. Agentic AI software development lifecycle: Secure ADLC playbook. Codebridge Tech, 2026

  18. [18]

    Introducing Devin, the first AI software engineer.https://cognition.ai/ blog/introducing-devin, 2024

    Cognition Labs. Introducing Devin, the first AI software engineer.https://cognition.ai/ blog/introducing-devin, 2024

  19. [19]

    Sea change in software development: Economic and productivity analysis of the AI- powered developer lifecycle.arXiv:2306.15033, 2023

    Thomas Dohmke, Marco Iansiti, and Greg Richards. Sea change in software development: Economic and productivity analysis of the AI- powered developer lifecycle.arXiv:2306.15033, 2023

  20. [20]

    Agentic development lifecycle (ADLC): A new model for AI systems beyond SDLC

    EPAM. Agentic development lifecycle (ADLC): A new model for AI systems beyond SDLC. https://www.epam.com/insights, 2026

  21. [21]

    Gao et al

    X. Gao et al. SWE-Bench-CL: Continual learn- ing for coding agents.arXiv:2507.00014, 2025

  22. [22]

    AlphaEvolve: A Gemini- powered coding agent for designing advanced algorithms

    Google DeepMind. AlphaEvolve: A Gemini- powered coding agent for designing advanced algorithms. https://deepmind.google/blog, 2025

  23. [23]

    MetaGPT: Meta programming for a multi-agent collaborative framework

    Sirui Hong et al. MetaGPT: Meta programming for a multi-agent collaborative framework. In International Conference on Learning Represen- tations (ICLR), 2024. Preprint. Under review. Agentic AI in the Software Development LifecyclePage 8

  24. [24]

    Jimenez, John Yang, Alexander Wet- tig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wet- tig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can lan- guage models resolve real-world GitHub issues? InInternational Conference on Learning Repre- sentations (ICLR), 2024

  25. [25]

    AgentMesh: A co- operative multi-agent generative AI frame- work for software development automation

    Sourena Khanzadeh. AgentMesh: A co- operative multi-agent generative AI frame- work for software development automation. arXiv:2507.19902, 2025

  26. [26]

    Liu et al

    Y. Liu et al. A comprehensive survey on benchmarks and solutions in software engi- neering of LLM-empowered agentic systems. arXiv:2510.09721, 2025

  27. [27]

    AgileCoder: Dynamic collaborative agents for software development based on agile methodology

    Minh-Hoang Nguyen et al. AgileCoder: Dynamic collaborative agents for software development based on agile methodology. arXiv:2406.11912, 2024

  28. [28]

    Alexander Novikov, Ngân V˜ u, Marvin Eisen- berger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Ab- bas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. AlphaEvolve: A coding agent for scientific and ...

  29. [29]

    GPT-5.4-Codex and Codex CLI 0.120: Technical overview

    OpenAI. GPT-5.4-Codex and Codex CLI 0.120: Technical overview. https://openai. com/index/codex-cli, 2026

  30. [30]

    Modernizing the SDLC process with agentic AI

    Shashikanta Parida. Modernizing the SDLC process with agentic AI. Microsoft Data Science Blog, 2025

  31. [31]

    The Impact of AI on Developer Productivity: Evidence from GitHub Copilot

    Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. The impact of AI on de- veloper productivity: Evidence from GitHub Copilot.Microsoft Research Technical Report; arXiv:2302.06590, 2023

  32. [32]

    HyperAgent: Generalist software engineering agents to solve coding tasks at scale.arXiv:2409.16299, 2024

    Huy Nhat Phan et al. HyperAgent: Generalist software engineering agents to solve coding tasks at scale.arXiv:2409.16299, 2024

  33. [33]

    ChatDev: Communicative Agents for Software Development

    Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Da- hai Li, Zhiyuan Liu, and Maosong Sun. Chat- Dev: Communicative agents for software devel- opment.arXiv:2307.07924, 2024

  34. [34]

    Sapkota et al

    R. Sapkota et al. Agentic AI: A comprehensive survey of architectures, applications, and future directions.arXiv:2510.25445, 2025

  35. [35]

    OpenHands: An open platform for AI software developers as generalist agents

    Xingyao Wang et al. OpenHands: An open platform for AI software developers as generalist agents. InInternational Conference on Learning Representations (ICLR), 2025

  36. [36]

    Agents in software engineer- ing: Survey, landscape, and vision.Automated Software Engineering, 32(2):1–36, 2025

    Yanlin Wang et al. Agents in software engineer- ing: Survey, landscape, and vision.Automated Software Engineering, 32(2):1–36, 2025

  37. [37]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Sys- tems (NeurIPS), 2022

  38. [38]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent–computer interfaces enable automated software engineering. InAdvances in Neural In- formation Processing Systems (NeurIPS), 2024

  39. [39]

    arXiv preprint arXiv:2410.03859 , year=

    John Yang et al. SWE-bench Multimodal: Do AI systems generalize to visual software do- mains?arXiv:2410.03859, 2024

  40. [40]

    ReAct: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

  41. [41]

    AutoCodeRover: Au- tonomous program improvement

    Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. AutoCodeRover: Au- tonomous program improvement. InISSTA 2024, 2024

  42. [42]

    Swe-compass: Towards unified evaluation of agentic coding abilities for large language models, 2025

    Y. Zhao et al. SWE-Compass: Towards unified evaluation of agentic coding abilities for large language models.arXiv:2511.05459, 2025. Preprint. Under review