pith. machine review for the scientific record. sign in

arxiv: 2605.06464 · v2 · submitted 2026-05-07 · 💻 cs.SE

Recognition: 2 theorem links

· Lean Theorem

To What Extent Does Agent-generated Code Require Maintenance? An Empirical Study

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:56 UTC · model grok-4.3

classification 💻 cs.SE
keywords AI-generated codecode maintenanceempirical studyautonomous agentssoftware evolutionGitHubhuman involvement
0
0 comments X

The pith

AI-generated code receives less frequent maintenance than human-authored code, with humans performing most updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper asks how much ongoing work is needed to keep code written by autonomous LLM agents in good shape after it is first produced. It compares maintenance records for more than one thousand files across one hundred popular open-source projects, measuring how often each file is changed, how large those changes are, what kind of changes occur, and who makes them. The analysis reveals that AI files are updated less often than human files, that the changes are typically small, that feature additions dominate AI maintenance while bug fixes dominate human maintenance, and that people carry out nearly all of the work. A reader would care because these patterns indicate whether adopting agent-generated code actually lowers the total human effort required to sustain a codebase over time.

Core claim

Using the AIDev dataset of AI-generated pull requests together with GitHub histories, the study tracks maintenance on over 1,000 files and roughly 3,200 changes. AI-generated files undergo less frequent maintenance than human-authored files, and the updates that do occur modify only a small fraction of each file. Feature extensions are the most common modification type for AI code, whereas bug fixes predominate for human code. Human developers perform the large majority of all maintenance on both AI-generated and human-authored files.

What carries the argument

Direct comparison of maintenance frequency, change size, modification categories, and author identity between AI-generated files and human-authored files drawn from the same 100 repositories.

If this is right

  • AI agents can supply initial code that requires relatively little subsequent modification.
  • Maintenance work on AI code shifts toward adding new capabilities rather than correcting defects.
  • Human developers continue to perform the bulk of upkeep even after agents generate the starting code.
  • The overall maintenance burden on human teams may decrease when agents handle the first implementation pass.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The lower maintenance rate for AI code could reflect either greater initial correctness or simply less ambitious initial functionality.
  • Similar patterns may or may not hold outside popular public repositories; smaller or internal projects could differ.
  • Training agents to produce more complete initial implementations might further reduce the need for later human feature additions.

Load-bearing premise

The AIDev dataset together with the choice of 100 popular repositories captures representative maintenance patterns without systematic bias from project popularity or from how AI code is typically used.

What would settle it

A replication on a different collection of repositories in which AI-generated files show higher commit frequency, larger average change sizes, or a higher proportion of bug-fix modifications than human-authored files in the same projects.

Figures

Figures reproduced from arXiv: 2605.06464 by Hajimu Iida, Hiroshi Iwata, Ken'ichi Yamaguchi, Shota Sawada, Tatsuya Shirai, Yutaro Kashiwa.

Figure 1
Figure 1. Figure 1: Overview of our data collection process and research questions view at source ↗
Figure 2
Figure 2. Figure 2: The maintenance frequency and magnitude for Agent and Human generated files view at source ↗
read the original abstract

LLM-based autonomous coding agents have reshaped software development. While these agents excel at code generation, open questions persist about the long-term maintainability of AI-generated code. This study empirically investigates the maintenance extent, human involvement, and modification types of AI-generated files versus human-authored code. Using the AIDev dataset of AI-generated pull requests and GitHub, we analyzed over 1,000 files and approximately 3,200 changes from 100 popular repositories. Our findings show that: (i) AI-generated files receive less frequent maintenance than human-authored code, with updates affecting only a small fraction of file size; (ii) the most frequent modifications to AI code are feature extensions, whereas human updates focus on bug fixes, and (iii) human developers perform the large majority of this maintenance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. This paper conducts an empirical study on the maintenance of agent-generated code by analyzing data from the AIDev dataset, which includes AI-generated pull requests in 100 popular GitHub repositories. The authors examine over 1,000 files and approximately 3,200 changes, finding that AI-generated files experience less frequent maintenance than human-authored code, with updates affecting only a small fraction of the file size. Additionally, modifications to AI code primarily involve feature extensions, while human updates focus on bug fixes, and human developers are responsible for the majority of maintenance on AI-generated files.

Significance. Should the central findings prove robust after accounting for potential confounds such as file age, this work would make a valuable contribution to software engineering by providing empirical evidence on the maintainability of LLM-generated code in real-world settings. The use of a large-scale dataset from popular repositories lends credibility to the observations about maintenance patterns and human involvement, which could influence best practices for integrating autonomous coding agents into development workflows.

major comments (2)
  1. §4 (Results, finding i): The direct comparison of maintenance frequency and change sizes between AI-generated and human-authored files does not appear to control for or match on file age, creation date, or time since introduction into the repository. AI-generated files are added via specific PRs and are thus systematically newer, meaning lower maintenance rates could be an artifact of shorter observation windows rather than a property of the generated code. This is central to the primary claim and requires either stratification, matching, or explicit discussion of the exposure time distribution.
  2. §3 (Methodology): The description of how maintenance was measured, including definitions of change frequency, file size impact, and classification of modification types (feature extension vs. bug fix), lacks sufficient detail on the process (e.g., manual labeling criteria, automation used, inter-rater reliability if applicable). Additionally, there is no mention of statistical tests for significance or controls for confounding variables such as file complexity or project-specific factors.
minor comments (2)
  1. Abstract: The abstract would benefit from a brief mention of the key methodological approaches or limitations to better contextualize the findings for readers.
  2. Discussion or Limitations section: A more explicit limitations section addressing potential biases from selecting only popular repositories and AI usage patterns would strengthen the paper.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and have revised the paper to strengthen the analysis and methodological transparency.

read point-by-point responses
  1. Referee: [—] §4 (Results, finding i): The direct comparison of maintenance frequency and change sizes between AI-generated and human-authored files does not appear to control for or match on file age, creation date, or time since introduction into the repository. AI-generated files are added via specific PRs and are thus systematically newer, meaning lower maintenance rates could be an artifact of shorter observation windows rather than a property of the generated code. This is central to the primary claim and requires either stratification, matching, or explicit discussion of the exposure time distribution.

    Authors: We agree this is a substantive concern. AI-generated files are introduced at discrete points via PRs and therefore have shorter average observation windows than older human-authored files. In the revised manuscript we add a matched analysis that pairs each AI-generated file with human-authored files of similar age and size within the same repository, plus a stratification by quartiles of file age. The primary finding (lower maintenance frequency for AI files) remains consistent after these controls. We also include a new figure showing the distribution of observation times and report time-normalized maintenance rates. These additions appear in the updated §4. revision: yes

  2. Referee: [—] §3 (Methodology): The description of how maintenance was measured, including definitions of change frequency, file size impact, and classification of modification types (feature extension vs. bug fix), lacks sufficient detail on the process (e.g., manual labeling criteria, automation used, inter-rater reliability if applicable). Additionally, there is no mention of statistical tests for significance or controls for confounding variables such as file complexity or project-specific factors.

    Authors: We accept that the original §3 was insufficiently detailed. The revised version expands the section with explicit operational definitions: change frequency is the count of subsequent commits touching the file; file-size impact is the percentage of lines modified relative to the file at introduction. Modification-type classification was performed manually on a random sample of 300 changes using commit messages and diffs; two authors applied a written coding scheme (feature extension = addition of new functionality; bug fix = correction of incorrect behavior) and resolved disagreements by discussion, yielding Cohen’s κ = 0.81. We now report the use of Wilcoxon rank-sum tests for univariate comparisons and multivariate negative-binomial regression models that control for file complexity (LOC), repository, and language. These details and the associated statistical results are added to §3. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical analysis of repository data

full rationale

The paper conducts an empirical study by analyzing over 1,000 files and 3,200 changes from 100 GitHub repositories using the AIDev dataset. It reports observed frequencies of maintenance, modification types, and human involvement through direct counting and comparison. No mathematical derivations, parameter fitting, model predictions, or self-citation chains are present that could reduce claims to inputs by construction. The analysis is self-contained against external benchmarks (GitHub commit histories) with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

This is an observational empirical study relying on existing datasets and standard assumptions about data sources in software repository mining.

axioms (2)
  • domain assumption The AIDev dataset correctly identifies AI-generated code in pull requests.
    Central to distinguishing AI vs human code.
  • domain assumption GitHub commit history reflects all maintenance activities on the files.
    Used to measure frequency and types of changes.

pith-pipeline@v0.9.0 · 5450 in / 1277 out tokens · 45949 ms · 2026-05-12T00:56:36.949922+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

  1. [1]

    McGraw-Hill New York, 2007

    Stephen R Schach.Object-oriented and classical software engineering, volume 6. McGraw-Hill New York, 2007

  2. [2]

    Ogheneovo

    Edward E. Ogheneovo. On the relationship between software complexity and maintenance costs.Journal of Computer and Communications, 02(14):1–16, 2014

  3. [3]

    Which factors affect software projects maintenance cost more?Acta Informatica Medica, 21(1):63, 2013

    Sayed Mehdi Hejazi Dehaghani and Nafiseh Hajrahimi. Which factors affect software projects maintenance cost more?Acta Informatica Medica, 21(1):63, 2013

  4. [4]

    Investigating the smells of LLM generated code.CoRR, abs/2510.03029, 2025

    Debalina Ghosh Paul, Hong Zhu, and Ian Bayley. Investigating the smells of LLM generated code.CoRR, abs/2510.03029, 2025

  5. [5]

    Code copycat conundrum: Demystifying repetition in llm-based code generation.CoRR, abs/2504.12608, 2025

    Mingwei Liu, Juntao Li, Ying Wang, Xueying Du, Zuoyu Ou, Qiuyuan Chen, Bingxu An, Zhao Wei, Yong Xu, Fangming Zou, Xin Peng, and Yiling Lou. Code copycat conundrum: Demystifying repetition in llm-based code generation.CoRR, abs/2504.12608, 2025

  6. [6]

    Human-written vs

    Domenico Cotroneo, Cristina Improta, and Pietro Liguori. Human-written vs. ai-generated code: A large-scale study of defects, vulnerabilities, and complexity. CoRR, abs/2508.21634, 2025

  7. [7]

    Does ai-assisted coding deliver? a difference-in-differences study of cursor’s impact on software projects.CoRR, abs/2511.04427, 2025

    Hao He, Courtney Miller, Shyam Agarwal, Christian Kästner, and Bogdan Vasilescu. Does ai-assisted coding deliver? a difference-in-differences study of cursor’s impact on software projects.CoRR, abs/2511.04427, 2025

  8. [8]

    Empirical analysis of ai-assisted code generation tools impact on code quality, security and developer productivity.International Journal For Multidisci- plinary Research, 2025

    Purvi Sankhe, Neeta Patil, Minakshi Ghorpade, Pratibha Prasad, and Monisha Linkesh. Empirical analysis of ai-assisted code generation tools impact on code quality, security and developer productivity.International Journal For Multidisci- plinary Research, 2025

  9. [9]

    Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. The rise of AI teammates in software engineering (SE) 3.0: How autonomous coding agents are reshaping software engineering.CoRR, abs/2507.15003, 2025

  10. [10]

    Openhands: An open platform for AI software developers as generalist agents

    Xingyao Wang et al. Openhands: An open platform for AI software developers as generalist agents. InProceedings of the Thirteenth International Conference on Learning Representations (ICLR 2025), 2025

  11. [11]

    Metagpt: Meta programming for A multi-agent collaborative framework

    Sirui Hong et al. Metagpt: Meta programming for A multi-agent collaborative framework. InProceedings of the twelfth International Conference on Learning Representations (ICLR 2024), 2024

  12. [12]

    Glassman

    Priyan Vaithilingam, Tianyi Zhang, and Elena L. Glassman. Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. InProceedings of the 2022 Conference on Human Factors in Computing Systems (CHI 2022), pages 332:1–332:7, 2022

  13. [13]

    A study on developer behaviors for validating and repair- ing llm-generated code using eye tracking and IDE actions.CoRR, abs/2405.16081, 2024

    Ningzhi Tang, Meng Chen, Zheng Ning, Aakash Bansal, Yu Huang, Collin McMil- lan, and Toby Jia-Jun Li. A study on developer behaviors for validating and repair- ing llm-generated code using eye tracking and IDE actions.CoRR, abs/2405.16081, 2024

  14. [14]

    Automated code review in practice

    Umut Cihan, Vahid Haratian, Arda Içöz, Mert Kaan Gül, Ömercan Devran, Emir- can Furkan Bayendur, Baykal Mehmet Uçar, and Eray Tüzün. Automated code review in practice. InProceedings of the 47th IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP 2025), pages 425–436, 2025

  15. [15]

    Measuring the impact of early-2025 AI on experienced open-source developer productivity.CoRR, abs/2507.09089, 2025

    Joel Becker, Nate Rush, Elizabeth Barnes, and David Rein. Measuring the impact of early-2025 AI on experienced open-source developer productivity.CoRR, abs/2507.09089, 2025

  16. [16]

    Licorish, Ansh Bajpai, Chetan Arora, Fanyu Wang, and Chakkrit Tantithamthavorn

    Sherlock A. Licorish, Ansh Bajpai, Chetan Arora, Fanyu Wang, and Chakkrit Tantithamthavorn. Comparing human and LLM generated code: The jury is still out!CoRR, abs/2501.16857, 2025

  17. [17]

    Do users write more insecure code with AI assistants? InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS 2023), pages 2785– 2799, 2023

    Neil Perry, Megha Srivastava, Deepak Kumar, and Dan Boneh. Do users write more insecure code with AI assistants? InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS 2023), pages 2785– 2799, 2023

  18. [18]

    Asleep at the keyboard? assessing the security of github copilot’s code contributions.Commun

    Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. Asleep at the keyboard? assessing the security of github copilot’s code contributions.Commun. ACM, 68(2):96–105, 2025

  19. [19]

    dos Santos

    Henrique Gomes Nunes, Eduardo Figueiredo, Larissa Rocha Soares, Sarah Nadi, Fischer Ferreira, and Geanderson E. dos Santos. Evaluating the effectiveness of llms in fixing maintainability issues in real-world projects. InProceedings of the IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER 2025), pages 669–680, 2025

  20. [20]

    The modular imperative: Rethinking llms for maintainable software

    Anastasiya Kravchuk-Kirilyuk, Fernanda Graciolli, and Nada Amin. The modular imperative: Rethinking llms for maintainable software. InProceedings of the 1st ACM SIGPLAN International Workshop on Language Models and Programming Languages (LMPL 2025), page 106–111, 2025

  21. [21]

    What to cut? predicting unnecessary methods in agentic code generation, 2026

    Kan Watanabe, Tatsuya Shirai, Yutaro Kashiwa, and Hajimu Iida. What to cut? predicting unnecessary methods in agentic code generation, 2026

  22. [22]

    Do autonomous agents contribute test code? A study of tests in agentic pull requests.CoRR, abs/2601.03556, 2026

    Sabrina Haque, Sarvesh Ingale, and Christoph Csallner. Do autonomous agents contribute test code? A study of tests in agentic pull requests.CoRR, abs/2601.03556, 2026

  23. [23]

    How do agents refactor: An empirical study.CoRR, abs/2601.20160, 2026

    Lukas Ottenhof, Daniel Penner, Abram Hindle, and Thibaud Lutellier. How do agents refactor: An empirical study.CoRR, abs/2601.20160, 2026

  24. [24]

    Early-stage prediction of review effort in ai-generated pull requests.CoRR, abs/2601.00753, 2026

    Dao Sy Duy Minh, Huynh Trung Kiet, Nguyen Lam Phu Quy, Pham Phu Hoa, Tran Chi Nguyen, Nguyen Dinh Ha Duong, and Truong Bao Tran. Early-stage prediction of review effort in ai-generated pull requests.CoRR, abs/2601.00753, 2026

  25. [25]

    About code owners

    Github. About code owners. https://docs.github.com/en/repositories/managing- your-repositorys-settings-and-features/customizing-your-repository/about- code-owners, 2025. Accessed: December 31st, 2025

  26. [26]

    Miku Watanabe, Hao Li, Yutaro Kashiwa, Brittany Reid, Hajimu Iida, and Ahmed E. Hassan. On the use of agentic coding: An empirical study of pull requests on github.CoRR, abs/2509.14745, 2025

  27. [27]

    A first look at conven- tional commits classification

    Qunhong Zeng, Yuxia Zhang, Zhiqing Qiu, and Hui Liu. A first look at conven- tional commits classification. InProceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE 2025), pages 2277–2289, 2025

  28. [28]

    Kosei Horikawa, Hao Li, Yutaro Kashiwa, Bram Adams, Hajimu Iida, and Ahmed E. Hassan. Agentic refactoring: An empirical study of AI coding agents.CoRR, abs/2511.04824, 2025