arxiv: 2605.06464 · v2 · submitted 2026-05-07 · 💻 cs.SE

Recognition: 2 theorem links

· Lean Theorem

To What Extent Does Agent-generated Code Require Maintenance? An Empirical Study

Shota Sawada , Tatsuya Shirai , Yutaro Kashiwa , Ken'ichi Yamaguchi , Hiroshi Iwata , Hajimu Iida

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:56 UTC · model grok-4.3

classification 💻 cs.SE

keywords AI-generated codecode maintenanceempirical studyautonomous agentssoftware evolutionGitHubhuman involvement

0 comments

The pith

AI-generated code receives less frequent maintenance than human-authored code, with humans performing most updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper asks how much ongoing work is needed to keep code written by autonomous LLM agents in good shape after it is first produced. It compares maintenance records for more than one thousand files across one hundred popular open-source projects, measuring how often each file is changed, how large those changes are, what kind of changes occur, and who makes them. The analysis reveals that AI files are updated less often than human files, that the changes are typically small, that feature additions dominate AI maintenance while bug fixes dominate human maintenance, and that people carry out nearly all of the work. A reader would care because these patterns indicate whether adopting agent-generated code actually lowers the total human effort required to sustain a codebase over time.

Core claim

Using the AIDev dataset of AI-generated pull requests together with GitHub histories, the study tracks maintenance on over 1,000 files and roughly 3,200 changes. AI-generated files undergo less frequent maintenance than human-authored files, and the updates that do occur modify only a small fraction of each file. Feature extensions are the most common modification type for AI code, whereas bug fixes predominate for human code. Human developers perform the large majority of all maintenance on both AI-generated and human-authored files.

What carries the argument

Direct comparison of maintenance frequency, change size, modification categories, and author identity between AI-generated files and human-authored files drawn from the same 100 repositories.

If this is right

AI agents can supply initial code that requires relatively little subsequent modification.
Maintenance work on AI code shifts toward adding new capabilities rather than correcting defects.
Human developers continue to perform the bulk of upkeep even after agents generate the starting code.
The overall maintenance burden on human teams may decrease when agents handle the first implementation pass.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The lower maintenance rate for AI code could reflect either greater initial correctness or simply less ambitious initial functionality.
Similar patterns may or may not hold outside popular public repositories; smaller or internal projects could differ.
Training agents to produce more complete initial implementations might further reduce the need for later human feature additions.

Load-bearing premise

The AIDev dataset together with the choice of 100 popular repositories captures representative maintenance patterns without systematic bias from project popularity or from how AI code is typically used.

What would settle it

A replication on a different collection of repositories in which AI-generated files show higher commit frequency, larger average change sizes, or a higher proportion of bug-fix modifications than human-authored files in the same projects.

Figures

Figures reproduced from arXiv: 2605.06464 by Hajimu Iida, Hiroshi Iwata, Ken'ichi Yamaguchi, Shota Sawada, Tatsuya Shirai, Yutaro Kashiwa.

**Figure 1.** Figure 1: Overview of our data collection process and research questions view at source ↗

**Figure 2.** Figure 2: The maintenance frequency and magnitude for Agent and Human generated files view at source ↗

read the original abstract

LLM-based autonomous coding agents have reshaped software development. While these agents excel at code generation, open questions persist about the long-term maintainability of AI-generated code. This study empirically investigates the maintenance extent, human involvement, and modification types of AI-generated files versus human-authored code. Using the AIDev dataset of AI-generated pull requests and GitHub, we analyzed over 1,000 files and approximately 3,200 changes from 100 popular repositories. Our findings show that: (i) AI-generated files receive less frequent maintenance than human-authored code, with updates affecting only a small fraction of file size; (ii) the most frequent modifications to AI code are feature extensions, whereas human updates focus on bug fixes, and (iii) human developers perform the large majority of this maintenance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The core claim that AI-generated files need less maintenance looks unreliable without age controls.

read the letter

The paper's main finding is that AI-generated files from the AIDev dataset get updated less often than human code, with smaller changes, mostly feature additions rather than bug fixes, and humans handling nearly all the work. It draws on over 1,000 files and 3,200 changes across 100 popular GitHub repositories. That scale and the focus on actual commit history are the parts that land as useful new data points on a practical question. The breakdown by modification type and who performs the changes adds some concrete detail that prior work on code quality has not always captured for agent output. The soft spot is exactly the one flagged in the stress test. AI files arrive through targeted pull requests, so they are systematically newer than the typical human file in the same repo. Without matching on creation date, time since merge, or exposure window, the lower maintenance rate could simply reflect shorter time in the codebase. The abstract gives no indication that any temporal stratification or matching was applied, which undercuts the direct comparison. Details on how maintenance frequency was quantified, whether statistical tests were used, or controls for file complexity are also thin. This work is aimed at researchers and teams trying to fold coding agents into real projects. A reader looking for early empirical signals on maintenance patterns could extract value from the dataset and the observed differences, but the evidence needs tightening before the claims can be taken at face value. It deserves peer review because the topic is current and the data source is specific, provided the authors add age controls and clearer methods.

Referee Report

2 major / 2 minor

Summary. This paper conducts an empirical study on the maintenance of agent-generated code by analyzing data from the AIDev dataset, which includes AI-generated pull requests in 100 popular GitHub repositories. The authors examine over 1,000 files and approximately 3,200 changes, finding that AI-generated files experience less frequent maintenance than human-authored code, with updates affecting only a small fraction of the file size. Additionally, modifications to AI code primarily involve feature extensions, while human updates focus on bug fixes, and human developers are responsible for the majority of maintenance on AI-generated files.

Significance. Should the central findings prove robust after accounting for potential confounds such as file age, this work would make a valuable contribution to software engineering by providing empirical evidence on the maintainability of LLM-generated code in real-world settings. The use of a large-scale dataset from popular repositories lends credibility to the observations about maintenance patterns and human involvement, which could influence best practices for integrating autonomous coding agents into development workflows.

major comments (2)

§4 (Results, finding i): The direct comparison of maintenance frequency and change sizes between AI-generated and human-authored files does not appear to control for or match on file age, creation date, or time since introduction into the repository. AI-generated files are added via specific PRs and are thus systematically newer, meaning lower maintenance rates could be an artifact of shorter observation windows rather than a property of the generated code. This is central to the primary claim and requires either stratification, matching, or explicit discussion of the exposure time distribution.
§3 (Methodology): The description of how maintenance was measured, including definitions of change frequency, file size impact, and classification of modification types (feature extension vs. bug fix), lacks sufficient detail on the process (e.g., manual labeling criteria, automation used, inter-rater reliability if applicable). Additionally, there is no mention of statistical tests for significance or controls for confounding variables such as file complexity or project-specific factors.

minor comments (2)

Abstract: The abstract would benefit from a brief mention of the key methodological approaches or limitations to better contextualize the findings for readers.
Discussion or Limitations section: A more explicit limitations section addressing potential biases from selecting only popular repositories and AI usage patterns would strengthen the paper.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and have revised the paper to strengthen the analysis and methodological transparency.

read point-by-point responses

Referee: [—] §4 (Results, finding i): The direct comparison of maintenance frequency and change sizes between AI-generated and human-authored files does not appear to control for or match on file age, creation date, or time since introduction into the repository. AI-generated files are added via specific PRs and are thus systematically newer, meaning lower maintenance rates could be an artifact of shorter observation windows rather than a property of the generated code. This is central to the primary claim and requires either stratification, matching, or explicit discussion of the exposure time distribution.

Authors: We agree this is a substantive concern. AI-generated files are introduced at discrete points via PRs and therefore have shorter average observation windows than older human-authored files. In the revised manuscript we add a matched analysis that pairs each AI-generated file with human-authored files of similar age and size within the same repository, plus a stratification by quartiles of file age. The primary finding (lower maintenance frequency for AI files) remains consistent after these controls. We also include a new figure showing the distribution of observation times and report time-normalized maintenance rates. These additions appear in the updated §4. revision: yes
Referee: [—] §3 (Methodology): The description of how maintenance was measured, including definitions of change frequency, file size impact, and classification of modification types (feature extension vs. bug fix), lacks sufficient detail on the process (e.g., manual labeling criteria, automation used, inter-rater reliability if applicable). Additionally, there is no mention of statistical tests for significance or controls for confounding variables such as file complexity or project-specific factors.

Authors: We accept that the original §3 was insufficiently detailed. The revised version expands the section with explicit operational definitions: change frequency is the count of subsequent commits touching the file; file-size impact is the percentage of lines modified relative to the file at introduction. Modification-type classification was performed manually on a random sample of 300 changes using commit messages and diffs; two authors applied a written coding scheme (feature extension = addition of new functionality; bug fix = correction of incorrect behavior) and resolved disagreements by discussion, yielding Cohen’s κ = 0.81. We now report the use of Wilcoxon rank-sum tests for univariate comparisons and multivariate negative-binomial regression models that control for file complexity (LOC), repository, and language. These details and the associated statistical results are added to §3. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical analysis of repository data

full rationale

The paper conducts an empirical study by analyzing over 1,000 files and 3,200 changes from 100 GitHub repositories using the AIDev dataset. It reports observed frequencies of maintenance, modification types, and human involvement through direct counting and comparison. No mathematical derivations, parameter fitting, model predictions, or self-citation chains are present that could reduce claims to inputs by construction. The analysis is self-contained against external benchmarks (GitHub commit histories) with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

This is an observational empirical study relying on existing datasets and standard assumptions about data sources in software repository mining.

axioms (2)

domain assumption The AIDev dataset correctly identifies AI-generated code in pull requests.
Central to distinguishing AI vs human code.
domain assumption GitHub commit history reflects all maintenance activities on the files.
Used to measure frequency and types of changes.

pith-pipeline@v0.9.0 · 5450 in / 1277 out tokens · 45949 ms · 2026-05-12T00:56:36.949922+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our findings show that: (i) AI-generated files receive less frequent maintenance than human-authored code... (iii) human developers perform the large majority of this maintenance.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We categorized commits modifying AI-generated files... using the Conventional Commits Classification System (CCS)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

[1]

McGraw-Hill New York, 2007

Stephen R Schach.Object-oriented and classical software engineering, volume 6. McGraw-Hill New York, 2007

work page 2007
[2]

Ogheneovo

Edward E. Ogheneovo. On the relationship between software complexity and maintenance costs.Journal of Computer and Communications, 02(14):1–16, 2014

work page 2014
[3]

Which factors affect software projects maintenance cost more?Acta Informatica Medica, 21(1):63, 2013

Sayed Mehdi Hejazi Dehaghani and Nafiseh Hajrahimi. Which factors affect software projects maintenance cost more?Acta Informatica Medica, 21(1):63, 2013

work page 2013
[4]

Investigating the smells of LLM generated code.CoRR, abs/2510.03029, 2025

Debalina Ghosh Paul, Hong Zhu, and Ian Bayley. Investigating the smells of LLM generated code.CoRR, abs/2510.03029, 2025

work page arXiv 2025
[5]

Code copycat conundrum: Demystifying repetition in llm-based code generation.CoRR, abs/2504.12608, 2025

Mingwei Liu, Juntao Li, Ying Wang, Xueying Du, Zuoyu Ou, Qiuyuan Chen, Bingxu An, Zhao Wei, Yong Xu, Fangming Zou, Xin Peng, and Yiling Lou. Code copycat conundrum: Demystifying repetition in llm-based code generation.CoRR, abs/2504.12608, 2025

work page arXiv 2025
[6]

Human-written vs

Domenico Cotroneo, Cristina Improta, and Pietro Liguori. Human-written vs. ai-generated code: A large-scale study of defects, vulnerabilities, and complexity. CoRR, abs/2508.21634, 2025

work page arXiv 2025
[7]

Does ai-assisted coding deliver? a difference-in-differences study of cursor’s impact on software projects.CoRR, abs/2511.04427, 2025

Hao He, Courtney Miller, Shyam Agarwal, Christian Kästner, and Bogdan Vasilescu. Does ai-assisted coding deliver? a difference-in-differences study of cursor’s impact on software projects.CoRR, abs/2511.04427, 2025

work page arXiv 2025
[8]

Empirical analysis of ai-assisted code generation tools impact on code quality, security and developer productivity.International Journal For Multidisci- plinary Research, 2025

Purvi Sankhe, Neeta Patil, Minakshi Ghorpade, Pratibha Prasad, and Monisha Linkesh. Empirical analysis of ai-assisted code generation tools impact on code quality, security and developer productivity.International Journal For Multidisci- plinary Research, 2025

work page 2025
[9]

Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. The rise of AI teammates in software engineering (SE) 3.0: How autonomous coding agents are reshaping software engineering.CoRR, abs/2507.15003, 2025

work page internal anchor Pith review arXiv 2025
[10]

Openhands: An open platform for AI software developers as generalist agents

Xingyao Wang et al. Openhands: An open platform for AI software developers as generalist agents. InProceedings of the Thirteenth International Conference on Learning Representations (ICLR 2025), 2025

work page 2025
[11]

Metagpt: Meta programming for A multi-agent collaborative framework

Sirui Hong et al. Metagpt: Meta programming for A multi-agent collaborative framework. InProceedings of the twelfth International Conference on Learning Representations (ICLR 2024), 2024

work page 2024
[12]

Glassman

Priyan Vaithilingam, Tianyi Zhang, and Elena L. Glassman. Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. InProceedings of the 2022 Conference on Human Factors in Computing Systems (CHI 2022), pages 332:1–332:7, 2022

work page 2022
[13]

A study on developer behaviors for validating and repair- ing llm-generated code using eye tracking and IDE actions.CoRR, abs/2405.16081, 2024

Ningzhi Tang, Meng Chen, Zheng Ning, Aakash Bansal, Yu Huang, Collin McMil- lan, and Toby Jia-Jun Li. A study on developer behaviors for validating and repair- ing llm-generated code using eye tracking and IDE actions.CoRR, abs/2405.16081, 2024

work page arXiv 2024
[14]

Automated code review in practice

Umut Cihan, Vahid Haratian, Arda Içöz, Mert Kaan Gül, Ömercan Devran, Emir- can Furkan Bayendur, Baykal Mehmet Uçar, and Eray Tüzün. Automated code review in practice. InProceedings of the 47th IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP 2025), pages 425–436, 2025

work page 2025
[15]

Measuring the impact of early-2025 AI on experienced open-source developer productivity.CoRR, abs/2507.09089, 2025

Joel Becker, Nate Rush, Elizabeth Barnes, and David Rein. Measuring the impact of early-2025 AI on experienced open-source developer productivity.CoRR, abs/2507.09089, 2025

work page arXiv 2025
[16]

Licorish, Ansh Bajpai, Chetan Arora, Fanyu Wang, and Chakkrit Tantithamthavorn

Sherlock A. Licorish, Ansh Bajpai, Chetan Arora, Fanyu Wang, and Chakkrit Tantithamthavorn. Comparing human and LLM generated code: The jury is still out!CoRR, abs/2501.16857, 2025

work page arXiv 2025
[17]

Do users write more insecure code with AI assistants? InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS 2023), pages 2785– 2799, 2023

Neil Perry, Megha Srivastava, Deepak Kumar, and Dan Boneh. Do users write more insecure code with AI assistants? InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS 2023), pages 2785– 2799, 2023

work page 2023
[18]

Asleep at the keyboard? assessing the security of github copilot’s code contributions.Commun

Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. Asleep at the keyboard? assessing the security of github copilot’s code contributions.Commun. ACM, 68(2):96–105, 2025

work page 2025
[19]

dos Santos

Henrique Gomes Nunes, Eduardo Figueiredo, Larissa Rocha Soares, Sarah Nadi, Fischer Ferreira, and Geanderson E. dos Santos. Evaluating the effectiveness of llms in fixing maintainability issues in real-world projects. InProceedings of the IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER 2025), pages 669–680, 2025

work page 2025
[20]

The modular imperative: Rethinking llms for maintainable software

Anastasiya Kravchuk-Kirilyuk, Fernanda Graciolli, and Nada Amin. The modular imperative: Rethinking llms for maintainable software. InProceedings of the 1st ACM SIGPLAN International Workshop on Language Models and Programming Languages (LMPL 2025), page 106–111, 2025

work page 2025
[21]

What to cut? predicting unnecessary methods in agentic code generation, 2026

Kan Watanabe, Tatsuya Shirai, Yutaro Kashiwa, and Hajimu Iida. What to cut? predicting unnecessary methods in agentic code generation, 2026

work page 2026
[22]

Do autonomous agents contribute test code? A study of tests in agentic pull requests.CoRR, abs/2601.03556, 2026

Sabrina Haque, Sarvesh Ingale, and Christoph Csallner. Do autonomous agents contribute test code? A study of tests in agentic pull requests.CoRR, abs/2601.03556, 2026

work page arXiv 2026
[23]

How do agents refactor: An empirical study.CoRR, abs/2601.20160, 2026

Lukas Ottenhof, Daniel Penner, Abram Hindle, and Thibaud Lutellier. How do agents refactor: An empirical study.CoRR, abs/2601.20160, 2026

work page arXiv 2026
[24]

Early-stage prediction of review effort in ai-generated pull requests.CoRR, abs/2601.00753, 2026

Dao Sy Duy Minh, Huynh Trung Kiet, Nguyen Lam Phu Quy, Pham Phu Hoa, Tran Chi Nguyen, Nguyen Dinh Ha Duong, and Truong Bao Tran. Early-stage prediction of review effort in ai-generated pull requests.CoRR, abs/2601.00753, 2026

work page arXiv 2026
[25]

About code owners

Github. About code owners. https://docs.github.com/en/repositories/managing- your-repositorys-settings-and-features/customizing-your-repository/about- code-owners, 2025. Accessed: December 31st, 2025

work page 2025
[26]

Miku Watanabe, Hao Li, Yutaro Kashiwa, Brittany Reid, Hajimu Iida, and Ahmed E. Hassan. On the use of agentic coding: An empirical study of pull requests on github.CoRR, abs/2509.14745, 2025

work page arXiv 2025
[27]

A first look at conven- tional commits classification

Qunhong Zeng, Yuxia Zhang, Zhiqing Qiu, and Hui Liu. A first look at conven- tional commits classification. InProceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE 2025), pages 2277–2289, 2025

work page 2025
[28]

Kosei Horikawa, Hao Li, Yutaro Kashiwa, Bram Adams, Hajimu Iida, and Ahmed E. Hassan. Agentic refactoring: An empirical study of AI coding agents.CoRR, abs/2511.04824, 2025

work page arXiv 2025