Recognition: 2 theorem links
· Lean TheoremTo What Extent Does Agent-generated Code Require Maintenance? An Empirical Study
Pith reviewed 2026-05-12 00:56 UTC · model grok-4.3
The pith
AI-generated code receives less frequent maintenance than human-authored code, with humans performing most updates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using the AIDev dataset of AI-generated pull requests together with GitHub histories, the study tracks maintenance on over 1,000 files and roughly 3,200 changes. AI-generated files undergo less frequent maintenance than human-authored files, and the updates that do occur modify only a small fraction of each file. Feature extensions are the most common modification type for AI code, whereas bug fixes predominate for human code. Human developers perform the large majority of all maintenance on both AI-generated and human-authored files.
What carries the argument
Direct comparison of maintenance frequency, change size, modification categories, and author identity between AI-generated files and human-authored files drawn from the same 100 repositories.
If this is right
- AI agents can supply initial code that requires relatively little subsequent modification.
- Maintenance work on AI code shifts toward adding new capabilities rather than correcting defects.
- Human developers continue to perform the bulk of upkeep even after agents generate the starting code.
- The overall maintenance burden on human teams may decrease when agents handle the first implementation pass.
Where Pith is reading between the lines
- The lower maintenance rate for AI code could reflect either greater initial correctness or simply less ambitious initial functionality.
- Similar patterns may or may not hold outside popular public repositories; smaller or internal projects could differ.
- Training agents to produce more complete initial implementations might further reduce the need for later human feature additions.
Load-bearing premise
The AIDev dataset together with the choice of 100 popular repositories captures representative maintenance patterns without systematic bias from project popularity or from how AI code is typically used.
What would settle it
A replication on a different collection of repositories in which AI-generated files show higher commit frequency, larger average change sizes, or a higher proportion of bug-fix modifications than human-authored files in the same projects.
Figures
read the original abstract
LLM-based autonomous coding agents have reshaped software development. While these agents excel at code generation, open questions persist about the long-term maintainability of AI-generated code. This study empirically investigates the maintenance extent, human involvement, and modification types of AI-generated files versus human-authored code. Using the AIDev dataset of AI-generated pull requests and GitHub, we analyzed over 1,000 files and approximately 3,200 changes from 100 popular repositories. Our findings show that: (i) AI-generated files receive less frequent maintenance than human-authored code, with updates affecting only a small fraction of file size; (ii) the most frequent modifications to AI code are feature extensions, whereas human updates focus on bug fixes, and (iii) human developers perform the large majority of this maintenance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper conducts an empirical study on the maintenance of agent-generated code by analyzing data from the AIDev dataset, which includes AI-generated pull requests in 100 popular GitHub repositories. The authors examine over 1,000 files and approximately 3,200 changes, finding that AI-generated files experience less frequent maintenance than human-authored code, with updates affecting only a small fraction of the file size. Additionally, modifications to AI code primarily involve feature extensions, while human updates focus on bug fixes, and human developers are responsible for the majority of maintenance on AI-generated files.
Significance. Should the central findings prove robust after accounting for potential confounds such as file age, this work would make a valuable contribution to software engineering by providing empirical evidence on the maintainability of LLM-generated code in real-world settings. The use of a large-scale dataset from popular repositories lends credibility to the observations about maintenance patterns and human involvement, which could influence best practices for integrating autonomous coding agents into development workflows.
major comments (2)
- §4 (Results, finding i): The direct comparison of maintenance frequency and change sizes between AI-generated and human-authored files does not appear to control for or match on file age, creation date, or time since introduction into the repository. AI-generated files are added via specific PRs and are thus systematically newer, meaning lower maintenance rates could be an artifact of shorter observation windows rather than a property of the generated code. This is central to the primary claim and requires either stratification, matching, or explicit discussion of the exposure time distribution.
- §3 (Methodology): The description of how maintenance was measured, including definitions of change frequency, file size impact, and classification of modification types (feature extension vs. bug fix), lacks sufficient detail on the process (e.g., manual labeling criteria, automation used, inter-rater reliability if applicable). Additionally, there is no mention of statistical tests for significance or controls for confounding variables such as file complexity or project-specific factors.
minor comments (2)
- Abstract: The abstract would benefit from a brief mention of the key methodological approaches or limitations to better contextualize the findings for readers.
- Discussion or Limitations section: A more explicit limitations section addressing potential biases from selecting only popular repositories and AI usage patterns would strengthen the paper.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major comment below and have revised the paper to strengthen the analysis and methodological transparency.
read point-by-point responses
-
Referee: [—] §4 (Results, finding i): The direct comparison of maintenance frequency and change sizes between AI-generated and human-authored files does not appear to control for or match on file age, creation date, or time since introduction into the repository. AI-generated files are added via specific PRs and are thus systematically newer, meaning lower maintenance rates could be an artifact of shorter observation windows rather than a property of the generated code. This is central to the primary claim and requires either stratification, matching, or explicit discussion of the exposure time distribution.
Authors: We agree this is a substantive concern. AI-generated files are introduced at discrete points via PRs and therefore have shorter average observation windows than older human-authored files. In the revised manuscript we add a matched analysis that pairs each AI-generated file with human-authored files of similar age and size within the same repository, plus a stratification by quartiles of file age. The primary finding (lower maintenance frequency for AI files) remains consistent after these controls. We also include a new figure showing the distribution of observation times and report time-normalized maintenance rates. These additions appear in the updated §4. revision: yes
-
Referee: [—] §3 (Methodology): The description of how maintenance was measured, including definitions of change frequency, file size impact, and classification of modification types (feature extension vs. bug fix), lacks sufficient detail on the process (e.g., manual labeling criteria, automation used, inter-rater reliability if applicable). Additionally, there is no mention of statistical tests for significance or controls for confounding variables such as file complexity or project-specific factors.
Authors: We accept that the original §3 was insufficiently detailed. The revised version expands the section with explicit operational definitions: change frequency is the count of subsequent commits touching the file; file-size impact is the percentage of lines modified relative to the file at introduction. Modification-type classification was performed manually on a random sample of 300 changes using commit messages and diffs; two authors applied a written coding scheme (feature extension = addition of new functionality; bug fix = correction of incorrect behavior) and resolved disagreements by discussion, yielding Cohen’s κ = 0.81. We now report the use of Wilcoxon rank-sum tests for univariate comparisons and multivariate negative-binomial regression models that control for file complexity (LOC), repository, and language. These details and the associated statistical results are added to §3. revision: yes
Circularity Check
No circularity: direct empirical analysis of repository data
full rationale
The paper conducts an empirical study by analyzing over 1,000 files and 3,200 changes from 100 GitHub repositories using the AIDev dataset. It reports observed frequencies of maintenance, modification types, and human involvement through direct counting and comparison. No mathematical derivations, parameter fitting, model predictions, or self-citation chains are present that could reduce claims to inputs by construction. The analysis is self-contained against external benchmarks (GitHub commit histories) with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The AIDev dataset correctly identifies AI-generated code in pull requests.
- domain assumption GitHub commit history reflects all maintenance activities on the files.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our findings show that: (i) AI-generated files receive less frequent maintenance than human-authored code... (iii) human developers perform the large majority of this maintenance.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We categorized commits modifying AI-generated files... using the Conventional Commits Classification System (CCS)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Stephen R Schach.Object-oriented and classical software engineering, volume 6. McGraw-Hill New York, 2007
work page 2007
- [2]
-
[3]
Which factors affect software projects maintenance cost more?Acta Informatica Medica, 21(1):63, 2013
Sayed Mehdi Hejazi Dehaghani and Nafiseh Hajrahimi. Which factors affect software projects maintenance cost more?Acta Informatica Medica, 21(1):63, 2013
work page 2013
-
[4]
Investigating the smells of LLM generated code.CoRR, abs/2510.03029, 2025
Debalina Ghosh Paul, Hong Zhu, and Ian Bayley. Investigating the smells of LLM generated code.CoRR, abs/2510.03029, 2025
-
[5]
Mingwei Liu, Juntao Li, Ying Wang, Xueying Du, Zuoyu Ou, Qiuyuan Chen, Bingxu An, Zhao Wei, Yong Xu, Fangming Zou, Xin Peng, and Yiling Lou. Code copycat conundrum: Demystifying repetition in llm-based code generation.CoRR, abs/2504.12608, 2025
-
[6]
Domenico Cotroneo, Cristina Improta, and Pietro Liguori. Human-written vs. ai-generated code: A large-scale study of defects, vulnerabilities, and complexity. CoRR, abs/2508.21634, 2025
-
[7]
Hao He, Courtney Miller, Shyam Agarwal, Christian Kästner, and Bogdan Vasilescu. Does ai-assisted coding deliver? a difference-in-differences study of cursor’s impact on software projects.CoRR, abs/2511.04427, 2025
-
[8]
Purvi Sankhe, Neeta Patil, Minakshi Ghorpade, Pratibha Prasad, and Monisha Linkesh. Empirical analysis of ai-assisted code generation tools impact on code quality, security and developer productivity.International Journal For Multidisci- plinary Research, 2025
work page 2025
-
[9]
Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. The rise of AI teammates in software engineering (SE) 3.0: How autonomous coding agents are reshaping software engineering.CoRR, abs/2507.15003, 2025
work page internal anchor Pith review arXiv 2025
-
[10]
Openhands: An open platform for AI software developers as generalist agents
Xingyao Wang et al. Openhands: An open platform for AI software developers as generalist agents. InProceedings of the Thirteenth International Conference on Learning Representations (ICLR 2025), 2025
work page 2025
-
[11]
Metagpt: Meta programming for A multi-agent collaborative framework
Sirui Hong et al. Metagpt: Meta programming for A multi-agent collaborative framework. InProceedings of the twelfth International Conference on Learning Representations (ICLR 2024), 2024
work page 2024
-
[12]
Priyan Vaithilingam, Tianyi Zhang, and Elena L. Glassman. Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. InProceedings of the 2022 Conference on Human Factors in Computing Systems (CHI 2022), pages 332:1–332:7, 2022
work page 2022
-
[13]
Ningzhi Tang, Meng Chen, Zheng Ning, Aakash Bansal, Yu Huang, Collin McMil- lan, and Toby Jia-Jun Li. A study on developer behaviors for validating and repair- ing llm-generated code using eye tracking and IDE actions.CoRR, abs/2405.16081, 2024
-
[14]
Automated code review in practice
Umut Cihan, Vahid Haratian, Arda Içöz, Mert Kaan Gül, Ömercan Devran, Emir- can Furkan Bayendur, Baykal Mehmet Uçar, and Eray Tüzün. Automated code review in practice. InProceedings of the 47th IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP 2025), pages 425–436, 2025
work page 2025
-
[15]
Joel Becker, Nate Rush, Elizabeth Barnes, and David Rein. Measuring the impact of early-2025 AI on experienced open-source developer productivity.CoRR, abs/2507.09089, 2025
-
[16]
Licorish, Ansh Bajpai, Chetan Arora, Fanyu Wang, and Chakkrit Tantithamthavorn
Sherlock A. Licorish, Ansh Bajpai, Chetan Arora, Fanyu Wang, and Chakkrit Tantithamthavorn. Comparing human and LLM generated code: The jury is still out!CoRR, abs/2501.16857, 2025
-
[17]
Neil Perry, Megha Srivastava, Deepak Kumar, and Dan Boneh. Do users write more insecure code with AI assistants? InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS 2023), pages 2785– 2799, 2023
work page 2023
-
[18]
Asleep at the keyboard? assessing the security of github copilot’s code contributions.Commun
Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. Asleep at the keyboard? assessing the security of github copilot’s code contributions.Commun. ACM, 68(2):96–105, 2025
work page 2025
-
[19]
Henrique Gomes Nunes, Eduardo Figueiredo, Larissa Rocha Soares, Sarah Nadi, Fischer Ferreira, and Geanderson E. dos Santos. Evaluating the effectiveness of llms in fixing maintainability issues in real-world projects. InProceedings of the IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER 2025), pages 669–680, 2025
work page 2025
-
[20]
The modular imperative: Rethinking llms for maintainable software
Anastasiya Kravchuk-Kirilyuk, Fernanda Graciolli, and Nada Amin. The modular imperative: Rethinking llms for maintainable software. InProceedings of the 1st ACM SIGPLAN International Workshop on Language Models and Programming Languages (LMPL 2025), page 106–111, 2025
work page 2025
-
[21]
What to cut? predicting unnecessary methods in agentic code generation, 2026
Kan Watanabe, Tatsuya Shirai, Yutaro Kashiwa, and Hajimu Iida. What to cut? predicting unnecessary methods in agentic code generation, 2026
work page 2026
-
[22]
Sabrina Haque, Sarvesh Ingale, and Christoph Csallner. Do autonomous agents contribute test code? A study of tests in agentic pull requests.CoRR, abs/2601.03556, 2026
-
[23]
How do agents refactor: An empirical study.CoRR, abs/2601.20160, 2026
Lukas Ottenhof, Daniel Penner, Abram Hindle, and Thibaud Lutellier. How do agents refactor: An empirical study.CoRR, abs/2601.20160, 2026
-
[24]
Early-stage prediction of review effort in ai-generated pull requests.CoRR, abs/2601.00753, 2026
Dao Sy Duy Minh, Huynh Trung Kiet, Nguyen Lam Phu Quy, Pham Phu Hoa, Tran Chi Nguyen, Nguyen Dinh Ha Duong, and Truong Bao Tran. Early-stage prediction of review effort in ai-generated pull requests.CoRR, abs/2601.00753, 2026
-
[25]
Github. About code owners. https://docs.github.com/en/repositories/managing- your-repositorys-settings-and-features/customizing-your-repository/about- code-owners, 2025. Accessed: December 31st, 2025
work page 2025
- [26]
-
[27]
A first look at conven- tional commits classification
Qunhong Zeng, Yuxia Zhang, Zhiqing Qiu, and Hui Liu. A first look at conven- tional commits classification. InProceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE 2025), pages 2277–2289, 2025
work page 2025
- [28]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.