arxiv: 2605.02741 · v1 · submitted 2026-05-04 · 💻 cs.SE · cs.AI

Recognition: 2 theorem links

· Lean Theorem

AI-Generated Smells: An Analysis of Code and Architecture in LLM and Agent-Driven Development

Yuecai Zhu , Nikolaos Tsantalis , Peter C. Rigby

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:59 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords AI-generated codecode smellstechnical debtsoftware maintainabilityLLMagent-driven developmentarchitectural decayVolume-Quality Inverse Law

0 comments

The pith

AI-generated code shows a Volume-Quality Inverse Law where larger volume nearly perfectly predicts structural degradation, and this holds even when the code is correct or prompted in detail.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper audits technical debt in software produced by large language models and agents, moving beyond functional correctness to examine long-term maintainability. Across single-file tasks and complex multi-file systems, it identifies a Reasoning-Complexity Trade-off in which more capable models produce increasingly bloated and coupled code. This pattern leads the authors to define a Volume-Quality Inverse Law in which code volume serves as a near-perfect predictor of architectural decay. The decay persists regardless of whether the generated code passes functional tests or receives detailed prompts. The work therefore reframes the core challenge of AI-driven development from code generation to the explicit management of architectural complexity.

Core claim

As models become more capable, they generate increasingly bloated and coupled code. This architectural decay is so pronounced that the authors establish a Volume-Quality Inverse Law, where code volume is a near perfect predictor of structural degradation. Neither functional correctness nor detailed prompting mitigates this decay.

What carries the argument

The Volume-Quality Inverse Law, which treats generated code volume as a near-perfect predictor of coupling and smell-based degradation across both simple algorithmic tasks and agent-driven multi-file systems.

If this is right

AI-generated software carries a distinct machine signature of defects rather than simply replicating human errors.
Current prompt-driven workflows must shift focus from correctness to architectural complexity management.
Future agents will require built-in mechanisms for architectural foresight to produce maintainable output.
Functional correctness alone is insufficient to guarantee long-term viability of LLM-produced systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Maintainability benchmarks for code generation should incorporate volume and coupling metrics alongside test-pass rates.
Developers using these tools may need systematic post-generation refactoring pipelines to counteract the observed decay.
Training objectives for future models could include explicit penalties for excessive coupling or smell density.

Load-bearing premise

The chosen code smell metrics and coupling measures accurately reflect long-term maintainability problems rather than surface patterns limited to the generated samples.

What would settle it

A controlled experiment in which agents receive explicit architectural constraints, produce high-volume code that nevertheless shows low coupling and few smells, or in which detailed prompting demonstrably reduces degradation metrics.

Figures

Figures reproduced from arXiv: 2605.02741 by Nikolaos Tsantalis, Peter C. Rigby, Yuecai Zhu.

**Figure 1.** Figure 1: Distribution of Code Smell Counts. This box plot illustrates the distribution of counts for the most prevalent code view at source ↗

read the original abstract

The promise of Large Language Models in automated software engineering is often measured by functional correctness, overlooking the critical issue of long term maintainability. This paper presents a systematic audit of technical debt in AI-generated software, revealing that AI does not eliminate flaws but rather introduces a distinct machine signature of defects. Our multi-scale analysis, spanning single-file algorithmic tasks and complex, agent generated systems, identifies a fundamental Reasoning-Complexity Trade-off: as models become more capable, they generate increasingly bloated and coupled code. This architectural decay is so pronounced that we establish a Volume-Quality Inverse Law, where code volume is a near perfect predictor of structural degradation. Crucially, we demonstrate that neither functional correctness nor detailed prompting mitigates this decay. These findings challenge the current paradigm of prompt-driven generation, reframing the central problem of AI-based software engineering from one of code generation to one of architectural complexity management. We conclude that future progress depends on equipping agents with explicit architectural foresight to ensure the software they build is not just functional, but also maintainable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags that capable LLMs produce more voluminous, coupled code and calls this a Volume-Quality Inverse Law, but the supporting measurements and their link to real maintainability remain thin.

read the letter

The main point to take away is that the authors audited LLM and agent outputs at two scales and observed that higher capability correlates with more code smells and coupling, with volume acting as a strong predictor and neither correctness nor prompting fixing the pattern. They frame this as a Reasoning-Complexity Trade-off and argue the field should shift from pure generation accuracy to architectural control. That reframing is useful and the multi-scale scope (single-file tasks plus agent systems) goes beyond the usual narrow benchmarks. The work is new in naming these patterns from their audit and in pushing maintainability into the evaluation conversation. It does a service by documenting the machine-specific defects that appear even when functional tests pass. The soft spots sit in the measurement foundation. The abstract and analysis give no clear account of sample selection, the precise smell and coupling detectors applied, statistical controls, or any check that these metrics track long-term maintenance cost outside the generated set. The near-perfect volume-degradation link therefore risks reading as a post-hoc fit rather than an independent law. The claim that prompting and correctness do not help inherits the same gap. If the detectors mainly surface verbosity or repetition that LLMs default to, the trade-off may not generalize to actual engineering problems. This paper is aimed at researchers building or evaluating AI coding tools who already care about quality beyond pass rates. A reader working on benchmarks or agent design would get value from the empirical patterns, even if they treat the named laws as starting hypotheses. It deserves a serious referee to press on the methods and data provenance rather than a desk reject.

Referee Report

3 major / 2 minor

Summary. The paper presents a systematic audit of technical debt in AI-generated software from LLMs and agents across single-file algorithmic tasks and complex agent-driven systems. It identifies a Reasoning-Complexity Trade-off in which more capable models produce increasingly bloated and coupled code, leading to the Volume-Quality Inverse Law (code volume as a near-perfect predictor of structural degradation). The work claims this decay is not mitigated by functional correctness or detailed prompting, reframing AI software engineering around architectural complexity management rather than functional generation alone.

Significance. If the central claims hold after validation, the paper would be significant for AI-assisted software engineering by providing empirical evidence of inherent maintainability limitations in current LLM and agent approaches. The multi-scale analysis and introduction of a predictive law could serve as a useful heuristic for practitioners and tool designers, shifting focus toward architectural foresight. However, the absence of external benchmarks or pre-registered metrics limits its immediate applicability.

major comments (3)

[Abstract and §4] Abstract and §4 (Results): The Volume-Quality Inverse Law is described as establishing code volume as a 'near perfect predictor' of degradation, but the manuscript supplies no correlation coefficients, regression details, sample sizes, or controls for task complexity. This makes it impossible to distinguish a robust law from a post-hoc fit to the observed data, which is load-bearing for the central claim.
[§3] §3 (Methodology): The paper relies on code smell and coupling metrics to quantify architectural decay without providing justification for their selection, inter-rater reliability, or external validation against real-world maintainability outcomes (e.g., bug rates or refactoring effort in production systems). If these metrics primarily flag LLM generation artifacts such as verbosity rather than defects that increase long-term costs, the Reasoning-Complexity Trade-off and non-mitigation claims do not follow.
[§4.3] §4.3 (Prompting and Correctness Experiments): The assertion that 'neither functional correctness nor detailed prompting mitigates this decay' requires explicit description of how correctness was measured (e.g., test suites, pass rates), how prompting variations were controlled, and the statistical tests used to support the null effect. Without these, the claim that the trade-off is inherent cannot be assessed.

minor comments (2)

[§2] §2 (Related Work): Expand discussion of prior empirical studies on code quality in LLM outputs to better situate the Volume-Quality Inverse Law against existing findings on maintainability.
[Tables 1-2] Tables 1-2: Include raw data distributions or confidence intervals alongside reported averages for volume and smell counts to improve interpretability of the multi-scale results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments, which highlight important areas for strengthening the empirical support and methodological transparency of our claims. We address each major comment below and will incorporate revisions to improve the manuscript.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Results): The Volume-Quality Inverse Law is described as establishing code volume as a 'near perfect predictor' of degradation, but the manuscript supplies no correlation coefficients, regression details, sample sizes, or controls for task complexity. This makes it impossible to distinguish a robust law from a post-hoc fit to the observed data, which is load-bearing for the central claim.

Authors: We agree that explicit statistical details are necessary to substantiate the Volume-Quality Inverse Law and prevent it from appearing as a post-hoc observation. In the revised version, we will expand §4 with the correlation coefficients, regression analyses, exact sample sizes for each scale of analysis, and controls for task complexity through stratified reporting. These additions will provide the quantitative foundation for the law and allow readers to assess its robustness directly from the data. revision: yes
Referee: [§3] §3 (Methodology): The paper relies on code smell and coupling metrics to quantify architectural decay without providing justification for their selection, inter-rater reliability, or external validation against real-world maintainability outcomes (e.g., bug rates or refactoring effort in production systems). If these metrics primarily flag LLM generation artifacts such as verbosity rather than defects that increase long-term costs, the Reasoning-Complexity Trade-off and non-mitigation claims do not follow.

Authors: The metrics were selected because they are standard in software engineering for measuring structural degradation and maintainability (e.g., coupling and complexity indicators from established suites). We will add an explicit justification subsection in §3 with supporting citations. As the metrics are fully automated via static analysis, inter-rater reliability is not applicable and will be clarified. We acknowledge the absence of direct external validation against production bug rates or refactoring effort in this study; we will add a limitations paragraph discussing this gap and referencing prior literature that links these metrics to maintainability costs, while noting it as an area for future validation. revision: partial
Referee: [§4.3] §4.3 (Prompting and Correctness Experiments): The assertion that 'neither functional correctness nor detailed prompting mitigates this decay' requires explicit description of how correctness was measured (e.g., test suites, pass rates), how prompting variations were controlled, and the statistical tests used to support the null effect. Without these, the claim that the trade-off is inherent cannot be assessed.

Authors: We will revise §4.3 to provide the requested details: functional correctness was assessed via dedicated test suites with reported pass rates per model and task; prompting variations were controlled through a fixed set of templates with graduated detail levels, which will be included as examples; and we will report the statistical tests (including any null-effect analyses) used to evaluate mitigation. These expansions will make the non-mitigation finding fully evaluable. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical observations or law naming

full rationale

The paper conducts an empirical audit of AI-generated code across scales, applying code smell and coupling metrics to identify patterns of bloat and coupling. The Volume-Quality Inverse Law is explicitly framed as a result established from this analysis (volume as near-perfect predictor of degradation), with additional checks showing that functional correctness and prompting do not mitigate the observed decay. No equations, self-definitions, fitted parameters renamed as predictions, or self-citation chains reduce any claim to its inputs by construction. The metrics are applied to generated samples as described, and the findings are presented as observations rather than derivations that presuppose the law. This is a standard empirical study structure with independent content in the multi-scale comparison and mitigation checks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that standard software engineering metrics for volume, coupling, and smells validly indicate maintainability decay, plus the representativeness of the chosen single-file and agent-generated samples.

axioms (1)

domain assumption Standard code smell and coupling metrics accurately reflect long-term maintainability and technical debt.
Invoked to support the Volume-Quality Inverse Law and the claim that decay is not mitigated by correctness or prompting.

pith-pipeline@v0.9.0 · 5486 in / 1272 out tokens · 68447 ms · 2026-05-08T17:59:40.256144+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TLoC exhibited a near-perfect positive correlation with architectural smells (rho = 0.94, p < 0.001). ... we establish a 'Volume-Quality Inverse Law'

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 9 canonical work pages · 1 internal anchor

[1]

Annonymous. 2026. Replication Package. https://doi.org/10.5281/zenodo. 19245562

work page doi:10.5281/zenodo 2026
[2]

Fraol Batole, David OBrien, Tien Nguyen, Robert Dyer, and Hridesh Rajan
[3]

In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE)

An LLM-Based Agent-Oriented Approach for Automated Code Design Issue Localization. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 637–637
[4]

2013.SonarQube in action

G Ann Campbell and Patroklos P Papapetrou. 2013.SonarQube in action. Manning Publications Co

2013
[5]

Mark Chen. 2021. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021)

work page internal anchor Pith review arXiv 2021
[6]

Jonathan Cordeiro, Shayan Noei, and Ying Zou. 2024. An empirical study on the code refactoring capability of large language models.arXiv preprint arXiv:2411.02320(2024)

work page arXiv 2024
[7]

2026.cloc: v2.08

Albert Danial. 2021.cloc: v1.92. doi:10.5281/zenodo.5760077

work page doi:10.5281/zenodo.5760077 2021
[8]

Den Delimarsky. 2025. Diving Into Spec-Driven Development With GitHub Spec Kit.developer.microsoft.com/(September 2025). https://developer.microsoft.com/ blog/spec-driven-development-spec-kit

2025
[9]

Yihong Dong, Xue Jiang, Jiaru Qian, Tian Wang, Kechi Zhang, Zhi Jin, and Ge Li. 2025. A Survey on Code Generation with LLM-based Agents.arXiv preprint arXiv:2508.00083(2025)

work page arXiv 2025
[10]

Junda He, Christoph Treude, and David Lo. 2025. LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision, and the Road Ahead.ACM Transactions on Software Engineering and Methodology34, 5 (2025), 1–30

2025
[11]

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al
[12]

International Conference on Learning Representations, ICLR

MetaGPT: Meta programming for a multi-agent collaborative framework. International Conference on Learning Representations, ICLR
[13]

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel Mankowitz, Esme Sutherland Robson, Pushmeet...

work page arXiv 2022
[14]

Bo Liu, Yanjie Jiang, Yuxia Zhang, Nan Niu, Guangjie Li, and Hui Liu. 2025. Ex- ploring the potential of general purpose LLMs in automated software refactoring: an empirical study.Automated Software Engineering32, 1 (2025), 26

2025
[15]

Ran Mo, Yuanfang Cai, Rick Kazman, Lu Xiao, and Qiong Feng. 2019. Architecture anti-patterns: Automatically detectable violations of design principles.IEEE Transactions on Software Engineering47, 5 (2019), 1008–1028

2019
[16]

Arpita Naik and RITHIKA RAJAN SHYLAJA. 2024. Enhancing Code Refac- toring in Python: Leveraging Large Language Models.odr.chalmers.se (2024). https://odr.chalmers.se/server/api/core/bitstreams/85d14e4b-13dc-40c7- bcea-1fe230875b45/content

2024
[17]

Alexander Puma Pucho, Alexandre Mello Ferreira, Elder José Reioli Cirilo, and Bruno BP Cafeo. 2025. Refactoring Python Code with LLM-Based Multi-Agent Systems: An Empirical Study in ML Software Projects. InSimpósio Brasileiro de Engenharia de Software (SBES). SBC, 678–684

2025
[18]

Vasanth Rajendran, Dinesh Besiahgari, Sachin C Patil, Manjunath Chan- drashekaraiah, and Vishnu Challagulla. 2025. A Multi-Agent LLM Environment for Software Design and Refactoring: A Conceptual Framework. InSoutheastCon

2025
[19]

Alfred Santa Molison, Marcia Moraes, Glaucia Melo, Fabio Santos, and Wesley KG Assunçao. 2025. Is LLM-Generated Code More Maintainable & Reliable than Human-Written Code?arXiv e-prints(2025), arXiv–2508

2025
[20]

Karthik Shivashankar and Antonio Martini. 2025. PyExamine: A Comprehen- sive, Un-Opinionated Smell Detection Tool for Python. In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR). IEEE, 763–774

2025
[21]

Shahbaz Siddeeq, Muhammad Waseem, Zeeshan Rasheed, Md Mahade Hasan, Jussi Rasku, Mika Saari, Henri Terho, Kalle Makela, Kai-Kristian Kemell, and Pekka Abrahamsson. 2025. LLM-based Multi-Agent System for Intelligent Refac- toring of Haskell Code.arXiv preprint arXiv:2506.19481(2025)

work page arXiv 2025
[22]

Nalin Wadhwa, Jui Pradhan, Atharv Sonwane, Surya Prakash Sahu, Nagarajan Natarajan, Aditya Kanade, Suresh Parthasarathy, and Sriram Rajamani. 2024. Core: Resolving code quality issues using llms.Proceedings of the ACM on Software Engineering1, FSE (2024), 789–811

2024
[23]

Yisen Xu, Feng Lin, Jinqiu Yang, Nikolaos Tsantalis, et al. 2025. Mantra: Enhanc- ing automated method-level refactoring with contextual rag and multi-agent llm AI-Generated Smells: An Analysis of Code and Architecture in LLM- and Agent-Driven Development collaboration.arXiv preprint arXiv:2503.14340(2025)

work page arXiv 2025
[24]

Jiyang Zhang, Pengyu Nie, Junyi Jessy Li, and Milos Gligoric. 2023. Multilin- gual code co-evolution using large language models. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 695–707

2023
[25]

Zibin Zheng, Kaiwen Ning, Yanlin Wang, Jingwen Zhang, Dewu Zheng, Mingxi Ye, and Jiachi Chen. 2023. A survey of large language models for code: Evolution, benchmarking, and future trends.arXiv preprint arXiv:2311.10372(2023)

work page arXiv 2023