pith. sign in

arxiv: 2605.20049 · v1 · pith:OVXTJS2Lnew · submitted 2026-05-19 · 💻 cs.SE · cs.AI

Does Code Cleanliness Affect Coding Agents? A Controlled Minimal-Pair Study

Pith reviewed 2026-05-20 03:32 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords code cleanlinesscoding agentsminimal pairsstatic analysiscognitive complexityagent evaluationmaintainability
0
0 comments X

The pith

Code cleanliness leaves coding agent success rates unchanged but cuts token consumption and file revisits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether cleaner code helps autonomous coding agents complete tasks more successfully. It does so by building minimal pairs of code repositories that are functionally identical but differ in their level of static analysis violations and cognitive complexity. The study runs the same tasks on both clean and messy versions using the Claude Code agent across 660 trials. While the agents succeed at the same rate on both, they use fewer tokens and revisit files less frequently when working with cleaner code. This indicates that code quality still shapes how efficiently AI agents operate even if it does not change their ultimate performance.

Core claim

The paper establishes that code cleanliness has no impact on the pass rate of coding agents on hidden tests. In experiments involving 33 tasks across six minimal-pair repositories, the agents achieved equivalent success whether the code had high or low cleanliness scores. At the same time, working on cleaner code resulted in 7 to 8% fewer tokens being used and a 34% reduction in file revisitations. The authors argue that principles of code maintainability continue to be relevant for managing the computational and navigational demands placed on coding agents.

What carries the argument

The minimal-pair evaluation protocol, where repositories are constructed to be identical in architecture, dependencies, and behavior but vary in static-analysis violations and cognitive complexity, with pairs created by either degrading clean code or cleaning messy code.

Load-bearing premise

The minimal pairs isolate cleanliness effects without other unmeasured differences in code structure, naming conventions, or dependencies that could alter agent behavior.

What would settle it

Finding a set of minimal-pair repositories where agents show different success rates on clean versus messy versions, or where the efficiency gains do not appear, would challenge the central claim.

Figures

Figures reproduced from arXiv: 2605.20049 by Olivier Schmitt (SonarSource), Priyansh Trivedi.

Figure 1
Figure 1. Figure 1: An example task in the benchmark, drawn from the [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Per-trial distributions of input tokens (left) and output tokens (right) across the 27 tasks. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

As autonomous coding agents see rapid adoption, their evaluation has primarily focused on task completion rates holding the target codebase fixed. This leaves a critical question unanswered: does the structural and stylistic quality, or ``cleanliness'' of the underlying code affect an agent's ability to navigate and modify it? To isolate the effect of code cleanliness from agent capability, we introduce an evaluation protocol built around minimal pairs: repositories that match on architecture, dependencies, and external behaviour, but differ on static-analysis rule violations and cognitive complexity. The pairs are constructed in both directions, by agent pipelines that either degrade a clean repository or clean a messy one. We author 33 tasks across six such pairs, evaluated through hidden tests at the application's public surface. Across 660 trials with Claude Code, code cleanliness does not change the agent's pass rate. However, it substantially alters the agent's operational footprint: agents working on cleaner code use 7 to 8% fewer tokens and reduce file revisitations by 34%. Our findings suggest that traditional maintainability principles remain highly relevant in the era of AI-driven development, shaping the computational cost and navigational efficiency of coding agents. Code cleanliness joins model choice, harness, and prompting as a factor that materially affects agent behaviours.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that code cleanliness (static-analysis violations and cognitive complexity) does not affect task pass rates for the Claude Code agent but does reduce operational footprint: across 660 trials on 33 tasks from six minimal-pair repositories (constructed bidirectionally via agent pipelines), cleaner code yields 7-8% fewer tokens and 34% fewer file revisitations while holding architecture, dependencies, and external behavior fixed. Hidden tests at the public surface are used to evaluate success.

Significance. If the minimal-pair isolation holds, the result provides empirical evidence that traditional maintainability metrics remain relevant for AI coding agents by shaping navigation efficiency and token cost without altering completion rates. The controlled design with bidirectional construction, 660 trials, and hidden tests is a strength that allows falsifiable claims about agent behavior rather than post-hoc correlations.

major comments (1)
  1. [Minimal-pair construction] Minimal-pair construction section: the bidirectional agent-pipeline method (degrading clean repos or cleaning messy ones) risks introducing unmeasured differences in identifier names, module organization, comment density, or call-graph shape that are not captured by the reported static-analysis and cognitive-complexity metrics. These incidental changes could explain the 7-8% token reduction and 34% drop in file revisitations rather than cleanliness per se. The unchanged pass rate does not rule out the confound if hidden tests are insensitive to navigation differences. Please report quantitative checks (e.g., diff statistics on naming entropy, dependency graphs, or structural similarity) that confirm the pairs differ only on the intended dimensions.
minor comments (2)
  1. [Results] Table or results section: the 7-8% token reduction and 34% revisitation drop should be accompanied by per-pair breakdowns and confidence intervals to show consistency across the six repositories rather than aggregate only.
  2. [Methods] Methods: clarify whether task selection and pair construction were pre-registered or whether post-hoc adjustments were made after initial agent runs.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our minimal-pair design. We address the major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: Minimal-pair construction section: the bidirectional agent-pipeline method (degrading clean repos or cleaning messy ones) risks introducing unmeasured differences in identifier names, module organization, comment density, or call-graph shape that are not captured by the reported static-analysis and cognitive-complexity metrics. These incidental changes could explain the 7-8% token reduction and 34% drop in file revisitations rather than cleanliness per se. The unchanged pass rate does not rule out the confound if hidden tests are insensitive to navigation differences. Please report quantitative checks (e.g., diff statistics on naming entropy, dependency graphs, or structural similarity) that confirm the pairs differ only on the intended dimensions.

    Authors: We agree that this is a valid concern and that the bidirectional construction, while controlling for architecture, dependencies, and external behavior, could introduce incidental differences in naming, organization, or structure not captured by our primary metrics. The unchanged pass rates alone do not fully rule out such confounds for the efficiency measures. In the revised manuscript we will add a dedicated subsection with quantitative checks: (1) identifier naming statistics (average length and Shannon entropy of names), (2) module organization metrics (file count, directory depth, and number of modules), (3) comment density, and (4) structural similarity via AST edit distance and call-graph comparison metrics between each clean/messy pair. These results will be reported in a new table to demonstrate that systematic differences align with the intended cleanliness dimensions. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements on constructed pairs

full rationale

The paper reports results from an experimental protocol involving 660 trials on six minimal-pair repositories. No mathematical derivations, equations, fitted parameters, or first-principles predictions are present. The central claims (unchanged pass rate, 7-8% token reduction, 34% fewer revisitations) are direct observational outcomes from running Claude Code on the pairs. The minimal-pair construction is described as a methodological step to isolate cleanliness, not as a self-defining or self-citing reduction. No load-bearing self-citations or ansatzes appear in the provided text. This is a standard empirical study whose findings stand or fall on the experimental controls rather than any definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study rests on the assumption that the constructed minimal pairs differ only in the targeted cleanliness attributes and that the chosen tasks are representative of real agent usage. No free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Minimal-pair repositories can be constructed that match on architecture, dependencies, and external behavior while differing only on static-analysis violations and cognitive complexity.
    This premise is required for the isolation claim and is stated in the abstract description of the evaluation protocol.

pith-pipeline@v0.9.0 · 5750 in / 1264 out tokens · 29257 ms · 2026-05-20T03:32:03.504827+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

  1. [1]

    Agentic Much?

    Robbes, Romain and Matricon, Th. Agentic Much?. 2026 , eprint =

  2. [2]

    Bai, Longju and Huang, Zhemin and Wang, Xingyao and Sun, Jiao and Mihalcea, Rada and Brynjolfsson, Erik and Pentland, Alex and Pei, Jiaxin , journal =. How Do. 2026 , note =

  3. [3]

    2025 , url =

    Wang, Qian and Tang, Zhenheng and Jiang, Zichen and Chen, Nuo and Wang, Tianyu and He, Bingsheng , booktitle =. 2025 , url =

  4. [4]

    Proceedings of the 23rd International Conference on Mining Software Repositories (MSR) , year =

    Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering , author =. Proceedings of the 23rd International Conference on Mining Software Repositories (MSR) , year =

  5. [5]

    and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik R

    Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik R. , booktitle =. 2024 , url =

  6. [6]

    and Yang, John and Ho, Leyton and Patwardhan, Tejal and Liu, Kevin and Madry, Aleksander , year =

    Chowdhury, Neil and Aung, James and Shern, Chan Jun and Jaffe, Oliver and Sherburn, Dane and Starace, Giulio and Mays, Evan and Dias, Rachel and Aljubeh, Marwan and Glaese, Mia and Jimenez, Carlos E. and Yang, John and Ho, Leyton and Patwardhan, Tejal and Liu, Kevin and Madry, Aleksander , year =. Introducing

  7. [7]

    Snell, Charlie and Lee, Jaehoon and Xu, Kelvin and Kumar, Aviral , journal =. Scaling

  8. [8]

    When More is Less: Understanding Chain-of-Thought Length in

    Wu, Yuyang and Wang, Yifei and Du, Tianqi and Jegelka, Stefanie and Wang, Yisen , journal =. When More is Less: Understanding Chain-of-Thought Length in

  9. [9]

    Transactions on Machine Learning Research , year =

    Inverse Scaling in Test-Time Compute , author =. Transactions on Machine Learning Research , year =

  10. [10]

    Aggarwal, Pranjal and Kim, Seungone and Lanchantin, Jack and Welleck, Sean and Weston, Jason and Kulikov, Ilia and Saha, Swarnadeep , journal =

  11. [11]

    and He, Pinjia and Hassan, Ahmed E

    Fan, Zhiyu and Vasilevski, Kirill and Lin, Dayi and Chen, Boyuan and Chen, Yihao and Zhong, Zhiqing and Zhang, Jie M. and He, Pinjia and Hassan, Ahmed E. , journal =

  12. [12]

    2026 , note =

    Orlanski, Gabriel and Roy, Devjeet and Yun, Alexander and Shin, Changho and Gu, Alex and Ge, Albert and Adila, Dyah and Roberts, Nicholas and Sala, Frederic and Albarghouthi, Aws , journal =. 2026 , note =

  13. [13]

    and Van, Cuong Duc and Phan, Hoang N

    Le, Cuong Chi and Pham, Minh V.T. and Van, Cuong Duc and Phan, Hoang N. and Phan, Huy N. and Nguyen, Tien N. , year =. When Names Disappear:. 2510.03178 , archivePrefix=

  14. [14]

    Proceedings of the XXXIX Brazilian Symposium on Software Engineering (SBES) , year =

    Measuring How Changes in Code Readability Attributes Affect Code Quality Evaluation by Large Language Models , author =. Proceedings of the XXXIX Brazilian Symposium on Software Engineering (SBES) , year =

  15. [15]

    Proceedings of the IEEE/ACM 46th International Conference on Software Engineering (ICSE) , year =

    Evaluating Large Language Models in Class-Level Code Generation , author =. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering (ICSE) , year =

  16. [16]

    2025 , url =

    Jain, Naman and Han, King and Gu, Alex and Li, Wen-Ding and Yan, Fanjia and Zhang, Tianjun and Wang, Sida and Solar-Lezama, Armando and Sen, Koushik and Stoica, Ion , booktitle =. 2025 , url =

  17. [17]

    2025 , howpublished =

  18. [18]

    C hat D ev: Communicative Agents for Software Development

    Qian, Chen and Liu, Wei and Liu, Hongzhang and Chen, Nuo and Dang, Yufan and Li, Jiahao and Yang, Cheng and Chen, Weize and Su, Yusheng and Cong, Xin and Xu, Juyuan and Li, Dahai and Liu, Zhiyuan and Sun, Maosong. C hat D ev: Communicative Agents for Software Development. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguist...