arxiv: 2605.07001 · v2 · submitted 2026-05-07 · 💻 cs.SE · cs.CL

Recognition: no theorem link

SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair

Ion George Dinu (1) , Marian Cristian Mih\u{a}escu (1) , Traian Rebedea (2) ((1) University of Craiova , Craiova , Romania , (2) University Politehnica of Bucharest , Bucharest , Romania)

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:55 UTC · model grok-4.3

classification 💻 cs.SE cs.CL

keywords LLM agentsarchitectural code smellscode refactoringsoftware maintainabilitybenchmark evaluationPythonfalse positive detection

0 comments

The pith

LLM agents repair under half of architectural code smells while often introducing new ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a benchmark called SmellBench to test whether large language model agents can fix architectural code smells, which are design problems that span multiple files and require understanding overall software structure rather than isolated bugs. It evaluates eleven agent setups from different model families on sixty-five validated hard smells in a major Python library, using prompts tuned to each smell type and a scoring system that tracks successful repairs, correct rejection of false alarms, and overall change to the codebase. The results show agents can match expert judgments when spotting false positives but achieve only modest repair rates and tend to degrade quality when they attempt fixes. A sympathetic reader cares because architectural smells hurt long-term maintainability and because automated tools that work only on local changes leave a clear gap for real design-level work. The framework itself supplies reusable infrastructure so future agents can be measured against the same standard.

Core claim

LLM agents excel at localized code transformations yet lack the cross-module architectural understanding needed for reliable smell repair. When run on expert-validated smells from scikit-learn, the strongest configuration resolves 47.7 percent of issues and reaches expert-level agreement on false-positive identification, but more aggressive agents add up to 140 new smells, producing an inverse relationship between repair volume and net codebase quality. SmellBench supplies the orchestration, prompt templates, and multi-metric scoring that make this gap measurable and trackable.

What carries the argument

SmellBench, a task orchestration framework that applies smell-type-specific prompts, supports iterative multi-step execution, and scores agents on repair effectiveness, false-positive identification, and net codebase impact.

If this is right

Automated refactoring tools will remain limited to local edits until models acquire better cross-file design reasoning.
Benchmarks must separately measure false-positive rejection and net quality change, because raw repair counts can mislead.
Progress on architectural smells will require new training signals that emphasize multi-module dependencies rather than single-file edits.
Software teams using current agents for smell repair will still need human oversight to avoid quality regressions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the benchmark to other languages or larger projects could reveal whether the observed gap is language-specific or general.
If agents improve on this task, they could eventually support continuous architectural maintenance in large codebases without manual intervention.
The inverse relationship between aggressiveness and quality suggests training objectives should penalize smell creation, not just reward smell removal.

Load-bearing premise

The sixty-five smells found by the detector in one large Python project, once filtered by expert review, form a representative sample of the architectural repair problem that agents must solve.

What would settle it

A new agent configuration that consistently repairs every validated smell while leaving the total smell count unchanged or lower would falsify the claimed gap in cross-module capability.

Figures

Figures reproduced from arXiv: 2605.07001 by (2) University Politehnica of Bucharest, Bucharest, Craiova, Ion George Dinu (1), Marian Cristian Mih\u{a}escu (1), Romania, Romania), Traian Rebedea (2) ((1) University of Craiova.

**Figure 2.** Figure 2: Task lifecycle state machine. Each task follows a structured lifecycle as shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: GEPA prompt optimization pipeline. one per terminal status, that are delivered to agents via the task packet (§4.1). Scoring function design. GEPA requires a scoring function that evaluates candidate prompts at each iteration. The natural approach would be to deploy each candidate with an actual agent, measure task outcomes on the benchmark, and use the results as the score. However, a single agent run acr… view at source ↗

read the original abstract

Architectural code smells erode software maintainability and are costly to repair manually, yet unlike localized bugs, they require cross-module reasoning about design intent that challenges both developers and automated tools. While large language model agents excel at bug fixing and code-level refactoring, their ability to repair architectural code smells remains unexplored. We present the first empirical evaluation of LLM agents on architectural code smell repair. We contribute SmellBench, a task orchestration framework that incorporates smell-type-specific optimized prompts and supports iterative multi-step execution, together with a scoring methodology that separately evaluates repair effectiveness, false positive identification, and net codebase impact. We evaluate 11 agent configurations from four model families (GPT, Claude, Gemini, Mistral) on 65 hard-severity architectural smells detected by PyExamine in the Python project scikit-learn, validated against expert judgments. Expert validation reveals that 63.1% of detected smells are false positives, while the best agent achieves a 47.7% resolution rate. Agents identify false positives with up to $\kappa = 0.94$ expert agreement, but repair aggressiveness and net codebase quality are inversely related: the most aggressive agent introduces 140 new smells. These findings expose a gap between current LLM capabilities in localized code transformations and the architectural understanding needed for cross-module refactoring. SmellBench provides reusable infrastructure for tracking progress on this underexplored dimension of automated software engineering. We release our code and data at https://doi.org/10.5281/zenodo.19247588.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SmellBench gives the first concrete numbers on LLM agents tackling architectural smells, but success is measured mostly by re-running the same detector.

read the letter

The paper's main point is that LLM agents are decent at spotting false-positive smells but weak at fixing real architectural ones without creating new problems. They introduce SmellBench as an orchestration framework with smell-specific prompts and a three-part scoring method for resolution rate, false-positive identification, and net codebase impact. On 65 hard smells from scikit-learn, the best agent reaches 47.7% resolution while experts confirm 63.1% of PyExamine detections are false positives, with kappa up to 0.94 on that task. Releasing code and data at Zenodo is a clear plus for anyone who wants to extend the benchmark.

Referee Report

3 major / 3 minor

Summary. The paper introduces SmellBench, a task orchestration framework for evaluating LLM agents on repairing architectural code smells. It evaluates 11 agent configurations across four model families on 65 hard-severity smells detected by PyExamine in scikit-learn (expert-validated), reporting a best resolution rate of 47.7%, expert agreement on false-positive identification up to κ=0.94, 63.1% false positives among detections, and an inverse relationship between repair aggressiveness and net codebase quality (most aggressive agent introduces 140 new smells). The central claim is that these results expose a gap between current LLM capabilities in localized code transformations and the cross-module architectural reasoning needed for genuine smell repair; the framework and data are released for future benchmarking.

Significance. If the evaluation methodology holds, the work is significant as the first empirical benchmark specifically targeting architectural (as opposed to localized) code smells for LLM agents. It supplies reusable infrastructure, released code/data, and concrete metrics that can track progress on an underexplored dimension of automated software engineering. The separation of repair effectiveness, false-positive identification, and net impact is a useful design choice, and the expert validation step strengthens the test set.

major comments (3)

[abstract and scoring methodology (§3–4)] Scoring methodology (abstract and §3–4): The claim that the 47.7% resolution rate demonstrates a deficit in cross-module architectural reasoning depends on whether repair success is defined by independent expert judgment that the edit preserves or improves design intent across modules, or by the narrower criterion of the original PyExamine flag disappearing. The abstract states that the scoring “separately evaluates repair effectiveness” but does not specify an expert-judged architectural criterion; if effectiveness reduces to re-detection, localized edits that evade the detector without addressing architectural issues would suffice to produce the reported numbers.
[§2 and §4] Test-set construction (§2 and §4): The 65 smells are described as “hard-severity” and “validated against expert judgments,” yet the paper does not report the precise exclusion rules, inter-expert agreement on smell validity, or sampling procedure used to ensure the set is representative rather than biased toward smells that PyExamine flags reliably. This directly affects the generalizability of the gap claim.
[results section] Net-impact measurement (results): The finding that the most aggressive agent introduces 140 new smells is load-bearing for the aggressiveness–quality tradeoff. The paper should clarify whether “new smells” are counted only by PyExamine re-scan or also by expert review of whether the introduced issues are architecturally meaningful; without the latter, the metric risks conflating detector noise with genuine degradation.

minor comments (3)

[abstract] The abstract reports “140 new smells” without stating whether this is an absolute count or normalized by lines of code or number of edits; a normalized figure would aid interpretation.
[results] Table or figure presenting per-agent resolution rates, false-positive identification rates, and net smell change should be added or clarified if already present, with explicit formulas for each metric.
[§3] The prompt templates and iteration limits for the 11 configurations are mentioned but not reproduced in sufficient detail for replication; the released repository should be cross-referenced explicitly in the text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped clarify key aspects of our methodology and results presentation. We address each major comment below and have revised the manuscript to improve transparency without altering the core findings or data.

read point-by-point responses

Referee: [abstract and scoring methodology (§3–4)] Scoring methodology (abstract and §3–4): The claim that the 47.7% resolution rate demonstrates a deficit in cross-module architectural reasoning depends on whether repair success is defined by independent expert judgment that the edit preserves or improves design intent across modules, or by the narrower criterion of the original PyExamine flag disappearing. The abstract states that the scoring “separately evaluates repair effectiveness” but does not specify an expert-judged architectural criterion; if effectiveness reduces to re-detection, localized edits that evade the detector without addressing architectural issues would suffice to produce the reported numbers.

Authors: We appreciate this observation and agree that the abstract and scoring sections would benefit from greater precision. The 47.7% resolution rate is computed by re-running PyExamine after the agent's edit and checking whether the original smell flag has disappeared. This definition aligns with the automated, detector-driven nature of SmellBench and measures success against the same criterion used to identify the initial smells. We acknowledge that this metric does not incorporate a separate expert judgment on whether cross-module design intent was preserved or improved, which could allow certain localized changes to count as successes. In the revised manuscript we will update the abstract and §§3–4 to state explicitly that resolution is defined by non-re-detection, and we will add a brief discussion in the limitations section noting this distinction and its implications for interpreting the gap in architectural reasoning. revision: yes
Referee: [§2 and §4] Test-set construction (§2 and §4): The 65 smells are described as “hard-severity” and “validated against expert judgments,” yet the paper does not report the precise exclusion rules, inter-expert agreement on smell validity, or sampling procedure used to ensure the set is representative rather than biased toward smells that PyExamine flags reliably. This directly affects the generalizability of the gap claim.

Authors: We thank the referee for highlighting this omission. The 65 smells comprise the complete set of hard-severity architectural smells returned by PyExamine on scikit-learn; expert validation was performed to confirm each as a genuine architectural issue before inclusion. We will revise §§2 and §4 to report the precise exclusion rules applied during validation, the inter-expert agreement on validity, and to confirm that the set is exhaustive rather than a sampled subset. These additions will strengthen the description of the test set and support the generalizability of the reported gap. revision: yes
Referee: [results section] Net-impact measurement (results): The finding that the most aggressive agent introduces 140 new smells is load-bearing for the aggressiveness–quality tradeoff. The paper should clarify whether “new smells” are counted only by PyExamine re-scan or also by expert review of whether the introduced issues are architecturally meaningful; without the latter, the metric risks conflating detector noise with genuine degradation.

Authors: We agree that explicit clarification is warranted. The count of 140 new smells is obtained solely by re-running PyExamine on the post-repair codebase and tallying additional detections. This approach ensures methodological consistency with the original smell identification. We recognize that the metric may therefore include some detector noise. In the revised results section we will state this explicitly and add a short discussion of the limitation, noting that future benchmark extensions could incorporate expert review of introduced smells to further isolate genuine architectural degradation. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical benchmark with expert validation and released artifacts

full rationale

The paper is a self-contained empirical evaluation study that introduces SmellBench, applies PyExamine detection to scikit-learn, performs expert validation of the 65 smells (reporting 63.1% false positives), and measures agent performance via resolution rate, kappa agreement on false-positive identification, and net smell count impact. No mathematical derivations, equations, fitted parameters, or predictions appear in the provided text. The scoring methodology is described as separating repair effectiveness from false-positive identification and net impact, with explicit expert judgments and public code/data release. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked. The central claims rest on observable experimental outcomes rather than reducing to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central evaluation rests on the assumption that PyExamine detections plus expert review provide valid ground truth, and that the introduced SmellBench framework and scoring capture meaningful repair quality. No numeric free parameters are described in the abstract.

axioms (2)

domain assumption Expert human judgments constitute reliable ground truth for validating detected architectural code smells and agent repairs.
Used to establish 63.1% false positive rate and agent agreement metrics.
domain assumption The 65 hard-severity smells in scikit-learn are representative of architectural smells requiring cross-module reasoning.
Basis for the evaluation task selection.

invented entities (1)

SmellBench framework no independent evidence
purpose: Task orchestration, smell-type-specific prompts, and multi-step execution scoring for architectural smell repair evaluation
New infrastructure contributed by the paper to enable the reported experiments.

pith-pipeline@v0.9.0 · 5616 in / 1459 out tokens · 51166 ms · 2026-05-13T05:55:05.526664+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 1 internal anchor

[1]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A. Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J. Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. 2026. GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning.arXiv preprint...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.19457 2026
[2]

Anthropic. 2024. Introducing the Model Context Protocol. Anthropic Blog. https://www.anthropic.com/news/model-context-protocol Accessed: 2026-03- 21

work page 2024
[3]

Anthropic. 2025. Claude Code: Agentic Coding Tool. https://github.com/ anthropics/claude-code Accessed: 2026-03-15

work page 2025
[4]

Andrea Arcuri and Lionel Briand. 2011. A practical guide for using statistical tests to assess randomized algorithms in software engineering. InProceedings of the 33rd International Conference on Software Engineering(Waikiki, Honolulu, HI, USA)(ICSE ’11). Association for Computing Machinery, New York, NY, USA, 1–10. doi:10.1145/1985793.1985795

work page doi:10.1145/1985793.1985795 2011
[5]

Nguyen, Robert Dyer, and Hridesh Rajan

Fraol Batole, David OBrien, Tien N. Nguyen, Robert Dyer, and Hridesh Rajan

work page
[6]

In: 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE)

An LLM-Based Agent-Oriented Approach for Automated Code Design Issue Localization. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering(Ottawa, Ontario, Canada)(ICSE ’25). IEEE Press, 1320–1332. doi:10.1109/ICSE55347.2025.00100

work page doi:10.1109/icse55347.2025.00100 2025
[7]

Islem Bouzenia, Premkumar Devanbu, and Michael Pradel. 2025. RepairAgent: An Autonomous, LLM-Based Agent for Program Repair(ICSE ’25). IEEE Press, 2188–2200. doi:10.1109/ICSE55347.2025.00157

work page doi:10.1109/icse55347.2025.00157 2025
[8]

Jonathan Cordeiro, Shayan Noei, and Ying Zou. 2024. An Empirical Study on the Code Refactoring Capability of Large Language Models.arXiv preprint arXiv:2411.02320(2024). https://doi.org/10.48550/arXiv.2411.02320

work page doi:10.48550/arxiv.2411.02320 2024
[9]

Dhruv Gautam, Spandan Garg, Jinu Jang, Neel Sundaresan, and Roshanak Zilouch- ian Moghaddam. 2025. RefactorBench: Evaluating Stateful Reasoning in Lan- guage Agents Through Code.arXiv preprint arXiv:2503.07832v1(2025). https: //doi.org/10.48550/arXiv.2503.07832

work page doi:10.48550/arxiv.2503.07832 2025
[10]

Google. 2025. Gemini CLI: AI Agent for the Command Line. https://github.com/ google-gemini/gemini-cli Accessed: 2026-03-15

work page 2025
[11]

Zhaoqiang Guo, Tingting Tan, Shiran Liu, Xutong Liu, Wei Lai, Yibiao Yang, Yanhui Li, Lin Chen, Wei Dong, and Yuming Zhou. 2023. Mitigating False Pos- itive Static Analysis Warnings: Progress, Challenges, and Opportunities.IEEE Transactions on Software Engineering49, 12 (2023), 5154–5188. doi:10.1109/TSE. 2023.3329667

work page doi:10.1109/tse 2023
[12]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=VTF8yNQM66

work page 2024
[13]

Rodi Jolak, Simon Karlsson, and Felix Dobslaw. 2025. An empirical investigation of the impact of architectural smells on software maintainability.Journal of Systems and Software225 (2025), 112382. doi:10.1016/j.jss.2025.112382

work page doi:10.1016/j.jss.2025.112382 2025
[14]

Mistral AI. 2025. Mistral Vibe: Agentic Coding Assistant. https://github.com/ mistralai/mistral-vibe Accessed: 2026-03-15

work page 2025
[15]

Haris Mumtaz, Paramvir Singh, and Kelly Blincoe. 2021. A systematic mapping study on architectural smells detection.Journal of Systems and Software173 (2021), 110885. doi:10.1016/j.jss.2020.110885

work page doi:10.1016/j.jss.2020.110885 2021
[16]

Samal Nursapa, Anastassiya Samuilova, Alessio Bucaioni, and Phuong Nguyen

work page
[17]

ROSE: Transformer-Based Refactoring Recommendation for Architectural Smells. 421–427. doi:10.1109/ESEM64174.2025.00019

work page doi:10.1109/esem64174.2025.00019 2025
[18]

OpenAI. 2025. Codex CLI: Open-Source Coding Agent. https://github.com/ openai/codex Accessed: 2026-03-15

work page 2025
[19]

Khouloud Oueslati, Maxime Lamothe, and Foutse Khomh. 2026. RefAgent: A Multi-agent LLM-based Framework for Automatic Software Refactoring.arXiv preprint arXiv:2511.03153v2(2026). https://doi.org/10.48550/arXiv.2511.03153

work page doi:10.48550/arxiv.2511.03153 2026
[20]

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine Learning in Python.J. Mach. Learn. Res.12 (Nov. 2...

work page 2011
[21]

Karthik Shivashankar and Antonio Martini. 2025. PyExamine: A Comprehensive, Un-Opinionated Smell Detection Tool for Python. (2025), 763–774. doi:10.1109/ MSR66628.2025.00114

work page arXiv 2025
[22]

Toward Realistic AI-Generated Student Questions to Support Instructor Training

Claudio Tessa, Matteo Bochicchio, and Francesca Arcelli Fontana. 2025.Exploring Architectural Smells Detection Through LLMs. 90–98. doi:10.1007/978-3-032- 02138-0_6

work page doi:10.1007/978-3-032- 2025
[23]

Will Tracz. 2015. Refactoring for Software Design Smells: Managing Techni- cal Debt by Girish Suryanarayana, Ganesh Samarthyam, and Tushar Sharma. SIGSOFT Softw. Eng. Notes40, 6 (Nov. 2015), 36. doi:10.1145/2830719.2830739

work page doi:10.1145/2830719.2830739 2015
[24]

Palacio, Luftar Rahman Alif, and Denys Poshyvanyk

Alejandro Velasco, Daniel Rodriguez-Cardenas, Dipin Khati, David N. Palacio, Luftar Rahman Alif, and Denys Poshyvanyk. 2026. A Causal Perspective on Measuring, Explaining and Mitigating Smells in LLM-Generated Code.arXiv preprint arXiv:2511.15817v5(2026). https://doi.org/10.48550/arXiv.2511.15817 SmellBench: Evaluating LLM Agents on Architectural Code Sme...

work page doi:10.48550/arxiv.2511.15817 2026
[25]

Di Wu, Fangwen Mu, Lin Shi, Zhaoqiang Guo, Kui Liu, Weiguang Zhuang, Yuqi Zhong, and Li Zhang. 2024. iSMELL: Assembling LLMs with Expert Toolsets for Code Smell Detection and Refactoring. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering(Sacramento, CA, USA)(ASE ’24). Association for Computing Machinery, New Yor...

work page doi:10.1145/3691620.3695508 2024
[26]

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. De- mystifying LLM-Based Software Engineering Agents.Proc. ACM Softw. Eng.2, FSE, Article FSE037 (June 2025), 24 pages. doi:10.1145/3715754

work page doi:10.1145/3715754 2025
[27]

Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated Program Repair in the Era of Large Pre-Trained Language Models(ICSE ’23). IEEE Press, 1482–1494. doi:10.1109/ICSE48619.2023.00129

work page doi:10.1109/icse48619.2023.00129 2023
[28]

Yisen Xu, Feng Lin, Jinqiu Yang, Tse-Hsun (Peter) Chen, and Nikolaos Tsantalis

work page
[29]

Mantra: Enhancing automated method-level refactoring with contextual rag and multi-agent llm collaboration,

MANTRA: Enhancing Automated Method-Level Refactoring with Contex- tual RAG and Multi-Agent LLM Collaboration.arXiv preprint arXiv:2503.14340v2 (2025). https://doi.org/10.48550/arXiv.2503.14340

work page doi:10.48550/arxiv.2503.14340 2025
[30]

Zhipeng Xue, Xiaoting Zhang, Zhipeng Gao, Xing Hu, Shan Gao, Xin Xia, and Shanping Li. 2026. Clean Code, Better Models: Enhancing LLM Performance with Smell-Cleaned Dataset.ACM Trans. Softw. Eng. Methodol.(Feb. 2026). doi:10. 1145/3793252

work page 2026
[31]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: agent-computer interfaces enable automated software engineering. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada) (NIPS ’24). Curran Associates Inc., Red Hook, NY,...

work page 2024