Comparing ML-Specific and General Python Code Smells Across Project Characteristics

Bet\"ul Cimendag; Halimeh Agh; Stefan Wagner

arxiv: 2606.01882 · v1 · pith:TFE542HRnew · submitted 2026-06-01 · 💻 cs.SE

Comparing ML-Specific and General Python Code Smells Across Project Characteristics

Halimeh Agh , Bet\"ul Cimendag , Stefan Wagner This is my paper

Pith reviewed 2026-06-28 13:46 UTC · model grok-4.3

classification 💻 cs.SE

keywords machine learningcode smellssoftware qualityempirical studygithub projectsproject characteristicstechnical debt

0 comments

The pith

ML code smells are 41-94 times less frequent than general Python smells in open-source ML projects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines links between six project characteristics and both ML-specific and general Python code smells across 279 GitHub repositories. It shows ML smells occur far less often than general ones and correlate with commit frequency and domain, while general Python smells show no ties to any characteristic studied. The results indicate that one-size-fits-all quality approaches miss the distinct drivers of technical debt in ML systems.

Core claim

ML-specific code smells detected by CodeSmile occur 41-94 times less frequently than general Python smells detected by Pylint. Commit frequency and domain are significantly associated with ML-specific smell occurrence, but project size, team size, age, and CI/CD adoption are not. General Python smells are not associated with any of the six project features, and the domains most impacted by each smell type differ.

What carries the argument

Empirical comparison of ML-specific smells (CodeSmile) versus general Python smells (Pylint) related to project size, age, contributors, commit frequency, CI/CD adoption, and domain in 279 open-source ML repositories.

If this is right

Domains such as MLOps, Reinforcement Learning, and Computer Vision require distinct quality checks for ML-specific issues like configuration, tensor manipulation, and GPU workflows.
Standard CI/CD pipelines often miss domain-specific ML correctness problems, so specialized quality gates are needed.
General Python smell reduction strategies can ignore project context, but ML smell strategies must account for commit frequency and domain.
ML code quality depends on specialized practices rather than the general project metrics traditionally linked to technical debt.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

ML teams may apply stricter discipline to ML-specific code than to surrounding general code.
Detection tools for ML smells would benefit from tighter integration with domain-specific validators beyond standard automation.
The independence of general smells from all project traits suggests systemic language-level issues that persist across contexts.

Load-bearing premise

The tools CodeSmile and Pylint correctly detect and classify the relevant code smells without substantial measurement error, and the 279 projects represent typical ML development practices.

What would settle it

A representative sample of ML projects in which ML-specific smell density equals or exceeds general Python smell density, or in which project size shows a significant correlation with ML smell frequency.

Figures

Figures reproduced from arXiv: 2606.01882 by Bet\"ul Cimendag, Halimeh Agh, Stefan Wagner.

**Figure 1.** Figure 1: Overview of the research methodology. sizes (small, medium, large) and CI adoption. Recent work has further improved ML-specific smell detection capabilities. Hamfelt et al. [19] developed MLpylint, a static analysis tool that detects 20 ML-CSs using Abstract Syntax Tree analysis, validating it on 160 open-source projects with feedback from ML experts. Mahmoudi et al. [20] introduced SpecDetect4AI, which c… view at source ↗

**Figure 2.** Figure 2: Distribution of ML-CSs Across Domains. group differences (U = 10690.5, p = 0.153, δ = 0.099). These null findings suggest that neither project scale nor team size reliably predicts ML-specific code quality. Project age had almost no correlation (ρ = −0.035, p = 0.562) and showed no significant difference between young and mature projects (U = 10541.5, p = 0.228, δ = 0.083). This challenges traditional view… view at source ↗

**Figure 3.** Figure 3: Distribution of General Python Smells Across Domains. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Machine learning systems consist of general-purpose code as well as machine-learning-specific code. While ML-specific code smells have been identified, their connection to project characteristics and their interaction with overall code quality are not well understood. Without this knowledge, quality assurance strategies remain one-size-fits-all, failing to account for the contextual factors that drive technical debt in ML systems. We present empirical evidence by examining how six project features (size, age, contributors, commit frequency, CI/CD adoption, and domain) relate to both ML-specific and general Python code quality in 279 open-source ML projects on GitHub. Using CodeSmile for ML code smells and Pylint for general Python smells, our results show: (1) ML code smells are 41-94 times less frequent than general Python smells; (2) commit frequency and domain are significantly associated with ML-specific quality, while project size, team size, age, and CI/CD adoption are not, challenging traditional views on technical debt; (3) general Python smells are not linked to any project characteristic, indicating systemic coding issues that are independent of project context; (4) domains that suffer most from ML-specific smells are not necessarily the same domains that suffer most from general Python smells, necessitating tailored quality strategies for each smell type. MLOps often involves configuration issues, Reinforcement Learning faces challenges with tensor manipulation, and Computer Vision encounters problems with GPU workflows. Overall, ML code quality depends on domain-specific practices and specialized CI/CD quality gates, as standard automation often overlooks domain-specific correctness problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ML smells show up 41-94 times rarer than general Python ones with ties only to commit frequency and domain, but the gap rests on unvalidated detectors.

read the letter

The main thing to know is that this study measures ML-specific code smells as 41-94 times less common than general Python smells across 279 open-source projects, and finds that only commit frequency and domain correlate with the ML ones while nothing correlates with the general ones.

The paper does a solid job of running the comparison on real projects with named detectors. It breaks out the results by project traits and notes domain differences, such as reinforcement learning struggling with tensors and computer vision with GPU code. That gives a clearer picture than prior work that treated smells more uniformly.

The main concern is measurement. CodeSmile's rules for ML smells may not be tuned to the same distributions as the ML projects in the sample, so the large gap and the specific associations could change if the tool undercounts or overcounts. The abstract gives no detail on the exact statistical approach or any checks for confounds, which makes it tough to assess how robust the null findings are for size, team, age, and CI/CD.

The sample is limited to open-source GitHub ML projects in Python, so the patterns may not hold for closed-source or other language work.

This is the kind of study that software engineering groups working on ML quality assurance would want to see. It supplies concrete ratios and associations that can be tested or extended. Readers who follow code smell research or technical debt in data science projects will get usable data points from it.

The work shows clear thinking in separating the smell types and testing multiple characteristics. It deserves a serious referee to dig into the methods and the tool performance.

I would recommend sending it for peer review, with the expectation that reviewers will press on validation of the detectors and the statistical details.

Referee Report

2 major / 2 minor

Summary. The manuscript reports an empirical analysis of code smells in 279 open-source ML projects on GitHub. Using CodeSmile to detect ML-specific smells and Pylint for general Python smells, it finds ML-specific smells are 41-94 times less frequent than general ones. Commit frequency and domain are significantly associated with ML-specific quality, while project size, team size, age, and CI/CD adoption are not. General Python smells show no associations with any project characteristics. Different domains are affected differently by the two types of smells, suggesting the need for tailored quality assurance strategies.

Significance. If the detection tools accurately capture the smells without substantial bias, the results provide valuable evidence that ML code quality is driven by different factors than general code quality, challenging traditional technical debt models that treat all code uniformly. The large sample size and use of established tools like Pylint are strengths. The findings have practical implications for MLOps practices, highlighting domain-specific issues like tensor manipulation in RL and GPU workflows in CV.

major comments (2)

[Abstract and Methods (code smell detection)] Abstract and Methods (code smell detection): The headline result that ML code smells are 41-94 times less frequent, and the associations with commit frequency and domain, depend on CodeSmile providing a faithful measure of ML smell prevalence. No validation of CodeSmile's accuracy (e.g., precision/recall against manual review or comparison in ML-specific contexts like tensor operations) is described. If CodeSmile under-detects ML smells relative to Pylint's detection of general smells, the frequency ratio and null results for other factors become difficult to interpret.
[Results (statistical analysis)] Results (statistical analysis): The abstract mentions significant associations but provides no details on the statistical methods, correction for multiple testing, effect sizes, or handling of potential confounds such as correlations between project characteristics. This information is necessary to assess the robustness of the claims about which factors are or are not associated.

minor comments (2)

[Abstract] The abstract could briefly note the statistical approach used to determine 'significantly associated' to improve clarity for readers.
[Discussion] Consider adding a limitations section explicitly addressing potential measurement error in the smell detection tools.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on tool validation and statistical transparency. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: Abstract and Methods (code smell detection): The headline result that ML code smells are 41-94 times less frequent, and the associations with commit frequency and domain, depend on CodeSmile providing a faithful measure of ML smell prevalence. No validation of CodeSmile's accuracy (e.g., precision/recall against manual review or comparison in ML-specific contexts like tensor operations) is described. If CodeSmile under-detects ML smells relative to Pylint's detection of general smells, the frequency ratio and null results for other factors become difficult to interpret.

Authors: We agree that the manuscript does not include an independent validation of CodeSmile within this study. CodeSmile was chosen as the established detector from its originating publication, which reports performance metrics on ML code. To improve interpretability of the 41-94x frequency ratio, we will revise the Methods section to cite those prior metrics, explicitly discuss the assumption of comparable detection fidelity with Pylint, and add a dedicated paragraph in Threats to Validity acknowledging the absence of fresh ML-specific precision/recall evaluation as a limitation. This directly mitigates concerns about systematic under-detection biasing the headline results and the null findings for other project characteristics. revision: yes
Referee: Results (statistical analysis): The abstract mentions significant associations but provides no details on the statistical methods, correction for multiple testing, effect sizes, or handling of potential confounds such as correlations between project characteristics. This information is necessary to assess the robustness of the claims about which factors are or are not associated.

Authors: The full manuscript contains a Statistical Analysis subsection in Methods that specifies negative binomial regression for the count-based smell data, Bonferroni correction for the six project characteristics tested, incidence rate ratios as effect sizes, and multicollinearity checks via VIF (all <5, indicating no problematic confounds). The Results section reports these alongside the significance findings. We will revise the abstract to include a one-sentence summary of the modeling approach and add explicit cross-references from Results to the Methods details. This makes the robustness information immediately accessible without altering the existing analysis. revision: partial

Circularity Check

0 steps flagged

No significant circularity; purely observational empirical study.

full rationale

The paper performs direct measurement of code smells via external static-analysis tools (CodeSmile for ML-specific smells, Pylint for general Python smells) across 279 GitHub projects, followed by statistical association tests against six project characteristics. No equations, normalizations, fitted parameters, or predictions are present that could reduce to self-referential definitions or fitted inputs called predictions. No self-citation chains or uniqueness theorems are invoked as load-bearing premises. All reported results (frequency ratios, domain associations, null results for size/team/age/CI-CD) are outputs of the measurement and correlation pipeline rather than inputs redefined as findings. The study is therefore self-contained against external benchmarks with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claims depend on the accuracy of automated smell detection and the representativeness of the sampled projects; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Open-source GitHub ML projects form a representative sample for drawing conclusions about ML code quality
The study selects and analyzes 279 such projects to generalize about ML systems and project characteristics.
domain assumption CodeSmile and Pylint outputs constitute valid and comparable measures of ML-specific and general code smells
All reported frequencies, ratios, and associations rest on the classifications produced by these two tools.

pith-pipeline@v0.9.1-grok · 5811 in / 1305 out tokens · 30243 ms · 2026-06-28T13:46:32.111203+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 32 canonical work pages

[1]

Software engineering for machine learning: A case study,

S. Amershi et al., “Software engineering for machine learning: A case study,” in Proceedings of the 41st IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice, 2019, pp. 291–300. doi: 10.1109/ICSE-SEIP.2019.00042

work page doi:10.1109/icse-seip.2019.00042 2019
[2]

The state of the ML-universe: 10 years of artificial intelligence & machine learning soft- ware development on GitHub,

J. D. Gonzalez, T. Zimmermann, and N. Nagappan, “The state of the ML-universe: 10 years of artificial intelligence & machine learning soft- ware development on GitHub,” in Proceedings of the 17th International Conference on Mining Software Repositories, 2020, pp. 431–442. doi: 10.1145/3379597.3387473

work page doi:10.1145/3379597.3387473 2020
[3]

Software engineering for AI-based sys- tems: A survey,

S. Mart ´ınez-Fern´andez et al., “Software engineering for AI-based sys- tems: A survey,” ACM Trans. Softw. Eng. Methodol., vol. 31, no. 2, p. 37e:1-37e:59, 2022, doi: 10.1145/3487043

work page doi:10.1145/3487043 2022
[4]

Robots and AI: Illusions and Social Dilemmas

V . Lenarduzzi, F. Lomio, S. Moreschini, D. Taibi, and D. A. Tamburri, “Software quality for AI: Where we are now?,” in Software Quality: Future Perspectives on Software Engineering Quality, D. Winkler, S. Biffl, D. Mendez, M. Wimmer, and J. Bergsmann, Eds., Cham: Springer International Publishing, 2021, pp. 43–53. doi: 10.1007/978-3-030- 65854-0 4

work page doi:10.1007/978-3-030- 2021
[5]

Hidden technical debt in machine learning systems,

D. Sculley et al., “Hidden technical debt in machine learning systems,” in Advances in Neural Information Processing Systems, 2015, pp. 2503–2511

2015
[6]

Fowler, Refactoring: Improving the design of existing code

M. Fowler, Refactoring: Improving the design of existing code. Addison- Wesley Professional, 2018

2018
[7]

Code smells and refactoring: A tertiary systematic review of challenges and observations,

G. Lacerda, F. Petrillo, M. Pimenta, and Y . G. Gu ´eh´eneuc, “Code smells and refactoring: A tertiary systematic review of challenges and observations,” Journal of Systems and Software, vol. 167, p. 110610, 2020, doi: 10.1016/j.jss.2020.110610

work page doi:10.1016/j.jss.2020.110610 2020
[8]

In: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022

B. van Oort, L. Cruz, M. Aniche, and A. van Deursen, “The Prevalence of Code Smells in Machine Learning projects,” in Proceedings of the 1st IEEE/ACM Workshop on AI Engineering - Software Engineering for AI, 2021, pp. 1–8. doi: 10.1109/W AIN52551.2021.00011

work page doi:10.1109/w 2021
[9]

On the diffuseness and the impact on maintainability of code smells: A large scale empirical investigation,

F. Palomba, G. Bavota, M. Di Penta, F. Fasano, R. Oliveto, and A. De Lucia, “On the diffuseness and the impact on maintainability of code smells: A large scale empirical investigation,” in Proceedings of the 40th International Conference on Software Engineering, 2018, p. 482. doi: 10.1145/3180155.3182532

work page doi:10.1145/3180155.3182532 2018
[10]

An exploratory study of the impact of antipatterns on class change- and fault-proneness,

F. Khomh, M. D. Penta, Y .-G. Gu ´eh´eneuc, and G. Antoniol, “An exploratory study of the impact of antipatterns on class change- and fault-proneness,” Empir Software Eng, vol. 17, no. 3, pp. 243–275, 2012, doi: 10.1007/s10664-011-9171-y

work page doi:10.1007/s10664-011-9171-y 2012
[11]

Code smells for machine learning applications,

H. Zhang, L. Cruz, and A. van Deursen, “Code smells for machine learning applications,” in Proceedings of the 1st International Conference on AI Engineering: Software Engineering for AI, 2022, pp. 217–228. doi: 10.1145/3522664.3528620

work page doi:10.1145/3522664.3528620 2022
[12]

When code smells meet ML: On the lifecycle of ML-specific code smells in ML-enabled systems,

G. Recupito, G. Giordano, F. Ferrucci, D. Di Nucci, and F. Palomba, “When code smells meet ML: On the lifecycle of ML-specific code smells in ML-enabled systems,” Empir Software Eng, vol. 30, no. 5, p. 139, 2025, doi: 10.1007/s10664-025-10676-4

work page doi:10.1007/s10664-025-10676-4 2025
[13]

Investigating the Resolution of Vulnerable Dependencies with Dependabot Security Updates,

R. Widyasari et al., “NICHE: A curated dataset of engineered machine learning projects in Python,” in Proceedings of 20th IEEE/ACM Inter- national Conference on Mining Software Repositories, 2023, pp. 62–66. doi: 10.1109/MSR59073.2023.00022

work page doi:10.1109/msr59073.2023.00022 2023
[14]

A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects,

A. J. Simmons, S. Barnett, J. Rivera-Villicana, A. Bajaj, and R. Vasa, “A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects,” in Proceedings of the 14th ACM/IEEE International Symposium on Empirical Software Engineer- ing and Measurement, 2020, pp. 1–11. doi: 10.1145/3382494.3410680

work page doi:10.1145/3382494.3410680 2020
[15]

Under- standing developer practices and code smells diffusion in AI-enabled software: A preliminary study,

G. Giordano, G. Annunziata, A. De Lucia, and F. Palomba, “Under- standing developer practices and code smells diffusion in AI-enabled software: A preliminary study,” in IWSM-Mensura, 2023

2023
[16]

An evidence- based study on the relationship of software engineering practices on code smells in Python ML projects,

G. Giordano, A. Della Porta, F. Ferrucci, and F. Palomba, “An evidence- based study on the relationship of software engineering practices on code smells in Python ML projects,” in Software Engineering and Advanced Applications, D. Taibi and D. Smite, Eds., Cham: Springer Nature, 2026, pp. 105–120. doi: 10.1007/978-3-032-04207-1 8

work page doi:10.1007/978-3-032-04207-1 2026
[17]

An empirical study of refactorings and technical debt in machine learning systems,

Y . Tang, R. Khatchadourian, M. Bagherzadeh, R. Singh, A. Stewart, and A. Raja, “An empirical study of refactorings and technical debt in machine learning systems,” in Proceedings of the 43rd IEEE/ACM International Conference on Software Engineering, 2021, pp. 238–250. doi: 10.1109/ICSE43902.2021.00033

work page doi:10.1109/icse43902.2021.00033 2021
[18]

Prevalence of code smells in reinforcement learning projects,

N. Cardozo, I. Dusparic, and C. Cabrera, “Prevalence of code smells in reinforcement learning projects,” 2023, arXiv: arXiv:2303.10236

arXiv 2023
[19]

Automatic identifica- tion of machine learning-specific code smells,

P. Hamfelt, R. Britto, L. Rocha, and C. Almendra, “Automatic identifica- tion of machine learning-specific code smells,” 2025, arXiv: 2508.02541

arXiv 2025
[20]

AI-specific code smells: From specification to detection,

B. Mahmoudi, N. Moha, Q. Sti ´evenart, and F. Avellaneda, “AI-specific code smells: From specification to detection,” 2025, arXiv: 2509.20491

Pith/arXiv arXiv 2025
[21]

Is it all lost? A study of inactive open source projects,

J. Khondhu, A. Capiluppi, and K.-J. Stol, “Is it all lost? A study of inactive open source projects,” in Open Source Software: Quality Verification, E. Petrinja, G. Succi, N. El Ioini, and A. Sillitti, Eds., Springer, 2013, pp. 61–79. doi: 10.1007/978-3-642-38928-3 5

work page doi:10.1007/978-3-642-38928-3 2013
[22]

Is this GitHub project maintained? Measuring the level of maintenance activity of open- source projects,

J. Coelho, M. T. Valente, L. Milen, and L. L. Silva, “Is this GitHub project maintained? Measuring the level of maintenance activity of open- source projects,” Information and Software Technology, vol. 122, p. 106274, 2020, doi: 10.1016/j.infsof.2020.106274

work page doi:10.1016/j.infsof.2020.106274 2020
[23]

Understanding the factors that impact the popularity of GitHub repositories,

H. Borges, A. Hora, and M. T. Valente, “Understanding the factors that impact the popularity of GitHub repositories,” in Proceedings of the IEEE International Conference on Software Maintenance and Evolution, 2016, pp. 334–344. doi: 10.1109/ICSME.2016.31

work page doi:10.1109/icsme.2016.31 2016
[24]

Is popularity a measure of quality? An analysis of Maven components,

H. Sajnani, V . Saini, J. Ossher, and C. V . Lopes, “Is popularity a measure of quality? An analysis of Maven components,” in Proceedings of the IEEE International Conference on Software Maintenance and Evolution, 2014, pp. 231–240. doi: 10.1109/ICSME.2014.45

work page doi:10.1109/icsme.2014.45 2014
[25]

The promises and perils of mining GitHub,

E. Kalliamvakou, G. Gousios, K. Blincoe, L. Singer, D. M. German, and D. Damian, “The promises and perils of mining GitHub,” in Proceedings of the 11th Working Conference on Mining Software Repositories, 2014, pp. 92–101. doi: 10.1145/2597073.2597074

work page doi:10.1145/2597073.2597074 2014
[26]

A novel approach for estimating truck factors,

G. Avelino, L. Passos, A. Hora, and M. T. Valente, “A novel approach for estimating truck factors,” in Proceedings of the 24th IEEE Inter- national Conference on Program Comprehension, 2016, pp. 1–10. doi: 10.1109/ICPC.2016.7503718

work page doi:10.1109/icpc.2016.7503718 2016
[27]

Evensen, G., Oliver, D

W. Zou, W. Zhang, X. Xia, R. Holmes, and Z. Chen, “Branch use in practice: A large-scale empirical study of 2,923 projects on GitHub,” in Proceedings of the 19th IEEE International Conference on Software Quality, Reliability and Security, 2019, pp. 306–317. doi: 10.1109/QRS.2019.00047

work page doi:10.1109/qrs.2019.00047 2019
[28]

Us- age, costs, and benefits of continuous integration in open-source projects,

M. Hilton, T. Tunnell, K. Huang, D. Marinov, and D. Dig, “Us- age, costs, and benefits of continuous integration in open-source projects,” in Proceedings of the 31st IEEE/ACM International Con- ference on Automated Software Engineering, 2016, pp. 426–437. doi: 10.1145/2970276.2970358

work page doi:10.1145/2970276.2970358 2016
[29]

Quality and productivity outcomes relating to continuous integration in GitHub,

B. Vasilescu, Y . Yu, H. Wang, P. Devanbu, and V . Filkov, “Quality and productivity outcomes relating to continuous integration in GitHub,” in Proceedings of the 10th Joint Meeting on Foundations of Software Engineering, 2015, pp. 805–816. doi: 10.1145/2786805.2786850

work page doi:10.1145/2786805.2786850 2015
[30]

Automatically categorising GitHub repositories by application domain,

F. Zanartu et al., “Automatically categorising GitHub repositories by application domain,” 2022, arXiv: 2208.00269

arXiv 2022
[31]

HiGitClass: Keyword-driven hierarchical classi- fication of GitHub repositories,

Y . Zhang et al., “HiGitClass: Keyword-driven hierarchical classi- fication of GitHub repositories,” in Proceedings of the IEEE In- ternational Conference on Data Mining, 2019, pp. 876–885. doi: 10.1109/ICDM.2019.00098

work page doi:10.1109/icdm.2019.00098 2019
[32]

An empirical study of code smells in transformer-based code generation techniques,

M. L. Siddiq, S. H. Majumder, M. R. Mim, S. Jajodia, and J. C. S. Santos, “An empirical study of code smells in transformer-based code generation techniques,” in Proceedings of the 22nd IEEE International Working Conference on Source Code Analysis and Manipulation, 2022, pp. 71–82. doi: 10.1109/SCAM55253.2022.00014

work page doi:10.1109/scam55253.2022.00014 2022
[33]

Robust statistical methods for empirical software engineering,

B. Kitchenham et al., “Robust statistical methods for empirical software engineering,” Empir Software Eng, vol. 22, no. 2, pp. 579–630, 2017, doi: 10.1007/s10664-016-9437-5

work page doi:10.1007/s10664-016-9437-5 2017
[34]

A systematic mapping study on technical debt and its management,

Z. Li, P. Avgeriou, and P. Liang, “A systematic mapping study on technical debt and its management,” Journal of Systems and Software, vol. 101, pp. 193–220, 2015, doi: 10.1016/j.jss.2014.12.027

work page doi:10.1016/j.jss.2014.12.027 2015
[35]

The ML test score: A rubric for ML production readiness and technical debt reduction,

E. Breck, S. Cai, E. Nielsen, M. Salib, and D. Sculley, “The ML test score: A rubric for ML production readiness and technical debt reduction,” in Proceedings of the IEEE International Conference on Big Data, 2017, pp. 1123–1132. doi: 10.1109/BigData.2017.8258038

work page doi:10.1109/bigdata.2017.8258038 2017
[36]

Machine learning testing: Survey, landscapes and horizons,

J. M. Zhang, M. Harman, L. Ma, and Y . Liu, “Machine learning testing: Survey, landscapes and horizons,” IEEE Transactions on Software Engi- neering, vol. 48, no. 1, pp. 1–36, 2022, doi: 10.1109/TSE.2019.2962027

work page doi:10.1109/tse.2019.2962027 2022
[37]

Machine learning operations (MLOps): Overview, definition, and architecture,

D. Kreuzberger, N. K ¨uhl, and S. Hirschl, “Machine learning operations (MLOps): Overview, definition, and architecture,” IEEE Access, vol. 11, pp. 31866–31879, 2023, doi: 10.1109/ACCESS.2023.3262138

work page doi:10.1109/access.2023.3262138 2023
[38]

In: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems

S. Chattopadhyay, I. Prasad, A. Z. Henley, A. Sarma, and T. Barik, “What’s wrong with computational notebooks? Pain points, needs, and design opportunities,” in Proceedings of the CHI Conference on Human Factors in Computing Systems, 2020, pp. 1–12. doi: 10.1145/3313831.3376729

work page doi:10.1145/3313831.3376729 2020
[39]

Exploration and explanation in computational notebooks,

A. Rule, A. Tabard, and J. D. Hollan, “Exploration and explanation in computational notebooks,” in Proceedings of the CHI Conference on Human Factors in Computing Systems, 2018, pp. 1–12. doi: 10.1145/3173574.3173606

work page doi:10.1145/3173574.3173606 2018

[1] [1]

Software engineering for machine learning: A case study,

S. Amershi et al., “Software engineering for machine learning: A case study,” in Proceedings of the 41st IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice, 2019, pp. 291–300. doi: 10.1109/ICSE-SEIP.2019.00042

work page doi:10.1109/icse-seip.2019.00042 2019

[2] [2]

The state of the ML-universe: 10 years of artificial intelligence & machine learning soft- ware development on GitHub,

J. D. Gonzalez, T. Zimmermann, and N. Nagappan, “The state of the ML-universe: 10 years of artificial intelligence & machine learning soft- ware development on GitHub,” in Proceedings of the 17th International Conference on Mining Software Repositories, 2020, pp. 431–442. doi: 10.1145/3379597.3387473

work page doi:10.1145/3379597.3387473 2020

[3] [3]

Software engineering for AI-based sys- tems: A survey,

S. Mart ´ınez-Fern´andez et al., “Software engineering for AI-based sys- tems: A survey,” ACM Trans. Softw. Eng. Methodol., vol. 31, no. 2, p. 37e:1-37e:59, 2022, doi: 10.1145/3487043

work page doi:10.1145/3487043 2022

[4] [4]

Robots and AI: Illusions and Social Dilemmas

V . Lenarduzzi, F. Lomio, S. Moreschini, D. Taibi, and D. A. Tamburri, “Software quality for AI: Where we are now?,” in Software Quality: Future Perspectives on Software Engineering Quality, D. Winkler, S. Biffl, D. Mendez, M. Wimmer, and J. Bergsmann, Eds., Cham: Springer International Publishing, 2021, pp. 43–53. doi: 10.1007/978-3-030- 65854-0 4

work page doi:10.1007/978-3-030- 2021

[5] [5]

Hidden technical debt in machine learning systems,

D. Sculley et al., “Hidden technical debt in machine learning systems,” in Advances in Neural Information Processing Systems, 2015, pp. 2503–2511

2015

[6] [6]

Fowler, Refactoring: Improving the design of existing code

M. Fowler, Refactoring: Improving the design of existing code. Addison- Wesley Professional, 2018

2018

[7] [7]

Code smells and refactoring: A tertiary systematic review of challenges and observations,

G. Lacerda, F. Petrillo, M. Pimenta, and Y . G. Gu ´eh´eneuc, “Code smells and refactoring: A tertiary systematic review of challenges and observations,” Journal of Systems and Software, vol. 167, p. 110610, 2020, doi: 10.1016/j.jss.2020.110610

work page doi:10.1016/j.jss.2020.110610 2020

[8] [8]

In: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022

B. van Oort, L. Cruz, M. Aniche, and A. van Deursen, “The Prevalence of Code Smells in Machine Learning projects,” in Proceedings of the 1st IEEE/ACM Workshop on AI Engineering - Software Engineering for AI, 2021, pp. 1–8. doi: 10.1109/W AIN52551.2021.00011

work page doi:10.1109/w 2021

[9] [9]

On the diffuseness and the impact on maintainability of code smells: A large scale empirical investigation,

F. Palomba, G. Bavota, M. Di Penta, F. Fasano, R. Oliveto, and A. De Lucia, “On the diffuseness and the impact on maintainability of code smells: A large scale empirical investigation,” in Proceedings of the 40th International Conference on Software Engineering, 2018, p. 482. doi: 10.1145/3180155.3182532

work page doi:10.1145/3180155.3182532 2018

[10] [10]

An exploratory study of the impact of antipatterns on class change- and fault-proneness,

F. Khomh, M. D. Penta, Y .-G. Gu ´eh´eneuc, and G. Antoniol, “An exploratory study of the impact of antipatterns on class change- and fault-proneness,” Empir Software Eng, vol. 17, no. 3, pp. 243–275, 2012, doi: 10.1007/s10664-011-9171-y

work page doi:10.1007/s10664-011-9171-y 2012

[11] [11]

Code smells for machine learning applications,

H. Zhang, L. Cruz, and A. van Deursen, “Code smells for machine learning applications,” in Proceedings of the 1st International Conference on AI Engineering: Software Engineering for AI, 2022, pp. 217–228. doi: 10.1145/3522664.3528620

work page doi:10.1145/3522664.3528620 2022

[12] [12]

When code smells meet ML: On the lifecycle of ML-specific code smells in ML-enabled systems,

G. Recupito, G. Giordano, F. Ferrucci, D. Di Nucci, and F. Palomba, “When code smells meet ML: On the lifecycle of ML-specific code smells in ML-enabled systems,” Empir Software Eng, vol. 30, no. 5, p. 139, 2025, doi: 10.1007/s10664-025-10676-4

work page doi:10.1007/s10664-025-10676-4 2025

[13] [13]

Investigating the Resolution of Vulnerable Dependencies with Dependabot Security Updates,

R. Widyasari et al., “NICHE: A curated dataset of engineered machine learning projects in Python,” in Proceedings of 20th IEEE/ACM Inter- national Conference on Mining Software Repositories, 2023, pp. 62–66. doi: 10.1109/MSR59073.2023.00022

work page doi:10.1109/msr59073.2023.00022 2023

[14] [14]

A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects,

A. J. Simmons, S. Barnett, J. Rivera-Villicana, A. Bajaj, and R. Vasa, “A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects,” in Proceedings of the 14th ACM/IEEE International Symposium on Empirical Software Engineer- ing and Measurement, 2020, pp. 1–11. doi: 10.1145/3382494.3410680

work page doi:10.1145/3382494.3410680 2020

[15] [15]

Under- standing developer practices and code smells diffusion in AI-enabled software: A preliminary study,

G. Giordano, G. Annunziata, A. De Lucia, and F. Palomba, “Under- standing developer practices and code smells diffusion in AI-enabled software: A preliminary study,” in IWSM-Mensura, 2023

2023

[16] [16]

An evidence- based study on the relationship of software engineering practices on code smells in Python ML projects,

G. Giordano, A. Della Porta, F. Ferrucci, and F. Palomba, “An evidence- based study on the relationship of software engineering practices on code smells in Python ML projects,” in Software Engineering and Advanced Applications, D. Taibi and D. Smite, Eds., Cham: Springer Nature, 2026, pp. 105–120. doi: 10.1007/978-3-032-04207-1 8

work page doi:10.1007/978-3-032-04207-1 2026

[17] [17]

An empirical study of refactorings and technical debt in machine learning systems,

Y . Tang, R. Khatchadourian, M. Bagherzadeh, R. Singh, A. Stewart, and A. Raja, “An empirical study of refactorings and technical debt in machine learning systems,” in Proceedings of the 43rd IEEE/ACM International Conference on Software Engineering, 2021, pp. 238–250. doi: 10.1109/ICSE43902.2021.00033

work page doi:10.1109/icse43902.2021.00033 2021

[18] [18]

Prevalence of code smells in reinforcement learning projects,

N. Cardozo, I. Dusparic, and C. Cabrera, “Prevalence of code smells in reinforcement learning projects,” 2023, arXiv: arXiv:2303.10236

arXiv 2023

[19] [19]

Automatic identifica- tion of machine learning-specific code smells,

P. Hamfelt, R. Britto, L. Rocha, and C. Almendra, “Automatic identifica- tion of machine learning-specific code smells,” 2025, arXiv: 2508.02541

arXiv 2025

[20] [20]

AI-specific code smells: From specification to detection,

B. Mahmoudi, N. Moha, Q. Sti ´evenart, and F. Avellaneda, “AI-specific code smells: From specification to detection,” 2025, arXiv: 2509.20491

Pith/arXiv arXiv 2025

[21] [21]

Is it all lost? A study of inactive open source projects,

J. Khondhu, A. Capiluppi, and K.-J. Stol, “Is it all lost? A study of inactive open source projects,” in Open Source Software: Quality Verification, E. Petrinja, G. Succi, N. El Ioini, and A. Sillitti, Eds., Springer, 2013, pp. 61–79. doi: 10.1007/978-3-642-38928-3 5

work page doi:10.1007/978-3-642-38928-3 2013

[22] [22]

Is this GitHub project maintained? Measuring the level of maintenance activity of open- source projects,

J. Coelho, M. T. Valente, L. Milen, and L. L. Silva, “Is this GitHub project maintained? Measuring the level of maintenance activity of open- source projects,” Information and Software Technology, vol. 122, p. 106274, 2020, doi: 10.1016/j.infsof.2020.106274

work page doi:10.1016/j.infsof.2020.106274 2020

[23] [23]

Understanding the factors that impact the popularity of GitHub repositories,

H. Borges, A. Hora, and M. T. Valente, “Understanding the factors that impact the popularity of GitHub repositories,” in Proceedings of the IEEE International Conference on Software Maintenance and Evolution, 2016, pp. 334–344. doi: 10.1109/ICSME.2016.31

work page doi:10.1109/icsme.2016.31 2016

[24] [24]

Is popularity a measure of quality? An analysis of Maven components,

H. Sajnani, V . Saini, J. Ossher, and C. V . Lopes, “Is popularity a measure of quality? An analysis of Maven components,” in Proceedings of the IEEE International Conference on Software Maintenance and Evolution, 2014, pp. 231–240. doi: 10.1109/ICSME.2014.45

work page doi:10.1109/icsme.2014.45 2014

[25] [25]

The promises and perils of mining GitHub,

E. Kalliamvakou, G. Gousios, K. Blincoe, L. Singer, D. M. German, and D. Damian, “The promises and perils of mining GitHub,” in Proceedings of the 11th Working Conference on Mining Software Repositories, 2014, pp. 92–101. doi: 10.1145/2597073.2597074

work page doi:10.1145/2597073.2597074 2014

[26] [26]

A novel approach for estimating truck factors,

G. Avelino, L. Passos, A. Hora, and M. T. Valente, “A novel approach for estimating truck factors,” in Proceedings of the 24th IEEE Inter- national Conference on Program Comprehension, 2016, pp. 1–10. doi: 10.1109/ICPC.2016.7503718

work page doi:10.1109/icpc.2016.7503718 2016

[27] [27]

Evensen, G., Oliver, D

W. Zou, W. Zhang, X. Xia, R. Holmes, and Z. Chen, “Branch use in practice: A large-scale empirical study of 2,923 projects on GitHub,” in Proceedings of the 19th IEEE International Conference on Software Quality, Reliability and Security, 2019, pp. 306–317. doi: 10.1109/QRS.2019.00047

work page doi:10.1109/qrs.2019.00047 2019

[28] [28]

Us- age, costs, and benefits of continuous integration in open-source projects,

M. Hilton, T. Tunnell, K. Huang, D. Marinov, and D. Dig, “Us- age, costs, and benefits of continuous integration in open-source projects,” in Proceedings of the 31st IEEE/ACM International Con- ference on Automated Software Engineering, 2016, pp. 426–437. doi: 10.1145/2970276.2970358

work page doi:10.1145/2970276.2970358 2016

[29] [29]

Quality and productivity outcomes relating to continuous integration in GitHub,

B. Vasilescu, Y . Yu, H. Wang, P. Devanbu, and V . Filkov, “Quality and productivity outcomes relating to continuous integration in GitHub,” in Proceedings of the 10th Joint Meeting on Foundations of Software Engineering, 2015, pp. 805–816. doi: 10.1145/2786805.2786850

work page doi:10.1145/2786805.2786850 2015

[30] [30]

Automatically categorising GitHub repositories by application domain,

F. Zanartu et al., “Automatically categorising GitHub repositories by application domain,” 2022, arXiv: 2208.00269

arXiv 2022

[31] [31]

HiGitClass: Keyword-driven hierarchical classi- fication of GitHub repositories,

Y . Zhang et al., “HiGitClass: Keyword-driven hierarchical classi- fication of GitHub repositories,” in Proceedings of the IEEE In- ternational Conference on Data Mining, 2019, pp. 876–885. doi: 10.1109/ICDM.2019.00098

work page doi:10.1109/icdm.2019.00098 2019

[32] [32]

An empirical study of code smells in transformer-based code generation techniques,

M. L. Siddiq, S. H. Majumder, M. R. Mim, S. Jajodia, and J. C. S. Santos, “An empirical study of code smells in transformer-based code generation techniques,” in Proceedings of the 22nd IEEE International Working Conference on Source Code Analysis and Manipulation, 2022, pp. 71–82. doi: 10.1109/SCAM55253.2022.00014

work page doi:10.1109/scam55253.2022.00014 2022

[33] [33]

Robust statistical methods for empirical software engineering,

B. Kitchenham et al., “Robust statistical methods for empirical software engineering,” Empir Software Eng, vol. 22, no. 2, pp. 579–630, 2017, doi: 10.1007/s10664-016-9437-5

work page doi:10.1007/s10664-016-9437-5 2017

[34] [34]

A systematic mapping study on technical debt and its management,

Z. Li, P. Avgeriou, and P. Liang, “A systematic mapping study on technical debt and its management,” Journal of Systems and Software, vol. 101, pp. 193–220, 2015, doi: 10.1016/j.jss.2014.12.027

work page doi:10.1016/j.jss.2014.12.027 2015

[35] [35]

The ML test score: A rubric for ML production readiness and technical debt reduction,

E. Breck, S. Cai, E. Nielsen, M. Salib, and D. Sculley, “The ML test score: A rubric for ML production readiness and technical debt reduction,” in Proceedings of the IEEE International Conference on Big Data, 2017, pp. 1123–1132. doi: 10.1109/BigData.2017.8258038

work page doi:10.1109/bigdata.2017.8258038 2017

[36] [36]

Machine learning testing: Survey, landscapes and horizons,

J. M. Zhang, M. Harman, L. Ma, and Y . Liu, “Machine learning testing: Survey, landscapes and horizons,” IEEE Transactions on Software Engi- neering, vol. 48, no. 1, pp. 1–36, 2022, doi: 10.1109/TSE.2019.2962027

work page doi:10.1109/tse.2019.2962027 2022

[37] [37]

Machine learning operations (MLOps): Overview, definition, and architecture,

D. Kreuzberger, N. K ¨uhl, and S. Hirschl, “Machine learning operations (MLOps): Overview, definition, and architecture,” IEEE Access, vol. 11, pp. 31866–31879, 2023, doi: 10.1109/ACCESS.2023.3262138

work page doi:10.1109/access.2023.3262138 2023

[38] [38]

In: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems

S. Chattopadhyay, I. Prasad, A. Z. Henley, A. Sarma, and T. Barik, “What’s wrong with computational notebooks? Pain points, needs, and design opportunities,” in Proceedings of the CHI Conference on Human Factors in Computing Systems, 2020, pp. 1–12. doi: 10.1145/3313831.3376729

work page doi:10.1145/3313831.3376729 2020

[39] [39]

Exploration and explanation in computational notebooks,

A. Rule, A. Tabard, and J. D. Hollan, “Exploration and explanation in computational notebooks,” in Proceedings of the CHI Conference on Human Factors in Computing Systems, 2018, pp. 1–12. doi: 10.1145/3173574.3173606

work page doi:10.1145/3173574.3173606 2018