The Hidden Environmental Cost of Poor Coding Practices in TensorFlow and Keras Applications: A Study on Resource Leaks and Carbon Emissions

Alain Abran; Bashar Abdallah; Gustavo Santos; Mohammad Hamdaqa; Rola Al Bataineh

arxiv: 2606.19799 · v1 · pith:V3ISL5SVnew · submitted 2026-06-18 · 💻 cs.SE · cs.LG

The Hidden Environmental Cost of Poor Coding Practices in TensorFlow and Keras Applications: A Study on Resource Leaks and Carbon Emissions

Bashar Abdallah , Gustavo Santos , Rola Al Bataineh , Alain Abran , Mohammad Hamdaqa This is my paper

Pith reviewed 2026-06-26 16:58 UTC · model grok-4.3

classification 💻 cs.SE cs.LG

keywords resource leaksenergy consumptioncarbon emissionsTensorFlowKerasmachine learningcode smellssustainability

0 comments

The pith

Resource leaks in TensorFlow and Keras code raise electricity use by 32 to 46 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether two common coding mistakes, improper model reuse and unreleased tensor references, affect the energy draw of machine learning training runs. It compares otherwise identical tasks run with and without each mistake. The tests show consistent rises in electricity consumption and carbon emissions that pass statistical checks for significance. The work treats these increases as direct evidence that resource management in ML code carries measurable environmental costs.

Core claim

Controlled experiments on identical training tasks show that Improper Model Reuse increases electricity consumption by approximately 32 percent and Unreleased Tensor References by approximately 46 percent, with proportional rises in estimated CO2 emissions; paired statistical tests confirm the differences are systematic and significant.

What carries the argument

Controlled experiments that run identical training tasks against a smell-free baseline to isolate the isolated effects of Improper Model Reuse and Unreleased Tensor References on energy and emissions.

Load-bearing premise

The controlled experiments isolate the effect of the resource leaks alone, with identical training tasks differing only in the presence of IMR or UTR and with accurate energy measurement methods.

What would settle it

Re-running the same training tasks on different hardware or with different models and finding no statistically significant difference in measured electricity use between the versions that contain the smells and the clean baselines.

Figures

Figures reproduced from arXiv: 2606.19799 by Alain Abran, Bashar Abdallah, Gustavo Santos, Mohammad Hamdaqa, Rola Al Bataineh.

**Figure 2.** Figure 2: illustrates electricity consumption across configurations, and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: CO2 emissions per run (mean ± SD across ten runs). IMR increased emissions by 31.78% and UTR by 45.77% relative to baseline [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Efficiency and sustainability are critical considerations in the development and deployment of machine learning (ML) applications. Among the factors influencing sustainability, resource leaks in ML code can introduce hidden inefficiencies that elevate energy consumption and CO2 emissions. Despite this, empirical evidence quantifying their environmental impact remains limited. This emerging results paper presents an initial empirical investigation of two common resource-leak smells, namely Improper Model Reuse (IMR) and Unreleased Tensor References (UTR), and their impact on energy consumption and CO2 emissions in TensorFlow and Keras workloads. Controlled experiments were conducted for each smell by executing identical training tasks while comparing against a smell-free baseline. Our preliminary results show that both smells consistently increase estimated electricity usage and carbon emissions. IMR and UTR increased electricity consumption by approximately 32% and 46%, respectively, with proportional increases in CO2 emissions. Paired statistical tests indicate that these differences are systematic and statistically significant, providing initial empirical evidence that resource-leak smells may degrade ML energy efficiency and environmental sustainability. These findings suggest that resource-leak smells pose measurable risks to both software quality and sustainability, emphasizing the importance of integrating resource-lifecycle management and energy-efficiency considerations into ML development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The reported 32% and 46% energy increases rest on an unspecified estimation method, so the central numbers are not yet reliable evidence even though the topic is worth pursuing.

read the letter

The main thing to know is that this paper claims two resource-leak smells in TensorFlow and Keras raise electricity consumption by roughly 32% and 46%, with matching CO2 increases, based on controlled runs and paired tests. Those specific percentages are the headline result, but the abstract supplies zero information on the energy measurement or estimation approach.

The work is new in applying smell detection to environmental metrics inside ML frameworks. It identifies Improper Model Reuse and Unreleased Tensor References, runs identical training tasks with and without each smell, and reports consistent differences. That direction extends existing code-smell and green-software literature in a practical way, and the controlled-experiment framing is a reasonable starting point for an emerging-results paper.

The soft spot is exactly where the stress test points: without any description of the tool, sampling, calibration, hardware, or how the estimator handles changes in memory or tensor lifetime, the deltas could be artifacts of the measurement rather than real consumption differences. The abstract calls the usage "estimated," which makes the statistical significance harder to evaluate. No task details, run counts, or variability numbers appear either.

This is aimed at researchers in sustainable software engineering who track energy in AI code. A reader looking for usable numbers to cite would probably hold off until the measurement method is shown. The paper shows clear engagement with the relevant literature on smells and sustainability, so the thinking is coherent on its own terms.

It deserves peer review. The topic is timely enough that referees could usefully push on the methods and help turn the preliminary findings into something more solid.

Referee Report

2 major / 0 minor

Summary. The paper claims that two resource-leak code smells in TensorFlow/Keras applications—Improper Model Reuse (IMR) and Unreleased Tensor References (UTR)—increase estimated electricity consumption by approximately 32% and 46% respectively (with proportional CO2 increases) relative to smell-free baselines. This is based on controlled experiments executing identical training tasks, with paired statistical tests reported as showing the differences are systematic and statistically significant.

Significance. If the energy measurements prove accurate and the experiments properly isolate the smells' effects, the work would provide useful initial empirical evidence connecting common ML coding practices to measurable environmental costs, supporting calls to integrate resource-lifecycle management into ML development workflows.

major comments (2)

[Abstract] Abstract: the headline claims of 32% (IMR) and 46% (UTR) increases in electricity usage rest on an unspecified 'estimation' method. No description is given of the tool, model, sampling rate, calibration procedure, or handling of secondary effects such as changed GPU/CPU utilization or memory pressure induced by the leaks themselves; without this, the deltas cannot be confirmed as reflecting actual joule differences rather than estimator artifacts.
[Abstract] Abstract: the controlled-experiment description provides no information on training-task specifics (model architecture, dataset, epochs), hardware platform, number of runs per condition, sample sizes, variability measures, or the exact paired statistical test and its results (p-values, degrees of freedom). These omissions make it impossible to assess whether the experiments isolate the smells or whether the significance claims are reliable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your review and the recommendation for major revision. We address each of the major comments below and plan to incorporate clarifications and additional details into the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claims of 32% (IMR) and 46% (UTR) increases in electricity usage rest on an unspecified 'estimation' method. No description is given of the tool, model, sampling rate, calibration procedure, or handling of secondary effects such as changed GPU/CPU utilization or memory pressure induced by the leaks themselves; without this, the deltas cannot be confirmed as reflecting actual joule differences rather than estimator artifacts.

Authors: We agree that the abstract omits these methodological details. As this is an emerging results paper, the abstract is kept brief. We will revise the manuscript by adding a methods subsection that specifies the energy estimation tool used, the underlying power model, sampling rate, calibration procedure, and how we accounted for secondary effects like utilization changes due to the leaks. This will allow verification that the increases are attributable to the smells rather than artifacts. revision: yes
Referee: [Abstract] Abstract: the controlled-experiment description provides no information on training-task specifics (model architecture, dataset, epochs), hardware platform, number of runs per condition, sample sizes, variability measures, or the exact paired statistical test and its results (p-values, degrees of freedom). These omissions make it impossible to assess whether the experiments isolate the smells or whether the significance claims are reliable.

Authors: We acknowledge this limitation in the current abstract. We will revise the paper to include a comprehensive experimental setup description covering the training task details (model architecture, dataset, epochs), hardware platform, number of runs per condition, sample sizes, variability measures (e.g., standard deviation across runs), and the statistical test results including p-values and degrees of freedom. This will enable readers to evaluate the isolation of the smells' effects and the reliability of the significance claims. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical comparison of controlled runs

full rationale

The paper reports results from controlled experiments that execute identical training tasks with and without the two smells, then apply paired statistical tests to the measured differences in estimated electricity and CO2. No equations, fitted parameters, self-citations, or derivations are present in the provided text that would reduce the reported 32%/46% deltas to inputs by construction. The central claim rests on the experimental design itself rather than any self-referential definition or prior fitted result from the same authors. This is a standard empirical measurement study whose validity hinges on measurement accuracy and isolation (addressed by the skeptic under correctness, not circularity).

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper is an empirical measurement study that depends on domain assumptions about experimental isolation and measurement accuracy rather than new theoretical entities or fitted parameters.

axioms (2)

domain assumption Energy consumption differences can be accurately attributed to the presence or absence of the two resource leaks in controlled runs.
Invoked by the design of comparing identical tasks against a smell-free baseline.
domain assumption Electricity usage estimates reliably translate to CO2 emission estimates.
Used to claim proportional increases in carbon emissions.

pith-pipeline@v0.9.1-grok · 5763 in / 1542 out tokens · 49057 ms · 2026-06-26T16:58:06.818605+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 8 canonical work pages

[1]

Wojciechowska, Gustavo Santos, Edmand Yu, Maxime Lamothe, Alain Abran, and Mohammad Hamdaqa

Bashar Abdallah, Martyna E. Wojciechowska, Gustavo Santos, Edmand Yu, Maxime Lamothe, Alain Abran, and Mohammad Hamdaqa. 2025. From Code Smells to Best Practices: Tackling Resource Leaks in PyTorch, TensorFlow, and Keras.arXiv preprint arXiv:2511.15229(2025). https://arxiv.org/abs/2511.15229

arXiv 2025
[2]

Benoit Courty et al. 2024. mlco2/codecarbon: v2.4.1. doi:10.5281/zenodo.11171501 Software package

work page doi:10.5281/zenodo.11171501 2024
[3]

Hedi Jebnoun, Heba Ben Braiek, Mohammad Masudur Rahman, and Foutse Khomh. 2020. The Scent of Deep Learning Code: An Empirical Study. InProceed- ings of the 17th International Conference on Mining Software Repositories (MSR). 420–430. doi:10.1145/3379597.3387479

work page doi:10.1145/3379597.3387479 2020
[4]

Ioannis Mavromatis, Konstantinos Katsaros, and Asad Khan. 2024. Computing Within Limits: An Empirical Study of Energy Consumption in ML Training and Inference.arXiv preprint arXiv:2406.14328(2024). http://arxiv.org/abs/2406.14328

arXiv 2024
[5]

Md Rakib Hasan Misu, Jiajun Li, Akhil Bhattiprolu, Yan Liu, Eduardo Almeida, and Iftekhar Ahmed. 2025. Test Smell: A Parasitic Energy Consumer in Software Testing.Information and Software Technology175 (2025), 107671. doi:10.1016/j. infsof.2025.107671

work page doi:10.1016/j 2025
[6]

Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean

David Patterson, Joseph Gonzalez, Quoc V. Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. 2021. Carbon Emissions and Large Neural Network Training.arXiv preprint arXiv:2104.10350(2021). https://arxiv.org/abs/2104.10350

Pith/arXiv arXiv 2021
[7]

Alejandro Sánchez-Mompó, Ioannis Mavromatis, Panagiotis Li, Konstantinos Katsaros, and Asad Khan. 2025. Green MLOps to Green GenOps: An Empirical Study of Energy Consumption in Discriminative and Generative AI Operations. Information16, 4 (2025). doi:10.3390/info16040281

work page doi:10.3390/info16040281 2025
[8]

Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2020. Energy and Policy Considerations for Modern Deep Learning Research. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 13693–13701. doi:10.1609/aaai.v34i09. 7123

work page doi:10.1609/aaai.v34i09 2020
[9]

2014.Refac- toring for Software Design Smells: Managing Technical Debt

Girish Suryanarayana, Ganesh Samarthyam, and Tushar Sharma. 2014.Refac- toring for Software Design Smells: Managing Technical Debt. Morgan Kaufmann. doi:10.1016/C2013-0-23413-9

work page doi:10.1016/c2013-0-23413-9 2014
[10]

Springer, 2 edn

Claes Wohlin, Per Runeson, Martin Höst, Magnus C. Ohlsson, Björn Regnell, and Anders Wesslén. 2012.Experimentation in Software Engineering. Springer. doi:10.1007/978-3-642-29044-2

work page doi:10.1007/978-3-642-29044-2 2012
[11]

Priyanka Singh Yadav, Raghavendra Selvan Rao, Alok Mishra, and Manish Gupta
[12]

Applied Sciences14, 14 (2024), 6149

Machine Learning-Based Methods for Code Smell Detection: A Survey. Applied Sciences14, 14 (2024), 6149. doi:10.3390/app14146149 A Detailed Experimental Results This appendix provides the full per-run measurements supporting the results in Section 4. Energy is reported in kWh and CO2 in kg, consistent with CodeCarbon outputs. Table 3: Per-run energy consum...

work page doi:10.3390/app14146149 2024

[1] [1]

Wojciechowska, Gustavo Santos, Edmand Yu, Maxime Lamothe, Alain Abran, and Mohammad Hamdaqa

Bashar Abdallah, Martyna E. Wojciechowska, Gustavo Santos, Edmand Yu, Maxime Lamothe, Alain Abran, and Mohammad Hamdaqa. 2025. From Code Smells to Best Practices: Tackling Resource Leaks in PyTorch, TensorFlow, and Keras.arXiv preprint arXiv:2511.15229(2025). https://arxiv.org/abs/2511.15229

arXiv 2025

[2] [2]

Benoit Courty et al. 2024. mlco2/codecarbon: v2.4.1. doi:10.5281/zenodo.11171501 Software package

work page doi:10.5281/zenodo.11171501 2024

[3] [3]

Hedi Jebnoun, Heba Ben Braiek, Mohammad Masudur Rahman, and Foutse Khomh. 2020. The Scent of Deep Learning Code: An Empirical Study. InProceed- ings of the 17th International Conference on Mining Software Repositories (MSR). 420–430. doi:10.1145/3379597.3387479

work page doi:10.1145/3379597.3387479 2020

[4] [4]

Ioannis Mavromatis, Konstantinos Katsaros, and Asad Khan. 2024. Computing Within Limits: An Empirical Study of Energy Consumption in ML Training and Inference.arXiv preprint arXiv:2406.14328(2024). http://arxiv.org/abs/2406.14328

arXiv 2024

[5] [5]

Md Rakib Hasan Misu, Jiajun Li, Akhil Bhattiprolu, Yan Liu, Eduardo Almeida, and Iftekhar Ahmed. 2025. Test Smell: A Parasitic Energy Consumer in Software Testing.Information and Software Technology175 (2025), 107671. doi:10.1016/j. infsof.2025.107671

work page doi:10.1016/j 2025

[6] [6]

Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean

David Patterson, Joseph Gonzalez, Quoc V. Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. 2021. Carbon Emissions and Large Neural Network Training.arXiv preprint arXiv:2104.10350(2021). https://arxiv.org/abs/2104.10350

Pith/arXiv arXiv 2021

[7] [7]

Alejandro Sánchez-Mompó, Ioannis Mavromatis, Panagiotis Li, Konstantinos Katsaros, and Asad Khan. 2025. Green MLOps to Green GenOps: An Empirical Study of Energy Consumption in Discriminative and Generative AI Operations. Information16, 4 (2025). doi:10.3390/info16040281

work page doi:10.3390/info16040281 2025

[8] [8]

Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2020. Energy and Policy Considerations for Modern Deep Learning Research. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 13693–13701. doi:10.1609/aaai.v34i09. 7123

work page doi:10.1609/aaai.v34i09 2020

[9] [9]

2014.Refac- toring for Software Design Smells: Managing Technical Debt

Girish Suryanarayana, Ganesh Samarthyam, and Tushar Sharma. 2014.Refac- toring for Software Design Smells: Managing Technical Debt. Morgan Kaufmann. doi:10.1016/C2013-0-23413-9

work page doi:10.1016/c2013-0-23413-9 2014

[10] [10]

Springer, 2 edn

Claes Wohlin, Per Runeson, Martin Höst, Magnus C. Ohlsson, Björn Regnell, and Anders Wesslén. 2012.Experimentation in Software Engineering. Springer. doi:10.1007/978-3-642-29044-2

work page doi:10.1007/978-3-642-29044-2 2012

[11] [11]

Priyanka Singh Yadav, Raghavendra Selvan Rao, Alok Mishra, and Manish Gupta

[12] [12]

Applied Sciences14, 14 (2024), 6149

Machine Learning-Based Methods for Code Smell Detection: A Survey. Applied Sciences14, 14 (2024), 6149. doi:10.3390/app14146149 A Detailed Experimental Results This appendix provides the full per-run measurements supporting the results in Section 4. Energy is reported in kWh and CO2 in kg, consistent with CodeCarbon outputs. Table 3: Per-run energy consum...

work page doi:10.3390/app14146149 2024