pith. sign in

arxiv: 2604.02754 · v1 · submitted 2026-04-03 · 💻 cs.SE

An Empirical Study of Sustainability in Prompt-driven Test Script Generation Using Small Language Models

Pith reviewed 2026-05-13 20:18 UTC · model grok-4.3

classification 💻 cs.SE
keywords sustainabilitysmall language modelstest script generationenergy consumptioncarbon emissionsprompt engineeringunit testingsoftware testing
0
0 comments X

The pith

Small language models for prompt-driven unit test generation display distinct profiles trading energy use and speed against coverage and stability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures the energy consumption, carbon emissions, and runtime of small language models between 2B and 8B parameters when they generate unit test scripts from prompts on the HumanEval benchmark. It tracks these quantities with CodeCarbon under controlled conditions and uses test coverage as a first proxy for output quality. Different models turn out to have different sustainability profiles, with some running cheaper and faster while others deliver more stable or broader coverage. This matters because prompt-driven generation is spreading in software testing, so knowing which models cost less carbon can guide practical choices.

Core claim

The study empirically examines the environmental and performance tradeoffs of SLMs using the HumanEval benchmark and adaptive prompt variants. The analysis uses CodeCarbon to characterize energy consumption, carbon emissions and duration under controlled conditions, with unit-test script coverage serving as an initial proxy for generated test quality. Our results show that different SLMs exhibit distinct sustainability profiles - some favor lower energy use and faster execution, while others maintain higher stability or coverage under comparable conditions.

What carries the argument

CodeCarbon tracking of energy, emissions, and duration during prompt-driven test script generation by 2B-8B parameter SLMs on HumanEval, using adaptive Anthropic-style prompts and coverage as quality proxy.

If this is right

  • Model selection for test generation can be guided by whether a task prioritizes low energy use or higher coverage stability.
  • Prompt structure interacts with model choice to shape both environmental cost and test outcomes.
  • SLMs can serve as lower-impact alternatives to larger models for routine automated testing tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Energy-aware test generation pipelines could dynamically switch among SLMs based on current sustainability targets.
  • The observed profiles suggest value in repeating the measurements on full project repositories rather than isolated benchmarks.
  • These tradeoffs could combine with other green software practices to reduce overall emissions in continuous integration.

Load-bearing premise

That coverage of the generated unit tests is a sufficient proxy for their actual quality and that the controlled CodeCarbon measurements generalize to real-world usage.

What would settle it

A follow-up experiment on large real-world codebases that measures actual defect detection rates against energy consumed; results would falsify the claim if coverage shows no reliable link to effectiveness.

Figures

Figures reproduced from arXiv: 2604.02754 by Novarun Deb, Pragati Kumari.

Figure 1
Figure 1. Figure 1: The Empirical Study pipeline of inference, we designed four progressively constrained prompt templates (AP𝑉0 through AP𝑉3 ). This structured approach adheres to industry-standard prompt engineering methodologies, specifically leveraging the steering and constraint techniques proposed by An￾thropic3 . The structural evolution of these variants is summarized in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: SCI evaluations for different model-prompt con [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: SVI scores for different quantizations of [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: SVI scores for different model-prompt configura [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

The increasing use of language models in automated test script generation raises concerns about their environmental impact, yet existing sustainability analyses focus predominantly on large language models. As a result, the energy and carbon characteristics of small language models (SLMs) during prompt-driven unit-test script generation remain largely unexplored. To address this gap, this study empirically examines the environmental and performance tradeoffs of SLMs (in the 2B-8B parameter range) using the HumanEval benchmark and adaptive prompt variants (based on the Anthropic template). The analysis uses CodeCarbon to characterize energy consumption carbon emissions and duration under controlled conditions, with unit-test script coverage serving as an initial proxy for generated test quality. Our results show that different SLMs exhibit distinct sustainability profiles - some favor lower energy use and faster execution, while others maintain higher stability or coverage under comparable conditions. Overall, this work provides focused empirical evidence on sustainable SLM-based test script generation, clarifying how prompt structure and model selection jointly shape environmental and performance outcomes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents an empirical study of sustainability in prompt-driven unit-test script generation using small language models (SLMs in the 2B-8B range). It evaluates energy consumption, carbon emissions, and execution duration via CodeCarbon on the HumanEval benchmark with adaptive prompts derived from the Anthropic template, treating unit-test coverage as an initial proxy for generated test quality. The central claim is that different SLMs exhibit distinct sustainability profiles, with some favoring lower energy use and faster execution while others maintain higher stability or coverage under comparable conditions.

Significance. If the empirical measurements hold under scrutiny, the work is significant for providing focused data on the environmental and performance tradeoffs of SLMs in software engineering tasks, a gap left by prior studies centered on larger models. The controlled experimental setup with CodeCarbon supports reproducibility, and the emphasis on prompt structure and model selection offers practical guidance for sustainable test automation. Strengths include the direct measurement of energy/emissions alongside coverage, which could inform model choice in resource-constrained settings.

major comments (1)
  1. [Abstract and results] Abstract and results discussion: The headline claim of distinct sustainability profiles (balancing energy, speed, stability, and coverage) depends on unit-test coverage serving as a sufficient proxy for test quality. Coverage counts executed lines on HumanEval but does not verify semantic correctness, assertion strength, or fault-detection power. If models produce scripts with similar coverage yet differing brittleness or non-assertive tests, the reported coverage or stability advantages cannot be interpreted as quality advantages, weakening the joint claim that prompt structure and model choice shape meaningful environmental-performance outcomes.
minor comments (2)
  1. [Abstract] The abstract states that coverage serves as an 'initial proxy' but provides no error bars, statistical tests, or full methodological details on measurement (e.g., exact CodeCarbon configuration, prompt adaptation steps, or number of runs). Adding these would strengthen the empirical presentation without altering the core claim.
  2. [Methodology] Minor notation issue: The description of 'adaptive prompt variants' lacks a concrete example or table showing the base Anthropic template versus the adapted versions used for each SLM, which would aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps us clarify the scope and limitations of our empirical study. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract and results] Abstract and results discussion: The headline claim of distinct sustainability profiles (balancing energy, speed, stability, and coverage) depends on unit-test coverage serving as a sufficient proxy for test quality. Coverage counts executed lines on HumanEval but does not verify semantic correctness, assertion strength, or fault-detection power. If models produce scripts with similar coverage yet differing brittleness or non-assertive tests, the reported coverage or stability advantages cannot be interpreted as quality advantages, weakening the joint claim that prompt structure and model choice shape meaningful environmental-performance outcomes.

    Authors: We acknowledge that unit-test coverage is an imperfect and preliminary proxy that does not capture semantic correctness, assertion strength, or fault-detection effectiveness. Our manuscript already describes coverage as an 'initial proxy' and frames the sustainability profiles strictly in terms of the measured trade-offs among energy consumption, carbon emissions, execution duration, and coverage rates (with stability defined as consistency across repeated runs). The central claims concern these observable metrics rather than asserting superior test quality. Nevertheless, we agree that the current wording risks over-interpretation. We will revise the abstract, results discussion, and limitations section to more explicitly state the boundaries of coverage as a quality indicator and to emphasize that the reported profiles reflect environmental and performance characteristics under this proxy, not comprehensive test effectiveness. This revision will be incorporated in the next version of the manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurement study

full rationale

The paper conducts controlled experiments measuring energy use, carbon emissions, runtime, and unit-test coverage for SLMs on HumanEval using CodeCarbon and adaptive prompts. No equations, derivations, fitted parameters, or predictions appear in the provided text or abstract. Results are reported as direct observations of distinct sustainability profiles without any self-referential reduction where outputs are defined by or fitted to the same inputs. Coverage serves as an explicit proxy for quality but is not derived from or equated to the sustainability metrics by construction. This matches the reader's 0.0 assessment; the study is self-contained empirical work with no load-bearing self-citations or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical measurement study; no mathematical derivations, free parameters, or postulated entities are present in the central claim.

pith-pipeline@v0.9.0 · 5471 in / 1028 out tokens · 47720 ms · 2026-05-13T20:18:18.963702+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 1 internal anchor

  1. [1]

    Tom Cappendijk, Pepijn de Reus, and Ana Oprescu. 2025. An exploration of prompting LLMs to generate energy-efficient code. InProceedings of the 47th International Conference on Software Engineering: GREENs Track. ACM, Ottawa, Canada. doi:10.48550/arXiv.2405.05097

  2. [2]

    Chaoyang Deng, Yuxuan Liu, Ruochen Wang, Yuxin Wen, Yuxuan Meng, Shuai Li, Huayi Wang, Xin Jin, and Yao Xu. 2024. LLMCarbon: Quantifying and mitigating the carbon footprint of large language models.arXiv preprint arXiv:2402.08776 (2024)

  3. [3]

    Gordon Fraser and Andrea Arcuri. 2011. EvoSuite: Automatic test suite generation for object-oriented software. InProceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering. ACM, 416–419

  4. [4]

    Green Software Foundation. 2021. Green software foundation website.Green Software Foundation(2021). Available online: https://greensoftware.foundation

  5. [5]

    Iyer, Ravi Pathak, Achyudh Sridhar, Varsha Apte, Raghuram Gopalakr- ishnan, Deepa Kalathil, and Parthasarathy Ranganathan

    Vishakh H. Iyer, Ravi Pathak, Achyudh Sridhar, Varsha Apte, Raghuram Gopalakr- ishnan, Deepa Kalathil, and Parthasarathy Ranganathan. 2023. Carbon-aware large language model inference in the cloud.arXiv preprint arXiv:2309.08101 (2023)

  6. [6]

    Saurabh Kapoor. 2024. Green software quality: A comprehensive framework for sustainable metrics in software development.International Journal of Computer Trends and Technology72, 10 (2024), 113–120. doi:10.14445/22312803/IJCTT- V72I10P118

  7. [7]

    Patricia Lago, Qing Gu, and Paolo Bozzelli. 2014. A systematic literature re- view of green software metrics.VU University Amsterdam Technical Report (2014). https://research.vu.nl/en/publications/a-systematic-literature-review-of- green-software-metrics

  8. [8]

    Yuhan Li, Rong Chen, Mayank Gupta, and Dinesh Rao. 2023. Energy consumption patterns in automated test script generation. InProceedings of the 45th Interna- tional Conference on Software Engineering (ICSE). IEEE, 845–856

  9. [9]

    Alexandra Sasha Luccioni, Evan Cornblath, Lynn Kaack, Aditi Kapur, and Alexan- dre Lacoste. 2023. ML CO2 emissions tracker: Estimating the carbon footprint of ML models in real time.arXiv preprint arXiv:2302.08476(2023)

  10. [10]

    Mourão, Leila Karita, and Ivan do Carmo Machado

    Brunna C. Mourão, Leila Karita, and Ivan do Carmo Machado. 2018. Green and sustainable software engineering: A systematic mapping study. InProceedings of the SBQS. ACM, 1–10. doi:10.1145/3275245.3275258

  11. [11]

    Ana Oprescu, Pepijn Reus, and Tom Cappendijk. 2023. Prompt engineering for energy efficiency: Lessons from code generation. InProceedings of the 2023 International Conference on Green Software. ACM

  12. [12]

    Hannah Rashkin, Maarten Sap, Maxwell Forbes, and Yejin Choi. 2021. Words to watts: A benchmark for measuring the energy efficiency of NLP models and tasks. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP). 10000–10015

  13. [13]

    Ritesh Sharma, Neha Batra, and Xin Liu. 2023. Sustainable test automation: Quantifying the environmental impact of code testing with SLMs.Journal of Software Sustainability15, 3 (2023), 225–239

  14. [14]

    Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and policy considerations for deep learning in NLP. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, 3645–3650

  15. [15]

    Tina Vartziotis, Ippolyti Dellatolas, George Dasoulas, Maximilian Schmidt, Florian Schneider, Tim Hoffmann, Sotirios Kotsopoulos, and Michael Keckeisen. 2024. Learn to code sustainably: An empirical study on green code generation. In Proceedings of the 2024 International Workshop on Large Language Models for Code (LLM4Code). ACM, 1–8. doi:10.1145/3643795.3648394

  16. [16]

    Roberto Verdecchia, Patricia Lago, Christof Ebert, and Carol de Vries. 2021. Green IT and green software.IEEE Software38, 6 (2021), 7–15. doi:10.1109/MS.2021. 3102254

  17. [17]

    Eric Zhilin Wu, Joseph Gonzalez, Yoav Goldberg, and Tatsunori Hashimoto. 2023. Green prompting: Reducing inference-time emissions of language models.arXiv preprint arXiv:2305.17138(2023)