Recognition: no theorem link
Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation
Pith reviewed 2026-05-13 20:09 UTC · model grok-4.3
The pith
Sustainability decouples from accuracy in small language model code generation
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The empirical study reveals that sustainability often decouples from accuracy in SLM-based code generation, allowing significant environmental optimizations without sacrificing performance. Chain-of-Thought prompting provides a near-optimal balance between reasoning capability and energy efficiency. Multi-sampling strategies often incur disproportionate costs for marginal gains. Grid carbon intensity is the dominant factor in deployment-time emissions.
What carries the argument
Systematic evaluation of six prompting strategies across eleven SLMs on HumanEval+ and MBPP+ benchmarks, tracking Pass@1 accuracy together with energy consumption, carbon emissions, and inference latency.
If this is right
- Chain-of-Thought prompting achieves strong reasoning with relatively low energy consumption.
- Multi-sampling techniques should be used sparingly because they add high costs for only small accuracy improvements.
- Practitioners must consider local grid carbon intensity when estimating emissions from model deployment.
- Prompt engineering offers a practical, parameter-free method to improve both performance and sustainability in code generation.
Where Pith is reading between the lines
- Prompt-selection features could be added to coding tools to automatically favor low-emission strategies based on the user's region.
- The accuracy-energy decoupling observed here may appear in other generative AI tasks that rely on prompting.
- Standardized testing protocols across hardware would help confirm how much the results depend on the specific lab conditions used.
Load-bearing premise
Energy and carbon measurements taken on the authors' specific hardware and under controlled benchmark conditions accurately reflect real-world developer usage patterns and typical prompts.
What would settle it
Direct energy and accuracy measurements collected from typical developer coding sessions on varied hardware and with everyday prompts would show whether the observed trade-offs hold outside the lab setup.
Figures
read the original abstract
The shift from cloud-hosted Large Language Models (LLMs) to locally deployed open-source Small Language Models (SLMs) has democratized AI-assisted coding; however, it has also decentralized the environmental footprint of AI. While prompting strategies - such as Chain-of-Thought and ReAct - serve as external mechanisms for optimizing code generation without modifying model parameters, their impact on energy consumption and carbon emissions remains largely invisible to developers. This paper presents the first systematic empirical study investigating how different prompt engineering strategies in SLM-based code generation impact code generation accuracy alongside sustainability factors. We evaluate six prominent prompting strategies across 11 open-source models (ranging from 1B to 34B parameters) using the HumanEval+ and MBPP+ benchmarks. By measuring Pass@1 accuracy alongside energy (kWh), carbon emissions (kgCO2eq), and inference latency, we reveal that sustainability often decouples from accuracy, allowing significant environmental optimizations without sacrificing performance. Our findings indicate that Chain-of-Thought, being a simpler prompting technique, can provide a near-optimal balance between reasoning capability and energy efficiency. Conversely, multi-sampling strategies often incur disproportionate costs for marginal gains. Finally, we identify grid carbon intensity as the dominant factor in deployment-time emissions, highlighting the need for practitioners to consider regional energy profiles. This work provides a quantitative foundation for "green" prompt engineering, enabling developers to align high-performance code generation with ecological responsibility.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to conduct the first systematic empirical study on the environmental impact of prompt engineering strategies in small language model (SLM) based code generation. Using six prompting strategies on 11 open-source models (1B to 34B parameters) with HumanEval+ and MBPP+ benchmarks, it measures Pass@1 accuracy, energy consumption in kWh, carbon emissions in kgCO2eq, and inference latency. The key findings are that sustainability often decouples from accuracy, Chain-of-Thought provides a near-optimal balance of reasoning capability and energy efficiency, multi-sampling strategies incur high costs for marginal gains, and grid carbon intensity is the dominant factor in emissions.
Significance. If the empirical results hold under broader conditions, this work would provide a valuable quantitative foundation for green prompt engineering in AI-assisted coding. It makes the invisible environmental costs of prompting visible to developers and offers practical recommendations for balancing performance and sustainability, particularly by favoring simpler techniques like Chain-of-Thought. The emphasis on regional energy profiles adds an important dimension to deployment decisions in sustainable software engineering.
major comments (2)
- [Methodology] The paper's central claims about decoupling of sustainability from accuracy and the optimality of Chain-of-Thought rest on energy and carbon measurements collected under controlled, short-context, single-query conditions on the authors' specific hardware. The manuscript should provide more details on the measurement methodology, including hardware specifications, statistical controls, and how these metrics were computed, to substantiate the findings.
- [Results] The generalization of the observed decoupling to real-world developer usage patterns is not addressed. Factors such as longer contexts, multi-turn interactions, batching, CPU/GPU heterogeneity, and varying grid intensities could reorder the relative energy costs of different prompting techniques, which is a load-bearing concern for the practical recommendations.
minor comments (1)
- [Abstract] The abstract states the findings but does not specify the exact number of models or the benchmarks used, which would help readers quickly assess the scope.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to improve clarity and scope while preserving the integrity of the empirical findings.
read point-by-point responses
-
Referee: [Methodology] The paper's central claims about decoupling of sustainability from accuracy and the optimality of Chain-of-Thought rest on energy and carbon measurements collected under controlled, short-context, single-query conditions on the authors' specific hardware. The manuscript should provide more details on the measurement methodology, including hardware specifications, statistical controls, and how these metrics were computed, to substantiate the findings.
Authors: We agree that expanded methodological transparency will strengthen the paper. The original manuscript (Section 3) already specifies the use of CodeCarbon for energy tracking, the NVIDIA A100-based server hardware, and regional grid intensity values from Electricity Maps. In the revision we will add: (1) exact hardware specifications (CPU model, GPU power limits, memory configuration), (2) statistical protocol (five independent runs per configuration with mean and standard deviation reported for all metrics), and (3) the precise computation formulas for kWh-to-kgCO2eq conversion. These additions will be placed in a new subsection 3.4 to directly support the controlled-condition claims. revision: yes
-
Referee: [Results] The generalization of the observed decoupling to real-world developer usage patterns is not addressed. Factors such as longer contexts, multi-turn interactions, batching, CPU/GPU heterogeneity, and varying grid intensities could reorder the relative energy costs of different prompting techniques, which is a load-bearing concern for the practical recommendations.
Authors: We acknowledge the limitation of our controlled single-query design. The study deliberately isolates prompting effects under short-context conditions to obtain reproducible measurements; extending to multi-turn or batched workloads would require a substantially larger experimental campaign. In the revised manuscript we will add an explicit Limitations subsection (Section 5.3) that discusses how longer contexts, multi-turn interactions, batching, and hardware heterogeneity could alter relative costs, while noting that grid carbon intensity remains the dominant factor regardless of prompting strategy. We will also frame the practical recommendations as applying to the studied regime and flag broader validation as future work. This addresses the concern without overstating generalizability. revision: partial
Circularity Check
No circularity; results are direct empirical measurements on public benchmarks
full rationale
The paper reports Pass@1 accuracy, kWh energy, kgCO2eq emissions, and latency measured directly on HumanEval+ and MBPP+ for 11 SLMs and six prompting strategies. No equations, fitted parameters, self-citations, or uniqueness theorems are used to derive the central claims; the decoupling observation and CoT recommendation follow from the tabulated measurements themselves. The study is self-contained against external benchmarks with no reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption HumanEval+ and MBPP+ benchmarks provide a reliable measure of code generation accuracy
- domain assumption Energy and carbon measurements can be accurately obtained from hardware monitoring during inference
Reference graph
Works this paper leans on
-
[1]
M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmann, J. R. Lee, Y. T. Lee, Y. Li, W. Liu, C. C. T. Mendes, A. Nguyen, E. Price, G. de Rosa, O. Saarikivi, A. Salim, S. Shah, X. Wang, R. Ward, Y. Wu, D. Yu, C. Zhang, and Y. Zhang. Phi-4 technical report, 2024
work page 2024
-
[2]
gpt-oss-120b & gpt-oss-20b Model Card
S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y. Bai, B. Baker, H. Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Alibaba Group. What is the carbon footprint difference between running local large language models and cloud-based apis per inference? https://www.alibaba. com/product-insights/, Jan. 2026. Alibaba Product Insights Blog
work page 2026
- [4]
-
[5]
Program Synthesis with Large Language Models
J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
M. Chen. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
W. Chen, X. Ma, X. Wang, and W. W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks.arXiv preprint arXiv:2211.12588, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[8]
Ai’s growing carbon footprint, June 2023
Columbia Climate School, State of the Planet. Ai’s growing carbon footprint, June 2023. Accessed: 5 Dec 2025
work page 2023
- [9]
- [10]
-
[11]
Z. Fu, F. Chen, S. Zhou, H. Li, and L. Jiang. Llmco2: Advancing accurate carbon footprint prediction for llm inferences.ACM SIGENERGY Energy Informatics Review, 5(2):63–68, 2025
work page 2025
-
[12]
A. Giagnorio, A. Mastropaolo, S. Afrin, M. Di Penta, and G. Bavota. Quantizing large language models for code generation: A differentiated replication.arXiv preprint arXiv:2503.07103, 2025
-
[13]
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, and et al. The llama 3 herd of models, 2024
work page 2024
-
[14]
Software Carbon Intensity (SCI) Specification
Green Software Foundation. Software Carbon Intensity (SCI) Specification. https://sci.greensoftware.foundation/, n.d. Accessed: 2026-01-23
work page 2026
- [15]
-
[16]
Training Compute-Optimal Large Language Models
J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
E. J. Husom, A. Goknil, M. Astekin, L. K. Shar, A. KÃ¥sen, S. Sen, B. A. Mithassel, and A. Soylu. Sustainable llm inference for edge ai: Evaluating quantized llms for energy efficiency, output accuracy, and inference latency.ACM Transactions on Internet of Things, 6(4):1–35, 2025
work page 2025
-
[19]
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
O. Khattab, A. Singhvi, P. Maheshwari, Z. Zhang, K. Santhanam, S. Vardhamanan, S. Haq, A. Sharma, T. T. Joshi, H. Moazam, et al. Dspy: Compiling declarative lan- guage model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [20]
-
[21]
B. Li, Y. Jiang, V. Gadepally, and D. Tiwari. Sprout: Green generative ai with carbon-efficient llm inference. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21799–21813, 2024
work page 2024
-
[22]
J. Li, G. Li, Y. Li, and Z. Jin. Structured chain-of-thought prompting for code generation.ACM Transactions on Software Engineering and Methodology, 34(2):1– 23, 2025
work page 2025
- [23]
- [24]
-
[25]
StarCoder 2 and The Stack v2: The Next Generation
A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy-Poirier, N. Tazi, A. Tang, D. Pykhtar, J. Liu, Y. Wei, et al. Starcoder 2 and the stack v2: The next gen- eration.arXiv preprint arXiv:2402.19173, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [26]
-
[27]
F. Mu, L. Shi, S. Wang, Z. Yu, B. Zhang, C. Wang, S. Liu, and Q. Wang. Clari- fygpt: A framework for enhancing llm-based code generation via requirements clarification.Proceedings of the ACM on Software Engineering, 1(FSE):2332–2354, 2024
work page 2024
-
[28]
Quantifying the energy consumption and carbon emissions of LLM inference via simulations,
M. Özcan, P. Wiesner, P. Weiß, and O. Kao. Quantifying the energy consump- tion and carbon emissions of llm inference via simulations.arXiv preprint arXiv:2507.11417, 2025
-
[29]
D. Patterson, J. Gonzalez, U. Hölzle, Q. Le, C. Liang, L.-M. Munguia, D. Rothchild, D. R. So, M. Texier, and J. Dean. The carbon footprint of machine learning training will plateau, then shrink.Computer, 55(7):18–28, 2022
work page 2022
- [30]
-
[31]
Code Llama: Open Foundation Models for Code
B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, R. Sauvestre, T. Remez, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [32]
- [33]
-
[34]
H. Taherkhani, M. Sepindband, H. V. Pham, S. Wang, and H. Hemmati. Epic: Cost- effective search-based prompt engineering of llms for code generation.arXiv preprint arXiv:2408.11198, 2024
-
[35]
C. Team, H. Zhao, J. Hui, J. Howland, N. Nguyen, S. Zuo, A. Hu, C. A. Choquette- Choo, J. Shen, J. Kelley, K. Bansal, L. Vilnis, M. Wirth, P. Michel, P. Choy, P. Joshi, R. Kumar, S. Hashmi, S. Agrawal, Z. Gong, J. Fine, T. Warkentin, A. J. Hartman, B. Ni, K. Korevec, K. Schaefer, and S. Huffman. Codegemma: Open code models based on gemma, 2024
work page 2024
-
[36]
G. Team, A. Kamath, J. Ferret, et al. Gemma 3 technical report, 2025
work page 2025
-
[37]
T. Vartziotis, I. Dellatolas, G. Dasoulas, M. Schmidt, F. Schneider, T. Hoffmann, S. Kotsopoulos, and M. Keckeisen. Learn to code sustainably: An empirical study on llm-based green code generation.arXiv preprint arXiv:2403.03344, 2024
-
[38]
T. Vartziotis, M. Schmidt, G. Dasoulas, I. Dellatolas, S. Attademo, V. D. Le, A. Wiechmann, T. Hoffmann, M. Keckeisen, and S. Kotsopoulos. Carbon footprint evaluation of code generation through llm as a service. InInternational Stuttgart Symposium, pages 230–241. Springer, 2024
work page 2024
- [39]
-
[40]
F. Wang, Z. Zhang, X. Zhang, Z. Wu, T. Mo, Q. Lu, W. Wang, R. Li, J. Xu, X. Tang, et al. A comprehensive survey of small language models in the era of large lan- guage models: Techniques, enhancements, applications, collaboration with llms, and trustworthiness.ACM Transactions on Intelligent Systems and Technology, 16(6):1–87, 2025
work page 2025
-
[41]
X. Wang, B. Li, Y. Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y. Song, B. Li, J. Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [43]
-
[44]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. EASE 2026, 9–12 June, 2026, Glasgow, Scotland, United Kingdom Md Afif Al Mamun, Sayan Nath, Gias Uddin, and Novarun Deb Advances in neural information processing systems, 35:24824–24837, 2022
work page 2026
-
[45]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao. React: Syn- ergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022
work page 2022
-
[47]
D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuurmans, C. Cui, O. Bousquet, Q. Le, et al. Least-to-most prompting enables complex reasoning in large language models.arXiv preprint arXiv:2205.10625, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[48]
B. Zhu. Towards principled training and serving of large language models. 2025
work page 2025
- [49]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.