arxiv: 2604.02776 · v1 · submitted 2026-04-03 · 💻 cs.SE

Recognition: no theorem link

Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation

Md Afif Al Mamun , Sayan Nath , Gias Uddin , Novarun Deb

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:09 UTC · model grok-4.3

classification 💻 cs.SE

keywords small language modelsprompt engineeringcode generationenergy consumptioncarbon emissionssustainabilityChain-of-Thought

0 comments

The pith

Sustainability decouples from accuracy in small language model code generation

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests how six different prompting strategies affect both the correctness and the environmental cost when small language models generate code. It runs the strategies on eleven open-source models from 1B to 34B parameters using the HumanEval+ and MBPP+ benchmarks while tracking accuracy, energy use in kWh, carbon emissions, and latency. The measurements show that accuracy and energy consumption do not always rise or fall together, so some prompt choices cut emissions without hurting output quality. Chain-of-Thought stands out for delivering strong reasoning at modest energy cost, whereas multi-sampling methods add large energy overhead for small accuracy gains. The work matters because local SLM use is spreading and simple prompt decisions can reduce the overall carbon footprint of AI-assisted coding.

Core claim

The empirical study reveals that sustainability often decouples from accuracy in SLM-based code generation, allowing significant environmental optimizations without sacrificing performance. Chain-of-Thought prompting provides a near-optimal balance between reasoning capability and energy efficiency. Multi-sampling strategies often incur disproportionate costs for marginal gains. Grid carbon intensity is the dominant factor in deployment-time emissions.

What carries the argument

Systematic evaluation of six prompting strategies across eleven SLMs on HumanEval+ and MBPP+ benchmarks, tracking Pass@1 accuracy together with energy consumption, carbon emissions, and inference latency.

If this is right

Chain-of-Thought prompting achieves strong reasoning with relatively low energy consumption.
Multi-sampling techniques should be used sparingly because they add high costs for only small accuracy improvements.
Practitioners must consider local grid carbon intensity when estimating emissions from model deployment.
Prompt engineering offers a practical, parameter-free method to improve both performance and sustainability in code generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Prompt-selection features could be added to coding tools to automatically favor low-emission strategies based on the user's region.
The accuracy-energy decoupling observed here may appear in other generative AI tasks that rely on prompting.
Standardized testing protocols across hardware would help confirm how much the results depend on the specific lab conditions used.

Load-bearing premise

Energy and carbon measurements taken on the authors' specific hardware and under controlled benchmark conditions accurately reflect real-world developer usage patterns and typical prompts.

What would settle it

Direct energy and accuracy measurements collected from typical developer coding sessions on varied hardware and with everyday prompts would show whether the observed trade-offs hold outside the lab setup.

Figures

Figures reproduced from arXiv: 2604.02776 by Gias Uddin, Md Afif Al Mamun, Novarun Deb, Sayan Nath.

**Figure 1.** Figure 1: Evaluation framework to benchmark LLMs. emissions to both computational scale and electricity source, complementing analyses of accuracy and token efficiency. 3.6 Carbon Emission Estimation (CO2eq) Equivalent carbon dioxide emission (CO2eq) quantifies the environmental impact of model inference by combining measured energy consumption with the carbon intensity of the electricity used. In this work, CO2eq… view at source ↗

**Figure 2.** Figure 2: Comparison of Pass@1 accuracy and CO2 emissions on MBPP+ and HumanEval+ across models. and environmental cost. For a fair comparison, all strategy-model combination was evaluated under the same hardware configuration (Machine 1). Results [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Bubble chart of mean Pass@1 accuracy versus en [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: Relationship between different sustainability factors across different prompting strategies. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of average token usage of different [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of parsing errors in different models. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

read the original abstract

The shift from cloud-hosted Large Language Models (LLMs) to locally deployed open-source Small Language Models (SLMs) has democratized AI-assisted coding; however, it has also decentralized the environmental footprint of AI. While prompting strategies - such as Chain-of-Thought and ReAct - serve as external mechanisms for optimizing code generation without modifying model parameters, their impact on energy consumption and carbon emissions remains largely invisible to developers. This paper presents the first systematic empirical study investigating how different prompt engineering strategies in SLM-based code generation impact code generation accuracy alongside sustainability factors. We evaluate six prominent prompting strategies across 11 open-source models (ranging from 1B to 34B parameters) using the HumanEval+ and MBPP+ benchmarks. By measuring Pass@1 accuracy alongside energy (kWh), carbon emissions (kgCO2eq), and inference latency, we reveal that sustainability often decouples from accuracy, allowing significant environmental optimizations without sacrificing performance. Our findings indicate that Chain-of-Thought, being a simpler prompting technique, can provide a near-optimal balance between reasoning capability and energy efficiency. Conversely, multi-sampling strategies often incur disproportionate costs for marginal gains. Finally, we identify grid carbon intensity as the dominant factor in deployment-time emissions, highlighting the need for practitioners to consider regional energy profiles. This work provides a quantitative foundation for "green" prompt engineering, enabling developers to align high-performance code generation with ecological responsibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper runs the first broad sweep of prompt strategies on SLMs while tracking both accuracy and energy/carbon numbers, but the measurements are narrow enough that the decoupling claim needs real-world checks before it travels far.

read the letter

The paper's main move is to test six prompting strategies across eleven open-source SLMs on HumanEval+ and MBPP+, recording Pass@1 accuracy alongside kWh, kgCO2eq, and latency. That combination of accuracy and sustainability metrics at this scale for code generation is new; most prior work has looked at one side or the other but not both together for local models. The concrete result that Chain-of-Thought gives a reasonable accuracy-to-energy ratio while multi-sampling costs more for little gain is the part worth keeping. The note that grid carbon intensity often swamps other factors is also useful for anyone choosing where to run these models. The experimental base is straightforward empirical measurement on public benchmarks, with no fitted parameters or circular derivations, so the numbers themselves are not obviously circular. The soft spot is the test regime. All runs appear to be short, single-query inferences on fixed hardware under controlled conditions. Real developer sessions bring longer contexts, multi-turn interactions, batching differences, and hardware variation that can reorder which prompt looks cheapest. The abstract gives no hardware specs or measurement protocol, which leaves the kWh and carbon figures hard to reproduce or generalize. If the full paper supplies those details and some variability checks, the decoupling story becomes more credible; otherwise it stays tied to the lab setup. This work is aimed at people building or selecting local code assistants who care about power draw. A reader who needs starting numbers on prompt trade-offs will get value from it, even if they treat the environmental rankings as provisional. I would send it to peer review because the question is timely and the empirical core is there, but the referees should press on the measurement details and scope of the claims.

Referee Report

2 major / 1 minor

Summary. The paper claims to conduct the first systematic empirical study on the environmental impact of prompt engineering strategies in small language model (SLM) based code generation. Using six prompting strategies on 11 open-source models (1B to 34B parameters) with HumanEval+ and MBPP+ benchmarks, it measures Pass@1 accuracy, energy consumption in kWh, carbon emissions in kgCO2eq, and inference latency. The key findings are that sustainability often decouples from accuracy, Chain-of-Thought provides a near-optimal balance of reasoning capability and energy efficiency, multi-sampling strategies incur high costs for marginal gains, and grid carbon intensity is the dominant factor in emissions.

Significance. If the empirical results hold under broader conditions, this work would provide a valuable quantitative foundation for green prompt engineering in AI-assisted coding. It makes the invisible environmental costs of prompting visible to developers and offers practical recommendations for balancing performance and sustainability, particularly by favoring simpler techniques like Chain-of-Thought. The emphasis on regional energy profiles adds an important dimension to deployment decisions in sustainable software engineering.

major comments (2)

[Methodology] The paper's central claims about decoupling of sustainability from accuracy and the optimality of Chain-of-Thought rest on energy and carbon measurements collected under controlled, short-context, single-query conditions on the authors' specific hardware. The manuscript should provide more details on the measurement methodology, including hardware specifications, statistical controls, and how these metrics were computed, to substantiate the findings.
[Results] The generalization of the observed decoupling to real-world developer usage patterns is not addressed. Factors such as longer contexts, multi-turn interactions, batching, CPU/GPU heterogeneity, and varying grid intensities could reorder the relative energy costs of different prompting techniques, which is a load-bearing concern for the practical recommendations.

minor comments (1)

[Abstract] The abstract states the findings but does not specify the exact number of models or the benchmarks used, which would help readers quickly assess the scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to improve clarity and scope while preserving the integrity of the empirical findings.

read point-by-point responses

Referee: [Methodology] The paper's central claims about decoupling of sustainability from accuracy and the optimality of Chain-of-Thought rest on energy and carbon measurements collected under controlled, short-context, single-query conditions on the authors' specific hardware. The manuscript should provide more details on the measurement methodology, including hardware specifications, statistical controls, and how these metrics were computed, to substantiate the findings.

Authors: We agree that expanded methodological transparency will strengthen the paper. The original manuscript (Section 3) already specifies the use of CodeCarbon for energy tracking, the NVIDIA A100-based server hardware, and regional grid intensity values from Electricity Maps. In the revision we will add: (1) exact hardware specifications (CPU model, GPU power limits, memory configuration), (2) statistical protocol (five independent runs per configuration with mean and standard deviation reported for all metrics), and (3) the precise computation formulas for kWh-to-kgCO2eq conversion. These additions will be placed in a new subsection 3.4 to directly support the controlled-condition claims. revision: yes
Referee: [Results] The generalization of the observed decoupling to real-world developer usage patterns is not addressed. Factors such as longer contexts, multi-turn interactions, batching, CPU/GPU heterogeneity, and varying grid intensities could reorder the relative energy costs of different prompting techniques, which is a load-bearing concern for the practical recommendations.

Authors: We acknowledge the limitation of our controlled single-query design. The study deliberately isolates prompting effects under short-context conditions to obtain reproducible measurements; extending to multi-turn or batched workloads would require a substantially larger experimental campaign. In the revised manuscript we will add an explicit Limitations subsection (Section 5.3) that discusses how longer contexts, multi-turn interactions, batching, and hardware heterogeneity could alter relative costs, while noting that grid carbon intensity remains the dominant factor regardless of prompting strategy. We will also frame the practical recommendations as applying to the studied regime and flag broader validation as future work. This addresses the concern without overstating generalizability. revision: partial

Circularity Check

0 steps flagged

No circularity; results are direct empirical measurements on public benchmarks

full rationale

The paper reports Pass@1 accuracy, kWh energy, kgCO2eq emissions, and latency measured directly on HumanEval+ and MBPP+ for 11 SLMs and six prompting strategies. No equations, fitted parameters, self-citations, or uniqueness theorems are used to derive the central claims; the decoupling observation and CoT recommendation follow from the tabulated measurements themselves. The study is self-contained against external benchmarks with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The study rests on the validity of HumanEval+ and MBPP+ as proxies for code-generation quality and on standard methods for measuring energy and carbon emissions; no free parameters or invented entities are introduced.

axioms (2)

domain assumption HumanEval+ and MBPP+ benchmarks provide a reliable measure of code generation accuracy
Standard benchmarks in the field, invoked implicitly when reporting Pass@1 accuracy
domain assumption Energy and carbon measurements can be accurately obtained from hardware monitoring during inference
Required for all reported kWh and kgCO2eq values

pith-pipeline@v0.9.0 · 5566 in / 1311 out tokens · 48409 ms · 2026-05-13T20:09:02.584349+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 13 internal anchors

[1]

Abdin, J

M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmann, J. R. Lee, Y. T. Lee, Y. Li, W. Liu, C. C. T. Mendes, A. Nguyen, E. Price, G. de Rosa, O. Saarikivi, A. Salim, S. Shah, X. Wang, R. Ward, Y. Wu, D. Yu, C. Zhang, and Y. Zhang. Phi-4 technical report, 2024

work page 2024
[2]

gpt-oss-120b & gpt-oss-20b Model Card

S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y. Bai, B. Baker, H. Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

What is the carbon footprint difference between running local large language models and cloud-based apis per inference? https://www.alibaba

Alibaba Group. What is the carbon footprint difference between running local large language models and cloud-based apis per inference? https://www.alibaba. com/product-insights/, Jan. 2026. Alibaba Product Insights Blog

work page 2026
[4]

Ashraf, S

H. Ashraf, S. M. Danish, A. Leivadeas, Y. Otoum, and Z. Sattar. Energy-aware code generation with llms: Benchmarking small vs. large language models for sustainable ai programming.arXiv preprint arXiv:2508.08332, 2025

work page arXiv 2025
[5]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

M. Chen. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

W. Chen, X. Ma, X. Wang, and W. W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks.arXiv preprint arXiv:2211.12588, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

Ai’s growing carbon footprint, June 2023

Columbia Climate School, State of the Planet. Ai’s growing carbon footprint, June 2023. Accessed: 5 Dec 2025

work page 2023
[9]

R. Cruz, J. Contreras, F. Guerrero, E. Rodriguez, C. Valdez, and C. Carrillo. Prompt engineering and framework: implementation to increase code reliability based guideline for llms.arXiv preprint arXiv:2506.10989, 2025

work page arXiv 2025
[10]

Y. Dong, X. Jiang, J. Qian, T. Wang, K. Zhang, Z. Jin, and G. Li. A survey on code generation with llm-based agents.arXiv preprint arXiv:2508.00083, 2025

work page arXiv 2025
[11]

Z. Fu, F. Chen, S. Zhou, H. Li, and L. Jiang. Llmco2: Advancing accurate carbon footprint prediction for llm inferences.ACM SIGENERGY Energy Informatics Review, 5(2):63–68, 2025

work page 2025
[12]

Giagnorio, A

A. Giagnorio, A. Mastropaolo, S. Afrin, M. Di Penta, and G. Bavota. Quantizing large language models for code generation: A differentiated replication.arXiv preprint arXiv:2503.07103, 2025

work page arXiv 2025
[13]

Grattafiori, A

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, and et al. The llama 3 herd of models, 2024

work page 2024
[14]

Software Carbon Intensity (SCI) Specification

Green Software Foundation. Software Carbon Intensity (SCI) Specification. https://sci.greensoftware.foundation/, n.d. Accessed: 2026-01-23

work page 2026
[15]

M. M. Hasan, M. Waseem, K.-K. Kemell, J. Rasku, J. Ala-Rantala, and P. Abra- hamsson. Assessing small language models for code generation: An empirical study with benchmarks.arXiv preprint arXiv:2507.03160, 2025

work page arXiv 2025
[16]

Training Compute-Optimal Large Language Models

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

E. J. Husom, A. Goknil, M. Astekin, L. K. Shar, A. KÃ¥sen, S. Sen, B. A. Mithassel, and A. Soylu. Sustainable llm inference for edge ai: Evaluating quantized llms for energy efficiency, output accuracy, and inference latency.ACM Transactions on Internet of Things, 6(4):1–35, 2025

work page 2025
[19]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

O. Khattab, A. Singhvi, P. Maheshwari, Z. Zhang, K. Santhanam, S. Vardhamanan, S. Haq, A. Sharma, T. T. Joshi, H. Moazam, et al. Dspy: Compiling declarative lan- guage model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Lemos, V

F. Lemos, V. Alves, and F. Ferraz. Is it time to treat prompts as code? a multi-use case study for prompt optimization using dspy.arXiv preprint arXiv:2507.03620, 2025

work page arXiv 2025
[21]

B. Li, Y. Jiang, V. Gadepally, and D. Tiwari. Sprout: Green generative ai with carbon-efficient llm inference. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21799–21813, 2024

work page 2024
[22]

J. Li, G. Li, Y. Li, and Z. Jin. Structured chain-of-thought prompting for code generation.ACM Transactions on Software Engineering and Methodology, 34(2):1– 23, 2025

work page 2025
[23]

C. Liu, X. Bao, H. Zhang, N. Zhang, H. Hu, X. Zhang, and M. Yan. Improving chatgpt prompt for code generation.arXiv preprint arXiv:2305.08360, 2023

work page arXiv 2023
[24]

F. Liu, Y. Liu, L. Shi, H. Huang, R. Wang, Z. Yang, L. Zhang, Z. Li, and Y. Ma. Exploring and evaluating hallucinations in llm-powered code generation.arXiv preprint arXiv:2404.00971, 2024

work page arXiv 2024
[25]

StarCoder 2 and The Stack v2: The Next Generation

A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy-Poirier, N. Tazi, A. Tang, D. Pykhtar, J. Liu, Y. Wei, et al. Starcoder 2 and the stack v2: The next gen- eration.arXiv preprint arXiv:2402.19173, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

X. Meng, Z. Ma, P. Gao, and C. Peng. An empirical study on llm-based agents for automated bug fixing.arXiv preprint arXiv:2411.10213, 2024

work page arXiv 2024
[27]

F. Mu, L. Shi, S. Wang, Z. Yu, B. Zhang, C. Wang, S. Liu, and Q. Wang. Clari- fygpt: A framework for enhancing llm-based code generation via requirements clarification.Proceedings of the ACM on Software Engineering, 1(FSE):2332–2354, 2024

work page 2024
[28]

Quantifying the energy consumption and carbon emissions of LLM inference via simulations,

M. Özcan, P. Wiesner, P. Weiß, and O. Kao. Quantifying the energy consump- tion and carbon emissions of llm inference via simulations.arXiv preprint arXiv:2507.11417, 2025

work page arXiv 2025
[29]

Patterson, J

D. Patterson, J. Gonzalez, U. Hölzle, Q. Le, C. Liang, L.-M. Munguia, D. Rothchild, D. R. So, M. Texier, and J. Dean. The carbon footprint of machine learning training will plateau, then shrink.Computer, 55(7):18–28, 2022

work page 2022
[30]

J. Roh, V. Gandhi, S. Anilkumar, and A. Garg. Break-the-chain: Reasoning failures in llms via adversarial prompting in code generation.arXiv preprint arXiv:2506.06971, 2025

work page arXiv 2025
[31]

Code Llama: Open Foundation Models for Code

B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, R. Sauvestre, T. Remez, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Rubei, A

R. Rubei, A. Moussaid, C. Di Sipio, and D. Di Ruscio. Prompt engineering and its implications on the energy consumption of large language models.arXiv preprint arXiv:2501.05899, 2025

work page arXiv 2025
[33]

Su and C

C.-Y. Su and C. McMillan. Distilled gpt for source code summarization.Automated Software Engineering, 31(1):22, 2024

work page 2024
[34]

Taherkhani, M

H. Taherkhani, M. Sepindband, H. V. Pham, S. Wang, and H. Hemmati. Epic: Cost- effective search-based prompt engineering of llms for code generation.arXiv preprint arXiv:2408.11198, 2024

work page arXiv 2024
[35]

C. Team, H. Zhao, J. Hui, J. Howland, N. Nguyen, S. Zuo, A. Hu, C. A. Choquette- Choo, J. Shen, J. Kelley, K. Bansal, L. Vilnis, M. Wirth, P. Michel, P. Choy, P. Joshi, R. Kumar, S. Hashmi, S. Agrawal, Z. Gong, J. Fine, T. Warkentin, A. J. Hartman, B. Ni, K. Korevec, K. Schaefer, and S. Huffman. Codegemma: Open code models based on gemma, 2024

work page 2024
[36]

G. Team, A. Kamath, J. Ferret, et al. Gemma 3 technical report, 2025

work page 2025
[37]

Vartziotis, I

T. Vartziotis, I. Dellatolas, G. Dasoulas, M. Schmidt, F. Schneider, T. Hoffmann, S. Kotsopoulos, and M. Keckeisen. Learn to code sustainably: An empirical study on llm-based green code generation.arXiv preprint arXiv:2403.03344, 2024

work page arXiv 2024
[38]

Vartziotis, M

T. Vartziotis, M. Schmidt, G. Dasoulas, I. Dellatolas, S. Attademo, V. D. Le, A. Wiechmann, T. Hoffmann, M. Keckeisen, and S. Kotsopoulos. Carbon footprint evaluation of code generation through llm as a service. InInternational Stuttgart Symposium, pages 230–241. Springer, 2024

work page 2024
[39]

C.-Y. Wang, A. DaghighFarsoodeh, and H. V. Pham. Selection of prompt engineer- ing techniques for code generation through predicting code complexity.arXiv preprint arXiv:2409.16416, 2024

work page arXiv 2024
[40]

F. Wang, Z. Zhang, X. Zhang, Z. Wu, T. Mo, Q. Lu, W. Wang, R. Li, J. Xu, X. Tang, et al. A comprehensive survey of small language models in the era of large lan- guage models: Techniques, enhancements, applications, collaboration with llms, and trustworthiness.ACM Transactions on Intelligent Systems and Technology, 16(6):1–87, 2025

work page 2025
[41]

X. Wang, B. Li, Y. Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y. Song, B. Li, J. Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[43]

Z. Wang, S. Liu, Y. Sun, H. Li, and K. Shen. Codecontests+: High-quality test case generation for competitive programming.arXiv preprint arXiv:2506.05817, 2025

work page arXiv 2025
[44]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. EASE 2026, 9–12 June, 2026, Glasgow, Scotland, United Kingdom Md Afif Al Mamun, Sayan Nath, Gias Uddin, and Novarun Deb Advances in neural information processing systems, 35:24824–24837, 2022

work page 2026
[45]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao. React: Syn- ergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

work page 2022
[47]

D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuurmans, C. Cui, O. Bousquet, Q. Le, et al. Least-to-most prompting enables complex reasoning in large language models.arXiv preprint arXiv:2205.10625, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[48]

B. Zhu. Towards principled training and serving of large language models. 2025

work page 2025
[49]

Q. Zhu, D. Guo, Z. Shao, D. Yang, P. Wang, R. Xu, Y. Wu, Y. Li, H. Gao, S. Ma, et al. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence.arXiv preprint arXiv:2406.11931, 2024

work page arXiv 2024