pith. machine review for the scientific record. sign in

arxiv: 2604.02776 · v1 · submitted 2026-04-03 · 💻 cs.SE

Recognition: no theorem link

Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:09 UTC · model grok-4.3

classification 💻 cs.SE
keywords small language modelsprompt engineeringcode generationenergy consumptioncarbon emissionssustainabilityChain-of-Thought
0
0 comments X

The pith

Sustainability decouples from accuracy in small language model code generation

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests how six different prompting strategies affect both the correctness and the environmental cost when small language models generate code. It runs the strategies on eleven open-source models from 1B to 34B parameters using the HumanEval+ and MBPP+ benchmarks while tracking accuracy, energy use in kWh, carbon emissions, and latency. The measurements show that accuracy and energy consumption do not always rise or fall together, so some prompt choices cut emissions without hurting output quality. Chain-of-Thought stands out for delivering strong reasoning at modest energy cost, whereas multi-sampling methods add large energy overhead for small accuracy gains. The work matters because local SLM use is spreading and simple prompt decisions can reduce the overall carbon footprint of AI-assisted coding.

Core claim

The empirical study reveals that sustainability often decouples from accuracy in SLM-based code generation, allowing significant environmental optimizations without sacrificing performance. Chain-of-Thought prompting provides a near-optimal balance between reasoning capability and energy efficiency. Multi-sampling strategies often incur disproportionate costs for marginal gains. Grid carbon intensity is the dominant factor in deployment-time emissions.

What carries the argument

Systematic evaluation of six prompting strategies across eleven SLMs on HumanEval+ and MBPP+ benchmarks, tracking Pass@1 accuracy together with energy consumption, carbon emissions, and inference latency.

If this is right

  • Chain-of-Thought prompting achieves strong reasoning with relatively low energy consumption.
  • Multi-sampling techniques should be used sparingly because they add high costs for only small accuracy improvements.
  • Practitioners must consider local grid carbon intensity when estimating emissions from model deployment.
  • Prompt engineering offers a practical, parameter-free method to improve both performance and sustainability in code generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Prompt-selection features could be added to coding tools to automatically favor low-emission strategies based on the user's region.
  • The accuracy-energy decoupling observed here may appear in other generative AI tasks that rely on prompting.
  • Standardized testing protocols across hardware would help confirm how much the results depend on the specific lab conditions used.

Load-bearing premise

Energy and carbon measurements taken on the authors' specific hardware and under controlled benchmark conditions accurately reflect real-world developer usage patterns and typical prompts.

What would settle it

Direct energy and accuracy measurements collected from typical developer coding sessions on varied hardware and with everyday prompts would show whether the observed trade-offs hold outside the lab setup.

Figures

Figures reproduced from arXiv: 2604.02776 by Gias Uddin, Md Afif Al Mamun, Novarun Deb, Sayan Nath.

Figure 1
Figure 1. Figure 1: Evaluation framework to benchmark LLMs. emissions to both computational scale and electricity source, com￾plementing analyses of accuracy and token efficiency. 3.6 Carbon Emission Estimation (CO2eq) Equivalent carbon dioxide emission (CO2eq) quantifies the environ￾mental impact of model inference by combining measured energy consumption with the carbon intensity of the electricity used. In this work, CO2eq… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of Pass@1 accuracy and CO2 emissions on MBPP+ and HumanEval+ across models. and environmental cost. For a fair comparison, all strategy-model combination was evaluated under the same hardware configuration (Machine 1). Results [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Bubble chart of mean Pass@1 accuracy versus en [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Relationship between different sustainability factors across different prompting strategies. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of average token usage of different [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of parsing errors in different models. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
read the original abstract

The shift from cloud-hosted Large Language Models (LLMs) to locally deployed open-source Small Language Models (SLMs) has democratized AI-assisted coding; however, it has also decentralized the environmental footprint of AI. While prompting strategies - such as Chain-of-Thought and ReAct - serve as external mechanisms for optimizing code generation without modifying model parameters, their impact on energy consumption and carbon emissions remains largely invisible to developers. This paper presents the first systematic empirical study investigating how different prompt engineering strategies in SLM-based code generation impact code generation accuracy alongside sustainability factors. We evaluate six prominent prompting strategies across 11 open-source models (ranging from 1B to 34B parameters) using the HumanEval+ and MBPP+ benchmarks. By measuring Pass@1 accuracy alongside energy (kWh), carbon emissions (kgCO2eq), and inference latency, we reveal that sustainability often decouples from accuracy, allowing significant environmental optimizations without sacrificing performance. Our findings indicate that Chain-of-Thought, being a simpler prompting technique, can provide a near-optimal balance between reasoning capability and energy efficiency. Conversely, multi-sampling strategies often incur disproportionate costs for marginal gains. Finally, we identify grid carbon intensity as the dominant factor in deployment-time emissions, highlighting the need for practitioners to consider regional energy profiles. This work provides a quantitative foundation for "green" prompt engineering, enabling developers to align high-performance code generation with ecological responsibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to conduct the first systematic empirical study on the environmental impact of prompt engineering strategies in small language model (SLM) based code generation. Using six prompting strategies on 11 open-source models (1B to 34B parameters) with HumanEval+ and MBPP+ benchmarks, it measures Pass@1 accuracy, energy consumption in kWh, carbon emissions in kgCO2eq, and inference latency. The key findings are that sustainability often decouples from accuracy, Chain-of-Thought provides a near-optimal balance of reasoning capability and energy efficiency, multi-sampling strategies incur high costs for marginal gains, and grid carbon intensity is the dominant factor in emissions.

Significance. If the empirical results hold under broader conditions, this work would provide a valuable quantitative foundation for green prompt engineering in AI-assisted coding. It makes the invisible environmental costs of prompting visible to developers and offers practical recommendations for balancing performance and sustainability, particularly by favoring simpler techniques like Chain-of-Thought. The emphasis on regional energy profiles adds an important dimension to deployment decisions in sustainable software engineering.

major comments (2)
  1. [Methodology] The paper's central claims about decoupling of sustainability from accuracy and the optimality of Chain-of-Thought rest on energy and carbon measurements collected under controlled, short-context, single-query conditions on the authors' specific hardware. The manuscript should provide more details on the measurement methodology, including hardware specifications, statistical controls, and how these metrics were computed, to substantiate the findings.
  2. [Results] The generalization of the observed decoupling to real-world developer usage patterns is not addressed. Factors such as longer contexts, multi-turn interactions, batching, CPU/GPU heterogeneity, and varying grid intensities could reorder the relative energy costs of different prompting techniques, which is a load-bearing concern for the practical recommendations.
minor comments (1)
  1. [Abstract] The abstract states the findings but does not specify the exact number of models or the benchmarks used, which would help readers quickly assess the scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to improve clarity and scope while preserving the integrity of the empirical findings.

read point-by-point responses
  1. Referee: [Methodology] The paper's central claims about decoupling of sustainability from accuracy and the optimality of Chain-of-Thought rest on energy and carbon measurements collected under controlled, short-context, single-query conditions on the authors' specific hardware. The manuscript should provide more details on the measurement methodology, including hardware specifications, statistical controls, and how these metrics were computed, to substantiate the findings.

    Authors: We agree that expanded methodological transparency will strengthen the paper. The original manuscript (Section 3) already specifies the use of CodeCarbon for energy tracking, the NVIDIA A100-based server hardware, and regional grid intensity values from Electricity Maps. In the revision we will add: (1) exact hardware specifications (CPU model, GPU power limits, memory configuration), (2) statistical protocol (five independent runs per configuration with mean and standard deviation reported for all metrics), and (3) the precise computation formulas for kWh-to-kgCO2eq conversion. These additions will be placed in a new subsection 3.4 to directly support the controlled-condition claims. revision: yes

  2. Referee: [Results] The generalization of the observed decoupling to real-world developer usage patterns is not addressed. Factors such as longer contexts, multi-turn interactions, batching, CPU/GPU heterogeneity, and varying grid intensities could reorder the relative energy costs of different prompting techniques, which is a load-bearing concern for the practical recommendations.

    Authors: We acknowledge the limitation of our controlled single-query design. The study deliberately isolates prompting effects under short-context conditions to obtain reproducible measurements; extending to multi-turn or batched workloads would require a substantially larger experimental campaign. In the revised manuscript we will add an explicit Limitations subsection (Section 5.3) that discusses how longer contexts, multi-turn interactions, batching, and hardware heterogeneity could alter relative costs, while noting that grid carbon intensity remains the dominant factor regardless of prompting strategy. We will also frame the practical recommendations as applying to the studied regime and flag broader validation as future work. This addresses the concern without overstating generalizability. revision: partial

Circularity Check

0 steps flagged

No circularity; results are direct empirical measurements on public benchmarks

full rationale

The paper reports Pass@1 accuracy, kWh energy, kgCO2eq emissions, and latency measured directly on HumanEval+ and MBPP+ for 11 SLMs and six prompting strategies. No equations, fitted parameters, self-citations, or uniqueness theorems are used to derive the central claims; the decoupling observation and CoT recommendation follow from the tabulated measurements themselves. The study is self-contained against external benchmarks with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The study rests on the validity of HumanEval+ and MBPP+ as proxies for code-generation quality and on standard methods for measuring energy and carbon emissions; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption HumanEval+ and MBPP+ benchmarks provide a reliable measure of code generation accuracy
    Standard benchmarks in the field, invoked implicitly when reporting Pass@1 accuracy
  • domain assumption Energy and carbon measurements can be accurately obtained from hardware monitoring during inference
    Required for all reported kWh and kgCO2eq values

pith-pipeline@v0.9.0 · 5566 in / 1311 out tokens · 48409 ms · 2026-05-13T20:09:02.584349+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 13 internal anchors

  1. [1]

    Abdin, J

    M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmann, J. R. Lee, Y. T. Lee, Y. Li, W. Liu, C. C. T. Mendes, A. Nguyen, E. Price, G. de Rosa, O. Saarikivi, A. Salim, S. Shah, X. Wang, R. Ward, Y. Wu, D. Yu, C. Zhang, and Y. Zhang. Phi-4 technical report, 2024

  2. [2]

    gpt-oss-120b & gpt-oss-20b Model Card

    S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y. Bai, B. Baker, H. Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

  3. [3]

    What is the carbon footprint difference between running local large language models and cloud-based apis per inference? https://www.alibaba

    Alibaba Group. What is the carbon footprint difference between running local large language models and cloud-based apis per inference? https://www.alibaba. com/product-insights/, Jan. 2026. Alibaba Product Insights Blog

  4. [4]

    Ashraf, S

    H. Ashraf, S. M. Danish, A. Leivadeas, Y. Otoum, and Z. Sattar. Energy-aware code generation with llms: Benchmarking small vs. large language models for sustainable ai programming.arXiv preprint arXiv:2508.08332, 2025

  5. [5]

    Program Synthesis with Large Language Models

    J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  6. [6]

    M. Chen. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  7. [7]

    W. Chen, X. Ma, X. Wang, and W. W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks.arXiv preprint arXiv:2211.12588, 2022

  8. [8]

    Ai’s growing carbon footprint, June 2023

    Columbia Climate School, State of the Planet. Ai’s growing carbon footprint, June 2023. Accessed: 5 Dec 2025

  9. [9]

    R. Cruz, J. Contreras, F. Guerrero, E. Rodriguez, C. Valdez, and C. Carrillo. Prompt engineering and framework: implementation to increase code reliability based guideline for llms.arXiv preprint arXiv:2506.10989, 2025

  10. [10]

    Y. Dong, X. Jiang, J. Qian, T. Wang, K. Zhang, Z. Jin, and G. Li. A survey on code generation with llm-based agents.arXiv preprint arXiv:2508.00083, 2025

  11. [11]

    Z. Fu, F. Chen, S. Zhou, H. Li, and L. Jiang. Llmco2: Advancing accurate carbon footprint prediction for llm inferences.ACM SIGENERGY Energy Informatics Review, 5(2):63–68, 2025

  12. [12]

    Giagnorio, A

    A. Giagnorio, A. Mastropaolo, S. Afrin, M. Di Penta, and G. Bavota. Quantizing large language models for code generation: A differentiated replication.arXiv preprint arXiv:2503.07103, 2025

  13. [13]

    Grattafiori, A

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, and et al. The llama 3 herd of models, 2024

  14. [14]

    Software Carbon Intensity (SCI) Specification

    Green Software Foundation. Software Carbon Intensity (SCI) Specification. https://sci.greensoftware.foundation/, n.d. Accessed: 2026-01-23

  15. [15]

    M. M. Hasan, M. Waseem, K.-K. Kemell, J. Rasku, J. Ala-Rantala, and P. Abra- hamsson. Assessing small language models for code generation: An empirical study with benchmarks.arXiv preprint arXiv:2507.03160, 2025

  16. [16]

    Training Compute-Optimal Large Language Models

    J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022

  17. [17]

    B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024

  18. [18]

    E. J. Husom, A. Goknil, M. Astekin, L. K. Shar, A. KÃ¥sen, S. Sen, B. A. Mithassel, and A. Soylu. Sustainable llm inference for edge ai: Evaluating quantized llms for energy efficiency, output accuracy, and inference latency.ACM Transactions on Internet of Things, 6(4):1–35, 2025

  19. [19]

    DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    O. Khattab, A. Singhvi, P. Maheshwari, Z. Zhang, K. Santhanam, S. Vardhamanan, S. Haq, A. Sharma, T. T. Joshi, H. Moazam, et al. Dspy: Compiling declarative lan- guage model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023

  20. [20]

    Lemos, V

    F. Lemos, V. Alves, and F. Ferraz. Is it time to treat prompts as code? a multi-use case study for prompt optimization using dspy.arXiv preprint arXiv:2507.03620, 2025

  21. [21]

    B. Li, Y. Jiang, V. Gadepally, and D. Tiwari. Sprout: Green generative ai with carbon-efficient llm inference. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21799–21813, 2024

  22. [22]

    J. Li, G. Li, Y. Li, and Z. Jin. Structured chain-of-thought prompting for code generation.ACM Transactions on Software Engineering and Methodology, 34(2):1– 23, 2025

  23. [23]

    C. Liu, X. Bao, H. Zhang, N. Zhang, H. Hu, X. Zhang, and M. Yan. Improving chatgpt prompt for code generation.arXiv preprint arXiv:2305.08360, 2023

  24. [24]

    F. Liu, Y. Liu, L. Shi, H. Huang, R. Wang, Z. Yang, L. Zhang, Z. Li, and Y. Ma. Exploring and evaluating hallucinations in llm-powered code generation.arXiv preprint arXiv:2404.00971, 2024

  25. [25]

    StarCoder 2 and The Stack v2: The Next Generation

    A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy-Poirier, N. Tazi, A. Tang, D. Pykhtar, J. Liu, Y. Wei, et al. Starcoder 2 and the stack v2: The next gen- eration.arXiv preprint arXiv:2402.19173, 2024

  26. [26]

    X. Meng, Z. Ma, P. Gao, and C. Peng. An empirical study on llm-based agents for automated bug fixing.arXiv preprint arXiv:2411.10213, 2024

  27. [27]

    F. Mu, L. Shi, S. Wang, Z. Yu, B. Zhang, C. Wang, S. Liu, and Q. Wang. Clari- fygpt: A framework for enhancing llm-based code generation via requirements clarification.Proceedings of the ACM on Software Engineering, 1(FSE):2332–2354, 2024

  28. [28]

    Quantifying the energy consumption and carbon emissions of LLM inference via simulations,

    M. Özcan, P. Wiesner, P. Weiß, and O. Kao. Quantifying the energy consump- tion and carbon emissions of llm inference via simulations.arXiv preprint arXiv:2507.11417, 2025

  29. [29]

    Patterson, J

    D. Patterson, J. Gonzalez, U. Hölzle, Q. Le, C. Liang, L.-M. Munguia, D. Rothchild, D. R. So, M. Texier, and J. Dean. The carbon footprint of machine learning training will plateau, then shrink.Computer, 55(7):18–28, 2022

  30. [30]

    J. Roh, V. Gandhi, S. Anilkumar, and A. Garg. Break-the-chain: Reasoning failures in llms via adversarial prompting in code generation.arXiv preprint arXiv:2506.06971, 2025

  31. [31]

    Code Llama: Open Foundation Models for Code

    B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, R. Sauvestre, T. Remez, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023

  32. [32]

    Rubei, A

    R. Rubei, A. Moussaid, C. Di Sipio, and D. Di Ruscio. Prompt engineering and its implications on the energy consumption of large language models.arXiv preprint arXiv:2501.05899, 2025

  33. [33]

    Su and C

    C.-Y. Su and C. McMillan. Distilled gpt for source code summarization.Automated Software Engineering, 31(1):22, 2024

  34. [34]

    Taherkhani, M

    H. Taherkhani, M. Sepindband, H. V. Pham, S. Wang, and H. Hemmati. Epic: Cost- effective search-based prompt engineering of llms for code generation.arXiv preprint arXiv:2408.11198, 2024

  35. [35]

    C. Team, H. Zhao, J. Hui, J. Howland, N. Nguyen, S. Zuo, A. Hu, C. A. Choquette- Choo, J. Shen, J. Kelley, K. Bansal, L. Vilnis, M. Wirth, P. Michel, P. Choy, P. Joshi, R. Kumar, S. Hashmi, S. Agrawal, Z. Gong, J. Fine, T. Warkentin, A. J. Hartman, B. Ni, K. Korevec, K. Schaefer, and S. Huffman. Codegemma: Open code models based on gemma, 2024

  36. [36]

    G. Team, A. Kamath, J. Ferret, et al. Gemma 3 technical report, 2025

  37. [37]

    Vartziotis, I

    T. Vartziotis, I. Dellatolas, G. Dasoulas, M. Schmidt, F. Schneider, T. Hoffmann, S. Kotsopoulos, and M. Keckeisen. Learn to code sustainably: An empirical study on llm-based green code generation.arXiv preprint arXiv:2403.03344, 2024

  38. [38]

    Vartziotis, M

    T. Vartziotis, M. Schmidt, G. Dasoulas, I. Dellatolas, S. Attademo, V. D. Le, A. Wiechmann, T. Hoffmann, M. Keckeisen, and S. Kotsopoulos. Carbon footprint evaluation of code generation through llm as a service. InInternational Stuttgart Symposium, pages 230–241. Springer, 2024

  39. [39]

    C.-Y. Wang, A. DaghighFarsoodeh, and H. V. Pham. Selection of prompt engineer- ing techniques for code generation through predicting code complexity.arXiv preprint arXiv:2409.16416, 2024

  40. [40]

    F. Wang, Z. Zhang, X. Zhang, Z. Wu, T. Mo, Q. Lu, W. Wang, R. Li, J. Xu, X. Tang, et al. A comprehensive survey of small language models in the era of large lan- guage models: Techniques, enhancements, applications, collaboration with llms, and trustworthiness.ACM Transactions on Intelligent Systems and Technology, 16(6):1–87, 2025

  41. [41]

    X. Wang, B. Li, Y. Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y. Song, B. Li, J. Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024

  42. [42]

    X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

  43. [43]

    Z. Wang, S. Liu, Y. Sun, H. Li, and K. Shen. Codecontests+: High-quality test case generation for competitive programming.arXiv preprint arXiv:2506.05817, 2025

  44. [44]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. EASE 2026, 9–12 June, 2026, Glasgow, Scotland, United Kingdom Md Afif Al Mamun, Sayan Nath, Gias Uddin, and Novarun Deb Advances in neural information processing systems, 35:24824–24837, 2022

  45. [45]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  46. [46]

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao. React: Syn- ergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

  47. [47]

    D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuurmans, C. Cui, O. Bousquet, Q. Le, et al. Least-to-most prompting enables complex reasoning in large language models.arXiv preprint arXiv:2205.10625, 2022

  48. [48]

    B. Zhu. Towards principled training and serving of large language models. 2025

  49. [49]

    Q. Zhu, D. Guo, Z. Shao, D. Yang, P. Wang, R. Xu, Y. Wu, Y. Li, H. Gao, S. Ma, et al. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence.arXiv preprint arXiv:2406.11931, 2024