arxiv: 2604.06755 · v1 · submitted 2026-04-08 · 💻 cs.SE

Recognition: 2 theorem links

· Lean Theorem

Babbling Suppression: Making LLMs Greener One Token at a Time

Fernando Castor, Lola Solovyeva

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:22 UTC · model grok-4.3

classification 💻 cs.SE

keywords babbling suppressionLLM code generationenergy efficiencytoken reductionsoftware engineeringsustainable AIcode completionAI assistants

0 comments

The pith

Babbling Suppression stops LLM code generation once tests pass, cutting energy use by up to 65% without losing accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Babbling Suppression, a method that runs tests on code produced so far by an LLM and halts further token generation as soon as a correct solution appears. This targets the extra output LLMs often continue to produce after the problem is solved. Tests across Python and Java benchmarks with 3-7B parameter models show energy savings in most cases. The technique leaves the model's final answer quality unchanged because it only terminates on outputs that already pass all tests.

Core claim

Babbling Suppression integrates test execution into the LLM generation process by evaluating intermediate outputs and terminating generation once a solution passes all tests. This prevents excessive token generation while having no impact on model accuracy. An empirical study was conducted across two Python and two Java benchmarks, targeting four 3-4B parameter models and six 6-7B parameter models, with babbling observed across all models and higher frequency in Java.

What carries the argument

Babbling Suppression (BS), a model-agnostic technique that evaluates intermediate code outputs against test suites during generation to decide when to stop producing more tokens.

Load-bearing premise

That running tests on intermediate outputs adds negligible overhead relative to the token savings and that the chosen benchmarks' test suites are representative enough to detect correct solutions without false positives or negatives that would alter termination behavior.

What would settle it

Compare total energy and final accuracy on a new set of problems where the test suites are deliberately weak enough to accept incorrect early outputs or where test execution time exceeds the saved generation time.

Figures

Figures reproduced from arXiv: 2604.06755 by Fernando Castor, Lola Solovyeva.

**Figure 1.** Figure 1: Token likelihood in test-passing solutions, together with histograms showing the lengths of the generated solutions. The bottom x-axis shows the token index, and the left y-axis shows the probability that a token appears in a correctly generated solution. The top x-axis shows the length of the generated solutions, and the right y-axis shows the number of instances with that length. the output and operates … view at source ↗

**Figure 2.** Figure 2: Step-by-step example of applying babbling suppression to an LLM-generated Python function that computes the square of a number. For the generated output: blue indicates a newly generated line, and the line delimiter is highlighted in red. The generated output is shown with a white background, which turns red (with a cross) if a check fails and green (with a check mark) if all tests pass. Blue boxes represe… view at source ↗

read the original abstract

Context: Large Language Models (LLMs) are increasingly used in modern software development, aiding in code generation, code completion, and refactoring through AI-powered assistants. While they accelerate development workflows, they often produce extraneous output, referred to as "babbling", which incurs additional cognitive, economic, and energy costs. Objective: This work investigates the problem of babbling in LLM-based code generation and proposes a practical, model-agnostic approach to reduce unnecessary output without compromising solution accuracy. Method: We introduce Babbling Suppression (BS), a method that integrates test execution into the LLM generation process by evaluating intermediate outputs and terminating generation once a solution passes all tests. This prevents excessive token generation while having no impact on model accuracy. An empirical study was conducted across two Python and two Java benchmarks, targeting four 3-4B parameter models and six 6-7B parameter models. Results: Our findings show that babbling occurs across all tested models, with higher frequency in Java than in Python. Applying BS significantly reduces energy consumption by up to 65% for Python and 62% for Java in models prone to babbling. Across 40 model-benchmark pairs, 29 showed reduced mean energy consumption, with reductions exceeding 20% in 22 cases. Generated token count decreased in 35 pairs, while the GPU energy-per-token overhead of BS remained below 10% for 26 pairs, decreased for 2, and reached a maximum of 24%, yielding net energy savings in most cases. Implications: BS can make AI-assisted programming more efficient and sustainable by reducing energy consumption and minimizing cognitive effort by developers. Its model-agnostic design allows easy integration, suggesting broad applicability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows measurable energy cuts from stopping LLM code generation early via live tests, but the net benefit rests on low test overhead.

read the letter

The main takeaway is that Babbling Suppression stops token generation once intermediate outputs pass the tests, leading to lower energy use in most of their experiments. They ran it on four small and six medium models for Python and Java benchmarks. Across 40 pairs, energy mean went down in 29, with big drops in the babbling-heavy ones. Token reduction happened in 35 pairs, and they kept track of the extra energy cost from the tests, which was under 10% per token in 26 cases. This is a practical addition that does not require changing the model itself. The numbers give a sense of how much waste can be trimmed in AI coding assistants. The weak point is the overhead calculation. With test runs happening during generation, any significant CPU cost or frequent calls could reduce or erase the reported 65% and 62% savings. The abstract mentions the overhead but does not detail the test frequency or full energy accounting, so that needs verification in the full text. The no-accuracy-loss claim also depends on the tests being strict enough to avoid passing bad partial solutions. This is useful for teams optimizing LLM tools for software engineering. It targets a real inefficiency in current usage. I would send it to peer review. The study is large enough to be informative, and the method is straightforward to evaluate.

Referee Report

3 major / 3 minor

Summary. The paper introduces Babbling Suppression (BS), a model-agnostic technique that interleaves execution of benchmark test suites on partial LLM-generated code during decoding and terminates generation as soon as an intermediate output passes all tests. It reports that babbling occurs in all tested models (higher in Java), that BS reduces mean energy consumption in 29 of 40 model-benchmark pairs (exceeding 20 % in 22 cases), with peak reductions of 65 % (Python) and 62 % (Java), while claiming no accuracy loss and GPU energy-per-token overhead below 10 % in most pairs.

Significance. If the net energy savings survive full accounting for test-execution overhead and test-suite reliability, the work supplies a practical, immediately deployable intervention that directly addresses the energy footprint of LLM-based code generation. The scale of the study (40 pairs across 3-7 B models and two languages) and the explicit quantification of overhead provide a useful empirical baseline for follow-on work on sustainable code assistants.

major comments (3)

[Energy Measurement and Results] The central net-energy claim rests on the assumption that repeated test-suite execution adds negligible overhead relative to avoided token generation. The manuscript reports GPU energy-per-token overhead reaching 24 % in some pairs but does not state whether CPU energy consumed by test execution is measured and subtracted from the reported savings, nor does it specify the exact frequency of test invocations (every token, every k tokens, or on EOS). Without these details the headline reductions cannot be verified as net positive.
[Method and Accuracy Evaluation] The claim of “no impact on model accuracy” requires that test suites never produce false-positive passes on incomplete or incorrect prefixes. The paper does not report any diagnostic on partial-solution false positives, nor does it describe controls (e.g., minimum test coverage thresholds or manual inspection of early-termination cases) that would rule out premature termination on non-solutions.
[Results] Across the 40 pairs the manuscript states that token count decreased in 35 cases and energy decreased in 29, yet it does not provide a per-pair breakdown showing that the observed energy reduction exceeds the measured overhead in every case where savings are claimed. A single table or figure that nets overhead against gross savings is needed to substantiate the “net energy savings in most cases” statement.

minor comments (3)

[Introduction] The term “babbling” is introduced without a precise operational definition (e.g., tokens generated after a passing solution appears). A short formal definition would improve reproducibility.
[Abstract and Results] The abstract and results text use “up to 65 %” and “up to 62 %” without clarifying whether these maxima are achieved on the same model-benchmark pair or are the single largest observed values; a parenthetical note would remove ambiguity.
[Method] No mention is made of the exact test-framework harness used (pytest, JUnit, etc.) or of any timeout or resource limits placed on test execution; these parameters affect both overhead and reliability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We value the constructive criticism provided, which helps improve the clarity and rigor of our work on Babbling Suppression. Below, we address each major comment in detail, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Energy Measurement and Results] The central net-energy claim rests on the assumption that repeated test-suite execution adds negligible overhead relative to avoided token generation. The manuscript reports GPU energy-per-token overhead reaching 24 % in some pairs but does not state whether CPU energy consumed by test execution is measured and subtracted from the reported savings, nor does it specify the exact frequency of test invocations (every token, every k tokens, or on EOS). Without these details the headline reductions cannot be verified as net positive.

Authors: We agree that additional details on the energy measurement protocol are required to fully substantiate the net savings claims. The current manuscript emphasizes GPU energy for the decoding process, which constitutes the bulk of the energy use. We will revise the Methods and Results sections to specify that test invocations occur after every token (following an initial minimum length to prevent overhead on trivial prefixes). CPU energy for test execution was not measured or subtracted, as our focus was on GPU metrics for LLM inference; we will explicitly note this limitation and provide an estimate that CPU overhead is small relative to the avoided GPU token generation. A new paragraph will explain the net calculation as total GPU energy with BS versus without, incorporating the per-token overhead. revision: yes
Referee: [Method and Accuracy Evaluation] The claim of “no impact on model accuracy” requires that test suites never produce false-positive passes on incomplete or incorrect prefixes. The paper does not report any diagnostic on partial-solution false positives, nor does it describe controls (e.g., minimum test coverage thresholds or manual inspection of early-termination cases) that would rule out premature termination on non-solutions.

Authors: We clarify that the accuracy metric is the proportion of problems solved correctly, where correctness is defined by passage of the full test suite. Since BS only stops generation upon test passage, the output is always a verified correct solution, preserving the accuracy exactly as in the baseline (no incorrect solutions are accepted). To address potential concerns about false positives, we will add to the revised manuscript a discussion of our verification: experimental logs showed no instances of non-solutions passing tests early, consistent with the design of the benchmarks where tests require complete functionality. We will include a brief description of the test suites' coverage and note that no additional thresholds were applied beyond full passage. revision: yes
Referee: [Results] Across the 40 pairs the manuscript states that token count decreased in 35 cases and energy decreased in 29, yet it does not provide a per-pair breakdown showing that the observed energy reduction exceeds the measured overhead in every case where savings are claimed. A single table or figure that nets overhead against gross savings is needed to substantiate the “net energy savings in most cases” statement.

Authors: We concur that a detailed per-pair net analysis is necessary for transparency. We will incorporate into the revised Results section a supplementary table listing for all 40 model-benchmark pairs the average token count, gross energy consumption, overhead percentage, and net energy savings. This will clearly show that in the 29 pairs with reduced energy, the savings surpass the overhead. The underlying data from our experiments supports this presentation and will be used to create the table. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivations or self-referential reductions

full rationale

The paper describes an empirical method (Babbling Suppression) that runs tests on intermediate LLM outputs to terminate generation early, then reports measured token counts and GPU energy across 40 model-benchmark pairs. No equations, fitted parameters, or predictions appear; results are direct experimental outcomes on fixed benchmarks. No self-citations are invoked as load-bearing premises, and the central claims rest on observed deltas rather than any definitional or fitted-input reduction. This matches the default case of a non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The method depends on the existence of usable test suites for the generated code and on the assumption that early termination based on those tests does not discard superior solutions that would appear later.

axioms (1)

domain assumption Test suites provided with the benchmarks are sufficient to determine functional correctness of partial generations
The BS method terminates generation only when all tests pass; this requires the tests to be reliable indicators of solution quality.

invented entities (1)

Babbling no independent evidence
purpose: Label for extraneous token output beyond a correct solution
Introduced to name the unnecessary generation that incurs extra costs; no independent physical or formal definition supplied.

pith-pipeline@v0.9.0 · 5610 in / 1318 out tokens · 30917 ms · 2026-05-10T18:22:55.526797+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We introduce Babbling Suppression (BS), a method that integrates test execution into the LLM generation process by evaluating intermediate outputs and terminating generation once a solution passes all tests.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear
Applying BS significantly reduces energy consumption by up to 65% for Python and 62% for Java

Reference graph

Works this paper leans on

47 extracted references · 36 canonical work pages · 3 internal anchors

[1]

Negar Alizadeh, Boris Belchev, Nishant Saurabh, Patricia Kelbert, and Fernando Castor. 2025. Language Models in Software Development Tasks: An Experimental Analysis of Energy and Accuracy. In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR). 725–736. doi:10.1109/MSR66628.2025. 00109

work page doi:10.1109/msr66628.2025 2025
[2]

Anonymous Anonymous. 2026. Greening AI-Assisted Code Generation by Re- ducing Babbling. doi:10.5281/zenodo.19237762

work page doi:10.5281/zenodo.19237762 2026
[3]

Radu Apsan, Vincenzo Stoico, Michel Albonico, Rudra Dhar, Karthik Vaid- hyanathan, and Ivano Malavolta. 2025. Generating Energy-Efficient Code via Large-Language Models – Where are we now? arXiv:2509.10099 [cs.SE] https://arxiv.org/abs/2509.10099

work page arXiv 2025
[4]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. arXiv:2108.07732 [cs.PL] https://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

BigCode. 2025. https://huggingface.co/spaces/bigcode/bigcode-models- leaderboard Accessed: 2025-10-23

2025
[6]

Jialun Cao, Zhiyong Chen, Jiarong Wu, Shing-Chi Cheung, and Chang Xu
[7]

InProceedings of the 39th IEEE/ACM Interna- tional Conference on Automated Software Engineering(Sacramento, CA, USA) (ASE ’24)

JavaBench: A Benchmark of Object-Oriented Code Generation for Eval- uating Large Language Models. InProceedings of the 39th IEEE/ACM Interna- tional Conference on Automated Software Engineering(Sacramento, CA, USA) (ASE ’24). Association for Computing Machinery, New York, NY, USA, 870–882. doi:10.1145/3691620.3695470

work page doi:10.1145/3691620.3695470
[8]

Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps- Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. 2023. MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation.IEEE Trans. Softw. Eng.49, 7 (July 2023), 3675–3691....

work page doi:10.1109/tse 2023
[9]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, et al. 2021. Evaluating Large Lan- guage Models Trained on Code.CoRRabs/2107.03374 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Jonathan Cordeiro, Shayan Noei, and Ying Zou. 2026. An Empirical Study on the Code Refactoring Capability of Large Language Models.ACM Trans. Softw. Eng. Methodol.(March 2026). doi:10.1145/3801158 Just Accepted

work page doi:10.1145/3801158 2026
[11]

Nicole Davila, Igor Wiese, Igor Steinmacher, Lucas Lucio da Silva, Andre Kawamoto, Gilson Jose Peres Favaro, and Ingrid Nunes. 2024. An Industry Case Study on Adoption of AI-based Programming Assistants. InProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice(Lisbon, Portugal)(ICSE-SEIP ’24). Associatio...

work page doi:10.1145/3639477.3643648 2024
[12]

Christof Ebert and Panos Louridas. 2023. Generative AI for Software Practitioners. IEEE Software40, 4 (2023), 30–38. doi:10.1109/MS.2023.3265877

work page doi:10.1109/ms.2023.3265877 2023
[13]

GitHub. 2025. GitHub Copilot. https://copilot.github.com/ Accessed: 2025-10-23

2025
[14]

Lianghong Guo, Yanlin Wang, Ensheng Shi, Wanjun Zhong, Hongyu Zhang, Jiachi Chen, Ruikai Zhang, Yuchi Ma, and Zibin Zheng. 2024. When to Stop? Towards Efficient Code Generation in LLMs with Excess Token Prevention. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Test- ing and Analysis(Vienna, Austria)(ISSTA 2024). Association fo...

work page doi:10.1145/3650212.3680343 2024
[15]

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Ja- cob Steinhardt. 2021. Measuring Coding Challenge Competence With APPS. arXiv:2105.09938 [cs.SE] https://arxiv.org/abs/2105.09938

work page internal anchor Pith review arXiv 2021
[16]

Rasha Ahmad Husein, Hala Aburajouh, and Cagatay Catal. 2025. Large language models for code completion: A systematic literature review.Computer Standards & Interfaces92 (2025), 103917. doi:10.1016/j.csi.2024.103917

work page doi:10.1016/j.csi.2024.103917 2025
[17]

Shashikant Ilager, Lukas Florian Briem, and Ivona Brandic. 2025. GREEN- CODE: Learning to Optimize Energy Efficiency in LLM-based Code Generation. arXiv:2501.11006 [cs.DC] https://arxiv.org/abs/2501.11006

work page arXiv 2025
[18]

Jasmin Jahić and Ashkan Sami. 2024. State of Practice: LLMs in Software En- gineering and Software Architecture. In2024 IEEE 21st International Confer- ence on Software Architecture Companion (ICSA-C). 311–318. doi:10.1109/ICSA- C63560.2024.00059

work page doi:10.1109/icsa- 2024
[19]

Mohammad Talal Jamil, Shamsa Abid, and Shafay Shamail. 2025. Can LLMs Generate Higher Quality Code Than Humans? An Empirical Study. In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR). 478–489. doi:10.1109/MSR66628.2025.00081

work page doi:10.1109/msr66628.2025.00081 2025
[20]

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2026. A Survey on Large Language Models for Code Generation.ACM Trans. Softw. Eng. Methodol.35, 2, Article 58 (Jan. 2026), 72 pages. doi:10.1145/3747588

work page doi:10.1145/3747588 2026
[21]

Nurminen, and Zhonghong Ou

Kashif Nizam Khan, Mikael Hirki, Tapio Niemi, Jukka K. Nurminen, and Zhonghong Ou. 2018. RAPL in Action: Experiences in Using RAPL for Power Measurements.ACM Trans. Model. Perform. Eval. Comput. Syst.3, 2, Article 9 (March 2018), 26 pages

2018
[22]

Kharma, Soohyeon Choi, Mohammad Alkhanafseh, and David Mohaisen

Mohammed F. Kharma, Soohyeon Choi, Mohammad Alkhanafseh, and David Mohaisen. 5555. Security and Quality in LLM-Generated Code: a Multi-Language, Multi-Model Analysis .IEEE Transactions on Dependable and Secure Computing 01 (March 5555), 1–15. doi:10.1109/TDSC.2026.3672745

work page doi:10.1109/tdsc.2026.3672745 2026
[23]

Will I be replaced?

Mohammad Amin Kuhail, Sujith Samuel Mathew, Ashraf Khalil, Jose Berengueres, and Syed Jawad Hussain Shah. 2024. “Will I be replaced?” Assessing ChatGPT’s effect on software development and programmer perceptions of AI tools.Science of Computer Programming235 (2024), 103111. doi:10.1016/j.scico.2024.103111

work page doi:10.1016/j.scico.2024.103111 2024
[24]

Sasha Luccioni, Yacine Jernite, and Emma Strubell. 2024. Power Hungry Process- ing: Watts Driving the Cost of AI Deployment?. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency(Rio de Janeiro, Brazil) (FAccT ’24). Association for Computing Machinery, New York, NY, USA, 85–99. doi:10.1145/3630106.3658542

work page doi:10.1145/3630106.3658542 2024
[25]

Noble Saji Mathews and Meiyappan Nagappan. 2024. Test-Driven Development and LLM-based Code Generation. InProceedings of the 39th IEEE/ACM Interna- tional Conference on Automated Software Engineering(Sacramento, CA, USA) (ASE ’24). Association for Computing Machinery, New York, NY, USA, 1583–1594. doi:10.1145/3691620.3695527

work page doi:10.1145/3691620.3695527 2024
[26]

Mohammadjavad Mehditabar, Saurabhsingh Rajput, Antonio Mastropaolo, and Tushar Sharma. 2025. Smart but Costly? Benchmarking LLMs on Functional Accuracy and Energy Efficiency. arXiv:2511.07698 [cs.SE] https://arxiv.org/abs/ 2511.07698

work page arXiv 2025
[27]

Merriam-Webster. 2026. babbling. https://www.merriam-webster.com/ dictionary/babbling Accessed: 2026-01-22

2026
[28]

Hussein Mozannar, Gagan Bansal, Adam Fourney, and Eric Horvitz. 2024. Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Program- ming. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 142, 16 pages. doi:10.114...

work page doi:10.1145/3613904.3641936 2024
[29]

Fangwen Mu, Lin Shi, Song Wang, Zhuohao Yu, Binquan Zhang, ChenXue Wang, Shichao Liu, and Qing Wang. 2024. ClarifyGPT: A Framework for Enhancing LLM-Based Code Generation via Requirements Clarification.Proc. ACM Softw. Eng.1, FSE, Article 103 (July 2024), 23 pages. doi:10.1145/3660810

work page doi:10.1145/3660810 2024
[30]

NVIDIA Corporation. 2025. NVIDIA Management Library (NVML). https://developer.nvidia.com/management-library-nvml. Last accessed October 22nd, 2025

2025
[31]

Dangfeng Pan, Zhensu Sun, Cenyuan Zhang, David Lo, and Xiaoning Du. 2025. The Hidden Cost of Readability: How Code Formatting Silently Consumes Your LLM Budget. arXiv:2508.13666 [cs.SE] https://arxiv.org/abs/2508.13666

work page arXiv 2025
[32]

ANTHROPIC PBC. 2026. Claude. https://claude.ai/ Accessed: 2026-03-23

2026
[33]

Sanyogita Piya and Allison Sullivan. 2024. LLM4TDD: Best Practices for Test Driven Development Using Large Language Models. InProceedings of the 1st International Workshop on Large Language Models for Code(Lisbon, Portugal) (LLM4Code ’24). Association for Computing Machinery, New York, NY, USA, 14–21. doi:10.1145/3643795.3648382

work page doi:10.1145/3643795.3648382 2024
[34]

PyPI Contributors. 2024. pynvml: Python bindings for NVML. https://pypi.org/ project/pynvml/ Accessed: 2025-10-23

2024
[35]

Sanka Rasnayaka, Guanlin Wang, Ridwan Shariffdeen, and Ganesh Neelakanta Iyer. 2024. An Empirical Study on Usage and Perceptions of LLMs in a Software Engineering Project. InProceedings of the 1st International Workshop on Large Language Models for Code(Lisbon, Portugal)(LLM4Code ’24). Association for Com- puting Machinery, New York, NY, USA, 111–118. doi...

work page doi:10.1145/3643795.3648379 2024
[36]

Jaswanth Revuri, Rakesh Kumar Sakthivel, and Gayathri Nagasubramanian. 2026. Chapter Five - Artificial intelligence (AI) technologies and tools for accelerated software development. InCloud-native Architecture (CNA) and Artificial Intelli- gence (AI) for the Future of Software Engineering: The Principles, Patterns, Platforms and Practices, Pethuru Raj, Ma...

work page doi:10.1016/bs.adcom.2025.07.001 2026
[37]

Agnia Sergeyuk, Yaroslav Golubev, Timofey Bryksin, and Iftekhar Ahmed. 2025. Using AI-based coding assistants in practice: State of affairs, perceptions, and ways forward.Information and Software Technology178 (2025), 107610. doi:10. 1016/j.infsof.2024.107610

work page arXiv 2025
[38]

Lola Solovyeva and Fernando Castor. 2026. Towards Green AI: Decoding the Energy of LLM Inference in Software Development. arXiv:2602.05712 [cs.SE] https://arxiv.org/abs/2602.05712 Conference’17, July 2017, Washington, DC, USA Lola Solovyeva and Fernando Castor

work page arXiv 2026
[39]

Lola Solovyeva, Sophie Weidmann, and Fernando Castor. 2025. AI-Powered, But Power-Hungry? Energy Efficiency of LLM-Generated Code. In2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge). 49–60. doi:10.1109/Forge66646.2025.00012

work page doi:10.1109/forge66646.2025.00012 2025
[40]

Ningzhi Tang, Meng Chen, Zheng Ning, Aakash Bansal, Yu Huang, Collin McMil- lan, and Toby Jia-Jun Li. 2024. Developer Behaviors in Validating and Repairing LLM-Generated Code Using IDE and Eye Tracking. In2024 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 40–46

2024
[41]

The Economist. 2025. OpenAI’s latest model will change the economics of soft- ware. https://www.economist.com/business/2025/01/20/openais-latest-model- will-change-the-economics-of-software

2025
[42]

The Economist. 2025. Will OpenAI ever make real money? https://www. economist.com/business/2025/05/15/will-openai-ever-make-real-money

2025
[43]

Jianxun Wang and Yixiang Chen. 2023. A Review on Code Generation with LLMs: Application and Evaluation. In2023 IEEE International Conference on Medical Artificial Intelligence (MedAI). 284–289. doi:10.1109/MedAI59581.2023.00044

work page doi:10.1109/medai59581.2023.00044 2023
[44]

Zihan Wang, Siyao Liu, Yang Sun, Hongyan Li, and Kai Shen. 2025. Code- Contests+: High-Quality Test Case Generation for Competitive Programming. arXiv:2506.05817 [cs.SE] https://arxiv.org/abs/2506.05817

work page arXiv 2025
[45]

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh- Agrawal, Sandeep Singh Sandha, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. 2025. LiveBench: A Challenging, Contamination-Limited LLM Benchmark. arXiv:2...

work page arXiv 2025
[46]

Yusen Zhang, Sarkar Snigdha Sarathi Das, and Rui Zhang. 2024. Verbosity ≠ Veracity: Demystify Verbosity Compensation Behavior of Large Language Models. arXiv:2411.07858 [cs.CL] https://arxiv.org/abs/2411.07858

work page arXiv 2024
[47]

Ziyao Zhang, Chong Wang, Yanlin Wang, Ensheng Shi, Yuchi Ma, Wanjun Zhong, Jiachi Chen, Mingzhi Mao, and Zibin Zheng. 2025. LLM Hallucinations in Practical Code Generation: Phenomena, Mechanism, and Mitigation.Proc. ACM Softw. Eng.2, ISSTA, Article ISSTA022 (June 2025), 23 pages. doi:10.1145/3728894

work page doi:10.1145/3728894 2025