To Run or Not to Run: Analyzing the Cost-Effectiveness of Code Execution in LLM-Based Program Repair

David Lo; Junhua Zhu; Li Li; Mingyi Zhou; Renyu Yang; Xin Wang; Zhensu Sun; Zhihao Lin

arxiv: 2606.26978 · v1 · pith:QGYSKZZYnew · submitted 2026-06-25 · 💻 cs.SE

To Run or Not to Run: Analyzing the Cost-Effectiveness of Code Execution in LLM-Based Program Repair

Zhihao Lin , Junhua Zhu , Mingyi Zhou , Xin Wang , Zhensu Sun , Renyu Yang , David Lo , Li Li This is my paper

Pith reviewed 2026-06-26 03:50 UTC · model grok-4.3

classification 💻 cs.SE

keywords LLM program repaircode executioncost effectivenessrepair successexecution restrictionstoken costswall clock time

0 comments

The pith

Execution restrictions have little effect on repair success but save substantial costs in LLM program repair agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates the value of code execution in agents that generate, run, and revise patches for fixing code. It finds that agents run tests frequently but that completely prohibiting execution reduces success rates by only about one percentage point on leading models, with no statistical significance, while cutting token and time expenses. The benefit of execution turns out to be limited to particular cases rather than helping evenly. These results indicate that execution should be viewed as a costly resource whose use needs explicit justification instead of being applied by default.

Core claim

Execution restrictions have little effect on repair success: on commercial agents with state-of-the-art models the resolve-rate gap between prohibited and unrestricted is only 1.25 percentage points and not statistically significant, while prohibited saves substantial token and wall-clock cost. Execution benefit is concentrated rather than uniform.

What carries the argument

Comparison of four execution paradigms in end-to-end repair attempts to quantify differences in success and resource use.

If this is right

Repair success rates remain nearly unchanged when execution is prohibited for top-performing agents.
Token consumption and wall-clock time decrease substantially under restricted execution rules.
Benefits from running code appear only on a subset of tasks instead of across the board.
Agents currently execute tests without regard to whether the action is likely to improve outcomes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Repair agents could add logic to skip executions on instances where they are unlikely to help, based on timing or other indicators.
The same selective-use approach might reduce waste in other domains where LLMs call external tools repeatedly.
Large-scale deployment of these agents would become more practical if cost controls like execution bans were adopted by default.

Load-bearing premise

The chosen repair tasks, agents, and execution rules are representative of LLM-based program repair in general.

What would settle it

Finding a statistically significant success rate difference exceeding a few percentage points when testing the same paradigms on a wider collection of tasks or agents would show the claim does not hold.

Figures

Figures reproduced from arXiv: 2606.26978 by David Lo, Junhua Zhu, Li Li, Mingyi Zhou, Renyu Yang, Xin Wang, Zhensu Sun, Zhihao Lin.

read the original abstract

LLM-based agents for program repair are increasingly built on a "generate-run-revise" paradigm, iteratively executing tests to evaluate and refine patches. This execution-based approach has become standard practice in state-of-the-art systems. However, executions can be time-consuming and expensive, yet their impact on these agents remains underexplored. In this paper, we conduct a two-stage empirical study over execution behavior in LLM-based program repair. To characterize execution behavior at scale, we first analyze 7,745 agent traces from SWE-bench leaderboard submissions. Second, we evaluate 3,000 end-to-end repair attempts across 200 SWE-bench instances and three agents (Claude Code, Codex, and the open-source OpenCode) under four execution paradigms, which allows for a fine-grained comparison of performance and cost. Our analysis reveals three key observations: (1) Code execution is used across all agents and models analyzed, with an average of 8.8 test runs per task. Execution behavior varies substantially across agents and models, with frequency ranging from 2 to 19 per task, and late-stage executions consistently achieve higher success rates than early-stage ones. (2) Execution restrictions have little effect on repair success: on commercial agents with SOTA models the resolve-rate gap between Prohibited and Unrestricted is only 1.25 percentage points and not statistically significant, while Prohibited saves substantial token and wall-clock cost. (3) Execution benefit is concentrated rather than uniform. These patterns suggest that current agents apply execution indiscriminately, paying its cost on instances where it provides little benefit. Execution, therefore, should be treated as a resource with an explicit cost-benefit tradeoff, not a default capability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds execution can be restricted with little resolve-rate loss on tested agents but the 200-instance, three-agent sample limits how far that travels.

read the letter

The main point here is that on commercial agents with strong models, stopping code execution entirely only costs about 1.25 percentage points in resolve rate on SWE-bench tasks, and that difference is not statistically significant while cutting token and time costs. The first-stage analysis of 7,745 traces adds some scale by showing agents average 8.8 runs per task, with frequency ranging from 2 to 19 and later runs succeeding more often.

What the work actually delivers is a controlled comparison across four execution paradigms on 3,000 attempts with three agents. That setup lets them quantify the cost-benefit gap directly rather than just speculating. The observation that benefit is concentrated on certain instances is a useful practical signal for agent designers who currently run tests by default.

The soft spot is the narrow base for the central claim. The 200 instances and three agents (Claude Code, Codex, OpenCode) come from a space where the trace data already shows high sensitivity to agent choice. Without details on how those 200 were picked or whether they balance cases where late execution matters, the small gap and the "indiscriminate use" conclusion may not hold outside this slice. The post-experiment concentration analysis also rests on the same limited runs.

This is for researchers tuning LLM repair agents who need data on when execution pays off. It deserves peer review because the empirical measurement is targeted and the numbers are concrete, even if broader validation would strengthen it.

Referee Report

3 major / 2 minor

Summary. The manuscript reports a two-stage empirical study on code execution in LLM-based program repair. Stage 1 analyzes 7,745 traces from SWE-bench leaderboard submissions, finding an average of 8.8 test runs per task with substantial variation (2–19) across agents/models and higher success for late-stage executions. Stage 2 runs 3,000 end-to-end repair attempts on 200 SWE-bench instances using three agents (Claude Code, Codex, OpenCode) under four execution paradigms. The central claims are that the resolve-rate gap between Prohibited and Unrestricted execution is only 1.25 pp and statistically non-significant on commercial SOTA agents, that Prohibited yields substantial token and wall-clock savings, and that execution benefit is concentrated rather than uniform, implying current agents apply execution indiscriminately.

Significance. If the controlled comparison and concentration finding hold under representative sampling and transparent statistics, the work supplies concrete evidence that execution should be treated as a costly resource with explicit tradeoffs rather than a default capability. The scale of the trace analysis (7,745 instances) and the multi-paradigm, multi-agent design are clear strengths for an empirical measurement study in this domain.

major comments (3)

[Experimental setup] Experimental setup (likely §4): The selection procedure for the 200 SWE-bench instances is not described (random sample from the full benchmark, stratified by difficulty or by execution frequency observed in the 7,745-trace corpus, or convenience sample). Given the first-stage finding that execution frequency already varies from 2 to 19 per task across agents, this choice is load-bearing for the generalizability of the 1.25 pp gap and the “concentrated benefit” conclusion.
[Results] Results on resolve-rate comparison (likely §5): The claim that the 1.25 pp gap between Prohibited and Unrestricted is “not statistically significant” is presented without the test statistic, exact p-value, confidence interval, per-condition sample sizes, or correction for multiple comparisons. These details are required to evaluate whether the non-significance supports the central claim that restrictions have “little effect.”
[Results / Discussion] Analysis of concentrated benefit (likely §5 or discussion): The observation that benefit is concentrated rests on post-experiment partitioning of instances. No pre-specified definition of concentration, robustness checks across alternative partitions, or comparison against a null model of uniform benefit is reported, weakening the inference that agents apply execution “indiscriminately.”

minor comments (2)

[Abstract] Abstract: The absolute resolve rates for Prohibited and Unrestricted conditions should be stated alongside the 1.25 pp difference to allow readers to assess practical magnitude.
[Introduction / Methods] Terminology: The four execution paradigms are introduced in the abstract but their precise definitions (e.g., when execution is allowed or blocked) should be repeated with a short table or bullet list in the main text for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help strengthen the rigor of our empirical study. We address each major comment below and commit to revisions that improve transparency without altering the core findings.

read point-by-point responses

Referee: [Experimental setup] Experimental setup (likely §4): The selection procedure for the 200 SWE-bench instances is not described (random sample from the full benchmark, stratified by difficulty or by execution frequency observed in the 7,745-trace corpus, or convenience sample). Given the first-stage finding that execution frequency already varies from 2 to 19 per task across agents, this choice is load-bearing for the generalizability of the 1.25 pp gap and the “concentrated benefit” conclusion.

Authors: We acknowledge the omission in the original manuscript. The 200 instances were drawn as a uniform random sample from the SWE-bench test set (with no stratification) to enable direct comparison across agents and paradigms. To address generalizability concerns given the observed variation in execution frequency, we will expand §4 with an explicit description of the sampling method, the rationale for a random (rather than stratified) draw, and a brief discussion of how the first-stage trace analysis informed but did not dictate the second-stage sample. revision: yes
Referee: [Results] Results on resolve-rate comparison (likely §5): The claim that the 1.25 pp gap between Prohibited and Unrestricted is “not statistically significant” is presented without the test statistic, exact p-value, confidence interval, per-condition sample sizes, or correction for multiple comparisons. These details are required to evaluate whether the non-significance supports the central claim that restrictions have “little effect.”

Authors: We agree that the supporting statistical details were missing. The non-significance claim rests on a McNemar test for paired binary outcomes across the 200 instances (n=200 per condition). In the revision we will report the test statistic, exact p-value, 95% CI for the 1.25 pp difference, per-condition sample sizes, and confirm that no multiple-comparison correction applies because this was the pre-planned primary contrast between the two extreme paradigms. These additions will allow readers to assess the strength of the “little effect” interpretation directly. revision: yes
Referee: [Results / Discussion] Analysis of concentrated benefit (likely §5 or discussion): The observation that benefit is concentrated rests on post-experiment partitioning of instances. No pre-specified definition of concentration, robustness checks across alternative partitions, or comparison against a null model of uniform benefit is reported, weakening the inference that agents apply execution “indiscriminately.”

Authors: The concentration finding is indeed post-hoc. While exploratory analyses are common in measurement studies, we accept that stronger evidentiary standards are warranted. In the revision we will (1) state an explicit, pre-specified definition of concentration (instances in the top quintile of benefit accounting for ≥80% of total benefit), (2) report robustness results under two alternative partitions (quartiles and deciles), and (3) add a simple null-model simulation that redistributes the observed benefit uniformly across instances to quantify how extreme the observed concentration is. These changes will be placed in §5 and the discussion. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurement on external benchmarks

full rationale

The paper is a two-stage empirical study analyzing 7,745 existing agent traces from SWE-bench and running controlled experiments on 200 SWE-bench instances across three agents under four execution paradigms. No derivations, equations, fitted parameters, or self-citations are used to derive claims; all results are direct measurements against the external SWE-bench benchmark and real traces. The central observations (resolve-rate gaps, cost savings, execution frequency) are computed from the collected data without reduction to inputs by construction. This matches the default case of a self-contained empirical paper with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is a pure empirical study that relies on the existing SWE-bench benchmark and standard statistical comparison methods; it introduces no new free parameters, invented entities, or non-standard axioms.

axioms (1)

domain assumption The selected agents, instances, and execution paradigms are representative for drawing conclusions about LLM-based program repair in general.
The study generalizes from three agents and 200 instances to broader claims about execution cost-effectiveness.

pith-pipeline@v0.9.1-grok · 5870 in / 1245 out tokens · 62528 ms · 2026-06-26T03:50:52.885919+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 5 canonical work pages

[1]

Anthropic. 2025. Claude Code: An agentic coding tool for the terminal. https://www.anthropic.com/claude-code Accessed: 2026-01-23. To Run or Not to Run 21

2025
[2]

Sangmin Bae. 2025. Accelerating Large Language Model Inference via Early-Exiting Algorithms.arXiv preprint arXiv:2509.05915(2025)

arXiv 2025
[3]

Islem Bouzenia, Premkumar Devanbu, and Michael Pradel. 2025. RepairAgent: An Autonomous, LLM-Based Agent for Program Repair. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). 2188–2200. https://doi.org/10.1109/ICSE55347.2025.00157

work page doi:10.1109/icse55347.2025.00157 2025
[4]

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. 2023. Accelerating Large Language Model Decoding with Speculative Sampling.arXiv preprint arXiv:2302.01318(2023)

Pith/arXiv arXiv 2023
[5]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
[6]

arXiv:2107.03374 [cs.LG] https://arxiv.org/abs/2107.03374

Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG] https://arxiv.org/abs/2107.03374

Pith/arXiv arXiv
[7]

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2024. Teaching Large Language Models to Self-Debug. InProceedings of the International Conference on Learning Representations (ICLR)

2024
[8]

Viet-Tung Do, Xuan-Quang Nguyen, Van-Khanh Hoang, Duy-Hung Nguyen, Shahab Sabahi, Jeff Yang, Hajime Hotta, Minh-Tien Nguyen, and Hung Le. 2025. Automatic prompt selection for large language models. (2025), 91–102

2025
[9]

Pengfei Gao and Chao Peng. 2025. More with Less: An Empirical Study of Turn-Control Strategies for Efficient Coding Agents. arXiv:2510.16786 [cs.SE] https://arxiv.org/abs/2510.16786

arXiv 2025
[10]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y.K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence.arXiv preprint arXiv:2401.14196(2024)

Pith/arXiv arXiv 2024
[11]

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber
[12]

InProceedings of the International Conference on Learning Representations (ICLR)

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. InProceedings of the International Conference on Learning Representations (ICLR)
[13]

Qiushi Huang, Xubo Liu, Tom Ko Zheng, Zhaocheng Liu, Wenwu Zhao, Mark Sherblom, Yvonne Coady, and Wenwu Wang. 2024. Selective Prompting Tuning for Personalized Conversations with LLMs. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics

2024
[14]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, et al. 2024. Qwen2.5-Coder Technical Report. arXiv:2409.12186 [cs.CL] https://arxiv.org/abs/2409.12186

Pith/arXiv arXiv 2024
[15]

Nan Jiang, Thibaud Lutellier, and Lin Tan. 2021. CURE: Code-Aware Neural Machine Translation for Automatic Program Repair. InProceedings of the IEEE/ACM International Conference on Software Engineering (ICSE). 1161–1173

2021
[16]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. InInternational Conference on Learning Representations

2024
[17]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InInternational Conference on Learning Representations, B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun (Eds.), Vol. 2024. 54107–54157. https://proceedings.iclr....

2024
[18]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP)

2023
[19]

Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. 2011. Genprog: A generic method for automatic software repair.Ieee transactions on software engineering38, 1, 54–72

2011
[20]

Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast Inference from Transformers via Speculative Decoding. InInternational Conference on Machine Learning. PMLR, 19274–19286

2023
[21]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

2022
[22]

Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin. 2023. Compressing Context to Enhance Inference Efficiency of Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 22 Lin et al. 6342–6353. https://doi.org/10.18653/v1/2023.emnlp-main.391

work page doi:10.18653/v1/2023.emnlp-main.391 2023
[23]

Shanchao Liang, Spandan Garg, and Roshanak Zilouchian Moghaddam. 2025. The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason. arXiv:2506.12286 [cs.SE] https://arxiv.org/abs/2506.12286

arXiv 2025
[24]

Bissyandé

Kui Liu, Anil Koyuncu, Dongsun Kim, and Tegawendé F. Bissyandé. 2019. TBar: Revisiting Template-based Automated Program Repair. InProceedings of the ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA). 31–42

2019
[25]

Fan Long and Martin Rinard. 2016. Automatic patch generation by learning correct code.SIGPLAN Not.51, 1, 298–312. https://doi.org/10.1145/2914770.2837617

work page doi:10.1145/2914770.2837617 2016
[26]

Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. 2024. WizardCoder: Empowering Code Large Language Models with Evol-Instruct. InProceedings of the International Conference on Learning Representations (ICLR)

2024
[27]

Thibaud Lutellier, Hung Viet Pham, Lawrence Pang, Yitong Li, Moshi Wei, and Lin Tan. 2020. CoCoNuT: Combining Context-Aware Neural Translation Models using Ensemble for Program Repair. InProceedings of the ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA). 101–114

2020
[28]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-Refine: Iterative Refinement with Self-Feedback. InProceedings of the Conference on Neural...

2023
[29]

Quinn McNemar. 1947. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika12, 2 (1947), 153–157

1947
[30]

Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, and Shayne Longpre. 2024. OctoPack: Instruction Tuning Code Large Language Models. In Proceedings of the International Conference on Learning Representations (ICLR)

2024
[31]

OpenAI. 2025. OpenAI Codex: The AI Agent for Software Engineering. https://openai.com/codex Accessed: 2026-01-23

2025
[32]

OpenAI. 2026. Why we no longer evaluate SWE-bench Verified. https://openai.com/index/why-we-no-longer- evaluate-swe-bench-verified/ Accessed: 2026-04-28

2026
[33]

OpenCode Contributors. 2025. OpenCode: An Open-Source AI Coding Agent for the Terminal. https://github.com/ opencode-ai/opencode v1.4.0, accessed 2026-04-09

2025
[34]

Vicky Zhao, Lili Qiu, and Dongmei Zhang

Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Ruhle, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, and Dongmei Zhang. 2024. LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression. InFindings of the Association for Computational Linguistics ACL 2024. 963–981. https...

2024
[35]

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. 2023. Code Llama: Open Foundation Models for Code.arXiv preprint arXiv:2308.12950 (2023)

Pith/arXiv arXiv 2023
[36]

Donald J Schuirmann. 1987. A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability.Journal of pharmacokinetics and biopharmaceutics15, 6 (1987), 657–680

1987
[37]

Chaofan Tao, Jierun Chen, Yuxin Jiang, Kaiqi Kou, Shaowei Wang, Ruoyu Wang, Xiaohui Li, Sidi Yang, Yiming Du, Jianbo Dai, Zhiming Mao, Xinyu Wang, Lifeng Shang, and Haoli Bai. 2026. SWE-Lego: Pushing the Limits of Supervised Fine-tuning for Software Issue Resolving.arXiv preprint arXiv:2601.01426(2026)

arXiv 2026
[38]

Nguyen Phu Vinh, Anh Chung Hoang, Chris Ngo, and Truong-Son Hy. 2025. Repeton: Structured Bug Repair with ReAct-Guided Patch-and-Test Cycles.arXiv preprint arXiv:2506.08173(2025)

arXiv 2025
[39]

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. 2024. Executable Code Actions Elicit Better LLM Agents. InProceedings of the International Conference on Machine Learning (ICML)

2024
[40]

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. 2024. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741(2024)

Pith/arXiv arXiv 2024
[41]

Yunkun Wang, Yue Zhang, Guochang Li, Chen Zhi, Binhua Li, Fei Huang, Yongbin Li, and Shuiguang Deng. 2025. InspectCoder: Dynamic Analysis-Enabled Self Repair through Interactive LLM-Debugger Collaboration.arXiv preprint arXiv:2510.18327(2025)

arXiv 2025
[42]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.Advances in Neural Information Processing Systems35 (2022), 24824–24837

2022
[43]

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. Demystifying LLM-Based Software Engineering Agents.Proc. ACM Softw. Eng.2, FSE, Article FSE037 (June 2025), 24 pages. https://doi.org/10.1145/3715754 To Run or Not to Run 23

work page doi:10.1145/3715754 2025
[44]

Chunqiu Steven Xia and Lingming Zhang. 2024. Automated program repair via conversation: Fixing 162 out of 337 bugs for $0.42 each using chatgpt. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 819–831

2024
[45]

Reiss, and Jifeng Xuan

Qi Xin, Haojun Wu, Steven P. Reiss, and Jifeng Xuan. 2024. Towards Practical and Useful Automated Program Repair for Debugging.arXiv preprint arXiv:2407.08958(2024)

arXiv 2024
[46]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: agent-computer interfaces enable automated software engineering. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’24). Curran Associates Inc., Red Hook, NY, ...

2024
[47]

Griffiths, Yuan Cao, and Karthik Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. InProceedings of the Conference on Neural Information Processing Systems (NeurIPS)

2023
[48]

Tenenbaum, and Chuang Gan

Shun Zhang, Zhenfang Chen, Yikang Shen, Mingyu Ding, Joshua B. Tenenbaum, and Chuang Gan. 2023. Planning with Large Language Models for Code Generation. InThe Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=Lr8cOOtYbfL

2023
[49]

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. AutoCodeRover: Autonomous Program Improvement. InProceedings of the ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA). 1592–1604

2024
[50]

Qihao Zhu, Zeyu Sun, Yuan-an Xiao, Wenjie Zhang, Kang Yuan, Yingfei Xiong, and Lu Zhang. 2021. A syntax-guided edit decoder for neural program repair. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Athens, Greece)(ESEC/FSE 2021). Association for Computing Mac...

work page doi:10.1145/3468264.3468544 2021

[1] [1]

Anthropic. 2025. Claude Code: An agentic coding tool for the terminal. https://www.anthropic.com/claude-code Accessed: 2026-01-23. To Run or Not to Run 21

2025

[2] [2]

Sangmin Bae. 2025. Accelerating Large Language Model Inference via Early-Exiting Algorithms.arXiv preprint arXiv:2509.05915(2025)

arXiv 2025

[3] [3]

Islem Bouzenia, Premkumar Devanbu, and Michael Pradel. 2025. RepairAgent: An Autonomous, LLM-Based Agent for Program Repair. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). 2188–2200. https://doi.org/10.1109/ICSE55347.2025.00157

work page doi:10.1109/icse55347.2025.00157 2025

[4] [4]

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. 2023. Accelerating Large Language Model Decoding with Speculative Sampling.arXiv preprint arXiv:2302.01318(2023)

Pith/arXiv arXiv 2023

[5] [5]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

[6] [6]

arXiv:2107.03374 [cs.LG] https://arxiv.org/abs/2107.03374

Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG] https://arxiv.org/abs/2107.03374

Pith/arXiv arXiv

[7] [7]

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2024. Teaching Large Language Models to Self-Debug. InProceedings of the International Conference on Learning Representations (ICLR)

2024

[8] [8]

Viet-Tung Do, Xuan-Quang Nguyen, Van-Khanh Hoang, Duy-Hung Nguyen, Shahab Sabahi, Jeff Yang, Hajime Hotta, Minh-Tien Nguyen, and Hung Le. 2025. Automatic prompt selection for large language models. (2025), 91–102

2025

[9] [9]

Pengfei Gao and Chao Peng. 2025. More with Less: An Empirical Study of Turn-Control Strategies for Efficient Coding Agents. arXiv:2510.16786 [cs.SE] https://arxiv.org/abs/2510.16786

arXiv 2025

[10] [10]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y.K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence.arXiv preprint arXiv:2401.14196(2024)

Pith/arXiv arXiv 2024

[11] [11]

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber

[12] [12]

InProceedings of the International Conference on Learning Representations (ICLR)

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. InProceedings of the International Conference on Learning Representations (ICLR)

[13] [13]

Qiushi Huang, Xubo Liu, Tom Ko Zheng, Zhaocheng Liu, Wenwu Zhao, Mark Sherblom, Yvonne Coady, and Wenwu Wang. 2024. Selective Prompting Tuning for Personalized Conversations with LLMs. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics

2024

[14] [14]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, et al. 2024. Qwen2.5-Coder Technical Report. arXiv:2409.12186 [cs.CL] https://arxiv.org/abs/2409.12186

Pith/arXiv arXiv 2024

[15] [15]

Nan Jiang, Thibaud Lutellier, and Lin Tan. 2021. CURE: Code-Aware Neural Machine Translation for Automatic Program Repair. InProceedings of the IEEE/ACM International Conference on Software Engineering (ICSE). 1161–1173

2021

[16] [16]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. InInternational Conference on Learning Representations

2024

[17] [17]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InInternational Conference on Learning Representations, B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun (Eds.), Vol. 2024. 54107–54157. https://proceedings.iclr....

2024

[18] [18]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP)

2023

[19] [19]

Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. 2011. Genprog: A generic method for automatic software repair.Ieee transactions on software engineering38, 1, 54–72

2011

[20] [20]

Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast Inference from Transformers via Speculative Decoding. InInternational Conference on Machine Learning. PMLR, 19274–19286

2023

[21] [21]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

2022

[22] [22]

Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin. 2023. Compressing Context to Enhance Inference Efficiency of Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 22 Lin et al. 6342–6353. https://doi.org/10.18653/v1/2023.emnlp-main.391

work page doi:10.18653/v1/2023.emnlp-main.391 2023

[23] [23]

Shanchao Liang, Spandan Garg, and Roshanak Zilouchian Moghaddam. 2025. The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason. arXiv:2506.12286 [cs.SE] https://arxiv.org/abs/2506.12286

arXiv 2025

[24] [24]

Bissyandé

Kui Liu, Anil Koyuncu, Dongsun Kim, and Tegawendé F. Bissyandé. 2019. TBar: Revisiting Template-based Automated Program Repair. InProceedings of the ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA). 31–42

2019

[25] [25]

Fan Long and Martin Rinard. 2016. Automatic patch generation by learning correct code.SIGPLAN Not.51, 1, 298–312. https://doi.org/10.1145/2914770.2837617

work page doi:10.1145/2914770.2837617 2016

[26] [26]

Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. 2024. WizardCoder: Empowering Code Large Language Models with Evol-Instruct. InProceedings of the International Conference on Learning Representations (ICLR)

2024

[27] [27]

Thibaud Lutellier, Hung Viet Pham, Lawrence Pang, Yitong Li, Moshi Wei, and Lin Tan. 2020. CoCoNuT: Combining Context-Aware Neural Translation Models using Ensemble for Program Repair. InProceedings of the ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA). 101–114

2020

[28] [28]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-Refine: Iterative Refinement with Self-Feedback. InProceedings of the Conference on Neural...

2023

[29] [29]

Quinn McNemar. 1947. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika12, 2 (1947), 153–157

1947

[30] [30]

Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, and Shayne Longpre. 2024. OctoPack: Instruction Tuning Code Large Language Models. In Proceedings of the International Conference on Learning Representations (ICLR)

2024

[31] [31]

OpenAI. 2025. OpenAI Codex: The AI Agent for Software Engineering. https://openai.com/codex Accessed: 2026-01-23

2025

[32] [32]

OpenAI. 2026. Why we no longer evaluate SWE-bench Verified. https://openai.com/index/why-we-no-longer- evaluate-swe-bench-verified/ Accessed: 2026-04-28

2026

[33] [33]

OpenCode Contributors. 2025. OpenCode: An Open-Source AI Coding Agent for the Terminal. https://github.com/ opencode-ai/opencode v1.4.0, accessed 2026-04-09

2025

[34] [34]

Vicky Zhao, Lili Qiu, and Dongmei Zhang

Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Ruhle, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, and Dongmei Zhang. 2024. LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression. InFindings of the Association for Computational Linguistics ACL 2024. 963–981. https...

2024

[35] [35]

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. 2023. Code Llama: Open Foundation Models for Code.arXiv preprint arXiv:2308.12950 (2023)

Pith/arXiv arXiv 2023

[36] [36]

Donald J Schuirmann. 1987. A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability.Journal of pharmacokinetics and biopharmaceutics15, 6 (1987), 657–680

1987

[37] [37]

Chaofan Tao, Jierun Chen, Yuxin Jiang, Kaiqi Kou, Shaowei Wang, Ruoyu Wang, Xiaohui Li, Sidi Yang, Yiming Du, Jianbo Dai, Zhiming Mao, Xinyu Wang, Lifeng Shang, and Haoli Bai. 2026. SWE-Lego: Pushing the Limits of Supervised Fine-tuning for Software Issue Resolving.arXiv preprint arXiv:2601.01426(2026)

arXiv 2026

[38] [38]

Nguyen Phu Vinh, Anh Chung Hoang, Chris Ngo, and Truong-Son Hy. 2025. Repeton: Structured Bug Repair with ReAct-Guided Patch-and-Test Cycles.arXiv preprint arXiv:2506.08173(2025)

arXiv 2025

[39] [39]

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. 2024. Executable Code Actions Elicit Better LLM Agents. InProceedings of the International Conference on Machine Learning (ICML)

2024

[40] [40]

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. 2024. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741(2024)

Pith/arXiv arXiv 2024

[41] [41]

Yunkun Wang, Yue Zhang, Guochang Li, Chen Zhi, Binhua Li, Fei Huang, Yongbin Li, and Shuiguang Deng. 2025. InspectCoder: Dynamic Analysis-Enabled Self Repair through Interactive LLM-Debugger Collaboration.arXiv preprint arXiv:2510.18327(2025)

arXiv 2025

[42] [42]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.Advances in Neural Information Processing Systems35 (2022), 24824–24837

2022

[43] [43]

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. Demystifying LLM-Based Software Engineering Agents.Proc. ACM Softw. Eng.2, FSE, Article FSE037 (June 2025), 24 pages. https://doi.org/10.1145/3715754 To Run or Not to Run 23

work page doi:10.1145/3715754 2025

[44] [44]

Chunqiu Steven Xia and Lingming Zhang. 2024. Automated program repair via conversation: Fixing 162 out of 337 bugs for $0.42 each using chatgpt. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 819–831

2024

[45] [45]

Reiss, and Jifeng Xuan

Qi Xin, Haojun Wu, Steven P. Reiss, and Jifeng Xuan. 2024. Towards Practical and Useful Automated Program Repair for Debugging.arXiv preprint arXiv:2407.08958(2024)

arXiv 2024

[46] [46]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: agent-computer interfaces enable automated software engineering. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’24). Curran Associates Inc., Red Hook, NY, ...

2024

[47] [47]

Griffiths, Yuan Cao, and Karthik Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. InProceedings of the Conference on Neural Information Processing Systems (NeurIPS)

2023

[48] [48]

Tenenbaum, and Chuang Gan

Shun Zhang, Zhenfang Chen, Yikang Shen, Mingyu Ding, Joshua B. Tenenbaum, and Chuang Gan. 2023. Planning with Large Language Models for Code Generation. InThe Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=Lr8cOOtYbfL

2023

[49] [49]

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. AutoCodeRover: Autonomous Program Improvement. InProceedings of the ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA). 1592–1604

2024

[50] [50]

Qihao Zhu, Zeyu Sun, Yuan-an Xiao, Wenjie Zhang, Kang Yuan, Yingfei Xiong, and Lu Zhang. 2021. A syntax-guided edit decoder for neural program repair. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Athens, Greece)(ESEC/FSE 2021). Association for Computing Mac...

work page doi:10.1145/3468264.3468544 2021