pith. machine review for the scientific record. sign in

arxiv: 2604.19305 · v1 · submitted 2026-04-21 · 💻 cs.SE

Recognition: unknown

DebugRepair: Enhancing LLM-Based Automated Program Repair via Self-Directed Debugging

Dan Hao, Hao Tan, Jia Li, Kainan Li, Kunwu Zheng, Linhao Wu, Pengyu Xue, Xiran Lyu, Yifei Pei, Yizhou Chen, Zhen Yang, Zhonghang Lu

Pith reviewed 2026-05-10 02:19 UTC · model grok-4.3

classification 💻 cs.SE
keywords automated program repairlarge language modelsself-directed debuggingruntime tracespatch refinementDefects4Jsimulated instrumentation
0
0 comments X

The pith

By simulating its own debugging steps, DebugRepair lets LLMs gather intermediate runtime states to produce more accurate program patches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DebugRepair, a framework that equips large language models with self-directed debugging during automated program repair. It reduces noisy test context through semantic purification, inserts targeted debugging statements via simulated instrumentation with rule-based fallback, and refines patches in a conversation that incorporates prior attempts and fresh runtime observations. Existing feedback methods supply only outcome-level symptoms such as stack traces, leaving models without the intermediate evidence needed for reliable root-cause analysis. Experiments across Java and Python benchmarks demonstrate that this change yields higher repair counts than prior LLM-based approaches. A sympathetic reader would conclude that supplying models with structured runtime traces can close a key gap in how they diagnose and correct defects.

Core claim

DebugRepair improves LLM-based automated program repair by replacing sole reliance on test-failure outcomes with self-directed debugging that collects intermediate runtime states. The framework comprises test semantic purification to strip irrelevant context, simulated instrumentation that adds debugging statements and falls back to rule-based traces when needed, and debugging-driven conversational repair that iteratively updates candidate patches using both earlier attempts and newly observed states. On Defects4J this produces 224 correct fixes with GPT-3.5 (26.2 percent above prior best LLM methods) and 295 fixes with DeepSeek-V3 (59 above the next baseline), with similar relative gains on

What carries the argument

Simulated instrumentation combined with debugging-driven conversational repair, which inserts targeted statements to expose intermediate runtime states and feeds those states back into patch refinement.

If this is right

  • LLM repair performance rises by 51.3 percent over vanilla settings across five additional backbone models.
  • The method reaches state-of-the-art results against 15 prior approaches on three benchmarks spanning Java and Python.
  • With GPT-3.5 it fixes 224 Defects4J bugs, exceeding the previous best LLM-based result by 26.2 percent.
  • With DeepSeek-V3 it fixes 295 Defects4J bugs, exceeding the second-best baseline by 59 bugs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same intermediate-state collection strategy could be applied to non-LLM repair systems that currently use only test outcomes.
  • Models may benefit further if the simulation is replaced by actual debugger attachment that records richer execution traces.
  • The conversational refinement loop suggests a general pattern for improving LLM code generation tasks that require diagnosis rather than one-shot synthesis.

Load-bearing premise

The simulated instrumentation and rule-based debugging statements capture relevant intermediate runtime states without altering program semantics or injecting misleading artifacts.

What would settle it

Remove the simulated instrumentation component from DebugRepair and measure whether the number of correctly fixed bugs on Defects4J drops below the full system's count for the same backbone model.

Figures

Figures reproduced from arXiv: 2604.19305 by Dan Hao, Hao Tan, Jia Li, Kainan Li, Kunwu Zheng, Linhao Wu, Pengyu Xue, Xiran Lyu, Yifei Pei, Yizhou Chen, Zhen Yang, Zhonghang Lu.

Figure 1
Figure 1. Figure 1: Motivation Example of DebugRepair. models. We further perform comprehensive ablation studies to validate the effectiveness of each component in the framework. • Open Science. To facilitate reproducibility and future research, we will release the implementa￾tion of DebugRepair and detailed repair results in the future version. 2 Motivation In this section, we present a motivating example from Defects4J (Cha… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of DebugRepair. dependencies (Lines 15 and 17). Besides, since statements that are not dependent on 𝑠𝑓 𝑎𝑖𝑙 may also use objects in 𝑉𝑟𝑒𝑞, making the potentially caused side effects, in turn, affect 𝑠𝑓 𝑎𝑖𝑙 . Thus, we have to repeat the traversal iteratively to update Slice and 𝑉𝑟𝑒𝑞 until no new objects appear, i.e., fixing the changed flag to False in the outer loop (Lines 18-19). To illustrate the … view at source ↗
Figure 3
Figure 3. Figure 3: An example to illustrate the necessity of repeated traversal. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Instrumentation Prompt Template. However, LLM-generated instrumentation may introduce unintended edits beyond print state￾ments, such as auxiliary comments or even syntactic errors. To ensure that the instrumented code preserves the original program semantics, we impose a two-stage consistency check. First, we perform a line-wise equivalence check between normalized 𝐹𝑖𝑛𝑠𝑡 and 𝐹𝑏𝑢𝑔𝑔𝑦, where the normal￾izati… view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of the Prompt Construction. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Bug Fix Venn Diagram on Defects4J (DebugRepair, RepairAgent, ThinkRepair, BaseChatGPT) [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Bug Fix Venn Diagram on Defects4J (DebugRepair, ChatRepair, ContrastRepair, TSAPR) [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: An illustration of the self-directed debugging process on the Lang-6 bug. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Impact of hyper-parameter settings on the repair performance of DebugRepair. [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
read the original abstract

Automated Program Repair (APR) has benefited from the code understanding and generation capabilities of Large Language Models (LLMs). Existing feedback-based APR methods iteratively refine candidate patches using test execution feedback and have shown promising results. However, most rely on outcome-level failure symptoms, such as stack traces, which show how failures are observed but fail to expose the intermediate runtime states critical for root-cause analysis. As a result, LLMs often infer bug causes without sufficient runtime evidence, leading to incorrect patches. To address this limitation, we propose DebugRepair, a self-directed debugging framework for LLM-based APR. DebugRepair enhances patch refinement with intermediate runtime evidence collected through simulated debugging. It consists of three components: test semantic purification, simulated instrumentation, and debugging-driven conversational repair. Together, they reduce noisy test context, collect runtime traces through targeted debugging statements with rule-based fallback, and progressively refine candidate patches using prior attempts and newly observed runtime states. We evaluate DebugRepair on three benchmarks across Java and Python. Experiments show that DebugRepair achieves state-of-the-art performance against 15 approaches. With GPT-3.5, it correctly fixes 224 bugs on Defects4J, outperforming prior SOTA LLM-based methods by 26.2%. With DeepSeek-V3, it correctly fixes 295 Defects4J bugs, surpassing the second-best baseline by 59 bugs. Across five additional backbone LLMs, DebugRepair improves repair performance by 51.3% over vanilla settings. Ablation studies further confirm the effectiveness of all components.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes DebugRepair, a self-directed debugging framework for LLM-based automated program repair. It comprises test semantic purification to reduce noisy test context, simulated instrumentation to collect intermediate runtime traces via targeted debugging statements with rule-based fallback, and debugging-driven conversational repair to iteratively refine patches using prior attempts and observed runtime states. Evaluations on Defects4J and two additional Java/Python benchmarks against 15 baselines claim SOTA results, including 224 fixed bugs on Defects4J with GPT-3.5 (26.2% over prior LLM-based SOTA), 295 with DeepSeek-V3 (59 more than second-best), and 51.3% average improvement across five other backbone LLMs, supported by ablations.

Significance. If the results hold, this would be a solid contribution to APR by moving beyond outcome-level feedback (e.g., stack traces) to intermediate runtime evidence, enabling better root-cause analysis by LLMs. The consistent gains across multiple LLMs and the ablation support for each component indicate practical utility and generalizability within the current LLM-APR paradigm.

major comments (2)
  1. [§3.2] §3.2 (Simulated Instrumentation): No verification is reported that instrumented executions produce identical test-suite outcomes to the original programs. This check is required to confirm that rule-based debugging-statement insertion neither alters observable behavior nor injects spurious values, which is load-bearing for the claim that the collected states enable correct patches.
  2. [§4] §4 (Evaluation and Ablations): The ablation studies show component contributions but do not test whether the captured runtime states are causally tied to the bug root cause (versus incidental) or whether the conversational refinement would succeed without them; this weakens attribution of the 26.2% and 51.3% gains specifically to the debugging mechanism.
minor comments (2)
  1. The abstract states results on 'three benchmarks' but only names Defects4J explicitly; listing the other two (with sizes and bug counts) would improve clarity.
  2. [Figure 1] Figure 1 (framework overview) would benefit from explicit arrows showing how runtime states feed back into the conversational repair loop.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below, acknowledging valid points where the manuscript can be strengthened and outlining planned revisions.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Simulated Instrumentation): No verification is reported that instrumented executions produce identical test-suite outcomes to the original programs. This check is required to confirm that rule-based debugging-statement insertion neither alters observable behavior nor injects spurious values, which is load-bearing for the claim that the collected states enable correct patches.

    Authors: We agree this verification is important. The manuscript does not report an explicit check confirming that instrumented executions produce identical test-suite outcomes. Our rule-based insertion targets non-intrusive debugging statements (e.g., prints for intermediate values) designed to avoid altering control flow or data, with fallback to original execution on failure. To address this, we will add a verification experiment in the revised manuscript, measuring outcome equivalence rates across Defects4J and the other benchmarks before and after instrumentation. This will directly support the fidelity of the collected runtime states. revision: yes

  2. Referee: [§4] §4 (Evaluation and Ablations): The ablation studies show component contributions but do not test whether the captured runtime states are causally tied to the bug root cause (versus incidental) or whether the conversational refinement would succeed without them; this weakens attribution of the 26.2% and 51.3% gains specifically to the debugging mechanism.

    Authors: The ablation studies demonstrate the contribution of each component, with the largest gains tied to the debugging-driven conversational repair that incorporates observed runtime states alongside prior attempts. However, we acknowledge the comment's validity: the current ablations do not isolate whether the specific runtime states are causally linked to root causes versus incidental, nor do they directly compare conversational success with versus without those states. We will add a targeted analysis in the revision, such as an additional ablation within the conversational repair phase that masks or removes the runtime state information, to better attribute the reported improvements (26.2% and 51.3%) to the debugging mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical claims or derivations

full rationale

The paper is an empirical study proposing a debugging framework for LLM-based APR and reporting success rates on public benchmarks (Defects4J and others) via direct comparison to external baselines. No equations, fitted parameters, or self-referential normalizations appear in the provided text; performance numbers are obtained from standard experimental runs rather than any derivation that reduces outputs to inputs by construction. Self-citations, if present, are not load-bearing for any central premise, and the method description relies on explicit components (test purification, simulated instrumentation, conversational repair) whose validity is assessed externally rather than defined circularly.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 4 invented entities

The central claim rests on the domain assumption that LLMs will produce better patches when given intermediate runtime states, plus the heuristic design of the three components. No numerical free parameters are introduced. The framework itself and its sub-components are newly postulated entities whose value is demonstrated only through the reported experiments.

axioms (1)
  • domain assumption LLMs can leverage intermediate runtime states for more accurate root-cause diagnosis and patch refinement than outcome-level failure symptoms alone
    Invoked throughout the motivation and component design sections of the abstract.
invented entities (4)
  • DebugRepair framework no independent evidence
    purpose: Self-directed debugging pipeline for LLM-based APR
    Newly proposed end-to-end system; independent evidence is the benchmark results.
  • test semantic purification no independent evidence
    purpose: Reduce noisy test context before debugging
    Component invented for this approach.
  • simulated instrumentation no independent evidence
    purpose: Collect runtime traces via targeted debugging statements with rule-based fallback
    Component invented for this approach.
  • debugging-driven conversational repair no independent evidence
    purpose: Progressively refine patches using prior attempts and observed runtime states
    Component invented for this approach.

pith-pipeline@v0.9.0 · 5610 in / 1544 out tokens · 60927 ms · 2026-05-10T02:19:49.111042+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PYTHALAB-MERA: Validation-Grounded Memory, Retrieval, and Acceptance Control for Frozen-LLM Coding Agents

    cs.CL 2026-05 unverdicted novelty 5.0

    An external controller for frozen LLMs raises strict validation success on three RL coding tasks from 0/9 to 8/9 by selecting memory records and skills, running fail-fast checks, and propagating credit via eligibility traces.

Reference graph

Works this paper leans on

51 extracted references · 7 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    [n. d.]. Models | OpenAI API. https://platform.openai.com/docs/models#gpt-3-5-turbo [Online; accessed 2026-01-20]

  2. [2]

    [n. d.]. SiliconFlow – AI Infrastructure for LLMs & Multimodal Models. https://www.siliconflow.com/ [Online; accessed 2026-01-20]

  3. [3]

    [n. d.]. tree-sitter/tree-sitter: An incremental parsing system for programming tools. https://github.com/tree-sitter/tree- sitter [Online; accessed 2026-03-13]

  4. [4]

    Islem Bouzenia, Premkumar Devanbu, and Michael Pradel. 2024. Repairagent: An autonomous, llm-based agent for program repair.arXiv preprint arXiv:2403.17134(2024)

  5. [5]

    Mark Chen. 2021. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021)

  6. [6]

    Zimin Chen, Steve Kommrusch, Michele Tufano, Louis-Noël Pouchet, Denys Poshyvanyk, and Martin Monperrus. 2019. Sequencer: Sequence-to-sequence learning for end-to-end program repair.IEEE Transactions on Software Engineering 47, 9 (2019), 1943–1959

  7. [7]

    Xiang Gao, Bo Wang, Gregory J Duck, Ruyi Ji, Yingfei Xiong, and Abhik Roychoudhury. 2021. Beyond tests: Program vulnerability repair via crash constraint extraction.ACM Transactions on Software Engineering and Methodology (TOSEM)30, 2 (2021), 1–27

  8. [8]

    Luca Gazzola, Daniela Micucci, and Leonardo Mariani. 2018. Automatic software repair: A survey. InProceedings of the 40th International Conference on Software Engineering. 1219–1219

  9. [9]

    Ali Ghanbari, Samuel Benton, and Lingming Zhang. 2019. Practical program repair via bytecode mutation. InProceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. 19–30. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. 111:26 Wu et al

  10. [11]

    Haichuan Hu, Ye Shang, Weifeng Sun, and Quanjun Zhang. 2025. TSAPR: A Tree Search Framework For Automated Program Repair.arXiv preprint arXiv:2507.01827(2025)

  11. [12]

    Jinru Hua, Mengshi Zhang, Kaiyuan Wang, and Sarfraz Khurshid. 2018. Sketchfix: a tool for automated program repair approach using lazy candidate generation. InProceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 888–891

  12. [13]

    Jiajun Jiang, Yingfei Xiong, Hongyu Zhang, Qing Gao, and Xiangqun Chen. 2018. Shaping program repair space with existing patches and similar code. InProceedings of the 27th ACM SIGSOFT international symposium on software testing and analysis. 298–309

  13. [14]

    Nan Jiang, Kevin Liu, Thibaud Lutellier, and Lin Tan. 2023. Impact of code language models on automated program repair. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1430–1442

  14. [15]

    Nan Jiang, Thibaud Lutellier, Yiling Lou, Lin Tan, Dan Goldwasser, and Xiangyu Zhang. 2023. Knod: Domain knowledge distilled tree decoder for automated program repair. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1251–1263

  15. [16]

    Nan Jiang, Thibaud Lutellier, and Lin Tan. 2021. Cure: Code-aware neural machine translation for automatic program repair. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1161–1173

  16. [17]

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770(2023)

  17. [18]

    René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. InProceedings of the 2014 international symposium on software testing and analysis. 437–440

  18. [19]

    Jiaolong Kong, Xiaofei Xie, Mingfei Cheng, Shangqing Liu, Xiaoning Du, and Qi Guo. 2025. Contrastrepair: Enhancing conversation-based automated program repair via contrastive test case pairs.ACM Transactions on Software Engineering and Methodology34, 8 (2025), 1–31

  19. [20]

    Xuan-Bach D Le, Duc-Hiep Chu, David Lo, Claire Le Goues, and Willem Visser. 2017. S3: syntax-and semantic-guided repair synthesis via programming by examples. InProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. 593–604

  20. [21]

    Xuan Bach D Le, David Lo, and Claire Le Goues. 2016. History driven program repair. In2016 IEEE 23rd international conference on software analysis, evolution, and reengineering (SANER), Vol. 1. IEEE, 213–224

  21. [22]

    Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. 2011. Genprog: A generic method for automatic software repair.Ieee transactions on software engineering38, 1 (2011), 54–72

  22. [23]

    Claire Le Goues, Michael Pradel, and Abhik Roychoudhury. 2019. Automated program repair.Commun. ACM62, 12 (2019), 56–65

  23. [24]

    Yi Li, Shaohua Wang, and Tien N Nguyen. 2020. Dlfix: Context-based code transformation learning for automated program repair. InProceedings of the ACM/IEEE 42nd international conference on software engineering. 602–614

  24. [25]

    Yi Li, Shaohua Wang, and Tien N Nguyen. 2022. Dear: A novel deep learning-based approach for automated program repair. InProceedings of the 44th international conference on software engineering. 511–523

  25. [26]

    Derrick Lin, James Koppel, Angela Chen, and Armando Solar-Lezama. 2017. QuixBugs: A multi-lingual program repair benchmark set based on the Quixey Challenge. InProceedings Companion of the 2017 ACM SIGPLAN international conference on systems, programming, languages, and applications: software for humanity. 55–56

  26. [27]

    Kui Liu, Anil Koyuncu, Dongsun Kim, and Tegawendé F Bissyandé. 2019. Avatar: Fixing semantic bugs with fix patterns of static analysis violations. In2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 1–12

  27. [28]

    Kui Liu, Anil Koyuncu, Dongsun Kim, and Tegawendé F Bissyandé. 2019. TBar: Revisiting template-based automated program repair. InProceedings of the 28th ACM SIGSOFT international symposium on software testing and analysis. 31–42

  28. [29]

    Fan Long and Martin Rinard. 2015. Staged program repair with condition synthesis. InProceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. 166–178

  29. [30]

    Matias Martinez and Martin Monperrus. 2016. Astor: A program repair library for java. InProceedings of the 25th international symposium on software testing and analysis. 441–444

  30. [31]

    Sergey Mechtaev, Jooyong Yi, and Abhik Roychoudhury. 2016. Angelix: Scalable multiline program patch synthesis via symbolic analysis. InProceedings of the 38th international conference on software engineering. 691–701

  31. [32]

    Weishi Wang, Yue Wang, Shafiq Joty, and Steven CH Hoi. 2023. Rap-gen: Retrieval-augmented patch generation with codet5 for automatic program repair. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 146–158. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 20...

  32. [33]

    Yunkun Wang, Yue Zhang, Guochang Li, Chen Zhi, Binhua Li, Fei Huang, Yongbin Li, and Shuiguang Deng. 2026. InspectCoder: Dynamic Analysis-Driven Self Repair through Interactive LLM-Debugger Collaboration.Proceedings of the ACM on Programming Languages10, OOPSLA1 (2026), 1041–1069

  33. [34]

    Ming Wen, Junjie Chen, Rongxin Wu, Dan Hao, and Shing-Chi Cheung. 2018. Context-aware patch generation for better automated program repair. InProceedings of the 40th international conference on software engineering. 1–11

  34. [35]

    Chunqiu Steven Xia, Yifeng Ding, and Lingming Zhang. 2023. The plastic surgery hypothesis in the era of large language models. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 522–534

  35. [36]

    Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated program repair in the era of large pre-trained language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1482–1494

  36. [37]

    Chunqiu Steven Xia and Lingming Zhang. 2022. Less training, more repairing please: revisiting automated program repair via zero-shot learning. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 959–971

  37. [38]

    Chunqiu Steven Xia and Lingming Zhang. 2024. Automated program repair via conversation: Fixing 162 out of 337 bugs for $0.42 each using chatgpt. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 819–831

  38. [39]

    Pengyu Xue, Linhao Wu, Zhen Yang, Chengyi Wang, Xiang Li, Yuxiang Zhang, Jia Li, Ruikai Jin, Yifei Pei, Zhaoyan Shen, et al. 2025. ClassEval-T: Evaluating Large Language Models in Class-Level Code Translation.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 1421–1444

  39. [40]

    Pengyu Xue, Linhao Wu, Zhen Yang, Zhongxing Yu, Zhi Jin, Ge Li, Yan Xiao, Shuo Liu, Xinyi Li, Hongyi Lin, et al

  40. [41]

    arXiv preprint arXiv:2410.07516(2024)

    Exploring and Lifting the Robustness of LLM-powered Automated Program Repair with Metamorphic Testing. arXiv preprint arXiv:2410.07516(2024)

  41. [42]

    Pengyu Xue, Linhao Wu, Zhongxing Yu, Zhi Jin, Zhen Yang, Xinyi Li, Zhenyu Yang, and Yue Tan. 2024. Automated commit message generation with large language models: An empirical study and beyond.IEEE Transactions on Software Engineering(2024)

  42. [43]

    Pengyu Xue, Kunwu Zheng, Zhen Yang, Yifei Pei, Linhao Wu, Jiahui Dong, Xiapu Luo, Yan Xiao, Fei Liu, Yuxuan Zhang, et al. 2025. TransLibEval: Demystify Large Language Models’ Capability in Third-party Library-targeted Code Translation.arXiv preprint arXiv:2509.12087(2025)

  43. [44]

    Zhen Yang, Fang Liu, Zhongxing Yu, Jacky Wai Keung, Jia Li, Shuo Liu, Yifan Hong, Xiaoxue Ma, Zhi Jin, and Ge Li

  44. [45]

    Exploring and unleashing the power of large language models in automated code translation.Proceedings of the ACM on Software Engineering1, FSE (2024), 1585–1608

  45. [46]

    Yifan Yao, Jinhao Duan, Kaidi Xu, Yuanfang Cai, Zhibo Sun, and Yue Zhang. 2024. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly.High-Confidence Computing4, 2 (2024), 100211

  46. [47]

    He Ye, Matias Martinez, Xiapu Luo, Tao Zhang, and Martin Monperrus. 2022. Selfapr: Self-supervised program repair with test execution diagnostics. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–13

  47. [48]

    He Ye, Matias Martinez, and Martin Monperrus. 2022. Neural program repair with execution-based backpropagation. InProceedings of the 44th international conference on software engineering. 1506–1518

  48. [49]

    Xin Yin, Chao Ni, Shaohua Wang, Zhenhao Li, Limin Zeng, and Xiaohu Yang. 2024. Thinkrepair: Self-directed automated program repair. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1274–1286

  49. [50]

    Jiayi Zhang, Kai Huang, Jian Zhang, Yang Liu, and Chunyang Chen. 2025. Repair Ingredients Are All You Need: Improving Large Language Model-Based Program Repair via Repair Ingredients Search.arXiv preprint arXiv:2506.23100 (2025)

  50. [51]

    Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen. 2023. A survey of learning-based automated program repair.ACM Transactions on Software Engineering and Methodology33, 2 (2023), 1–69

  51. [52]

    Qihao Zhu, Zeyu Sun, Yuan-an Xiao, Wenjie Zhang, Kang Yuan, Yingfei Xiong, and Lu Zhang. 2021. A syntax-guided edit decoder for neural program repair. InProceedings of the 29th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering. 341–353. Received 20 February 2007; revised 12 March 2009; a...