pith. machine review for the scientific record. sign in

arxiv: 2604.24212 · v1 · submitted 2026-04-27 · 💻 cs.SE

Recognition: unknown

Empowering Autonomous Debugging Agents with Efficient Dynamic Analysis

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:16 UTC · model grok-4.3

classification 💻 cs.SE
keywords autonomous debugging agentsprogram repairdynamic analysisLLM agentsSWE-benchdebugging interfaceFrame Lifetime Tracesoftware engineering
0
0 comments X

The pith

A function-level debugging interface lets basic agents resolve 63.8% of SWE-bench Verified repair tasks at low cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Agent-centric Debugging Interface (ADI) as a way to give autonomous repair agents more effective dynamic feedback. Standard debuggers force line-by-line exchanges that quickly exhaust token budgets and trap agents in loops. ADI replaces that with function-level traces and high-level navigation commands so agents can inspect state and steer execution without fine-grained inspection. A basic agent using only ADI reaches 63.8% resolution on the verified SWE-bench set, slightly above a heavily engineered baseline, while adding ADI to stronger agents produces consistent lifts of 6.2 to 18.5 points.

Core claim

The paper claims that an agent-centric debugging interface built around a Frame Lifetime Trace data structure and a small set of high-level navigational commands supplies exactly the execution information LLM agents need for program repair. This interface replaces the cost-inefficient line-by-line interaction of conventional debuggers, enabling a simple agent to achieve 63.8% resolution on SWE-bench Verified at an average cost of $1.28 per task and delivering additive gains when plugged into existing state-of-the-art agents.

What carries the argument

Agent-centric Debugging Interface (ADI), a function-level interaction paradigm powered by the Frame Lifetime Trace data structure that records stateful execution within each function together with high-level navigational commands that let the agent move and inspect without line-by-line requests.

If this is right

  • Basic agents equipped only with ADI reach 63.8% resolution on SWE-bench Verified and slightly exceed the performance of the optimized Claude-Tools agent.
  • The same interface added to existing SOTA agents produces consistent gains between 6.2% and 18.5% on resolved tasks.
  • Average per-task cost drops to $1.28 when using Claude-Sonnet-3.7, showing that high-level commands reduce token consumption compared with traditional debuggers.
  • ADI acts as a plug-and-play module that improves any agent architecture without requiring changes to the underlying model or workflow.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the trace abstraction proves sufficient for most bugs, similar high-level interfaces could be designed for other agent-driven tasks such as test generation or vulnerability discovery.
  • The approach implies that LLM agents benefit more from curated, function-scoped views of execution state than from exhaustive low-level traces.
  • Developers of repair benchmarks could add new tasks that explicitly test whether function-level traces are enough or whether cross-function data flows still require extra instrumentation.

Load-bearing premise

That the Frame Lifetime Trace and high-level navigational commands together contain all information an agent requires to diagnose and repair complex multi-function bugs without needing line-by-line inspection or extra low-level feedback.

What would settle it

A set of repair tasks in which agents using only ADI repeatedly fail while agents given full line-by-line variable inspection or additional low-level traces succeed.

Figures

Figures reproduced from arXiv: 2604.24212 by Hongliang Tian, Jiahong Xiang, Xiaopan Chu, Xiaoyang Xu, Yuqun Zhang.

Figure 1
Figure 1. Figure 1: A real-world automated program repair task from the SWE-bench ( view at source ↗
Figure 2
Figure 2. Figure 2: BaseAgent’s and BaseAgent𝑝𝑑𝑏 ’s debugging processes on the astropy-12907 task. To solve this task, we first employ an agent with the advanced LLM Claude-Sonnet-3.7 that performs post-mortem debugging using only final execution outputs. As shown in view at source ↗
Figure 3
Figure 3. Figure 3: An Efficient Debugging Process on the astropy-12907 Task. suspicious code region, where a targeted query provides unambiguous evidence of the root cause: the cright array is incorrectly filled with ones. With this confirmation, the agent can formulate the correct patch, resolving the issue with the same logic as the official developer-written solution. This motivating example illustrates the value of a wel… view at source ↗
Figure 4
Figure 4. Figure 4: Traditional Debugger REPL Interaction and the Function Frame. view at source ↗
Figure 5
Figure 5. Figure 5: Agent-centric Debugging Interface Framework view at source ↗
Figure 6
Figure 6. Figure 6: The Frame Lifetime Trace (FLT) of the _cstack#2 invocation from the astropy-12907 task view at source ↗
Figure 7
Figure 7. Figure 7: The prompt of ADI used by FramePilot. through function calls. To illustrate how Agent-centric Debugging Interface is exposed to the agent, view at source ↗
Figure 8
Figure 8. Figure 8: Invocation rate of ADI on the SWE-bench Lite and Verified benchmarks. view at source ↗
Figure 9
Figure 9. Figure 9: Diagnosing the state mismatch in django-11119 [6]. We present a case study on the django-11119 task [6], which involves a subtle, non-crashing bug where the template Engine fails to apply its autoescape setting, causing the template variables to be rendered without proper HTML escaping. This type of state-dependent issue is exceptionally Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE031. Publication d… view at source ↗
read the original abstract

Autonomous agents for automated program repair represent a promising frontier in software engineering, yet their effectiveness is often hindered by reliance on post-mortem, coarse-grained execution feedback. While integrating traditional interactive debuggers seems a natural solution, their low-level, line-by-line interaction paradigm turns out to be cost-inefficient for LLM-based agents, leading to exhausted budgets and unproductive loops. To mitigate this, we introduce Agent-centric Debugging Interface (ADI), a novel agent-centric debugging interface designed for cost-efficient, end-to-end autonomous interaction. Specifically, Agent-centric Debugging Interface realizes a function-level interaction paradigm, powered by our Frame Lifetime Trace, a comprehensive data structure encapsulating a function's stateful execution trace, and a set of high-level navigational commands. Our extensive evaluation on the SWE-bench benchmark demonstrates the effectiveness and efficiency of ADI. By simply equipping a basic agent with ADI, it successfully resolves 63.8\% of the tasks on the SWE-bench Verified set, even slightly outperforming the highly optimized and high-investment Claude-Tools agent, at an average cost of USD 1.28 per task with Claude-Sonnet-3.7. Furthermore, we demonstrate ADI's generality by integrating it as a plug-and-play component into existing SOTA agents, delivering consistent gains ranging from 6.2\% to 18.5\% on the resolved tasks. These results indicate that Agent-centric Debugging Interface can provide a general and efficient enhancement for existing autonomous agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Agent-centric Debugging Interface (ADI), a function-level debugging paradigm for LLM-based autonomous agents that uses a Frame Lifetime Trace data structure to encapsulate stateful execution and high-level navigational commands instead of line-by-line interaction. On the SWE-bench Verified benchmark, a basic agent equipped with ADI resolves 63.8% of tasks (slightly outperforming the optimized Claude-Tools baseline) at an average cost of USD 1.28 per task with Claude-Sonnet-3.7; integrating ADI as a plug-and-play module into existing SOTA agents yields consistent gains of 6.2% to 18.5% in resolved tasks.

Significance. If the empirical results hold under rigorous validation, the work offers a meaningful advance in software engineering by making dynamic analysis practical and cost-effective for autonomous program repair agents. The plug-and-play integration results and clear cost/resolution figures on a standard benchmark are particular strengths that could influence agent design more broadly.

major comments (2)
  1. [§4 (Evaluation)] §4 (Evaluation): The headline 63.8% resolution rate, outperformance over Claude-Tools, and 6.2–18.5% gains are reported without details on experimental controls, number of independent runs, variance due to LLM non-determinism, or statistical significance tests. This is load-bearing for the central empirical claim.
  2. [§3.1 (Frame Lifetime Trace)] §3.1 (Frame Lifetime Trace): No ablation study or failure-case breakdown is provided on whether the trace preserves all information needed for multi-function or side-effect bugs (e.g., untraced globals, I/O, or cross-boundary state). The reported gains rest on the assumption that function-level traces plus navigational commands suffice without low-level feedback; this requires explicit validation.
minor comments (2)
  1. [Abstract] Abstract: Specify the exact baseline numbers for Claude-Tools and confirm whether the 63.8% figure applies to the full SWE-bench Verified set.
  2. [§3.2] Notation for the high-level navigational commands could be formalized with a small table or grammar to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important areas for improving the rigor of our empirical evaluation and the analysis of our proposed data structure. We address each of the major comments below, indicating the revisions we plan to make.

read point-by-point responses
  1. Referee: The headline 63.8% resolution rate, outperformance over Claude-Tools, and 6.2–18.5% gains are reported without details on experimental controls, number of independent runs, variance due to LLM non-determinism, or statistical significance tests. This is load-bearing for the central empirical claim.

    Authors: We agree that the current presentation lacks sufficient details on experimental controls and statistical rigor to fully substantiate the claims given LLM non-determinism. In the revised manuscript, we will expand the evaluation section to report results from multiple independent runs (using different random seeds), including means and standard deviations for the resolution rates and costs. We will also perform and report statistical significance tests (such as t-tests) for the observed gains over baselines. Furthermore, we will provide clearer documentation of the experimental setup and controls used. revision: yes

  2. Referee: No ablation study or failure-case breakdown is provided on whether the trace preserves all information needed for multi-function or side-effect bugs (e.g., untraced globals, I/O, or cross-boundary state). The reported gains rest on the assumption that function-level traces plus navigational commands suffice without low-level feedback; this requires explicit validation.

    Authors: We agree that an explicit validation of the Frame Lifetime Trace's coverage for various bug types would strengthen the paper. Although the SWE-bench tasks encompass multi-function and side-effect bugs, we did not include a dedicated ablation or failure breakdown. In the revision, we will incorporate a failure-case analysis that categorizes the tasks based on bug characteristics (e.g., involvement of globals, I/O, cross-function state) and discusses the information preserved by the trace. We will also analyze cases where low-level feedback might have been beneficial. This addresses the need for explicit validation while building on the existing end-to-end results. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical results on external benchmark

full rationale

The paper introduces ADI with Frame Lifetime Trace and high-level commands, then reports direct empirical outcomes on SWE-bench Verified (63.8% resolution, cost figures, and plug-in gains of 6.2-18.5%). No equations, parameter fits, or derivations appear; the central claims are measured performance numbers against an external benchmark rather than quantities that reduce by construction to the paper's own inputs or self-citations. The evaluation is self-contained and falsifiable via the benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The paper introduces two new constructs (ADI and Frame Lifetime Trace) whose correctness and completeness are asserted rather than derived from prior independent evidence.

invented entities (2)
  • Agent-centric Debugging Interface (ADI) no independent evidence
    purpose: Provide cost-efficient function-level interaction for LLM agents
    New interface proposed in the paper; no prior citation or independent validation mentioned.
  • Frame Lifetime Trace no independent evidence
    purpose: Encapsulate a function's stateful execution trace for high-level navigation
    New data structure introduced to support the interface; no external evidence of sufficiency provided in abstract.

pith-pipeline@v0.9.0 · 5573 in / 1312 out tokens · 31152 ms · 2026-05-08T03:16:34.444874+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 20 canonical work pages · 5 internal anchors

  1. [1]

    https://github.com/swe-bench/experiments/issues/16 GitHub issue #16, SWE-bench/experiments

    2024.Whether using test patch is allowed. https://github.com/swe-bench/experiments/issues/16 GitHub issue #16, SWE-bench/experiments

  2. [2]

    Hugging Face

    2024-02-29. Hugging Face. https://huggingface.co

  3. [3]

    OpenAI API

    2024-02-29. OpenAI API. https://openai.com/api

  4. [4]

    astropy-12907 github issue

    2025. astropy-12907 github issue. https://github.com/astropy/astropy/issues/12906

  5. [5]

    Astropy Github Repository

    2025. Astropy Github Repository. https://github.com/astropy/astropy

  6. [6]

    django-11119 task pr

    2025. django-11119 task pr. https://github.com/django/django/pull/11119/

  7. [7]

    Github Repository

    2025. Github Repository. https://github.com/GhabiX/ADI

  8. [8]

    Talor Abramovich, Meet Udeshi, Minghao Shao, Kilian Lieret, Haoran Xi, Kimberly Milner, Sofija Jancheska, John Yang, Carlos E Jimenez, Farshad Khorrami, et al. 2024. Enigma: Enhanced interactive generative model agent for ctf challenges.arXiv preprint arXiv:2409.16165(2024)

  9. [9]

    Rui Abreu, Peter Zoeteweij, and Arjan JC Van Gemund. 2007. On the accuracy of spectrum-based fault localization. In Testing: Academic and industrial conference practice and research techniques-MUTATION (TAICPART-MUTATION 2007). IEEE, 89–98

  10. [10]

    Alibaba Cloud. 2025. Qwen3 API Price. Online. https://www.alibabacloud.com/help/zh/model-studio/models Accessed: 2025-07-12

  11. [11]

    Anthropic. 2024. Introducing Claude 3.5 Sonnet. Online. https://www.anthropic.com/news/claude-3-5-sonnet Accessed: 2025-07-15

  12. [12]

    Anthropic. 2025. Claude 3.7 Sonnet and Claude Code. Online. https://www.anthropic.com/news/claude-3-7-sonnet Accessed: 2025-07-15

  13. [13]

    Anthropic. 2025. Claude API Documentation. https://docs.anthropic.com/en/home. Accessed: 2025-06-30

  14. [14]

    Anthropic. 2025. Raising the Bar on SWE-Bench Verified with Claude 3.5 Sonnet. https://www.anthropic.com/ engineering/swe-bench-sonnet. Accessed: 2025-09-11

  15. [15]

    Yasharth Bajpai, Bhavya Chopra, Param Biyani, Cagri Aslan, Dustin Coleman, Sumit Gulwani, Chris Parnin, Arjun Radhakrishna, and Gustavo Soares. 2024. Let’s fix this together: Conversational debugging with github copilot. In2024 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, 1–12

  16. [16]

    Shraddha Barke, Michael B James, and Nadia Polikarpova. 2023. Grounded copilot: How programmers interact with code-generating models.Proceedings of the ACM on Programming Languages7, OOPSLA1 (2023), 85–111

  17. [17]

    Islem Bouzenia, Premkumar Devanbu, and Michael Pradel. 2024. Repairagent: An autonomous, llm-based agent for program repair.arXiv preprint arXiv:2403.17134(2024)

  18. [18]

    Islem Bouzenia, Yangruibo Ding, Kexin Pei, Baishakhi Ray, and Michael Pradel. 2023. Tracefixer: Execution trace-driven program repair.arXiv preprint arXiv:2304.12743(2023)

  19. [19]

    Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128(2023)

  20. [20]

    Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. 2023. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. InProceedings of the 32nd ACM SIGSOFT international symposium on software testing and analysis. 423–435

  21. [21]

    Konstantin Grotov, Artem Borzilov, Maksim Krivobok, Timofey Bryksin, and Yaroslav Zharov. 2024. Debug smarter, not harder: Ai agents for error resolution in computational notebooks. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 363–371

  22. [22]

    Yirui He, Ziyao He, Syed Fatiul Huq, and Sam Malek. 2026. ReFLAIR: Detecting Responsive Layout Reflow Issues using Multimodal Generative AI.Proceedings of the ACM on Software Engineering3, FSE (2026). doi:10.1145/3808136

  23. [23]

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 8 (2024), 1–79

  24. [24]

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770(2023)

  25. [25]

    Sungmin Kang, Bei Chen, Shin Yoo, and Jian-Guang Lou. 2025. Explainable automated debugging via large language model-driven scientific debugging.Empirical Software Engineering30, 2 (2025), 45

  26. [26]

    Sungmin Kang, Juyeon Yoon, and Shin Yoo. 2023. Large language models are few-shot testers: Exploring llm-based general bug reproduction. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2312–2323

  27. [27]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE031. Public...

  28. [28]

    Caroline Lemieux, Jeevana Priya Inala, Shuvendu K Lahiri, and Siddhartha Sen. 2023. Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 919–931

  29. [29]

    Kyla H Levin, Nicolas van Kempen, Emery D Berger, and Stephen N Freund. 2025. ChatDBG: Augmenting Debugging with Large Language Models.Proceedings of the ACM on Software Engineering2, FSE (2025), 1892–1913

  30. [30]

    Hongwei Li, Yuheng Tang, Shiqi Wang, and Wenbo Guo. 2025. PatchPilot: A Stable and Cost-Efficient Agentic Patching Framework.arXiv preprint arXiv:2502.02747(2025)

  31. [31]

    Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. 2023. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. Advances in Neural Information Processing Systems36 (2023), 42330–42357

  32. [32]

    Junwei Liu, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. 2024. Large language model-based agents for software engineering: A survey.arXiv preprint arXiv:2409.02977(2024)

  33. [33]

    Zhengyao Liu, Yunlong Ma, Jingxuan Xu, Junchen Ai, Xiang Gao, Hailong Sun, and Abhik Roychoudhury. 2025. Agent That Debugs: Dynamic State-Guided Vulnerability Repair.arXiv preprint arXiv:2504.07634(2025)

  34. [34]

    Yingwei Ma, Rongyu Cao, Yongchang Cao, Yue Zhang, Jue Chen, Yibo Liu, Yuchen Liu, Binhua Li, Fei Huang, and Yongbin Li. 2024. Lingma swe-gpt: An open development-process-centric language model for automated software improvement.arXiv preprint arXiv:2411.00622(2024)

  35. [35]

    OpenAI. 2024. Hello GPT-4o. Online. https://openai.com/index/hello-gpt-4o/ Accessed: 2025-07-15

  36. [36]

    OpenAI. 2024. Introducing SWE-bench, verified. https://openai.com/index/introducing-swe-bench-verified/. Accessed on 2025-06-23

  37. [37]

    Python Docs. 2025. Python ctypes. https://docs.python.org/3/library/ctypes.html

  38. [38]

    Python Docs. 2025. Python Frame Objects. https://docs.python.org/3/c-api/frame.html

  39. [39]

    Python Software Foundation. 2023. Python 3 Glossary — qualified name. https://docs.python.org/3/glossary.html#term- qualified-name

  40. [40]

    2025.pdb — The Python Debugger

    Python Software Foundation. 2025.pdb — The Python Debugger. https://docs.python.org/3/library/pdb.html

  41. [41]

    Haifeng Ruan, Yuntong Zhang, and Abhik Roychoudhury. 2025. Specrover: Code intent extraction via llms. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE, 963–974

  42. [42]

    Richard Stallman, Roland Pesch, Stan Shebs, et al. 1988. Debugging with GDB.Free Software Foundation675 (1988)

  43. [43]

    Hanzhuo Tan, Weihao Li, Xiaolong Tian, Siyi Wang, Jiaming Liu, Jing Li, and Yuqun Zhang. 2025. SK2Decompile: LLM-based Two-Phase Binary Decompilation from Skeleton to Skin.arXiv preprint arXiv:2509.22114(2025)

  44. [44]

    Hanzhuo Tan, Qi Luo, Ling Jiang, Zizheng Zhan, Jing Li, Haotian Zhang, and Yuqun Zhang. 2024. Prompt-based code completion via multi-retrieval augmented generation.ACM Transactions on Software Engineering and Methodology (2024)

  45. [45]

    Hanzhuo Tan, Qi Luo, Jing Li, and Yuqun Zhang. 2024. Llm4decompile: Decompiling binary code with large language models. 3473–3487 pages

  46. [46]

    Hanzhuo Tan, Xiaolong Tian, Hanrui Qi, Jiaming Liu, Siyi Wang, GAO Zuchen, Qi Luo, Jing Li, and Yuqun Zhang. [n. d.]. Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track

  47. [47]

    Hanzhuo Tan, Chunpu Xu, Jing Li, Yuqun Zhang, Zeyang Fang, Zeyu Chen, and Baohua Lai. 2024. Hicl: Hashtag-driven in-context learning for social media natural language understanding.IEEE transactions on neural networks and learning systems36, 4 (2024), 7037–7050

  48. [48]

    Ningzhi Tang, Meng Chen, Zheng Ning, Aakash Bansal, Yu Huang, Collin McMillan, and Toby Jia-Jun Li. 2024. A Study on Developer Behaviors for Validating and Repairing LLM-Generated Code Using Eye Tracking and IDE Actions. arXiv preprint arXiv:2405.16081(2024)

  49. [49]

    The SWE-bench Team. 2024. SWE-bench: A Benchmark for Evaluating Large Language Models on Real World Software Issues. https://www.swebench.com. Accessed: 28-June-2025

  50. [50]

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. 2024. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741(2024)

  51. [51]

    Wilkes, David J

    Maurice V. Wilkes, David J. Wheeler, and Stanley Gill. 1951.The Preparation of Programs for an Electronic Digital Computer. Addison-Wesley Press, Cambridge, MA, USA

  52. [52]

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2024. Agentless: Demystifying llm-based software engineering agents.arXiv preprint arXiv:2407.01489(2024)

  53. [53]

    Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated program repair in the era of large pre-trained language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1482–1494

  54. [54]

    Chunqiu Steven Xia and Lingming Zhang. 2024. Automated program repair via conversation: Fixing 162 out of 337 bugs for $0.42 each using chatgpt. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE031. Publication date: July 2026. FSE031:22 Jiahong Xiang †, Xiaoyang Xu, Xiaop...

  55. [55]

    Jiahong Xiang. 2026. FramePilot-Artifacts. doi:10.5281/zenodo.19728388

  56. [56]

    Jiahong Xiang, Wenxiao He, Xihua Wang, Hongliang Tian, and Yuqun Zhang. 2026. Evaluating and Improving Automated Repository-Level Rust Issue Resolution with LLM-based Agents. In2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE). IEEE, 816–830. doi:10.1145/3744916.3773108

  57. [57]

    Jiahong Xiang, Xiaoyang Xu, Fanchu Kong, Mingyuan Wu, Zizheng Zhang, Haotian Zhang, and Yuqun Zhang. 2024. How far can we go with practical function-level program repair?arXiv preprint arXiv:2404.12833(2024)

  58. [58]

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652

  59. [59]

    John Yang, Kilian Leret, Carlos E Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. 2025. Swe-smith: Scaling data for software engineering agents.arXiv preprint arXiv:2504.21798(2025)

  60. [60]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629

  61. [61]

    Xingdi Yuan, Morgane M Moss, Charbel El Feghali, Chinmay Singh, Darya Moldavskaya, Drew MacPhee, Lucas Caccia, Matheus Pereira, Minseon Kim, Alessandro Sordoni, et al. 2025. debug-gym: A Text-Based Environment for Interactive Debugging.arXiv preprint arXiv:2503.21557(2025)

  62. [62]

    Zhengran Zeng, Hanzhuo Tan, Haotian Zhang, Jing Li, Yuqun Zhang, and Lingming Zhang. 2022. An extensive study on pre-trained models for program understanding and generation. InProceedings of the 31st ACM SIGSOFT international symposium on software testing and analysis. 39–51. Received 2025-09-12; accepted 2025-12-22 Proc. ACM Softw. Eng., Vol. 3, No. FSE,...