arxiv: 2604.17699 · v1 · submitted 2026-04-20 · 💻 cs.SE

Recognition: unknown

SelfHeal: Empirical Fix Pattern Analysis and Bug Repair in LLM Agents

Niful Islam , Muhammad Anas Raza , Mohammad Wardat

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:09 UTC · model grok-4.3

classification 💻 cs.SE

keywords LLM agentsbug fix patternsmulti-agent systemsempirical studybenchmark datasetsoftware repairReAct agentsautonomous debugging

0 comments

The pith

SelfHeal uses two ReAct agents and empirical fix patterns from online sources to repair bugs in LLM agents more effectively than baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts the first empirical study of bug fix patterns in LLM agents by examining posts and code snippets from Stack Overflow, GitHub, and HuggingFace Forums. It identifies common locations for fixes, involved languages, and frameworks, then releases the AgentDefect benchmark containing 37 runtime buggy instances along with fixed code and test files. From these patterns it constructs SelfHeal, a multi-agent system in which a fix agent proposes repairs using internal fix rules and web search while a separate critic agent validates the proposals. Evaluation demonstrates that SelfHeal outperforms both baseline and state-of-the-art repair methods by a significant margin. A reader would care because LLM agents are increasingly deployed for autonomous multi-step tasks, and scalable automated repair reduces the manual debugging load that currently limits their reliability.

Core claim

We present the first empirical study on bug fix patterns in LLM agents drawn from three platforms and introduce AgentDefect, the first benchmark dataset of 37 runtime buggy instances supplied with fixed code and tests. We also propose SelfHeal, a multi-agent system that deploys two independent ReAct agents: the fix agent generates candidate repairs by consulting both internal knowledge of observed fix patterns and external web search, while the critic agent evaluates and refines those candidates. When powered by a strong backbone LLM the system achieves substantially higher repair success than prior approaches on the collected instances.

What carries the argument

SelfHeal, a two-agent ReAct system in which one agent proposes fixes from empirical patterns and web search while the second agent validates them.

If this is right

Developers of LLM agents gain an automated tool that proposes and validates repairs without constant manual intervention.
The AgentDefect dataset supplies a public benchmark that future repair techniques can be measured against directly.
Combining observed fix patterns as internal knowledge with web search improves the quality of generated repairs over either source alone.
Runtime bugs arising in tool-using or multi-step LLM agents become addressable through coordinated proposal-and-critique agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of proposal and validation roles could be tested on debugging tasks for other autonomous AI systems such as reinforcement-learning agents.
Collecting additional buggy instances beyond the initial 37 would likely surface further fix patterns that SelfHeal could incorporate.
The identified patterns might be used proactively during LLM-agent design to reduce the occurrence of common bugs.
The overall architecture suggests a general template for applying empirical analysis to automate repair in other software domains that involve agent-like behavior.

Load-bearing premise

The 37 runtime buggy instances collected from the three platforms, together with the identified fix patterns, are representative enough for the two ReAct agents to generate and validate correct repairs across a meaningful range of LLM agent bugs.

What would settle it

Testing SelfHeal on a fresh, independently collected set of runtime LLM-agent bugs and finding repair success rates no higher than those of the baselines would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2604.17699 by Mohammad Wardat, Muhammad Anas Raza, Niful Islam.

**Figure 2.** Figure 2: Fix pattern distribution across frameworks. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Workflow of the proposed multi-agent approach. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Performance comparison of different approaches, including (a) SWE Agent, (b) Zero-shot, (c) No Fix Rules, (d) No Web [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have transformed software development and AI applications. While LLMs are designed for text processing, LLM agents extend this capability by enabling autonomous actions, tool use, and multi-step task completion. As this field grows, developers face new challenges in debugging these complex systems. To address this challenge, we present the first empirical study on bug fix patterns in LLM agents. We study buggy posts and code snippets from three platforms: Stack Overflow, GitHub, and HuggingFace Forums. We examine their fix patterns, the components where fixes are applied, and the programming languages and frameworks involved. Furthermore, we introduce AgentDefect, the first benchmark dataset for bugs in LLM agents. The dataset contains 37 runtime buggy instances along with fixed code and test files. Finally, we present SelfHeal, a multi-agent system designed to fix bugs in LLM agents. The system leverages two independent ReAct agents: the fix agent and the critic agent. These agents use tools that provide both internal knowledge (fix rules) and external knowledge (web search) to propose and validate fixes. Our evaluation shows that SelfHeal with Gemini 3 Pro as the backbone LLM outperforms both baseline and state-of-the-art approaches by a significant margin.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents the first empirical study of bug fix patterns in LLM agents, analyzing posts and code from Stack Overflow, GitHub, and HuggingFace Forums. It introduces AgentDefect, a benchmark containing 37 runtime buggy LLM agent instances with fixes and tests, and proposes SelfHeal, a multi-agent repair system that employs two independent ReAct agents (a fix agent and a critic agent) augmented with internal fix-rule knowledge and external web-search tools. The central claim is that SelfHeal using Gemini 3 Pro as the backbone LLM significantly outperforms both baselines and state-of-the-art approaches.

Significance. If the evaluation protocol and dataset representativeness can be substantiated, the work would provide the first dedicated benchmark for LLM-agent bugs and a practical multi-agent repair architecture that combines rule-based and search-based knowledge. This could seed follow-on research in automated debugging for autonomous LLM systems. The empirical pattern analysis and the two-agent ReAct design are reasonable starting points, but the small scale of the evaluation limits immediate generalizability.

major comments (3)

[Abstract] Abstract: the claim that SelfHeal 'outperforms both baseline and state-of-the-art approaches by a significant margin' is presented without any success rates, baseline names, statistical tests, or error analysis, rendering the headline result impossible to assess.
[AgentDefect / Evaluation] AgentDefect construction and Evaluation sections: the 37 runtime instances are collected from the identical three platforms used for the empirical fix-pattern study, yet the manuscript provides no explicit statement that these instances were held out from pattern extraction; without such separation or a per-category/platform breakdown, the reported performance margin cannot be confidently attributed to the two-ReAct-agent architecture rather than leakage or selection bias.
[Evaluation] Evaluation section: no description is given of the success metric (e.g., exact code match, test-suite passage, or manual inspection), the selection criteria for the 37 instances, or any difficulty stratification; these omissions are load-bearing because the central claim rests on the benchmark being both representative and fairly evaluated.

minor comments (2)

[Abstract / Evaluation] Clarify the exact model version referenced as 'Gemini 3 Pro' and confirm whether it is a publicly available checkpoint.
[Introduction] The manuscript would benefit from an explicit related-work subsection contrasting SelfHeal with prior LLM-based repair systems for conventional software.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the abstract and evaluation sections require additional quantitative details and clarifications to make the results more transparent and to address potential concerns about data separation and evaluation rigor. We will revise the manuscript accordingly and provide point-by-point responses to the major comments below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that SelfHeal 'outperforms both baseline and state-of-the-art approaches by a significant margin' is presented without any success rates, baseline names, statistical tests, or error analysis, rendering the headline result impossible to assess.

Authors: We acknowledge that the abstract is overly concise and does not include the specific quantitative results needed to evaluate the central claim. The full evaluation section reports success rates (e.g., the percentage of bugs successfully repaired by SelfHeal versus baselines), names the compared approaches, and includes error analysis. In the revised manuscript, we will expand the abstract to incorporate these concrete figures, baseline names, and any statistical tests or significance results, ensuring the headline claim is immediately assessable while remaining within length constraints. revision: yes
Referee: [AgentDefect / Evaluation] AgentDefect construction and Evaluation sections: the 37 runtime instances are collected from the identical three platforms used for the empirical fix-pattern study, yet the manuscript provides no explicit statement that these instances were held out from pattern extraction; without such separation or a per-category/platform breakdown, the reported performance margin cannot be confidently attributed to the two-ReAct-agent architecture rather than leakage or selection bias.

Authors: This is a legitimate concern about potential overlap or bias. The 37 AgentDefect instances were deliberately selected as runtime cases distinct from those used in the initial fix-pattern extraction; however, the manuscript does not explicitly document this separation. We will revise the AgentDefect construction section to state clearly that benchmark instances were held out from pattern analysis. We will also add a per-category and per-platform breakdown of the 37 instances to demonstrate diversity and mitigate concerns about selection bias, allowing readers to better attribute performance gains to the SelfHeal design. revision: yes
Referee: [Evaluation] Evaluation section: no description is given of the success metric (e.g., exact code match, test-suite passage, or manual inspection), the selection criteria for the 37 instances, or any difficulty stratification; these omissions are load-bearing because the central claim rests on the benchmark being both representative and fairly evaluated.

Authors: We agree these details are essential for assessing the evaluation's validity. The success metric combines test-suite passage with manual inspection of fix correctness; selection criteria prioritized runtime bugs across bug types identified in the empirical study; and instances were categorized by difficulty where possible. In the revised evaluation section, we will explicitly define the success metric, detail the selection criteria for the 37 instances, and include any available difficulty stratification or categorization to substantiate representativeness and fairness. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical data collection and system evaluation

full rationale

The paper performs an empirical study: it collects buggy posts from Stack Overflow, GitHub, and HuggingFace Forums, manually identifies fix patterns and affected components, assembles the AgentDefect benchmark of 37 runtime instances with fixes and tests, and evaluates the SelfHeal two-ReAct-agent repair system on that benchmark. No equations, first-principles derivations, parameter fitting, or predictions appear anywhere in the described workflow. Performance claims are direct experimental outcomes on the collected instances rather than quantities forced by construction from the inputs. No self-citation chain or uniqueness theorem is invoked to justify core choices. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Empirical study plus system-building work; rests on domain assumptions about data representativeness and agent effectiveness rather than new mathematical axioms or invented entities.

axioms (2)

domain assumption Buggy posts and code snippets collected from Stack Overflow, GitHub, and HuggingFace Forums are representative of real-world bugs in LLM agents.
The entire empirical analysis and subsequent benchmark rest on this sampling assumption.
domain assumption Fix patterns identified in the study can be encoded as rules usable by ReAct agents to generate valid repairs.
Core premise enabling the fix agent component of SelfHeal.

pith-pipeline@v0.9.0 · 5520 in / 1622 out tokens · 85871 ms · 2026-05-10T05:09:32.342341+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

82 extracted references · 46 canonical work pages · 5 internal anchors

[1]

fix pfdreader·mt7180/quaigle@c29f047 — github.com

2023. fix pfdreader·mt7180/quaigle@c29f047 — github.com. https://github .com/mt7180/quaigle/commit/c29f047546876af0812cc4f06adf7a09048058f0. [Accessed 23-01-2026]

2023
[2]

fix: span topo·jina-ai/langchain-serve@762626d — github.com

2023. fix: span topo·jina-ai/langchain-serve@762626d — github.com. https: //github.com/jina-ai/langchain-serve/commit/762626d588167cbe8bacc10d2fa8a 02a9c417b8b. [Accessed 23-01-2026]

2023
[3]

How to see the Embedding of the documents with Chroma (or any other DB) saved in Lang Chain? — stackoverflow.com

2023. How to see the Embedding of the documents with Chroma (or any other DB) saved in Lang Chain? — stackoverflow.com. https://stackoverflow.com/ques tions/76379440/how-to-see-the-embedding-of-the-documents-with-chroma- or-any-other-db-saved-in. [Accessed 23-01-2026]

work page arXiv 2023
[4]

LangChain: Querying a document and getting structured output using Pydantic with ChatGPT not working well — stackoverflow.com

2023. LangChain: Querying a document and getting structured output using Pydantic with ChatGPT not working well — stackoverflow.com. https://stackove rflow.com/questions/76822673/langchain-querying-a-document-and-getting- structured-output-using-pydantic-with. [Accessed 23-01-2026]

work page arXiv 2023
[5]

https://stackoverflow

2023.LangChain textttModuleNotFoundError: No module named ’langchain’. https://stackoverflow. com/questions/76726419

work page arXiv 2023
[6]

Trying to create vectors and chunked data using Azure Cognitive Search/Azure AI Search

2023. Trying to create vectors and chunked data using Azure Cognitive Search/Azure AI Search. https://stackoverflow.com/questions/77646675/trying- to-create-vectors-and-chunked-data-using-azure-cognitive-search-azure-ai. Accessed: 2026-01-23

work page arXiv 2023
[7]

https://stackoverflow.com/questions/76313568

2023.TypeError: issubclass() arg 1 must be a class when importing LangChain in Flask. https://stackoverflow.com/questions/76313568

work page arXiv 2023
[8]

Using Vicuna + langchain + llama_index for creating a self hosted LLM model — stackoverflow.com

2023. Using Vicuna + langchain + llama_index for creating a self hosted LLM model — stackoverflow.com. https://stackoverflow.com/questions/76067104 /using-vicuna-langchain-llama-index-for-creating-a-self-hosted-llm-model. [Accessed 23-01-2026]

work page arXiv 2023
[9]

Fix bug with empty tool list (#225)·startino/aitino@e68077f — github.com

2024. Fix bug with empty tool list (#225)·startino/aitino@e68077f — github.com. https://github.com/startino/aitino/commit/e68077f1da17d0f16c25d13800c90900 1b216325. [Accessed 23-01-2026]

2024
[10]

Fix hf generate for llama3.2 (#12497)·intel/ipex-llm@7d27f13 — github.com

2024. Fix hf generate for llama3.2 (#12497)·intel/ipex-llm@7d27f13 — github.com. https://github.com/intel/ipex-llm/commit/7d27f134ddd094ef49b3dd71487261c 452d46056. [Accessed 23-01-2026]

2024
[11]

Getting error when using memory with chain: TypeError: Object of type Member is not serializable

2024. Getting error when using memory with chain: TypeError: Object of type Member is not serializable. https://stackoverflow.com/questions/79313470 /getting-error-when-using-memory-with-chain-typeerror-object-of-type- member-is. Accessed: 2026-01-23

work page arXiv 2024
[12]

ModuleNotFoundError: No module named ’langchain_openai’ — stackover- flow.com

2024. ModuleNotFoundError: No module named ’langchain_openai’ — stackover- flow.com. https://stackoverflow.com/questions/77782167/modulenotfounderror- no-module-named-langchain-openai. [Accessed 23-01-2026]

work page arXiv 2024
[13]

2025.How can I match the token count used by BGE-M3 embedding model before embedding?https://stackoverflow.com/questions/79753835

work page arXiv 2025
[14]

2025.How to create custom columns when creating embeddings using LlamaIndex in Postgres (with pgvector extension)?https://stackoverflow.com/questions/7949 7660

2025
[15]

2025.Langchain-based model memory

Aapolaris. 2025.Langchain-based model memory. https://stackoverflow.com/qu estions/79776520/langchain-based-model-memory Stack Overflow question

work page arXiv 2025
[16]

AIBTC. 2024. fix: update callbacks and output format for chat msgs·aibtcdev/ai- agent-crew@e6c341d — github.com. https://github.com/aibtcdev/ai-agent- crew/commit/e6c341d49362c60f 05e1cf ea1d6f 9149e732bc8c. [Accessed 23-01-2026]

2024
[17]

Argilla. 2023. fix: ‘requires_dependencies‘ import (#3763)·argilla- io/argilla@dcafd79 — github.com. https://github.com/argilla-io/argilla/c ommit/dcafd79f02715abf9ba492ae8fd7dcc0173e3107. [Accessed 23-01-2026]

2023
[18]

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report.arXiv preprint arXiv:2309.16609(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Gemma Catolino, Fabio Palomba, Andy Zaidman, and Filomena Ferrucci. 2019. Not all bugs are the same: Understanding, characterizing, and classifying bug types.Journal of Systems and Software152 (2019), 165–181

2019
[20]

Wei-Hao Chen, Jia Lin Cheoh, Manthan Keim, Sabine Brunswicker, and Tianyi Zhang. 2025. Towards Understanding Fine-Grained Programming Mistakes and Fixing Patterns in Data Science.Proceedings of the ACM on Software Engineering 2, FSE, Article FSE082 (June 2025), 23 pages. doi:10.1145/3729352

work page doi:10.1145/3729352 2025
[21]

Sadia Sultana Chowa, Riasad Alvi, Subhey Sadi Rahman, Md Abdur Rahman, Mohaimenul Azam Khan Raiaan, Md Rafiqul Islam, Mukhtar Hussain, and Sami Azam. 2026. From language to action: a review of large language models as autonomous agents and tool users.Artificial Intelligence Review(2026)

2026
[22]

LlamaIndex Contributors. 2026. Llama Index Python Package. https://pypi.org/p roject/llama-index/#history. Accessed: 2026-01-23

2026
[23]

Darshan Deshpande, Varun Gangal, Hersh Mehta, Jitin Krishnan, Anand Kan- nappan, and Rebecca Qian. 2025. TRAIL: Trace Reasoning and Agentic Issue Localization.arXiv preprint arXiv:2505.08638(2025)

work page arXiv 2025
[24]

Peng Ding and Rick Stevens. 2025. Unified Tool Integration for LLMs: A Protocol- Agnostic Approach to Function Calling.arXiv preprint arXiv:2508.02979(2025)

work page arXiv 2025
[25]

Xiaoting Du, Zhihao Liu, Chenglong Li, Xiangyue Ma, Yingzhuo Li, and Xinyu Wang. 2024. LLM-BRC: A large language model-based bug report classification framework.Software Quality Journal32, 3 (2024), 985–1005

2024
[26]

Madeline Endres, Georgios Sakkas, Benjamin Cosman, Ranjit Jhala, and Westley Weimer. 2019. Infix: Automatically repairing novice program inputs. In2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 399–410

2019
[27]

Will Epperson, Gagan Bansal, Victor C Dibia, Adam Fourney, Jack Gerrits, Erkang (Eric) Zhu, and Saleema Amershi. 2025. Interactive Debugging and Steering of Multi-Agent AI Systems. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Ma- chinery, New York, NY, USA, Article 156, 15 pages. doi:1...

work page doi:10.1145/3706598.3713581 2025
[28]

GitHub. 2015. Running non stop in Colab·Issue #60·shroominic/codeinterpreter- api — github.com. https://github.com/shroominic/codeinterpreter-api/issues/60. [Accessed 23-01-2026]

2015
[29]

GitHub. 2024. [BUG] No module named ’uvloop’ – Issue #623 – kye- gomez/swarms. https://github.com/kyegomez/swarms/issues/623. Accessed 23 January 2026

2024
[30]

GitHub. 2024. LLamaSharpEmbeddings Exception: EmbeddingMode must be true.·Issue 343·tryAGI/LangChain — github.com. https://github.com/tryAGI/ LangChain/issues/343. [Accessed 23-01-2026]

2024
[31]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. 2025. Deepseek-r1: Incen- tivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Junxiao Han, Guanqi Wang, Jiakun Liu, Lingfeng Bao, Xing Hu, Jinling Wei, and Shuiguang Deng. 2025. A Comprehensive Study of Bug Characteristics on Foundation Language Models. In2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge). 257–268. doi:10.1109/ Forge66646.2025.00037

work page arXiv 2025
[33]

Junda He, Christoph Treude, and David Lo. 2025. LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision, and the Road Ahead.ACM Transactions on Software Engineering and Methodology34, 5, Article 124 (May 2025), 30 pages. doi:10.1145/3712003

work page doi:10.1145/3712003 2025
[34]

Lei Zhang Huaizheng Zhang*, Yizheng Huang*. 2024. MLE-Agent: Your Intelligent Companion for Seamless AI Engineering and Research. https://github.com/MLS ysOps/MLE-agent

2024
[35]

Yuchao Huang, Junjie Wang, Zhe Liu, Mingyang Li, Song Wang, Chunyang Chen, Yuanzhe Hu, and Qing Wang. 2025. One Sentence Can Kill the Bug: Auto- Replay Mobile App Crashes From One-Sentence Overviews.IEEE Transactions on Software Engineering51, 4 (2025), 975–989. doi:10.1109/TSE.2025.3535938

work page doi:10.1109/tse.2025.3535938 2025
[36]

Nargiz Humbatova, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, Andrea Stocco, and Paolo Tonella. 2020. Taxonomy of real faults in deep learning systems. InProceedings of the ACM/IEEE 42nd International Conference on Software Engi- neering(Seoul, South Korea)(ICSE ’20). Association for Computing Machinery, New York, NY, USA, 1110–1121. doi:10.1145/...

work page doi:10.1145/3377811.3380395 2020
[37]

Md Johirul Islam, Giang Nguyen, Rangeet Pan, and Hridesh Rajan. 2019. A comprehensive study on deep learning bug characteristics(ESEC/FSE 2019). As- sociation for Computing Machinery, New York, NY, USA, 510–520. doi:10.1145/ 3338906.3338955

work page arXiv 2019
[38]

Md Johirul Islam, Rangeet Pan, Giang Nguyen, and Hridesh Rajan. 2020. Repairing deep neural networks: fix patterns and challenges. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering(Seoul, South Korea)(ICSE ’20). Association for Computing Machinery, New York, NY, USA, 1135–1146. doi:10.1145/3377811.3380378

work page doi:10.1145/3377811.3380378 2020
[39]

Mohammad Saiful Islam, Mohamed Sami Rakha, William Pourmajidi, Janakan Sivaloganathan, John Steinbacher, and Andriy Miranskyy. 2025. Anomaly De- tection in Large-Scale Cloud Systems: An Industry Case and Dataset. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engi- neering in Practice (ICSE-SEIP). 377–388. doi:10.1109/ICSE...

work page doi:10.1109/icse-seip66354.2025.00039 2025
[40]

Niful Islam, Ragib Shahriar Ayon, Deepak George Thomas, Shibbir Ahmed, and Mohammad Wardat. 2026. When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Labeling.arXiv preprint arXiv:2601.15232(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[41]

SelfHeal: Empirical Fix Pattern Analysis and Bug Repair in LLM Agents

Niful Islam, Muhammad Anas Raza, and Mohammad Wardat. 2026. "SelfHeal: Empirical Fix Pattern Analysis and Bug Repair in LLM Agents" - GitHub. https: //github.com/Laboratory-software-Innovation/SelfHeal. Accessed: 2026-04-19

2026
[42]

SelfHeal: Empirical Fix Pattern Analysis and Bug Repair in LLM Agents

Niful Islam, Muhammad Anas Raza, and Mohammad Wardat. 2026. "SelfHeal: Empirical Fix Pattern Analysis and Bug Repair in LLM Agents" - Site. https: //sites.google.com/view/selfheal/home. Accessed: 2026-01-23

2026
[43]

Sigma Jahan, Saurabh singh Rajput, Tushar Sharma, and Mohammad Masudur Rahman. 2026. Why Attention Fails: A Taxonomy of Faults in Attention-Based Neural Networks. InProceedings of the 48th IEEE/ACM International Conference on Software Engineering (ICSE 2026)

2026
[44]

Li Jia, Hao Zhong, Xiaoyin Wang, Linpeng Huang, and Xuansheng Lu. 2021. The symptoms, causes, and repairs of bugs inside a deep learning library.Journal of Systems and Software177 (2021), 110935

2021
[45]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve 11 real-world github issues?arXiv preprint arXiv:2310.06770(2023)

work page internal anchor Pith review arXiv 2023
[46]

Yuya Kakui. 2024. Merge pull request #202 from kyaukyuai/fix/agent· kyaukyuai/gpt-all-star@1803420 — github.com. https://github.com/kyauk yuai/gpt-all-star/commit/1803420f fd2186686705b0461c7d8e638c7eb579. [Accessed 23-01-2026]

work page arXiv 2024
[47]

Aimen Kerrour. 2025. Fixed bug and use gpt-4o-mini·kaymen99/Upwork-AI- jobs-applier@97b1158 — github.com. https://github.com/kaymen99/Upwork-AI- jobs-applier/commit/97b115891cd678a0b15a08ff40de7996d289b1bf. [Accessed 23-01-2026]

2025
[48]

LangChain Developers. 2026. LangGraph: Build and Orchestrate Stateful Agents. https://www.langchain.com/langgraph. Accessed: 2026-01-25

2026
[49]

Shanchao Liang, Nan Jiang, Yiran Hu, and Lin Tan. 2025. Can Language Models Replace Programmers for Coding? REPOCOD Says ‘Not Yet’. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vienna, Austria, 24698– 24717. doi:10.18653/v1/2025.acl-long.1204

work page doi:10.18653/v1/2025.acl-long.1204 2025
[50]

Di Liu, Yanyan Yan, Hongcheng Fan, and Yang Feng. 2024. Mining Fix Patterns for System Interaction Bugs. InProceedings of the 15th Asia-Pacific Symposium on Internetware(Macau, China)(Internetware ’24). Association for Computing Machinery, New York, NY, USA, 367–376. doi:10.1145/3671016.3671398

work page doi:10.1145/3671016.3671398 2024
[51]

Junwei Liu, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. 2026. Large Language Model-Based Agents for Software Engineering: A Survey.ACM Trans. Softw. Eng. Methodol.(2026). doi:10.1145/37 96507

work page doi:10.1145/37 2026
[52]

LlamaIndex. 2022. LlamaIndex. https://www.llamaindex.ai/. Accessed: 2026-01- 21

2022
[53]

Vasilios Mavroudis. 2024. LangChain.Preprints.org(2024). doi:10.20944/preprin ts202411.0566.v1

work page doi:10.20944/preprin 2024
[54]

Marcos Medeiros, Uira Kulesza, Roberta Coelho, Rodrigo Bonifacio, Christoph Treude, and Eiji Adachi Barbosa. 2024. The Impact Of Bug Localization Based on Crash Report Mining: A Developers’ Perspective. InProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice (Lisbon, Portugal)(ICSE-SEIP ’24). Associatio...

work page doi:10.1145/3639477.3639730 2024
[55]

Kaiwen Ning, Jiachi Chen, Jingwen Zhang, Wei Li, Zexu Wang, Yuming Feng, Weizhe Zhang, and Zibin Zheng. 2026. Defining and Detecting the Defects of Large Language Model-Based Autonomous Agents.IEEE Transactions on Software Engineering52, 3 (2026), 1074–1093. doi:10.1109/TSE.2026.3658554

work page doi:10.1109/tse.2026.3658554 2026
[56]

Kai Pan, Sunghun Kim, and E James Whitehead Jr. 2009. Toward an understanding of bug fix patterns.Empirical Software Engineering14, 3 (2009), 286–315

2009
[57]

Rangeet Pan, Ali Reza Ibrahimzada, Rahul Krishna, Divya Sankar, Lam- bert Pouguem Wassi, Michele Merler, Boris Sobolev, Raju Pavuluri, Saurabh Sinha, and Reyhaneh Jabbarvand. 2024. Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code. InProceedings of the IEEE/ACM 46th International Conference on Software Enginee...

work page doi:10.1145/3597503.3639226 2024
[58]

ProLLM. 2026. Summarization Leaderboard. https://www.prollm.ai/leaderboar d/summarization?language=afrikaans,brazilian+portuguese,english,polish&l evel=advanced,basic Evaluates an LLM’s ability to accurately summarize long texts from diverse sources

2026
[59]

2025.difflib — Helpers for computing deltas

Python Software Foundation. 2025.difflib — Helpers for computing deltas. https: //docs.python.org/3/library/difflib.html Accessed: 2026-01-15

2025
[60]

Muhammad Anas Raza and Mohammad Wardat. 2025. Graph neural network for fault localization in sequence-based models.Empirical Software Engineering30, 5 (2025), 119

2025
[61]

Talia Ringer, RanDair Porter, Nathaniel Yazdani, John Leo, and Dan Grossman
[62]

Lopes, Juneyoung Lee, Chung-Kil Hur, Zhengyang Liu, and John Regehr

Proof repair across type equivalences. InProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Imple- mentation(Virtual, Canada)(PLDI 2021). Association for Computing Machinery, New York, NY, USA, 112–127. doi:10.1145/3453483.3454033

work page doi:10.1145/3453483.3454033 2021
[63]

Tony Rousmaniere, Simon B Goldberg, and John Torous. 2026. Large language models as mental health providers.The Lancet Psychiatry13, 1 (2026), 7–9

2026
[64]

Valerio Terragni, Annie Vella, Partha Roop, and Kelly Blincoe. 2025. The Future of AI-Driven Software Engineering.ACM Transactions on Software Engineering and Methodology34, 5, Article 120 (May 2025), 20 pages. doi:10.1145/3715003

work page doi:10.1145/3715003 2025
[65]

thedigitalworkplace. 2024. Fix bug in process event in worker runtime (#492)· thedigitalworkplace/Autogen@1ba7a68 — github.com. https://github.com/thedi gitalworkplace/Autogen/commit/1ba7a681a93a1308e608dcfedee2f29a6dab4a76 #diff-e5e45bd0c38c1fd4a21ca55a40b0df2161922f15d285de1bdadbe7d110aa109f R458-R446. [Accessed 23-01-2026]

2024
[66]

thedigitalworkplace. 2025. Fix chess sample (#4932) – thedigitalwork- place/Autogen commit 52c2a70. https://github.com/thedigitalworkplace /Autogen/commit/52c2a70e95df2006e0094e96ad192243148ec4bb. Accessed 23 January 2026

2025
[67]

Frank Tip, Jonathan Bell, and Max Schäfer. 2025. LLMorpheus: Mutation Testing Using Large Language Models.IEEE Transactions on Software Engineering51, 6 (2025), 1645–1665. doi:10.1109/TSE.2025.3562025

work page doi:10.1109/tse.2025.3562025 2025
[68]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

Vellum AI. 2025. LLM Leaderboard 2025. https://www.vellum.ai/llm-leader board?utm_source=google&utm_medium=organic. Accessed: 2026-01-21; leaderboard of public benchmark performance for LLMs, updated 15 Dec 2025

2025
[70]

Mohammad Wardat, Wei Le, and Hridesh Rajan. 2021. DeepLocalize: Fault Localization for Deep Neural Networks. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). 251–262. doi:10.1109/ICSE43902.2021 .00034

work page doi:10.1109/icse43902.2021 2021
[71]

whoabuddy. 2024. hotfix: remove async from crew execution·aibtcdev/aibtcdev- backend@62f3f3a — github.com. https://github.com/aibtcdev/aibtcdev-backend/ commit/62f3f3aacc67f40833641c6251b6fbf50fc5d5eb. [Accessed 23-01-2026]

2024
[72]

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. De- mystifying LLM-Based Software Engineering Agents.Proc. ACM Softw. Eng.2, FSE, Article FSE037 (June 2025), 24 pages. doi:10.1145/3715754

work page doi:10.1145/3715754 2025
[73]

Yang Xu, Chao Liu, Yong Li, Qiaoluan Xie, and Hyun-Deok Choi. 2023. A Method of Component Prediction for Crash Bug Reports Using Component- Based Features and Machine Learning. In2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 773–777. doi:10.1109/ SANER56733.2023.00089

work page arXiv 2023
[74]

Ziluo Xue, Yanjie Zhao, Shenao Wang, Kai Chen, and Haoyu Wang. 2025. A Characterization Study of Bugs in LLM Agent Workflow Orchestration Frame- works. In2025 40th IEEE/ACM International Conference on Automated Soft- ware Engineering (ASE)(Seoul, Korea, Republic of). IEEE Press, 3369–3380. doi:10.1109/ASE63991.2025.00278

work page doi:10.1109/ase63991.2025.00278 2025
[75]

Baskar Y, Harrison Chase, and LangChain Contributors. 2026. LangChain Python Package. https://pypi.org/project/langchain/#history. Accessed: 2026-01-23

2026
[76]

F., Le Goues, C., and Jin, S

Boyang Yang, Haoye Tian, Jiadong Ren, Hongyu Zhang, Jacques Klein, Tegawendé F. Bissyandé, Claire Le Goues, and Shunfu Jin. 2026. MORepair: Teaching LLMs to Repair Code via Multi-Objective Fine-Tuning.ACM Trans- actions on Software Engineering and Methodology35, 2, Article 38 (Jan. 2026), 38 pages. doi:10.1145/3735129

work page doi:10.1145/3735129 2026
[77]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer In- terfaces Enable Automated Software Engineering. InAdvances in Neural In- formation Processing Systems, Vol. 37. Curran Associates, Inc., 50528–50652. doi:10.52202/079017-1601

work page doi:10.52202/079017-1601 2024
[78]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InThe Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=WE_vluYUL-X

2023
[79]

Xiao Yu, Haoxuan Chen, Feifei Niu, Xing Hu, Jacky Wai Keung, and Xin Xia. 2025. Towards Understanding Bugs in Distributed Training and Inference Frameworks for Large Language Models.arXiv preprint arXiv:2506.10426(2025)

work page arXiv 2025
[80]

Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang. 2024. AgentTuning: Enabling Generalized Agent Abilities for LLMs. In Findings of the Association for Computational Linguistics: ACL 2024. Association for Computational Linguistics, Bangkok, Thailand, 3053–3077. doi:10.18653/v1/ 2024.findings-acl.181

work page doi:10.18653/v1/ 2024

Showing first 80 references.