Recognition: unknown
SelfHeal: Empirical Fix Pattern Analysis and Bug Repair in LLM Agents
Pith reviewed 2026-05-10 05:09 UTC · model grok-4.3
The pith
SelfHeal uses two ReAct agents and empirical fix patterns from online sources to repair bugs in LLM agents more effectively than baselines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present the first empirical study on bug fix patterns in LLM agents drawn from three platforms and introduce AgentDefect, the first benchmark dataset of 37 runtime buggy instances supplied with fixed code and tests. We also propose SelfHeal, a multi-agent system that deploys two independent ReAct agents: the fix agent generates candidate repairs by consulting both internal knowledge of observed fix patterns and external web search, while the critic agent evaluates and refines those candidates. When powered by a strong backbone LLM the system achieves substantially higher repair success than prior approaches on the collected instances.
What carries the argument
SelfHeal, a two-agent ReAct system in which one agent proposes fixes from empirical patterns and web search while the second agent validates them.
If this is right
- Developers of LLM agents gain an automated tool that proposes and validates repairs without constant manual intervention.
- The AgentDefect dataset supplies a public benchmark that future repair techniques can be measured against directly.
- Combining observed fix patterns as internal knowledge with web search improves the quality of generated repairs over either source alone.
- Runtime bugs arising in tool-using or multi-step LLM agents become addressable through coordinated proposal-and-critique agents.
Where Pith is reading between the lines
- The same separation of proposal and validation roles could be tested on debugging tasks for other autonomous AI systems such as reinforcement-learning agents.
- Collecting additional buggy instances beyond the initial 37 would likely surface further fix patterns that SelfHeal could incorporate.
- The identified patterns might be used proactively during LLM-agent design to reduce the occurrence of common bugs.
- The overall architecture suggests a general template for applying empirical analysis to automate repair in other software domains that involve agent-like behavior.
Load-bearing premise
The 37 runtime buggy instances collected from the three platforms, together with the identified fix patterns, are representative enough for the two ReAct agents to generate and validate correct repairs across a meaningful range of LLM agent bugs.
What would settle it
Testing SelfHeal on a fresh, independently collected set of runtime LLM-agent bugs and finding repair success rates no higher than those of the baselines would falsify the performance claim.
Figures
read the original abstract
Large Language Models (LLMs) have transformed software development and AI applications. While LLMs are designed for text processing, LLM agents extend this capability by enabling autonomous actions, tool use, and multi-step task completion. As this field grows, developers face new challenges in debugging these complex systems. To address this challenge, we present the first empirical study on bug fix patterns in LLM agents. We study buggy posts and code snippets from three platforms: Stack Overflow, GitHub, and HuggingFace Forums. We examine their fix patterns, the components where fixes are applied, and the programming languages and frameworks involved. Furthermore, we introduce AgentDefect, the first benchmark dataset for bugs in LLM agents. The dataset contains 37 runtime buggy instances along with fixed code and test files. Finally, we present SelfHeal, a multi-agent system designed to fix bugs in LLM agents. The system leverages two independent ReAct agents: the fix agent and the critic agent. These agents use tools that provide both internal knowledge (fix rules) and external knowledge (web search) to propose and validate fixes. Our evaluation shows that SelfHeal with Gemini 3 Pro as the backbone LLM outperforms both baseline and state-of-the-art approaches by a significant margin.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents the first empirical study of bug fix patterns in LLM agents, analyzing posts and code from Stack Overflow, GitHub, and HuggingFace Forums. It introduces AgentDefect, a benchmark containing 37 runtime buggy LLM agent instances with fixes and tests, and proposes SelfHeal, a multi-agent repair system that employs two independent ReAct agents (a fix agent and a critic agent) augmented with internal fix-rule knowledge and external web-search tools. The central claim is that SelfHeal using Gemini 3 Pro as the backbone LLM significantly outperforms both baselines and state-of-the-art approaches.
Significance. If the evaluation protocol and dataset representativeness can be substantiated, the work would provide the first dedicated benchmark for LLM-agent bugs and a practical multi-agent repair architecture that combines rule-based and search-based knowledge. This could seed follow-on research in automated debugging for autonomous LLM systems. The empirical pattern analysis and the two-agent ReAct design are reasonable starting points, but the small scale of the evaluation limits immediate generalizability.
major comments (3)
- [Abstract] Abstract: the claim that SelfHeal 'outperforms both baseline and state-of-the-art approaches by a significant margin' is presented without any success rates, baseline names, statistical tests, or error analysis, rendering the headline result impossible to assess.
- [AgentDefect / Evaluation] AgentDefect construction and Evaluation sections: the 37 runtime instances are collected from the identical three platforms used for the empirical fix-pattern study, yet the manuscript provides no explicit statement that these instances were held out from pattern extraction; without such separation or a per-category/platform breakdown, the reported performance margin cannot be confidently attributed to the two-ReAct-agent architecture rather than leakage or selection bias.
- [Evaluation] Evaluation section: no description is given of the success metric (e.g., exact code match, test-suite passage, or manual inspection), the selection criteria for the 37 instances, or any difficulty stratification; these omissions are load-bearing because the central claim rests on the benchmark being both representative and fairly evaluated.
minor comments (2)
- [Abstract / Evaluation] Clarify the exact model version referenced as 'Gemini 3 Pro' and confirm whether it is a publicly available checkpoint.
- [Introduction] The manuscript would benefit from an explicit related-work subsection contrasting SelfHeal with prior LLM-based repair systems for conventional software.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We agree that the abstract and evaluation sections require additional quantitative details and clarifications to make the results more transparent and to address potential concerns about data separation and evaluation rigor. We will revise the manuscript accordingly and provide point-by-point responses to the major comments below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that SelfHeal 'outperforms both baseline and state-of-the-art approaches by a significant margin' is presented without any success rates, baseline names, statistical tests, or error analysis, rendering the headline result impossible to assess.
Authors: We acknowledge that the abstract is overly concise and does not include the specific quantitative results needed to evaluate the central claim. The full evaluation section reports success rates (e.g., the percentage of bugs successfully repaired by SelfHeal versus baselines), names the compared approaches, and includes error analysis. In the revised manuscript, we will expand the abstract to incorporate these concrete figures, baseline names, and any statistical tests or significance results, ensuring the headline claim is immediately assessable while remaining within length constraints. revision: yes
-
Referee: [AgentDefect / Evaluation] AgentDefect construction and Evaluation sections: the 37 runtime instances are collected from the identical three platforms used for the empirical fix-pattern study, yet the manuscript provides no explicit statement that these instances were held out from pattern extraction; without such separation or a per-category/platform breakdown, the reported performance margin cannot be confidently attributed to the two-ReAct-agent architecture rather than leakage or selection bias.
Authors: This is a legitimate concern about potential overlap or bias. The 37 AgentDefect instances were deliberately selected as runtime cases distinct from those used in the initial fix-pattern extraction; however, the manuscript does not explicitly document this separation. We will revise the AgentDefect construction section to state clearly that benchmark instances were held out from pattern analysis. We will also add a per-category and per-platform breakdown of the 37 instances to demonstrate diversity and mitigate concerns about selection bias, allowing readers to better attribute performance gains to the SelfHeal design. revision: yes
-
Referee: [Evaluation] Evaluation section: no description is given of the success metric (e.g., exact code match, test-suite passage, or manual inspection), the selection criteria for the 37 instances, or any difficulty stratification; these omissions are load-bearing because the central claim rests on the benchmark being both representative and fairly evaluated.
Authors: We agree these details are essential for assessing the evaluation's validity. The success metric combines test-suite passage with manual inspection of fix correctness; selection criteria prioritized runtime bugs across bug types identified in the empirical study; and instances were categorized by difficulty where possible. In the revised evaluation section, we will explicitly define the success metric, detail the selection criteria for the 37 instances, and include any available difficulty stratification or categorization to substantiate representativeness and fairness. revision: yes
Circularity Check
No circularity: purely empirical data collection and system evaluation
full rationale
The paper performs an empirical study: it collects buggy posts from Stack Overflow, GitHub, and HuggingFace Forums, manually identifies fix patterns and affected components, assembles the AgentDefect benchmark of 37 runtime instances with fixes and tests, and evaluates the SelfHeal two-ReAct-agent repair system on that benchmark. No equations, first-principles derivations, parameter fitting, or predictions appear anywhere in the described workflow. Performance claims are direct experimental outcomes on the collected instances rather than quantities forced by construction from the inputs. No self-citation chain or uniqueness theorem is invoked to justify core choices. The derivation chain is therefore self-contained and non-circular.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Buggy posts and code snippets collected from Stack Overflow, GitHub, and HuggingFace Forums are representative of real-world bugs in LLM agents.
- domain assumption Fix patterns identified in the study can be encoded as rules usable by ReAct agents to generate valid repairs.
Reference graph
Works this paper leans on
-
[1]
fix pfdreader·mt7180/quaigle@c29f047 — github.com
2023. fix pfdreader·mt7180/quaigle@c29f047 — github.com. https://github .com/mt7180/quaigle/commit/c29f047546876af0812cc4f06adf7a09048058f0. [Accessed 23-01-2026]
2023
-
[2]
fix: span topo·jina-ai/langchain-serve@762626d — github.com
2023. fix: span topo·jina-ai/langchain-serve@762626d — github.com. https: //github.com/jina-ai/langchain-serve/commit/762626d588167cbe8bacc10d2fa8a 02a9c417b8b. [Accessed 23-01-2026]
2023
-
[3]
2023. How to see the Embedding of the documents with Chroma (or any other DB) saved in Lang Chain? — stackoverflow.com. https://stackoverflow.com/ques tions/76379440/how-to-see-the-embedding-of-the-documents-with-chroma- or-any-other-db-saved-in. [Accessed 23-01-2026]
-
[4]
2023. LangChain: Querying a document and getting structured output using Pydantic with ChatGPT not working well — stackoverflow.com. https://stackove rflow.com/questions/76822673/langchain-querying-a-document-and-getting- structured-output-using-pydantic-with. [Accessed 23-01-2026]
-
[5]
2023.LangChain textttModuleNotFoundError: No module named ’langchain’. https://stackoverflow. com/questions/76726419
-
[6]
Trying to create vectors and chunked data using Azure Cognitive Search/Azure AI Search
2023. Trying to create vectors and chunked data using Azure Cognitive Search/Azure AI Search. https://stackoverflow.com/questions/77646675/trying- to-create-vectors-and-chunked-data-using-azure-cognitive-search-azure-ai. Accessed: 2026-01-23
-
[7]
https://stackoverflow.com/questions/76313568
2023.TypeError: issubclass() arg 1 must be a class when importing LangChain in Flask. https://stackoverflow.com/questions/76313568
-
[8]
Using Vicuna + langchain + llama_index for creating a self hosted LLM model — stackoverflow.com
2023. Using Vicuna + langchain + llama_index for creating a self hosted LLM model — stackoverflow.com. https://stackoverflow.com/questions/76067104 /using-vicuna-langchain-llama-index-for-creating-a-self-hosted-llm-model. [Accessed 23-01-2026]
-
[9]
Fix bug with empty tool list (#225)·startino/aitino@e68077f — github.com
2024. Fix bug with empty tool list (#225)·startino/aitino@e68077f — github.com. https://github.com/startino/aitino/commit/e68077f1da17d0f16c25d13800c90900 1b216325. [Accessed 23-01-2026]
2024
-
[10]
Fix hf generate for llama3.2 (#12497)·intel/ipex-llm@7d27f13 — github.com
2024. Fix hf generate for llama3.2 (#12497)·intel/ipex-llm@7d27f13 — github.com. https://github.com/intel/ipex-llm/commit/7d27f134ddd094ef49b3dd71487261c 452d46056. [Accessed 23-01-2026]
2024
-
[11]
Getting error when using memory with chain: TypeError: Object of type Member is not serializable
2024. Getting error when using memory with chain: TypeError: Object of type Member is not serializable. https://stackoverflow.com/questions/79313470 /getting-error-when-using-memory-with-chain-typeerror-object-of-type- member-is. Accessed: 2026-01-23
-
[12]
ModuleNotFoundError: No module named ’langchain_openai’ — stackover- flow.com
2024. ModuleNotFoundError: No module named ’langchain_openai’ — stackover- flow.com. https://stackoverflow.com/questions/77782167/modulenotfounderror- no-module-named-langchain-openai. [Accessed 23-01-2026]
- [13]
-
[14]
2025.How to create custom columns when creating embeddings using LlamaIndex in Postgres (with pgvector extension)?https://stackoverflow.com/questions/7949 7660
2025
-
[15]
2025.Langchain-based model memory
Aapolaris. 2025.Langchain-based model memory. https://stackoverflow.com/qu estions/79776520/langchain-based-model-memory Stack Overflow question
-
[16]
AIBTC. 2024. fix: update callbacks and output format for chat msgs·aibtcdev/ai- agent-crew@e6c341d — github.com. https://github.com/aibtcdev/ai-agent- crew/commit/e6c341d49362c60f 05e1cf ea1d6f 9149e732bc8c. [Accessed 23-01-2026]
2024
-
[17]
Argilla. 2023. fix: ‘requires_dependencies‘ import (#3763)·argilla- io/argilla@dcafd79 — github.com. https://github.com/argilla-io/argilla/c ommit/dcafd79f02715abf9ba492ae8fd7dcc0173e3107. [Accessed 23-01-2026]
2023
-
[18]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report.arXiv preprint arXiv:2309.16609(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Gemma Catolino, Fabio Palomba, Andy Zaidman, and Filomena Ferrucci. 2019. Not all bugs are the same: Understanding, characterizing, and classifying bug types.Journal of Systems and Software152 (2019), 165–181
2019
-
[20]
Wei-Hao Chen, Jia Lin Cheoh, Manthan Keim, Sabine Brunswicker, and Tianyi Zhang. 2025. Towards Understanding Fine-Grained Programming Mistakes and Fixing Patterns in Data Science.Proceedings of the ACM on Software Engineering 2, FSE, Article FSE082 (June 2025), 23 pages. doi:10.1145/3729352
-
[21]
Sadia Sultana Chowa, Riasad Alvi, Subhey Sadi Rahman, Md Abdur Rahman, Mohaimenul Azam Khan Raiaan, Md Rafiqul Islam, Mukhtar Hussain, and Sami Azam. 2026. From language to action: a review of large language models as autonomous agents and tool users.Artificial Intelligence Review(2026)
2026
-
[22]
LlamaIndex Contributors. 2026. Llama Index Python Package. https://pypi.org/p roject/llama-index/#history. Accessed: 2026-01-23
2026
- [23]
- [24]
-
[25]
Xiaoting Du, Zhihao Liu, Chenglong Li, Xiangyue Ma, Yingzhuo Li, and Xinyu Wang. 2024. LLM-BRC: A large language model-based bug report classification framework.Software Quality Journal32, 3 (2024), 985–1005
2024
-
[26]
Madeline Endres, Georgios Sakkas, Benjamin Cosman, Ranjit Jhala, and Westley Weimer. 2019. Infix: Automatically repairing novice program inputs. In2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 399–410
2019
-
[27]
Will Epperson, Gagan Bansal, Victor C Dibia, Adam Fourney, Jack Gerrits, Erkang (Eric) Zhu, and Saleema Amershi. 2025. Interactive Debugging and Steering of Multi-Agent AI Systems. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Ma- chinery, New York, NY, USA, Article 156, 15 pages. doi:1...
-
[28]
GitHub. 2015. Running non stop in Colab·Issue #60·shroominic/codeinterpreter- api — github.com. https://github.com/shroominic/codeinterpreter-api/issues/60. [Accessed 23-01-2026]
2015
-
[29]
GitHub. 2024. [BUG] No module named ’uvloop’ – Issue #623 – kye- gomez/swarms. https://github.com/kyegomez/swarms/issues/623. Accessed 23 January 2026
2024
-
[30]
GitHub. 2024. LLamaSharpEmbeddings Exception: EmbeddingMode must be true.·Issue 343·tryAGI/LangChain — github.com. https://github.com/tryAGI/ LangChain/issues/343. [Accessed 23-01-2026]
2024
-
[31]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. 2025. Deepseek-r1: Incen- tivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Junxiao Han, Guanqi Wang, Jiakun Liu, Lingfeng Bao, Xing Hu, Jinling Wei, and Shuiguang Deng. 2025. A Comprehensive Study of Bug Characteristics on Foundation Language Models. In2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge). 257–268. doi:10.1109/ Forge66646.2025.00037
-
[33]
Junda He, Christoph Treude, and David Lo. 2025. LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision, and the Road Ahead.ACM Transactions on Software Engineering and Methodology34, 5, Article 124 (May 2025), 30 pages. doi:10.1145/3712003
-
[34]
Lei Zhang Huaizheng Zhang*, Yizheng Huang*. 2024. MLE-Agent: Your Intelligent Companion for Seamless AI Engineering and Research. https://github.com/MLS ysOps/MLE-agent
2024
-
[35]
Yuchao Huang, Junjie Wang, Zhe Liu, Mingyang Li, Song Wang, Chunyang Chen, Yuanzhe Hu, and Qing Wang. 2025. One Sentence Can Kill the Bug: Auto- Replay Mobile App Crashes From One-Sentence Overviews.IEEE Transactions on Software Engineering51, 4 (2025), 975–989. doi:10.1109/TSE.2025.3535938
-
[36]
Nargiz Humbatova, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, Andrea Stocco, and Paolo Tonella. 2020. Taxonomy of real faults in deep learning systems. InProceedings of the ACM/IEEE 42nd International Conference on Software Engi- neering(Seoul, South Korea)(ICSE ’20). Association for Computing Machinery, New York, NY, USA, 1110–1121. doi:10.1145/...
- [37]
-
[38]
Md Johirul Islam, Rangeet Pan, Giang Nguyen, and Hridesh Rajan. 2020. Repairing deep neural networks: fix patterns and challenges. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering(Seoul, South Korea)(ICSE ’20). Association for Computing Machinery, New York, NY, USA, 1135–1146. doi:10.1145/3377811.3380378
-
[39]
Mohammad Saiful Islam, Mohamed Sami Rakha, William Pourmajidi, Janakan Sivaloganathan, John Steinbacher, and Andriy Miranskyy. 2025. Anomaly De- tection in Large-Scale Cloud Systems: An Industry Case and Dataset. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engi- neering in Practice (ICSE-SEIP). 377–388. doi:10.1109/ICSE...
-
[40]
Niful Islam, Ragib Shahriar Ayon, Deepak George Thomas, Shibbir Ahmed, and Mohammad Wardat. 2026. When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Labeling.arXiv preprint arXiv:2601.15232(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[41]
SelfHeal: Empirical Fix Pattern Analysis and Bug Repair in LLM Agents
Niful Islam, Muhammad Anas Raza, and Mohammad Wardat. 2026. "SelfHeal: Empirical Fix Pattern Analysis and Bug Repair in LLM Agents" - GitHub. https: //github.com/Laboratory-software-Innovation/SelfHeal. Accessed: 2026-04-19
2026
-
[42]
SelfHeal: Empirical Fix Pattern Analysis and Bug Repair in LLM Agents
Niful Islam, Muhammad Anas Raza, and Mohammad Wardat. 2026. "SelfHeal: Empirical Fix Pattern Analysis and Bug Repair in LLM Agents" - Site. https: //sites.google.com/view/selfheal/home. Accessed: 2026-01-23
2026
-
[43]
Sigma Jahan, Saurabh singh Rajput, Tushar Sharma, and Mohammad Masudur Rahman. 2026. Why Attention Fails: A Taxonomy of Faults in Attention-Based Neural Networks. InProceedings of the 48th IEEE/ACM International Conference on Software Engineering (ICSE 2026)
2026
-
[44]
Li Jia, Hao Zhong, Xiaoyin Wang, Linpeng Huang, and Xuansheng Lu. 2021. The symptoms, causes, and repairs of bugs inside a deep learning library.Journal of Systems and Software177 (2021), 110935
2021
-
[45]
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve 11 real-world github issues?arXiv preprint arXiv:2310.06770(2023)
work page internal anchor Pith review arXiv 2023
- [46]
-
[47]
Aimen Kerrour. 2025. Fixed bug and use gpt-4o-mini·kaymen99/Upwork-AI- jobs-applier@97b1158 — github.com. https://github.com/kaymen99/Upwork-AI- jobs-applier/commit/97b115891cd678a0b15a08ff40de7996d289b1bf. [Accessed 23-01-2026]
2025
-
[48]
LangChain Developers. 2026. LangGraph: Build and Orchestrate Stateful Agents. https://www.langchain.com/langgraph. Accessed: 2026-01-25
2026
-
[49]
Shanchao Liang, Nan Jiang, Yiran Hu, and Lin Tan. 2025. Can Language Models Replace Programmers for Coding? REPOCOD Says ‘Not Yet’. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vienna, Austria, 24698– 24717. doi:10.18653/v1/2025.acl-long.1204
-
[50]
Di Liu, Yanyan Yan, Hongcheng Fan, and Yang Feng. 2024. Mining Fix Patterns for System Interaction Bugs. InProceedings of the 15th Asia-Pacific Symposium on Internetware(Macau, China)(Internetware ’24). Association for Computing Machinery, New York, NY, USA, 367–376. doi:10.1145/3671016.3671398
-
[51]
Junwei Liu, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. 2026. Large Language Model-Based Agents for Software Engineering: A Survey.ACM Trans. Softw. Eng. Methodol.(2026). doi:10.1145/37 96507
work page doi:10.1145/37 2026
-
[52]
LlamaIndex. 2022. LlamaIndex. https://www.llamaindex.ai/. Accessed: 2026-01- 21
2022
-
[53]
Vasilios Mavroudis. 2024. LangChain.Preprints.org(2024). doi:10.20944/preprin ts202411.0566.v1
-
[54]
Marcos Medeiros, Uira Kulesza, Roberta Coelho, Rodrigo Bonifacio, Christoph Treude, and Eiji Adachi Barbosa. 2024. The Impact Of Bug Localization Based on Crash Report Mining: A Developers’ Perspective. InProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice (Lisbon, Portugal)(ICSE-SEIP ’24). Associatio...
-
[55]
Kaiwen Ning, Jiachi Chen, Jingwen Zhang, Wei Li, Zexu Wang, Yuming Feng, Weizhe Zhang, and Zibin Zheng. 2026. Defining and Detecting the Defects of Large Language Model-Based Autonomous Agents.IEEE Transactions on Software Engineering52, 3 (2026), 1074–1093. doi:10.1109/TSE.2026.3658554
-
[56]
Kai Pan, Sunghun Kim, and E James Whitehead Jr. 2009. Toward an understanding of bug fix patterns.Empirical Software Engineering14, 3 (2009), 286–315
2009
-
[57]
Rangeet Pan, Ali Reza Ibrahimzada, Rahul Krishna, Divya Sankar, Lam- bert Pouguem Wassi, Michele Merler, Boris Sobolev, Raju Pavuluri, Saurabh Sinha, and Reyhaneh Jabbarvand. 2024. Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code. InProceedings of the IEEE/ACM 46th International Conference on Software Enginee...
-
[58]
ProLLM. 2026. Summarization Leaderboard. https://www.prollm.ai/leaderboar d/summarization?language=afrikaans,brazilian+portuguese,english,polish&l evel=advanced,basic Evaluates an LLM’s ability to accurately summarize long texts from diverse sources
2026
-
[59]
2025.difflib — Helpers for computing deltas
Python Software Foundation. 2025.difflib — Helpers for computing deltas. https: //docs.python.org/3/library/difflib.html Accessed: 2026-01-15
2025
-
[60]
Muhammad Anas Raza and Mohammad Wardat. 2025. Graph neural network for fault localization in sequence-based models.Empirical Software Engineering30, 5 (2025), 119
2025
-
[61]
Talia Ringer, RanDair Porter, Nathaniel Yazdani, John Leo, and Dan Grossman
-
[62]
Lopes, Juneyoung Lee, Chung-Kil Hur, Zhengyang Liu, and John Regehr
Proof repair across type equivalences. InProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Imple- mentation(Virtual, Canada)(PLDI 2021). Association for Computing Machinery, New York, NY, USA, 112–127. doi:10.1145/3453483.3454033
-
[63]
Tony Rousmaniere, Simon B Goldberg, and John Torous. 2026. Large language models as mental health providers.The Lancet Psychiatry13, 1 (2026), 7–9
2026
-
[64]
Valerio Terragni, Annie Vella, Partha Roop, and Kelly Blincoe. 2025. The Future of AI-Driven Software Engineering.ACM Transactions on Software Engineering and Methodology34, 5, Article 120 (May 2025), 20 pages. doi:10.1145/3715003
-
[65]
thedigitalworkplace. 2024. Fix bug in process event in worker runtime (#492)· thedigitalworkplace/Autogen@1ba7a68 — github.com. https://github.com/thedi gitalworkplace/Autogen/commit/1ba7a681a93a1308e608dcfedee2f29a6dab4a76 #diff-e5e45bd0c38c1fd4a21ca55a40b0df2161922f15d285de1bdadbe7d110aa109f R458-R446. [Accessed 23-01-2026]
2024
-
[66]
thedigitalworkplace. 2025. Fix chess sample (#4932) – thedigitalwork- place/Autogen commit 52c2a70. https://github.com/thedigitalworkplace /Autogen/commit/52c2a70e95df2006e0094e96ad192243148ec4bb. Accessed 23 January 2026
2025
-
[67]
Frank Tip, Jonathan Bell, and Max Schäfer. 2025. LLMorpheus: Mutation Testing Using Large Language Models.IEEE Transactions on Software Engineering51, 6 (2025), 1645–1665. doi:10.1109/TSE.2025.3562025
-
[68]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[69]
Vellum AI. 2025. LLM Leaderboard 2025. https://www.vellum.ai/llm-leader board?utm_source=google&utm_medium=organic. Accessed: 2026-01-21; leaderboard of public benchmark performance for LLMs, updated 15 Dec 2025
2025
-
[70]
Mohammad Wardat, Wei Le, and Hridesh Rajan. 2021. DeepLocalize: Fault Localization for Deep Neural Networks. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). 251–262. doi:10.1109/ICSE43902.2021 .00034
-
[71]
whoabuddy. 2024. hotfix: remove async from crew execution·aibtcdev/aibtcdev- backend@62f3f3a — github.com. https://github.com/aibtcdev/aibtcdev-backend/ commit/62f3f3aacc67f40833641c6251b6fbf50fc5d5eb. [Accessed 23-01-2026]
2024
-
[72]
Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. De- mystifying LLM-Based Software Engineering Agents.Proc. ACM Softw. Eng.2, FSE, Article FSE037 (June 2025), 24 pages. doi:10.1145/3715754
-
[73]
Yang Xu, Chao Liu, Yong Li, Qiaoluan Xie, and Hyun-Deok Choi. 2023. A Method of Component Prediction for Crash Bug Reports Using Component- Based Features and Machine Learning. In2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 773–777. doi:10.1109/ SANER56733.2023.00089
-
[74]
Ziluo Xue, Yanjie Zhao, Shenao Wang, Kai Chen, and Haoyu Wang. 2025. A Characterization Study of Bugs in LLM Agent Workflow Orchestration Frame- works. In2025 40th IEEE/ACM International Conference on Automated Soft- ware Engineering (ASE)(Seoul, Korea, Republic of). IEEE Press, 3369–3380. doi:10.1109/ASE63991.2025.00278
-
[75]
Baskar Y, Harrison Chase, and LangChain Contributors. 2026. LangChain Python Package. https://pypi.org/project/langchain/#history. Accessed: 2026-01-23
2026
-
[76]
Boyang Yang, Haoye Tian, Jiadong Ren, Hongyu Zhang, Jacques Klein, Tegawendé F. Bissyandé, Claire Le Goues, and Shunfu Jin. 2026. MORepair: Teaching LLMs to Repair Code via Multi-Objective Fine-Tuning.ACM Trans- actions on Software Engineering and Methodology35, 2, Article 38 (Jan. 2026), 38 pages. doi:10.1145/3735129
-
[77]
Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer In- terfaces Enable Automated Software Engineering. InAdvances in Neural In- formation Processing Systems, Vol. 37. Curran Associates, Inc., 50528–50652. doi:10.52202/079017-1601
-
[78]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InThe Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=WE_vluYUL-X
2023
- [79]
-
[80]
Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang. 2024. AgentTuning: Enabling Generalized Agent Abilities for LLMs. In Findings of the Association for Computational Linguistics: ACL 2024. Association for Computational Linguistics, Bangkok, Thailand, 3053–3077. doi:10.18653/v1/ 2024.findings-acl.181
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.