arxiv: 2605.08717 · v1 · submitted 2026-05-09 · 💻 cs.SE · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents

Chenyu Zhao , Shenglin Zhang , Yihang Lin , Wenwei Gu , Zhimin Chen , Yongqian Sun , Dan Pei , Chetan Bansal

show 2 more authors

Saravan Rajmohan Minghua Ma

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:59 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords software engineering agentsfailure recoverystructured diagnosistelemetry analysisguidance gatepost-failure recoveryAIOpscode repair

0 comments

The pith

PROBE converts failed-run telemetry into structured evidence, diagnosis, and bounded guidance that agents can execute without changing their policy or tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PROBE as a failure-anchored framework that organizes runtime evidence from software engineering agents into three layers: a Telemetry Layer that keeps fine-grained signals, a Diagnosis Layer that fuses them into grounded diagnoses, and a Guidance Gate that outputs only evidence-based, actionable steps within the agent's existing scope. A sympathetic reader would care because agents currently expose traces or generate loose feedback yet leave many failures unresolved, and this structure offers a systematic way to turn those traces into verifiable next attempts. Evaluation across repository repair, workflow recovery, and AIOps shows clear gains on previously stuck cases, pointing to a diagnosis-recovery gap where knowing the problem is not enough unless it produces bounded, executable guidance.

Core claim

PROBE organizes failed-run telemetry into structured evidence, structured diagnosis, and bounded recovery guidance through a Telemetry Layer, a Diagnosis Layer, and a Guidance Gate. The Telemetry Layer preserves fine-grained runtime signals, the Diagnosis Layer fuses cross-signal evidence into grounded diagnoses, and the Guidance Gate produces diagnosis-derived guidance only when it is evidence-grounded, actionable, and within the scope of agent-side behavior. On 257 initially unresolved cases, this yields 65.37% Top-1 diagnosis accuracy and a 21.79% recovery rate, outperforming the strongest non-PROBE baseline by 43.58 and 12.45 percentage points, while attaching as a non-intrusive sidecar.

What carries the argument

The Guidance Gate, which filters diagnoses to emit only bounded recovery guidance that remains evidence-grounded, actionable, and executable inside the agent's unchanged policy, toolset, and execution budget.

If this is right

Accurate diagnosis proves necessary but insufficient unless converted into bounded guidance a subsequent agent attempt can execute and verify.
Recovery rate on unresolved cases rises by 12.45 percentage points over the strongest baseline.
The framework attaches to existing service-diagnosis workflows as a non-intrusive side channel without altering agent policy, tools, or budget.
The same three-layer structure applies across repository-level repair, enterprise workflow recovery, and AIOps mitigation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the gate works reliably, the same telemetry-to-guidance pattern could be tested in other autonomous agent settings that face post-failure recovery costs.
The results imply that modest, evidence-only additions can raise recoverability even when the agent's core behavior and resource limits stay fixed.
The diagnosis-recovery gap suggests future agent designs should treat executable guidance as a first-class output rather than an afterthought.

Load-bearing premise

The Guidance Gate can reliably produce diagnosis-derived guidance that is both evidence-grounded and executable within the unchanged scope of the agent's existing policy, toolset, and execution budget.

What would settle it

A controlled test on a fresh set of unresolved failure cases where the guidance output by the Gate either cannot be executed by the original agent or produces no measurable improvement in recovery rate over baselines.

Figures

Figures reproduced from arXiv: 2605.08717 by Chenyu Zhao, Chetan Bansal, Dan Pei, Minghua Ma, Saravan Rajmohan, Shenglin Zhang, Wenwei Gu, Yihang Lin, Yongqian Sun, Zhimin Chen.

**Figure 2.** Figure 2: Overview of the PROBE framework. PROBE organizes failed-run telemetry into a typed telemetry bundle (T ), constructs structured evidence (E), derives structured diagnosis (D), and produces bounded recovery guidance (G), forming a recovery loop for subsequent attempts. Step 1. Failure Localization Repeated Tool Failures Execution Errors Unstable Intent Transitions Inefficient Action Patterns Claimed vs. Obs… view at source ↗

**Figure 3.** Figure 3: Detailed workflow of the Diagnosis Layer. workflow state, and evaluator-side observations. It helps distinguish agent-side behavioral failures from environment-side constraints. Optional external outcome evidence. 𝑇outcome is included when evaluator-side feedback is available, such as benchmark verdicts, test results, incident-resolution checks, or task-specific scoring scripts. It grounds telemetry agai… view at source ↗

**Figure 4.** Figure 4: AIOpsLab case illustrating how PROBE preserves failed-run evidence, derives a structured diagnosis, and produces bounded recovery guidance for the subsequent attempt. tasks overall, but their Top-1 accuracy remains much lower than PROBE, at 14.79% and 21.79%. This suggests that generic trace summaries provide useful runtime context, but often fail to preserve the signal-specific anchors needed to identi… view at source ↗

read the original abstract

Software engineering agents are increasingly deployed in evaluable engineering environments, yet post-failure recovery remains costly, manual, and ad hoc. Existing systems expose traces or generate follow-up feedback, but they do not convert heterogeneous runtime evidence into grounded, bounded recovery guidance for a subsequent attempt. We present PROBE, a failure-anchored framework for structured recovery in software engineering agents. PROBE organizes failed-run telemetry into structured evidence, structured diagnosis, and bounded recovery guidance through a Telemetry Layer, a Diagnosis Layer, and a Guidance Gate. The Telemetry Layer preserves fine-grained runtime signals, the Diagnosis Layer fuses cross-signal evidence into grounded diagnoses, and the Guidance Gate produces diagnosis-derived guidance only when it is evidence-grounded, actionable, and within the scope of agent-side behavior. We evaluate PROBE across three settings: repository-level software repair, enterprise workflow recovery, and AIOps service mitigation. On 257 initially unresolved cases, PROBE achieves 65.37% Top-1 diagnosis accuracy and a 21.79% recovery rate, outperforming the strongest non-PROBE baseline by 43.58 and 12.45 percentage points. The results reveal a diagnosis-recovery gap: accurate diagnosis is necessary but insufficient unless translated into bounded guidance that a subsequent attempt can execute and verify. Beyond controlled evaluation, a Microsoft IcM prototype shows that PROBE can attach as a non-intrusive side channel to existing service-diagnosis workflows without changing the agent policy, toolset, or execution budget. These results suggest that telemetry-grounded, failure-anchored recovery can improve post-failure recoverability under realistic engineering constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces PROBE, a failure-anchored framework for structured recovery in software engineering agents. It organizes failed-run telemetry via a Telemetry Layer, fuses evidence into diagnoses via a Diagnosis Layer, and uses a Guidance Gate to emit only evidence-grounded, actionable, and in-scope recovery guidance. Evaluated on 257 initially unresolved cases across repository-level repair, enterprise workflow recovery, and AIOps mitigation, PROBE reports 65.37% Top-1 diagnosis accuracy and 21.79% recovery rate, outperforming the strongest baseline by 43.58 and 12.45 percentage points respectively. A Microsoft IcM prototype demonstrates non-intrusive attachment to existing workflows without altering agent policy, toolset, or budget. The work also identifies a diagnosis-recovery gap.

Significance. If the reported performance gains hold under rigorous validation, the work would be significant for SE agent research by providing a concrete mechanism to convert heterogeneous failure signals into bounded, executable guidance without modifying the underlying agent. The multi-domain evaluation (257 cases) and real-world prototype are positive elements that suggest practical utility. The diagnosis-recovery gap observation is a useful framing. However, the significance is limited by incomplete reporting on the core filtering mechanism that underpins the claimed improvements.

major comments (2)

[Evaluation (across three settings)] The central empirical claims (65.37% Top-1 diagnosis accuracy and 21.79% recovery rate on 257 cases, with +43.58 / +12.45 pp gains) rest on the Guidance Gate emitting only guidance that is executable within the agent's original policy, toolset, and budget. The abstract asserts the gate enforces 'evidence-grounded, actionable, and within the scope' conditions, yet the evaluation provides no fraction of cases filtered by the gate, no operational definition of scope (e.g., tool whitelist, token budget, or policy constraints), and no explicit check that accepted guidance never required new capabilities. This is load-bearing for interpreting whether the lift is reliable or an artifact of selective reporting.
[Abstract and Evaluation] The abstract and results report concrete accuracy/recovery numbers on 257 cases but omit details on the baselines (beyond 'strongest non-PROBE'), error bars, data exclusion rules, statistical significance tests, or how the 257 cases were selected/partitioned. These omissions leave the performance claims only partially supported and make it difficult to assess robustness.

minor comments (2)

The paper would benefit from a dedicated limitations or threats-to-validity subsection that explicitly discusses how scope enforcement was validated and whether any guidance was rejected post-hoc.
[Evaluation] Clarify the exact definitions of 'Top-1 diagnosis accuracy' and 'recovery rate' (e.g., success criteria, verification method) in the evaluation setup.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting areas where additional transparency can strengthen the empirical claims. Below we respond point-by-point to the major comments, indicating the revisions we will make.

read point-by-point responses

Referee: [Evaluation (across three settings)] The central empirical claims (65.37% Top-1 diagnosis accuracy and 21.79% recovery rate on 257 cases, with +43.58 / +12.45 pp gains) rest on the Guidance Gate emitting only guidance that is executable within the agent's original policy, toolset, and budget. The abstract asserts the gate enforces 'evidence-grounded, actionable, and within the scope' conditions, yet the evaluation provides no fraction of cases filtered by the gate, no operational definition of scope (e.g., tool whitelist, token budget, or policy constraints), and no explicit check that accepted guidance never required new capabilities. This is load-bearing for interpreting whether the lift is reliable or an artifact of selective reporting.

Authors: We agree that the filtering behavior of the Guidance Gate requires explicit quantification to support the reported gains. The current manuscript describes the gate's three conditions at the design level (Section 3.3) but does not report the fraction of cases filtered nor provide an operational definition of scope. In the revision we will add a new paragraph to the evaluation section that states: (i) the exact number and percentage of cases for which the gate withheld guidance, (ii) the concrete criteria used to operationalize scope (tool whitelist membership, token-budget compliance, and policy-constraint checks as implemented in each domain), and (iii) the results of a post-hoc manual audit confirming that every emitted guidance item was executable with the original agent policy, toolset, and budget. These additions will be placed immediately before the main results tables. revision: yes
Referee: [Abstract and Evaluation] The abstract and results report concrete accuracy/recovery numbers on 257 cases but omit details on the baselines (beyond 'strongest non-PROBE'), error bars, data exclusion rules, statistical significance tests, or how the 257 cases were selected/partitioned. These omissions leave the performance claims only partially supported and make it difficult to assess robustness.

Authors: We acknowledge that the current reporting is insufficient for full reproducibility and robustness assessment. The manuscript identifies the strongest non-PROBE baseline but does not enumerate all systems compared, report variance, or detail case provenance. We will revise the evaluation section and results tables to: (i) list every baseline system with its configuration parameters, (ii) include error bars (standard deviation across the three domains or bootstrap estimates), (iii) report statistical significance (paired McNemar tests for diagnosis accuracy and recovery rate), and (iv) add a dedicated paragraph describing how the 257 unresolved cases were collected, any exclusion rules applied, and how they were partitioned across the repository-repair, workflow-recovery, and AIOps settings. These changes will appear in both the main text and an expanded appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on external tasks with concrete metrics

full rationale

The paper presents PROBE as a framework with Telemetry Layer, Diagnosis Layer, and Guidance Gate, evaluated on 257 initially unresolved cases across repository-level repair, enterprise workflow, and AIOps settings. It reports specific empirical results (65.37% Top-1 diagnosis accuracy, 21.79% recovery rate) outperforming baselines by stated margins. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text. The central claims rest on external benchmarks and reported improvements rather than reducing to inputs by construction, satisfying the self-contained criterion for score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on domain assumptions about the sufficiency of runtime telemetry for diagnosis and the translatability of diagnoses into agent-executable guidance; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)

domain assumption Failed-run telemetry contains sufficient fine-grained signals to support grounded diagnoses
Invoked in the description of the Telemetry Layer and Diagnosis Layer.
domain assumption Diagnosis-derived guidance can be produced that is actionable within the agent's existing policy and toolset
Core premise of the Guidance Gate.

pith-pipeline@v0.9.0 · 5627 in / 1381 out tokens · 41699 ms · 2026-05-12T00:59:42.570340+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The Guidance Gate acts as a grounding, actionability, and scope filter, converting the diagnosis into bounded recovery guidance only when it is telemetry-supported, actionable, and within the scope of agent-side behavior.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PROBE organizes failed-run telemetry into structured evidence, structured diagnosis, and bounded recovery guidance through a Telemetry Layer, a Diagnosis Layer, and a Guidance Gate.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 7 internal anchors

[1]

Toufique Ahmed, Supriyo Ghosh, Chetan Bansal, Thomas Zimmermann, Xuchao Zhang, and Saravan Rajmohan. 2023. Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models. In45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE, 1737–1749. doi:10.1109/ICSE48619.2...

work page doi:10.1109/icse48619.2023.00149 2023
[2]

Bairi, A

Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Vageesh D. C., Arun Iyer, Suresh Parthasarathy, Sriram K. Rajamani, Balasubramanyan Ashok, and Shashank Shet. 2024. CodePlan: Repository-Level Coding using LLMs and Plan- ning.Proc. ACM Softw. Eng.1, FSE (2024), 675–698. doi:10.1145/3643757

work page doi:10.1145/3643757 2024
[3]

Shraddha Barke, Arnav Goyal, Alind Khare, Avaljot Singh, Suman Nath, and Chetan Bansal. 2026. AgentRx: Diagnosing AI Agent Failures from Execution Trajectories. arXiv:2602.02475 [cs.AI] https://arxiv.org/abs/2602.02475

work page arXiv 2026
[4]

BerriAI. 2024. LiteLLM. https://docs.litellm.ai/. Unified interface for calling multiple large language model providers. Accessed: 2026-04-09

work page 2024
[5]

Islem Bouzenia, Premkumar Devanbu, and Michael Pradel. 2024. RepairAgent: An Autonomous, LLM-Based Agent for Program Repair. arXiv:2403.17134 [cs.SE] https://arxiv.org/abs/2403.17134

work page arXiv 2024
[6]

Islem Bouzenia and Michael Pradel. 2025. Understanding Software Engineering Agents: A Study of Thought-Action-Result Trajectories. In40th IEEE/ACM Inter- national Conference on Automated Software Engineering, ASE 2025, Seoul, Korea, Republic of, November 16-20, 2025. IEEE, 2846–2857. doi:10.1109/ASE63991.2025. 00234

work page doi:10.1109/ase63991.2025 2025
[7]

Harrison Chase. 2022. LangChain. https://github.com/langchain-ai/langchain. Open-source framework for building LLM applications and agents. Accessed: 2026-04-09

work page 2022
[8]

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023. Teaching Large Language Models to Self-Debug. arXiv:2304.05128 [cs.CL] https://arxiv. org/abs/2304.05128

work page internal anchor Pith review arXiv 2023
[9]

Yinfang Chen, Manish Shetty, Gagan Somashekar, Minghua Ma, Yogesh Simmhan, Jonathan Mace, Chetan Bansal, Rujia Wang, and Saravan Rajmohan. 2025. AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Au- tonomous Clouds. InProceedings of the Eighth Conference on Machine Learn- ing and Systems, MLSys 2025, Santa Clara, CA, USA, May 12-15, 2025, ...

work page 2025
[10]

Yinfang Chen, Huaibing Xie, Minghua Ma, Yu Kang, Xin Gao, Liu Shi, Yunjie Cao, Xuedong Gao, Hao Fan, Ming Wen, Jun Zeng, Supriyo Ghosh, Xuchao Zhang, Chaoyun Zhang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Tianyin Xu. 2024. Automatic Root Cause Analysis via Large Language Models for Cloud Incidents. InProceedings of the Nineteenth European Confer...

work page doi:10.1145/3627703.3629553 2024
[11]

Zhi Chen, Wei Ma, and Lingxiao Jiang. 2026. Beyond Final Code: A Process- Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios. arXiv:2503.12374 [cs.SE] https://arxiv.org/abs/2503.12374

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Liming Dong, Qinghua Lu, and Liming Zhu. 2024. AgentOps: Enabling Observ- ability of LLM Agents. arXiv:2411.05285 [cs.AI] https://arxiv.org/abs/2411.05285

work page arXiv 2024
[13]

Drishti Goel, Fiza Husain, Aditya Singh, Supriyo Ghosh, Anjaly Parayil, Chetan Bansal, Xuchao Zhang, and Saravan Rajmohan. 2024. X-Lifecycle Learning for Cloud Incident Management using LLMs. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, FSE 2024, Porto de Galinhas, Brazil, July 15-19, 2024, M...

work page doi:10.1145/3663529.3663861 2024
[14]

2025.Agents Companion

Antonio Gulli, Lavi Nigam, Julia Wiesinger, Vladimir Vuskovic, Irina Sigler, Ivan Nardini, Nicolas Stroppa, Sokratis Kartakis, Narek Saribekyan, and Alan Bount. 2025.Agents Companion. Technical Report. Google. https:// www.kaggle.com/whitepaper-agent-companion Google Whitepaper. Avail- able at: https://cdn.jsdelivr.net/gh/abncharts/abncharts.public.1/abna...

work page 2025
[15]

Zikang Guo, Benfeng Xu, Chiwei Zhu, Wentao Hong, Xiaorui Wang, and Zhen- dong Mao. 2026. MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools. InFortieth AAAI Conference on Artifi- cial Intelligence, Thirty-Eighth Conference on Innovative Applications of Artificial Intelligence, Sixteenth Symposium on Educational Advanc...

work page doi:10.1609/aaai.v40i37 2026
[16]

Songqiao Han, Xiyang Hu, Hailiang Huang, Mingqi Jiang, and Yue Zhao. 2022. ADBench: Anomaly Detection Benchmark. arXiv:2206.09426 [cs.LG] https: //arxiv.org/abs/2206.09426

work page arXiv 2022
[17]

Yuxuan Jiang, Chaoyun Zhang, Shilin He, Zhihao Yang, Minghua Ma, Si Qin, Yu Kang, Yingnong Dang, Saravan Rajmohan, Qingwei Lin, and Dongmei Zhang

work page
[18]

InProceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024

Xpert: Empowering Incident Management with Query Recommendations via Large Language Models. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024. ACM, 92:1–92:13. doi:10.1145/3597503.3639081

work page doi:10.1145/3597503.3639081 2024
[19]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=VTF8yNQM66

work page 2024
[20]

LangChain. 2026. LangSmith: A Platform for Observability and Evaluation of LLM-based Systems. https://docs.langchain.com/langsmith/observability. Ac- cessed: 2026-03

work page 2026
[21]

Han Li, Yifan Yao, Letian Zhu, Rili Feng, Hongyi Ye, Jiaming Wang, Yancheng He, Pengyu Zou, Lehan Zhang, Xinping Lei, Haoyang Huang, Ken Deng, Ming Sun, Zhaoxiang Zhang, He Ye, and Jiaheng Liu. 2026. CodeTracer: Towards Traceable Agent States. arXiv:2604.11641 [cs.SE] https://arxiv.org/abs/2604.11641

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. 2023. API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, Decem- ber 6-10, 2023, Houda Bouamor, Juan Pino, and Kalika Bali (...

work page doi:10.18653/v1/2023.emnlp-main.187 2023
[23]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-Refine: Iterative Refinement with Self-Feedback. arXiv:2303.17651 [cs.CL] https://arxiv.or...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Shiva Krishna Reddy Malay, Shravan Nayak, Jishnu Sethumadhavan Nair, Sagar Davasam, Aman Tiwari, Sathwik Tejaswi Madhusudhan, Sridhar Krishna Nemala, Srinivas Sunkara, and Sai Rajeswar. 2026. EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings. arXiv:2603.13594 [cs.AI] https://arxiv.org/abs/2603.13594

work page arXiv 2026
[25]

Model Context Protocol Contributors. 2025. Model Context Protocol Specification. https://modelcontextprotocol.io/specification/2025-06-18. Accessed: 2026-04-09

work page 2025
[26]

Rahul Nanda, Chandra Maddila, Smriti Jha, Euna Mehnaz Khan, Matteo Paltenghi, and Satish Chandra. 2026. Wink: Recovering from Misbehaviors in Coding Agents. arXiv:2602.17037 [cs.SE] https://arxiv.org/abs/2602.17037

work page arXiv 2026
[27]

Demystifying gpt self-repair for code generation

Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. 2024. Is Self-Repair a Silver Bullet for Code Generation? arXiv:2306.09896 [cs.CL] https://arxiv.org/abs/2306.09896

work page arXiv 2024
[28]

Patil, Tianjun Zhang, Xin Wang, and Joseph E

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez

work page
[29]

Gorilla: Large Language Model Connected with Massive APIs. InAdvances in Neural Information Processing Systems 38: Annual Confer- ence on Neural Information Processing Systems 2024, NeurIPS 2024, Vancou- ver, BC, Canada, December 10 - 15, 2024, Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang ...

work page 2024
[30]

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2024. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. InThe Twelfth International Conference on Le...

work page 2024
[31]

Kauffmann, Robert A

Lukas Ruff, Jacob R. Kauffmann, Robert A. Vandermeulen, Gregoire Montavon, Wojciech Samek, Marius Kloft, Thomas G. Dietterich, and Klaus-Robert Muller

work page
[32]

Nagpal, C., Li, X., and Dubrawski, A

A Unifying Review of Deep and Shallow Anomaly Detection.Proc. IEEE 109, 5 (May 2021), 756–795. doi:10.1109/jproc.2021.3052449

work page doi:10.1109/jproc.2021.3052449 2021
[33]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Infor- mation Processing Systems 2023, NeurIPS 2023, New Or...

work page 2023
[34]

Manish Shetty, Chetan Bansal, Sumit Kumar, Nikitha Rao, Nachiappan Nagappan, and Thomas Zimmermann. 2021. Neural Knowledge Extraction From Cloud Service Incidents. In43rd IEEE/ACM International Conference on Software Engi- neering: Software Engineering in Practice, ICSE (SEIP) 2021, Madrid, Spain, May 25-28, 2021. IEEE, 218–227. doi:10.1109/ICSE-SEIP52600...

work page doi:10.1109/icse-seip52600.2021.00031 2021
[35]

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. doi:10.48550/ARXIV.2303.11366

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.11366 2023
[36]

Gou Tan, Zilong He, Min Li, Haiyu Huang, Yilun Wang, Pengfei Chen, Giuliano Casale, and Chuanfu Zhang. 2026. LLMRCA: Multilevel Root Cause Analysis for LLM Applications Using Multimodal Observability Data.ACM Trans. Softw. Eng. Methodol.(April 2026). doi:10.1145/3806200 Just Accepted

work page doi:10.1145/3806200 2026
[37]

Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian

work page
[38]

Appworld: A controllable world of apps and people for benchmarking interactive coding agents,

AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents. arXiv:2407.18901 [cs.SE] https://arxiv.org/abs/2407. ASE ’26, October 12–16, 2026, Munich, Germany Zhao et al. 18901

work page arXiv 2026
[39]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-Thought Prompt- ing Elicits Reasoning in Large Language Models. InAdvances in Neural Infor- mation Processing Systems 35: Annual Conference on Neural Information Pro- cessing Systems 2022, NeurIPS 2022, New Orleans, LA, USA,...

work page 2022
[40]

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. 2023. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework.CoRR abs/2308.08155 (2023). arXiv:2308.08155 doi:10.48550/ARXIV.2308.08155

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.08155 2023
[41]

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. De- mystifying LLM-Based Software Engineering Agents.Proc. ACM Softw. Eng.2, FSE (2025), 801–824. doi:10.1145/3715754

work page doi:10.1145/3715754 2025
[42]

Chunqiu Steven Xia and Lingming Zhang. 2024. Automated Program Repair via Conversation: Fixing 162 out of 337 Bugs for $0.42 Each using ChatGPT. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA ’24). ACM, 819–831. doi:10.1145/3650212.3680323

work page doi:10.1145/3650212.3680323 2024
[43]

arXiv preprint arXiv:2310.10634 , year=

Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, Leo Z. Liu, Yiheng Xu, Hongjin Su, Dongchan Shin, Caiming Xiong, and Tao Yu. 2023. OpenAgents: An Open Platform for Language Agents in the Wild.CoRRabs/2310.10634 (2023). arXiv:2310.10634 doi:10.48550/ARXIV.2310.10634

work page doi:10.48550/arxiv.2310.10634 2023
[44]

Zhe Xie, Shenglin Zhang, Yitong Geng, Yao Zhang, Minghua Ma, Xiaohui Nie, Zhenhe Yao, Longlong Xu, Yongqian Sun, Wentao Li, and Dan Pei. 2024. Mi- croservice Root Cause Analysis With Limited Observability Through Intervention Recognition in the Latent Space. InProceedings of the 30th ACM SIGKDD Confer- ence on Knowledge Discovery and Data Mining, KDD 2024...

work page doi:10.1145/3637528.3671530 2024
[45]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer In- terfaces Enable Automated Software Engineering. arXiv:2405.15793 [cs.SE] https://arxiv.org/abs/2405.15793

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview. net/forum?id=WE_vluYUL-X

work page 2023
[47]

Las-Casas, Rodrigo Fonseca, and Saravan Rajmohan

Dylan Zhang, Xuchao Zhang, Chetan Bansal, Pedro Henrique B. Las-Casas, Rodrigo Fonseca, and Saravan Rajmohan. 2023. PACE-LM: Prompting and Augmentation for Calibrated Confidence Estimation with GPT-4 in Cloud In- cident Root Cause Analysis.CoRRabs/2309.05833 (2023). arXiv:2309.05833 doi:10.48550/ARXIV.2309.05833

work page doi:10.48550/arxiv.2309.05833 2023
[48]

Las-Casas, Ro- drigo Fonseca, and Saravan Rajmohan

Dylan Zhang, Xuchao Zhang, Chetan Bansal, Pedro Henrique B. Las-Casas, Ro- drigo Fonseca, and Saravan Rajmohan. 2024. LM-PACE: Confidence Estimation by Large Language Models for Effective Root Causing of Cloud Incidents. In Companion Proceedings of the 32nd ACM International Conference on the Founda- tions of Software Engineering, FSE 2024, Porto de Galin...

work page doi:10.1145/3663529.3663858 2024
[49]

Xuchao Zhang, Supriyo Ghosh, Chetan Bansal, Rujia Wang, Minghua Ma, Yu Kang, and Saravan Rajmohan. 2024. Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4. arXiv:2401.13810 [cs.CL] https://arxiv. org/abs/2401.13810

work page arXiv 2024
[50]

Le, and Ed H

Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V. Le, and Ed H. Chi. 2023. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.n...

work page 2023
[51]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2024. WebArena: A Realistic Web Environment for Building Autonomous Agents. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview....

work page 2024