pith. machine review for the scientific record. sign in

arxiv: 2605.08717 · v1 · submitted 2026-05-09 · 💻 cs.SE · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:59 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords software engineering agentsfailure recoverystructured diagnosistelemetry analysisguidance gatepost-failure recoveryAIOpscode repair
0
0 comments X

The pith

PROBE converts failed-run telemetry into structured evidence, diagnosis, and bounded guidance that agents can execute without changing their policy or tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PROBE as a failure-anchored framework that organizes runtime evidence from software engineering agents into three layers: a Telemetry Layer that keeps fine-grained signals, a Diagnosis Layer that fuses them into grounded diagnoses, and a Guidance Gate that outputs only evidence-based, actionable steps within the agent's existing scope. A sympathetic reader would care because agents currently expose traces or generate loose feedback yet leave many failures unresolved, and this structure offers a systematic way to turn those traces into verifiable next attempts. Evaluation across repository repair, workflow recovery, and AIOps shows clear gains on previously stuck cases, pointing to a diagnosis-recovery gap where knowing the problem is not enough unless it produces bounded, executable guidance.

Core claim

PROBE organizes failed-run telemetry into structured evidence, structured diagnosis, and bounded recovery guidance through a Telemetry Layer, a Diagnosis Layer, and a Guidance Gate. The Telemetry Layer preserves fine-grained runtime signals, the Diagnosis Layer fuses cross-signal evidence into grounded diagnoses, and the Guidance Gate produces diagnosis-derived guidance only when it is evidence-grounded, actionable, and within the scope of agent-side behavior. On 257 initially unresolved cases, this yields 65.37% Top-1 diagnosis accuracy and a 21.79% recovery rate, outperforming the strongest non-PROBE baseline by 43.58 and 12.45 percentage points, while attaching as a non-intrusive sidecar.

What carries the argument

The Guidance Gate, which filters diagnoses to emit only bounded recovery guidance that remains evidence-grounded, actionable, and executable inside the agent's unchanged policy, toolset, and execution budget.

If this is right

  • Accurate diagnosis proves necessary but insufficient unless converted into bounded guidance a subsequent agent attempt can execute and verify.
  • Recovery rate on unresolved cases rises by 12.45 percentage points over the strongest baseline.
  • The framework attaches to existing service-diagnosis workflows as a non-intrusive side channel without altering agent policy, tools, or budget.
  • The same three-layer structure applies across repository-level repair, enterprise workflow recovery, and AIOps mitigation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the gate works reliably, the same telemetry-to-guidance pattern could be tested in other autonomous agent settings that face post-failure recovery costs.
  • The results imply that modest, evidence-only additions can raise recoverability even when the agent's core behavior and resource limits stay fixed.
  • The diagnosis-recovery gap suggests future agent designs should treat executable guidance as a first-class output rather than an afterthought.

Load-bearing premise

The Guidance Gate can reliably produce diagnosis-derived guidance that is both evidence-grounded and executable within the unchanged scope of the agent's existing policy, toolset, and execution budget.

What would settle it

A controlled test on a fresh set of unresolved failure cases where the guidance output by the Gate either cannot be executed by the original agent or produces no measurable improvement in recovery rate over baselines.

Figures

Figures reproduced from arXiv: 2605.08717 by Chenyu Zhao, Chetan Bansal, Dan Pei, Minghua Ma, Saravan Rajmohan, Shenglin Zhang, Wenwei Gu, Yihang Lin, Yongqian Sun, Zhimin Chen.

Figure 1
Figure 1. Figure 1: Prior work treats observability, diagnosis, and in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the PROBE framework. PROBE organizes failed-run telemetry into a typed telemetry bundle (T ), constructs structured evidence (E), derives structured diagnosis (D), and produces bounded recovery guidance (G), forming a recovery loop for subsequent attempts. Step 1. Failure Localization Repeated Tool Failures Execution Errors Unstable Intent Transitions Inefficient Action Patterns Claimed vs. Obs… view at source ↗
Figure 3
Figure 3. Figure 3: Detailed workflow of the Diagnosis Layer. workflow state, and evaluator-side observations. It helps distinguish agent-side behavioral failures from environment-side constraints. Optional external outcome evidence. 𝑇outcome is included when evaluator-side feedback is available, such as benchmark ver￾dicts, test results, incident-resolution checks, or task-specific scor￾ing scripts. It grounds telemetry agai… view at source ↗
Figure 4
Figure 4. Figure 4: AIOpsLab case illustrating how PROBE preserves failed-run evidence, derives a structured diagnosis, and pro￾duces bounded recovery guidance for the subsequent at￾tempt. tasks overall, but their Top-1 accuracy remains much lower than PROBE, at 14.79% and 21.79%. This suggests that generic trace sum￾maries provide useful runtime context, but often fail to preserve the signal-specific anchors needed to identi… view at source ↗
read the original abstract

Software engineering agents are increasingly deployed in evaluable engineering environments, yet post-failure recovery remains costly, manual, and ad hoc. Existing systems expose traces or generate follow-up feedback, but they do not convert heterogeneous runtime evidence into grounded, bounded recovery guidance for a subsequent attempt. We present PROBE, a failure-anchored framework for structured recovery in software engineering agents. PROBE organizes failed-run telemetry into structured evidence, structured diagnosis, and bounded recovery guidance through a Telemetry Layer, a Diagnosis Layer, and a Guidance Gate. The Telemetry Layer preserves fine-grained runtime signals, the Diagnosis Layer fuses cross-signal evidence into grounded diagnoses, and the Guidance Gate produces diagnosis-derived guidance only when it is evidence-grounded, actionable, and within the scope of agent-side behavior. We evaluate PROBE across three settings: repository-level software repair, enterprise workflow recovery, and AIOps service mitigation. On 257 initially unresolved cases, PROBE achieves 65.37% Top-1 diagnosis accuracy and a 21.79% recovery rate, outperforming the strongest non-PROBE baseline by 43.58 and 12.45 percentage points. The results reveal a diagnosis-recovery gap: accurate diagnosis is necessary but insufficient unless translated into bounded guidance that a subsequent attempt can execute and verify. Beyond controlled evaluation, a Microsoft IcM prototype shows that PROBE can attach as a non-intrusive side channel to existing service-diagnosis workflows without changing the agent policy, toolset, or execution budget. These results suggest that telemetry-grounded, failure-anchored recovery can improve post-failure recoverability under realistic engineering constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces PROBE, a failure-anchored framework for structured recovery in software engineering agents. It organizes failed-run telemetry via a Telemetry Layer, fuses evidence into diagnoses via a Diagnosis Layer, and uses a Guidance Gate to emit only evidence-grounded, actionable, and in-scope recovery guidance. Evaluated on 257 initially unresolved cases across repository-level repair, enterprise workflow recovery, and AIOps mitigation, PROBE reports 65.37% Top-1 diagnosis accuracy and 21.79% recovery rate, outperforming the strongest baseline by 43.58 and 12.45 percentage points respectively. A Microsoft IcM prototype demonstrates non-intrusive attachment to existing workflows without altering agent policy, toolset, or budget. The work also identifies a diagnosis-recovery gap.

Significance. If the reported performance gains hold under rigorous validation, the work would be significant for SE agent research by providing a concrete mechanism to convert heterogeneous failure signals into bounded, executable guidance without modifying the underlying agent. The multi-domain evaluation (257 cases) and real-world prototype are positive elements that suggest practical utility. The diagnosis-recovery gap observation is a useful framing. However, the significance is limited by incomplete reporting on the core filtering mechanism that underpins the claimed improvements.

major comments (2)
  1. [Evaluation (across three settings)] The central empirical claims (65.37% Top-1 diagnosis accuracy and 21.79% recovery rate on 257 cases, with +43.58 / +12.45 pp gains) rest on the Guidance Gate emitting only guidance that is executable within the agent's original policy, toolset, and budget. The abstract asserts the gate enforces 'evidence-grounded, actionable, and within the scope' conditions, yet the evaluation provides no fraction of cases filtered by the gate, no operational definition of scope (e.g., tool whitelist, token budget, or policy constraints), and no explicit check that accepted guidance never required new capabilities. This is load-bearing for interpreting whether the lift is reliable or an artifact of selective reporting.
  2. [Abstract and Evaluation] The abstract and results report concrete accuracy/recovery numbers on 257 cases but omit details on the baselines (beyond 'strongest non-PROBE'), error bars, data exclusion rules, statistical significance tests, or how the 257 cases were selected/partitioned. These omissions leave the performance claims only partially supported and make it difficult to assess robustness.
minor comments (2)
  1. The paper would benefit from a dedicated limitations or threats-to-validity subsection that explicitly discusses how scope enforcement was validated and whether any guidance was rejected post-hoc.
  2. [Evaluation] Clarify the exact definitions of 'Top-1 diagnosis accuracy' and 'recovery rate' (e.g., success criteria, verification method) in the evaluation setup.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting areas where additional transparency can strengthen the empirical claims. Below we respond point-by-point to the major comments, indicating the revisions we will make.

read point-by-point responses
  1. Referee: [Evaluation (across three settings)] The central empirical claims (65.37% Top-1 diagnosis accuracy and 21.79% recovery rate on 257 cases, with +43.58 / +12.45 pp gains) rest on the Guidance Gate emitting only guidance that is executable within the agent's original policy, toolset, and budget. The abstract asserts the gate enforces 'evidence-grounded, actionable, and within the scope' conditions, yet the evaluation provides no fraction of cases filtered by the gate, no operational definition of scope (e.g., tool whitelist, token budget, or policy constraints), and no explicit check that accepted guidance never required new capabilities. This is load-bearing for interpreting whether the lift is reliable or an artifact of selective reporting.

    Authors: We agree that the filtering behavior of the Guidance Gate requires explicit quantification to support the reported gains. The current manuscript describes the gate's three conditions at the design level (Section 3.3) but does not report the fraction of cases filtered nor provide an operational definition of scope. In the revision we will add a new paragraph to the evaluation section that states: (i) the exact number and percentage of cases for which the gate withheld guidance, (ii) the concrete criteria used to operationalize scope (tool whitelist membership, token-budget compliance, and policy-constraint checks as implemented in each domain), and (iii) the results of a post-hoc manual audit confirming that every emitted guidance item was executable with the original agent policy, toolset, and budget. These additions will be placed immediately before the main results tables. revision: yes

  2. Referee: [Abstract and Evaluation] The abstract and results report concrete accuracy/recovery numbers on 257 cases but omit details on the baselines (beyond 'strongest non-PROBE'), error bars, data exclusion rules, statistical significance tests, or how the 257 cases were selected/partitioned. These omissions leave the performance claims only partially supported and make it difficult to assess robustness.

    Authors: We acknowledge that the current reporting is insufficient for full reproducibility and robustness assessment. The manuscript identifies the strongest non-PROBE baseline but does not enumerate all systems compared, report variance, or detail case provenance. We will revise the evaluation section and results tables to: (i) list every baseline system with its configuration parameters, (ii) include error bars (standard deviation across the three domains or bootstrap estimates), (iii) report statistical significance (paired McNemar tests for diagnosis accuracy and recovery rate), and (iv) add a dedicated paragraph describing how the 257 unresolved cases were collected, any exclusion rules applied, and how they were partitioned across the repository-repair, workflow-recovery, and AIOps settings. These changes will appear in both the main text and an expanded appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on external tasks with concrete metrics

full rationale

The paper presents PROBE as a framework with Telemetry Layer, Diagnosis Layer, and Guidance Gate, evaluated on 257 initially unresolved cases across repository-level repair, enterprise workflow, and AIOps settings. It reports specific empirical results (65.37% Top-1 diagnosis accuracy, 21.79% recovery rate) outperforming baselines by stated margins. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text. The central claims rest on external benchmarks and reported improvements rather than reducing to inputs by construction, satisfying the self-contained criterion for score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on domain assumptions about the sufficiency of runtime telemetry for diagnosis and the translatability of diagnoses into agent-executable guidance; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)
  • domain assumption Failed-run telemetry contains sufficient fine-grained signals to support grounded diagnoses
    Invoked in the description of the Telemetry Layer and Diagnosis Layer.
  • domain assumption Diagnosis-derived guidance can be produced that is actionable within the agent's existing policy and toolset
    Core premise of the Guidance Gate.

pith-pipeline@v0.9.0 · 5627 in / 1381 out tokens · 41699 ms · 2026-05-12T00:59:42.570340+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 7 internal anchors

  1. [1]

    Toufique Ahmed, Supriyo Ghosh, Chetan Bansal, Thomas Zimmermann, Xuchao Zhang, and Saravan Rajmohan. 2023. Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models. In45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE, 1737–1749. doi:10.1109/ICSE48619.2...

  2. [2]

    Bairi, A

    Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Vageesh D. C., Arun Iyer, Suresh Parthasarathy, Sriram K. Rajamani, Balasubramanyan Ashok, and Shashank Shet. 2024. CodePlan: Repository-Level Coding using LLMs and Plan- ning.Proc. ACM Softw. Eng.1, FSE (2024), 675–698. doi:10.1145/3643757

  3. [3]

    Shraddha Barke, Arnav Goyal, Alind Khare, Avaljot Singh, Suman Nath, and Chetan Bansal. 2026. AgentRx: Diagnosing AI Agent Failures from Execution Trajectories. arXiv:2602.02475 [cs.AI] https://arxiv.org/abs/2602.02475

  4. [4]

    BerriAI. 2024. LiteLLM. https://docs.litellm.ai/. Unified interface for calling multiple large language model providers. Accessed: 2026-04-09

  5. [5]

    Islem Bouzenia, Premkumar Devanbu, and Michael Pradel. 2024. RepairAgent: An Autonomous, LLM-Based Agent for Program Repair. arXiv:2403.17134 [cs.SE] https://arxiv.org/abs/2403.17134

  6. [6]

    Islem Bouzenia and Michael Pradel. 2025. Understanding Software Engineering Agents: A Study of Thought-Action-Result Trajectories. In40th IEEE/ACM Inter- national Conference on Automated Software Engineering, ASE 2025, Seoul, Korea, Republic of, November 16-20, 2025. IEEE, 2846–2857. doi:10.1109/ASE63991.2025. 00234

  7. [7]

    Harrison Chase. 2022. LangChain. https://github.com/langchain-ai/langchain. Open-source framework for building LLM applications and agents. Accessed: 2026-04-09

  8. [8]

    Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023. Teaching Large Language Models to Self-Debug. arXiv:2304.05128 [cs.CL] https://arxiv. org/abs/2304.05128

  9. [9]

    Yinfang Chen, Manish Shetty, Gagan Somashekar, Minghua Ma, Yogesh Simmhan, Jonathan Mace, Chetan Bansal, Rujia Wang, and Saravan Rajmohan. 2025. AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Au- tonomous Clouds. InProceedings of the Eighth Conference on Machine Learn- ing and Systems, MLSys 2025, Santa Clara, CA, USA, May 12-15, 2025, ...

  10. [10]

    Yinfang Chen, Huaibing Xie, Minghua Ma, Yu Kang, Xin Gao, Liu Shi, Yunjie Cao, Xuedong Gao, Hao Fan, Ming Wen, Jun Zeng, Supriyo Ghosh, Xuchao Zhang, Chaoyun Zhang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Tianyin Xu. 2024. Automatic Root Cause Analysis via Large Language Models for Cloud Incidents. InProceedings of the Nineteenth European Confer...

  11. [11]

    Zhi Chen, Wei Ma, and Lingxiao Jiang. 2026. Beyond Final Code: A Process- Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios. arXiv:2503.12374 [cs.SE] https://arxiv.org/abs/2503.12374

  12. [12]

    Liming Dong, Qinghua Lu, and Liming Zhu. 2024. AgentOps: Enabling Observ- ability of LLM Agents. arXiv:2411.05285 [cs.AI] https://arxiv.org/abs/2411.05285

  13. [13]

    Drishti Goel, Fiza Husain, Aditya Singh, Supriyo Ghosh, Anjaly Parayil, Chetan Bansal, Xuchao Zhang, and Saravan Rajmohan. 2024. X-Lifecycle Learning for Cloud Incident Management using LLMs. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, FSE 2024, Porto de Galinhas, Brazil, July 15-19, 2024, M...

  14. [14]

    2025.Agents Companion

    Antonio Gulli, Lavi Nigam, Julia Wiesinger, Vladimir Vuskovic, Irina Sigler, Ivan Nardini, Nicolas Stroppa, Sokratis Kartakis, Narek Saribekyan, and Alan Bount. 2025.Agents Companion. Technical Report. Google. https:// www.kaggle.com/whitepaper-agent-companion Google Whitepaper. Avail- able at: https://cdn.jsdelivr.net/gh/abncharts/abncharts.public.1/abna...

  15. [15]

    Zikang Guo, Benfeng Xu, Chiwei Zhu, Wentao Hong, Xiaorui Wang, and Zhen- dong Mao. 2026. MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools. InFortieth AAAI Conference on Artifi- cial Intelligence, Thirty-Eighth Conference on Innovative Applications of Artificial Intelligence, Sixteenth Symposium on Educational Advanc...

  16. [16]

    Songqiao Han, Xiyang Hu, Hailiang Huang, Mingqi Jiang, and Yue Zhao. 2022. ADBench: Anomaly Detection Benchmark. arXiv:2206.09426 [cs.LG] https: //arxiv.org/abs/2206.09426

  17. [17]

    Yuxuan Jiang, Chaoyun Zhang, Shilin He, Zhihao Yang, Minghua Ma, Si Qin, Yu Kang, Yingnong Dang, Saravan Rajmohan, Qingwei Lin, and Dongmei Zhang

  18. [18]

    InProceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024

    Xpert: Empowering Incident Management with Query Recommendations via Large Language Models. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024. ACM, 92:1–92:13. doi:10.1145/3597503.3639081

  19. [19]

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=VTF8yNQM66

  20. [20]

    LangChain. 2026. LangSmith: A Platform for Observability and Evaluation of LLM-based Systems. https://docs.langchain.com/langsmith/observability. Ac- cessed: 2026-03

  21. [21]

    Han Li, Yifan Yao, Letian Zhu, Rili Feng, Hongyi Ye, Jiaming Wang, Yancheng He, Pengyu Zou, Lehan Zhang, Xinping Lei, Haoyang Huang, Ken Deng, Ming Sun, Zhaoxiang Zhang, He Ye, and Jiaheng Liu. 2026. CodeTracer: Towards Traceable Agent States. arXiv:2604.11641 [cs.SE] https://arxiv.org/abs/2604.11641

  22. [22]

    Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. 2023. API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, Decem- ber 6-10, 2023, Houda Bouamor, Juan Pino, and Kalika Bali (...

  23. [23]

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-Refine: Iterative Refinement with Self-Feedback. arXiv:2303.17651 [cs.CL] https://arxiv.or...

  24. [24]

    Shiva Krishna Reddy Malay, Shravan Nayak, Jishnu Sethumadhavan Nair, Sagar Davasam, Aman Tiwari, Sathwik Tejaswi Madhusudhan, Sridhar Krishna Nemala, Srinivas Sunkara, and Sai Rajeswar. 2026. EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings. arXiv:2603.13594 [cs.AI] https://arxiv.org/abs/2603.13594

  25. [25]

    Model Context Protocol Contributors. 2025. Model Context Protocol Specification. https://modelcontextprotocol.io/specification/2025-06-18. Accessed: 2026-04-09

  26. [26]

    Rahul Nanda, Chandra Maddila, Smriti Jha, Euna Mehnaz Khan, Matteo Paltenghi, and Satish Chandra. 2026. Wink: Recovering from Misbehaviors in Coding Agents. arXiv:2602.17037 [cs.SE] https://arxiv.org/abs/2602.17037

  27. [27]

    Demystifying gpt self-repair for code generation

    Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. 2024. Is Self-Repair a Silver Bullet for Code Generation? arXiv:2306.09896 [cs.CL] https://arxiv.org/abs/2306.09896

  28. [28]

    Patil, Tianjun Zhang, Xin Wang, and Joseph E

    Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez

  29. [29]

    Gorilla: Large Language Model Connected with Massive APIs. InAdvances in Neural Information Processing Systems 38: Annual Confer- ence on Neural Information Processing Systems 2024, NeurIPS 2024, Vancou- ver, BC, Canada, December 10 - 15, 2024, Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang ...

  30. [30]

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2024. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. InThe Twelfth International Conference on Le...

  31. [31]

    Kauffmann, Robert A

    Lukas Ruff, Jacob R. Kauffmann, Robert A. Vandermeulen, Gregoire Montavon, Wojciech Samek, Marius Kloft, Thomas G. Dietterich, and Klaus-Robert Muller

  32. [32]

    Nagpal, C., Li, X., and Dubrawski, A

    A Unifying Review of Deep and Shallow Anomaly Detection.Proc. IEEE 109, 5 (May 2021), 756–795. doi:10.1109/jproc.2021.3052449

  33. [33]

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Infor- mation Processing Systems 2023, NeurIPS 2023, New Or...

  34. [34]

    Manish Shetty, Chetan Bansal, Sumit Kumar, Nikitha Rao, Nachiappan Nagappan, and Thomas Zimmermann. 2021. Neural Knowledge Extraction From Cloud Service Incidents. In43rd IEEE/ACM International Conference on Software Engi- neering: Software Engineering in Practice, ICSE (SEIP) 2021, Madrid, Spain, May 25-28, 2021. IEEE, 218–227. doi:10.1109/ICSE-SEIP52600...

  35. [35]

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. doi:10.48550/ARXIV.2303.11366

  36. [36]

    Gou Tan, Zilong He, Min Li, Haiyu Huang, Yilun Wang, Pengfei Chen, Giuliano Casale, and Chuanfu Zhang. 2026. LLMRCA: Multilevel Root Cause Analysis for LLM Applications Using Multimodal Observability Data.ACM Trans. Softw. Eng. Methodol.(April 2026). doi:10.1145/3806200 Just Accepted

  37. [37]

    Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian

  38. [38]

    Appworld: A controllable world of apps and people for benchmarking interactive coding agents,

    AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents. arXiv:2407.18901 [cs.SE] https://arxiv.org/abs/2407. ASE ’26, October 12–16, 2026, Munich, Germany Zhao et al. 18901

  39. [39]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-Thought Prompt- ing Elicits Reasoning in Large Language Models. InAdvances in Neural Infor- mation Processing Systems 35: Annual Conference on Neural Information Pro- cessing Systems 2022, NeurIPS 2022, New Orleans, LA, USA,...

  40. [40]

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. 2023. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework.CoRR abs/2308.08155 (2023). arXiv:2308.08155 doi:10.48550/ARXIV.2308.08155

  41. [41]

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. De- mystifying LLM-Based Software Engineering Agents.Proc. ACM Softw. Eng.2, FSE (2025), 801–824. doi:10.1145/3715754

  42. [42]

    Chunqiu Steven Xia and Lingming Zhang. 2024. Automated Program Repair via Conversation: Fixing 162 out of 337 Bugs for $0.42 Each using ChatGPT. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA ’24). ACM, 819–831. doi:10.1145/3650212.3680323

  43. [43]

    arXiv preprint arXiv:2310.10634 , year=

    Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, Leo Z. Liu, Yiheng Xu, Hongjin Su, Dongchan Shin, Caiming Xiong, and Tao Yu. 2023. OpenAgents: An Open Platform for Language Agents in the Wild.CoRRabs/2310.10634 (2023). arXiv:2310.10634 doi:10.48550/ARXIV.2310.10634

  44. [44]

    Zhe Xie, Shenglin Zhang, Yitong Geng, Yao Zhang, Minghua Ma, Xiaohui Nie, Zhenhe Yao, Longlong Xu, Yongqian Sun, Wentao Li, and Dan Pei. 2024. Mi- croservice Root Cause Analysis With Limited Observability Through Intervention Recognition in the Latent Space. InProceedings of the 30th ACM SIGKDD Confer- ence on Knowledge Discovery and Data Mining, KDD 2024...

  45. [45]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer In- terfaces Enable Automated Software Engineering. arXiv:2405.15793 [cs.SE] https://arxiv.org/abs/2405.15793

  46. [46]

    Narasimhan, and Yuan Cao

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview. net/forum?id=WE_vluYUL-X

  47. [47]

    Las-Casas, Rodrigo Fonseca, and Saravan Rajmohan

    Dylan Zhang, Xuchao Zhang, Chetan Bansal, Pedro Henrique B. Las-Casas, Rodrigo Fonseca, and Saravan Rajmohan. 2023. PACE-LM: Prompting and Augmentation for Calibrated Confidence Estimation with GPT-4 in Cloud In- cident Root Cause Analysis.CoRRabs/2309.05833 (2023). arXiv:2309.05833 doi:10.48550/ARXIV.2309.05833

  48. [48]

    Las-Casas, Ro- drigo Fonseca, and Saravan Rajmohan

    Dylan Zhang, Xuchao Zhang, Chetan Bansal, Pedro Henrique B. Las-Casas, Ro- drigo Fonseca, and Saravan Rajmohan. 2024. LM-PACE: Confidence Estimation by Large Language Models for Effective Root Causing of Cloud Incidents. In Companion Proceedings of the 32nd ACM International Conference on the Founda- tions of Software Engineering, FSE 2024, Porto de Galin...

  49. [49]

    Xuchao Zhang, Supriyo Ghosh, Chetan Bansal, Rujia Wang, Minghua Ma, Yu Kang, and Saravan Rajmohan. 2024. Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4. arXiv:2401.13810 [cs.CL] https://arxiv. org/abs/2401.13810

  50. [50]

    Le, and Ed H

    Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V. Le, and Ed H. Chi. 2023. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.n...

  51. [51]

    Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2024. WebArena: A Realistic Web Environment for Building Autonomous Agents. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview....