Recognition: 2 theorem links
· Lean TheoremDebugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents
Pith reviewed 2026-05-12 00:59 UTC · model grok-4.3
The pith
PROBE converts failed-run telemetry into structured evidence, diagnosis, and bounded guidance that agents can execute without changing their policy or tools.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PROBE organizes failed-run telemetry into structured evidence, structured diagnosis, and bounded recovery guidance through a Telemetry Layer, a Diagnosis Layer, and a Guidance Gate. The Telemetry Layer preserves fine-grained runtime signals, the Diagnosis Layer fuses cross-signal evidence into grounded diagnoses, and the Guidance Gate produces diagnosis-derived guidance only when it is evidence-grounded, actionable, and within the scope of agent-side behavior. On 257 initially unresolved cases, this yields 65.37% Top-1 diagnosis accuracy and a 21.79% recovery rate, outperforming the strongest non-PROBE baseline by 43.58 and 12.45 percentage points, while attaching as a non-intrusive sidecar.
What carries the argument
The Guidance Gate, which filters diagnoses to emit only bounded recovery guidance that remains evidence-grounded, actionable, and executable inside the agent's unchanged policy, toolset, and execution budget.
If this is right
- Accurate diagnosis proves necessary but insufficient unless converted into bounded guidance a subsequent agent attempt can execute and verify.
- Recovery rate on unresolved cases rises by 12.45 percentage points over the strongest baseline.
- The framework attaches to existing service-diagnosis workflows as a non-intrusive side channel without altering agent policy, tools, or budget.
- The same three-layer structure applies across repository-level repair, enterprise workflow recovery, and AIOps mitigation.
Where Pith is reading between the lines
- If the gate works reliably, the same telemetry-to-guidance pattern could be tested in other autonomous agent settings that face post-failure recovery costs.
- The results imply that modest, evidence-only additions can raise recoverability even when the agent's core behavior and resource limits stay fixed.
- The diagnosis-recovery gap suggests future agent designs should treat executable guidance as a first-class output rather than an afterthought.
Load-bearing premise
The Guidance Gate can reliably produce diagnosis-derived guidance that is both evidence-grounded and executable within the unchanged scope of the agent's existing policy, toolset, and execution budget.
What would settle it
A controlled test on a fresh set of unresolved failure cases where the guidance output by the Gate either cannot be executed by the original agent or produces no measurable improvement in recovery rate over baselines.
Figures
read the original abstract
Software engineering agents are increasingly deployed in evaluable engineering environments, yet post-failure recovery remains costly, manual, and ad hoc. Existing systems expose traces or generate follow-up feedback, but they do not convert heterogeneous runtime evidence into grounded, bounded recovery guidance for a subsequent attempt. We present PROBE, a failure-anchored framework for structured recovery in software engineering agents. PROBE organizes failed-run telemetry into structured evidence, structured diagnosis, and bounded recovery guidance through a Telemetry Layer, a Diagnosis Layer, and a Guidance Gate. The Telemetry Layer preserves fine-grained runtime signals, the Diagnosis Layer fuses cross-signal evidence into grounded diagnoses, and the Guidance Gate produces diagnosis-derived guidance only when it is evidence-grounded, actionable, and within the scope of agent-side behavior. We evaluate PROBE across three settings: repository-level software repair, enterprise workflow recovery, and AIOps service mitigation. On 257 initially unresolved cases, PROBE achieves 65.37% Top-1 diagnosis accuracy and a 21.79% recovery rate, outperforming the strongest non-PROBE baseline by 43.58 and 12.45 percentage points. The results reveal a diagnosis-recovery gap: accurate diagnosis is necessary but insufficient unless translated into bounded guidance that a subsequent attempt can execute and verify. Beyond controlled evaluation, a Microsoft IcM prototype shows that PROBE can attach as a non-intrusive side channel to existing service-diagnosis workflows without changing the agent policy, toolset, or execution budget. These results suggest that telemetry-grounded, failure-anchored recovery can improve post-failure recoverability under realistic engineering constraints.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PROBE, a failure-anchored framework for structured recovery in software engineering agents. It organizes failed-run telemetry via a Telemetry Layer, fuses evidence into diagnoses via a Diagnosis Layer, and uses a Guidance Gate to emit only evidence-grounded, actionable, and in-scope recovery guidance. Evaluated on 257 initially unresolved cases across repository-level repair, enterprise workflow recovery, and AIOps mitigation, PROBE reports 65.37% Top-1 diagnosis accuracy and 21.79% recovery rate, outperforming the strongest baseline by 43.58 and 12.45 percentage points respectively. A Microsoft IcM prototype demonstrates non-intrusive attachment to existing workflows without altering agent policy, toolset, or budget. The work also identifies a diagnosis-recovery gap.
Significance. If the reported performance gains hold under rigorous validation, the work would be significant for SE agent research by providing a concrete mechanism to convert heterogeneous failure signals into bounded, executable guidance without modifying the underlying agent. The multi-domain evaluation (257 cases) and real-world prototype are positive elements that suggest practical utility. The diagnosis-recovery gap observation is a useful framing. However, the significance is limited by incomplete reporting on the core filtering mechanism that underpins the claimed improvements.
major comments (2)
- [Evaluation (across three settings)] The central empirical claims (65.37% Top-1 diagnosis accuracy and 21.79% recovery rate on 257 cases, with +43.58 / +12.45 pp gains) rest on the Guidance Gate emitting only guidance that is executable within the agent's original policy, toolset, and budget. The abstract asserts the gate enforces 'evidence-grounded, actionable, and within the scope' conditions, yet the evaluation provides no fraction of cases filtered by the gate, no operational definition of scope (e.g., tool whitelist, token budget, or policy constraints), and no explicit check that accepted guidance never required new capabilities. This is load-bearing for interpreting whether the lift is reliable or an artifact of selective reporting.
- [Abstract and Evaluation] The abstract and results report concrete accuracy/recovery numbers on 257 cases but omit details on the baselines (beyond 'strongest non-PROBE'), error bars, data exclusion rules, statistical significance tests, or how the 257 cases were selected/partitioned. These omissions leave the performance claims only partially supported and make it difficult to assess robustness.
minor comments (2)
- The paper would benefit from a dedicated limitations or threats-to-validity subsection that explicitly discusses how scope enforcement was validated and whether any guidance was rejected post-hoc.
- [Evaluation] Clarify the exact definitions of 'Top-1 diagnosis accuracy' and 'recovery rate' (e.g., success criteria, verification method) in the evaluation setup.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for highlighting areas where additional transparency can strengthen the empirical claims. Below we respond point-by-point to the major comments, indicating the revisions we will make.
read point-by-point responses
-
Referee: [Evaluation (across three settings)] The central empirical claims (65.37% Top-1 diagnosis accuracy and 21.79% recovery rate on 257 cases, with +43.58 / +12.45 pp gains) rest on the Guidance Gate emitting only guidance that is executable within the agent's original policy, toolset, and budget. The abstract asserts the gate enforces 'evidence-grounded, actionable, and within the scope' conditions, yet the evaluation provides no fraction of cases filtered by the gate, no operational definition of scope (e.g., tool whitelist, token budget, or policy constraints), and no explicit check that accepted guidance never required new capabilities. This is load-bearing for interpreting whether the lift is reliable or an artifact of selective reporting.
Authors: We agree that the filtering behavior of the Guidance Gate requires explicit quantification to support the reported gains. The current manuscript describes the gate's three conditions at the design level (Section 3.3) but does not report the fraction of cases filtered nor provide an operational definition of scope. In the revision we will add a new paragraph to the evaluation section that states: (i) the exact number and percentage of cases for which the gate withheld guidance, (ii) the concrete criteria used to operationalize scope (tool whitelist membership, token-budget compliance, and policy-constraint checks as implemented in each domain), and (iii) the results of a post-hoc manual audit confirming that every emitted guidance item was executable with the original agent policy, toolset, and budget. These additions will be placed immediately before the main results tables. revision: yes
-
Referee: [Abstract and Evaluation] The abstract and results report concrete accuracy/recovery numbers on 257 cases but omit details on the baselines (beyond 'strongest non-PROBE'), error bars, data exclusion rules, statistical significance tests, or how the 257 cases were selected/partitioned. These omissions leave the performance claims only partially supported and make it difficult to assess robustness.
Authors: We acknowledge that the current reporting is insufficient for full reproducibility and robustness assessment. The manuscript identifies the strongest non-PROBE baseline but does not enumerate all systems compared, report variance, or detail case provenance. We will revise the evaluation section and results tables to: (i) list every baseline system with its configuration parameters, (ii) include error bars (standard deviation across the three domains or bootstrap estimates), (iii) report statistical significance (paired McNemar tests for diagnosis accuracy and recovery rate), and (iv) add a dedicated paragraph describing how the 257 unresolved cases were collected, any exclusion rules applied, and how they were partitioned across the repository-repair, workflow-recovery, and AIOps settings. These changes will appear in both the main text and an expanded appendix. revision: yes
Circularity Check
No circularity: empirical evaluation on external tasks with concrete metrics
full rationale
The paper presents PROBE as a framework with Telemetry Layer, Diagnosis Layer, and Guidance Gate, evaluated on 257 initially unresolved cases across repository-level repair, enterprise workflow, and AIOps settings. It reports specific empirical results (65.37% Top-1 diagnosis accuracy, 21.79% recovery rate) outperforming baselines by stated margins. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text. The central claims rest on external benchmarks and reported improvements rather than reducing to inputs by construction, satisfying the self-contained criterion for score 0.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Failed-run telemetry contains sufficient fine-grained signals to support grounded diagnoses
- domain assumption Diagnosis-derived guidance can be produced that is actionable within the agent's existing policy and toolset
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The Guidance Gate acts as a grounding, actionability, and scope filter, converting the diagnosis into bounded recovery guidance only when it is telemetry-supported, actionable, and within the scope of agent-side behavior.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PROBE organizes failed-run telemetry into structured evidence, structured diagnosis, and bounded recovery guidance through a Telemetry Layer, a Diagnosis Layer, and a Guidance Gate.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Toufique Ahmed, Supriyo Ghosh, Chetan Bansal, Thomas Zimmermann, Xuchao Zhang, and Saravan Rajmohan. 2023. Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models. In45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE, 1737–1749. doi:10.1109/ICSE48619.2...
-
[2]
Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Vageesh D. C., Arun Iyer, Suresh Parthasarathy, Sriram K. Rajamani, Balasubramanyan Ashok, and Shashank Shet. 2024. CodePlan: Repository-Level Coding using LLMs and Plan- ning.Proc. ACM Softw. Eng.1, FSE (2024), 675–698. doi:10.1145/3643757
- [3]
-
[4]
BerriAI. 2024. LiteLLM. https://docs.litellm.ai/. Unified interface for calling multiple large language model providers. Accessed: 2026-04-09
work page 2024
- [5]
-
[6]
Islem Bouzenia and Michael Pradel. 2025. Understanding Software Engineering Agents: A Study of Thought-Action-Result Trajectories. In40th IEEE/ACM Inter- national Conference on Automated Software Engineering, ASE 2025, Seoul, Korea, Republic of, November 16-20, 2025. IEEE, 2846–2857. doi:10.1109/ASE63991.2025. 00234
-
[7]
Harrison Chase. 2022. LangChain. https://github.com/langchain-ai/langchain. Open-source framework for building LLM applications and agents. Accessed: 2026-04-09
work page 2022
-
[8]
Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023. Teaching Large Language Models to Self-Debug. arXiv:2304.05128 [cs.CL] https://arxiv. org/abs/2304.05128
work page internal anchor Pith review arXiv 2023
-
[9]
Yinfang Chen, Manish Shetty, Gagan Somashekar, Minghua Ma, Yogesh Simmhan, Jonathan Mace, Chetan Bansal, Rujia Wang, and Saravan Rajmohan. 2025. AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Au- tonomous Clouds. InProceedings of the Eighth Conference on Machine Learn- ing and Systems, MLSys 2025, Santa Clara, CA, USA, May 12-15, 2025, ...
work page 2025
-
[10]
Yinfang Chen, Huaibing Xie, Minghua Ma, Yu Kang, Xin Gao, Liu Shi, Yunjie Cao, Xuedong Gao, Hao Fan, Ming Wen, Jun Zeng, Supriyo Ghosh, Xuchao Zhang, Chaoyun Zhang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Tianyin Xu. 2024. Automatic Root Cause Analysis via Large Language Models for Cloud Incidents. InProceedings of the Nineteenth European Confer...
-
[11]
Zhi Chen, Wei Ma, and Lingxiao Jiang. 2026. Beyond Final Code: A Process- Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios. arXiv:2503.12374 [cs.SE] https://arxiv.org/abs/2503.12374
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [12]
-
[13]
Drishti Goel, Fiza Husain, Aditya Singh, Supriyo Ghosh, Anjaly Parayil, Chetan Bansal, Xuchao Zhang, and Saravan Rajmohan. 2024. X-Lifecycle Learning for Cloud Incident Management using LLMs. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, FSE 2024, Porto de Galinhas, Brazil, July 15-19, 2024, M...
-
[14]
Antonio Gulli, Lavi Nigam, Julia Wiesinger, Vladimir Vuskovic, Irina Sigler, Ivan Nardini, Nicolas Stroppa, Sokratis Kartakis, Narek Saribekyan, and Alan Bount. 2025.Agents Companion. Technical Report. Google. https:// www.kaggle.com/whitepaper-agent-companion Google Whitepaper. Avail- able at: https://cdn.jsdelivr.net/gh/abncharts/abncharts.public.1/abna...
work page 2025
-
[15]
Zikang Guo, Benfeng Xu, Chiwei Zhu, Wentao Hong, Xiaorui Wang, and Zhen- dong Mao. 2026. MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools. InFortieth AAAI Conference on Artifi- cial Intelligence, Thirty-Eighth Conference on Innovative Applications of Artificial Intelligence, Sixteenth Symposium on Educational Advanc...
- [16]
-
[17]
Yuxuan Jiang, Chaoyun Zhang, Shilin He, Zhihao Yang, Minghua Ma, Si Qin, Yu Kang, Yingnong Dang, Saravan Rajmohan, Qingwei Lin, and Dongmei Zhang
-
[18]
Xpert: Empowering Incident Management with Query Recommendations via Large Language Models. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024. ACM, 92:1–92:13. doi:10.1145/3597503.3639081
-
[19]
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=VTF8yNQM66
work page 2024
-
[20]
LangChain. 2026. LangSmith: A Platform for Observability and Evaluation of LLM-based Systems. https://docs.langchain.com/langsmith/observability. Ac- cessed: 2026-03
work page 2026
-
[21]
Han Li, Yifan Yao, Letian Zhu, Rili Feng, Hongyi Ye, Jiaming Wang, Yancheng He, Pengyu Zou, Lehan Zhang, Xinping Lei, Haoyang Huang, Ken Deng, Ming Sun, Zhaoxiang Zhang, He Ye, and Jiaheng Liu. 2026. CodeTracer: Towards Traceable Agent States. arXiv:2604.11641 [cs.SE] https://arxiv.org/abs/2604.11641
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[22]
Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. 2023. API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, Decem- ber 6-10, 2023, Houda Bouamor, Juan Pino, and Kalika Bali (...
-
[23]
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-Refine: Iterative Refinement with Self-Feedback. arXiv:2303.17651 [cs.CL] https://arxiv.or...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Shiva Krishna Reddy Malay, Shravan Nayak, Jishnu Sethumadhavan Nair, Sagar Davasam, Aman Tiwari, Sathwik Tejaswi Madhusudhan, Sridhar Krishna Nemala, Srinivas Sunkara, and Sai Rajeswar. 2026. EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings. arXiv:2603.13594 [cs.AI] https://arxiv.org/abs/2603.13594
-
[25]
Model Context Protocol Contributors. 2025. Model Context Protocol Specification. https://modelcontextprotocol.io/specification/2025-06-18. Accessed: 2026-04-09
work page 2025
- [26]
-
[27]
Demystifying gpt self-repair for code generation
Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. 2024. Is Self-Repair a Silver Bullet for Code Generation? arXiv:2306.09896 [cs.CL] https://arxiv.org/abs/2306.09896
-
[28]
Patil, Tianjun Zhang, Xin Wang, and Joseph E
Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez
-
[29]
Gorilla: Large Language Model Connected with Massive APIs. InAdvances in Neural Information Processing Systems 38: Annual Confer- ence on Neural Information Processing Systems 2024, NeurIPS 2024, Vancou- ver, BC, Canada, December 10 - 15, 2024, Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang ...
work page 2024
-
[30]
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2024. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. InThe Twelfth International Conference on Le...
work page 2024
-
[31]
Lukas Ruff, Jacob R. Kauffmann, Robert A. Vandermeulen, Gregoire Montavon, Wojciech Samek, Marius Kloft, Thomas G. Dietterich, and Klaus-Robert Muller
-
[32]
Nagpal, C., Li, X., and Dubrawski, A
A Unifying Review of Deep and Shallow Anomaly Detection.Proc. IEEE 109, 5 (May 2021), 756–795. doi:10.1109/jproc.2021.3052449
-
[33]
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Infor- mation Processing Systems 2023, NeurIPS 2023, New Or...
work page 2023
-
[34]
Manish Shetty, Chetan Bansal, Sumit Kumar, Nikitha Rao, Nachiappan Nagappan, and Thomas Zimmermann. 2021. Neural Knowledge Extraction From Cloud Service Incidents. In43rd IEEE/ACM International Conference on Software Engi- neering: Software Engineering in Practice, ICSE (SEIP) 2021, Madrid, Spain, May 25-28, 2021. IEEE, 218–227. doi:10.1109/ICSE-SEIP52600...
-
[35]
Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. doi:10.48550/ARXIV.2303.11366
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.11366 2023
-
[36]
Gou Tan, Zilong He, Min Li, Haiyu Huang, Yilun Wang, Pengfei Chen, Giuliano Casale, and Chuanfu Zhang. 2026. LLMRCA: Multilevel Root Cause Analysis for LLM Applications Using Multimodal Observability Data.ACM Trans. Softw. Eng. Methodol.(April 2026). doi:10.1145/3806200 Just Accepted
-
[37]
Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian
-
[38]
Appworld: A controllable world of apps and people for benchmarking interactive coding agents,
AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents. arXiv:2407.18901 [cs.SE] https://arxiv.org/abs/2407. ASE ’26, October 12–16, 2026, Munich, Germany Zhao et al. 18901
-
[39]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-Thought Prompt- ing Elicits Reasoning in Large Language Models. InAdvances in Neural Infor- mation Processing Systems 35: Annual Conference on Neural Information Pro- cessing Systems 2022, NeurIPS 2022, New Orleans, LA, USA,...
work page 2022
-
[40]
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. 2023. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework.CoRR abs/2308.08155 (2023). arXiv:2308.08155 doi:10.48550/ARXIV.2308.08155
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.08155 2023
-
[41]
Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. De- mystifying LLM-Based Software Engineering Agents.Proc. ACM Softw. Eng.2, FSE (2025), 801–824. doi:10.1145/3715754
-
[42]
Chunqiu Steven Xia and Lingming Zhang. 2024. Automated Program Repair via Conversation: Fixing 162 out of 337 Bugs for $0.42 Each using ChatGPT. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA ’24). ACM, 819–831. doi:10.1145/3650212.3680323
-
[43]
arXiv preprint arXiv:2310.10634 , year=
Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, Leo Z. Liu, Yiheng Xu, Hongjin Su, Dongchan Shin, Caiming Xiong, and Tao Yu. 2023. OpenAgents: An Open Platform for Language Agents in the Wild.CoRRabs/2310.10634 (2023). arXiv:2310.10634 doi:10.48550/ARXIV.2310.10634
-
[44]
Zhe Xie, Shenglin Zhang, Yitong Geng, Yao Zhang, Minghua Ma, Xiaohui Nie, Zhenhe Yao, Longlong Xu, Yongqian Sun, Wentao Li, and Dan Pei. 2024. Mi- croservice Root Cause Analysis With Limited Observability Through Intervention Recognition in the Latent Space. InProceedings of the 30th ACM SIGKDD Confer- ence on Knowledge Discovery and Data Mining, KDD 2024...
-
[45]
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer In- terfaces Enable Automated Software Engineering. arXiv:2405.15793 [cs.SE] https://arxiv.org/abs/2405.15793
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview. net/forum?id=WE_vluYUL-X
work page 2023
-
[47]
Las-Casas, Rodrigo Fonseca, and Saravan Rajmohan
Dylan Zhang, Xuchao Zhang, Chetan Bansal, Pedro Henrique B. Las-Casas, Rodrigo Fonseca, and Saravan Rajmohan. 2023. PACE-LM: Prompting and Augmentation for Calibrated Confidence Estimation with GPT-4 in Cloud In- cident Root Cause Analysis.CoRRabs/2309.05833 (2023). arXiv:2309.05833 doi:10.48550/ARXIV.2309.05833
-
[48]
Las-Casas, Ro- drigo Fonseca, and Saravan Rajmohan
Dylan Zhang, Xuchao Zhang, Chetan Bansal, Pedro Henrique B. Las-Casas, Ro- drigo Fonseca, and Saravan Rajmohan. 2024. LM-PACE: Confidence Estimation by Large Language Models for Effective Root Causing of Cloud Incidents. In Companion Proceedings of the 32nd ACM International Conference on the Founda- tions of Software Engineering, FSE 2024, Porto de Galin...
- [49]
-
[50]
Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V. Le, and Ed H. Chi. 2023. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.n...
work page 2023
-
[51]
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2024. WebArena: A Realistic Web Environment for Building Autonomous Agents. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview....
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.