Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents

Chenyu Zhao; Chetan Bansal; Dan Pei; Minghua Ma; Saravan Rajmohan; Shenglin Zhang; Wenwei Gu; Yihang Lin; Yongqian Sun; Zhimin Chen

arxiv: 2605.08717 · v2 · pith:UHF6IZ5Rnew · submitted 2026-05-09 · 💻 cs.SE · cs.AI

Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents

Chenyu Zhao , Shenglin Zhang , Yihang Lin , Wenwei Gu , Zhimin Chen , Yongqian Sun , Dan Pei , Chetan Bansal

show 2 more authors

Saravan Rajmohan Minghua Ma

This is my paper

Pith reviewed 2026-05-12 00:59 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords software engineering agentsfailure recoverystructured diagnosistelemetry analysisguidance gatepost-failure recoveryAIOpscode repair

0 comments

The pith

PROBE converts failed-run telemetry into structured evidence, diagnosis, and bounded guidance that agents can execute without changing their policy or tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PROBE as a failure-anchored framework that organizes runtime evidence from software engineering agents into three layers: a Telemetry Layer that keeps fine-grained signals, a Diagnosis Layer that fuses them into grounded diagnoses, and a Guidance Gate that outputs only evidence-based, actionable steps within the agent's existing scope. A sympathetic reader would care because agents currently expose traces or generate loose feedback yet leave many failures unresolved, and this structure offers a systematic way to turn those traces into verifiable next attempts. Evaluation across repository repair, workflow recovery, and AIOps shows clear gains on previously stuck cases, pointing to a diagnosis-recovery gap where knowing the problem is not enough unless it produces bounded, executable guidance.

Core claim

PROBE organizes failed-run telemetry into structured evidence, structured diagnosis, and bounded recovery guidance through a Telemetry Layer, a Diagnosis Layer, and a Guidance Gate. The Telemetry Layer preserves fine-grained runtime signals, the Diagnosis Layer fuses cross-signal evidence into grounded diagnoses, and the Guidance Gate produces diagnosis-derived guidance only when it is evidence-grounded, actionable, and within the scope of agent-side behavior. On 257 initially unresolved cases, this yields 65.37% Top-1 diagnosis accuracy and a 21.79% recovery rate, outperforming the strongest non-PROBE baseline by 43.58 and 12.45 percentage points, while attaching as a non-intrusive sidecar.

What carries the argument

The Guidance Gate, which filters diagnoses to emit only bounded recovery guidance that remains evidence-grounded, actionable, and executable inside the agent's unchanged policy, toolset, and execution budget.

If this is right

Accurate diagnosis proves necessary but insufficient unless converted into bounded guidance a subsequent agent attempt can execute and verify.
Recovery rate on unresolved cases rises by 12.45 percentage points over the strongest baseline.
The framework attaches to existing service-diagnosis workflows as a non-intrusive side channel without altering agent policy, tools, or budget.
The same three-layer structure applies across repository-level repair, enterprise workflow recovery, and AIOps mitigation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the gate works reliably, the same telemetry-to-guidance pattern could be tested in other autonomous agent settings that face post-failure recovery costs.
The results imply that modest, evidence-only additions can raise recoverability even when the agent's core behavior and resource limits stay fixed.
The diagnosis-recovery gap suggests future agent designs should treat executable guidance as a first-class output rather than an afterthought.

Load-bearing premise

The Guidance Gate can reliably produce diagnosis-derived guidance that is both evidence-grounded and executable within the unchanged scope of the agent's existing policy, toolset, and execution budget.

What would settle it

A controlled test on a fresh set of unresolved failure cases where the guidance output by the Gate either cannot be executed by the original agent or produces no measurable improvement in recovery rate over baselines.

Figures

Figures reproduced from arXiv: 2605.08717 by Chenyu Zhao, Chetan Bansal, Dan Pei, Minghua Ma, Saravan Rajmohan, Shenglin Zhang, Wenwei Gu, Yihang Lin, Yongqian Sun, Zhimin Chen.

**Figure 2.** Figure 2: Overview of the PROBE framework. PROBE organizes failed-run telemetry into a typed telemetry bundle (T ), constructs structured evidence (E), derives structured diagnosis (D), and produces bounded recovery guidance (G), forming a recovery loop for subsequent attempts. Step 1. Failure Localization Repeated Tool Failures Execution Errors Unstable Intent Transitions Inefficient Action Patterns Claimed vs. Obs… view at source ↗

**Figure 3.** Figure 3: Detailed workflow of the Diagnosis Layer. workflow state, and evaluator-side observations. It helps distinguish agent-side behavioral failures from environment-side constraints. Optional external outcome evidence. 𝑇outcome is included when evaluator-side feedback is available, such as benchmark verdicts, test results, incident-resolution checks, or task-specific scoring scripts. It grounds telemetry agai… view at source ↗

**Figure 4.** Figure 4: AIOpsLab case illustrating how PROBE preserves failed-run evidence, derives a structured diagnosis, and produces bounded recovery guidance for the subsequent attempt. tasks overall, but their Top-1 accuracy remains much lower than PROBE, at 14.79% and 21.79%. This suggests that generic trace summaries provide useful runtime context, but often fail to preserve the signal-specific anchors needed to identi… view at source ↗

read the original abstract

Software engineering agents are increasingly deployed in evaluable engineering environments, yet post-failure recovery remains costly, manual, and ad hoc. Existing systems expose traces or generate follow-up feedback, but they do not convert heterogeneous runtime evidence into grounded, bounded recovery guidance for a subsequent attempt. We present PROBE, a failure-anchored framework for structured recovery in software engineering agents. PROBE organizes failed-run telemetry into structured evidence, structured diagnosis, and bounded recovery guidance through a Telemetry Layer, a Diagnosis Layer, and a Guidance Gate. The Telemetry Layer preserves fine-grained runtime signals, the Diagnosis Layer fuses cross-signal evidence into grounded diagnoses, and the Guidance Gate produces diagnosis-derived guidance only when it is evidence-grounded, actionable, and within the scope of agent-side behavior. We evaluate PROBE across three settings: repository-level software repair, enterprise workflow recovery, and AIOps service mitigation. On 257 initially unresolved cases, PROBE achieves 65.37% Top-1 diagnosis accuracy and a 21.79% recovery rate, outperforming the strongest non-PROBE baseline by 43.58 and 12.45 percentage points. The results reveal a diagnosis-recovery gap: accurate diagnosis is necessary but insufficient unless translated into bounded guidance that a subsequent attempt can execute and verify. Beyond controlled evaluation, a Microsoft IcM prototype shows that PROBE can attach as a non-intrusive side channel to existing service-diagnosis workflows without changing the agent policy, toolset, or execution budget. These results suggest that telemetry-grounded, failure-anchored recovery can improve post-failure recoverability under realistic engineering constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces PROBE, a failure-anchored framework for structured recovery in software engineering agents. It organizes failed-run telemetry via a Telemetry Layer, fuses evidence into diagnoses via a Diagnosis Layer, and uses a Guidance Gate to emit only evidence-grounded, actionable, and in-scope recovery guidance. Evaluated on 257 initially unresolved cases across repository-level repair, enterprise workflow recovery, and AIOps mitigation, PROBE reports 65.37% Top-1 diagnosis accuracy and 21.79% recovery rate, outperforming the strongest baseline by 43.58 and 12.45 percentage points respectively. A Microsoft IcM prototype demonstrates non-intrusive attachment to existing workflows without altering agent policy, toolset, or budget. The work also identifies a diagnosis-recovery gap.

Significance. If the reported performance gains hold under rigorous validation, the work would be significant for SE agent research by providing a concrete mechanism to convert heterogeneous failure signals into bounded, executable guidance without modifying the underlying agent. The multi-domain evaluation (257 cases) and real-world prototype are positive elements that suggest practical utility. The diagnosis-recovery gap observation is a useful framing. However, the significance is limited by incomplete reporting on the core filtering mechanism that underpins the claimed improvements.

major comments (2)

[Evaluation (across three settings)] The central empirical claims (65.37% Top-1 diagnosis accuracy and 21.79% recovery rate on 257 cases, with +43.58 / +12.45 pp gains) rest on the Guidance Gate emitting only guidance that is executable within the agent's original policy, toolset, and budget. The abstract asserts the gate enforces 'evidence-grounded, actionable, and within the scope' conditions, yet the evaluation provides no fraction of cases filtered by the gate, no operational definition of scope (e.g., tool whitelist, token budget, or policy constraints), and no explicit check that accepted guidance never required new capabilities. This is load-bearing for interpreting whether the lift is reliable or an artifact of selective reporting.
[Abstract and Evaluation] The abstract and results report concrete accuracy/recovery numbers on 257 cases but omit details on the baselines (beyond 'strongest non-PROBE'), error bars, data exclusion rules, statistical significance tests, or how the 257 cases were selected/partitioned. These omissions leave the performance claims only partially supported and make it difficult to assess robustness.

minor comments (2)

The paper would benefit from a dedicated limitations or threats-to-validity subsection that explicitly discusses how scope enforcement was validated and whether any guidance was rejected post-hoc.
[Evaluation] Clarify the exact definitions of 'Top-1 diagnosis accuracy' and 'recovery rate' (e.g., success criteria, verification method) in the evaluation setup.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting areas where additional transparency can strengthen the empirical claims. Below we respond point-by-point to the major comments, indicating the revisions we will make.

read point-by-point responses

Referee: [Evaluation (across three settings)] The central empirical claims (65.37% Top-1 diagnosis accuracy and 21.79% recovery rate on 257 cases, with +43.58 / +12.45 pp gains) rest on the Guidance Gate emitting only guidance that is executable within the agent's original policy, toolset, and budget. The abstract asserts the gate enforces 'evidence-grounded, actionable, and within the scope' conditions, yet the evaluation provides no fraction of cases filtered by the gate, no operational definition of scope (e.g., tool whitelist, token budget, or policy constraints), and no explicit check that accepted guidance never required new capabilities. This is load-bearing for interpreting whether the lift is reliable or an artifact of selective reporting.

Authors: We agree that the filtering behavior of the Guidance Gate requires explicit quantification to support the reported gains. The current manuscript describes the gate's three conditions at the design level (Section 3.3) but does not report the fraction of cases filtered nor provide an operational definition of scope. In the revision we will add a new paragraph to the evaluation section that states: (i) the exact number and percentage of cases for which the gate withheld guidance, (ii) the concrete criteria used to operationalize scope (tool whitelist membership, token-budget compliance, and policy-constraint checks as implemented in each domain), and (iii) the results of a post-hoc manual audit confirming that every emitted guidance item was executable with the original agent policy, toolset, and budget. These additions will be placed immediately before the main results tables. revision: yes
Referee: [Abstract and Evaluation] The abstract and results report concrete accuracy/recovery numbers on 257 cases but omit details on the baselines (beyond 'strongest non-PROBE'), error bars, data exclusion rules, statistical significance tests, or how the 257 cases were selected/partitioned. These omissions leave the performance claims only partially supported and make it difficult to assess robustness.

Authors: We acknowledge that the current reporting is insufficient for full reproducibility and robustness assessment. The manuscript identifies the strongest non-PROBE baseline but does not enumerate all systems compared, report variance, or detail case provenance. We will revise the evaluation section and results tables to: (i) list every baseline system with its configuration parameters, (ii) include error bars (standard deviation across the three domains or bootstrap estimates), (iii) report statistical significance (paired McNemar tests for diagnosis accuracy and recovery rate), and (iv) add a dedicated paragraph describing how the 257 unresolved cases were collected, any exclusion rules applied, and how they were partitioned across the repository-repair, workflow-recovery, and AIOps settings. These changes will appear in both the main text and an expanded appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on external tasks with concrete metrics

full rationale

The paper presents PROBE as a framework with Telemetry Layer, Diagnosis Layer, and Guidance Gate, evaluated on 257 initially unresolved cases across repository-level repair, enterprise workflow, and AIOps settings. It reports specific empirical results (65.37% Top-1 diagnosis accuracy, 21.79% recovery rate) outperforming baselines by stated margins. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text. The central claims rest on external benchmarks and reported improvements rather than reducing to inputs by construction, satisfying the self-contained criterion for score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on domain assumptions about the sufficiency of runtime telemetry for diagnosis and the translatability of diagnoses into agent-executable guidance; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)

domain assumption Failed-run telemetry contains sufficient fine-grained signals to support grounded diagnoses
Invoked in the description of the Telemetry Layer and Diagnosis Layer.
domain assumption Diagnosis-derived guidance can be produced that is actionable within the agent's existing policy and toolset
Core premise of the Guidance Gate.

pith-pipeline@v0.9.0 · 5627 in / 1381 out tokens · 41699 ms · 2026-05-12T00:59:42.570340+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The Guidance Gate acts as a grounding, actionability, and scope filter, converting the diagnosis into bounded recovery guidance only when it is telemetry-supported, actionable, and within the scope of agent-side behavior.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PROBE organizes failed-run telemetry into structured evidence, structured diagnosis, and bounded recovery guidance through a Telemetry Layer, a Diagnosis Layer, and a Guidance Gate.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.