Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback

Amos Storkey; Hongru Wang; Jeff Z. Pan; Rishabh Sabharwal

arxiv: 2606.09748 · v1 · pith:GISE5V3Rnew · submitted 2026-06-08 · 💻 cs.AI · cs.CL· cs.LG

Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback

Rishabh Sabharwal , Hongru Wang , Amos Storkey , Jeff Z. Pan This is my paper

Pith reviewed 2026-06-27 16:17 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords deep research agentsmulti-turn evaluationprocess-level feedbackself-reflectionresearch gap inferencerubric criteriaagent improvement

0 comments

The pith

Even with targeted process-level feedback, deep research agents fail to achieve reliable multi-turn improvement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether deep research agents can revise reports across multiple turns when given feedback. It compares self-reflection, where agents revise without external signals, to process-level feedback that uses Research Gap Inference to target specific gaps in research strategy based on rubric patterns. Self-reflection produces no net gain because agents incorporate and drop criteria at similar rates. One round of process-level feedback raises scores by 8-15 points with 35-40 percent incorporation, but further turns produce regression on up to 24 percent of previously satisfied criteria when the full report is rewritten. The central result is that reliable compounding improvement stays out of reach for the architectures tested.

Core claim

A single round of process-level feedback using Research Gap Inference yields substantial gains, raising normalized scores by approximately 8-15 points with a 35-40 percent incorporation rate of previously unsatisfied criteria. These gains do not compound over subsequent turns because agents regress on up to 24 percent of previously satisfied criteria when rewriting the full report to address remaining gaps. Under self-reflection alone, agents incorporate and regress on rubric criteria at nearly equal rates, yielding negligible net improvement.

What carries the argument

Research Gap Inference (RGI), a method that analyzes patterns of satisfied and unsatisfied rubric criteria to infer specific research-process gaps and supply targeted guidance.

If this is right

Self-reflection alone produces negligible net improvement because incorporation and regression rates are nearly equal.
One round of process-level feedback produces clear gains of 8-15 normalized points and 35-40 percent incorporation of new criteria.
Subsequent turns after the first feedback round fail to compound gains because agents regress on up to 24 percent of previously satisfied criteria.
The architectures tested cannot reliably convert repeated targeted guidance into sustained report improvement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Architectures may need explicit mechanisms to preserve already-satisfied criteria during revisions rather than regenerating the full report.
The observed regression pattern suggests a possible tension between addressing new gaps and retaining earlier quality that could be tested by tracking criterion stability per turn.
Extending the evaluation to agents with external memory of prior feedback turns could reveal whether regression stems from loss of context during rewriting.

Load-bearing premise

The evaluation assumes that Research Gap Inference accurately identifies meaningful gaps in the agent's research process from the rubric patterns.

What would settle it

A demonstration that agents maintain or increase the fraction of satisfied criteria across three or more successive turns without regressing on prior gains would falsify the claim that reliable multi-turn improvement is out of reach.

Figures

Figures reproduced from arXiv: 2606.09748 by Amos Storkey, Hongru Wang, Jeff Z. Pan, Rishabh Sabharwal.

**Figure 1.** Figure 1: Process-level feedback generation. Given a report rt−1 evaluated against the DRACO rubric, RGI analyzes patterns of satisfied and unsatisfied criteria from FA, BD, and CQ (excluding PQ) to infer research-process gaps and generate process-level feedback ft−1 for the next turn. Example criteria shown for illustration; negative-weight criteria excluded for simplicity. criteria specify failure modes to avoid. … view at source ↗

**Figure 2.** Figure 2: Task-level Turn 3 headroom analysis. Each point represents one task, with the T2 normalized score on the x-axis and the change in T2 → T3 score on the y-axis. Blue circles denote tasks where T3 improved; orange crosses denote tasks where T3 did not improve over T2. T3 gains are concentrated among low-scoring T2 tasks, while degradations cluster with stronger T2 performance. feedback, also improves at RGI T… view at source ↗

**Figure 3.** Figure 3: Case studies illustrating contrasting outcomes of process-level feedback. Each panel shows a summary of the process-level feedback (left; full text in Appendix C.5), per-axis normalized scores at Turn 1 and Turn 2 (center), and representative criteria incorporated and regressed (right). Top: Process-level feedback drives recovery (Task 021); overall normalized score improves from 50.0 to 79.0 (+29.0). Bott… view at source ↗

**Figure 4.** Figure 4: Task query and RGI feedback for Case 1 (Task 021). Task 004 Query Analyze CME Group’s cash generation efficiency and capital allocation strategy by examining the operating cash flow growth from Q1 2024 to Q1 2025, including changes in accounts receivable and income taxes payable that indicate business momentum. Calculate the operating cash flow conversion rate for both periods to understand how working cap… view at source ↗

**Figure 5.** Figure 5: Task query and RGI feedback for Case 2 (Task 004). 18 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: RGI feedback generation prompt. D.4. Agent Revision Prompt At each revision turn, the agent receives the original query, its previous report, and the feedback concatenated into the following prompt template: User Prompt You previously wrote a research report on the following query: --- ORIGINAL QUERY --- {original query} --- YOUR PREVIOUS REPORT --- {prev report} --- USER FEEDBACK --- {feedback} Please rev… view at source ↗

**Figure 7.** Figure 7: Agent revision prompt template. Placeholders are filled with the original query, previous report, and process-level feedback at each turn. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

read the original abstract

Existing benchmarks for deep research agents (DRAs) assess only single-shot outputs, ignoring a key question: can DRAs improve their reports when guided by feedback? To investigate this, we conduct a multi-turn evaluation of DRAs under two feedback settings: self-reflection, in which the agent revises its report without any external diagnostic signal, and process-level feedback, in which the agent receives guidance targeting gaps in its research strategy. To enable process-level feedback, we design Research Gap Inference (RGI), a method that analyzes patterns of satisfied and unsatisfied rubric criteria to infer research-process gaps. Our analysis reveals three key findings: (i) under self-reflection, agents incorporate and regress on rubric criteria at nearly equal rates, yielding negligible net improvement; (ii) a single round of process-level feedback yields substantial gains, raising the normalized score by approximately $8$-$15$ points and yielding a roughly $35$-$40\%$ incorporation rate; (iii) these gains do not compound over subsequent turns, as agents regress on up to $24\%$ of previously satisfied criteria when rewriting the full report to address remaining gaps. Even with targeted guidance, reliable multi-turn improvement remains out of reach for the DRA architectures we evaluate. Our code and results are publicly available at https://github.com/sabharwalrishabh/Multi-Turn-Evaluation-of-DRAs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Process-level feedback lifts DRA scores in one turn but regressions on prior criteria block compounding gains, with RGI's accuracy unvalidated.

read the letter

The main takeaway is that even targeted process-level feedback fails to produce reliable multi-turn improvement in the DRAs tested, because agents regress on up to 24% of previously satisfied rubric criteria when rewriting to fix remaining gaps. Self-reflection yields no net progress at all.

The paper does something useful by shifting evaluation from single-shot outputs to multi-turn revision under two feedback regimes. The Research Gap Inference method is a concrete way to turn rubric patterns into guidance, and the reported incorporation rates (35-40%) plus the regression numbers give a measurable picture of where the agents fall short. Public code and results make the claims checkable.

The soft spot is that RGI itself has no reported validation—no human agreement scores, no ablation on inference quality, no error analysis. If the inferred gaps are often off-target, the non-compounding result could reflect noisy feedback rather than an intrinsic limit in the agent architectures. The abstract gives effect sizes but the full paper needs to show the number of agents, runs, and statistical handling to make the numbers convincing.

This is for people working on research agents or agent benchmarks. Anyone thinking about iterative improvement will find the setup and the negative multi-turn result worth seeing. It deserves peer review because the question matters and the experiment is a clear step past existing single-shot work, even if RGI needs more grounding.

Referee Report

1 major / 2 minor

Summary. The paper claims that deep research agents (DRAs) cannot achieve reliable multi-turn improvement even under targeted process-level feedback. It introduces Research Gap Inference (RGI) to generate feedback by analyzing patterns of satisfied/unsatisfied rubric criteria, then reports three findings from multi-turn evaluations: (i) self-reflection produces negligible net gains due to equal incorporation and regression rates; (ii) one round of process-level feedback raises normalized scores by 8-15 points with 35-40% incorporation; (iii) gains fail to compound because agents regress on up to 24% of previously satisfied criteria in later turns. The conclusion is that reliable iterative improvement remains out of reach for the evaluated DRA architectures. Code and results are released publicly.

Significance. If the results hold, the work supplies concrete evidence that current DRA architectures struggle with maintaining and building on progress across multiple turns, even when given diagnostic feedback on research-process gaps. This has implications for the design of iterative research agents. The public GitHub release of code and results is a clear strength, directly supporting reproducibility and follow-up experiments.

major comments (1)

[Abstract and RGI method description] Abstract and description of RGI: The central claim—that targeted process-level feedback still fails to produce reliable multi-turn improvement—depends on RGI accurately mapping rubric patterns to meaningful, actionable research-process gaps. The manuscript supplies no validation of RGI (human agreement rates, ablation against oracle or random feedback, or error analysis), so the reported 24% regression rate and non-compounding gains could be artifacts of mis-targeted or noisy feedback rather than intrinsic DRA limitations.

minor comments (2)

[Abstract] Abstract: Quantitative claims (8-15 point gains, 35-40% incorporation, 24% regression) are presented without any mention of the number of trials, agents evaluated, statistical tests, or variance, which weakens assessment of robustness.
[Evaluation metrics] The definition of 'normalized score' and the exact rubric criteria used should be stated explicitly in the main text (not only supplementary) to allow independent verification.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback highlighting the need for validation of the Research Gap Inference (RGI) method. We address this point directly below and outline planned revisions.

read point-by-point responses

Referee: [Abstract and RGI method description] Abstract and description of RGI: The central claim—that targeted process-level feedback still fails to produce reliable multi-turn improvement—depends on RGI accurately mapping rubric patterns to meaningful, actionable research-process gaps. The manuscript supplies no validation of RGI (human agreement rates, ablation against oracle or random feedback, or error analysis), so the reported 24% regression rate and non-compounding gains could be artifacts of mis-targeted or noisy feedback rather than intrinsic DRA limitations.

Authors: We acknowledge that the current manuscript does not report quantitative validation of RGI, such as inter-annotator agreement with human experts, ablations against random or oracle feedback, or a dedicated error analysis. RGI operates by deterministically mapping patterns of unsatisfied rubric criteria to inferred research-process gaps (e.g., repeated failure on 'source diversity' criteria triggers a gap in source acquisition strategy). The public code and result release enables independent verification and extension. The regression phenomenon appears consistently in both self-reflection and process-level conditions, which would be unlikely if driven purely by RGI noise. Nevertheless, we agree that explicit validation would strengthen the central claim and will add a new subsection with (i) a small-scale human agreement study on RGI outputs and (ii) an ablation comparing RGI-derived feedback against random feedback baselines in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation self-contained

full rationale

The paper is an empirical study that compares DRA performance across feedback conditions using experimental runs on rubric-based metrics. RGI is introduced as a heuristic method for inferring gaps from rubric patterns, but the central claims (incorporation/regression rates, non-compounding gains) are measured outcomes from agent executions rather than derived predictions or fitted parameters. No equations, self-citations, or ansatzes are invoked as load-bearing steps that reduce the results to inputs by construction. The work reports public code for reproducibility and makes no uniqueness theorems or renamings of known results. This matches the default expectation for non-circular empirical papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper's contribution centers on the empirical evaluation and the RGI method; the central claim depends on the assumption that the rubric is a reliable proxy for research quality and that the tested agents represent typical DRA architectures.

axioms (1)

domain assumption Rubric criteria provide a valid and comprehensive measure of research report quality.
The evaluation of improvement and gaps relies on this to quantify performance changes.

invented entities (1)

Research Gap Inference (RGI) no independent evidence
purpose: To infer research-process gaps from patterns of satisfied and unsatisfied rubric criteria for providing process-level feedback.
Introduced as a new method to enable the process-level feedback setting.

pith-pipeline@v0.9.1-grok · 5789 in / 1143 out tokens · 41521 ms · 2026-06-27T16:17:56.179772+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references

[1]

Your treatment of detection methods stays at a survey level, describing general approaches without engaging with the specific systems behind them. For each area the query requires, your research should surface concrete methods and architectures introduced since 2022, grounded in peer-reviewed sources, rather than characterizing the field in broad terms

2022
[2]

Increase regulatory precision by identifying the exact titles, legal identifiers, and key operational provisions of major EU and US legislation

Your regulatory coverage reads as a high-level summary of policies. Increase regulatory precision by identifying the exact titles, legal identifiers, and key operational provisions of major EU and US legislation. When investigating EU, US, and international frameworks, locate the primary legislative texts and work from their specific provisions rather tha...
[3]

Figure 4.Task query and RGI feedback for Case 1 (Task 021)

Discussion of benchmark-versus-deployment performance can be strengthened by providing quantitative details on the magnitude of benchmark-to-real-world performance drops and linking them to concrete technical causes. Figure 4.Task query and RGI feedback for Case 1 (Task 021). Task 004 Query Analyze CME Group’s cash generation efficiency and capital alloca...

2024
[4]

Shift to quarterly filings as your primary source for all time-sensitive metrics across both Q1 2024 and Q1 2025, rather than deriving figures from full-year aggregates

Your analysis relies on annualized or summary-level data rather than period-specific figures. Shift to quarterly filings as your primary source for all time-sensitive metrics across both Q1 2024 and Q1 2025, rather than deriving figures from full-year aggregates

2024
[5]

Work from the full fixed-rate notes schedule in the official filings, ensuring all outstanding notes are captured and totals reconcile against the reported figures

Your debt analysis appears to draw on partial or secondary summaries rather than the complete capital structure disclosures. Work from the full fixed-rate notes schedule in the official filings, ensuring all outstanding notes are captured and totals reconcile against the reported figures. Account for any refinancing activity during the reporting period
[6]

Your liquidity assessment is built on incomplete inputs, which undermines the downstream ratios the query requires. Aggregate all committed sources of available liquidity—including both drawn and undrawn facilities—from the most recent filings and use these as the basis for coverage and concentration metrics. Figure 5.Task query and RGI feedback for Case ...

2026
[7]

The original research query
[8]

What the report covered correctly (factual accuracy passes)
[9]

What the report missed or got wrong (factual accuracy failures and evaluator explanations)
[10]

Citation signals (which sources the model found, missed, or misused -- for your inference only)
[11]

What the report achieved analytically (breadth-and-depth passes)
[12]

: model committed this error

Analytical depth failures (breadth-and-depth failures and evaluator explanations) Note on errors of commission: Some FA and BD failures are marked with the label ": model committed this error". These are negative criteria as they describe something the report should NOT have done but did (e.g., citing an unreliable figure, applying an incompatible framewo...

2026

[1] [1]

Your treatment of detection methods stays at a survey level, describing general approaches without engaging with the specific systems behind them. For each area the query requires, your research should surface concrete methods and architectures introduced since 2022, grounded in peer-reviewed sources, rather than characterizing the field in broad terms

2022

[2] [2]

Increase regulatory precision by identifying the exact titles, legal identifiers, and key operational provisions of major EU and US legislation

Your regulatory coverage reads as a high-level summary of policies. Increase regulatory precision by identifying the exact titles, legal identifiers, and key operational provisions of major EU and US legislation. When investigating EU, US, and international frameworks, locate the primary legislative texts and work from their specific provisions rather tha...

[3] [3]

Figure 4.Task query and RGI feedback for Case 1 (Task 021)

Discussion of benchmark-versus-deployment performance can be strengthened by providing quantitative details on the magnitude of benchmark-to-real-world performance drops and linking them to concrete technical causes. Figure 4.Task query and RGI feedback for Case 1 (Task 021). Task 004 Query Analyze CME Group’s cash generation efficiency and capital alloca...

2024

[4] [4]

Shift to quarterly filings as your primary source for all time-sensitive metrics across both Q1 2024 and Q1 2025, rather than deriving figures from full-year aggregates

Your analysis relies on annualized or summary-level data rather than period-specific figures. Shift to quarterly filings as your primary source for all time-sensitive metrics across both Q1 2024 and Q1 2025, rather than deriving figures from full-year aggregates

2024

[5] [5]

Work from the full fixed-rate notes schedule in the official filings, ensuring all outstanding notes are captured and totals reconcile against the reported figures

Your debt analysis appears to draw on partial or secondary summaries rather than the complete capital structure disclosures. Work from the full fixed-rate notes schedule in the official filings, ensuring all outstanding notes are captured and totals reconcile against the reported figures. Account for any refinancing activity during the reporting period

[6] [6]

Your liquidity assessment is built on incomplete inputs, which undermines the downstream ratios the query requires. Aggregate all committed sources of available liquidity—including both drawn and undrawn facilities—from the most recent filings and use these as the basis for coverage and concentration metrics. Figure 5.Task query and RGI feedback for Case ...

2026

[7] [7]

The original research query

[8] [8]

What the report covered correctly (factual accuracy passes)

[9] [9]

What the report missed or got wrong (factual accuracy failures and evaluator explanations)

[10] [10]

Citation signals (which sources the model found, missed, or misused -- for your inference only)

[11] [11]

What the report achieved analytically (breadth-and-depth passes)

[12] [12]

: model committed this error

Analytical depth failures (breadth-and-depth failures and evaluator explanations) Note on errors of commission: Some FA and BD failures are marked with the label ": model committed this error". These are negative criteria as they describe something the report should NOT have done but did (e.g., citing an unreliable figure, applying an incompatible framewo...

2026