Learning Agent-Compatible Context Management for Long-Horizon Tasks

Jian-Yun Nie; Liuyi Yao; Lu Yi; Runlin Lei; Wenhao Zhang; Yaliang Li; Yuexiang Xie; Yuyang Li; Zhewei Wei

arxiv: 2605.30785 · v1 · pith:XDRHXY64new · submitted 2026-05-29 · 💻 cs.AI

Learning Agent-Compatible Context Management for Long-Horizon Tasks

Lu Yi , Runlin Lei , Liuyi Yao , Yuexiang Xie , Yuyang Li , Wenhao Zhang , Zhewei Wei , Yaliang Li

show 1 more author

Jian-Yun Nie

This is my paper

Pith reviewed 2026-06-28 22:33 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM agentscontext managementreinforcement learninglong-horizon tasksFidelity-Reliability Trade-offfrozen agentsweb search benchmarks

0 comments

The pith

An external LLM trained with reinforcement learning can manage context for frozen agents on long-horizon tasks without retraining them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM agents struggle with long tasks because accumulated context causes degradation and reasoning failures. Prior approaches require training the agent itself or apply fixed strategies like summarization, which limit use with closed-source agents and ignore agent differences. AdaCoM instead trains a separate LLM to edit the agent's context through actions such as pruning stale parts while keeping task constraints and progress. Training uses end-to-end reinforcement learning based solely on the agent's final task outcomes. Experiments across agents on web search and deep research benchmarks show performance gains, along with a Fidelity-Reliability Trade-off where stronger agents keep more context and weaker ones need more compression.

Core claim

AdaCoM trains an external LLM to manage the context of a frozen agent through flexible modification actions and end-to-end reinforcement learning. Across diverse agents on web search and deep research benchmarks, AdaCoM substantially improves performance by preserving task constraints and progress while pruning stale content. The learned strategies reveal a Fidelity-Reliability Trade-off: agents with higher vanilla ReAct performance benefit from higher-fidelity context preservation, whereas lower-performing agents require more aggressive compression to stay within a reliable reasoning regime. Transfer experiments show that AdaCoM generalizes most effectively across agents with similar capabi

What carries the argument

AdaCoM, an external LLM that applies modification actions to agent context and is trained via end-to-end RL on the agent's task outcomes.

Load-bearing premise

An external LLM can learn generalizable context modification strategies from only the agent's final performance outcomes without access to its internal state or gradients.

What would settle it

If testing AdaCoM on new long-horizon benchmarks with agents of varying capabilities shows no consistent performance gains or reversal of the fidelity-reliability pattern.

Figures

Figures reproduced from arXiv: 2605.30785 by Jian-Yun Nie, Liuyi Yao, Lu Yi, Runlin Lei, Wenhao Zhang, Yaliang Li, Yuexiang Xie, Yuyang Li, Zhewei Wei.

**Figure 1.** Figure 1: Overview of Adaptive Context Management (AdaCoM). Before each agent step, an external LLM manages the context presented to the frozen agent. Task feedback updates only the manager, enabling AdaCoM to discover agent-compatible context management strategies without training the underlying agent. search queries (Wei et al., 2025; Li et al., 2025) or producing deep research reports (Du et al., 2025; Wang et a… view at source ↗

**Figure 2.** Figure 2: Per-agent trajectory outcome distribution shifts (%) on BrowseComp-Plus, computed as AdaCoM minus ReAct. DeepSeek-V3 Kimi-K2-Instruct Qwen3-max GLM-4.5-Air 0 5000 10000 15000 20000 Context length (tokens) mean median 0 10 20 30 AdaCoM step 0 5000 10000 15000 20000 Context length (tokens) DeepSeek-V3 Kimi-K2-Instruct Qwen3-max GLM-4.5-Air [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 4.** Figure 4: Cross-agent transfer of AdaCoM on BrowseComp-Plus. Each group on the [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Cross-agent transfer of AdaCoM on MCPBench-Wiki. Each group on the x-axis is a target agent; bars within a group show MCP-Bench-Wiki mean@3 under ReAct and the two trained AdaCoMs. tion, reducing parallelism and efficiency. As TF and IG receive larger weights in the overall score, this trade-off still leads to a higher final score. C.2 Ablation Study on Process Reward We ablate the effect of process rewar… view at source ↗

read the original abstract

LLM agents increasingly face long-horizon tasks such as web search and deep research in real-world applications, where accumulated context can cause long-context degradation and reasoning failures. Prior work mitigates this through context management with agent-side context control or fixed strategies such as summarization, which require training the agent itself for adaptation - making it impractical for closed-source agents and ignoring that different agents may require different strategies. We introduce Adaptive Context Management (AdaCoM), which trains an external LLM to manage the context of a frozen agent through flexible modification actions and end-to-end reinforcement learning. Across diverse agents on web search and deep research benchmarks, AdaCoM substantially improves performance by preserving task constraints and progress while pruning stale content. The learned strategies reveal a Fidelity-Reliability Trade-off: agents with higher vanilla ReAct performance benefit from higher-fidelity context preservation, whereas lower-performing agents require more aggressive compression to stay within a reliable reasoning regime. Transfer experiments show that AdaCoM generalizes most effectively across agents with similar capability (measured by vanilla ReAct performance), suggesting a practical path toward reusable context managers for agent systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AdaCoM puts an external RL-trained manager on top of frozen agents and reports a capability-linked fidelity trade-off, but the abstract supplies no numbers or ablations to judge whether the RL actually learns usable policies.

read the letter

The paper trains an external LLM to edit context for a frozen agent using end-to-end RL on final task reward. The manager can insert, delete, or summarize turns, and the only training signal is whether the whole long-horizon run succeeds. They test this on web search and deep research benchmarks across several agents and claim clear gains plus a pattern: stronger base agents do better with higher-fidelity preservation while weaker ones need heavier pruning.

What is actually new is the external-manager setup itself. Prior work either fine-tunes the agent or applies fixed summarization; this keeps the agent untouched, which matters for closed-source models. The transfer result across agents of similar vanilla ReAct strength is also a concrete observation worth checking.

The soft spots are straightforward. The abstract gives no numbers, no baselines, no error bars, and no ablation on the RL objective or action space, so the size of the improvement cannot be assessed. More critically, the reward is a single scalar at the end of a long trajectory. Credit assignment to individual context edits is therefore extremely sparse. Without any access to the agent's hidden state or per-step reasoning quality, it is not obvious that the manager can learn stable, non-overfit policies rather than brittle heuristics that happen to work on the training distribution. The stress-test concern about delayed sparse signals therefore lands directly on the central claim.

This is for people who deploy agents in production and need a reusable context layer. It is worth sending to referees so the experiments can be examined, but the current write-up does not yet show that the RL component works reliably.

Referee Report

3 major / 2 minor

Summary. The paper introduces Adaptive Context Management (AdaCoM), an external LLM trained via end-to-end reinforcement learning to perform flexible context modifications (preserve constraints/progress, prune stale content) on frozen LLM agents for long-horizon tasks such as web search and deep research. It claims substantial performance gains across diverse agents, identifies a Fidelity-Reliability Trade-off (higher vanilla ReAct agents benefit from high-fidelity preservation; lower-performing agents need aggressive compression), and shows that learned managers transfer best across agents of similar capability.

Significance. If the empirical claims hold with rigorous validation, AdaCoM would provide a practical route to context management for closed-source agents without retraining them and could yield reusable managers for similar-capability agents. The trade-off observation, if reproducible, would be a useful empirical regularity for agent system design.

major comments (3)

[Abstract and §4] Abstract and §4 (Experiments): the central claim of 'substantially improves performance' across agents and benchmarks is stated without any quantitative numbers, error bars, baseline comparisons, or ablation results in the abstract; the reader's report confirms the same absence of verifiable data, so the magnitude and reliability of the gains cannot be assessed.
[§3] §3 (Method) and skeptic note: the end-to-end RL objective optimizes only against the scalar final-task outcome of a long-horizon trajectory; because the manager has no access to the frozen agent's internal state, hidden activations, or per-step reasoning quality, credit assignment for individual context edits is extremely sparse and delayed, directly undermining the claim that the learned policies reliably discover 'preserve constraints + prune stale' strategies.
[§4.3] §4.3 (Transfer experiments): the reported generalization 'most effectively across agents with similar capability' is presented as an observed pattern, but without details on how capability similarity was quantified, how many agents were tested, or controls for task difficulty, it is impossible to determine whether the transfer result is robust or an artifact of the chosen agent set.

minor comments (2)

[§3] Notation for the modification actions and the RL reward formulation should be introduced with explicit equations rather than prose descriptions.
[§4] Figure captions and axis labels for any performance or trade-off plots need to state the exact metrics, number of runs, and confidence intervals.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We will revise the abstract to include quantitative results and update §4.3 with additional experimental details. For the RL credit assignment concern, we will add further analysis while noting the limitations of sparse rewards. Point-by-point responses are below.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claim of 'substantially improves performance' across agents and benchmarks is stated without any quantitative numbers, error bars, baseline comparisons, or ablation results in the abstract; the reader's report confirms the same absence of verifiable data, so the magnitude and reliability of the gains cannot be assessed.

Authors: We agree the abstract should report specific numbers to allow immediate assessment of gains. The revised abstract will include key quantitative improvements (e.g., average success rate deltas with standard deviations across agents), baseline comparisons, and references to ablations and error bars already detailed in §4 tables. This directly addresses verifiability without altering the underlying results. revision: yes
Referee: [§3] §3 (Method) and skeptic note: the end-to-end RL objective optimizes only against the scalar final-task outcome of a long-horizon trajectory; because the manager has no access to the frozen agent's internal state, hidden activations, or per-step reasoning quality, credit assignment for individual context edits is extremely sparse and delayed, directly undermining the claim that the learned policies reliably discover 'preserve constraints + prune stale' strategies.

Authors: The sparse and delayed nature of the final-outcome reward is inherent to the end-to-end setup and limits direct causal attribution of individual edits. We identify the preserve/prune behaviors via post-hoc inspection of manager actions on held-out trajectories rather than claiming the RL process alone guarantees their discovery. In revision we will add action-frequency statistics and note this limitation explicitly; the cross-agent performance gains remain the primary empirical support. revision: partial
Referee: [§4.3] §4.3 (Transfer experiments): the reported generalization 'most effectively across agents with similar capability' is presented as an observed pattern, but without details on how capability similarity was quantified, how many agents were tested, or controls for task difficulty, it is impossible to determine whether the transfer result is robust or an artifact of the chosen agent set.

Authors: We will expand §4.3 to state that capability similarity is quantified by each agent's vanilla ReAct success rate on the identical benchmark tasks. Transfer was evaluated across five agents. All transfer pairs use the same web-search and deep-research task distributions to control difficulty; we will also report a random-pairing control to show the similarity-based pattern exceeds chance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical RL method with observed trade-off

full rationale

The paper presents an empirical approach: training an external LLM manager via end-to-end RL on downstream agent success for context modifications. No equations, fitted parameters, or derivation chain are described in the provided text. The Fidelity-Reliability Trade-off is reported as an observed pattern across agents, not derived from or equivalent to any input by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim rests on experimental results rather than reducing to self-defined quantities or prior author work by definition. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities. The method implicitly assumes RL can optimize context actions from outcome rewards alone.

pith-pipeline@v0.9.1-grok · 5747 in / 1067 out tokens · 13662 ms · 2026-06-28T22:33:29.796583+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 3 canonical work pages · 1 internal anchor

[1]

WebSailor: Navigating Super-human Reasoning for Web Agent

Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592. Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paran- jape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Trans. Assoc. Comput. Linguistics, 12:157–173. Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe D...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

arXiv preprint arXiv:2508.20453

Mcp-bench: Benchmarking tool-using llm agents with complex real-world tasks via mcp servers. arXiv preprint arXiv:2508.20453. Jason Wei, Zhiqing Sun, Spencer Papay, Scott McK- inney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. 2025. Browsecomp: A simple yet challeng- ing benchmark for browsing agents.a...

work page arXiv 2025
[3]

‘python\n def fun():\n x = 10\n y = 20\n return x + y\n answer = fun()\n“‘ Do not include anything other than Python code blocks in your response

Resum: Unlocking long-horizon search in- telligence via context summarization.arXiv preprint arXiv:2509.13313. Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient streaming lan- guage models with attention sinks. InInternational Conference on Learning Representations, volume 2024, pages 21875–21895. Wujiang Xu, Zujie Liang...

work page arXiv 2024
[4]

Agent Context: The complete context window including the original task description, previous agent actions and results, and any existing summaries or hints from prior rounds
[5]

modifications

Token Usage Ratio: Current token usage (0% to 100% scale). {{Background}} Memory Modification Process Follow this reasoning chain: (1)Analyze Context: Understand the task progress and the agent’s goal in the current round. (2)Apply Strategy: Apply the compression strategy to the specific context. Output FormatReturn{"modifications":[]}if no changes are ne...
[6]

Extract the final answer primarily from the agent’s final answer field
[7]

Use the explanation as supporting context that can clarify, refine, or contradict the final answer
[8]

Compare the agent’s overall output with the ground truth answer
[9]

The agent’s answer is correctonly ifit is semantically equivalent to the ground truth
[10]

Allow for minor variations in phrasing, but the core information must match exactly
[11]

For numerical answers, allow small rounding differences (within 1% or 0.1 units)
[12]

If the final answer contains additional information that does not contradict the ground truth, it can still be marked as correct
[13]

If the final answer is ambiguous, contradictory, or contains incorrect information, mark it as incorrect
[14]

extracted_answer

If the agent did not provide a clear final answer, mark it as incorrect. Output Format: Respond with a valid JSON object only (no additional text): { "extracted_answer": "<exact answer extracted from the agent's output, or null>", "ground_truth": "<the ground truth answer>", "reasoning": "<why the answer is correct or incorrect>", "score": 1.0 or 0.0 } Ta...
[15]

• 1–3: Perfectly completes 10–30% of requirements

Task fulfillment and quality. • 1–3: Perfectly completes 10–30% of requirements. • 4–6: Perfectly completes 40–60% of requirements. • 7–8: Perfectly completes 70–80% of requirements. • 9–10: Perfectly completes 90–100% of requirements. NOTE: requirements come from the task presented to agent only. Format (JSON/text) isnota requirement unless explicitly st...
[16]

• 1–3: 10–30% of claims are perfectly grounded in tool outputs

Grounding. • 1–3: 10–30% of claims are perfectly grounded in tool outputs. • 4–6: 40–60% of claims are perfectly grounded in tool outputs. • 7–8: 70–80% of claims are perfectly grounded in tool outputs. • 9–10: 90–100% of claims are perfectly grounded in tool outputs. Tool usage rubric(1–10 per subdimension)
[17]

• 1–3: 10–30% of tools were perfectly selected for their subtasks

Tool appropriateness. • 1–3: 10–30% of tools were perfectly selected for their subtasks. • 4–6: 40–60% of tools were perfectly selected for their subtasks. • 7–8: 70–80% of tools were perfectly selected for their subtasks. • 9–10: 90–100% of tools were perfectly selected for their subtasks
[18]

• 1–3: 10–30% of tool calls have perfectly accurate and complete parameters

Parameter accuracy. • 1–3: 10–30% of tool calls have perfectly accurate and complete parameters. • 4–6: 40–60% of tool calls have perfectly accurate and complete parameters. • 7–8: 70–80% of tool calls have perfectly accurate and complete parameters. • 9–10: 90–100% of tool calls have perfectly accurate and complete parameters. Planning effectiveness and ...
[19]

• 1–3: 10–30% of dependency chains are perfectly executed

Dependency awareness. • 1–3: 10–30% of dependency chains are perfectly executed. • 4–6: 40–60% of dependency chains are perfectly executed. • 7–8: 70–80% of dependency chains are perfectly executed. • 9–10: 90–100% of dependency chains are perfectly executed
[20]

• 1–3: More than 70% of tool calls are redundant or unnecessary

Efficiency. • 1–3: More than 70% of tool calls are redundant or unnecessary. • 4–6: 40–60% of tool calls are redundant or unnecessary. • 7–8: 10–30% of tool calls are redundant or unnecessary. • 9–10: Less than 10% of tool calls are redundant or unnecessary. Percentage-based scoring system. How to calculate scores: for each dimension, calculate the defect...
[21]

well executed

When evaluating percentages, assess what counts as “well executed” for each dimension: • Task fulfillment: requirements completed correctly. • Grounding: claims supported by actual tool outputs. • Tool appropriateness: suitable tools chosen for each subtask. • Parameter accuracy: correct and complete parameters in tool calls. • Dependency awareness: prope...
[22]

Minor imperfections reduce the percentage proportionally, not to zero
[23]

Key principles: 1.Alwayscalculate as percentage,notabsolute numbers

Map the resulting defect rate to the score range above. Key principles: 1.Alwayscalculate as percentage,notabsolute numbers
[24]

10 errors in 100 calls (10%)=same score as 1 error in 10 calls (10%)
[25]

Use the full 1–10 range

Consider the opportunity count for each dimension: • Tool calls: how many total calls were made? • Parallelization: how many taskscouldhave been parallel? • Parameters: how many total parameters across all calls? • Claims: how many factual statements were made? • Dependencies: how many dependency relationships exist? Score each dimension based on the defe...
[26]

For task fulfillment, use chain-of-thought: first listallrequirements from the task, then for each state whether fulfilled with evidence, then count fulfilled/total = percentage, then map to score range
[27]

Youmustmap each score to the exact percentage ranges in the rubrics
[28]

Task completion and tool usagemustbe evaluated against the concrete task reference, not the fuzzy task
[29]

Planning effectiveness should be evaluated based on the proportion of dependencies correctly handled, not the absolute number of steps executed or exact conformance to the dependency analysis
[30]

First calculate the actual percentage of completion/success, then assign the corresponding score range
[31]

task_fulfillment_reasoning

Focus on completionratiosnot absolute numbers — completing 7/10 steps (70%) should score similarly to completing 14/20 steps (70%), regardless of task complexity. Please score based on completion percentages and proportional success, not absolute numbers. Return your evaluation scoring and reasoning in this exact JSON format.All six numeric score fields (...
[32]

Original Question

Relevance is Key: Include only the information that directly or potentially contributes to answering the “Original Question”. Eliminate irrelevant or redundant details. 2.Refine, Don’t Just Append: • Merge: Consolidate new information with existing points to enhance clarity and complete- ness. • Update: Replace general statements with more precise or spec...
[33]

No-Change Option: If the new chunk provides no relevant information, simply return the unchanged Existing Reading Note
[34]

Avoid summarizing the entire document; this is a working reference, not a comprehensive report

Be Concise: Keep the note succinct, capturing only the most critical and essential facts. Avoid summarizing the entire document; this is a working reference, not a comprehensive report
[35]

Updated Reading Note

No Premature Conclusions: Focus strictly on refining the note at each step. Save final judgments or conclusions until all chunks have been processed. Output Format Your response must only consist of the full text of the revised “Updated Reading Note”. Do not write any explanations, commentary, or other additional text. Progress:{{chunk_idx}}out of{{total_...

[1] [1]

WebSailor: Navigating Super-human Reasoning for Web Agent

Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592. Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paran- jape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Trans. Assoc. Comput. Linguistics, 12:157–173. Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe D...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

arXiv preprint arXiv:2508.20453

Mcp-bench: Benchmarking tool-using llm agents with complex real-world tasks via mcp servers. arXiv preprint arXiv:2508.20453. Jason Wei, Zhiqing Sun, Spencer Papay, Scott McK- inney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. 2025. Browsecomp: A simple yet challeng- ing benchmark for browsing agents.a...

work page arXiv 2025

[3] [3]

‘python\n def fun():\n x = 10\n y = 20\n return x + y\n answer = fun()\n“‘ Do not include anything other than Python code blocks in your response

Resum: Unlocking long-horizon search in- telligence via context summarization.arXiv preprint arXiv:2509.13313. Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient streaming lan- guage models with attention sinks. InInternational Conference on Learning Representations, volume 2024, pages 21875–21895. Wujiang Xu, Zujie Liang...

work page arXiv 2024

[4] [4]

Agent Context: The complete context window including the original task description, previous agent actions and results, and any existing summaries or hints from prior rounds

[5] [5]

modifications

Token Usage Ratio: Current token usage (0% to 100% scale). {{Background}} Memory Modification Process Follow this reasoning chain: (1)Analyze Context: Understand the task progress and the agent’s goal in the current round. (2)Apply Strategy: Apply the compression strategy to the specific context. Output FormatReturn{"modifications":[]}if no changes are ne...

[6] [6]

Extract the final answer primarily from the agent’s final answer field

[7] [7]

Use the explanation as supporting context that can clarify, refine, or contradict the final answer

[8] [8]

Compare the agent’s overall output with the ground truth answer

[9] [9]

The agent’s answer is correctonly ifit is semantically equivalent to the ground truth

[10] [10]

Allow for minor variations in phrasing, but the core information must match exactly

[11] [11]

For numerical answers, allow small rounding differences (within 1% or 0.1 units)

[12] [12]

If the final answer contains additional information that does not contradict the ground truth, it can still be marked as correct

[13] [13]

If the final answer is ambiguous, contradictory, or contains incorrect information, mark it as incorrect

[14] [14]

extracted_answer

If the agent did not provide a clear final answer, mark it as incorrect. Output Format: Respond with a valid JSON object only (no additional text): { "extracted_answer": "<exact answer extracted from the agent's output, or null>", "ground_truth": "<the ground truth answer>", "reasoning": "<why the answer is correct or incorrect>", "score": 1.0 or 0.0 } Ta...

[15] [15]

• 1–3: Perfectly completes 10–30% of requirements

Task fulfillment and quality. • 1–3: Perfectly completes 10–30% of requirements. • 4–6: Perfectly completes 40–60% of requirements. • 7–8: Perfectly completes 70–80% of requirements. • 9–10: Perfectly completes 90–100% of requirements. NOTE: requirements come from the task presented to agent only. Format (JSON/text) isnota requirement unless explicitly st...

[16] [16]

• 1–3: 10–30% of claims are perfectly grounded in tool outputs

Grounding. • 1–3: 10–30% of claims are perfectly grounded in tool outputs. • 4–6: 40–60% of claims are perfectly grounded in tool outputs. • 7–8: 70–80% of claims are perfectly grounded in tool outputs. • 9–10: 90–100% of claims are perfectly grounded in tool outputs. Tool usage rubric(1–10 per subdimension)

[17] [17]

• 1–3: 10–30% of tools were perfectly selected for their subtasks

Tool appropriateness. • 1–3: 10–30% of tools were perfectly selected for their subtasks. • 4–6: 40–60% of tools were perfectly selected for their subtasks. • 7–8: 70–80% of tools were perfectly selected for their subtasks. • 9–10: 90–100% of tools were perfectly selected for their subtasks

[18] [18]

• 1–3: 10–30% of tool calls have perfectly accurate and complete parameters

Parameter accuracy. • 1–3: 10–30% of tool calls have perfectly accurate and complete parameters. • 4–6: 40–60% of tool calls have perfectly accurate and complete parameters. • 7–8: 70–80% of tool calls have perfectly accurate and complete parameters. • 9–10: 90–100% of tool calls have perfectly accurate and complete parameters. Planning effectiveness and ...

[19] [19]

• 1–3: 10–30% of dependency chains are perfectly executed

Dependency awareness. • 1–3: 10–30% of dependency chains are perfectly executed. • 4–6: 40–60% of dependency chains are perfectly executed. • 7–8: 70–80% of dependency chains are perfectly executed. • 9–10: 90–100% of dependency chains are perfectly executed

[20] [20]

• 1–3: More than 70% of tool calls are redundant or unnecessary

Efficiency. • 1–3: More than 70% of tool calls are redundant or unnecessary. • 4–6: 40–60% of tool calls are redundant or unnecessary. • 7–8: 10–30% of tool calls are redundant or unnecessary. • 9–10: Less than 10% of tool calls are redundant or unnecessary. Percentage-based scoring system. How to calculate scores: for each dimension, calculate the defect...

[21] [21]

well executed

When evaluating percentages, assess what counts as “well executed” for each dimension: • Task fulfillment: requirements completed correctly. • Grounding: claims supported by actual tool outputs. • Tool appropriateness: suitable tools chosen for each subtask. • Parameter accuracy: correct and complete parameters in tool calls. • Dependency awareness: prope...

[22] [22]

Minor imperfections reduce the percentage proportionally, not to zero

[23] [23]

Key principles: 1.Alwayscalculate as percentage,notabsolute numbers

Map the resulting defect rate to the score range above. Key principles: 1.Alwayscalculate as percentage,notabsolute numbers

[24] [24]

10 errors in 100 calls (10%)=same score as 1 error in 10 calls (10%)

[25] [25]

Use the full 1–10 range

Consider the opportunity count for each dimension: • Tool calls: how many total calls were made? • Parallelization: how many taskscouldhave been parallel? • Parameters: how many total parameters across all calls? • Claims: how many factual statements were made? • Dependencies: how many dependency relationships exist? Score each dimension based on the defe...

[26] [26]

For task fulfillment, use chain-of-thought: first listallrequirements from the task, then for each state whether fulfilled with evidence, then count fulfilled/total = percentage, then map to score range

[27] [27]

Youmustmap each score to the exact percentage ranges in the rubrics

[28] [28]

Task completion and tool usagemustbe evaluated against the concrete task reference, not the fuzzy task

[29] [29]

Planning effectiveness should be evaluated based on the proportion of dependencies correctly handled, not the absolute number of steps executed or exact conformance to the dependency analysis

[30] [30]

First calculate the actual percentage of completion/success, then assign the corresponding score range

[31] [31]

task_fulfillment_reasoning

Focus on completionratiosnot absolute numbers — completing 7/10 steps (70%) should score similarly to completing 14/20 steps (70%), regardless of task complexity. Please score based on completion percentages and proportional success, not absolute numbers. Return your evaluation scoring and reasoning in this exact JSON format.All six numeric score fields (...

[32] [32]

Original Question

Relevance is Key: Include only the information that directly or potentially contributes to answering the “Original Question”. Eliminate irrelevant or redundant details. 2.Refine, Don’t Just Append: • Merge: Consolidate new information with existing points to enhance clarity and complete- ness. • Update: Replace general statements with more precise or spec...

[33] [33]

No-Change Option: If the new chunk provides no relevant information, simply return the unchanged Existing Reading Note

[34] [34]

Avoid summarizing the entire document; this is a working reference, not a comprehensive report

Be Concise: Keep the note succinct, capturing only the most critical and essential facts. Avoid summarizing the entire document; this is a working reference, not a comprehensive report

[35] [35]

Updated Reading Note

No Premature Conclusions: Focus strictly on refining the note at each step. Save final judgments or conclusions until all chunks have been processed. Output Format Your response must only consist of the full text of the revised “Updated Reading Note”. Do not write any explanations, commentary, or other additional text. Progress:{{chunk_idx}}out of{{total_...