arxiv: 2604.11753 · v1 · submitted 2026-04-13 · 💻 cs.CL

Recognition: unknown

Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks

Yoonsang Lee , Howard Yen , Xi Ye , Danqi Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:10 UTC · model grok-4.3

classification 💻 cs.CL

keywords agentic aggregationparallel test-time scalinglong-horizon agentic taskstrajectory synthesisagentic searchdeep researchtest-time compute

0 comments

The pith

An aggregation agent uses lightweight tools to inspect parallel trajectories and synthesize better answers for long-horizon agentic tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Agentic tasks such as search and deep research generate long multi-turn trajectories that are hard to combine at test time. Simply taking the final answers loses useful details from the paths, while feeding every trajectory into the model quickly hits context limits. AggAgent turns the set of parallel trajectories into an environment and gives the aggregator lightweight tools to look inside specific paths or search across them as needed. On six benchmarks spanning three model families, this approach beats prior aggregation techniques by up to 5.3 percent on average and 10.3 percent on the hardest research tasks, while the extra cost stays no larger than one additional full rollout.

Core claim

AggAgent treats parallel trajectories as an environment and equips an aggregation agent with lightweight tools that let it inspect candidate solutions and search across trajectories on demand, thereby synthesizing a final response from rich trajectory information without exceeding context windows or incurring more than one extra agentic rollout in cost.

What carries the argument

AggAgent, an agent that treats a collection of parallel trajectories as an inspectable environment and uses lightweight tools to navigate and synthesize information from them.

If this is right

Parallel test-time scaling becomes practical for open-ended, tool-using tasks instead of being limited to short chain-of-thought problems.
The extra compute for aggregation remains capped by one rollout regardless of how many trajectories are generated in parallel.
Gains are largest precisely on the longest and most open-ended tasks where trajectory information matters most.
The method transfers across model families without requiring changes to the underlying agent or task setup.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same inspection-and-synthesis pattern could be applied to aggregate outputs from multiple independent agents rather than parallel rollouts of one agent.
Tool design for the aggregator could be specialized further for particular domains such as code or web search to reduce any residual overhead.
Hierarchical versions of AggAgent might handle extremely long horizons by first aggregating small groups of trajectories and then aggregating those summaries.
The approach suggests a general template for test-time methods that treat computation traces as first-class objects to be queried rather than raw text to be concatenated.

Load-bearing premise

A single aggregation agent with lightweight tools can locate and combine the useful information spread across many long trajectories without the synthesis step itself becoming expensive or dropping critical details.

What would settle it

On a held-out set of deep research tasks, measure whether the total tokens or steps used by AggAgent exceed those of one standard agent rollout while its final accuracy falls below that of simply concatenating final answers or majority-voting them.

Figures

Figures reproduced from arXiv: 2604.11753 by Danqi Chen, Howard Yen, Xi Ye, Yoonsang Lee.

**Figure 1.** Figure 1: AggAgent consistently outperforms existing aggregation methods. We measure the average performance across six long-horizon agentic benchmarks (Section 4) against the number of parallel trajectories. The same model as the rollout agent serves as the aggregator. 1Our code is available at https://github.com/princeton-pli/AggAgent. 1 arXiv:2604.11753v1 [cs.CL] 13 Apr 2026 [PITH_FULL_IMAGE:figures/full_fig_p00… view at source ↗

**Figure 2.** Figure 2: Overview of aggregation methods for parallel scaling. (Top) An agent produces K = 3 independent rollouts on a long-horizon task. (Bottom) Solution Aggregation feeds only final solutions to an LLM, discarding intermediate reasoning. Summary Aggregation compresses each trajectory into a lossy summary. AggAgent (ours) navigates trajectories via tools in an agentic manner, enabling full-fidelity cross-trajecto… view at source ↗

**Figure 3.** Figure 3: AggAgent achieves a Pareto-optimal performance–efficiency tradeoff. We compare aggregation methods at K ∈ {2, 4, 8} parallel samples, averaging performance across six benchmarks, with the same model serving as both rollout agent and aggregator. For each model, the top chart plots average cost (USD per query) vs. performance and the bottom chart plots average latency (seconds per query) vs. performance. Mea… view at source ↗

**Figure 4.** Figure 4: Employing a stronger aggregator improves performance on LLM-based aggregation methods. In all cases, GLM-4.7-Flash serves as the rollout agent; blue bars replace the aggregator with the stronger MiniMax-M2.5, while red bars use GLM-4.7-Flash for both roles. Yellow hatched bars denote the Pass@K. All methods are evaluated at K=8. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation of solution synthesis vs. best-trajectory selection. AggAgent synthesizes a new solution from the collected trajectories; the selection variant selects the single best trajectory directly. Color indicates model (red: GLM-4.7-Flash, green: Qwen3.5-122B, blue: MiniMax-M2.5); line style and marker indicate method (solid + circle: AggAgent, dashed + diamond: selection variant). 6.2 Output Design Abla… view at source ↗

**Figure 6.** Figure 6: Average number of tool calls per query by AggAgent. Numbers above each bar indicate the average total tool calls per query. search_trajectory dominates tool usage, while get_solution and finish are each called approximately once per query. get_segment is used more selectively, reflecting a coarse-to-fine strategy where AggAgent commits to full-content reads only when keyword-level search is insufficient. 6… view at source ↗

**Figure 7.** Figure 7: Qualitative examples illustrating four key behaviours of AggAgent. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Performance-efficiency trade-off of aggregation methods across six benchmarks [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Performance-efficiency trade-off of aggregation methods using Qwen3.5-122B as [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Performance-efficiency trade-off of aggregation methods using MiniMax-M2.5 as [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: Confidence calibration across all six benchmarks. Each row is a benchmark and [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

**Figure 12.** Figure 12: Rollout agent system prompt for agentic search tasks, from [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗

**Figure 13.** Figure 13: Additional rollout agent system prompt for deep research tasks, from [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

**Figure 14.** Figure 14: Rollout agent user message for agentic search tasks, from [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗

**Figure 15.** Figure 15: Additional instruction for evaluating ResearchRubrics. [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗

**Figure 16.** Figure 16: AggAgent system prompt for agentic search tasks. [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗

**Figure 17.** Figure 17: AggAgent system prompt for deep research tasks. [PITH_FULL_IMAGE:figures/full_fig_p031_17.png] view at source ↗

**Figure 18.** Figure 18: AggAgent tool descriptions. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_18.png] view at source ↗

**Figure 19.** Figure 19: Variants of AggAgent finish tool. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_19.png] view at source ↗

read the original abstract

We study parallel test-time scaling for long-horizon agentic tasks such as agentic search and deep research, where multiple rollouts are generated in parallel and aggregated into a final response. While such scaling has proven effective for chain-of-thought reasoning, agentic tasks pose unique challenges: trajectories are long, multi-turn, and tool-augmented, and outputs are often open-ended. Aggregating only final answers discards rich information from trajectories, while concatenating all trajectories exceeds the model's context window. To address this, we propose AggAgent, an aggregation agent that treats parallel trajectories as an environment. We equip it with lightweight tools to inspect candidate solutions and search across trajectories, enabling it to navigate and synthesize information on demand. Across six benchmarks and three model families (GLM-4.7, Qwen3.5, MiniMax-M2.5), AggAgent outperforms all existing aggregation methods-by up to 5.3% absolute on average and 10.3% on two deep research tasks-while adding minimal overhead, as the aggregation cost remains bounded by a single agentic rollout. Our findings establish agentic aggregation as an effective and cost-efficient approach to parallel test-time scaling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AggAgent's tool-equipped aggregator beats simple baselines on agentic tasks but the single-rollout cost bound is the part that still needs explicit verification.

read the letter

The main takeaway is that treating parallel trajectories as an inspectable environment and giving the aggregator lightweight search and inspection tools produces better final answers than voting or full concatenation, with gains reported up to 5.3% on average and 10.3% on the deeper research benchmarks. That framing is distinct from prior aggregation work and fits the long, multi-turn, tool-using character of the tasks they study. The experiments cover six benchmarks and three model families, which gives the performance numbers some breadth. They also keep the method simple enough that it can be tried without heavy new infrastructure. The soft spot is the overhead claim. The paper states that aggregation cost stays bounded by one agentic rollout, yet the aggregator is itself an agent that can issue repeated tool calls. Without detailed per-step token counts or explicit limits shown in the results, it is not obvious whether the bound holds once the trajectories get long and the aggregator starts cross-referencing. If the full logs already include every tool call and still meet the limit, the efficiency story is solid; if not, the largest gains may come at higher cost than advertised. This paper is aimed at groups working on test-time scaling for agents rather than pure reasoning chains. It has enough experimental coverage and a clear enough design difference to deserve peer review, though any referee should press on the exact cost accounting before the minimal-overhead part is taken as settled.

Referee Report

2 major / 2 minor

Summary. The paper claims that AggAgent, an aggregation agent equipped with lightweight inspect and search tools, can effectively synthesize information from parallel long-horizon agentic trajectories. This leads to performance improvements of up to 5.3% on average and 10.3% on deep research tasks over existing aggregation methods across six benchmarks and three model families, with the aggregation overhead bounded by a single agentic rollout.

Significance. If the results hold, this establishes agentic aggregation as a viable method for parallel test-time scaling in agentic tasks, addressing the challenges of long trajectories and context limits. The use of tools to navigate trajectories on demand is a novel contribution that could be significant for developing more efficient multi-agent systems.

major comments (2)

[Abstract] The efficiency claim that 'the aggregation cost remains bounded by a single agentic rollout' (Abstract) is load-bearing for the paper's contribution but is not supported by specific evidence. Since AggAgent is itself an agent that can issue an arbitrary number of tool calls for inspection and cross-trajectory search, the manuscript must demonstrate through reported metrics (e.g., average tool calls, tokens, or steps in the aggregation phase) that the total cost does not exceed one baseline rollout, especially on the deep research tasks showing the largest gains.
[Experimental Results] The support for the central performance claim remains limited without detailed baselines, statistical analysis, or variance measures for the reported gains (as the abstract provides only aggregate improvements). The experimental section should include per-benchmark breakdowns, confidence intervals, and comparisons that isolate the contribution of the tool-equipped aggregation.

minor comments (2)

The abstract mentions 'six benchmarks' and 'three model families' but does not name them; including the specific names (e.g., GLM-4.7, Qwen3.5, MiniMax-M2.5 and the benchmark list) would improve clarity.
Provide concrete examples of the 'lightweight tools' (e.g., inspect and search primitives) and their implementation in the method section to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications and indicating revisions to the manuscript where appropriate to strengthen the presentation of our results.

read point-by-point responses

Referee: [Abstract] The efficiency claim that 'the aggregation cost remains bounded by a single agentic rollout' (Abstract) is load-bearing for the paper's contribution but is not supported by specific evidence. Since AggAgent is itself an agent that can issue an arbitrary number of tool calls for inspection and cross-trajectory search, the manuscript must demonstrate through reported metrics (e.g., average tool calls, tokens, or steps in the aggregation phase) that the total cost does not exceed one baseline rollout, especially on the deep research tasks showing the largest gains.

Authors: We agree that the efficiency claim requires explicit empirical backing to be fully convincing, as the referee correctly notes that the agentic nature of AggAgent could in principle lead to variable overhead. While our internal development measurements supported the bounded-cost statement, the manuscript did not report the supporting statistics. In the revised manuscript we have added a new subsection (Section 4.3) and accompanying table that reports average tool calls, token consumption, and step counts for the aggregation phase across all benchmarks and model families. These metrics are directly compared to the cost of a single baseline rollout; the data confirm that aggregation overhead remains at or below one rollout (with the largest gains on deep research tasks using approximately 60-70% of baseline rollout cost on average). We have also updated the abstract to reference these supporting measurements. revision: yes
Referee: [Experimental Results] The support for the central performance claim remains limited without detailed baselines, statistical analysis, or variance measures for the reported gains (as the abstract provides only aggregate improvements). The experimental section should include per-benchmark breakdowns, confidence intervals, and comparisons that isolate the contribution of the tool-equipped aggregation.

Authors: We acknowledge that aggregate numbers alone provide limited insight and that per-benchmark detail plus statistical measures would improve transparency. The original manuscript focused on overall averages to highlight the method's generality, but we agree this can be strengthened. In the revised version we have expanded the experimental results section with a new table providing full per-benchmark breakdowns for all six tasks and three model families. Where multiple independent runs were feasible we now report standard deviations and 95% confidence intervals. We have also added an ablation study that directly compares the full tool-equipped AggAgent against a variant without the inspection and search tools, thereby isolating the contribution of the agentic aggregation components to the observed gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes AggAgent as a new agentic aggregation method equipped with inspect/search tools, evaluated empirically on six benchmarks across three model families. No equations, derivations, or first-principles claims appear in the provided text; performance gains and the bounded-overhead statement are presented as experimental outcomes rather than tautological reductions to inputs. No self-citations, fitted parameters renamed as predictions, or uniqueness theorems are invoked in a load-bearing way. The method is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The claim rests on the assumption that existing benchmarks adequately measure long-horizon agentic performance and that the proposed tool-based aggregation mechanism can be implemented without introducing new unstated costs or biases.

axioms (1)

domain assumption Existing agentic benchmarks are valid proxies for real-world long-horizon performance
The evaluation relies on six unnamed benchmarks being representative.

invented entities (1)

AggAgent no independent evidence
purpose: Aggregation agent that treats trajectories as an environment with inspection tools
New method introduced to address aggregation challenges in agentic tasks

pith-pipeline@v0.9.0 · 5515 in / 1159 out tokens · 36669 ms · 2026-05-10T16:10:03.009397+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Context Gathering Decision Process: A POMDP Framework for Agentic Search
cs.AI 2026-05 accept novelty 7.0

Framing LLM agent loops as a Context Gathering Decision Process POMDP yields a predicate-based belief state that boosts multi-hop reasoning up to 11.4% and an exhaustion gate that cuts token use up to 39% with no perf...
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
cs.AI 2026-05 unverdicted novelty 5.0 partial

Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.

Reference graph

Works this paper leans on

15 extracted references · 2 canonical work pages · cited by 2 Pith papers

[1]

Resum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025

ISSN 2835-8856. URL https://openreview.net/forum?id=eskQMcIbMS . Survey Certification. Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Xinmiao Yu, Dingchu Zhang, Yong Jiang, et al. ReSum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025. Yangzhen Wu, Zhiqing Sun, Sh...

work page arXiv 2025
[2]

For agentic search tasks, we format the user message following BrowseComp (Wei et al., 2025) (Figure 14)

appended for deep research tasks (Figure 13). For agentic search tasks, we format the user message following BrowseComp (Wei et al., 2025) (Figure 14). LLM-as-a-JudgeFor BrowseComp, BrowseComp-Plus, and HLE, we find that the original evaluation prompt occasionally produces false judgments, hence we instead use the prompt from Zhu et al. (2026). For Resear...

work page arXiv 2025
[6]

— **REQUIRED PROCEDURE** You must follow these steps before calling ‘finish’

Deliver your synthesized solution in the required format and provide justification. — **REQUIRED PROCEDURE** You must follow these steps before calling ‘finish’
[8]

**Retrieve full solutions** — Call ‘get_solution’ (no arguments) to get the final content from every trajectory’s last step, or pass a trajectory_id to retrieve one specific trajectory
[9]

**Verify with tool observations** — Do not rely solely on final solutions or a trajectory’s own reasoning. For key claims or divergences, go back and inspect what the tools actually returned: - Use **search_trajectory**(trajectory_id, query) to locate steps where a specific term or claim appears. Use role=‘tool’ to restrict to actual tool responses when v...
[10]

trajectory 1

**Cross-check** — Confirm: (a) tool observations in the log match what the agent claims, (b) reasoning is not circular, (c) arithmetic and logic are correct. — **OPERATIONAL GUIDELINES** - **Tool results are ground truth; agent reasoning is not.** Within each trajectory, what a tool *returned* is an objective observation. What the agent *concluded* from i...

2020
[11]

Evaluate tool results and reasoning quality across all candidate trajectories
[12]

Identify the most reliable final solution based on verifiable tool observations, logical consistency, and correct tool application
[13]

If no single trajectory is fully reliable, synthesize a corrected solution using only verified components from across trajectories
[14]

— **REQUIRED PROCEDURE** You must follow these steps before calling ’finish’

Deliver your synthesized solution in the required format and provide justification. — **REQUIRED PROCEDURE** You must follow these steps before calling ’finish’
[15]

Identify which trajectories are worth inspecting based on step counts and patterns

**Survey the landscape** — Read the TRAJECTORY METADATA in the user message. Identify which trajectories are worth inspecting based on step counts and patterns
[16]

**Retrieve full solutions** — Call ’get_solution’ (no arguments) to get the final content from every trajectory’s last step, or pass a trajectory_id to retrieve one specific trajectory
[17]

**Verify with tool observations** — Do not rely solely on final solutions or a trajectory’s own reasoning. For key claims or divergences, go back and inspect what the tools actually returned: - Use **search_trajectory**(trajectory_id, query) to locate steps where a specific term or claim appears. Use role=’tool’ to restrict to tool responses when verifyin...
[18]

**Cross-check** — Confirm: (a) tool observations in the log match what the agent claims, (b) reasoning is not circular, (c) arithmetic and logic are correct
[19]

type ":

**Synthesize** — Write a unified response that: - Covers every important aspect addressed by any candidate - Takes the highest-quality treatment of each aspect (not just the most common) - Resolves contradictions by preferring more specific, better-supported, or more precise content - Reads as a single coherent response, not a patchwork — **QUALITY CRITER...