Plan Before Search: Search Agents Need Plan

Ben Chen; Chenyi Lei; Huangyu Dai; Jiayi Ji; Qibin Hou; Wenwu Ou; Xiaoshuai Sun; Yufei Ma; Zhipeng Qian; Zihan Liang

arxiv: 2605.28354 · v1 · pith:2MKXQZQGnew · submitted 2026-05-27 · 💻 cs.AI

Plan Before Search: Search Agents Need Plan

Zhipeng Qian , Zihan Liang , Yufei Ma , Ben Chen , Huangyu Dai , Jiayi Ji , Chenyi Lei , Wenwu Ou

show 2 more authors

Xiaoshuai Sun Qibin Hou

This is my paper

Pith reviewed 2026-06-29 12:29 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-hop QAretrieval-augmented agentsreinforcement learningself-bootstrappingplan before searchagentic behaviordistillation alternativessub-question decomposition

0 comments

The pith

A small-scale seed model generates filtered trajectories that activate Plan behavior in target models for multi-hop retrieval without external distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines training retrieval-augmented reasoning agents for multi-hop question answering using reinforcement learning. It observes that the same reward signal produces different failure modes across model families from 3B to 14B parameters, tied to conditions like initial entropy and prerequisite skills. In response, the work introduces a self-bootstrapping method where trajectories from a small seed model teach the Plan behavior to any target model. Plan decomposes a question into ordered sub-questions before retrieval begins, anchoring each search step to a pre-set sub-question. This pipeline succeeds in activating Plan across tested models and outperforms baselines on multi-hop QA tasks while removing the requirement for distillation from a stronger external model.

Core claim

The paper claims that Plan, defined as decomposing a question into ordered sub-questions before any retrieval occurs, can be activated in target models through a self-bootstrapping paradigm that uses filtered trajectories generated by a small-scale seed model, and that this approach works across model families without distillation from an external stronger model while delivering consistent gains over competitive baselines on multi-hop QA benchmarks.

What carries the argument

Plan, the structured agentic behavior that decomposes a question into ordered sub-questions before retrieval to anchor search steps and prevent drift from partially relevant documents.

If this is right

The pipeline activates Plan across every tested model from 3B to 14B parameters.
It consistently outperforms competitive baselines on multi-hop QA benchmarks.
Successful training depends on model-specific feasibility conditions including sufficient initial entropy, training stability, and prerequisite sub-skills.
An identical reward signal induces qualitatively different RL failure modes across model families.
The self-bootstrapping method eliminates the need for distillation from an external stronger model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach implies that internal trajectory filtering can substitute for external teacher models in acquiring structured agent behaviors.
Similar bootstrapping may apply to other agentic skills such as verification or tool use beyond planning.
Model-specific initial conditions could become a standard diagnostic step before RL training of search agents.
The method might reduce overall compute by avoiding repeated calls to larger teacher models during capability acquisition.

Load-bearing premise

That filtered trajectories generated by a small-scale seed model are sufficient to activate the Plan behavior in any target model without requiring distillation from an external stronger model.

What would settle it

A direct test applying the seed-model trajectories to a target model outside the 3B-14B range or lacking prerequisite sub-skills and measuring whether Plan activation and benchmark gains still occur.

Figures

Figures reproduced from arXiv: 2605.28354 by Ben Chen, Chenyi Lei, Huangyu Dai, Jiayi Ji, Qibin Hou, Wenwu Ou, Xiaoshuai Sun, Yufei Ma, Zhipeng Qian, Zihan Liang.

**Figure 2.** Figure 2: Overview of our framework. Given a user question, the policy model first generates a global plan [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Training stability comparison between thresh [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: Initial RL entropy after SFT cold start. All the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 4.** Figure 4: Failure modes of direct RL on the plan task [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: Training dynamics of direct joint RL on Qwen2.5-7B-Base. (a) Validation EM rises to 0.405 at step 100, then collapses to 0.35 and fails to recover. (b) The critic score declines over the same window, suggesting the reward signal detects the collapse. framework (Sheng et al., 2025) for distributed training. B More Ablation Studies B.1 Self-Bootstrapping vs. Direct RL on Qwen2.5-7B-Base Qwen2.5-7B-Base is … view at source ↗

**Figure 7.** Figure 7: Prompt template for plan C Example Appendix C.1 Prompt Template [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Per-component token distribution across train [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

read the original abstract

Training large language models as retrieval-augmented reasoning agents typically combines reinforcement learning with an SFT cold start distilled from a stronger model. However, this paradigm overlooks two fundamental factors: the dependency structure among sub-skills, and the possibility that distillation is not the only route to capability acquisition. We study this through Plan, a structured agentic behavior for multi-hop retrieval that decomposes a question into ordered sub-questions before any retrieval is performed, so that each search step can be anchored to a pre-designed sub-question instead of drifting under the influence of partially relevant documents retrieved earlier. However, across three model families spanning 3B to 14B parameters, we find that an identical reward signal induces qualitatively different RL failure modes. This phenomenon indicates that successful training hinges not only on reward design but also on model-specific feasibility conditions: sufficient initial entropy, training stability, and prerequisite sub-skills. Motivated by this, we propose a self-bootstrapping paradigm in which a small-scale seed model generates filtered trajectories that activate Plan in any target model, eliminating the need for distillation from an external stronger model. Our pipeline activates Plan across every tested model and consistently outperforms competitive baselines on multi-hop QA benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Self-bootstrapping planning via filtered trajectories from a small seed model is a sensible practical alternative to distillation, but the abstract's performance claims rest on unshown details.

read the letter

The main point is that this work identifies why standard RL for retrieval agents often fails across model sizes and proposes fixing it by generating training trajectories from a small seed model instead of distilling from a stronger one. The authors define Plan as the behavior of breaking a question into ordered sub-questions before any search happens, which they argue prevents drift from early irrelevant documents.

What is actually new is the emphasis on model-specific feasibility conditions—initial entropy, stability, and prerequisite sub-skills—beyond just reward design. They report that the same reward produces qualitatively different failure modes in different families, which is a useful observation for anyone training agents. The self-bootstrapping route, where the seed's filtered outputs activate Plan in target models from 3B to 14B, is presented as removing the external distillation step.

The paper does a reasonable job laying out the dependency structure among sub-skills and why planning first matters for multi-hop retrieval. That framing is clear and directly tied to the agent setting.

The soft spots are in the evidence. The abstract states consistent outperformance and activation across every tested model, yet supplies no numbers, baselines, or description of the filtering criteria. The stress-test concern lands: if the 3B seed lacks the sub-skills, its trajectories are unlikely to supply them, and the method could simply replay the instability the authors themselves document. Without seeing the actual results, ablations, or controls, it is impossible to judge whether filtering compensates or whether the gains are real.

This is for researchers working on retrieval-augmented agents and multi-hop QA training pipelines. A reader who wants concrete alternatives to distillation would find the setup worth examining once the experiments are available.

It deserves peer review because the problem is practical and the proposed fix is testable, even if the current write-up leaves the central claims unverified.

Referee Report

2 major / 2 minor

Summary. The paper claims that standard RL training of retrieval-augmented agents requires an SFT cold-start distilled from a stronger model, but overlooks sub-skill dependencies and alternative acquisition routes. It defines 'Plan' as a structured behavior that decomposes a multi-hop question into an ordered list of sub-questions before any retrieval occurs. Across three model families (3B–14B), identical reward signals produce qualitatively different RL failure modes, which the authors attribute to model-specific conditions including initial entropy, stability, and prerequisite sub-skills. They therefore introduce a self-bootstrapping pipeline in which a small-scale seed model generates filtered trajectories that are used to activate Plan behavior in any target model, eliminating the need for stronger-model distillation. The pipeline is stated to activate Plan in every tested model and to outperform competitive baselines on multi-hop QA benchmarks.

Significance. If the empirical claims hold, the work would be significant for showing that a self-bootstrapping route using only a small seed model can replace distillation from a stronger model when training agentic retrieval behaviors. The observation that identical rewards induce different failure modes across model families supplies a concrete diagnostic for why RL succeeds or fails on agent tasks and underscores the role of prerequisite sub-skills. The approach is falsifiable via the reported multi-hop QA results and the cross-family activation experiments.

major comments (2)

[Abstract] Abstract: the claim of 'consistent outperformance across model families' and 'outperforms competitive baselines' is presented without any quantitative results, error bars, baseline specifications, or controls, so the central empirical claim cannot be evaluated from the supplied text.
[Method] Method / self-bootstrapping section: the central claim that filtered trajectories from the small-scale seed model suffice to activate Plan 'in any target model' without stronger-model distillation rests on the unverified assumption that the seed already possesses (or filtering reliably extracts) the prerequisite sub-skills. The paper itself notes that identical rewards produce different failure modes precisely because of missing sub-skills; no experiment is described that tests whether the 3B-scale seed trajectories are high-entropy and on-distribution enough to avoid reproducing the documented RL instabilities.

minor comments (2)

Clarify the precise parameter counts of the seed model and each target model, and state whether the seed is from the same family as the targets.
Specify the exact filtering criteria applied to the seed trajectories and report the fraction of trajectories retained.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below. We agree the abstract requires quantitative support and will revise it. For the self-bootstrapping method, we provide clarification on the role of filtering while acknowledging the need for additional discussion of trajectory properties.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'consistent outperformance across model families' and 'outperforms competitive baselines' is presented without any quantitative results, error bars, baseline specifications, or controls, so the central empirical claim cannot be evaluated from the supplied text.

Authors: We agree that the abstract would be strengthened by including quantitative details. In the revised manuscript we will update the abstract to report specific performance metrics on the multi-hop QA benchmarks, note the consistency of gains across the three model families, and identify the competitive baselines used. This change will make the central empirical claims directly evaluable from the abstract. revision: yes
Referee: [Method] Method / self-bootstrapping section: the central claim that filtered trajectories from the small-scale seed model suffice to activate Plan 'in any target model' without stronger-model distillation rests on the unverified assumption that the seed already possesses (or filtering reliably extracts) the prerequisite sub-skills. The paper itself notes that identical rewards produce different failure modes precisely because of missing sub-skills; no experiment is described that tests whether the 3B-scale seed trajectories are high-entropy and on-distribution enough to avoid reproducing the documented RL instabilities.

Authors: We recognize the logical connection to our own observations on model-specific RL failure modes. The pipeline's filtering step selects trajectories that already exhibit ordered sub-question decomposition, which our cross-family experiments show is sufficient to activate Plan in target models without triggering the documented instabilities. While a dedicated entropy or distribution analysis of the raw seed trajectories is not reported, the consistent activation success across 3B–14B models serves as empirical evidence that the filtered data avoids the problematic conditions. We will revise the method section to expand the description of the filtering criteria and their relation to preserving prerequisite sub-skills and training stability. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical self-bootstrapping claim stands on experimental results

full rationale

The paper presents an empirical pipeline in which a small seed model produces filtered trajectories to activate Plan behavior in target models, with success demonstrated across three model families on multi-hop QA benchmarks. No derivation reduces by construction to fitted inputs, self-citations, or definitional loops; the central claim is framed as an observed outcome of the proposed self-bootstrapping method rather than a mathematical identity or renamed known result. The text supplies no load-bearing self-citations, uniqueness theorems, or ansatzes smuggled via prior work, rendering the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review yields limited visibility into parameters or axioms; the core method rests on stated feasibility conditions for RL.

axioms (1)

domain assumption Successful RL training of Plan requires sufficient initial entropy, training stability, and prerequisite sub-skills that vary by model.
Explicitly stated in abstract as the reason identical reward signals produce different failure modes.

invented entities (1)

Plan no independent evidence
purpose: Structured agentic behavior that decomposes a question into ordered sub-questions before retrieval to anchor search steps.
Introduced as the core proposed behavior for multi-hop retrieval.

pith-pipeline@v0.9.1-grok · 5767 in / 1224 out tokens · 35366 ms · 2026-06-29T12:29:24.672509+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · 3 internal anchors

[1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Group-in-group policy optimization for llm agent training.Advances in Neural Information Pro- cessing Systems, 38:46375–46408. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv pr...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

s3: You don’t need that much data to train a search agent via rl. InProceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing, pages 21610–21628. Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. 2025. Search-r1: Training llms to reason and leverage search engines with re...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Search and refine during think: Autonomous retrieval-augmented reasoning of llms.arXiv e- prints, pages arXiv–2505. Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji- Rong Wen. 2025. R1-searcher: Incentivizing the search capability in llms via reinforcement learning. arXiv preprint arXiv:2503.05592. Hao Sun...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

All I Can Think About Is Getting You Home

Internvl-u: Democratizing unified multimodal models for understanding, reasoning, generation and editing.arXiv preprint arXiv:2603.09877. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022. Musique: Multi- hop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–...

work page arXiv 2022

[1] [1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Group-in-group policy optimization for llm agent training.Advances in Neural Information Pro- cessing Systems, 38:46375–46408. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv pr...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

s3: You don’t need that much data to train a search agent via rl. InProceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing, pages 21610–21628. Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. 2025. Search-r1: Training llms to reason and leverage search engines with re...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Search and refine during think: Autonomous retrieval-augmented reasoning of llms.arXiv e- prints, pages arXiv–2505. Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji- Rong Wen. 2025. R1-searcher: Incentivizing the search capability in llms via reinforcement learning. arXiv preprint arXiv:2503.05592. Hao Sun...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

All I Can Think About Is Getting You Home

Internvl-u: Democratizing unified multimodal models for understanding, reasoning, generation and editing.arXiv preprint arXiv:2603.09877. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022. Musique: Multi- hop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–...

work page arXiv 2022