pith. sign in

arxiv: 2605.28354 · v1 · pith:2MKXQZQGnew · submitted 2026-05-27 · 💻 cs.AI

Plan Before Search: Search Agents Need Plan

Pith reviewed 2026-06-29 12:29 UTC · model grok-4.3

classification 💻 cs.AI
keywords multi-hop QAretrieval-augmented agentsreinforcement learningself-bootstrappingplan before searchagentic behaviordistillation alternativessub-question decomposition
0
0 comments X

The pith

A small-scale seed model generates filtered trajectories that activate Plan behavior in target models for multi-hop retrieval without external distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines training retrieval-augmented reasoning agents for multi-hop question answering using reinforcement learning. It observes that the same reward signal produces different failure modes across model families from 3B to 14B parameters, tied to conditions like initial entropy and prerequisite skills. In response, the work introduces a self-bootstrapping method where trajectories from a small seed model teach the Plan behavior to any target model. Plan decomposes a question into ordered sub-questions before retrieval begins, anchoring each search step to a pre-set sub-question. This pipeline succeeds in activating Plan across tested models and outperforms baselines on multi-hop QA tasks while removing the requirement for distillation from a stronger external model.

Core claim

The paper claims that Plan, defined as decomposing a question into ordered sub-questions before any retrieval occurs, can be activated in target models through a self-bootstrapping paradigm that uses filtered trajectories generated by a small-scale seed model, and that this approach works across model families without distillation from an external stronger model while delivering consistent gains over competitive baselines on multi-hop QA benchmarks.

What carries the argument

Plan, the structured agentic behavior that decomposes a question into ordered sub-questions before retrieval to anchor search steps and prevent drift from partially relevant documents.

If this is right

  • The pipeline activates Plan across every tested model from 3B to 14B parameters.
  • It consistently outperforms competitive baselines on multi-hop QA benchmarks.
  • Successful training depends on model-specific feasibility conditions including sufficient initial entropy, training stability, and prerequisite sub-skills.
  • An identical reward signal induces qualitatively different RL failure modes across model families.
  • The self-bootstrapping method eliminates the need for distillation from an external stronger model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach implies that internal trajectory filtering can substitute for external teacher models in acquiring structured agent behaviors.
  • Similar bootstrapping may apply to other agentic skills such as verification or tool use beyond planning.
  • Model-specific initial conditions could become a standard diagnostic step before RL training of search agents.
  • The method might reduce overall compute by avoiding repeated calls to larger teacher models during capability acquisition.

Load-bearing premise

That filtered trajectories generated by a small-scale seed model are sufficient to activate the Plan behavior in any target model without requiring distillation from an external stronger model.

What would settle it

A direct test applying the seed-model trajectories to a target model outside the 3B-14B range or lacking prerequisite sub-skills and measuring whether Plan activation and benchmark gains still occur.

Figures

Figures reproduced from arXiv: 2605.28354 by Ben Chen, Chenyi Lei, Huangyu Dai, Jiayi Ji, Qibin Hou, Wenwu Ou, Xiaoshuai Sun, Yufei Ma, Zhipeng Qian, Zihan Liang.

Figure 1
Figure 1. Figure 1: Comparison with traditional search agents. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our framework. Given a user question, the policy model first generates a global plan [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training stability comparison between thresh [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Initial RL entropy after SFT cold start. All the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Failure modes of direct RL on the plan task [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training dynamics of direct joint RL on Qwen2.5-7B-Base. (a) Validation EM rises to 0.405 at step 100, then collapses to 0.35 and fails to recover. (b) The critic score declines over the same window, sug￾gesting the reward signal detects the collapse. framework (Sheng et al., 2025) for distributed train￾ing. B More Ablation Studies B.1 Self-Bootstrapping vs. Direct RL on Qwen2.5-7B-Base Qwen2.5-7B-Base is … view at source ↗
Figure 7
Figure 7. Figure 7: Prompt template for plan C Example Appendix C.1 Prompt Template [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-component token distribution across train [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
read the original abstract

Training large language models as retrieval-augmented reasoning agents typically combines reinforcement learning with an SFT cold start distilled from a stronger model. However, this paradigm overlooks two fundamental factors: the dependency structure among sub-skills, and the possibility that distillation is not the only route to capability acquisition. We study this through Plan, a structured agentic behavior for multi-hop retrieval that decomposes a question into ordered sub-questions before any retrieval is performed, so that each search step can be anchored to a pre-designed sub-question instead of drifting under the influence of partially relevant documents retrieved earlier. However, across three model families spanning 3B to 14B parameters, we find that an identical reward signal induces qualitatively different RL failure modes. This phenomenon indicates that successful training hinges not only on reward design but also on model-specific feasibility conditions: sufficient initial entropy, training stability, and prerequisite sub-skills. Motivated by this, we propose a self-bootstrapping paradigm in which a small-scale seed model generates filtered trajectories that activate Plan in any target model, eliminating the need for distillation from an external stronger model. Our pipeline activates Plan across every tested model and consistently outperforms competitive baselines on multi-hop QA benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that standard RL training of retrieval-augmented agents requires an SFT cold-start distilled from a stronger model, but overlooks sub-skill dependencies and alternative acquisition routes. It defines 'Plan' as a structured behavior that decomposes a multi-hop question into an ordered list of sub-questions before any retrieval occurs. Across three model families (3B–14B), identical reward signals produce qualitatively different RL failure modes, which the authors attribute to model-specific conditions including initial entropy, stability, and prerequisite sub-skills. They therefore introduce a self-bootstrapping pipeline in which a small-scale seed model generates filtered trajectories that are used to activate Plan behavior in any target model, eliminating the need for stronger-model distillation. The pipeline is stated to activate Plan in every tested model and to outperform competitive baselines on multi-hop QA benchmarks.

Significance. If the empirical claims hold, the work would be significant for showing that a self-bootstrapping route using only a small seed model can replace distillation from a stronger model when training agentic retrieval behaviors. The observation that identical rewards induce different failure modes across model families supplies a concrete diagnostic for why RL succeeds or fails on agent tasks and underscores the role of prerequisite sub-skills. The approach is falsifiable via the reported multi-hop QA results and the cross-family activation experiments.

major comments (2)
  1. [Abstract] Abstract: the claim of 'consistent outperformance across model families' and 'outperforms competitive baselines' is presented without any quantitative results, error bars, baseline specifications, or controls, so the central empirical claim cannot be evaluated from the supplied text.
  2. [Method] Method / self-bootstrapping section: the central claim that filtered trajectories from the small-scale seed model suffice to activate Plan 'in any target model' without stronger-model distillation rests on the unverified assumption that the seed already possesses (or filtering reliably extracts) the prerequisite sub-skills. The paper itself notes that identical rewards produce different failure modes precisely because of missing sub-skills; no experiment is described that tests whether the 3B-scale seed trajectories are high-entropy and on-distribution enough to avoid reproducing the documented RL instabilities.
minor comments (2)
  1. Clarify the precise parameter counts of the seed model and each target model, and state whether the seed is from the same family as the targets.
  2. Specify the exact filtering criteria applied to the seed trajectories and report the fraction of trajectories retained.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below. We agree the abstract requires quantitative support and will revise it. For the self-bootstrapping method, we provide clarification on the role of filtering while acknowledging the need for additional discussion of trajectory properties.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'consistent outperformance across model families' and 'outperforms competitive baselines' is presented without any quantitative results, error bars, baseline specifications, or controls, so the central empirical claim cannot be evaluated from the supplied text.

    Authors: We agree that the abstract would be strengthened by including quantitative details. In the revised manuscript we will update the abstract to report specific performance metrics on the multi-hop QA benchmarks, note the consistency of gains across the three model families, and identify the competitive baselines used. This change will make the central empirical claims directly evaluable from the abstract. revision: yes

  2. Referee: [Method] Method / self-bootstrapping section: the central claim that filtered trajectories from the small-scale seed model suffice to activate Plan 'in any target model' without stronger-model distillation rests on the unverified assumption that the seed already possesses (or filtering reliably extracts) the prerequisite sub-skills. The paper itself notes that identical rewards produce different failure modes precisely because of missing sub-skills; no experiment is described that tests whether the 3B-scale seed trajectories are high-entropy and on-distribution enough to avoid reproducing the documented RL instabilities.

    Authors: We recognize the logical connection to our own observations on model-specific RL failure modes. The pipeline's filtering step selects trajectories that already exhibit ordered sub-question decomposition, which our cross-family experiments show is sufficient to activate Plan in target models without triggering the documented instabilities. While a dedicated entropy or distribution analysis of the raw seed trajectories is not reported, the consistent activation success across 3B–14B models serves as empirical evidence that the filtered data avoids the problematic conditions. We will revise the method section to expand the description of the filtering criteria and their relation to preserving prerequisite sub-skills and training stability. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical self-bootstrapping claim stands on experimental results

full rationale

The paper presents an empirical pipeline in which a small seed model produces filtered trajectories to activate Plan behavior in target models, with success demonstrated across three model families on multi-hop QA benchmarks. No derivation reduces by construction to fitted inputs, self-citations, or definitional loops; the central claim is framed as an observed outcome of the proposed self-bootstrapping method rather than a mathematical identity or renamed known result. The text supplies no load-bearing self-citations, uniqueness theorems, or ansatzes smuggled via prior work, rendering the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review yields limited visibility into parameters or axioms; the core method rests on stated feasibility conditions for RL.

axioms (1)
  • domain assumption Successful RL training of Plan requires sufficient initial entropy, training stability, and prerequisite sub-skills that vary by model.
    Explicitly stated in abstract as the reason identical reward signals produce different failure modes.
invented entities (1)
  • Plan no independent evidence
    purpose: Structured agentic behavior that decomposes a question into ordered sub-questions before retrieval to anchor search steps.
    Introduced as the core proposed behavior for multi-hop retrieval.

pith-pipeline@v0.9.1-grok · 5767 in / 1224 out tokens · 35366 ms · 2026-06-29T12:29:24.672509+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · 3 internal anchors

  1. [1]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Group-in-group policy optimization for llm agent training.Advances in Neural Information Pro- cessing Systems, 38:46375–46408. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv pr...

  2. [2]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    s3: You don’t need that much data to train a search agent via rl. InProceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing, pages 21610–21628. Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. 2025. Search-r1: Training llms to reason and leverage search engines with re...

  3. [3]

    R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

    Search and refine during think: Autonomous retrieval-augmented reasoning of llms.arXiv e- prints, pages arXiv–2505. Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji- Rong Wen. 2025. R1-searcher: Incentivizing the search capability in llms via reinforcement learning. arXiv preprint arXiv:2503.05592. Hao Sun...

  4. [4]

    All I Can Think About Is Getting You Home

    Internvl-u: Democratizing unified multimodal models for understanding, reasoning, generation and editing.arXiv preprint arXiv:2603.09877. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022. Musique: Multi- hop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–...