pith. sign in

arxiv: 2605.30824 · v1 · pith:EUKM5X6Cnew · submitted 2026-05-29 · 💻 cs.AI

Planner-Centric Reinforcement Learning for Deep Research with Structure-Aware Reward

Pith reviewed 2026-06-28 22:28 UTC · model grok-4.3

classification 💻 cs.AI
keywords planner-centric reinforcement learningDAG-structured plansdeep research taskstwo-stage RL traininglong-form benchmarksLLM planningstructure-aware rewards
0
0 comments X

The pith

DecomposeR represents research plans as typed DAGs and trains planner then answerer stages separately to improve long-form LLM performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that modeling research plans explicitly as typed directed acyclic graphs and applying reinforcement learning first to the planner and then to the answerer produces better planning and synthesis than monolithic trajectory optimization. A sympathetic reader would care because current methods for deep research tasks entangle planning with execution, making it hard to assign credit for good plans and yielding weaker models on complex multi-branch queries. The approach claims this separation, combined with rewards on structured plan components rather than flat sequences, reduces training ambiguity and delivers measurable gains on long-form benchmarks.

Core claim

DecomposeR trains a Qwen3-8B model by first using planner reinforcement learning to learn graph structure and query decomposition on typed DAG representations of research plans, then using answerer reinforcement learning to optimize branch-level execution and final synthesis conditioned on those plans. By assigning rewards directly to explicit planner tokens and structured DAG components instead of end-to-end flat trajectories, the method enables finer-grained optimization of planning while reducing ambiguity, resulting in 5.1-8.0 point gains over comparable open baselines on popular long-form benchmarks.

What carries the argument

Typed directed acyclic graphs (DAGs) that represent research plans, enabling explicit structure for reward assignment and two-stage separation of planner RL from answerer RL.

If this is right

  • Planner RL stage improves query decomposition and graph structure learning before answerer training begins.
  • Rewards on structured plan components yield better credit assignment for planning decisions than end-to-end optimization.
  • Conditioning answerer RL on learned plans improves branch-level execution and final synthesis quality.
  • The resulting DecomposeR-8B model outperforms strong open baselines by 5.1-8.0 points on long-form benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The DAG representation could make intermediate plans more inspectable and editable by users or downstream systems.
  • The two-stage separation might scale to other multi-step agent tasks that currently suffer from poor planning credit assignment.
  • Extending the typed DAG nodes to include explicit evidence retrieval actions could further tighten the planning-execution loop.

Load-bearing premise

Assigning rewards to explicit planner tokens and structured DAG components produces finer-grained optimization of planning and reduces end-to-end training ambiguity compared to flat trajectories.

What would settle it

Running the same Qwen3-8B base model on the same long-form benchmarks with standard monolithic trajectory RL instead of the two-stage DAG planner approach and measuring whether the 5.1-8.0 point gap disappears or reverses.

Figures

Figures reproduced from arXiv: 2605.30824 by Mustafa Anis Hussain, Xinle Wu, Yao Lu.

Figure 1
Figure 1. Figure 1: DecomposeR rollout structure. The planner emits an initial typed DAG, receives search results, revises the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Planner RL reward dynamics. The black curve [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Answerer RL reward dynamics. The plot shows overall answerer reward together with its execu￾tion and synthesis contributions. Length diagnostics are reported in Appendix A.6. multi-domain research tasks that require web ex￾ploration, citation-backed synthesis, and report￾quality judgment. ResearchQA-Mini (ResQA￾Mini) evaluates scholarly question answering with survey-derived research questions and rubric i… view at source ↗
Figure 4
Figure 4. Figure 4: Planner graph size and typed component counts over RL. [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Answerer response length over RL [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Planner system prompt { "phase": "planner_after_search", "query": "{user query}", "current_graph": "{initial graph, with search/aggregate/answer nodes}", "max_fetches": 4, "search_results": "{Jina search results keyed by source search node}", "instruction": "Review the evidence. Keep or revise the graph, then choose the few URLs most [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Planner revision-turn user payload Answerer prompts. You are the answerer and executor for a graph-based deep research system. The planner role is already complete. You do not revise the graph. You do not choose fetches. You do not open new searches. Your job is to execute the already-planned graph in topological order. Execution model: - the environment gives you one execution wave at a time - each execut… view at source ↗
Figure 8
Figure 8. Figure 8: Answerer system prompt { "phase": "answer_execution", "payload_variant": "compact", "query": "{user query}", "graph": "{revised graph}", "execution_wave": { "index": 1, "target_node_ids": ["N7", "N8"] }, "shared_dependencies": [ { "parent_id": "N1", "parent_type": "search", "provision": "inline|history_reference", "available_citation_ids": ["N1-R1", "N1-F1"], "evidence_items": "{included only when provisio… view at source ↗
Figure 9
Figure 9. Figure 9: Answerer aggregate-wave user payload { "phase": "answer_final", "query": "{user query}", "graph": "{revised graph}", "target_node": { "id": "N11", "type": "answer", "need": "{final answer need}" }, "branch_reports": [ { "node_id": "N7", "need": "{upstream branch need}", "report_snippet": "{short branch-report preview}", "citation_ids_present": ["N1-R1", "N3-F1"], "full_report_available_in_conversation": tr… view at source ↗
Figure 10
Figure 10. Figure 10: Answerer final-answer user payload SFT trajectory generation. The same planner and answerer prompts above are used for cold-start SFT trajectory generation. Teacher rollouts follow the production sequence: the planner first receives the raw query and emits an initial graph; the environment executes the initial search nodes; the planner then receives the after-search payload, revises the graph, and selects… view at source ↗
Figure 11
Figure 11. Figure 11: Rubric generation system prompt You are grading a deep research answer against query-specific rubrics. You will receive: - the user query - the final answer - a list of rubrics Score each rubric independently using exactly one of these five scores: - `0`: not covered -- the rubric is absent, wrong, or unsupported - `0.25`: minimally covered -- the rubric is touched on but mostly inadequate - `0.5`: partia… view at source ↗
Figure 12
Figure 12. Figure 12: Answer judge system prompt [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: User query Abbreviation policy. This example keeps the control flow and graph structure intact while excerpting long evidence payloads and repetitive answer prose. The important structural property is that the revised graph has two aggregate execution waves before the final answer: N7–N9 are executed first, N10 is executed second, and the final answer consumes N7, N8, and N10. N9 therefore shapes the fina… view at source ↗
Figure 14
Figure 14. Figure 14: Planner initial output (abbreviated) [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Initial agroforestry nitrogen planner graph before search-result revision. Red nodes are search nodes, [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Planner revision output (abbreviated) [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Revised graph after search-result revision. [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Answerer execution payload excerpt (abbreviated) [PITH_FULL_IMAGE:figures/full_fig_p028_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Answerer turn 1 excerpt: answer_execution (N7, N8, N9) [PITH_FULL_IMAGE:figures/full_fig_p029_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Answerer turn 2 excerpt: answer_execution (N10) [PITH_FULL_IMAGE:figures/full_fig_p029_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Answerer final output: answer_final (N11) [PITH_FULL_IMAGE:figures/full_fig_p032_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Citation block appended to final answer [PITH_FULL_IMAGE:figures/full_fig_p033_22.png] view at source ↗
read the original abstract

Deep research tasks require LLMs to plan what to investigate, retrieve evidence, and synthesize long-form answers across multiple branches of inquiry. Existing training paradigms either rely on short-form verifiable QA as a proxy or optimize monolithic long trajectories, which makes planning and execution difficult to disentangle and yields weak credit assignment for the planning process. We propose DecomposeR, a planner-centric deep research framework that represents research plans as typed directed acyclic graphs (DAGs), allowing planning to be made explicit, structured, and rewardable. We train a Qwen3-8B model in two stages: planner reinforcement learning (RL) first learns graph structure and query decomposition to improve research planning, and answerer reinforcement learning (RL) then learns branch-level execution and final synthesis conditioned on the learned plan. By assigning rewards to explicit planner tokens and structured components rather than to a flat trajectory, DecomposeR enables finer-grained optimization of planning while reducing the ambiguity of end-to-end training. Experiments show that DecomposeR-8B improves over strong comparable open baselines by 5.1-8.0 points on popular long-form benchmarks due to improved planning and answering capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes DecomposeR, a planner-centric RL framework for deep research tasks. Research plans are represented as typed DAGs to make planning explicit and rewardable. A Qwen3-8B model is trained in two stages: planner RL to learn graph structure and decomposition, followed by answerer RL for branch execution and synthesis. The abstract claims this yields 5.1-8.0 point gains over strong open baselines on long-form benchmarks by providing finer-grained credit assignment via rewards on planner tokens and DAG components rather than flat trajectories.

Significance. If the performance delta can be isolated to the typed-DAG reward structure, the approach would provide a concrete mechanism for disentangling planning from execution in long-horizon LLM RL, potentially improving optimization of multi-branch research workflows.

major comments (1)
  1. [Abstract] Abstract: the central claim attributes the 5.1-8.0 point gains specifically to 'assigning rewards to explicit planner tokens and structured DAG components rather than to a flat trajectory.' This requires a control experiment (two-stage flat-trajectory RL vs. two-stage typed-DAG RL) to isolate the contribution of the DAG representation and component-level rewards from the mere separation into planner and answerer stages. No such ablation is described, so the core premise remains unverified and the attribution is not load-bearing.
minor comments (1)
  1. The abstract supplies no details on reward formulations, baseline descriptions, statistical significance, or experimental setup, making it impossible to assess whether reported gains are reproducible or attributable to the claimed mechanism.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on strengthening the attribution of our results. We respond to the major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim attributes the 5.1-8.0 point gains specifically to 'assigning rewards to explicit planner tokens and structured DAG components rather than to a flat trajectory.' This requires a control experiment (two-stage flat-trajectory RL vs. two-stage typed-DAG RL) to isolate the contribution of the DAG representation and component-level rewards from the mere separation into planner and answerer stages. No such ablation is described, so the core premise remains unverified and the attribution is not load-bearing.

    Authors: We agree that the manuscript does not contain a direct ablation isolating two-stage flat-trajectory RL from two-stage typed-DAG RL, and that such a control would more rigorously separate the contribution of the DAG structure and component-level rewards from the planner-answerer stage separation alone. Our existing experiments compare DecomposeR against strong open baselines that use neither the two-stage approach nor typed DAGs, and the framework is explicitly designed to enable rewards on planner tokens and DAG components. To address the referee's concern, we will add the requested control experiment in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with benchmark deltas

full rationale

The paper describes a two-stage RL training procedure on typed DAG plans and reports 5.1-8.0 point gains on long-form benchmarks. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described method. The central claim is an observed performance delta from an explicit training recipe; it does not reduce to a definitional identity or to a prior result supplied only by the same authors. The absence of any load-bearing derivation chain makes the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the text does not introduce or fit any quantities beyond the high-level claim of benchmark improvement.

pith-pipeline@v0.9.1-grok · 5732 in / 1105 out tokens · 20243 ms · 2026-06-28T22:28:05.433708+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 18 canonical work pages · 12 internal anchors

  1. [1]

    Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. 2025. https://arxiv.org/abs/2505.08775 Healthbench: Evaluating large language models towards improved human health . Preprint, arXiv:2505.08775

  2. [2]

    Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D'arcy, David Wadden, Matt Latzke, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke Zettlemoyer, and 6 others. 2024 a . https://arxiv.org/abs/2411.14199 Openscholar: Synthesizing scientific literature...

  3. [3]

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avi Sil, and Hannaneh Hajishirzi. 2024 b . Self-rag: Learning to retrieve, generate, and critique through self-reflection. In International conference on learning representations, volume 2024, pages 9112--9141

  4. [4]

    Mingyang Chen, Linzhuang Sun, Tianpeng Li, Haoze Sun, Chenzheng Zhu, Haofen Wang, Jeff Pan, Wen Zhang, Huajun Chen, Fan Yang, and 1 others. 2026. Learning to reason with search for llms via reinforcement learning. Advances in Neural Information Processing Systems, 38:85287--85307

  5. [5]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, and 1 others. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261

  6. [6]

    Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. 2025. Deepresearch bench: A comprehensive benchmark for deep research agents. arXiv preprint arXiv:2506.11763

  7. [7]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948

  8. [8]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. https://arxiv.org/abs/2106.09685 Lora: Low-rank adaptation of large language models . Preprint, arXiv:2106.09685

  9. [9]

    Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Active retrieval augmented generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7969--7992

  10. [10]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan O. Arik, Dong Wang, Hamed Zamani, and Jiawei Han. 2025. https://arxiv.org/abs/2503.09516 Search-r1: Training llms to reason and leverage search engines with reinforcement learning . In Proceedings of the Conference on Language Modeling

  11. [11]

    Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. 2023. Decomposed prompting: A modular approach for solving complex tasks. In International Conference on Learning Representations

  12. [12]

    Efficient Memory Management for Large Language Model Serving with PagedAttention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. https://arxiv.org/abs/2309.06180 Efficient memory management for large language model serving with pagedattention . Preprint, arXiv:2309.06180

  13. [13]

    Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, Weizhou Shen, Junkai Zhang, Dingchu Zhang, Xixi Wu, Yong Jiang, Ming Yan, Pengjun Xie, Fei Huang, and Jingren Zhou. 2025. https://arxiv.org/abs/2507.02592 Websailor: Navigating super-human reasoning for web agent . Preprint, arXiv:2507.02592

  14. [14]

    Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yongkang Wu, Ji-Rong Wen, Yutao Zhu, and Zhicheng Dou. 2026 a . Webthinker: Empowering large reasoning models with deep research capability. Advances in Neural Information Processing Systems, 38:120091--120131

  15. [15]

    Zijian Li, Xin Guan, Bo Zhang, Shen Huang, Houquan Zhou, Shaopeng Lai, Ming Yan, Yong Jiang, Pengjun Xie, Fei Huang, Jun Zhang, and Jingren Zhou. 2026 b . https://openreview.net/forum?id=MtNCJjlrKt Webweaver: Structuring web-scale evidence with dynamic outlines for open-ended deep research . In The Fourteenth International Conference on Learning Representations

  16. [16]

    Junteng Liu, Yunji Li, Chi Zhang, Jingyang Li, Aili Chen, Ke Ji, Weiyu Cheng, Zijia Wu, Chengyu Du, Qidi Xu, Jiayuan Song, Zhengmao Zhu, Wenhu Chen, Pengyu Zhao, and Junxian He. 2025. https://arxiv.org/abs/2509.06501 Webexplorer: Explore and evolve for training long-horizon web agents . Preprint, arXiv:2509.06501

  17. [17]

    Jianbiao Mei, Tao Hu, Daocheng Fu, Licheng Wen, Xuemeng Yang, Rong Wu, Pinlong Cai, Xinyu Cai, Xing Gao, Yu Yang, Chengjun Xie, Botian Shi, Yong Liu, and Yu Qiao. 2025. https://arxiv.org/abs/2505.16582 O ^2 -searcher: A searching-based agent model for open-domain open-ended question answering . Preprint, arXiv:2505.16582

  18. [18]

    OpenAI . 2025. Introducing deep research. https://openai.com/index/introducing-deep-research/

  19. [19]

    Perplexity Team . 2025. Introducing perplexity deep research. https://www.perplexity.ai/hub/blog/introducing-perplexity-deep-research

  20. [20]

    Nils Reimers and Iryna Gurevych. 2019. https://arxiv.org/abs/1908.10084 Sentence-bert: Sentence embeddings using siamese bert-networks . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics

  21. [21]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. https://arxiv.org/abs/2402.03300 Deepseekmath: Pushing the limits of mathematical reasoning in open language models . Preprint, arXiv:2402.03300

  22. [22]

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2025. https://doi.org/10.1145/3689031.3696075 Hybridflow: A flexible and efficient rlhf framework . In Proceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25, page 1279–1297. ACM

  23. [23]

    Amanpreet Singh, Joseph Chee Chang, Dany Haddad, Aakanksha Naik, Jena D Hwang, Rodney Kinney, Daniel S Weld, Doug Downey, and Sergey Feldman. 2025. AI2 scholar QA : Organized literature synthesis with attribution. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 513--523

  24. [24]

    Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. 2025. https://arxiv.org/abs/2503.05592 R1-searcher: Incentivizing the search capability in llms via reinforcement learning . Preprint, arXiv:2503.05592

  25. [25]

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2023. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pages 10014--10037

  26. [26]

    Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. 2023. https://doi.org/10.18653/v1/2023.acl-long.147 Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

  27. [27]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. https://arxiv.org/abs/2505.09388 Qwen3 technical report . Preprint, arXiv:2505.09388

  28. [28]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2023. https://openreview.net/forum?id=WE_vluYUL-X React: Synergizing reasoning and acting in language models . In The Eleventh International Conference on Learning Representations

  29. [29]

    Shu Zhao, Tan Yu, Anbang Xu, Japinder Singh, Aaditya Shukla, and Rama Akkiraju. 2025. https://arxiv.org/abs/2508.09303 Parallelsearch: Train your llms to decompose query and search sub-queries in parallel with reinforcement learning . Preprint, arXiv:2508.09303