pith. machine review for the scientific record. sign in

arxiv: 2604.14054 · v1 · submitted 2026-04-15 · 💻 cs.LG · cs.CL

Recognition: unknown

π-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

Dongbin Zhao, Guojun Yin, Jiajun Chai, Qichao Zhang, Songjun Tu, Wei Lin, Wenyue Chong, Xiaohan Wang, Yaocheng Zhang, Yuanheng Zhu

Pith reviewed 2026-05-10 13:40 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords self-playself-distillationprivileged informationmulti-agent learningdata-free trainingsearch agentsreinforcement learning
0
0 comments X

The pith

Self-play generates its own privileged context through question construction paths to enable dense self-distillation without external data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that conventional self-play for search agents is limited by sparse rewards and weak credit assignment. During task generation, self-play naturally produces a question construction path that encodes the reverse solution process and can serve as high-quality privileged information. Using this path, a teacher model provides dense supervision to the student via self-distillation in a multi-agent loop called π-Play. This converts sparse outcome rewards into continuous feedback, allowing fully data-free training. The approach is shown to exceed fully supervised agents while accelerating evolution by two to three times.

Core claim

Self-play naturally produces a question construction path during examiner task generation; this path captures the reverse solution process and supplies privileged context that lets a teacher model densely supervise a student through self-distillation, turning sparse-reward self-play into a dense-feedback self-evolution framework that requires no external data or human feedback.

What carries the argument

The question construction path (QCP), an intermediate artifact generated alongside tasks that encodes the reverse solution process, used as privileged context for self-distillation inside the multi-agent π-Play framework.

If this is right

  • Data-free π-Play surpasses the performance of fully supervised search agents on information-seeking tasks.
  • Evolutionary efficiency increases by a factor of two to three times compared with conventional self-play.
  • Dense supervision from QCP improves credit assignment over sparse outcome rewards alone.
  • The framework scales multi-agent self-evolution without any dependence on curated datasets or human feedback.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Generation artifacts like QCP may be exploitable as internal supervision signals in other self-play or generative agent domains beyond search tasks.
  • The method could be extended by searching for analogous construction paths in non-reasoning generation settings to broaden its applicability.
  • Iterative refinement of how QCP is extracted and used might further reduce variance in the self-evolution loop over multiple rounds.

Load-bearing premise

The question construction path that arises naturally during self-play task generation supplies high-quality privileged context that supports effective dense self-distillation without introducing bias or requiring external validation.

What would settle it

An experiment in which student models trained with QCP-based distillation show no gain or a loss in performance relative to standard self-play, or in which QCP supervision introduces measurable bias that harms final agent capability.

Figures

Figures reproduced from arXiv: 2604.14054 by Dongbin Zhao, Guojun Yin, Jiajun Chai, Qichao Zhang, Songjun Tu, Wei Lin, Wenyue Chong, Xiaohan Wang, Yaocheng Zhang, Yuanheng Zhu.

Figure 1
Figure 1. Figure 1: Overview of QCP-guided self-distillation in π-Play. The examiner is equipped with search tools and interacts with the search engine to obtain factual information, ensuring the correctness of both the synthesized QA pairs and their construction paths c. The teacher policy π T ψ leverages QCP as additional context to provide token-level supervision to the student policy π S θ along the student’s rollout y, b… view at source ↗
Figure 2
Figure 2. Figure 2: π-Play outperforms self-play methods (Dr.Zero [45]) across seven QA benchmarks with Qwen3-4B-Instruct￾2507. A single iteration of π-Play achieves gains that match or even exceed those of three iterations of self-play, demonstrating its superior evolutionary efficiency. Another line of self-evolution, self-distillation, ad￾dresses the credit assignment problem by employing high-quality privileged informatio… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of π-Play with other self-evolution frameworks. All models (examiner, teacher, and student) in π-Play are initialized from the same base LLM and function as search agents. π-Play uses alternating optimization to evolve multiple agents in a closed loop. Compared to self-play, it overcomes the sparse-reward problem of the student and enables the student to be optimized under the joint effect of ou… view at source ↗
Figure 4
Figure 4. Figure 4: Iterative reward and entropy dynamics of the examiner and student in π-Play with Qwen3-4B-Instruct-2507. Both reward and entropy reach a converged state by Iteration 3 [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Side-by-side trajectories of Dr.Zero (left) and [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: QCP example with hop = 1 provided by the examiner. Random Document Now, generate a question and its answer with n = 2 hops starting from the following source document: (Title: "Luna 19") Luna 19 (a.k.a. Lunik 19) (E-8-LS series), was an unmanned space mission of the Luna program. "Luna 19" extended the systematic study of lunar gravitational fields and location of mascons (mass concentrations). It also stu… view at source ↗
Figure 7
Figure 7. Figure 7: QCP example with hop = 2 provided by the examiner Random Document Now, generate a question and its answer with n = 3 hops starting from the following source document: (Title: "Chris Charsley")\nseason, including the test match against Darwen which won them promotion to the First Division. He also played for Aston Villa as a guest in 1886. Charsley had a brief spell with West Bromwich Albion, whom he joined… view at source ↗
Figure 8
Figure 8. Figure 8: QCP example with hop = 3 provided by the examiner. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: System prompt for the examiner, teacher and student in π-play. They use the same system prompt. D.2 Examiner Prompts User Prompt for the Examiner You are an expert in question generation. Craft one challenging, deterministic question and its single, unambiguous answer based on the provided source document. The logical path must start from the document and require exactly n hops (i.e., n-1 searches) to reac… view at source ↗
Figure 10
Figure 10. Figure 10: Initial instructions for the examiner in π-play. Our prompt for examiner is developed based on Dr.Zero [45] D.3 Teacher Prompts User Prompt for the Teacher (Qwen3-4B-Instruct-2507) You are a helpful assistant. You will be given privileged information about the reverse solution process of the question (i.e., construction process of the question). Please pretend not to know the source document used to const… view at source ↗
Figure 11
Figure 11. Figure 11: Initial instructions for the teacher in π-play D.4 Student Prompts User Prompt for the Student (Qwen3-4B-Instruct-2507) Answer the given question. If you find you lack some knowledge, you can call a search engine by < tool_call> query </tool_call> and it will return the top searched results between <tool_response> and </tool_response>. You can search as many times as your want. If you find no further exte… view at source ↗
Figure 12
Figure 12. Figure 12: Initial instructions for the student in π-play 26 [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
read the original abstract

Deep search agents have emerged as a promising paradigm for addressing complex information-seeking tasks, but their training remains challenging due to sparse rewards, weak credit assignment, and limited labeled data. Self-play offers a scalable route to reduce data dependence, but conventional self-play optimizes students only through sparse outcome rewards, leading to low learning efficiency. In this work, we observe that self-play naturally produces a question construction path (QCP) during task generation, an intermediate artifact that captures the reverse solution process. This reveals a new source of privileged information for self-distillation: self-play can itself provide high-quality privileged context for the teacher model in a low-cost and scalable manner, without relying on human feedback or curated privileged information. Leveraging this insight, we propose Privileged Information Self-Play ($\pi$-Play), a multi-agent self-evolution framework. In $\pi$-Play, an examiner generates tasks together with their QCPs, and a teacher model leverages QCP as privileged context to densely supervise a student via self-distillation. This design transforms conventional sparse-reward self-play into a dense-feedback self-evolution loop. Extensive experiments show that data-free $\pi$-Play surpasses fully supervised search agents and improves evolutionary efficiency by 2-3$\times$ over conventional self-play.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes π-Play, a multi-agent self-play framework in which an examiner generates tasks along with their question construction paths (QCPs); a teacher then uses the QCP as privileged context to perform dense self-distillation on a student. This converts conventional sparse-reward self-play into a dense-feedback self-evolution loop that is claimed to be entirely data-free. The central empirical claims are that data-free π-Play surpasses fully supervised search agents and improves evolutionary efficiency by 2–3× relative to standard self-play.

Significance. If the core claims are substantiated, the work offers a scalable route to dense supervision in search-agent training without curated data or human feedback. The identification of QCP as a naturally occurring privileged artifact is a creative observation that could generalize beyond the reported domain.

major comments (2)
  1. [§3.2] §3.2 (Privileged Self-Distillation): The assertion that examiner-generated QCPs supply high-quality, low-bias privileged supervision is load-bearing for the superiority claim, yet the manuscript provides no external correctness signal or validation step; because the examiner and student belong to the same model family, any systematic error in reverse reasoning is directly distilled, creating a closed-loop bias risk that conventional sparse self-play avoids.
  2. [§5] §5 (Experiments): The reported 2–3× efficiency gain and outperformance of fully supervised agents are stated without ablations that isolate the contribution of QCP-based dense distillation versus sparse outcome rewards alone, without reported statistical significance, run counts, or variance, and without explicit baselines for the supervised agents, rendering the central performance claims unverifiable from the presented evidence.
minor comments (1)
  1. [Abstract] The abstract asserts 'extensive experiments' but supplies no summary of metrics, dataset sizes, or statistical tests; a one-sentence overview of the evaluation protocol would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and indicate the changes we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Privileged Self-Distillation): The assertion that examiner-generated QCPs supply high-quality, low-bias privileged supervision is load-bearing for the superiority claim, yet the manuscript provides no external correctness signal or validation step; because the examiner and student belong to the same model family, any systematic error in reverse reasoning is directly distilled, creating a closed-loop bias risk that conventional sparse self-play avoids.

    Authors: We acknowledge the potential for bias propagation when distilling QCPs within the same model family, as systematic errors in reverse reasoning could indeed be reinforced without an external correctness signal. The QCP is generated as an intrinsic byproduct of the examiner's task construction process rather than an independently verified artifact, which distinguishes it from curated privileged information but does not eliminate the closed-loop risk. We will revise the manuscript by adding a dedicated limitations paragraph in §3.2 and §6 that explicitly discusses this bias concern, contrasts it with sparse self-play, and outlines mitigation strategies such as periodic sparse-reward anchoring or cross-model distillation in future extensions. This addition clarifies the scope of the claims without altering the core method. revision: partial

  2. Referee: [§5] §5 (Experiments): The reported 2–3× efficiency gain and outperformance of fully supervised agents are stated without ablations that isolate the contribution of QCP-based dense distillation versus sparse outcome rewards alone, without reported statistical significance, run counts, or variance, and without explicit baselines for the supervised agents, rendering the central performance claims unverifiable from the presented evidence.

    Authors: We agree that the experimental evidence must be strengthened to make the efficiency and performance claims verifiable. In the revised manuscript we will expand §5 with: (i) explicit ablations that isolate QCP-based dense distillation from sparse outcome rewards alone; (ii) results reported as means and standard deviations over at least five independent runs together with statistical significance tests; and (iii) detailed specifications of the fully supervised search-agent baselines, including their training regimes, data sources, and hyper-parameters. These revisions will directly address the gaps noted by the referee. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical framework where self-play generates QCP as an observed artifact used for dense self-distillation. The central performance claims (surpassing supervised agents, 2-3× efficiency gains) are asserted via extensive experiments rather than mathematical derivations that reduce to inputs by construction. No equations, fitted parameters renamed as predictions, self-citation load-bearing uniqueness theorems, or ansatzes smuggled via prior work appear in the abstract or description. The loop is grounded in task outcomes and external benchmarks, making the finding self-contained against the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are detailed in the provided text. QCP is described as an observed natural byproduct rather than a postulated entity.

pith-pipeline@v0.9.0 · 5563 in / 1220 out tokens · 34661 ms · 2026-05-10T13:40:14.851378+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Context to Skills: Can Language Models Learn from Context Skillfully?

    cs.AI 2026-04 unverdicted novelty 8.0

    Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.

Reference graph

Works this paper leans on

70 extracted references · 29 canonical work pages · cited by 1 Pith paper · 10 internal anchors

  1. [1]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe Twelfth International Conference on Learning Representations, pages 1–18, Vienna, Austria, 2024. OpenReview.net

  2. [2]

    Toolforge: A data synthesis pipeline for multi-hop search without real-world apis.arXiv preprint arXiv:2512.16149, 2025

    Hao Chen, Zhexin Hu, Jiajun Chai, Haocheng Yang, Hang He, Xiaohan Wang, Wei Lin, Luhang Wang, Guojun Yin, and Zhuofeng zhao. Toolforge: A data synthesis pipeline for multi-hop search without real-world apis.arXiv preprint arXiv:2512.16149, 2025

  3. [3]

    Self- questioning language models.arXiv preprint arXiv:2508.03682,

    Lili Chen, Mihir Prabhudesai, Katerina Fragkiadaki, Hao Liu, and Deepak Pathak. Self- questioning language models.arXiv preprint arXiv:2508.03682, 2025

  4. [4]

    Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335, 2024

    Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335, 2024

  5. [5]

    Srft: A single-stage method with supervised and reinforcement fine-tuning for reasoning.arXiv preprint arXiv:2506.19767, 2025

    Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu, Guojun Yin, Wei Lin, Qichao Zhang, Yuanheng Zhu, and Dongbin Zhao. Srft: A single-stage method with supervised and reinforcement fine-tuning for reasoning.arXiv preprint arXiv:2506.19767, 2025

  6. [6]

    Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl, 2025

    Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, and Yi Wu. Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl.arXiv preprint arXiv:2508.07976, 2025. 10

  7. [7]

    E., and Levine, S

    Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E Turner, and Sergey Levine. Q-prop: Sample-efficient policy gradient with an off-policy critic.arXiv preprint arXiv:1611.02247, 2016

  8. [8]

    MiniLLM: Knowledge distillation of large language models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InThe Twelfth International Conference on Learning Representations, pages 1–24, Vienna, Austria, 2024. OpenReview.net

  9. [9]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  10. [10]

    Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps

    Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, Barcelona, Spain (Online), 2020. International Committee on Computational Linguistics

  11. [11]

    R-Zero: Self-Evolving Reasoning LLM from Zero Data

    Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning llm from zero data. arXiv preprint arXiv:2508.05004, 2026

  12. [12]

    Reinforcement Learning via Self-Distillation

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

  13. [13]

    Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan O Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning. InSecond Conference on Language Modeling, 2025

  14. [14]

    Weld, and Luke Zettlemoyer

    Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics

  15. [15]

    Dense passage retrieval for open-domain question answering

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 6769–6781, Online, 2020. Association for Computational Linguistics

  16. [16]

    Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research.Transact...

  17. [17]

    Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592, 2025

    Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, Weizhou Shen, Junkai Zhang, Dingchu Zhang, Xixi Wu, Yong Jiang, Ming Yan, Pengjun Xie, Fei Huang, and Jingren Zhou. Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592, 2025

  18. [18]

    Spiral: Self-play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning.arXiv preprint arXiv:2506.24119,

    Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, Wee Sun Lee, and Natasha Jaques. Spiral: Self-play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning.arXiv preprint arXiv:2506.24119, 2026

  19. [19]

    Spice: Self-play in corpus environments improves reasoning.arXiv preprint arXiv:2510.24684, 2025

    Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, and Jason Weston. Spice: Self-play in corpus environments improves reasoning.arXiv preprint arXiv:2510.24684, 2025

  20. [20]

    Search self-play: Pushing the frontier of agent capability without supervision.arXiv preprint arXiv:2510.18821, 2025

    Hongliang Lu, Yuhang Wen, Pengyu Cheng, Ruijin Ding, Jiaqi Guo, Haotian Xu, Chutian Wang, Haonan Chen, Xiaoxi Jiang, and Guanjun Jiang. Search self-play: Pushing the frontier of agent capability without supervision.arXiv preprint arXiv:2510.18821, 2025. 11

  21. [21]

    On-policy distillation

    Kevin Lu and Thinking Machines Lab. On-policy distillation. Thinking Machines Lab: Connectionism, 2025

  22. [22]

    When not to trust language models: Investigating effectiveness of parametric and non-parametric memories

    Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Ha- jishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pages 9802–9822, Toronto, Canada, 2023. Association for Computa...

  23. [23]

    OpenAI OpenAI, Matthias Plappert, Raul Sampedro, Tao Xu, Ilge Akkaya, Vineet Kosaraju, Peter Welinder, Ruben D’Sa, Arthur Petron, Henrique P. d. O. Pinto, Alex Paino, Hyeonwoo Noh, Lilian Weng, Qiming Yuan, Casey Chu, and Wojciech Zaremba. Asymmetric self-play for automatic goal discovery in robotic manipulation.arXiv preprint arXiv:2101.04882, 2021

  24. [24]

    Privileged information distillation for language models.arXiv preprint arXiv:2602.04942, 2026

    Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. Privileged information distillation for language models.arXiv preprint arXiv:2602.04942, 2026

  25. [25]

    Measuring and narrowing the compositionality gap in language models

    Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 5687–5711, Singapore, 2023. Association for Computational Linguistics

  26. [26]

    CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

    Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. Crisp: Compressed reasoning via iterative self-policy distillation.arXiv preprint arXiv:2603.05433, 2026

  27. [27]

    Jordan, and Pieter Abbeel

    John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation. InThe F ourth Inter- national Conference on Learning Representations, pages 1–14, San Juan, Puerto Rico, 2016. OpenReview.net

  28. [28]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  29. [29]

    Self-Distillation Enables Continual Learning

    Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026

  30. [30]

    R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

    Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.arXiv preprint arXiv:2503.05592, 2025

  31. [31]

    Zerosearch: Incentivize the search capability of llms without searching.arXiv preprint arXiv:2505.04588, 2025

    Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Yan Zhang, Fei Huang, and Jingren Zhou. Zerosearch: Incentivize the search capability of llms without searching.arXiv preprint arXiv:2505.04588, 2025

  32. [32]

    Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

  33. [33]

    Dynamic dual-granularity skill bank for agentic rl.arXiv preprint arXiv:2603.28716, 2026

    Songjun Tu, Chengdong Xu, Qichao Zhang, Yaocheng Zhang, Xiangyuan Lan, Linjing Li, and Dongbin Zhao. Dynamic dual-granularity skill bank for agentic rl.arXiv preprint arXiv:2603.28716, 2026

  34. [34]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022

  35. [35]

    OpenClaw-RL: Train Any Agent Simply by Talking

    Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. Openclaw-rl: Train any agent simply by talking.arXiv preprint arXiv:2603.10165, 2026. 12

  36. [36]

    Stepsearch: Igniting llms search ability via step-wise proximal policy optimization.arXiv preprint arXiv:2505.15107, 2025

    Ziliang Wang, Xuhui Zheng, Kang An, Cijun Ouyang, Jialu Cai, Yuhang Wang, and Yichao Wu. Stepsearch: Igniting llms search ability via step-wise proximal policy optimization.arXiv preprint arXiv:2505.15107, 2025

  37. [37]

    Webdancer: Towards autonomous information seeking agency

    Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhenglin Wang, Zhengwei Tao, Ding-Chu Zhang, Zekun Xi, Xiangru Tang, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Webdancer: Towards autonomous information seeking agency. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, pages 1–29, RSan Diego, USA, 2025...

  38. [38]

    Qwen3 technical report, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  39. [39]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Lan- guage Processing, pages 2369–2380, Brussels, Belgium, 2018. Association for Computational ...

  40. [40]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, pages 1–33, Kigali, Rwanda, 2023. OpenReview.net

  41. [41]

    On-Policy Context Distillation for Language Models

    Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275, 2026

  42. [42]

    Self-rewarding language models

    Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason E Weston. Self-rewarding language models. InProceedings of the 41st International Conference on Machine Learning, pages 57905–57923, Vienna, Austria, 2024. PMLR

  43. [43]

    Promoting efficient reasoning with verifiable stepwise reward.arXiv preprint arXiv:2508.10293, 2025

    Chuhuai Yue, Chengqi Dong, Yinan Gao, Hang He, Jiajun Chai, Guojun Yin, and Wei Lin. Promoting efficient reasoning with verifiable stepwise reward.arXiv preprint arXiv:2508.10293, 2025

  44. [44]

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, pages 1–36, RSan Diego, USA, 2025. Curran Associates Inc

  45. [45]

    Zhenrui Yue, Kartikeya Upasani, Xianjun Yang, Suyu Ge, Shaoliang Nie, Yuning Mao, Zhe Liu, and Dong Wang. Dr. zero: Self-evolving search agents without training data.arXiv preprint arXiv:2601.07055, 2026

  46. [46]

    Criticsearch: Fine-grained credit assignment for search agents via a retrospective critic.arXiv preprint arXiv:2511.12159, 2025

    Yaocheng Zhang, Haohuan Huang, Zijun Song, Yuanheng Zhu, Qichao Zhang, Zijie Zhao, and Dongbin Zhao. Criticsearch: Fine-grained credit assignment for search agents via a retrospective critic.arXiv preprint arXiv:2511.12159, 2025

  47. [47]

    Offline goal- conditioned reinforcement learning with elastic-subgoal diffused policy learning

    Yaocheng Zhang, Yuanheng Zhu, Yuqian Fu, Songjun Tu, and Dongbin Zhao. Offline goal- conditioned reinforcement learning with elastic-subgoal diffused policy learning. InProceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems, page 2336–2344, Richland, SC, 2025. International Foundation for Autonomous Agents and Multia-...

  48. [48]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026. 13

  49. [49]

    DeepResearcher: Scaling deep research via reinforcement learning in real-world environments

    Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. DeepResearcher: Scaling deep research via reinforcement learning in real-world environments. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, page 414–431, Suzhou, China, 2025. Association for Computational Linguistics. ...

  50. [50]

    FC Barcelona Rugby League

    Compared with SQLM* and Dr.Zero, π-Play demonstrates fewer search actions and lower query redundancy. Avg Accuracy↑Search Count↓Query Redundancy↓ SQLM* 35.7 2.3 0.43 Dr.Zero 38.0 2.4 0.48 π-Play 39.6 1.9 0.37 18 B.4 Case Study We present QCP examples generated by the examiner during the π-Play training process. The QCPs for different hop settings are pres...

  51. [51]

    Hop 1 is the starting entity found in the document

    Hop: A node in the reasoning chain. Hop 1 is the starting entity found in the document. Hop n is the final answer. ### Inputs

  52. [52]

    n: the exact number of hops in the reasoning chain (requiring n-1 searches)

  53. [53]

    ### Process & Tools

    Source document: the full source text. ### Process & Tools

  54. [54]

    - Select a specific entity, event or detail explicitly mentioned in the text

    Analyze the Document and Select the Starting Point - Read and analyze the source document. - Select a specific entity, event or detail explicitly mentioned in the text. This entity becomes Hop 1 (the initial clue)

  55. [55]

    The result is Hop 2

    Design the Chain Forwards - From Hop 1 to Hop 2: Identify a factual attribute or relation of Hop 1 that is NOT in the text but can be found via search. The result is Hop 2. - Iterate: Continue connecting the current Hop i to the next Hop i+1 using deterministic, verifiable relation found via search. - Stop at Hop n: Continue this process until you have ex...

  56. [56]

    </think>`when you plan connections or receive new information

    Reasoning & Search Protocol - Always reason inside`<think> ... </think>`when you plan connections or receive new information. - For each hop transition that requires external information, issue search query using`< tool_call> ... </tool_call>`. - Search results will be provided between`<tool_response> ... </tool_response>`by the system

  57. [57]

    Output Format - Emit a numbered sequence of EXACTLY n-1 search steps. For each search i (1 to n-1), produce: `<think> Reasoning step i: Identify Hop i in document/search results, formulate query to reach Hop i+1 </think>` `<tool_call> Query to search Hop i+1 </tool_call>` `[Wait for search results in <tool_response> from system]` - After completing all se...

  58. [58]

    Example template for Hop n = 1, i.e. no search: `<think> [Explain how Hop 1 is selected from the source document and how the question is formulated] </think>` `<question> [Question based solely on the text entity Hop 1] </question>` `<answer> [Answer (Hop 1)] </answer>`

  59. [59]

    Example template for Hop n = 3, i.e. 2 searches: `<think> [Reasoning step 1: Find Hop 1 in the source document, formulate the query to reach Hop 2] </think>` `<tool_call> [Search query to find Hop 2 based on Hop 1] </tool_call>` `[Wait for search results in <tool_response> from system]` `<think> [Reasoning step 2: Reason on search results to identify Hop ...

  60. [60]

    Every subsequent hop must be supported by the corresponding search results

    Start in Document: Hop 1 must be explicitly present in the source text. Every subsequent hop must be supported by the corresponding search results

  61. [61]

    Search is mandatory for n > 1: Each link between hops beyond Hop 1 must use the search engine

  62. [62]

    Exact search count: Emit exactly (n-1)`<tool_call>`entries, no more, no fewer

  63. [63]

    No spoilers: The question must mention only Hop 1; do not include or hint at intermediate hops

  64. [64]

    Clarity: The question is self-contained; the answer is concise and direct (no extra commentary, formatting or explanation)

  65. [65]

    No hop should be skippable or derivable without its immediate predecessor

    Chain integrity: Each hop must depend strictly on the previous hop. No hop should be skippable or derivable without its immediate predecessor. Now, generate a question and its answer with n = {hops} hops starting from the following source document: {document} Figure 10:Initial instructions for the examiner in π-play.Our prompt for examiner is developed ba...

  66. [66]

    Precise Searching: When generating search queries, ensure they are specific, semantically complete, and directly target the key information you are missing

  67. [67]

    Context Retention: Remember prior conversations and search results, maintaining logical consistency across multiple rounds of searching

  68. [68]

    Termination Judgment: When information are sufficient to determine the answer, immediately stop searching and output the answer

  69. [69]

    Reference to Privileged Information: When outputting search actions and answers, refer to the question construction process in privileged information, but do not directly output the ground- truth in the first turn or directly use extra information from the source document

  70. [70]

    Steven Febey

    Search Boundary Constraint: When searching, do not introduce information that is not contained in the question or search results, even if it exists in the source document. ### Example (Only showcase the logical style, please do not directly imitate the specific content ) #### User This is the reverse solution process of the question (i.e., the process of ...