Are Full Rollouts Necessary for On-Policy Distillation?

Dongbin Zhao; Guojun Yin; Jiajun Chai; Qichao Zhang; Songjun Tu; Wei Lin; Xiaohan Wang; Yaocheng Zhang; Yuanheng Zhu; Yuqian Fu

arxiv: 2605.31490 · v1 · pith:GEDPD23Onew · submitted 2026-05-29 · 💻 cs.CL

Are Full Rollouts Necessary for On-Policy Distillation?

Yaocheng Zhang , Jiajun Chai , Songjun Tu , Yuqian Fu , Xiaohan Wang , Wei Lin , Guojun Yin , Qichao Zhang

show 2 more authors

Yuanheng Zhu Dongbin Zhao

This is my paper

Pith reviewed 2026-06-28 22:38 UTC · model grok-4.3

classification 💻 cs.CL

keywords on-policy distillationrollout horizonmathematical reasoningtraining efficiencytruncated rolloutsprogressive traininglarge language modelsdistillation methods

0 comments

The pith

Truncated or progressively growing rollouts suffice for effective on-policy distillation on math reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether on-policy distillation requires generating complete student trajectories to supply useful teacher signals. It notes that OPD delivers dense feedback at every step along the rollout and, unlike reward-based methods, needs neither a finished path nor a terminal answer to produce learning gradients. This leads to two horizon-limiting methods: one that starts short and lengthens the rollout over training, and one that keeps the rollout permanently short at reliable early positions. Experiments on mathematical reasoning show the first method speeds training by up to three times and the second reaches the same final performance with one-tenth the horizon length, cutting both time and memory use. The work therefore treats rollout length as a controllable variable rather than a fixed requirement.

Core claim

Standard on-policy distillation is bottlenecked by the need to generate full rollouts, which is costly and can expose the student to unreliable late-stage teacher feedback. Because OPD supplies learning signals throughout the sequence without requiring a complete trajectory or final reward, the authors introduce Progressive OPD, which gradually increases rollout length during training, and Truncated OPD, which fixes distillation on shorter, more reliable prefixes. On mathematical reasoning tasks, Progressive OPD improves training efficiency by up to 3× while Truncated OPD matches full-horizon performance using only 10% of the rollout length, producing large reductions in wall-clock time and

What carries the argument

Rollout horizon control via progressive expansion (POPD) or permanent truncation (TOPD) inside on-policy distillation loops.

If this is right

Progressive expansion of the rollout horizon improves OPD training efficiency by up to 3×.
Fixed truncation to 10% of the horizon matches the performance of full-horizon OPD on mathematical reasoning.
Both horizon-control methods produce substantial reductions in wall-clock time and memory consumption.
The rollout horizon itself is a primary controllable factor in OPD training cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Early truncation may shield the student from noisy or unreliable teacher signals that appear only in long rollouts during early training.
The same horizon-limiting logic could be tested on other long-horizon domains such as code generation where dense intermediate feedback is available.
Dynamic adjustment of horizon length based on measured feedback reliability could further reduce wasted computation.

Load-bearing premise

That teacher feedback on truncated or early-stage rollouts remains reliable and sufficient for effective learning without requiring the complete trajectory or a final answer reward.

What would settle it

A side-by-side run on the same mathematical reasoning benchmarks in which either Progressive OPD or Truncated OPD at the reported horizons produces lower final accuracy or slower convergence than standard full-rollout OPD would disprove the efficiency claims.

Figures

Figures reproduced from arXiv: 2605.31490 by Dongbin Zhao, Guojun Yin, Jiajun Chai, Qichao Zhang, Songjun Tu, Wei Lin, Xiaohan Wang, Yaocheng Zhang, Yuanheng Zhu, Yuqian Fu.

**Figure 1.** Figure 1: TOPD achieves comparable reasoning performance with substantially lower cost. Left: AIME24 accuracy curve for OPD and TOPD with different truncation ratios. Truncated variants, ρ = 0.1 and ρ = 0.5, achieve comparable or better performance than full-rollout OPD, while requiring much less training cost. Right: Theoretical minimum GPU memory requirement and total training time for two epochs under different m… view at source ↗

**Figure 2.** Figure 2: Overview of horizon control for efficient OPD. Standard OPD distills full rollouts throughout training. In contrast, POPD progressively expands the rollout horizon, while TOPD restricts distillation to truncated rollouts. 2 Preliminaries We consider on-policy distillation (OPD) for autoregressive language models. Given an input prompt x, let πθ denote the student policy and π g denote the teacher policy. … view at source ↗

**Figure 3.** Figure 3: Comparison between token-level OPD and sequence-level OPD under different degrees of teacherstudent mismatch. When the mismatch is small (30◦ , 45◦ ), both methods can distill the student toward teacher Target. As the mismatch increases (60◦ , 75◦ , 90◦ ), later rollout positions are more likely to enter regions with unreliable teacher feedback, causing sequence-level OPD to propagate noisy log-ratio sign… view at source ↗

**Figure 4.** Figure 4: Teacher reliability in the simple navigation task. The teacher and initial student trajectories point to different targets. Teacher guidance is reliable near its trajectory but becomes less reliable with distance. task where the teacher and initial student are trained toward different targets. The angular difference between their targets controls the mismatch degree. Teacher feedback is reliable near its … view at source ↗

**Figure 5.** Figure 5: Training efficiency comparison between OPD and POPD. The student is R1-Distill-1.5B, and the teacher is JustRL-R1-1.5B. 5 Experiments 5.1 LLM Reasoning Experiments [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Performance of TOPD with different rollout ratios on AIME24. The student is R1-Distill-1.5B, and the teacher is JustRL-R1-1.5B. Moderate truncation matches or even surpasses the performance of standard OPD while substantially reducing training cost. Truncated OPD provides a trade-off between cost and performance [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Autoregressive control task. The teacher and initial student trajectories point to different targets. We need to distill teacher policy into student so that the trajectories generated by student head toward teacher target. Please refer to Appendix D for task details [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Training efficiency comparison between OPD and POPD in the autoregressive control task. POPD reaches a high success rate substantially faster than standard OPD, showing that progressive horizon expansion improves long-horizon distillation efficiency. rate much faster than standard OPD. This again supports our claim that full rollouts can be inefficient, especially during early training, because late rollo… view at source ↗

**Figure 9.** Figure 9: Prefix-continuation analysis on AIME24. For each problem in AIME24, the teacher generates 256 truncated prefixes, and the student (without any OPD training) continues generation from each teachergenerated prefix. The x-axis denotes the rollout ratio of teacher-generated prefixes in the prefix-continuation setting, or the truncation ratio used in the TOPD setting. 50% 25% 10% 5% 2.5% 0% 10 20 30 40 50 60 A… view at source ↗

**Figure 10.** Figure 10: Prefix-continuation analysis before and after TOPD. We respectively use the student model before and after TOPD (ρ = 0.1) training to continue generation from teacher-generated prefixes. The x-axis is the rollout ratio of the teacher-generated prefixes. length ratio ρ yields only modest improvement, far smaller than TOPD training on rollouts of ρ. Therefore, The effectiveness of TOPD does not arise fro… view at source ↗

**Figure 11.** Figure 11: Reverse distillation on AIME24. We use the weaker R1-Distill-1.5B as the teacher and the stronger JustRL-R1-1.5B as the student. TOPD pulls the student toward the weaker teacher even with truncated rollouts, indicating that truncated rollouts provide a strong optimization signal. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗

**Figure 13.** Figure 13: We randomly sampled 500 questions from DAPO-Math-17K, and measured the KL divergence between the teacher and the student at each position within the student rollout. It can be observed that as the rollout position moves deeper, the KL divergence increases, indicating that the end of student’s rollout gradually shifts to a distribution unfamiliar to the teacher. 7 Conclusion In this work, we study whethe… view at source ↗

**Figure 12.** Figure 12: Ablation study on distillation from different rollout segments. We split each complete rollout into several segments according to token positions and distill the student using one segment at a time. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗

**Figure 14.** Figure 14: Sequence-level OPD under an analytic optimal teacher policy. When the teacher is optimal, future log-ratio signals remain reliable despite student mismatch, allowing sequence-level OPD to learn an accurate policy. However, sequence-level OPD still aggregates future signals into updates of early tokens, which may introduce additional variance and lead to slower convergence than token-level OPD (Fu et al., … view at source ↗

read the original abstract

On-policy distillation (OPD) provides dense teacher feedback along rollouts generated by the student and has emerged as a promising post-training paradigm for long-horizon reasoning. However, standard OPD typically generates full rollouts during training, which is computationally expensive and may expose the student to unreliable teacher feedback at late rollout positions, especially during early training. We identify the rollout horizon as a key bottleneck in OPD that substantially impacts training efficiency. Unlike Reinforcement Learning with Verifiable Rewards (RLVR), OPD does not require a complete trajectory or a final answer reward to provide learning signals. This observation suggests that full rollouts may not always be necessary for effective OPD. Motivated by this insight, we propose two simple horizon-control strategies: Progressive OPD (POPD), which gradually expands the rollout horizon during training, and Truncated OPD (TOPD), which permanently performs distillation on reliable truncated rollouts. Experiments on mathematical reasoning show that POPD improves the training efficiency of OPD by up to 3$\times$, while TOPD matches OPD performance using only 10\% of the rollout horizon, leading to substantial wall-clock and memory reductions. These results demonstrate that controlling the rollout horizon offers a simple and practical path to more efficient OPD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows truncated or growing rollouts can match full OPD performance on math reasoning with big efficiency gains, but the abstract leaves the experimental details thin.

read the letter

The main thing to know is that this work tests whether on-policy distillation needs complete trajectories. It proposes two horizon controls: POPD ramps the length up over training, and TOPD sticks to short fixed ones. On math reasoning tasks the abstract reports POPD reaching 3x efficiency and TOPD matching full OPD at 10% horizon length, with lower wall-clock and memory use.

The paper does a clean job of spelling out why OPD differs from RLVR here. Because the teacher gives dense token-level feedback along the rollout, the final answer is not required for a learning signal. That observation is straightforward and leads directly to the two simple strategies. The framing is practical and targets a real cost in current setups.

The soft spot is the lack of visible experimental detail. The abstract mentions positive outcomes but gives no baselines, dataset sizes, statistical tests, or ablation numbers. Without those it is hard to judge whether the gains are robust or depend on particular model scales or task difficulties. The core assumption that truncated teacher feedback stays reliable also gets limited discussion.

This is for people already running OPD on reasoning models who want to cut training cost. A reader in that niche could pick up a usable trick if the numbers hold. The claim is concrete enough and the efficiency angle matters enough that it deserves a serious referee rather than a desk reject.

I would send it to review and ask specifically for the full experimental protocol and controls.

Referee Report

1 major / 1 minor

Summary. The manuscript claims that full rollouts are unnecessary for on-policy distillation (OPD) in long-horizon reasoning because dense teacher feedback does not require complete trajectories or final-answer rewards. It introduces two horizon-control methods—Progressive OPD (POPD), which gradually expands the rollout length during training, and Truncated OPD (TOPD), which uses fixed short horizons—and reports that on mathematical reasoning tasks POPD yields up to 3× training-efficiency gains while TOPD matches standard OPD performance with only 10 % of the rollout horizon, producing wall-clock and memory savings.

Significance. If the reported efficiency gains and performance equivalence hold under rigorous controls, the work offers a practical route to lower the computational cost of OPD-based post-training by showing that truncated or progressively expanding horizons suffice when teacher signals are dense. The direct empirical tests of the two proposed strategies constitute a concrete, falsifiable contribution to efficient distillation methods.

major comments (1)

[Experiments section] Experiments section: the claims of up to 3× efficiency improvement for POPD and performance matching for TOPD at 10 % horizon are presented without reported details on the number of random seeds, variance or statistical significance tests, or the precise baseline implementations and dataset splits; these omissions prevent verification that the measured speed-ups are robust and not artifacts of single-run variability.

minor comments (1)

The abstract would be clearer if it named the specific mathematical reasoning benchmarks and teacher model used in the reported experiments.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and the recommendation for minor revision. The single major comment concerns missing experimental details on seeds, variance, and baselines; we address this directly below and will incorporate the requested information in the revised manuscript.

read point-by-point responses

Referee: [Experiments section] Experiments section: the claims of up to 3× efficiency improvement for POPD and performance matching for TOPD at 10 % horizon are presented without reported details on the number of random seeds, variance or statistical significance tests, or the precise baseline implementations and dataset splits; these omissions prevent verification that the measured speed-ups are robust and not artifacts of single-run variability.

Authors: We agree that these details are necessary for verifying robustness. In the revised manuscript we will (i) report all main results as means over at least three random seeds with standard deviations, (ii) include statistical significance tests (paired t-tests or Wilcoxon) between methods where performance differences are claimed, and (iii) expand the experimental section with exact baseline code references, dataset splits, and hyper-parameter tables. These additions will be placed in a new “Experimental Details” subsection and will not alter the core claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper is an empirical study that proposes two horizon-control strategies (POPD and TOPD) motivated by the observation that OPD supplies dense teacher feedback without requiring complete trajectories or final-answer rewards (unlike RLVR). These strategies are directly tested via experiments on mathematical reasoning tasks, reporting measured speed-ups (up to 3×) and performance equivalence at 10% horizon. No derivations, equations, fitted parameters renamed as predictions, or self-citations appear in the provided text; the central claims rest on external experimental outcomes rather than internal self-definition or load-bearing citations. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that partial rollouts supply usable teacher signals; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Teacher feedback on truncated rollouts is reliable and sufficient for learning
Invoked to justify why full trajectories are unnecessary.

pith-pipeline@v0.9.1-grok · 5782 in / 1189 out tokens · 24893 ms · 2026-06-28T22:38:12.022814+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Blockwise Policy-Drift Gating for On-Policy Distillation
cs.LG 2026-06 unverdicted novelty 5.0

Blockwise policy-drift gating raises mean pass@8 from 0.4978 to 0.5160 on four math benchmarks by reweighting OPD losses with detached mean-normalized gates from student policy drift over 64-token blocks.
A Formula-Driven Survey and Research Agenda for On-Policy Distillation
cs.AI 2026-06 unverdicted novelty 4.0

A survey creates a taxonomy for on-policy distillation in LLMs that separates temporal credit assignment from vocabulary-level probability routing.

Reference graph

Works this paper leans on

6 extracted references · 5 canonical work pages · cited by 2 Pith papers · 5 internal anchors

[1]

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

Learning to foresee: Unveiling the unlocking efficiency of on-policy distillation.arXiv preprint arXiv:2605.11739. DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, and 181 others

work page internal anchor Pith review Pith/arXiv arXiv
[2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capabil- ity in llms via reinforcement learning.arXiv preprint arXiv:2501.12948. Wei Du, Shubham Toshniwal, Branislav Kisacanin, Sadegh Mahdavi, Ivan Moshkov, George Armstrong, Stephen Ge, Edgar Minasyan, Feng Chen, and Igor Gitman. 2025. Nemotron-math: Efficient long-context distillation of mathematical reason- ing fr...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Revisiting on-policy distillation: Empiri- cal failure modes and simple fixes.arXiv preprint arXiv:2603.25562. GLM-5-Team, :, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, Chenzheng Zhu, Congfeng Yin, Cunxiang Wang, Gengzheng Pan, Hao Zeng, Haoke Zhang, Hao- ran Wang, and 168 others...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Rethinking on-policy distillation of large lan- guage models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016. Kevin Lu and Thinking Machines Lab. 2025. On- policy distillation.Thinking Machines Lab: Con- nectionism. Https://thinkingmachines.ai/blog/on- policy-distillation. Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gon- tier, Ale...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

π-play: Multi-agent self-play via privileged self-distillation without external data.arXiv preprint arXiv:2604.14054. 11 A Related Work MiniLLM first formalized OPD for LLMs under a reverse KL objective optimized via policy gradient (Gu et al., 2024; Yue et al., 2025). Unlike offline distillation (Kim and Rush, 2016), which aligns the student with teacher...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

In contrast, our work studies OPD efficiency from the perspective of rollout horizon control

analyze efficient OPD from the perspective of parameter dynamics and optimization behav- ior. In contrast, our work studies OPD efficiency from the perspective of rollout horizon control. We show that full rollouts are not always necessary for effective OPD, and that prioritizing reliable roll- out segments can substantially reduce training cost while pre...

2017

[1] [1]

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

Learning to foresee: Unveiling the unlocking efficiency of on-policy distillation.arXiv preprint arXiv:2605.11739. DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, and 181 others

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capabil- ity in llms via reinforcement learning.arXiv preprint arXiv:2501.12948. Wei Du, Shubham Toshniwal, Branislav Kisacanin, Sadegh Mahdavi, Ivan Moshkov, George Armstrong, Stephen Ge, Edgar Minasyan, Feng Chen, and Igor Gitman. 2025. Nemotron-math: Efficient long-context distillation of mathematical reason- ing fr...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Revisiting on-policy distillation: Empiri- cal failure modes and simple fixes.arXiv preprint arXiv:2603.25562. GLM-5-Team, :, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, Chenzheng Zhu, Congfeng Yin, Cunxiang Wang, Gengzheng Pan, Hao Zeng, Haoke Zhang, Hao- ran Wang, and 168 others...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Rethinking on-policy distillation of large lan- guage models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016. Kevin Lu and Thinking Machines Lab. 2025. On- policy distillation.Thinking Machines Lab: Con- nectionism. Https://thinkingmachines.ai/blog/on- policy-distillation. Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gon- tier, Ale...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

π-play: Multi-agent self-play via privileged self-distillation without external data.arXiv preprint arXiv:2604.14054. 11 A Related Work MiniLLM first formalized OPD for LLMs under a reverse KL objective optimized via policy gradient (Gu et al., 2024; Yue et al., 2025). Unlike offline distillation (Kim and Rush, 2016), which aligns the student with teacher...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

In contrast, our work studies OPD efficiency from the perspective of rollout horizon control

analyze efficient OPD from the perspective of parameter dynamics and optimization behav- ior. In contrast, our work studies OPD efficiency from the perspective of rollout horizon control. We show that full rollouts are not always necessary for effective OPD, and that prioritizing reliable roll- out segments can substantially reduce training cost while pre...

2017