pith. machine review for the scientific record. sign in

arxiv: 2602.12125 · v2 · submitted 2026-02-12 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 3 theorem links

· Lean Theorem

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Authors on Pith no claims yet

Pith reviewed 2026-05-16 05:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords on-policy distillationknowledge distillationreinforcement learningreward extrapolationmath reasoningcode generation
0
0 comments X

The pith

Generalized on-policy distillation using reward extrapolation enables students to surpass their teachers when merging domain experts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that standard on-policy distillation is a special case of KL-constrained reinforcement learning with equal weighting of reward and regularization. The authors introduce a generalized framework that allows a reward scaling factor and flexible reference model. Setting the scaling factor above one, which they call ExOPD, leads to consistent improvements over standard distillation in experiments on math reasoning and code generation tasks. When merging knowledge from different domain experts, this approach allows the student model to exceed the performance of the original teacher and the individual domain teachers.

Core claim

The central discovery is that on-policy distillation corresponds to a constrained RL objective where the reward and KL terms are always balanced equally. By introducing a reward scaling factor greater than one in the generalized objective, the method achieves reward extrapolation that improves student performance. In settings that combine multiple domain-specific experts, the extrapolated student outperforms both the combined teacher and the separate experts. In strong-to-weak distillation, choosing the teacher's pre-RL base model as the reference further refines the reward signal.

What carries the argument

The central mechanism is the reward scaling factor applied to the teacher-derived reward in the G-OPD loss function, which increases the influence of the reward relative to the KL regularization term.

If this is right

  • ExOPD outperforms standard OPD across different teacher-student size combinations on math and code tasks.
  • Merging domain experts with ExOPD lets the student surpass the teacher's performance boundary and beat the domain teachers.
  • Using the teacher's pre-RL base model as reference in strong-to-weak distillation provides a more accurate reward and boosts performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar reward scaling could potentially improve other forms of distillation where the teacher is a fine-tuned model.
  • Exploring reference models other than the pre-RL base might reduce the access requirement while retaining benefits.
  • The results suggest that the standard equal weighting in OPD may under-emphasize the task reward in some cases.

Load-bearing premise

The claim depends on the assumption that scaling up the reward weight produces better policies without causing the student to diverge from useful behavior.

What would settle it

If experiments applying ExOPD to the code generation tasks show that the student performance does not exceed the teacher's on the evaluation benchmarks, the central claim would be falsified.

read the original abstract

On-policy distillation (OPD), which aligns the student with the teacher's logit distribution on student-generated trajectories, has demonstrated strong empirical gains in improving student performance and often outperforms off-policy distillation and reinforcement learning (RL) paradigms. In this work, we first theoretically show that OPD is a special case of dense KL-constrained RL where the reward function and the KL regularization are always weighted equally and the reference model can by any model. Then, we propose the Generalized On-Policy Distillation (G-OPD) framework, which extends the standard OPD objective by introducing a flexible reference model and a reward scaling factor that controls the relative weight of the reward term against the KL regularization. Through comprehensive experiments on math reasoning and code generation tasks, we derive two novel insights: (1) Setting the reward scaling factor to be greater than 1 (i.e., reward extrapolation), which we term ExOPD, consistently improves over standard OPD across a range of teacher-student size pairings. In particular, in the setting where we merge the knowledge from different domain experts, obtained by applying domain-specific RL to the same student model, back into the original student, ExOPD enables the student to even surpass the teacher's performance boundary and outperform the domain teachers. (2) Building on ExOPD, we further find that in the strong-to-weak distillation setting (i.e., distilling a smaller student from a larger teacher), performing reward correction by choosing the reference model as the teacher's base model before RL yields a more accurate reward signal and further improves distillation performance. However, this choice assumes access to the teacher's pre-RL variant and incurs more computational overhead. We hope our work offers new insights for future research on OPD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that standard on-policy distillation (OPD) is a special case of dense KL-constrained RL (with equal reward-KL weighting and arbitrary reference), introduces the Generalized On-Policy Distillation (G-OPD) framework that adds a flexible reference model and a tunable reward scaling factor, and reports that setting the scaling factor >1 (ExOPD) yields consistent gains over OPD on math reasoning and code generation. In particular, ExOPD enables merging of domain-expert knowledge back into the base student such that the student surpasses the original teacher, and in strong-to-weak settings choosing the teacher's pre-RL checkpoint as reference further improves the reward signal.

Significance. If the empirical claims hold under rigorous testing, the work supplies a simple, theoretically motivated knob (reward extrapolation) that can push distillation performance past the teacher in multi-expert merging scenarios and clarifies the role of the reference model in strong-to-weak transfer. The explicit reduction of OPD to a special case of KL-constrained RL is a useful organizing insight for the field.

major comments (3)
  1. [Abstract and §5 (Experiments)] Abstract and experimental results: the claims of 'consistent improvements' and student outperformance of the teacher in domain-expert merging are presented without run counts, error bars, or statistical tests. This is load-bearing for the central empirical claim and must be addressed before the gains can be considered reliable.
  2. [§4.2 and §5.3] Reward-correction step in strong-to-weak distillation: the improvement is tied to using the teacher's pre-RL base model as reference. It is unclear whether the domain-expert merging results (where the student surpasses the teacher) also rely on this specific reference choice or employ a different reference; the paper should state the reference model used in each table/figure and quantify the extra overhead.
  3. [§3 (Theoretical Analysis)] Theoretical section: the statement that OPD is recovered exactly when the scaling factor equals 1 and the reference is arbitrary is asserted directly. The derivation should explicitly substitute the G-OPD objective (Eq. for the generalized loss) back into the dense KL-constrained RL objective to show the reduction without additional assumptions.
minor comments (2)
  1. [Throughout] Notation for the reward scaling factor should be introduced once and used uniformly; occasional switches between symbols or implicit definitions reduce readability.
  2. [Figures 2-4 and Tables 1-3] Figure captions and table footnotes should explicitly list the reference model and scaling factor used in each row/column so that the ExOPD vs. OPD comparison is immediately verifiable.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which has helped improve the clarity and rigor of our work. We address each major comment below and have revised the manuscript to incorporate the suggested changes.

read point-by-point responses
  1. Referee: [Abstract and §5 (Experiments)] Abstract and experimental results: the claims of 'consistent improvements' and student outperformance of the teacher in domain-expert merging are presented without run counts, error bars, or statistical tests. This is load-bearing for the central empirical claim and must be addressed before the gains can be considered reliable.

    Authors: We agree that statistical details are necessary to substantiate the empirical claims. In the revised manuscript, we have rerun all key experiments across 5 random seeds, reporting means with standard deviations in the tables and adding error bars to the figures. We have also added paired t-test results confirming that the improvements of ExOPD over OPD are statistically significant (p < 0.05) in the math and code merging settings. revision: yes

  2. Referee: [§4.2 and §5.3] Reward-correction step in strong-to-weak distillation: the improvement is tied to using the teacher's pre-RL base model as reference. It is unclear whether the domain-expert merging results (where the student surpasses the teacher) also rely on this specific reference choice or employ a different reference; the paper should state the reference model used in each table/figure and quantify the extra overhead.

    Authors: We clarify that domain-expert merging experiments use the original base student model (pre-domain-RL) as the reference, whereas strong-to-weak distillation uses the teacher's pre-RL checkpoint. We have updated all table and figure captions to explicitly state the reference model for each setting. The additional overhead of a non-default reference is one extra forward pass per update step, increasing wall-clock training time by approximately 12% on our hardware setup. revision: yes

  3. Referee: [§3 (Theoretical Analysis)] Theoretical section: the statement that OPD is recovered exactly when the scaling factor equals 1 and the reference is arbitrary is asserted directly. The derivation should explicitly substitute the G-OPD objective (Eq. for the generalized loss) back into the dense KL-constrained RL objective to show the reduction without additional assumptions.

    Authors: We have expanded the theoretical section with an explicit derivation. Starting from the G-OPD loss L = E[r(s,a) - λ KL(π_θ || π_ref)], we substitute λ = 1 to recover the dense KL-constrained RL objective with equal reward-KL weighting; the reduction holds for any reference model without further assumptions. The revised §3 now includes this step-by-step substitution. revision: yes

Circularity Check

0 steps flagged

No significant circularity; G-OPD and ExOPD are explicit design extensions validated empirically.

full rationale

The paper first proves OPD is a special case of dense KL-constrained RL (equal reward/KL weights, arbitrary reference). It then defines G-OPD by adding a free reference model and explicit reward scaling factor (set >1 for ExOPD). All performance claims (surpassing teacher in domain-expert merging, gains in strong-to-weak) come from direct experiments on math/code tasks, not from any parameter fitted to the target metric and renamed as prediction. No self-citation chains, uniqueness theorems, or ansatzes reduce the central results to inputs by construction. The scaling factor is presented as a tunable hyperparameter whose effect is measured, not derived tautologically.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on one tunable design choice (reward scaling) and one domain assumption about reference models; no new physical entities or unstated fitted constants are introduced.

free parameters (1)
  • reward scaling factor
    Hyperparameter that controls the relative weight of the reward term versus KL regularization; set greater than 1 to enable extrapolation.
axioms (1)
  • domain assumption On-policy distillation is a special case of dense KL-constrained RL with equal weighting of reward and regularization and arbitrary reference model
    Stated as the first theoretical result in the abstract.

pith-pipeline@v0.9.0 · 5638 in / 1285 out tokens · 87056 ms · 2026-05-16T05:06:43.878258+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel contradicts
    ?
    contradicts

    CONTRADICTS: the theorem conflicts with this paper passage, or marks a claim that would need revision before publication.

    Setting the reward scaling factor to be greater than 1 (i.e., reward extrapolation), which we term ExOPD, consistently improves over standard OPD

  • Foundation.LawOfExistence nothing_cannot_exist contradicts
    ?
    contradicts

    CONTRADICTS: the theorem conflicts with this paper passage, or marks a claim that would need revision before publication.

    performing reward correction by choosing the reference model as the teacher's base model before RL yields a more accurate reward signal

  • Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    OPD is a special case of dense KL-constrained RL where the reward function and the KL regularization are always weighted equally

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    cs.LG 2026-04 unverdicted novelty 8.0

    Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.

  2. Multi-Rollout On-Policy Distillation via Peer Successes and Failures

    cs.LG 2026-05 unverdicted novelty 7.0

    MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.

  3. From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation

    cs.LG 2026-05 conditional novelty 7.0

    Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.

  4. Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why

    cs.LG 2026-05 unverdicted novelty 7.0

    Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.

  5. Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning

    cs.CV 2026-05 unverdicted novelty 7.0

    RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multi...

  6. The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

    cs.LG 2026-05 unverdicted novelty 7.0

    On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.

  7. Flow-OPD: On-Policy Distillation for Flow Matching Models

    cs.CV 2026-05 conditional novelty 7.0

    Flow-OPD applies on-policy distillation to flow matching models via specialized teachers, cold-start initialization, and manifold anchor regularization, lifting GenEval from 63 to 92 and OCR from 59 to 94 on Stable Di...

  8. Rubric-based On-policy Distillation

    cs.LG 2026-05 unverdicted novelty 7.0

    Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance an...

  9. Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

    cs.CL 2026-05 unverdicted novelty 7.0

    POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...

  10. Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level

    cs.LG 2026-05 unverdicted novelty 7.0

    AOPD modifies on-policy distillation by using localized divergence minimization for non-positive advantages instead of negative reinforcement, yielding average gains of 4.09/8.34 over standard OPD on math reasoning be...

  11. MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate

    cs.CL 2026-05 unverdicted novelty 7.0

    MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.

  12. Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

    cs.CL 2026-05 unverdicted novelty 6.0

    Local teachability collapse in trajectory suffixes makes uniform dense supervision suboptimal in strong-to-weak OPD; truncating at BIC-style change points on teacher margin improves performance.

  13. Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

    cs.CL 2026-05 unverdicted novelty 6.0

    On-policy distillation gains efficiency from early foresight in module allocation and low-rank update directions, enabling EffOPD to accelerate training by 3x via adaptive extrapolation without extra modules or tuning.

  14. Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

    cs.CL 2026-05 unverdicted novelty 6.0

    On-policy distillation gains efficiency from early foresight in module focus and update directions, enabling EffOPD to accelerate training 3x with comparable performance.

  15. Flow-OPD: On-Policy Distillation for Flow Matching Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Flow-OPD applies on-policy distillation to flow-matching text-to-image models, lifting GenEval from 63 to 92 and OCR accuracy from 59 to 94 while preserving fidelity.

  16. Flow-OPD: On-Policy Distillation for Flow Matching Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Flow-OPD applies on-policy distillation to flow matching models, achieving GenEval of 92 and OCR accuracy of 94 on Stable Diffusion 3.5 Medium while avoiding the seesaw effect of multi-reward optimization.

  17. SOD: Step-wise On-policy Distillation for Small Language Model Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.

  18. Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance

    cs.CL 2026-04 unverdicted novelty 6.0

    Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math a...

  19. Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    cs.LG 2026-04 unverdicted novelty 6.0

    Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.

  20. On-Policy Distillation with Best-of-N Teacher Rollout Selection

    cs.CV 2026-05 unverdicted novelty 5.0

    BRTS improves on-policy distillation by selecting the highest-quality teacher trajectory from a small pool of samples based on correctness and alignment with the student, yielding gains on AIME and AMC math benchmarks.

  21. On-Policy Distillation with Best-of-N Teacher Rollout Selection

    cs.CV 2026-05 unverdicted novelty 5.0

    BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.

  22. Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level

    cs.LG 2026-05 unverdicted novelty 5.0

    Asymmetric On-Policy Distillation replaces ineffective negative reinforcement with localized divergence minimization in low-advantage regions, yielding 4.09-8.34 point gains over standard OPD on math reasoning benchmarks.

  23. Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level

    cs.LG 2026-05 unverdicted novelty 5.0

    Asymmetric On-Policy Distillation improves on-policy distillation by using divergence minimization instead of negative reinforcement in low-advantage regions, yielding 4-8 point gains on math reasoning benchmarks whil...

  24. Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning

    cs.CL 2026-04 accept novelty 5.0

    LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.

  25. Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence

    cs.AI 2026-05 unverdicted novelty 4.0

    Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.

  26. Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence

    cs.AI 2026-05 unverdicted novelty 3.0

    Safactory combines parallel simulation, trustworthy data management, and asynchronous evolution platforms into a single pipeline claimed to be the first unified framework for trustworthy autonomous agents.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 18 Pith papers · 16 internal anchors

  1. [1]

    Aime 2024.https://huggingface.co/datasets/AI-MO/aimo-validation-aime,

    AI-MO. Aime 2024.https://huggingface.co/datasets/AI-MO/aimo-validation-aime,

  2. [2]

    InternLM2 Technical Report

    URLhttps://matharena.ai/. Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report.arXiv preprint arXiv:2403.17297,

  3. [3]

    Process Reinforcement through Implicit Rewards

    Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456,

  4. [4]

    Enhancing chat language models by scaling high-quality instructional conversations

    Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 3029–3051,

  5. [5]

    Rlhf workflow: From reward mod- eling to online rlhf.arXiv preprint arXiv:2405.07863, 2024

    Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward modeling to online rlhf.arXiv preprint arXiv:2405.07863,

  6. [6]

    OpenThoughts: Data Recipes for Reasoning Models

    URL https://openreview.net/ forum?id=5h0qf7IBZZ. Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, et al. Openthoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178,

  7. [7]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025a. Yiju Guo, Wenkai Yang, Zexu Sun, Ning Ding, Zhiyuan Liu, and Yankai Lin. Learning to focus: Causal attention...

  8. [8]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

  9. [9]

    Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner- zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290,

  10. [10]

    Reinforcement Learning via Self-Distillation

    Jonas H¨ubotter, Frederike L¨ubeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802,

  11. [11]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,

  12. [12]

    Sequence-level knowledge distillation

    Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. InProceedings of the 2016 conference on empirical methods in natural language processing, pp. 1317–1327,

  13. [13]

    ORBIT: On-policy Exploration-Exploitation for Controllable Multi-Budget Reasoning

    Kun Liang, Clive Bai, Xin Xu, Chenming Tang, Sanwoo Lee, Weijie Liu, Saiyong Yang, and Yunfang Wu. Orbit: On-policy exploration-exploitation for controllable multi-budget reasoning.arXiv preprint arXiv:2601.08310,

  14. [14]

    Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352, 2025a

    Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, et al. Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352, 2025a. 11 Jiacai Liu, Yingru Li, Yuqian Fu, Jiawei Wang, Qian Liu, and Yu Shen. When speed kills stability: Demystify- ...

  15. [15]

    Xiaoqian Liu, Ke Wang, Yuchuan Wu, Fei Huang, Yongbin Li, Junge Zhang, and Jianbin Jiao

    URLhttps://openreview.net/forum?id=1qvx610Cu7. Xiaoqian Liu, Ke Wang, Yuchuan Wu, Fei Huang, Yongbin Li, Junge Zhang, and Jianbin Jiao. Agentic reinforcement learning with implicit step rewards.arXiv preprint arXiv:2509.19199, 2025c. Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connectionism,

  16. [16]

    https://thinkingmachines.ai/blog/on-policy-distillation

    doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation. OpenCompass. Aime 2025.https://huggingface.co/datasets/opencompass/AIME2025,

  17. [17]

    Privileged information distillation for language models.arXiv preprint arXiv:2602.04942,

    Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. Privileged information distillation for language models.arXiv preprint arXiv:2602.04942,

  18. [18]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108,

  19. [19]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,

  20. [20]

    Self-Distillation Enables Continual Learning

    Idan Shenfeld, Mehul Damani, Jonas H ¨ubotter, and Pulkit Agrawal. Self-distillation enables continual learning. arXiv preprint arXiv:2601.19897,

  21. [21]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

  22. [22]

    Alpaca: A strong, replicable instruction-following model.Stanford Center for Research on Foundation Models

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model.Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7,

  23. [23]

    MiMo-V2-Flash Technical Report

    Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780,

  24. [24]

    Qwen3 Technical Report

    URL https://openreview.net/forum?id=2QdsjiNXgj. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Wenkai Yang, Jingwen Chen, Yankai Lin, and Ji-Rong Wen. Deepcritic: Deliberate critique with large language models.arXiv...

  25. [25]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734,

  26. [26]

    13 A Detailed Math Derivations Here, we make mathematical derivations to calculate the expected gradients of OPD objective in Eq. (4). Since JOPD(θ) =min θ Ex∼D,y∼π θ(·|x) h DKL πθ(y|x) π∗(y|x) i =min θ Ex∼D,y∼π θ(·|x) h logπ θ(y|x)−logπ ∗(y|x) i . (15) We can get ∇θJOPD(θ) =∇ θEx∼D,y∼π θ(·|x) h logπ θ(y|x)−logπ ∗(y|x) i =∇ θEx h ∑ y πθ(y|x) logπ θ(y|x)−l...