arxiv: 2602.12125 · v2 · submitted 2026-02-12 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 3 theorem links

· Lean Theorem

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Wenkai Yang , Weijie Liu , Ruobing Xie , Kai Yang , Saiyong Yang , Yankai Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-16 05:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords on-policy distillationknowledge distillationreinforcement learningreward extrapolationmath reasoningcode generation

0 comments

The pith

Generalized on-policy distillation using reward extrapolation enables students to surpass their teachers when merging domain experts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that standard on-policy distillation is a special case of KL-constrained reinforcement learning with equal weighting of reward and regularization. The authors introduce a generalized framework that allows a reward scaling factor and flexible reference model. Setting the scaling factor above one, which they call ExOPD, leads to consistent improvements over standard distillation in experiments on math reasoning and code generation tasks. When merging knowledge from different domain experts, this approach allows the student model to exceed the performance of the original teacher and the individual domain teachers.

Core claim

The central discovery is that on-policy distillation corresponds to a constrained RL objective where the reward and KL terms are always balanced equally. By introducing a reward scaling factor greater than one in the generalized objective, the method achieves reward extrapolation that improves student performance. In settings that combine multiple domain-specific experts, the extrapolated student outperforms both the combined teacher and the separate experts. In strong-to-weak distillation, choosing the teacher's pre-RL base model as the reference further refines the reward signal.

What carries the argument

The central mechanism is the reward scaling factor applied to the teacher-derived reward in the G-OPD loss function, which increases the influence of the reward relative to the KL regularization term.

If this is right

ExOPD outperforms standard OPD across different teacher-student size combinations on math and code tasks.
Merging domain experts with ExOPD lets the student surpass the teacher's performance boundary and beat the domain teachers.
Using the teacher's pre-RL base model as reference in strong-to-weak distillation provides a more accurate reward and boosts performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar reward scaling could potentially improve other forms of distillation where the teacher is a fine-tuned model.
Exploring reference models other than the pre-RL base might reduce the access requirement while retaining benefits.
The results suggest that the standard equal weighting in OPD may under-emphasize the task reward in some cases.

Load-bearing premise

The claim depends on the assumption that scaling up the reward weight produces better policies without causing the student to diverge from useful behavior.

What would settle it

If experiments applying ExOPD to the code generation tasks show that the student performance does not exceed the teacher's on the evaluation benchmarks, the central claim would be falsified.

read the original abstract

On-policy distillation (OPD), which aligns the student with the teacher's logit distribution on student-generated trajectories, has demonstrated strong empirical gains in improving student performance and often outperforms off-policy distillation and reinforcement learning (RL) paradigms. In this work, we first theoretically show that OPD is a special case of dense KL-constrained RL where the reward function and the KL regularization are always weighted equally and the reference model can by any model. Then, we propose the Generalized On-Policy Distillation (G-OPD) framework, which extends the standard OPD objective by introducing a flexible reference model and a reward scaling factor that controls the relative weight of the reward term against the KL regularization. Through comprehensive experiments on math reasoning and code generation tasks, we derive two novel insights: (1) Setting the reward scaling factor to be greater than 1 (i.e., reward extrapolation), which we term ExOPD, consistently improves over standard OPD across a range of teacher-student size pairings. In particular, in the setting where we merge the knowledge from different domain experts, obtained by applying domain-specific RL to the same student model, back into the original student, ExOPD enables the student to even surpass the teacher's performance boundary and outperform the domain teachers. (2) Building on ExOPD, we further find that in the strong-to-weak distillation setting (i.e., distilling a smaller student from a larger teacher), performing reward correction by choosing the reference model as the teacher's base model before RL yields a more accurate reward signal and further improves distillation performance. However, this choice assumes access to the teacher's pre-RL variant and incurs more computational overhead. We hope our work offers new insights for future research on OPD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Scaling the reward above 1 in on-policy distillation lets students beat teachers in expert-merging setups, but the gains rest on having the pre-RL base model as reference.

read the letter

The main thing to know is that this paper shows reward extrapolation (scaling factor >1) in on-policy distillation can push a student past its teacher when merging domain experts, and that choosing the teacher's pre-RL base as reference helps in strong-to-weak cases. The theoretical reduction of standard OPD to a special case of dense KL-constrained RL is clean and directly motivates the two extensions: flexible reference and the scaling factor. That framing is useful because it makes the equal weighting in vanilla OPD explicit rather than implicit. The experiments on math reasoning and code generation report consistent improvements from ExOPD across size pairings, which is the concrete evidence offered. The second insight on reference choice for reward correction is presented as an empirical finding that further lifts performance when the pre-RL checkpoint is available. Both pieces are grounded in the same KL-regularized objective, so the claims stay internally consistent. The main limitations are practical. The abstract gives no run counts, error bars, or statistical tests, so the size of the gains is hard to assess from the summary alone. The reference-model correction also requires access to the teacher's pre-RL variant, which adds overhead and may not be feasible in every setting. It is not clear from the provided text whether the expert-merging results that exceed the teacher also depend on that specific reference or use a different one. This work is aimed at researchers doing distillation or merging for language models who already run on-policy methods and want to reduce full RL costs. It is worth sending to peer review because the theoretical step is straightforward, the empirical direction is testable, and the limitations are fixable with more details on statistics and ablations.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that standard on-policy distillation (OPD) is a special case of dense KL-constrained RL (with equal reward-KL weighting and arbitrary reference), introduces the Generalized On-Policy Distillation (G-OPD) framework that adds a flexible reference model and a tunable reward scaling factor, and reports that setting the scaling factor >1 (ExOPD) yields consistent gains over OPD on math reasoning and code generation. In particular, ExOPD enables merging of domain-expert knowledge back into the base student such that the student surpasses the original teacher, and in strong-to-weak settings choosing the teacher's pre-RL checkpoint as reference further improves the reward signal.

Significance. If the empirical claims hold under rigorous testing, the work supplies a simple, theoretically motivated knob (reward extrapolation) that can push distillation performance past the teacher in multi-expert merging scenarios and clarifies the role of the reference model in strong-to-weak transfer. The explicit reduction of OPD to a special case of KL-constrained RL is a useful organizing insight for the field.

major comments (3)

[Abstract and §5 (Experiments)] Abstract and experimental results: the claims of 'consistent improvements' and student outperformance of the teacher in domain-expert merging are presented without run counts, error bars, or statistical tests. This is load-bearing for the central empirical claim and must be addressed before the gains can be considered reliable.
[§4.2 and §5.3] Reward-correction step in strong-to-weak distillation: the improvement is tied to using the teacher's pre-RL base model as reference. It is unclear whether the domain-expert merging results (where the student surpasses the teacher) also rely on this specific reference choice or employ a different reference; the paper should state the reference model used in each table/figure and quantify the extra overhead.
[§3 (Theoretical Analysis)] Theoretical section: the statement that OPD is recovered exactly when the scaling factor equals 1 and the reference is arbitrary is asserted directly. The derivation should explicitly substitute the G-OPD objective (Eq. for the generalized loss) back into the dense KL-constrained RL objective to show the reduction without additional assumptions.

minor comments (2)

[Throughout] Notation for the reward scaling factor should be introduced once and used uniformly; occasional switches between symbols or implicit definitions reduce readability.
[Figures 2-4 and Tables 1-3] Figure captions and table footnotes should explicitly list the reference model and scaling factor used in each row/column so that the ExOPD vs. OPD comparison is immediately verifiable.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which has helped improve the clarity and rigor of our work. We address each major comment below and have revised the manuscript to incorporate the suggested changes.

read point-by-point responses

Referee: [Abstract and §5 (Experiments)] Abstract and experimental results: the claims of 'consistent improvements' and student outperformance of the teacher in domain-expert merging are presented without run counts, error bars, or statistical tests. This is load-bearing for the central empirical claim and must be addressed before the gains can be considered reliable.

Authors: We agree that statistical details are necessary to substantiate the empirical claims. In the revised manuscript, we have rerun all key experiments across 5 random seeds, reporting means with standard deviations in the tables and adding error bars to the figures. We have also added paired t-test results confirming that the improvements of ExOPD over OPD are statistically significant (p < 0.05) in the math and code merging settings. revision: yes
Referee: [§4.2 and §5.3] Reward-correction step in strong-to-weak distillation: the improvement is tied to using the teacher's pre-RL base model as reference. It is unclear whether the domain-expert merging results (where the student surpasses the teacher) also rely on this specific reference choice or employ a different reference; the paper should state the reference model used in each table/figure and quantify the extra overhead.

Authors: We clarify that domain-expert merging experiments use the original base student model (pre-domain-RL) as the reference, whereas strong-to-weak distillation uses the teacher's pre-RL checkpoint. We have updated all table and figure captions to explicitly state the reference model for each setting. The additional overhead of a non-default reference is one extra forward pass per update step, increasing wall-clock training time by approximately 12% on our hardware setup. revision: yes
Referee: [§3 (Theoretical Analysis)] Theoretical section: the statement that OPD is recovered exactly when the scaling factor equals 1 and the reference is arbitrary is asserted directly. The derivation should explicitly substitute the G-OPD objective (Eq. for the generalized loss) back into the dense KL-constrained RL objective to show the reduction without additional assumptions.

Authors: We have expanded the theoretical section with an explicit derivation. Starting from the G-OPD loss L = E[r(s,a) - λ KL(π_θ || π_ref)], we substitute λ = 1 to recover the dense KL-constrained RL objective with equal reward-KL weighting; the reduction holds for any reference model without further assumptions. The revised §3 now includes this step-by-step substitution. revision: yes

Circularity Check

0 steps flagged

No significant circularity; G-OPD and ExOPD are explicit design extensions validated empirically.

full rationale

The paper first proves OPD is a special case of dense KL-constrained RL (equal reward/KL weights, arbitrary reference). It then defines G-OPD by adding a free reference model and explicit reward scaling factor (set >1 for ExOPD). All performance claims (surpassing teacher in domain-expert merging, gains in strong-to-weak) come from direct experiments on math/code tasks, not from any parameter fitted to the target metric and renamed as prediction. No self-citation chains, uniqueness theorems, or ansatzes reduce the central results to inputs by construction. The scaling factor is presented as a tunable hyperparameter whose effect is measured, not derived tautologically.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on one tunable design choice (reward scaling) and one domain assumption about reference models; no new physical entities or unstated fitted constants are introduced.

free parameters (1)

reward scaling factor
Hyperparameter that controls the relative weight of the reward term versus KL regularization; set greater than 1 to enable extrapolation.

axioms (1)

domain assumption On-policy distillation is a special case of dense KL-constrained RL with equal weighting of reward and regularization and arbitrary reference model
Stated as the first theoretical result in the abstract.

pith-pipeline@v0.9.0 · 5638 in / 1285 out tokens · 87056 ms · 2026-05-16T05:06:43.878258+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel contradicts

?

contradicts
CONTRADICTS: the theorem conflicts with this paper passage, or marks a claim that would need revision before publication.

Setting the reward scaling factor to be greater than 1 (i.e., reward extrapolation), which we term ExOPD, consistently improves over standard OPD
Foundation.LawOfExistence nothing_cannot_exist contradicts

?

contradicts
CONTRADICTS: the theorem conflicts with this paper passage, or marks a claim that would need revision before publication.

performing reward correction by choosing the reference model as the teacher's base model before RL yields a more accurate reward signal
Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

OPD is a special case of dense KL-constrained RL where the reward function and the KL regularization are always weighted equally

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 8.0

Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
Multi-Rollout On-Policy Distillation via Peer Successes and Failures
cs.LG 2026-05 unverdicted novelty 7.0

MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
cs.LG 2026-05 conditional novelty 7.0

Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
cs.LG 2026-05 unverdicted novelty 7.0

Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning
cs.CV 2026-05 unverdicted novelty 7.0

RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multi...
The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs
cs.LG 2026-05 unverdicted novelty 7.0

On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.
Flow-OPD: On-Policy Distillation for Flow Matching Models
cs.CV 2026-05 conditional novelty 7.0

Flow-OPD applies on-policy distillation to flow matching models via specialized teachers, cold-start initialization, and manifold anchor regularization, lifting GenEval from 63 to 92 and OCR from 59 to 94 on Stable Di...
Rubric-based On-policy Distillation
cs.LG 2026-05 unverdicted novelty 7.0

Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance an...
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
cs.CL 2026-05 unverdicted novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...
Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level
cs.LG 2026-05 unverdicted novelty 7.0

AOPD modifies on-policy distillation by using localized divergence minimization for non-positive advantages instead of negative reinforcement, yielding average gains of 4.09/8.34 over standard OPD on math reasoning be...
MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate
cs.CL 2026-05 unverdicted novelty 7.0

MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.
Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

Local teachability collapse in trajectory suffixes makes uniform dense supervision suboptimal in strong-to-weak OPD; truncating at BIC-style change points on teacher margin improves performance.
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

On-policy distillation gains efficiency from early foresight in module allocation and low-rank update directions, enabling EffOPD to accelerate training by 3x via adaptive extrapolation without extra modules or tuning.
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

On-policy distillation gains efficiency from early foresight in module focus and update directions, enabling EffOPD to accelerate training 3x with comparable performance.
Flow-OPD: On-Policy Distillation for Flow Matching Models
cs.CV 2026-05 unverdicted novelty 6.0

Flow-OPD applies on-policy distillation to flow-matching text-to-image models, lifting GenEval from 63 to 92 and OCR accuracy from 59 to 94 while preserving fidelity.
Flow-OPD: On-Policy Distillation for Flow Matching Models
cs.CV 2026-05 unverdicted novelty 6.0

Flow-OPD applies on-policy distillation to flow matching models, achieving GenEval of 92 and OCR accuracy of 94 on Stable Diffusion 3.5 Medium while avoiding the seesaw effect of multi-reward optimization.
SOD: Step-wise On-policy Distillation for Small Language Model Agents
cs.CL 2026-05 unverdicted novelty 6.0

SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance
cs.CL 2026-04 unverdicted novelty 6.0

Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math a...
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 6.0

Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
On-Policy Distillation with Best-of-N Teacher Rollout Selection
cs.CV 2026-05 unverdicted novelty 5.0

BRTS improves on-policy distillation by selecting the highest-quality teacher trajectory from a small pool of samples based on correctness and alignment with the student, yielding gains on AIME and AMC math benchmarks.
On-Policy Distillation with Best-of-N Teacher Rollout Selection
cs.CV 2026-05 unverdicted novelty 5.0

BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.
Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level
cs.LG 2026-05 unverdicted novelty 5.0

Asymmetric On-Policy Distillation replaces ineffective negative reinforcement with localized divergence minimization in low-advantage regions, yielding 4.09-8.34 point gains over standard OPD on math reasoning benchmarks.
Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level
cs.LG 2026-05 unverdicted novelty 5.0

Asymmetric On-Policy Distillation improves on-policy distillation by using divergence minimization instead of negative reinforcement in low-advantage regions, yielding 4-8 point gains on math reasoning benchmarks whil...
Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
cs.CL 2026-04 accept novelty 5.0

LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
cs.AI 2026-05 unverdicted novelty 4.0

Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
cs.AI 2026-05 unverdicted novelty 3.0

Safactory combines parallel simulation, trustworthy data management, and asynchronous evolution platforms into a single pipeline claimed to be the first unified framework for trustworthy autonomous agents.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 18 Pith papers · 16 internal anchors

[1]

Aime 2024.https://huggingface.co/datasets/AI-MO/aimo-validation-aime,

AI-MO. Aime 2024.https://huggingface.co/datasets/AI-MO/aimo-validation-aime,

work page 2024
[2]

InternLM2 Technical Report

URLhttps://matharena.ai/. Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report.arXiv preprint arXiv:2403.17297,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Process Reinforcement through Implicit Rewards

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Enhancing chat language models by scaling high-quality instructional conversations

Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 3029–3051,

work page 2023
[5]

Rlhf workflow: From reward mod- eling to online rlhf.arXiv preprint arXiv:2405.07863, 2024

Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward modeling to online rlhf.arXiv preprint arXiv:2405.07863,

work page arXiv
[6]

OpenThoughts: Data Recipes for Reasoning Models

URL https://openreview.net/ forum?id=5h0qf7IBZZ. Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, et al. Openthoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025a. Yiju Guo, Wenkai Yang, Zexu Sun, Ning Ding, Zhiyuan Liu, and Yankai Lin. Learning to focus: Causal attention...

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner- zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Reinforcement Learning via Self-Distillation

Jonas H¨ubotter, Frederike L¨ubeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Sequence-level knowledge distillation

Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. InProceedings of the 2016 conference on empirical methods in natural language processing, pp. 1317–1327,

work page 2016
[13]

ORBIT: On-policy Exploration-Exploitation for Controllable Multi-Budget Reasoning

Kun Liang, Clive Bai, Xin Xu, Chenming Tang, Sanwoo Lee, Weijie Liu, Saiyong Yang, and Yunfang Wu. Orbit: On-policy exploration-exploitation for controllable multi-budget reasoning.arXiv preprint arXiv:2601.08310,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352, 2025a

Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, et al. Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352, 2025a. 11 Jiacai Liu, Yingru Li, Yuqian Fu, Jiawei Wang, Qian Liu, and Yu Shen. When speed kills stability: Demystify- ...

work page arXiv
[15]

Xiaoqian Liu, Ke Wang, Yuchuan Wu, Fei Huang, Yongbin Li, Junge Zhang, and Jianbin Jiao

URLhttps://openreview.net/forum?id=1qvx610Cu7. Xiaoqian Liu, Ke Wang, Yuchuan Wu, Fei Huang, Yongbin Li, Junge Zhang, and Jianbin Jiao. Agentic reinforcement learning with implicit step rewards.arXiv preprint arXiv:2509.19199, 2025c. Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connectionism,

work page arXiv
[16]

https://thinkingmachines.ai/blog/on-policy-distillation

doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation. OpenCompass. Aime 2025.https://huggingface.co/datasets/opencompass/AIME2025,

work page doi:10.64434/tml.20251026 2025
[17]

Privileged information distillation for language models.arXiv preprint arXiv:2602.04942,

Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. Privileged information distillation for language models.arXiv preprint arXiv:2602.04942,

work page arXiv
[18]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[19]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Self-Distillation Enables Continual Learning

Idan Shenfeld, Mehul Damani, Jonas H ¨ubotter, and Pulkit Agrawal. Self-distillation enables continual learning. arXiv preprint arXiv:2601.19897,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Alpaca: A strong, replicable instruction-following model.Stanford Center for Research on Foundation Models

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model.Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7,

work page 2023
[23]

MiMo-V2-Flash Technical Report

Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Qwen3 Technical Report

URL https://openreview.net/forum?id=2QdsjiNXgj. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Wenkai Yang, Jingwen Chen, Yankai Lin, and Ji-Rong Wen. Deepcritic: Deliberate critique with large language models.arXiv...

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

13 A Detailed Math Derivations Here, we make mathematical derivations to calculate the expected gradients of OPD objective in Eq. (4). Since JOPD(θ) =min θ Ex∼D,y∼π θ(·|x) h DKL πθ(y|x) π∗(y|x) i =min θ Ex∼D,y∼π θ(·|x) h logπ θ(y|x)−logπ ∗(y|x) i . (15) We can get ∇θJOPD(θ) =∇ θEx∼D,y∼π θ(·|x) h logπ θ(y|x)−logπ ∗(y|x) i =∇ θEx h ∑ y πθ(y|x) logπ θ(y|x)−l...

work page 2048