pith. machine review for the scientific record. sign in

arxiv: 2604.13010 · v2 · submitted 2026-04-14 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

Hai Cai, Song Han, Yecheng Wu

Pith reviewed 2026-05-11 01:00 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords on-policy distillationoffline distillationlarge language modelspost-trainingreasoning modelsteacher consistencygradient biaspolicy drift
0
0 comments X

The pith

Enforcing teacher consistency allows offline on-policy distillation to match live OPD optimum and performance at 4x efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that on-policy distillation for large reasoning models can be performed entirely offline by precomputing teacher log-probabilities once over supervised fine-tuning rollouts, provided the identical teacher model is used throughout. This Lightning OPD approach eliminates the infrastructure cost of maintaining a live teacher server while preserving the same training optimum as standard online OPD. The authors prove bounded gradient discrepancy and an implicit regularization effect under this consistency condition, which prevents policy drift. Experiments on math reasoning and code generation tasks confirm comparable final performance with substantially reduced training time, including scaling to mixture-of-experts models on a single 8xH100 node.

Core claim

Under the teacher consistency condition, Lightning OPD shares the same optimum as standard OPD, exhibits bounded gradient discrepancy, and introduces an implicit regularization effect that helps prevent policy drift. By precomputing log-probabilities with the same teacher used for SFT, the method removes the need for a live teacher server while delivering performance comparable to online distillation on math and code benchmarks at 4.0x higher training efficiency.

What carries the argument

Teacher consistency, the requirement that the identical teacher model be used for both SFT and OPD to eliminate gradient bias when precomputing log-probabilities for offline training.

If this is right

  • Lightning OPD removes the requirement for a live teacher server throughout training.
  • The method achieves comparable accuracy to standard OPD on AIME 2024 and code generation while using four times less wall-clock training time.
  • The implicit regularization from teacher consistency helps stabilize training and reduce policy drift.
  • The approach scales to larger MoE models, enabling 71.0% on AIME 2024 for a 30B model on a single 8xH100 node.
  • Post-training of large reasoning models becomes feasible with far lower infrastructure overhead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same consistency principle might stabilize other offline reinforcement learning or distillation pipelines that currently suffer from teacher-student mismatch.
  • Academic groups could now replicate advanced post-training results using only modest GPU clusters instead of dedicated inference servers.
  • The bounded discrepancy result suggests Lightning OPD could serve as a drop-in replacement in existing on-policy pipelines with minimal hyperparameter retuning.
  • Extending the precomputation step to include multiple teachers or curriculum schedules might further improve sample efficiency.

Load-bearing premise

That precomputing log-probabilities with the identical teacher used for SFT is sufficient to eliminate gradient bias and produce an offline procedure whose optimum and dynamics match those of live on-policy distillation.

What would settle it

A direct comparison experiment in which Lightning OPD and standard live OPD, started from the same SFT checkpoint and using the same data, reach final policies whose performance or loss values differ beyond the claimed bounded discrepancy on a held-out reasoning benchmark.

read the original abstract

On-policy distillation (OPD) is an effective post-training paradigm for large language models but requires a live teacher server throughout training, resulting in substantial infrastructure overhead. We investigate whether OPD can be performed offline by precomputing teacher log-probabilities once over SFT rollouts and reusing them during training. We find that naively doing so fails to reliably match standard OPD, and trace the root cause to a previously overlooked condition we term teacher consistency, requiring that the same teacher be used for both supervised fine-tuning and OPD. Violating this condition introduces a gradient bias that degrades performance for both offline and online OPD. Building on this insight, we propose Lightning OPD, an offline on-policy distillation framework that enforces teacher consistency and eliminates the need for a live teacher server entirely. We prove that, under teacher consistency, Lightning OPD shares the same optimum as standard OPD, with bounded gradient discrepancy and an implicit regularization effect that helps prevent policy drift. Experiments on math reasoning and code generation show that Lightning OPD achieves comparable performance to standard OPD while delivering 4.0x higher training efficiency. Starting from an SFT-initialized Qwen3-8B-Base model, Lightning OPD reaches 69.9% on AIME 2024 in just 30 GPU hours. Lightning OPD further scales to MoE architectures, training Qwen3-30B-A3B to 71.0% on AIME 2024 on a single 8xH100 node, substantially lowering the barrier for academic research on LLM post-training. Our code is released at https://github.com/jet-ai-projects/Lightning-OPD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Lightning OPD, an offline on-policy distillation method for post-training large reasoning LLMs. It identifies teacher consistency (using the identical teacher for SFT and distillation) as necessary to avoid gradient bias when precomputing teacher log-probabilities on fixed SFT rollouts. Under this condition, the authors prove that Lightning OPD shares the same optimum as standard live OPD, with bounded gradient discrepancy and an implicit regularization effect against policy drift. Experiments on math reasoning and code generation tasks show performance parity with standard OPD at 4x higher training efficiency, including scaling results for Qwen3-8B and a 30B MoE model on limited hardware.

Significance. If the shared-optimum result and bounded-discrepancy guarantee hold with the stated assumptions, Lightning OPD would meaningfully reduce infrastructure costs for LLM post-training by removing the live teacher server requirement. The released code, the 4x efficiency claim, and the scaling demonstration (Qwen3-30B-A3B reaching 71.0% on AIME 2024 on a single 8xH100 node) are concrete strengths that could broaden access to large-scale reasoning model training in academic settings.

major comments (2)
  1. [§4] §4 (Proof of equivalence): The claim that Lightning OPD shares the same optimum as standard OPD under teacher consistency is plausible because both losses reach their minimum when the student matches the teacher. However, the subsequent statement of 'bounded gradient discrepancy' is load-bearing for the dynamics claim yet lacks an explicit bound expressed in terms of a divergence measure such as KL(π_student || π_SFT) or total variation distance between the fixed SFT distribution and the evolving student distribution. Without this, it remains unclear whether the bound stays small as training proceeds and the student policy drifts.
  2. [§5.3] §5.3 (Experimental setup and ablations): The reported performance parity (e.g., 69.9% on AIME 2024) and 4.0x efficiency are encouraging, but the evaluation uses relatively short training horizons. The paper should add an ablation that tracks policy divergence (e.g., KL or win-rate against the SFT policy) over longer training to test whether the claimed implicit regularization actually prevents accumulating mismatch, as the offline expectation is taken over a stale distribution while standard OPD samples from the current student.
minor comments (2)
  1. [§3] The definition of teacher consistency is introduced in the abstract and §3 but would benefit from a formal statement (e.g., an equation specifying that the teacher used for precomputing log p_teacher is identical to the one used in the preceding SFT stage) before the proof.
  2. [Table 1] Table 1 and Figure 2: clarify whether the efficiency numbers include the one-time cost of precomputing teacher log-probabilities or only the subsequent training loop; also state the exact hardware configuration used for the standard OPD baseline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential infrastructure benefits of Lightning OPD. We address each major comment below, providing clarifications from the manuscript and outlining targeted revisions that strengthen the theoretical and empirical claims without altering the core contributions.

read point-by-point responses
  1. Referee: [§4] §4 (Proof of equivalence): The claim that Lightning OPD shares the same optimum as standard OPD under teacher consistency is plausible because both losses reach their minimum when the student matches the teacher. However, the subsequent statement of 'bounded gradient discrepancy' is load-bearing for the dynamics claim yet lacks an explicit bound expressed in terms of a divergence measure such as KL(π_student || π_SFT) or total variation distance between the fixed SFT distribution and the evolving student distribution. Without this, it remains unclear whether the bound stays small as training proceeds and the student policy drifts.

    Authors: We agree that an explicit bound expressed via a standard divergence would improve clarity. The manuscript already shows that, under teacher consistency, the gradient of Lightning OPD differs from standard OPD by a term whose magnitude is controlled by the total variation distance between the fixed SFT rollout distribution and the current student distribution (multiplied by a constant depending on the teacher’s maximum log-probability gap). We will add a corollary in the revised §4 that invokes Pinsker’s inequality to restate this bound directly in terms of KL(π_student || π_SFT). We will also include a short discussion of how the implicit regularization term derived in the proof keeps the KL term from growing unboundedly, thereby addressing the concern about drift during extended training. revision: yes

  2. Referee: [§5.3] §5.3 (Experimental setup and ablations): The reported performance parity (e.g., 69.9% on AIME 2024) and 4.0x efficiency are encouraging, but the evaluation uses relatively short training horizons. The paper should add an ablation that tracks policy divergence (e.g., KL or win-rate against the SFT policy) over longer training to test whether the claimed implicit regularization actually prevents accumulating mismatch, as the offline expectation is taken over a stale distribution while standard OPD samples from the current student.

    Authors: We concur that longer-horizon tracking would provide stronger empirical support for the regularization claim. Our current results already demonstrate parity at the reported compute budgets, but we will extend the training runs for both Lightning OPD and standard OPD on the math reasoning tasks and add new figures in §5.3 that plot (i) KL(π_student || π_SFT) and (ii) win-rate against the SFT policy as functions of training steps. These ablations will directly compare the divergence trajectories and confirm whether the offline formulation maintains lower drift than would be expected from a stale distribution. revision: yes

Circularity Check

0 steps flagged

No circularity: proof of shared optimum under teacher consistency is presented as independent mathematical result

full rationale

The paper introduces teacher consistency as an explicit condition (same teacher for SFT and OPD) and claims a separate proof that Lightning OPD then shares the optimum with standard OPD, plus bounded gradient discrepancy. No equations are supplied in the abstract or description that reduce the claimed optimum or bound to a fitted quantity, a self-referential definition, or a prior self-citation chain. The derivation is therefore treated as self-contained external reasoning rather than a renaming or construction that forces the result by the inputs alone. Experiments are reported separately and do not substitute for the proof.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that teacher consistency eliminates gradient bias and on the unverified mathematical proof that the offline procedure shares the same optimum as live OPD.

axioms (1)
  • domain assumption Teacher consistency: the identical teacher model must be used for both the initial SFT stage and the precomputation of log-probabilities for OPD.
    Cited as the root cause of failure when naively performing offline OPD and as the condition under which equivalence holds.

pith-pipeline@v0.9.0 · 5613 in / 1476 out tokens · 49025 ms · 2026-05-11T01:00:37.728795+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Rubric-based On-policy Distillation

    cs.LG 2026-05 unverdicted novelty 7.0

    Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance an...

  2. SOD: Step-wise On-policy Distillation for Small Language Model Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.

  3. SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation

    cs.CL 2026-05 unverdicted novelty 6.0

    SimCT recovers discarded teacher signal in cross-tokenizer on-policy distillation by enlarging supervision to jointly realizable multi-token continuations, yielding consistent gains on math reasoning and code generati...

  4. Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

    cs.LG 2026-05 unverdicted novelty 6.0

    Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.

Reference graph

Works this paper leans on

66 extracted references · 47 canonical work pages · cited by 4 Pith papers · 27 internal anchors

  1. [1]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  2. [2]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

  3. [3]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

  4. [4]

    Nvidia nemotron 3: Efficient and open intelligence, 2025

    NVIDIA. Nvidia nemotron 3: Efficient and open intelligence.arXiv preprint arXiv:2512.20856, 2025

  5. [5]

    MiMo-V2-Flash Technical Report

    Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

  6. [6]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems, volume 35, 2022

  7. [7]

    OpenThoughts: Data Recipes for Reasoning Models

    Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, et al. Openthoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178, 2025. 11 Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

  8. [8]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe Twelfth International Conference on Learning Representations, 2024

  9. [9]

    On-policy distillation.Thinking Machines Lab: Connectionism, 2025

    Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connectionism, 2025. https://thinkingmachines.ai/blog/on-policy-distillation

  10. [10]

    arXiv preprint arXiv:2602.12125 , year=

    Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

  11. [11]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

  12. [12]

    Self-Distillation Enables Continual Learning

    Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning. arXiv preprint arXiv:2601.19897, 2026

  13. [13]

    Reinforcement Learning via Self-Distillation

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

  14. [14]

    Nemotron-cascade 2: Post-training llms with cascade rl and multi-domain on-policy distillation

    Zhuolin Yang, Zihan Liu, Yang Chen, Wenliang Dai, Boxin Wang, Sheng-Chieh Lin, Chankyu Lee, Yangyi Chen, Dongfu Jiang, Jiafan He, Renjie Pi, Grace Lam, Nayeon Lee, Alexander Bukharin, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nemotron-cascade 2: Post-training llms with cascade rl and multi-domain on-policy distillation. 2026

  15. [15]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  16. [16]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  17. [17]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  18. [18]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  19. [19]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025

  20. [20]

    Rl’s ra- zor: Why online reinforcement learning forgets less.arXiv preprint arXiv:2509.04259, 2025

    Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. Rl’s razor: Why online reinforcement learning forgets less. arXiv preprint arXiv:2509.04259, 2025

  21. [21]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 35, 2022

  22. [22]

    REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

    Jian Hu. REINFORCE++: A simple and efficient approach for aligning large language models.arXiv preprint arXiv:2501.03262, 2025

  23. [23]

    Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs.arXiv preprint arXiv:2402.14740, 2024

  24. [24]

    Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models.arXiv preprint arXiv:2310.10505, 2023

    Ziniu Li, Tian Xu, Yushun Zhang, Yang Yu, Ruoyu Sun, and Zhi-Quan Luo. ReMax: A simple, effective, and efficient reinforcement learning method for aligning large language models.arXiv preprint arXiv:2310.10505, 2023

  25. [25]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding R1-Zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

  26. [26]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025. 12 Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

  27. [27]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    MiniMax, Aonian Shan, Bangwei Gong, Bo Yang, et al. MiniMax-M1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025

  28. [28]

    Process Reinforcement through Implicit Rewards

    Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456, 2025

  29. [29]

    What’s behind ppo’s collapse in long-cot? value optimization holds the secret, 2025

    Yufeng Yuan, Yu Yue, Ruofei Zhu, Tiantian Fan, and Lin Yan. What’s behind PPO’s collapse in long-CoT? value optimization holds the secret.arXiv preprint arXiv:2503.01491, 2025

  30. [30]

    VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

    Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, Tiantian Fan, Zhengyin Du, Xiangpeng Wei, et al. VAPO: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118, 2025

  31. [31]

    arXiv preprint arXiv:2410.01679 , year=

    Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. VinePPO: Refining credit assignment in RL training of LLMs.arXiv preprint arXiv:2410.01679, 2024

  32. [32]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chengqi Zhao, Chenggang Deng, Chengpeng Zhang, et al. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024

  33. [33]

    Kimi k1.5: Scaling Reinforcement Learning with LLMs

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, et al. Kimi k1.5: Scaling reinforcement learning with LLMs.arXiv preprint arXiv:2501.12599, 2025

  34. [34]

    Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290, 2025

  35. [35]

    Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

    An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2.5-Math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122, 2024

  36. [36]

    Shaotian Yan, Kaiyuan Liu, Chen Shen, Bing Wang, Sinan Fan, Jun Zhang, Yue Wu, Zheng Wang, and Jieping Ye

    Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance.arXiv preprint arXiv:2504.14945, 2025

  37. [37]

    Learning what reinforcement learning can’t: Interleaved online fine-tuning for hardest questions.arXiv preprint arXiv:2506.07527, 2025

    Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Yanhao Li, Bin Cui, and Wentao Zhang. Learning what reinforcement learning can’t: Interleaved online fine-tuning for hardest questions.arXiv preprint arXiv:2506.07527, 2026

  38. [38]

    On-policy rl meets off-policy experts: Harmonizing supervised fine- tuning and reinforcement learning via dynamic weighting.arXiv preprint arXiv:2508.11408, 2026

    Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, and Jingren Zhou. On-policy RL meets off-policy experts: Harmonizing supervised fine-tuning and reinforcement learning via dynamic weighting.arXiv preprint arXiv:2508.11408, 2025

  39. [39]

    Beyond two-stage training: Cooperative sft and rl for llm reasoning.arXiv preprint arXiv:2509.06948, 2025

    Liang Chen, Xueting Han, Li Shen, Jing Bai, and Kam-Fai Wong. Beyond two-stage training: Cooperative SFT and RL for LLM reasoning.arXiv preprint arXiv:2509.06948, 2025

  40. [40]

    UFT: Unifying supervised and reinforcement fine-tuning

    Mingyang Liu, Gabriele Farina, and Asuman Ozdaglar. UFT: Unifying supervised and reinforcement fine-tuning. arXiv preprint arXiv:2505.16984, 2025

  41. [41]

    Blending supervised and reinforcement fine-tuning with prefix sampling.arXiv preprint arXiv:2507.01679,

    Zeyu Huang, Tianhao Cheng, Zihan Qiu, Zili Wang, Yinghui Xu, Edoardo M. Ponti, and Ivan Titov. Blending supervised and reinforcement fine-tuning with prefix sampling.arXiv preprint arXiv:2507.01679, 2025

  42. [42]

    Distilling the knowledge in a neural network.NIPS Deep Learning Workshop, 2015

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.NIPS Deep Learning Workshop, 2015

  43. [43]

    Sequence-level knowledge distillation

    Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016

  44. [44]

    MiniLLM: Knowledge distillation of large language models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations, 2024

  45. [45]

    Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643,

    Tianzhu Ye, Li Dong, Zewen Chi, Xun Wu, Shaohan Huang, and Furu Wei. Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643, 2025

  46. [46]

    ORBIT: On-policy Exploration-Exploitation for Controllable Multi-Budget Reasoning

    Kun Liang, Clive Bai, Xin Xu, Chenming Tang, Sanwoo Lee, Weijie Liu, Saiyong Yang, and Yunfang Wu. Orbit: On-policy exploration-exploitation for controllable multi-budget reasoning.arXiv preprint arXiv:2601.08310, 2026. 13 Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

  47. [47]

    Privileged information distillation for language models.arXiv preprint arXiv:2602.04942, 2026

    Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. Privileged information distillation for language models.arXiv preprint arXiv:2602.04942, 2026

  48. [48]

    Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020

  49. [49]

    Simple statistical gradient-following algorithms for connectionist reinforcement learning

    Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229–256, 1992

  50. [50]

    Conservative Q-learning for offline reinforcement learning

    Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-learning for offline reinforcement learning. InAdvances in Neural Information Processing Systems, volume 33, 2020

  51. [51]

    Offline reinforcement learning with implicit q-learning

    Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2022

  52. [52]

    Off-policy deep reinforcement learning without exploration

    Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, 2019

  53. [53]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, volume 36, 2023

  54. [54]

    Scaling relationship on learning mathematical reasoning with large language models.arXiv preprint arXiv:2308.01825, 2023

    Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models.arXiv preprint arXiv:2308.01825, 2023

  55. [55]

    RAFT: Reward ranked finetuning for generative foundation model alignment

    Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. RAFT: Reward ranked finetuning for generative foundation model alignment. Transactions on Machine Learning Research, 2023

  56. [56]

    Llms can learn to reason via off-policy rl.arXiv preprint arXiv:2602.19362, 2026

    Daniel Ritter, Owen Oertell, Bradley Guo, Jonathan Chang, Kianté Brantley, and Wen Sun. LLMs can learn to reason via off-policy RL.arXiv preprint arXiv:2602.19362, 2026

  57. [57]

    PCL-Reasoner-V1.5: Advancing math reasoning with offline reinforcement learning.arXiv preprint arXiv:2601.14716, 2026

    Yao Lu, Dengdong Fan, Jianzheng Nie, Fan Xu, Jie Chen, Bin Zhou, and Yonghong Tian. PCL-Reasoner-V1.5: Advancing math reasoning with offline reinforcement learning.arXiv preprint arXiv:2601.14716, 2026

  58. [58]

    Encompassing diversity and complexity in code generation

    Yaoxiang Wang, Haoling Li, Xin Zhang, Jie Wu, Xiao Liu, Wenxiang Hu, Zhongxin Guo, Yangyu Huang, Ying Xin, Yujiu Yang, Jinsong Su, Qi Chen, and Scarlett Li. Encompassing diversity and complexity in code generation. arXiv preprint arXiv:2501.04694, 2025

  59. [59]

    AIME 2024.https://huggingface.co/datasets/AI-MO/aimo-validation-aime, 2024

    AI-MO. AIME 2024.https://huggingface.co/datasets/AI-MO/aimo-validation-aime, 2024

  60. [60]

    AIME 2025.https://github.com/open-compass/opencompass, 2025

    OpenCompass. AIME 2025.https://github.com/open-compass/opencompass, 2025

  61. [61]

    Balunović, J

    Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. MathArena: Evaluating LLMs on uncontaminated math competitions.arXiv preprint arXiv:2505.23281, 2025

  62. [62]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

  63. [63]

    LlamaFactory: Unified efficient fine-tuning of 100+ language models

    Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. LlamaFactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (System Demonstrations), 2024

  64. [64]

    slime: An SGLang-native post-training framework for RL scaling.https://github.com/THUDM/slime, 2025

    THUDM. slime: An SGLang-native post-training framework for RL scaling.https://github.com/THUDM/slime, 2025

  65. [65]

    Qwq-32b: Embracing the power of reinforcement learning, March 2025

    Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025

  66. [66]

    Nemotron-cascade: Scaling cascaded reinforcement learning for general-purpose reasoning models.arXiv preprint arXiv:2512.13607, 2025

    Yang Chen, Zhuolin Yang, Zihan Liu, Wenliang Dai, Boxin Wang, Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nemotron-cascade: Scaling cascaded reinforcement learning for general-purpose reasoning models.arXiv preprint arXiv:2512.13607, 2025. 14 Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline ...