Recognition: no theorem link
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
Pith reviewed 2026-05-11 01:00 UTC · model grok-4.3
The pith
Enforcing teacher consistency allows offline on-policy distillation to match live OPD optimum and performance at 4x efficiency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under the teacher consistency condition, Lightning OPD shares the same optimum as standard OPD, exhibits bounded gradient discrepancy, and introduces an implicit regularization effect that helps prevent policy drift. By precomputing log-probabilities with the same teacher used for SFT, the method removes the need for a live teacher server while delivering performance comparable to online distillation on math and code benchmarks at 4.0x higher training efficiency.
What carries the argument
Teacher consistency, the requirement that the identical teacher model be used for both SFT and OPD to eliminate gradient bias when precomputing log-probabilities for offline training.
If this is right
- Lightning OPD removes the requirement for a live teacher server throughout training.
- The method achieves comparable accuracy to standard OPD on AIME 2024 and code generation while using four times less wall-clock training time.
- The implicit regularization from teacher consistency helps stabilize training and reduce policy drift.
- The approach scales to larger MoE models, enabling 71.0% on AIME 2024 for a 30B model on a single 8xH100 node.
- Post-training of large reasoning models becomes feasible with far lower infrastructure overhead.
Where Pith is reading between the lines
- The same consistency principle might stabilize other offline reinforcement learning or distillation pipelines that currently suffer from teacher-student mismatch.
- Academic groups could now replicate advanced post-training results using only modest GPU clusters instead of dedicated inference servers.
- The bounded discrepancy result suggests Lightning OPD could serve as a drop-in replacement in existing on-policy pipelines with minimal hyperparameter retuning.
- Extending the precomputation step to include multiple teachers or curriculum schedules might further improve sample efficiency.
Load-bearing premise
That precomputing log-probabilities with the identical teacher used for SFT is sufficient to eliminate gradient bias and produce an offline procedure whose optimum and dynamics match those of live on-policy distillation.
What would settle it
A direct comparison experiment in which Lightning OPD and standard live OPD, started from the same SFT checkpoint and using the same data, reach final policies whose performance or loss values differ beyond the claimed bounded discrepancy on a held-out reasoning benchmark.
read the original abstract
On-policy distillation (OPD) is an effective post-training paradigm for large language models but requires a live teacher server throughout training, resulting in substantial infrastructure overhead. We investigate whether OPD can be performed offline by precomputing teacher log-probabilities once over SFT rollouts and reusing them during training. We find that naively doing so fails to reliably match standard OPD, and trace the root cause to a previously overlooked condition we term teacher consistency, requiring that the same teacher be used for both supervised fine-tuning and OPD. Violating this condition introduces a gradient bias that degrades performance for both offline and online OPD. Building on this insight, we propose Lightning OPD, an offline on-policy distillation framework that enforces teacher consistency and eliminates the need for a live teacher server entirely. We prove that, under teacher consistency, Lightning OPD shares the same optimum as standard OPD, with bounded gradient discrepancy and an implicit regularization effect that helps prevent policy drift. Experiments on math reasoning and code generation show that Lightning OPD achieves comparable performance to standard OPD while delivering 4.0x higher training efficiency. Starting from an SFT-initialized Qwen3-8B-Base model, Lightning OPD reaches 69.9% on AIME 2024 in just 30 GPU hours. Lightning OPD further scales to MoE architectures, training Qwen3-30B-A3B to 71.0% on AIME 2024 on a single 8xH100 node, substantially lowering the barrier for academic research on LLM post-training. Our code is released at https://github.com/jet-ai-projects/Lightning-OPD.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Lightning OPD, an offline on-policy distillation method for post-training large reasoning LLMs. It identifies teacher consistency (using the identical teacher for SFT and distillation) as necessary to avoid gradient bias when precomputing teacher log-probabilities on fixed SFT rollouts. Under this condition, the authors prove that Lightning OPD shares the same optimum as standard live OPD, with bounded gradient discrepancy and an implicit regularization effect against policy drift. Experiments on math reasoning and code generation tasks show performance parity with standard OPD at 4x higher training efficiency, including scaling results for Qwen3-8B and a 30B MoE model on limited hardware.
Significance. If the shared-optimum result and bounded-discrepancy guarantee hold with the stated assumptions, Lightning OPD would meaningfully reduce infrastructure costs for LLM post-training by removing the live teacher server requirement. The released code, the 4x efficiency claim, and the scaling demonstration (Qwen3-30B-A3B reaching 71.0% on AIME 2024 on a single 8xH100 node) are concrete strengths that could broaden access to large-scale reasoning model training in academic settings.
major comments (2)
- [§4] §4 (Proof of equivalence): The claim that Lightning OPD shares the same optimum as standard OPD under teacher consistency is plausible because both losses reach their minimum when the student matches the teacher. However, the subsequent statement of 'bounded gradient discrepancy' is load-bearing for the dynamics claim yet lacks an explicit bound expressed in terms of a divergence measure such as KL(π_student || π_SFT) or total variation distance between the fixed SFT distribution and the evolving student distribution. Without this, it remains unclear whether the bound stays small as training proceeds and the student policy drifts.
- [§5.3] §5.3 (Experimental setup and ablations): The reported performance parity (e.g., 69.9% on AIME 2024) and 4.0x efficiency are encouraging, but the evaluation uses relatively short training horizons. The paper should add an ablation that tracks policy divergence (e.g., KL or win-rate against the SFT policy) over longer training to test whether the claimed implicit regularization actually prevents accumulating mismatch, as the offline expectation is taken over a stale distribution while standard OPD samples from the current student.
minor comments (2)
- [§3] The definition of teacher consistency is introduced in the abstract and §3 but would benefit from a formal statement (e.g., an equation specifying that the teacher used for precomputing log p_teacher is identical to the one used in the preceding SFT stage) before the proof.
- [Table 1] Table 1 and Figure 2: clarify whether the efficiency numbers include the one-time cost of precomputing teacher log-probabilities or only the subsequent training loop; also state the exact hardware configuration used for the standard OPD baseline.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the potential infrastructure benefits of Lightning OPD. We address each major comment below, providing clarifications from the manuscript and outlining targeted revisions that strengthen the theoretical and empirical claims without altering the core contributions.
read point-by-point responses
-
Referee: [§4] §4 (Proof of equivalence): The claim that Lightning OPD shares the same optimum as standard OPD under teacher consistency is plausible because both losses reach their minimum when the student matches the teacher. However, the subsequent statement of 'bounded gradient discrepancy' is load-bearing for the dynamics claim yet lacks an explicit bound expressed in terms of a divergence measure such as KL(π_student || π_SFT) or total variation distance between the fixed SFT distribution and the evolving student distribution. Without this, it remains unclear whether the bound stays small as training proceeds and the student policy drifts.
Authors: We agree that an explicit bound expressed via a standard divergence would improve clarity. The manuscript already shows that, under teacher consistency, the gradient of Lightning OPD differs from standard OPD by a term whose magnitude is controlled by the total variation distance between the fixed SFT rollout distribution and the current student distribution (multiplied by a constant depending on the teacher’s maximum log-probability gap). We will add a corollary in the revised §4 that invokes Pinsker’s inequality to restate this bound directly in terms of KL(π_student || π_SFT). We will also include a short discussion of how the implicit regularization term derived in the proof keeps the KL term from growing unboundedly, thereby addressing the concern about drift during extended training. revision: yes
-
Referee: [§5.3] §5.3 (Experimental setup and ablations): The reported performance parity (e.g., 69.9% on AIME 2024) and 4.0x efficiency are encouraging, but the evaluation uses relatively short training horizons. The paper should add an ablation that tracks policy divergence (e.g., KL or win-rate against the SFT policy) over longer training to test whether the claimed implicit regularization actually prevents accumulating mismatch, as the offline expectation is taken over a stale distribution while standard OPD samples from the current student.
Authors: We concur that longer-horizon tracking would provide stronger empirical support for the regularization claim. Our current results already demonstrate parity at the reported compute budgets, but we will extend the training runs for both Lightning OPD and standard OPD on the math reasoning tasks and add new figures in §5.3 that plot (i) KL(π_student || π_SFT) and (ii) win-rate against the SFT policy as functions of training steps. These ablations will directly compare the divergence trajectories and confirm whether the offline formulation maintains lower drift than would be expected from a stale distribution. revision: yes
Circularity Check
No circularity: proof of shared optimum under teacher consistency is presented as independent mathematical result
full rationale
The paper introduces teacher consistency as an explicit condition (same teacher for SFT and OPD) and claims a separate proof that Lightning OPD then shares the optimum with standard OPD, plus bounded gradient discrepancy. No equations are supplied in the abstract or description that reduce the claimed optimum or bound to a fitted quantity, a self-referential definition, or a prior self-citation chain. The derivation is therefore treated as self-contained external reasoning rather than a renaming or construction that forces the result by the inputs alone. Experiments are reported separately and do not substitute for the proof.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Teacher consistency: the identical teacher model must be used for both the initial SFT stage and the precomputation of log-probabilities for OPD.
Forward citations
Cited by 4 Pith papers
-
Rubric-based On-policy Distillation
Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance an...
-
SOD: Step-wise On-policy Distillation for Small Language Model Agents
SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
-
SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation
SimCT recovers discarded teacher signal in cross-tokenizer on-policy distillation by enlarging supervision to jointly realizable multi-token continuations, yielding consistent gains on math reasoning and code generati...
-
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.
Reference graph
Works this paper leans on
-
[1]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[4]
Nvidia nemotron 3: Efficient and open intelligence, 2025
NVIDIA. Nvidia nemotron 3: Efficient and open intelligence.arXiv preprint arXiv:2512.20856, 2025
-
[5]
MiMo-V2-Flash Technical Report
Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026
work page internal anchor Pith review arXiv 2026
-
[6]
Training language models to follow instructions with human feedback
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems, volume 35, 2022
2022
-
[7]
OpenThoughts: Data Recipes for Reasoning Models
Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, et al. Openthoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178, 2025. 11 Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
work page internal anchor Pith review arXiv 2025
-
[8]
On-policy distillation of language models: Learning from self-generated mistakes
Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe Twelfth International Conference on Learning Representations, 2024
2024
-
[9]
On-policy distillation.Thinking Machines Lab: Connectionism, 2025
Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connectionism, 2025. https://thinkingmachines.ai/blog/on-policy-distillation
2025
-
[10]
arXiv preprint arXiv:2602.12125 , year=
Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026
-
[11]
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026
work page internal anchor Pith review arXiv 2026
-
[12]
Self-Distillation Enables Continual Learning
Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning. arXiv preprint arXiv:2601.19897, 2026
work page internal anchor Pith review arXiv 2026
-
[13]
Reinforcement Learning via Self-Distillation
Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026
work page internal anchor Pith review arXiv 2026
-
[14]
Nemotron-cascade 2: Post-training llms with cascade rl and multi-domain on-policy distillation
Zhuolin Yang, Zihan Liu, Yang Chen, Wenliang Dai, Boxin Wang, Sheng-Chieh Lin, Chankyu Lee, Yangyi Chen, Dongfu Jiang, Jiafan He, Renjie Pi, Grace Lam, Nayeon Lee, Alexander Bukharin, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nemotron-cascade 2: Post-training llms with cascade rl and multi-domain on-policy distillation. 2026
2026
-
[15]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[16]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Rl’s ra- zor: Why online reinforcement learning forgets less.arXiv preprint arXiv:2509.04259, 2025
Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. Rl’s razor: Why online reinforcement learning forgets less. arXiv preprint arXiv:2509.04259, 2025
-
[21]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 35, 2022
2022
-
[22]
REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization
Jian Hu. REINFORCE++: A simple and efficient approach for aligning large language models.arXiv preprint arXiv:2501.03262, 2025
work page internal anchor Pith review arXiv 2025
-
[23]
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs.arXiv preprint arXiv:2402.14740, 2024
work page internal anchor Pith review arXiv 2024
-
[24]
Ziniu Li, Tian Xu, Yushun Zhang, Yang Yu, Ruoyu Sun, and Zhi-Quan Luo. ReMax: A simple, effective, and efficient reinforcement learning method for aligning large language models.arXiv preprint arXiv:2310.10505, 2023
-
[25]
Understanding R1-Zero-Like Training: A Critical Perspective
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding R1-Zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025
work page internal anchor Pith review arXiv 2025
-
[26]
Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025. 12 Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
MiniMax, Aonian Shan, Bangwei Gong, Bo Yang, et al. MiniMax-M1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025
work page internal anchor Pith review arXiv 2025
-
[28]
Process Reinforcement through Implicit Rewards
Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456, 2025
work page internal anchor Pith review arXiv 2025
-
[29]
What’s behind ppo’s collapse in long-cot? value optimization holds the secret, 2025
Yufeng Yuan, Yu Yue, Ruofei Zhu, Tiantian Fan, and Lin Yan. What’s behind PPO’s collapse in long-CoT? value optimization holds the secret.arXiv preprint arXiv:2503.01491, 2025
-
[30]
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks
Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, Tiantian Fan, Zhengyin Du, Xiangpeng Wei, et al. VAPO: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118, 2025
work page internal anchor Pith review arXiv 2025
-
[31]
arXiv preprint arXiv:2410.01679 , year=
Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. VinePPO: Refining credit assignment in RL training of LLMs.arXiv preprint arXiv:2410.01679, 2024
-
[32]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chengqi Zhao, Chenggang Deng, Chengpeng Zhang, et al. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, et al. Kimi k1.5: Scaling reinforcement learning with LLMs.arXiv preprint arXiv:2501.12599, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290, 2025
work page internal anchor Pith review arXiv 2025
-
[35]
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2.5-Math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122, 2024
work page internal anchor Pith review arXiv 2024
-
[36]
Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance.arXiv preprint arXiv:2504.14945, 2025
-
[37]
Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Yanhao Li, Bin Cui, and Wentao Zhang. Learning what reinforcement learning can’t: Interleaved online fine-tuning for hardest questions.arXiv preprint arXiv:2506.07527, 2026
-
[38]
Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, and Jingren Zhou. On-policy RL meets off-policy experts: Harmonizing supervised fine-tuning and reinforcement learning via dynamic weighting.arXiv preprint arXiv:2508.11408, 2025
-
[39]
Liang Chen, Xueting Han, Li Shen, Jing Bai, and Kam-Fai Wong. Beyond two-stage training: Cooperative SFT and RL for LLM reasoning.arXiv preprint arXiv:2509.06948, 2025
-
[40]
UFT: Unifying supervised and reinforcement fine-tuning
Mingyang Liu, Gabriele Farina, and Asuman Ozdaglar. UFT: Unifying supervised and reinforcement fine-tuning. arXiv preprint arXiv:2505.16984, 2025
-
[41]
Zeyu Huang, Tianhao Cheng, Zihan Qiu, Zili Wang, Yinghui Xu, Edoardo M. Ponti, and Ivan Titov. Blending supervised and reinforcement fine-tuning with prefix sampling.arXiv preprint arXiv:2507.01679, 2025
-
[42]
Distilling the knowledge in a neural network.NIPS Deep Learning Workshop, 2015
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.NIPS Deep Learning Workshop, 2015
2015
-
[43]
Sequence-level knowledge distillation
Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016
2016
-
[44]
MiniLLM: Knowledge distillation of large language models
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations, 2024
2024
-
[45]
Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643,
Tianzhu Ye, Li Dong, Zewen Chi, Xun Wu, Shaohan Huang, and Furu Wei. Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643, 2025
-
[46]
ORBIT: On-policy Exploration-Exploitation for Controllable Multi-Budget Reasoning
Kun Liang, Clive Bai, Xin Xu, Chenming Tang, Sanwoo Lee, Weijie Liu, Saiyong Yang, and Yunfang Wu. Orbit: On-policy exploration-exploitation for controllable multi-budget reasoning.arXiv preprint arXiv:2601.08310, 2026. 13 Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[47]
Privileged information distillation for language models.arXiv preprint arXiv:2602.04942, 2026
Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. Privileged information distillation for language models.arXiv preprint arXiv:2602.04942, 2026
-
[48]
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020
work page internal anchor Pith review arXiv 2005
-
[49]
Simple statistical gradient-following algorithms for connectionist reinforcement learning
Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229–256, 1992
1992
-
[50]
Conservative Q-learning for offline reinforcement learning
Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-learning for offline reinforcement learning. InAdvances in Neural Information Processing Systems, volume 33, 2020
2020
-
[51]
Offline reinforcement learning with implicit q-learning
Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2022
2022
-
[52]
Off-policy deep reinforcement learning without exploration
Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, 2019
2019
-
[53]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, volume 36, 2023
2023
-
[54]
Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models.arXiv preprint arXiv:2308.01825, 2023
-
[55]
RAFT: Reward ranked finetuning for generative foundation model alignment
Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. RAFT: Reward ranked finetuning for generative foundation model alignment. Transactions on Machine Learning Research, 2023
2023
-
[56]
Llms can learn to reason via off-policy rl.arXiv preprint arXiv:2602.19362, 2026
Daniel Ritter, Owen Oertell, Bradley Guo, Jonathan Chang, Kianté Brantley, and Wen Sun. LLMs can learn to reason via off-policy RL.arXiv preprint arXiv:2602.19362, 2026
-
[57]
Yao Lu, Dengdong Fan, Jianzheng Nie, Fan Xu, Jie Chen, Bin Zhou, and Yonghong Tian. PCL-Reasoner-V1.5: Advancing math reasoning with offline reinforcement learning.arXiv preprint arXiv:2601.14716, 2026
-
[58]
Encompassing diversity and complexity in code generation
Yaoxiang Wang, Haoling Li, Xin Zhang, Jie Wu, Xiao Liu, Wenxiang Hu, Zhongxin Guo, Yangyu Huang, Ying Xin, Yujiu Yang, Jinsong Su, Qi Chen, and Scarlett Li. Encompassing diversity and complexity in code generation. arXiv preprint arXiv:2501.04694, 2025
-
[59]
AIME 2024.https://huggingface.co/datasets/AI-MO/aimo-validation-aime, 2024
AI-MO. AIME 2024.https://huggingface.co/datasets/AI-MO/aimo-validation-aime, 2024
2024
-
[60]
AIME 2025.https://github.com/open-compass/opencompass, 2025
OpenCompass. AIME 2025.https://github.com/open-compass/opencompass, 2025
2025
-
[61]
Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. MathArena: Evaluating LLMs on uncontaminated math competitions.arXiv preprint arXiv:2505.23281, 2025
-
[62]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[63]
LlamaFactory: Unified efficient fine-tuning of 100+ language models
Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. LlamaFactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (System Demonstrations), 2024
2024
-
[64]
slime: An SGLang-native post-training framework for RL scaling.https://github.com/THUDM/slime, 2025
THUDM. slime: An SGLang-native post-training framework for RL scaling.https://github.com/THUDM/slime, 2025
2025
-
[65]
Qwq-32b: Embracing the power of reinforcement learning, March 2025
Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025
2025
-
[66]
Yang Chen, Zhuolin Yang, Zihan Liu, Wenliang Dai, Boxin Wang, Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nemotron-cascade: Scaling cascaded reinforcement learning for general-purpose reasoning models.arXiv preprint arXiv:2512.13607, 2025. 14 Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.