arxiv: 2604.13010 · v2 · submitted 2026-04-14 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

Hai Cai, Song Han, Yecheng Wu

Pith reviewed 2026-05-11 01:00 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords on-policy distillationoffline distillationlarge language modelspost-trainingreasoning modelsteacher consistencygradient biaspolicy drift

0 comments

The pith

Enforcing teacher consistency allows offline on-policy distillation to match live OPD optimum and performance at 4x efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that on-policy distillation for large reasoning models can be performed entirely offline by precomputing teacher log-probabilities once over supervised fine-tuning rollouts, provided the identical teacher model is used throughout. This Lightning OPD approach eliminates the infrastructure cost of maintaining a live teacher server while preserving the same training optimum as standard online OPD. The authors prove bounded gradient discrepancy and an implicit regularization effect under this consistency condition, which prevents policy drift. Experiments on math reasoning and code generation tasks confirm comparable final performance with substantially reduced training time, including scaling to mixture-of-experts models on a single 8xH100 node.

Core claim

Under the teacher consistency condition, Lightning OPD shares the same optimum as standard OPD, exhibits bounded gradient discrepancy, and introduces an implicit regularization effect that helps prevent policy drift. By precomputing log-probabilities with the same teacher used for SFT, the method removes the need for a live teacher server while delivering performance comparable to online distillation on math and code benchmarks at 4.0x higher training efficiency.

What carries the argument

Teacher consistency, the requirement that the identical teacher model be used for both SFT and OPD to eliminate gradient bias when precomputing log-probabilities for offline training.

If this is right

Lightning OPD removes the requirement for a live teacher server throughout training.
The method achieves comparable accuracy to standard OPD on AIME 2024 and code generation while using four times less wall-clock training time.
The implicit regularization from teacher consistency helps stabilize training and reduce policy drift.
The approach scales to larger MoE models, enabling 71.0% on AIME 2024 for a 30B model on a single 8xH100 node.
Post-training of large reasoning models becomes feasible with far lower infrastructure overhead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same consistency principle might stabilize other offline reinforcement learning or distillation pipelines that currently suffer from teacher-student mismatch.
Academic groups could now replicate advanced post-training results using only modest GPU clusters instead of dedicated inference servers.
The bounded discrepancy result suggests Lightning OPD could serve as a drop-in replacement in existing on-policy pipelines with minimal hyperparameter retuning.
Extending the precomputation step to include multiple teachers or curriculum schedules might further improve sample efficiency.

Load-bearing premise

That precomputing log-probabilities with the identical teacher used for SFT is sufficient to eliminate gradient bias and produce an offline procedure whose optimum and dynamics match those of live on-policy distillation.

What would settle it

A direct comparison experiment in which Lightning OPD and standard live OPD, started from the same SFT checkpoint and using the same data, reach final policies whose performance or loss values differ beyond the claimed bounded discrepancy on a held-out reasoning benchmark.

read the original abstract

On-policy distillation (OPD) is an effective post-training paradigm for large language models but requires a live teacher server throughout training, resulting in substantial infrastructure overhead. We investigate whether OPD can be performed offline by precomputing teacher log-probabilities once over SFT rollouts and reusing them during training. We find that naively doing so fails to reliably match standard OPD, and trace the root cause to a previously overlooked condition we term teacher consistency, requiring that the same teacher be used for both supervised fine-tuning and OPD. Violating this condition introduces a gradient bias that degrades performance for both offline and online OPD. Building on this insight, we propose Lightning OPD, an offline on-policy distillation framework that enforces teacher consistency and eliminates the need for a live teacher server entirely. We prove that, under teacher consistency, Lightning OPD shares the same optimum as standard OPD, with bounded gradient discrepancy and an implicit regularization effect that helps prevent policy drift. Experiments on math reasoning and code generation show that Lightning OPD achieves comparable performance to standard OPD while delivering 4.0x higher training efficiency. Starting from an SFT-initialized Qwen3-8B-Base model, Lightning OPD reaches 69.9% on AIME 2024 in just 30 GPU hours. Lightning OPD further scales to MoE architectures, training Qwen3-30B-A3B to 71.0% on AIME 2024 on a single 8xH100 node, substantially lowering the barrier for academic research on LLM post-training. Our code is released at https://github.com/jet-ai-projects/Lightning-OPD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Lightning OPD gives a practical offline route to on-policy distillation for reasoning models by locking in teacher consistency, but the bounded-discrepancy claim needs checking against actual policy drift over longer runs.

read the letter

The useful advance here is recognizing that teacher consistency—using the exact same model for SFT and for the precomputed log probabilities—lets you drop the live teacher server without wrecking the distillation. Once that condition holds, the paper shows an offline loss whose optimum matches standard OPD, plus a claimed bound on gradient mismatch and some built-in regularization against drift. The experiments put numbers on it: an 8B model reaches 69.9% on AIME 2024 in 30 GPU hours, the 30B MoE hits 71% on one 8xH100 node, and they report roughly 4x training speedup while staying close to live OPD on math and code tasks. Releasing the code is the right move for anyone who wants to test it directly.

Referee Report

2 major / 2 minor

Summary. The paper proposes Lightning OPD, an offline on-policy distillation method for post-training large reasoning LLMs. It identifies teacher consistency (using the identical teacher for SFT and distillation) as necessary to avoid gradient bias when precomputing teacher log-probabilities on fixed SFT rollouts. Under this condition, the authors prove that Lightning OPD shares the same optimum as standard live OPD, with bounded gradient discrepancy and an implicit regularization effect against policy drift. Experiments on math reasoning and code generation tasks show performance parity with standard OPD at 4x higher training efficiency, including scaling results for Qwen3-8B and a 30B MoE model on limited hardware.

Significance. If the shared-optimum result and bounded-discrepancy guarantee hold with the stated assumptions, Lightning OPD would meaningfully reduce infrastructure costs for LLM post-training by removing the live teacher server requirement. The released code, the 4x efficiency claim, and the scaling demonstration (Qwen3-30B-A3B reaching 71.0% on AIME 2024 on a single 8xH100 node) are concrete strengths that could broaden access to large-scale reasoning model training in academic settings.

major comments (2)

[§4] §4 (Proof of equivalence): The claim that Lightning OPD shares the same optimum as standard OPD under teacher consistency is plausible because both losses reach their minimum when the student matches the teacher. However, the subsequent statement of 'bounded gradient discrepancy' is load-bearing for the dynamics claim yet lacks an explicit bound expressed in terms of a divergence measure such as KL(π_student || π_SFT) or total variation distance between the fixed SFT distribution and the evolving student distribution. Without this, it remains unclear whether the bound stays small as training proceeds and the student policy drifts.
[§5.3] §5.3 (Experimental setup and ablations): The reported performance parity (e.g., 69.9% on AIME 2024) and 4.0x efficiency are encouraging, but the evaluation uses relatively short training horizons. The paper should add an ablation that tracks policy divergence (e.g., KL or win-rate against the SFT policy) over longer training to test whether the claimed implicit regularization actually prevents accumulating mismatch, as the offline expectation is taken over a stale distribution while standard OPD samples from the current student.

minor comments (2)

[§3] The definition of teacher consistency is introduced in the abstract and §3 but would benefit from a formal statement (e.g., an equation specifying that the teacher used for precomputing log p_teacher is identical to the one used in the preceding SFT stage) before the proof.
[Table 1] Table 1 and Figure 2: clarify whether the efficiency numbers include the one-time cost of precomputing teacher log-probabilities or only the subsequent training loop; also state the exact hardware configuration used for the standard OPD baseline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential infrastructure benefits of Lightning OPD. We address each major comment below, providing clarifications from the manuscript and outlining targeted revisions that strengthen the theoretical and empirical claims without altering the core contributions.

read point-by-point responses

Referee: [§4] §4 (Proof of equivalence): The claim that Lightning OPD shares the same optimum as standard OPD under teacher consistency is plausible because both losses reach their minimum when the student matches the teacher. However, the subsequent statement of 'bounded gradient discrepancy' is load-bearing for the dynamics claim yet lacks an explicit bound expressed in terms of a divergence measure such as KL(π_student || π_SFT) or total variation distance between the fixed SFT distribution and the evolving student distribution. Without this, it remains unclear whether the bound stays small as training proceeds and the student policy drifts.

Authors: We agree that an explicit bound expressed via a standard divergence would improve clarity. The manuscript already shows that, under teacher consistency, the gradient of Lightning OPD differs from standard OPD by a term whose magnitude is controlled by the total variation distance between the fixed SFT rollout distribution and the current student distribution (multiplied by a constant depending on the teacher’s maximum log-probability gap). We will add a corollary in the revised §4 that invokes Pinsker’s inequality to restate this bound directly in terms of KL(π_student || π_SFT). We will also include a short discussion of how the implicit regularization term derived in the proof keeps the KL term from growing unboundedly, thereby addressing the concern about drift during extended training. revision: yes
Referee: [§5.3] §5.3 (Experimental setup and ablations): The reported performance parity (e.g., 69.9% on AIME 2024) and 4.0x efficiency are encouraging, but the evaluation uses relatively short training horizons. The paper should add an ablation that tracks policy divergence (e.g., KL or win-rate against the SFT policy) over longer training to test whether the claimed implicit regularization actually prevents accumulating mismatch, as the offline expectation is taken over a stale distribution while standard OPD samples from the current student.

Authors: We concur that longer-horizon tracking would provide stronger empirical support for the regularization claim. Our current results already demonstrate parity at the reported compute budgets, but we will extend the training runs for both Lightning OPD and standard OPD on the math reasoning tasks and add new figures in §5.3 that plot (i) KL(π_student || π_SFT) and (ii) win-rate against the SFT policy as functions of training steps. These ablations will directly compare the divergence trajectories and confirm whether the offline formulation maintains lower drift than would be expected from a stale distribution. revision: yes

Circularity Check

0 steps flagged

No circularity: proof of shared optimum under teacher consistency is presented as independent mathematical result

full rationale

The paper introduces teacher consistency as an explicit condition (same teacher for SFT and OPD) and claims a separate proof that Lightning OPD then shares the optimum with standard OPD, plus bounded gradient discrepancy. No equations are supplied in the abstract or description that reduce the claimed optimum or bound to a fitted quantity, a self-referential definition, or a prior self-citation chain. The derivation is therefore treated as self-contained external reasoning rather than a renaming or construction that forces the result by the inputs alone. Experiments are reported separately and do not substitute for the proof.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that teacher consistency eliminates gradient bias and on the unverified mathematical proof that the offline procedure shares the same optimum as live OPD.

axioms (1)

domain assumption Teacher consistency: the identical teacher model must be used for both the initial SFT stage and the precomputation of log-probabilities for OPD.
Cited as the root cause of failure when naively performing offline OPD and as the condition under which equivalence holds.

pith-pipeline@v0.9.0 · 5613 in / 1476 out tokens · 49025 ms · 2026-05-11T01:00:37.728795+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Rubric-based On-policy Distillation
cs.LG 2026-05 unverdicted novelty 7.0

Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance an...
SOD: Step-wise On-policy Distillation for Small Language Model Agents
cs.CL 2026-05 unverdicted novelty 6.0

SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

SimCT recovers discarded teacher signal in cross-tokenizer on-policy distillation by enlarging supervision to jointly realizable multi-token continuations, yielding consistent gains on math reasoning and code generati...
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
cs.LG 2026-05 unverdicted novelty 6.0

Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.

Reference graph

Works this paper leans on

66 extracted references · 47 canonical work pages · cited by 4 Pith papers · 27 internal anchors

[1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Nvidia nemotron 3: Efficient and open intelligence, 2025

NVIDIA. Nvidia nemotron 3: Efficient and open intelligence.arXiv preprint arXiv:2512.20856, 2025

work page arXiv 2025
[5]

MiMo-V2-Flash Technical Report

Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

work page internal anchor Pith review arXiv 2026
[6]

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems, volume 35, 2022

2022
[7]

OpenThoughts: Data Recipes for Reasoning Models

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, et al. Openthoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178, 2025. 11 Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

work page internal anchor Pith review arXiv 2025
[8]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe Twelfth International Conference on Learning Representations, 2024

2024
[9]

On-policy distillation.Thinking Machines Lab: Connectionism, 2025

Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connectionism, 2025. https://thinkingmachines.ai/blog/on-policy-distillation

2025
[10]

arXiv preprint arXiv:2602.12125 , year=

Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

work page arXiv 2026
[11]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

work page internal anchor Pith review arXiv 2026
[12]

Self-Distillation Enables Continual Learning

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning. arXiv preprint arXiv:2601.19897, 2026

work page internal anchor Pith review arXiv 2026
[13]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

work page internal anchor Pith review arXiv 2026
[14]

Nemotron-cascade 2: Post-training llms with cascade rl and multi-domain on-policy distillation

Zhuolin Yang, Zihan Liu, Yang Chen, Wenliang Dai, Boxin Wang, Sheng-Chieh Lin, Chankyu Lee, Yangyi Chen, Dongfu Jiang, Jiafan He, Renjie Pi, Grace Lam, Nayeon Lee, Alexander Bukharin, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nemotron-cascade 2: Post-training llms with cascade rl and multi-domain on-policy distillation. 2026

2026
[15]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[16]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Rl’s ra- zor: Why online reinforcement learning forgets less.arXiv preprint arXiv:2509.04259, 2025

Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. Rl’s razor: Why online reinforcement learning forgets less. arXiv preprint arXiv:2509.04259, 2025

work page arXiv 2025
[21]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 35, 2022

2022
[22]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Jian Hu. REINFORCE++: A simple and efficient approach for aligning large language models.arXiv preprint arXiv:2501.03262, 2025

work page internal anchor Pith review arXiv 2025
[23]

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs.arXiv preprint arXiv:2402.14740, 2024

work page internal anchor Pith review arXiv 2024
[24]

Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models.arXiv preprint arXiv:2310.10505, 2023

Ziniu Li, Tian Xu, Yushun Zhang, Yang Yu, Ruoyu Sun, and Zhi-Quan Luo. ReMax: A simple, effective, and efficient reinforcement learning method for aligning large language models.arXiv preprint arXiv:2310.10505, 2023

work page arXiv 2023
[25]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding R1-Zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review arXiv 2025
[26]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025. 12 Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

MiniMax, Aonian Shan, Bangwei Gong, Bo Yang, et al. MiniMax-M1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025

work page internal anchor Pith review arXiv 2025
[28]

Process Reinforcement through Implicit Rewards

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456, 2025

work page internal anchor Pith review arXiv 2025
[29]

What’s behind ppo’s collapse in long-cot? value optimization holds the secret, 2025

Yufeng Yuan, Yu Yue, Ruofei Zhu, Tiantian Fan, and Lin Yan. What’s behind PPO’s collapse in long-CoT? value optimization holds the secret.arXiv preprint arXiv:2503.01491, 2025

work page arXiv 2025
[30]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, Tiantian Fan, Zhengyin Du, Xiangpeng Wei, et al. VAPO: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118, 2025

work page internal anchor Pith review arXiv 2025
[31]

arXiv preprint arXiv:2410.01679 , year=

Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. VinePPO: Refining credit assignment in RL training of LLMs.arXiv preprint arXiv:2410.01679, 2024

work page arXiv 2024
[32]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chengqi Zhao, Chenggang Deng, Chengpeng Zhang, et al. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, et al. Kimi k1.5: Scaling reinforcement learning with LLMs.arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290, 2025

work page internal anchor Pith review arXiv 2025
[35]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2.5-Math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122, 2024

work page internal anchor Pith review arXiv 2024
[36]

Shaotian Yan, Kaiyuan Liu, Chen Shen, Bing Wang, Sinan Fan, Jun Zhang, Yue Wu, Zheng Wang, and Jieping Ye

Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance.arXiv preprint arXiv:2504.14945, 2025

work page arXiv 2025
[37]

Learning what reinforcement learning can’t: Interleaved online fine-tuning for hardest questions.arXiv preprint arXiv:2506.07527, 2025

Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Yanhao Li, Bin Cui, and Wentao Zhang. Learning what reinforcement learning can’t: Interleaved online fine-tuning for hardest questions.arXiv preprint arXiv:2506.07527, 2026

work page arXiv 2026
[38]

On-policy rl meets off-policy experts: Harmonizing supervised fine- tuning and reinforcement learning via dynamic weighting.arXiv preprint arXiv:2508.11408, 2026

Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, and Jingren Zhou. On-policy RL meets off-policy experts: Harmonizing supervised fine-tuning and reinforcement learning via dynamic weighting.arXiv preprint arXiv:2508.11408, 2025

work page arXiv 2025
[39]

Beyond two-stage training: Cooperative sft and rl for llm reasoning.arXiv preprint arXiv:2509.06948, 2025

Liang Chen, Xueting Han, Li Shen, Jing Bai, and Kam-Fai Wong. Beyond two-stage training: Cooperative SFT and RL for LLM reasoning.arXiv preprint arXiv:2509.06948, 2025

work page arXiv 2025
[40]

UFT: Unifying supervised and reinforcement fine-tuning

Mingyang Liu, Gabriele Farina, and Asuman Ozdaglar. UFT: Unifying supervised and reinforcement fine-tuning. arXiv preprint arXiv:2505.16984, 2025

work page arXiv 2025
[41]

Blending supervised and reinforcement fine-tuning with prefix sampling.arXiv preprint arXiv:2507.01679,

Zeyu Huang, Tianhao Cheng, Zihan Qiu, Zili Wang, Yinghui Xu, Edoardo M. Ponti, and Ivan Titov. Blending supervised and reinforcement fine-tuning with prefix sampling.arXiv preprint arXiv:2507.01679, 2025

work page arXiv 2025
[42]

Distilling the knowledge in a neural network.NIPS Deep Learning Workshop, 2015

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.NIPS Deep Learning Workshop, 2015

2015
[43]

Sequence-level knowledge distillation

Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016

2016
[44]

MiniLLM: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations, 2024

2024
[45]

Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643,

Tianzhu Ye, Li Dong, Zewen Chi, Xun Wu, Shaohan Huang, and Furu Wei. Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643, 2025

work page arXiv 2025
[46]

ORBIT: On-policy Exploration-Exploitation for Controllable Multi-Budget Reasoning

Kun Liang, Clive Bai, Xin Xu, Chenming Tang, Sanwoo Lee, Weijie Liu, Saiyong Yang, and Yunfang Wu. Orbit: On-policy exploration-exploitation for controllable multi-budget reasoning.arXiv preprint arXiv:2601.08310, 2026. 13 Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

work page internal anchor Pith review Pith/arXiv arXiv 2026
[47]

Privileged information distillation for language models.arXiv preprint arXiv:2602.04942, 2026

Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. Privileged information distillation for language models.arXiv preprint arXiv:2602.04942, 2026

work page arXiv 2026
[48]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020

work page internal anchor Pith review arXiv 2005
[49]

Simple statistical gradient-following algorithms for connectionist reinforcement learning

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229–256, 1992

1992
[50]

Conservative Q-learning for offline reinforcement learning

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-learning for offline reinforcement learning. InAdvances in Neural Information Processing Systems, volume 33, 2020

2020
[51]

Offline reinforcement learning with implicit q-learning

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2022

2022
[52]

Off-policy deep reinforcement learning without exploration

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, 2019

2019
[53]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, volume 36, 2023

2023
[54]

Scaling relationship on learning mathematical reasoning with large language models.arXiv preprint arXiv:2308.01825, 2023

Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models.arXiv preprint arXiv:2308.01825, 2023

work page arXiv 2023
[55]

RAFT: Reward ranked finetuning for generative foundation model alignment

Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. RAFT: Reward ranked finetuning for generative foundation model alignment. Transactions on Machine Learning Research, 2023

2023
[56]

Llms can learn to reason via off-policy rl.arXiv preprint arXiv:2602.19362, 2026

Daniel Ritter, Owen Oertell, Bradley Guo, Jonathan Chang, Kianté Brantley, and Wen Sun. LLMs can learn to reason via off-policy RL.arXiv preprint arXiv:2602.19362, 2026

work page arXiv 2026
[57]

PCL-Reasoner-V1.5: Advancing math reasoning with offline reinforcement learning.arXiv preprint arXiv:2601.14716, 2026

Yao Lu, Dengdong Fan, Jianzheng Nie, Fan Xu, Jie Chen, Bin Zhou, and Yonghong Tian. PCL-Reasoner-V1.5: Advancing math reasoning with offline reinforcement learning.arXiv preprint arXiv:2601.14716, 2026

work page arXiv 2026
[58]

Encompassing diversity and complexity in code generation

Yaoxiang Wang, Haoling Li, Xin Zhang, Jie Wu, Xiao Liu, Wenxiang Hu, Zhongxin Guo, Yangyu Huang, Ying Xin, Yujiu Yang, Jinsong Su, Qi Chen, and Scarlett Li. Encompassing diversity and complexity in code generation. arXiv preprint arXiv:2501.04694, 2025

work page arXiv 2025
[59]

AIME 2024.https://huggingface.co/datasets/AI-MO/aimo-validation-aime, 2024

AI-MO. AIME 2024.https://huggingface.co/datasets/AI-MO/aimo-validation-aime, 2024

2024
[60]

AIME 2025.https://github.com/open-compass/opencompass, 2025

OpenCompass. AIME 2025.https://github.com/open-compass/opencompass, 2025

2025
[61]

Balunović, J

Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. MathArena: Evaluating LLMs on uncontaminated math competitions.arXiv preprint arXiv:2505.23281, 2025

work page arXiv 2025
[62]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[63]

LlamaFactory: Unified efficient fine-tuning of 100+ language models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. LlamaFactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (System Demonstrations), 2024

2024
[64]

slime: An SGLang-native post-training framework for RL scaling.https://github.com/THUDM/slime, 2025

THUDM. slime: An SGLang-native post-training framework for RL scaling.https://github.com/THUDM/slime, 2025

2025
[65]

Qwq-32b: Embracing the power of reinforcement learning, March 2025

Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025

2025
[66]

Nemotron-cascade: Scaling cascaded reinforcement learning for general-purpose reasoning models.arXiv preprint arXiv:2512.13607, 2025

Yang Chen, Zhuolin Yang, Zihan Liu, Wenliang Dai, Boxin Wang, Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nemotron-cascade: Scaling cascaded reinforcement learning for general-purpose reasoning models.arXiv preprint arXiv:2512.13607, 2025. 14 Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline ...

work page arXiv 2025