Recognition: 3 theorem links
· Lean TheoremLearning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
Pith reviewed 2026-05-16 05:06 UTC · model grok-4.3
The pith
Generalized on-policy distillation using reward extrapolation enables students to surpass their teachers when merging domain experts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that on-policy distillation corresponds to a constrained RL objective where the reward and KL terms are always balanced equally. By introducing a reward scaling factor greater than one in the generalized objective, the method achieves reward extrapolation that improves student performance. In settings that combine multiple domain-specific experts, the extrapolated student outperforms both the combined teacher and the separate experts. In strong-to-weak distillation, choosing the teacher's pre-RL base model as the reference further refines the reward signal.
What carries the argument
The central mechanism is the reward scaling factor applied to the teacher-derived reward in the G-OPD loss function, which increases the influence of the reward relative to the KL regularization term.
If this is right
- ExOPD outperforms standard OPD across different teacher-student size combinations on math and code tasks.
- Merging domain experts with ExOPD lets the student surpass the teacher's performance boundary and beat the domain teachers.
- Using the teacher's pre-RL base model as reference in strong-to-weak distillation provides a more accurate reward and boosts performance.
Where Pith is reading between the lines
- Similar reward scaling could potentially improve other forms of distillation where the teacher is a fine-tuned model.
- Exploring reference models other than the pre-RL base might reduce the access requirement while retaining benefits.
- The results suggest that the standard equal weighting in OPD may under-emphasize the task reward in some cases.
Load-bearing premise
The claim depends on the assumption that scaling up the reward weight produces better policies without causing the student to diverge from useful behavior.
What would settle it
If experiments applying ExOPD to the code generation tasks show that the student performance does not exceed the teacher's on the evaluation benchmarks, the central claim would be falsified.
read the original abstract
On-policy distillation (OPD), which aligns the student with the teacher's logit distribution on student-generated trajectories, has demonstrated strong empirical gains in improving student performance and often outperforms off-policy distillation and reinforcement learning (RL) paradigms. In this work, we first theoretically show that OPD is a special case of dense KL-constrained RL where the reward function and the KL regularization are always weighted equally and the reference model can by any model. Then, we propose the Generalized On-Policy Distillation (G-OPD) framework, which extends the standard OPD objective by introducing a flexible reference model and a reward scaling factor that controls the relative weight of the reward term against the KL regularization. Through comprehensive experiments on math reasoning and code generation tasks, we derive two novel insights: (1) Setting the reward scaling factor to be greater than 1 (i.e., reward extrapolation), which we term ExOPD, consistently improves over standard OPD across a range of teacher-student size pairings. In particular, in the setting where we merge the knowledge from different domain experts, obtained by applying domain-specific RL to the same student model, back into the original student, ExOPD enables the student to even surpass the teacher's performance boundary and outperform the domain teachers. (2) Building on ExOPD, we further find that in the strong-to-weak distillation setting (i.e., distilling a smaller student from a larger teacher), performing reward correction by choosing the reference model as the teacher's base model before RL yields a more accurate reward signal and further improves distillation performance. However, this choice assumes access to the teacher's pre-RL variant and incurs more computational overhead. We hope our work offers new insights for future research on OPD.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that standard on-policy distillation (OPD) is a special case of dense KL-constrained RL (with equal reward-KL weighting and arbitrary reference), introduces the Generalized On-Policy Distillation (G-OPD) framework that adds a flexible reference model and a tunable reward scaling factor, and reports that setting the scaling factor >1 (ExOPD) yields consistent gains over OPD on math reasoning and code generation. In particular, ExOPD enables merging of domain-expert knowledge back into the base student such that the student surpasses the original teacher, and in strong-to-weak settings choosing the teacher's pre-RL checkpoint as reference further improves the reward signal.
Significance. If the empirical claims hold under rigorous testing, the work supplies a simple, theoretically motivated knob (reward extrapolation) that can push distillation performance past the teacher in multi-expert merging scenarios and clarifies the role of the reference model in strong-to-weak transfer. The explicit reduction of OPD to a special case of KL-constrained RL is a useful organizing insight for the field.
major comments (3)
- [Abstract and §5 (Experiments)] Abstract and experimental results: the claims of 'consistent improvements' and student outperformance of the teacher in domain-expert merging are presented without run counts, error bars, or statistical tests. This is load-bearing for the central empirical claim and must be addressed before the gains can be considered reliable.
- [§4.2 and §5.3] Reward-correction step in strong-to-weak distillation: the improvement is tied to using the teacher's pre-RL base model as reference. It is unclear whether the domain-expert merging results (where the student surpasses the teacher) also rely on this specific reference choice or employ a different reference; the paper should state the reference model used in each table/figure and quantify the extra overhead.
- [§3 (Theoretical Analysis)] Theoretical section: the statement that OPD is recovered exactly when the scaling factor equals 1 and the reference is arbitrary is asserted directly. The derivation should explicitly substitute the G-OPD objective (Eq. for the generalized loss) back into the dense KL-constrained RL objective to show the reduction without additional assumptions.
minor comments (2)
- [Throughout] Notation for the reward scaling factor should be introduced once and used uniformly; occasional switches between symbols or implicit definitions reduce readability.
- [Figures 2-4 and Tables 1-3] Figure captions and table footnotes should explicitly list the reference model and scaling factor used in each row/column so that the ExOPD vs. OPD comparison is immediately verifiable.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which has helped improve the clarity and rigor of our work. We address each major comment below and have revised the manuscript to incorporate the suggested changes.
read point-by-point responses
-
Referee: [Abstract and §5 (Experiments)] Abstract and experimental results: the claims of 'consistent improvements' and student outperformance of the teacher in domain-expert merging are presented without run counts, error bars, or statistical tests. This is load-bearing for the central empirical claim and must be addressed before the gains can be considered reliable.
Authors: We agree that statistical details are necessary to substantiate the empirical claims. In the revised manuscript, we have rerun all key experiments across 5 random seeds, reporting means with standard deviations in the tables and adding error bars to the figures. We have also added paired t-test results confirming that the improvements of ExOPD over OPD are statistically significant (p < 0.05) in the math and code merging settings. revision: yes
-
Referee: [§4.2 and §5.3] Reward-correction step in strong-to-weak distillation: the improvement is tied to using the teacher's pre-RL base model as reference. It is unclear whether the domain-expert merging results (where the student surpasses the teacher) also rely on this specific reference choice or employ a different reference; the paper should state the reference model used in each table/figure and quantify the extra overhead.
Authors: We clarify that domain-expert merging experiments use the original base student model (pre-domain-RL) as the reference, whereas strong-to-weak distillation uses the teacher's pre-RL checkpoint. We have updated all table and figure captions to explicitly state the reference model for each setting. The additional overhead of a non-default reference is one extra forward pass per update step, increasing wall-clock training time by approximately 12% on our hardware setup. revision: yes
-
Referee: [§3 (Theoretical Analysis)] Theoretical section: the statement that OPD is recovered exactly when the scaling factor equals 1 and the reference is arbitrary is asserted directly. The derivation should explicitly substitute the G-OPD objective (Eq. for the generalized loss) back into the dense KL-constrained RL objective to show the reduction without additional assumptions.
Authors: We have expanded the theoretical section with an explicit derivation. Starting from the G-OPD loss L = E[r(s,a) - λ KL(π_θ || π_ref)], we substitute λ = 1 to recover the dense KL-constrained RL objective with equal reward-KL weighting; the reduction holds for any reference model without further assumptions. The revised §3 now includes this step-by-step substitution. revision: yes
Circularity Check
No significant circularity; G-OPD and ExOPD are explicit design extensions validated empirically.
full rationale
The paper first proves OPD is a special case of dense KL-constrained RL (equal reward/KL weights, arbitrary reference). It then defines G-OPD by adding a free reference model and explicit reward scaling factor (set >1 for ExOPD). All performance claims (surpassing teacher in domain-expert merging, gains in strong-to-weak) come from direct experiments on math/code tasks, not from any parameter fitted to the target metric and renamed as prediction. No self-citation chains, uniqueness theorems, or ansatzes reduce the central results to inputs by construction. The scaling factor is presented as a tunable hyperparameter whose effect is measured, not derived tautologically.
Axiom & Free-Parameter Ledger
free parameters (1)
- reward scaling factor
axioms (1)
- domain assumption On-policy distillation is a special case of dense KL-constrained RL with equal weighting of reward and regularization and arbitrary reference model
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel contradicts?
contradictsCONTRADICTS: the theorem conflicts with this paper passage, or marks a claim that would need revision before publication.
Setting the reward scaling factor to be greater than 1 (i.e., reward extrapolation), which we term ExOPD, consistently improves over standard OPD
-
Foundation.LawOfExistencenothing_cannot_exist contradicts?
contradictsCONTRADICTS: the theorem conflicts with this paper passage, or marks a claim that would need revision before publication.
performing reward correction by choosing the reference model as the teacher's base model before RL yields a more accurate reward signal
-
Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
OPD is a special case of dense KL-constrained RL where the reward function and the KL regularization are always weighted equally
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 26 Pith papers
-
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
-
Multi-Rollout On-Policy Distillation via Peer Successes and Failures
MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.
-
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
-
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.
-
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning
RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multi...
-
The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs
On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.
-
Flow-OPD: On-Policy Distillation for Flow Matching Models
Flow-OPD applies on-policy distillation to flow matching models via specialized teachers, cold-start initialization, and manifold anchor regularization, lifting GenEval from 63 to 92 and OCR from 59 to 94 on Stable Di...
-
Rubric-based On-policy Distillation
Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance an...
-
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...
-
Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level
AOPD modifies on-policy distillation by using localized divergence minimization for non-positive advantages instead of negative reinforcement, yielding average gains of 4.09/8.34 over standard OPD on math reasoning be...
-
MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate
MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.
-
Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation
Local teachability collapse in trajectory suffixes makes uniform dense supervision suboptimal in strong-to-weak OPD; truncating at BIC-style change points on teacher margin improves performance.
-
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
On-policy distillation gains efficiency from early foresight in module allocation and low-rank update directions, enabling EffOPD to accelerate training by 3x via adaptive extrapolation without extra modules or tuning.
-
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
On-policy distillation gains efficiency from early foresight in module focus and update directions, enabling EffOPD to accelerate training 3x with comparable performance.
-
Flow-OPD: On-Policy Distillation for Flow Matching Models
Flow-OPD applies on-policy distillation to flow-matching text-to-image models, lifting GenEval from 63 to 92 and OCR accuracy from 59 to 94 while preserving fidelity.
-
Flow-OPD: On-Policy Distillation for Flow Matching Models
Flow-OPD applies on-policy distillation to flow matching models, achieving GenEval of 92 and OCR accuracy of 94 on Stable Diffusion 3.5 Medium while avoiding the seesaw effect of multi-reward optimization.
-
SOD: Step-wise On-policy Distillation for Small Language Model Agents
SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
-
Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance
Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math a...
-
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
-
On-Policy Distillation with Best-of-N Teacher Rollout Selection
BRTS improves on-policy distillation by selecting the highest-quality teacher trajectory from a small pool of samples based on correctness and alignment with the student, yielding gains on AIME and AMC math benchmarks.
-
On-Policy Distillation with Best-of-N Teacher Rollout Selection
BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.
-
Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level
Asymmetric On-Policy Distillation replaces ineffective negative reinforcement with localized divergence minimization in low-advantage regions, yielding 4.09-8.34 point gains over standard OPD on math reasoning benchmarks.
-
Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level
Asymmetric On-Policy Distillation improves on-policy distillation by using divergence minimization instead of negative reinforcement in low-advantage regions, yielding 4-8 point gains on math reasoning benchmarks whil...
-
Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.
-
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.
-
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
Safactory combines parallel simulation, trustworthy data management, and asynchronous evolution platforms into a single pipeline claimed to be the first unified framework for trustworthy autonomous agents.
Reference graph
Works this paper leans on
-
[1]
Aime 2024.https://huggingface.co/datasets/AI-MO/aimo-validation-aime,
AI-MO. Aime 2024.https://huggingface.co/datasets/AI-MO/aimo-validation-aime,
work page 2024
-
[2]
URLhttps://matharena.ai/. Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report.arXiv preprint arXiv:2403.17297,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Process Reinforcement through Implicit Rewards
Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Enhancing chat language models by scaling high-quality instructional conversations
Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 3029–3051,
work page 2023
-
[5]
Rlhf workflow: From reward mod- eling to online rlhf.arXiv preprint arXiv:2405.07863, 2024
Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward modeling to online rlhf.arXiv preprint arXiv:2405.07863,
-
[6]
OpenThoughts: Data Recipes for Reasoning Models
URL https://openreview.net/ forum?id=5h0qf7IBZZ. Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, et al. Openthoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025a. Yiju Guo, Wenkai Yang, Zexu Sun, Ning Ding, Zhiyuan Liu, and Yankai Lin. Learning to focus: Causal attention...
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner- zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Reinforcement Learning via Self-Distillation
Jonas H¨ubotter, Frederike L¨ubeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Sequence-level knowledge distillation
Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. InProceedings of the 2016 conference on empirical methods in natural language processing, pp. 1317–1327,
work page 2016
-
[13]
ORBIT: On-policy Exploration-Exploitation for Controllable Multi-Budget Reasoning
Kun Liang, Clive Bai, Xin Xu, Chenming Tang, Sanwoo Lee, Weijie Liu, Saiyong Yang, and Yunfang Wu. Orbit: On-policy exploration-exploitation for controllable multi-budget reasoning.arXiv preprint arXiv:2601.08310,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, et al. Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352, 2025a. 11 Jiacai Liu, Yingru Li, Yuqian Fu, Jiawei Wang, Qian Liu, and Yu Shen. When speed kills stability: Demystify- ...
-
[15]
Xiaoqian Liu, Ke Wang, Yuchuan Wu, Fei Huang, Yongbin Li, Junge Zhang, and Jianbin Jiao
URLhttps://openreview.net/forum?id=1qvx610Cu7. Xiaoqian Liu, Ke Wang, Yuchuan Wu, Fei Huang, Yongbin Li, Junge Zhang, and Jianbin Jiao. Agentic reinforcement learning with implicit step rewards.arXiv preprint arXiv:2509.19199, 2025c. Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connectionism,
-
[16]
https://thinkingmachines.ai/blog/on-policy-distillation
doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation. OpenCompass. Aime 2025.https://huggingface.co/datasets/opencompass/AIME2025,
-
[17]
Privileged information distillation for language models.arXiv preprint arXiv:2602.04942,
Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. Privileged information distillation for language models.arXiv preprint arXiv:2602.04942,
-
[18]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108,
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[19]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Self-Distillation Enables Continual Learning
Idan Shenfeld, Mehul Damani, Jonas H ¨ubotter, and Pulkit Agrawal. Self-distillation enables continual learning. arXiv preprint arXiv:2601.19897,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model.Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7,
work page 2023
-
[23]
MiMo-V2-Flash Technical Report
Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
URL https://openreview.net/forum?id=2QdsjiNXgj. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Wenkai Yang, Jingwen Chen, Yankai Lin, and Ji-Rong Wen. Deepcritic: Deliberate critique with large language models.arXiv...
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
13 A Detailed Math Derivations Here, we make mathematical derivations to calculate the expected gradients of OPD objective in Eq. (4). Since JOPD(θ) =min θ Ex∼D,y∼π θ(·|x) h DKL πθ(y|x) π∗(y|x) i =min θ Ex∼D,y∼π θ(·|x) h logπ θ(y|x)−logπ ∗(y|x) i . (15) We can get ∇θJOPD(θ) =∇ θEx∼D,y∼π θ(·|x) h logπ θ(y|x)−logπ ∗(y|x) i =∇ θEx h ∑ y πθ(y|x) logπ θ(y|x)−l...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.