pith. machine review for the scientific record. sign in

arxiv: 2605.03677 · v1 · submitted 2026-05-05 · 💻 cs.LG

Recognition: unknown

Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:57 UTC · model grok-4.3

classification 💻 cs.LG
keywords on-policy distillationlarge language modelsmultimodal language modelsdata balancingmargin calibrationpost-trainingknowledge distillation
0
0 comments X

The pith

Uni-OPD fixes two core limits in on-policy distillation so a single student model can reliably absorb capabilities from expert LLMs and MLLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that on-policy distillation often fails because student models do not explore enough useful states they generate and because teacher feedback on those rollouts does not stay consistent with final outcome rewards. To address both issues at once, the authors introduce a dual-perspective recipe: two data-balancing steps that increase exposure to informative student states, plus an outcome-guided margin calibration step that restores order between correct and incorrect trajectories. Experiments across five domains and sixteen benchmarks show the combined approach improves performance in single-teacher, multi-teacher, strong-to-weak, and cross-modal settings for both language and multimodal models. Readers care because the method turns a post-training technique that sometimes works into one that generalizes more predictably when consolidating specialized experts into one model.

Core claim

On-policy distillation improves when viewed through two lenses: from the student, data balancing strategies increase exploration of informative self-generated states; from the teacher, outcome-guided margin calibration ensures token-level guidance remains order-consistent with the final reward. Together these steps form Uni-OPD, a framework shown to work across LLMs and MLLMs in single-teacher, multi-teacher, strong-to-weak, and cross-modal distillation tasks.

What carries the argument

Dual-perspective optimization strategy that pairs student-side data balancing with teacher-side outcome-guided margin calibration.

If this is right

  • Single-teacher and multi-teacher distillation both become more effective across language and multimodal models.
  • Strong-to-weak distillation gains reliability without extra model size.
  • Cross-modal distillation between LLMs and MLLMs works under the same unified recipe.
  • Practical guidelines emerge for choosing data balance ratios and margin values during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same dual-view logic could be tested on non-language sequence tasks such as protein design or code generation where outcome rewards are available.
  • Margin calibration might be combined with other reward-modeling techniques to reduce the need for large outcome verifiers.
  • If the balancing strategies scale with model size, they could lower the compute cost of distilling many experts into one generalist.

Load-bearing premise

That insufficient state exploration and order-inconsistent teacher supervision are the primary bottlenecks in on-policy distillation and that the specific balancing and calibration fixes will improve results without creating new instabilities.

What would settle it

Applying the two data-balancing steps and margin calibration to a new domain or model family and observing no improvement or a performance drop relative to standard on-policy distillation would show the fixes do not reliably solve the stated bottlenecks.

Figures

Figures reproduced from arXiv: 2605.03677 by Chengquan Zhang, Fei Wu, Han Hu, Hehe Fan, Hongming Yang, Kaiqi Wang, Mingqi Gao, Shangpin Peng, Weinong Wang, Wenjin Hou, Yifei Chen, Yi Yang, Yue Zhang, Zhenglin Zhou, Zheng Ruan, Zhuotao Tian.

Figure 1
Figure 1. Figure 1: Overall performance comparisons and convergence behavior. Results are shown for settings including multi-teacher, strong-to-weak, and cross-modal distillation on math reasoning and code generation tasks. Uni-OPD consistently outperforms OPD and converges faster than RL, demonstrating its effectiveness across diverse settings. 1 Introduction Injecting complex reasoning abilities, domain knowledge, and human… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Uni-OPD framework. (Left) Offline difficulty-aware and online correctness-aware data balancing promote student exploration. (Right) Outcome-guided margin calibration mechanism improves the reliability of teacher supervision. (Middle) The resulting student policy merges complementary capabilities from multiple domain-specific teachers more effectively than standard OPD, leading to stronger o… view at source ↗
Figure 3
Figure 3. Figure 3: Data difficulty distribution and its impact on OPD performance. (Left) Training data often exhibits mirrored J-shaped or U-shaped difficulty distributions. (Right) A naive strategy is to filter out overly easy or overly hard samples (i.e., all-correct or all-wrong cases), but this reduces diversity. In contrast, our difficulty-balancing strategy upsamples mid-difficulty samples to preserve a balanced spect… view at source ↗
Figure 4
Figure 4. Figure 4: Impact of online correct and in view at source ↗
Figure 5
Figure 5. Figure 5: Demonstration of unreliable teacher supervision and outcome-guided margin calibration mechanism. (Left) Standard teacher supervision in OPD suffers from misalignment between trajectory-level return and outcome rewards, yielding unreliable supervision signals. (Right) Our method uses outcome rewards as a global anchor to calibrate returns through margin-based adjustment, restoring order consistency and impr… view at source ↗
Figure 6
Figure 6. Figure 6: Heatmap visualization of failure modes in OPD and the effect of margin shift. Left: an incorrect rollout with a high distillation return (top) and a correct rollout with a low one (bottom). Right: the same two rollouts after our margin shift, with the outcome ordering restored. 4.7 Analysis and Takeaways Based on our comprehensive and systematic study on both LLMs and MLLMs across single-teacher, multi-tea… view at source ↗
read the original abstract

On-policy distillation (OPD) has recently emerged as an effective post-training paradigm for consolidating the capabilities of specialized expert models into a single student model. Despite its empirical success, the conditions under which OPD yields reliable improvement remain poorly understood. In this work, we identify two fundamental bottlenecks that limit effective OPD: insufficient exploration of informative states and unreliable teacher supervision for student rollouts. Building on this insight, we propose Uni-OPD, a unified OPD framework that generalizes across Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs), centered on a dual-perspective optimization strategy. Specifically, from the student's perspective, we adopt two data balancing strategies to promote exploration of informative student-generated states during training. From the teacher's perspective, we show that reliable supervision hinges on whether aggregated token-level guidance remains order-consistent with the outcome reward. To this end, we develop an outcome-guided margin calibration mechanism to restore order consistency between correct and incorrect trajectories. We conduct extensive experiments on 5 domains and 16 benchmarks covering diverse settings, including single-teacher and multi-teacher distillation across LLMs and MLLMs, strong-to-weak distillation, and cross-modal distillation. Our results verify the effectiveness and versatility of Uni-OPD and provide practical insights into reliable OPD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper identifies two fundamental bottlenecks in on-policy distillation (OPD): insufficient exploration of informative states and unreliable teacher supervision for student rollouts. It proposes Uni-OPD, a unified framework generalizing across LLMs and MLLMs via a dual-perspective optimization strategy. From the student perspective, two data balancing strategies promote exploration of informative states; from the teacher perspective, an outcome-guided margin calibration restores order consistency between aggregated token-level guidance and outcome rewards. Extensive experiments are reported across 5 domains and 16 benchmarks in settings including single/multi-teacher distillation, strong-to-weak, and cross-modal distillation, claiming to verify the framework's effectiveness and versatility.

Significance. If the results hold with proper isolation of contributions, the work could provide a practical, generalizable recipe for reliable OPD that bridges language and multimodal models. The dual-perspective design (addressing both student exploration and teacher supervision consistency) offers a coherent way to mitigate stated bottlenecks, and the breadth of evaluated settings (cross-modal, multi-teacher) would be a notable strength if supported by controlled evidence.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): The manuscript asserts 'extensive experiments on 5 domains and 16 benchmarks' with claims of verifying effectiveness, yet the provided text supplies no quantitative results, baseline comparisons, statistical details, error bars, or ablation evidence. This prevents verification of the central claim that the dual-perspective recipe reliably addresses the two bottlenecks.
  2. [§4] §4 (Experiments): No systematic ablations or controlled comparisons isolate the causal contribution of the two data balancing strategies versus the outcome-guided margin calibration (or their combination). Without such isolation, performance gains cannot be confidently attributed to the proposed fixes rather than unstated factors such as rollout length, reward sparsity, or hyperparameter choices, undermining the claim that these mechanisms restore order consistency and improve OPD.
  3. [§3] §3 (Method): The outcome-guided margin calibration is motivated as restoring order consistency with the outcome reward, but the manuscript provides no formal definition, guarantee, or analysis showing that the calibration reliably achieves this across teacher models or modalities without introducing new instabilities in multi-teacher or cross-modal regimes.
minor comments (2)
  1. [Abstract] The term 'order-consistent' is used repeatedly but lacks an early formal definition or mathematical characterization; adding this in §2 would improve readability.
  2. [Abstract] Consider including at least one concrete performance delta or table reference in the abstract to convey the scale of improvements.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We appreciate the emphasis on ensuring that experimental claims are fully supported by visible evidence and that the proposed mechanisms are rigorously isolated and analyzed. We address each major comment below and will revise the manuscript accordingly to strengthen clarity and verifiability.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The manuscript asserts 'extensive experiments on 5 domains and 16 benchmarks' with claims of verifying effectiveness, yet the provided text supplies no quantitative results, baseline comparisons, statistical details, error bars, or ablation evidence. This prevents verification of the central claim that the dual-perspective recipe reliably addresses the two bottlenecks.

    Authors: We acknowledge that the experimental results need to be presented more explicitly to support the claims. While §4 describes the evaluation across 5 domains and 16 benchmarks in multiple settings (single/multi-teacher, strong-to-weak, cross-modal), we will revise the manuscript to include complete quantitative tables with all baseline comparisons, performance deltas, error bars, and statistical significance tests. A concise summary of key metrics will also be added to the abstract and introduction for immediate verifiability. revision: yes

  2. Referee: [§4] §4 (Experiments): No systematic ablations or controlled comparisons isolate the causal contribution of the two data balancing strategies versus the outcome-guided margin calibration (or their combination). Without such isolation, performance gains cannot be confidently attributed to the proposed fixes rather than unstated factors such as rollout length, reward sparsity, or hyperparameter choices, undermining the claim that these mechanisms restore order consistency and improve OPD.

    Authors: We agree that isolating the contributions of each component is necessary to substantiate the dual-perspective design. In the revised version, we will add a new ablation subsection in §4 that evaluates (i) each data balancing strategy in isolation, (ii) the margin calibration alone, and (iii) their combination, while controlling for rollout length, reward sparsity, and hyperparameter settings. These controlled comparisons will directly attribute gains to the proposed mechanisms. revision: yes

  3. Referee: [§3] §3 (Method): The outcome-guided margin calibration is motivated as restoring order consistency with the outcome reward, but the manuscript provides no formal definition, guarantee, or analysis showing that the calibration reliably achieves this across teacher models or modalities without introducing new instabilities in multi-teacher or cross-modal regimes.

    Authors: The calibration is defined in §3.2 as an adjustment of token-level margins conditioned on the outcome reward to enforce order consistency between aggregated guidance and trajectory-level rewards. We provide empirical validation of improved consistency and stability across the evaluated regimes. However, we recognize the value of a formal treatment. We will add a precise mathematical definition together with a proposition characterizing the conditions under which order consistency is restored, plus a brief discussion of stability considerations in multi-teacher and cross-modal cases. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical proposal without self-referential derivations

full rationale

The paper identifies two bottlenecks in on-policy distillation via empirical observation, then introduces Uni-OPD as a dual-perspective framework consisting of student-side data balancing and teacher-side margin calibration. These are presented as new mechanisms motivated by the bottlenecks, with effectiveness shown through experiments across 16 benchmarks. No equations, parameter-fitting steps, predictions derived from fitted inputs, or self-citations that bear the central load appear in the abstract or described structure. The derivation chain is self-contained as a practical recipe rather than a closed mathematical loop reducing to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that the two bottlenecks dominate OPD failure modes and that the dual strategies directly resolve them; no free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption Reliable teacher supervision requires that aggregated token-level guidance remains order-consistent with the outcome reward
    Presented in the abstract as the hinge condition for the teacher-side mechanism.

pith-pipeline@v0.9.0 · 5587 in / 1202 out tokens · 67282 ms · 2026-05-07T16:57:08.054811+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters

    cs.CV 2026-05 unverdicted novelty 7.0

    Chronicles-OCR is the first benchmark with 2,800 images across the complete evolutionary trajectory of Chinese characters, defining four tasks to evaluate VLLMs' cross-temporal visual perception.

  2. Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

    cs.LG 2026-05 unverdicted novelty 6.0

    Sparse RL on capable teachers followed by dense distillation to students beats direct GRPO on students for verifiable math reasoning.

  3. Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

    cs.LG 2026-05 unverdicted novelty 5.0

    Sparse RL on a strong teacher followed by dense distillation to the student outperforms direct GRPO on the student for math tasks, with a forward-KL + OPD bridge enabling further gains.

Reference graph

Works this paper leans on

65 extracted references · 50 canonical work pages · cited by 2 Pith papers · 29 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    AIME 2024.https://huggingface.co/datasets/AI-MO/aimo-validation-aime,

    AI-MO. AIME 2024.https://huggingface.co/datasets/AI-MO/aimo-validation-aime,

  3. [3]

    Anthropic

    URLhttps://hkunlp.github.io/blog/2025/Polaris. Anthropic. Claude 2, 2023a. URL https://www-files.anthropic.com/production/images/ Model-Card-Claude-2.pdf. Anthropic. Introducing Claude, 2023b. URLhttps://www.anthropic.com/index/introducing-claude. Anthropic. The Claude 3 model family: Opus, Sonnet, Haiku,

  4. [4]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025a. Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report.arXiv pr...

  5. [5]

    Honeybee: Data recipes for vision-language reasoners.arXiv preprint arXiv:2510.12225,

    Hritik Bansal, Devandra Singh Sachan, Kai-Wei Chang, Aditya Grover, Gargi Ghosh, Wen-tau Yih, and Ramakanth Pasunuru. Honeybee: Data recipes for vision-language reasoners.arXiv preprint arXiv:2510.12225,

  6. [6]

    V old: Reasoning transfer from llms to vision-language models via on-policy distillation.arXiv preprint arXiv:2510.23497, 2025

    Walid Bousselham, Hilde Kuehne, and Cordelia Schmid. VOLD: Reasoning transfer from LLMs to vision-language models via on-policy distillation.arXiv preprint arXiv:2510.23497,

  7. [7]

    X-opd: Cross-modal on-policy distillation for capability alignment in speech llms.arXiv preprint arXiv:2603.24596, 2026

    Di Cao, Dongjie Fu, Hai Yu, Siqi Zheng, Xu Tan, and Tao Jin. X-OPD: Cross-modal on-policy distillation for capability alignment in speech llms.arXiv preprint arXiv:2603.24596,

  8. [8]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  9. [9]

    Process Reinforcement through Implicit Rewards

    Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456,

  10. [10]

    MiniLLM: On-Policy Distillation of Large Language Models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: On-policy distillation of large language models. arXiv preprint arXiv:2306.08543,

  11. [11]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025a. Yiju Guo, Wenkai Yang, Zexu Sun, Ning Ding, Zhiyuan Liu, and Yankai Lin. Learning to focus: Causal attentio...

  12. [12]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

  13. [13]

    Seeing is believing? a benchmark for multimodal large language models on visual illusions and anomalies.arXiv preprint arXiv:2602.01816,

    Wenjin Hou, Wei Liu, Han Hu, Xiaoxiao Sun, Serena Yeung-Levy, and Hehe Fan. Seeing is believing? a benchmark for multimodal large language models on visual illusions and anomalies.arXiv preprint arXiv:2602.01816,

  14. [14]

    Reinforcement Learning via Self-Distillation

    Jonas H¨ubotter, Frederike L¨ubeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802,

  15. [15]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. GPT-4o system card.arXiv preprint arXiv:2410.21276,

  16. [16]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,

  17. [17]

    Stable On-Policy Distillation through Adaptive Target Reformulation

    Ijun Jang, Jewon Yeom, Juan Yeo, Hyunggu Lim, and Taesup Kim. Stable on-policy distillation through adaptive target reformulation.arXiv preprint arXiv:2601.07155,

  18. [18]

    Entropy-aware on-policy distillation of language models

    Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models.arXiv preprint arXiv:2603.07079,

  19. [19]

    Explain in your own words: Improving reasoning via token-selective dual knowledge distillation.arXiv preprint arXiv:2603.13260,

    Minsang Kim and Seung Jun Baek. Explain in your own words: Improving reasoning via token-selective dual knowledge distillation.arXiv preprint arXiv:2603.13260,

  20. [20]

    Distillm-2: A contrastive approach boosts the distillation of llms.arXiv preprint arXiv:2503.07067, 2025

    Jongwoo Ko, Tianyi Chen, Sungnyun Kim, Tianyu Ding, Luming Liang, Ilya Zharkov, and Se-Young Yun. DistiLLM-2: A contrastive approach boosts the distillation of LLMs.arXiv preprint arXiv:2503.07067,

  21. [21]

    arXiv preprint arXiv:2603.11137 , year =

    Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, and Pashmina Cameron. Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137,

  22. [22]

    Efficient knowledge injection in LLMs via self-distillation.arXiv preprint arXiv:2412.14964,

    Kalle Kujanp¨a¨a, Pekka Marttinen, Harri Valpola, and Alexander Ilin. Efficient knowledge injection in LLMs via self-distillation.arXiv preprint arXiv:2412.14964,

  23. [23]

    Video-opd: Efficient post-training of multimodal large language models for temporal video grounding via on-policy distillation.arXiv preprint arXiv:2602.02994, 2026

    Jiaze Li, Hao Yin, Haoran Xu, Boshen Xu, Wenhui Tan, Zewen He, Jianzhong Ju, Zhenbo Luo, and Jian Luan. Video-OPD: Efficient post-training of multimodal large language models for temporal video grounding via on-policy distillation.arXiv preprint arXiv:2602.02994, 2026a. Jingyao Li, Senqiao Yang, Sitong Wu, Han Shi, Chuanyang Zheng, Hong Xu, and Jiaya Jia....

  24. [24]

    Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

    Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026b. Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic ...

  25. [25]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024a. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 2023a. Haotian Liu, Ch...

  26. [26]

    Typicalness- aware learning for failure detection.arXiv preprint arXiv:2411.01981, 2024e

    Yijun Liu, Jiequan Cui, Zhuotao Tian, Senqiao Yang, Qingdong He, Xiaoling Wang, and Jingyong Su. Typicalness- aware learning for failure detection.arXiv preprint arXiv:2411.01981, 2024e. Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connectionism,

  27. [27]

    https://thinkingmachines.ai/blog/ on-policy-distillation/

    doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation. Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the association for computational linguistics: ACL 2022, pp. 2263–2279,

  28. [28]

    AIME 2025.https://huggingface.co/datasets/opencompass/AIME2025,

    OpenCompass. AIME 2025.https://huggingface.co/datasets/opencompass/AIME2025,

  29. [29]

    Does your vision-language model get lost in the long video sampling dilemma?arXiv preprint arXiv:2503.12496,

    Tianyuan Qu, Longxiang Tang, Bohao Peng, Senqiao Yang, Bei Yu, and Jiaya Jia. Does your vision-language model get lost in the long video sampling dilemma?arXiv preprint arXiv:2503.12496,

  30. [30]

    Pope: Learning to reason on hard problems via privileged on-policy exploration.arXiv preprint arXiv:2601.18779, 2026

    Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, and Aviral Kumar. POPE: Learning to reason on hard problems via privileged on-policy exploration.arXiv preprint arXiv:2601.18779,

  31. [31]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Tong Shao, Zhuotao Tian, Hang Zhao, and Jingyong Su. Explore the potential of CLIP for training-free open vocabulary semantic segmentation. InEuropean Conference on Computer Vision, 2024a. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathema...

  32. [32]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053,

  33. [33]

    A Survey of On-Policy Distillation for Large Language Models

    Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models.arXiv preprint arXiv:2604.00626,

  34. [34]

    Gates: Self-distillation under privileged context with consensus gating.arXiv preprint arXiv:2602.20574, 2026

    Alex Stein, Furong Huang, and Tom Goldstein. GATES: Self-distillation under privileged context with consensus gating.arXiv preprint arXiv:2602.20574,

  35. [35]

    CommonsenseQA: A question answering challenge targeting commonsense knowledge

    Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),

  36. [36]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530,

  37. [37]

    Kimi K2.5: Visual Agentic Intelligence

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276,

  38. [38]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288,

  39. [39]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems, 37:95095–95169, 2024a. Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin...

  40. [40]

    Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    Yecheng Wu, Song Han, and Hai Cai. Lightning opd: Efficient post-training for large reasoning models with offline on-policy distillation.arXiv preprint arXiv:2604.13010,

  41. [41]

    MiMo-V2-Flash Technical Report

    Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780,

  42. [42]

    Logicvista: Multimodal llm logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024

    Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973,

  43. [43]

    Ovd: On-policy verbal distillation.arXiv preprint arXiv:2601.21968, 2026

    Jing Xiong, Hui Shen, Shansan Gong, Yuxin Cheng, Jianghan Shen, Chaofan Tao, Haochen Tan, Haoli Bai, Lifeng Shang, and Ngai Wong. OVD: On-policy verbal distillation.arXiv preprint arXiv:2601.21968,

  44. [44]

    RedStar: Does Scaling Long- CoT Data Unlock Better Slow-Reasoning Systems?

    Haotian Xu, Xing Wu, Weinong Wang, Zhongzhi Li, Da Zheng, Boyuan Chen, Yi Hu, Shijia Kang, Jiaming Ji, Yingying Zhang, et al. Redstar: does scaling long-cot data unlock better slow-reasoning systems?arXiv preprint arXiv:2501.11284, 2025a. Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zh...

  45. [45]

    PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence

    Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang. PACED: Distillation at the frontier of student competence.arXiv preprint arXiv:2603.11178,

  46. [46]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024a. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint...

  47. [47]

    Self-Distilled RLVR

    Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled RLVR.arXiv preprint arXiv:2604.03128, 2026a. Senqiao Yang, Jiaming Liu, Ray Zhang, Mingjie Pan, Zoey Guo, Xiaoqi Li, Zehui Chen, Peng Gao, Yandong Guo, and Shanghang Zhang. LiDAR-LLM: Exploring the potential of large ...

  48. [48]

    On-Policy Context Distillation for Language Models

    Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models. arXiv preprint arXiv:2602.12275,

  49. [49]

    GLM-5: from Vibe Coding to Agentic Engineering

    URLhttps://aclanthology.org/P19-1472. Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. GLM-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763,

  50. [50]

    Fast and effective on-policy distillation from reasoning prefixes.arXiv preprint arXiv:2602.15260, 2026

    Dongxu Zhang, Zhichao Yang, Sepehr Janghorbani, Jun Han, Andrew Ressler II, Qian Qian, Gregory D Lyng, Sanjit Singh Batra, and Robert E Tillman. Fast and effective on-policy distillation from reasoning prefixes.arXiv preprint arXiv:2602.15260, 2026a. Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang,...

  51. [51]

    Lyra: An efficient and speech-centric framework for omni-cognition.arXiv preprint arXiv:2412.09501,

    Zhisheng Zhong, Chengyao Wang, Yuqi Liu, Senqiao Yang, Longxiang Tang, Yuechen Zhang, Jingyao Li, Tianyuan Qu, Yanwei Li, Yukang Chen, et al. Lyra: An efficient and speech-centric framework for omni-cognition.arXiv preprint arXiv:2412.09501,

  52. [52]

    Imperfect Response

    Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. LIMA: Less is more for alignment.Advances in Neural Information Processing Systems, 36: 55006–55021, 2023a. Guorui Zhou, Honghui Bao, Jiaming Huang, Jiaxin Deng, Jinghao Zhang, Junda She, Kuo Cai, Lejian Ren, Lu Ren, Qiang Luo, et a...

  53. [53]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023b. Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: Enhancing vision-language understanding with advanced large language mo...

  54. [54]

    Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models.arXiv preprint arXiv:2411.00836, 2024

    Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic vi- sual benchmark for evaluating mathematical reasoning robustness of vision language models.arXiv preprint arXiv:2411.00836,

  55. [55]

    The decoding configuration is kept fixed throughout this offline phase: we use temperature =1.0 , top-p=0.95 , top-k=50 , and a maximum response length of 16,384 tokens

    under the same prompt template that will later be used at training time, so that the estimated difficulty reflects the actual input format the student will see. The decoding configuration is kept fixed throughout this offline phase: we use temperature =1.0 , top-p=0.95 , top-k=50 , and a maximum response length of 16,384 tokens. For each instance, we then...

  56. [56]

    Present the code in ‘‘‘python Your code ‘‘‘ at the end

    Training Prompt Template Math Reasoning <|im start|>user {question} Please reason step by step, and put your final answer within\boxed{}.<|im end|> <|im start|>assistant Code Reasoning <|im start|>user {question} Write Python code to solve the problem. Present the code in ‘‘‘python Your code ‘‘‘ at the end. You need to think first then write the Python co...

  57. [57]

    training sets. 6https://huggingface.co/datasets/OpenMMReasoner/OpenMMReasoner-RL-74K 7 0 10 20 30 40 50 Optimization Step 30 40 50 60 70 80Response Correct Ratio (%) 0 10 20 30 40 50 Optimization Step 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42Entropy 0 10 20 30 40 50 Optimization Step 2000 4000 6000 8000 10000 12000Average Response Length OPD Uni-OPD Figure ...

  58. [58]

    per prompt in the worst case (typicallyG≤16in our setup). In contrast, the dominant per-iteration cost of OPD comes from two stages whose complexity scales linearly with the total number of rollout tokens Ttok = ∑BG i=1 |τi| and cubically with the hidden size d: (i) sampling BG on-policy rollouts from the student, and (ii) running a teacher prefill pass o...

  59. [59]

    For math reasoning benchmarks, we sample N=32 solutions per problem, while for code generation benchmarks, we sample N=4 solutions per problem

    We use the vLLM inference engine to perform sampling. For math reasoning benchmarks, we sample N=32 solutions per problem, while for code generation benchmarks, we sample N=4 solutions per problem. For evaluation, we adopt Math-Verify10 as a rule- based verifier for math reasoning tasks. For code generation, we use the EvalPlus11 and LiveCodeBench12 frame...

  60. [60]

    following recent studies (Peng et al., 2026; Meng et al., 2024). Specifically, we report results on MMLU (Hendrycks et al., 2020), ARC (Clark et al., 2018), HellaSwag (Zellers et al., 2019), TruthfulQA (Lin et al., 2022), Winogrande (Levesque et al., 2012), GSM8K (Cobbe et al., 2021), Common- senseQA (Talmor et al., 2019), and IFEval (Zhou et al., 2023b)....

  61. [61]

    Recently, some works have also combined RLVR with OPD, such as Self-Distilled RLVR (Yang et al., 2026a) and OpenClaw-RL (Wang et al., 2026)

    for sequence-level stable optimization. Recently, some works have also combined RLVR with OPD, such as Self-Distilled RLVR (Yang et al., 2026a) and OpenClaw-RL (Wang et al., 2026). In our work, we use GRPO to obtain stronger domain-specific teachers and use the corresponding reward models as global guidance for return calibration in OPD. 13 E.3 On-Policy ...

  62. [62]

    Recent studies further broaden this paradigm from multiple perspectives

    and GKD (Agarwal et al., 2024), establishes the basic paradigm of using teacher feedback on student-generated trajectories under a reverse KL objective. Recent studies further broaden this paradigm from multiple perspectives. In self-distillation methods, OPSD (Zhao et al., 2026b) uses privileged information; SDFT (Shenfeld et al.,

  63. [63]

    find all such functions,

    further extends OPD to speech through cross-modal alignment. In contrast, our work focuses on developing a unified OPD framework with an open recipe for both LLMs and MLLMs. F Case Studies We provide qualitative case studies ofUni-OPD, standard OPD, and the Student model across both LLM and MLLM benchmarks, covering textual math reasoning, code generation...

  64. [64]

    In 1996, Coal is shown as 35% of total output

    So:\n\nNuclear output in 2006 = 2 × 70 TWh = 140 TWh.\n\nThus, the correct answer is:\n\n<answer>A</answer> Uni-OPD: To solve this, we first need to find the output for Coal in 1996 from the chart. In 1996, Coal is shown as 35% of total output. Since total output in 1996 was 200 TWh, Coal output in 1996 is:\n\n35% of 200 TWh = 0.35 × 200 = 70 TWh.\n\nThe ...

  65. [65]

    \\boxed{}\

    So:\n\nNuclear in 2006 = 2 × 70 TWh = 140 TWh.\n\nLooking at the options, (A) 140TWh matches this value.\n\n<answer>A</answer> logicvista Figure F.4:Example output of LogicVista.All three models correctly answer this multi-step arithmetic reasoning question. OPD andUni-OPDboth reason concisely, withUni-OPDbeing slightly more token-efficient. 38.png Questi...