pith. sign in

arxiv: 2606.30626 · v1 · pith:4NXDGBZLnew · submitted 2026-06-29 · 💻 cs.AI

DOPD: Dual On-policy Distillation

Pith reviewed 2026-06-30 05:49 UTC · model grok-4.3

classification 💻 cs.AI
keywords on-policy distillationprivilege illusiondual distillationlarge language modelsvision-language modelsadvantage-aware routingtoken-level supervision
0
0 comments X

The pith

DOPD routes each token's supervision between privileged teacher and student policies using advantage gaps to reduce privilege illusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that adding privileged information to on-policy distillation creates privilege illusion, where models mix transferable capability gaps with non-replicable information asymmetry, worsened by uneven token importance. DOPD counters this by dynamically assigning each token to either the privileged teacher or privileged student for supervision, chosen according to advantage gap and relative probabilities. This gives tokens different objectives and strengths while transferring real capability and using auxiliary signals for asymmetry. If correct, the method raises the performance of distillation for large models by avoiding the previous failure mode. Readers care because it enables more reliable transfer of capabilities in language and vision-language settings without requiring perfectly symmetric information.

Core claim

DOPD is an advantage-aware dual distillation paradigm that dynamically routes token-level supervision between privileged teacher and privileged student policies based on their advantage gap and relative probabilities. Each token receives supervision of different strength, objective, and strategy from either teacher or student itself, which transfers credible capability while simultaneously receiving auxiliary signals to alleviate privilege illusion.

What carries the argument

Advantage-aware dual routing that assigns per-token supervision from either privileged teacher or privileged student based on advantage gap and relative probabilities.

If this is right

  • DOPD outperforms vanilla OPD and other counterparts on LLM and VLM settings.
  • The method yields gains in stability, robustness, continual learning, and out-of-distribution performance.
  • Tokens receive supervision varying in strength, objective, and strategy from either source.
  • Privilege illusion is reduced by separating capability transfer from information asymmetry.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The routing logic may extend to other distillation or imitation settings where one party holds extra context that cannot be replicated.
  • Non-uniform token importance could be used more broadly to focus training on capability-bearing signals rather than uniform dense supervision.
  • The dual-policy approach might lower the requirement for exact capability matching between teacher and student in future distillation work.

Load-bearing premise

That advantage gap and relative probabilities can separate transferable capability signals from non-replicable information asymmetry without creating new biases or instability.

What would settle it

An experiment on a small-scale LLM task where DOPD routing is applied but the resulting model shows no gain over vanilla OPD or where the chosen routes fail to correlate with measured capability transfer on held-out data.

read the original abstract

On-policy distillation (OPD) offers superior capacity transfer by supervising student-sampled trajectories with dense token-level signals. To furnish high-quality supervision sources and thereby elevate the performance frontier of distillation, an intuitive direction is to infuse privileged information to either teacher or student itself. However, this additional input induces a potential failure mode we dub privilege illusion: a pattern that conflates the transferable capability gap that students are meant to close, and the information asymmetry gap that can only be mimicked but never replicated. This issue is further amplified by the inherent non-uniformity of token-level supervision, where only a small subset of tokens carries pivotal capability-bearing signals. To this end, we propose DOPD, an advantage-aware dual distillation paradigm that dynamically routes token-level supervision between privileged teacher and privileged student policies based on their advantage gap and relative probabilities. Each token receives supervision of different strength, objective, and strategy from either teacher or student itself, which transfers credible capability while simultaneously receiving auxiliary signals, to alleviate privilege illusion. Extensive experiments on both large language model (LLM) and vision-language model (VLM) settings demonstrate that DOPD consistently outperforms Vanilla OPD and other counterparts. Further results on stability, robustness, continual learning, and out-of-distribution tasks validate its superiority.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes DOPD, an advantage-aware dual on-policy distillation paradigm for LLMs and VLMs. It dynamically routes token-level supervision between privileged teacher and privileged student policies using advantage gap and relative probabilities to address 'privilege illusion' (conflation of transferable capability gaps with non-replicable information asymmetry). The paper claims DOPD outperforms Vanilla OPD and other methods, with further benefits on stability, robustness, continual learning, and OOD tasks.

Significance. If the routing rule reliably isolates replicable capability signals from privilege-only asymmetry without introducing new biases or instability, the approach could meaningfully advance on-policy distillation by leveraging dual privileged policies and token-level adaptivity. The emphasis on non-uniform token supervision is a relevant direction for large-model training.

major comments (2)
  1. [Abstract] Abstract: the claim that 'DOPD consistently outperforms Vanilla OPD and other counterparts' supplies no metrics, baselines, statistical details, ablation results, or experimental setup, which is load-bearing for the central experimental claim.
  2. [Abstract] Abstract: the routing mechanism is described only at a conceptual level ('dynamically routes token-level supervision ... based on their advantage gap and relative probabilities') with no equations, algorithm, or pseudocode, preventing assessment of whether the heuristic correctly separates capability-bearing tokens from information-asymmetry tokens.
minor comments (1)
  1. [Abstract] Abstract: the newly introduced term 'privilege illusion' is not formally defined or illustrated with an example, which reduces clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that the abstract can be strengthened for clarity and will revise it to better support the central claims while respecting length constraints. Below we address each point.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'DOPD consistently outperforms Vanilla OPD and other counterparts' supplies no metrics, baselines, statistical details, ablation results, or experimental setup, which is load-bearing for the central experimental claim.

    Authors: We agree the abstract claim would be more informative with supporting details. In the revision we will add concise quantitative indicators (e.g., average relative gains on the primary benchmarks) and name the main baselines and settings, while keeping the statement within abstract length limits. revision: yes

  2. Referee: [Abstract] Abstract: the routing mechanism is described only at a conceptual level ('dynamically routes token-level supervision ... based on their advantage gap and relative probabilities') with no equations, algorithm, or pseudocode, preventing assessment of whether the heuristic correctly separates capability-bearing tokens from information-asymmetry tokens.

    Authors: The abstract intentionally remains high-level. The full routing rule, advantage-gap formula, probability-based selection, and Algorithm 1 appear in Section 3. To address the concern we will insert a compact inline expression for the token-level routing decision in the revised abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: method defined conceptually without self-referential derivations or fitted predictions

full rationale

The provided abstract and description introduce DOPD as a routing heuristic based on advantage gap and relative probabilities to address privilege illusion, but contain no equations, derivations, or first-principles claims that reduce to inputs by construction. No self-citations, fitted parameters renamed as predictions, or uniqueness theorems are invoked. The central contribution is presented as an empirical method validated on LLM/VLM tasks rather than a mathematical chain that collapses to its own definitions. This is the common case of a self-contained algorithmic proposal without detectable circularity in the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the approach is described conceptually without mathematical or implementation details.

pith-pipeline@v0.9.1-grok · 5804 in / 974 out tokens · 31751 ms · 2026-06-30T05:49:15.551150+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

84 extracted references · 39 canonical work pages · 31 internal anchors

  1. [1]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In International Conference on Learning Representations (ICLR), volume 2024, pages 21246–21263, 2024

  2. [2]

    Aime problems and solutions, 2025

    AIME. Aime problems and solutions, 2025. URLhttps://artofproblemsolving.com/wiki/index.php/AIME_ Problems_and_Solutions. 16

  3. [3]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  4. [4]

    Distillation scaling laws.arXiv preprint arXiv:2502.08606, 2025

    Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, and Russ Webb. Distillation scaling laws.arXiv preprint arXiv:2502.08606, 2025

  5. [5]

    Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems (NeurIPS), 37:27056–27087, 2024

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems (NeurIPS), 37:27056–27087, 2024

  6. [6]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025

  7. [7]

    Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

    Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Jiacai Liu, Zhuo Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on-policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562, 2026

  8. [8]

    MiniLLM: Knowledge distillation of large language models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. In International Conference on Learning Representations (ICLR), 2024

  9. [9]

    Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

    Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025

  10. [10]

    Skywork Open Reasoner 1 Technical Report

    Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, et al. Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312, 2025

  11. [11]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

  12. [12]

    Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

    Wenjin Hou, Shangpin Peng, Weinong Wang, Zheng Ruan, Yue Zhang, Zhenglin Zhou, Mingqi Gao, Yifei Chen, Kaiqi Wang, Hongming Yang, et al. Uni-opd: Unifying on-policy distillation with a dual-perspective recipe. arXiv preprint arXiv:2605.03677, 2026

  13. [13]

    Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes

    Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. InFindings of the Association for Computational Linguistics: ACL 2023, pages 8003–8017, 2023

  14. [14]

    C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

    Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Yao Fu, et al. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Advancesin Neural Information Processing Systems (NeurIPS), 36:62991–63010, 2023

  15. [15]

    Reinforcement Learning via Self-Distillation

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

  16. [16]

    Livecodebench: Holistic and contamination free evaluation of large language models for code

    Naman Jain, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. In International Conference on Learning Representations (ICLR), volume 2025, pages 58791–58831, 2025

  17. [17]

    Stable On-Policy Distillation through Adaptive Target Reformulation

    Ijun Jang, Jewon Yeom, Juan Yeo, Hyunggu Lim, and Taesup Kim. Stable on-policy distillation through adaptive target reformulation.arXiv preprint arXiv:2601.07155, 2026

  18. [18]

    Entropy-Aware On-Policy Distillation of Language Models

    Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models.arXiv preprint arXiv:2603.07079, 2026

  19. [19]

    Explain in your own words: Improving reasoning via token-selective dual knowledge distillation

    Minsang Kim and Seung Jun Baek. Explain in your own words: Improving reasoning via token-selective dual knowledge distillation. InInternational Conference on Learning Representations (ICLR), 2026

  20. [20]

    Sequence-level knowledge distillation

    Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. In Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1317–1327, 2016

  21. [21]

    DistiLLM-2: A contrastive approach boosts the distillation of LLMs

    Jongwoo Ko, Tianyi Chen, Sungnyun Kim, Tianyu Ding, Luming Liang, Ilya Zharkov, and Se-Young Yun. DistiLLM-2: A contrastive approach boosts the distillation of LLMs. InInternational Conference on Machine Learning (ICML), 2025. 17

  22. [22]

    Reopold: Reward-based on-policy distillation with mixture-based reward clipping.arXiv preprint arXiv:2603.11137, 2026

    Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, and Pashmina Cameron. Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137, 2026

  23. [23]

    On-policy distillation, 2025

    Thinking Machines Lab. On-policy distillation, 2025. URLhttps://thinkingmachines.ai/blog/ on-policy-distillation

  24. [24]

    Lavida: A large diffusion language model for multimodal under- standing

    Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, and Aditya Grover. Lavida: A large diffusion language model for multimodal under- standing. Advancesin Neural Information Processing Systems (NeurIPS), 38:105101–105134, 2026

  25. [25]

    Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

    Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026

  26. [26]

    Small models struggle to learn from strong reasoners

    Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubramanian, and Radha Poovendran. Small models struggle to learn from strong reasoners. InFindings of the Association for Computational Linguistics: ACL 2025, pages 25366–25394, 2025

  27. [27]

    Zebralogic: On the scaling limits of llms for logical reasoning.arXiv preprint arXiv:2502.01100, 2025

    Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi. Zebralogic: On the scaling limits of llms for logical reasoning.arXiv preprint arXiv:2502.01100, 2025

  28. [28]

    Visual-Advantage On-Policy Distillation for Vision-Language Models

    Ruiqi Liu, Xiaolei Lv, Gengsheng Li, Ximo Zhu, Zhiheng Wang, Zhengbo Zhang, Junkai Chen, Zhiheng Li, Bo Li, Jun Gao, et al. Visual-advantage on-policy distillation for vision-language models. arXiv preprint arXiv:2605.21924, 2026

  29. [29]

    Introducing gpt-5.4, 2026

    OpenAI. Introducing gpt-5.4, 2026. URLhttps://openai.com/index/introducing-gpt-5-4

  30. [30]

    Privileged Information Distillation for Language Models

    Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. Privileged information distillation for language models.arXiv preprint arXiv:2602.04942, 2026

  31. [31]

    Near-Future Policy Optimization

    Chuanyu Qin, Chenxu Yang, Qingyi Si, Naibin Gu, Dingyu Yao, Zheng Lin, Peng Fu, Nan Duan, and Jiaqi Wang. Near-future policy optimization.arXiv preprint arXiv:2604.20733, 2026

  32. [32]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019

  33. [33]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  34. [34]

    Self-Distillation Enables Continual Learning

    Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning. arXiv preprint arXiv:2601.19897, 2026

  35. [35]

    A Survey of On-Policy Distillation for Large Language Models

    Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models.arXiv preprint arXiv:2604.00626, 2026

  36. [36]

    Gates: Self-distillation under privileged context with consensus gating

    Alex Stein, Furong Huang, and Tom Goldstein. Gates: Self-distillation under privileged context with consensus gating. arXiv preprint arXiv:2602.20574, 2026

  37. [37]

    Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems (NeurIPS), 37:95095–95169, 2024

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems (NeurIPS), 37:95095–95169, 2024

  38. [38]

    Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation

    Yuanyi Wang, Su Lu, Yanggan Gu, Pengkai Wang, Yifan Yang, Zhaoyi Yan, Congkai Xie, Jianmin Wu, and Hongxia Yang. Not all disagreement is learnable: Token teachability in on-policy distillation.arXiv preprint arXiv:2605.26844, 2026

  39. [39]

    LiveBench: A Challenging, Contamination-Limited LLM Benchmark

    Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, et al. Livebench: A challenging, contamination-free llm benchmark. arXiv preprint arXiv:2406.19314, 4:2, 2024

  40. [40]

    Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    Yecheng Wu, Song Han, and Hai Cai. Lightning opd: Efficient post-training for large reasoning models with offline on-policy distillation.arXiv preprint arXiv:2604.13010, 2026

  41. [41]

    Realworldqa: A benchmark for real-world spatial understanding, 2024

    xAI. Realworldqa: A benchmark for real-world spatial understanding, 2024. URLhttps://huggingface.co/ datasets/xai-org/RealworldQA. 18

  42. [42]

    MiMo-V2-Flash Technical Report

    Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

  43. [43]

    LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

    Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024

  44. [44]

    Deepseek- v4: Towards highly efficient million-token context intelligence.arXiv preprint arXiv:2606.19348, 2026

    Anyi Xu, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chenchen Ling, et al. Deepseek-v4: Towards highly efficient million-token context intelligence. arXiv preprint arXiv:2606.19348, 2026

  45. [45]

    Speculative knowledge distillation: Bridging the teacher-student gap through interleaved sampling

    Wenda Xu, Rujun Han, Zifeng Wang, Long Le, Dhruv Madeka, Lei Li, William Yang Wang, Rishabh Agarwal, Chen-Yu Lee, and Tomas Pfister. Speculative knowledge distillation: Bridging the teacher-student gap through interleaved sampling. InInternational Conference on Learning Representations (ICLR), 2025

  46. [46]

    TIP: Token Importance in On-Policy Distillation

    Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, and Alborz Geramifard. Tip: Token importance in on-policy distillation.arXiv preprint arXiv:2604.14084, 2026

  47. [47]

    Patil, Ion Stoica, and Joseph E.Gonzalez

    Fanjia Yan, Huanzhi Mao, Charlie Cheng-Jie Ji, Tianjun Zhang, Shishir G. Patil, Ion Stoica, and Joseph E.Gonzalez. Berkeley function calling leaderboard, 2024. URLhttps://gorilla.cs.berkeley.edu/blogs/8_ berkeley_function_calling_leaderboard.html

  48. [48]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  49. [49]

    Self-Distilled RLVR

    Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026

  50. [50]

    Thinking in space: How multimodal large language models see, remember, and recall spaces

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Conference (CVPR), pages 10632–10643, 2025

  51. [51]

    Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

    Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

  52. [52]

    Joyai-vl-interaction: Real-time vision-language interaction intelligence

    Dingyu Yao, Junhao Zhou, Chenxu Yang, Chuanyu Qin, Haowen Hou, Zheming Liang, Congcong Wang, Yuhang Cao, Shenglong Ye, Shuai Xie, et al. Joyai-vl-interaction: Real-time vision-language interaction intelligence. arXiv preprint arXiv:2606.14777, 2026

  53. [53]

    On-Policy Context Distillation for Language Models

    Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models. arXiv preprint arXiv:2602.12275, 2026

  54. [54]

    Dapo: An open-source llm reinforcement learning system at scale.Advances in Neural Information Processing Systems (NeurIPS), 38:113222–113244, 2026

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.Advances in Neural Information Processing Systems (NeurIPS), 38:113222–113244, 2026

  55. [55]

    The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook

    Xinlei Yu, Zhangquan Chen, Yongbo He, Tianyu Fu, Cheng Yang, Chengming Xu, Yue Ma, Xiaobin Hu, Zhe Cao, Jie Xu, et al. The latent space: Foundation, evolution, mechanism, ability, and outlook.arXiv preprint arXiv:2604.02029, 2026

  56. [56]

    Vismem: Latent vision memory unlocks potential of vision-language models

    Xinlei Yu, Chengming Xu, Guibin Zhang, Zhangquan Chen, Yudong Zhang, Yongbo He, Peng-Tao Jiang, Jiangn- ing Zhang, Xiaobin Hu, and Shuicheng Yan. Vismem: Latent vision memory unlocks potential of vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 31544–31555, 2026

  57. [57]

    Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

    Qianhao Yuan, Jie Lou, Xing Yu, Hongyu Lin, Le Sun, Xianpei Han, and Yaojie Lu. Vision-opd: Learning to see fine details for multimodal llms via on-policy self-distillation.arXiv preprint arXiv:2605.18740, 2026

  58. [58]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9556–9567, 2024

  59. [59]

    Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark

    Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL: Long Papers), pages 15134–15186, 2025. 19

  60. [60]

    Fast and effective on-policy distillation from reasoning prefixes.arXiv preprint arXiv:2602.15260, 2026

    Dongxu Zhang, Zhichao Yang, Sepehr Janghorbani, Jun Han, Andrew Ressler II, Qian Qian, Gregory D Lyng, Sanjit Singh Batra, and Robert E Tillman. Fast and effective on-policy distillation from reasoning prefixes.arXiv preprint arXiv:2602.15260, 2026

  61. [61]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

  62. [62]

    Autologi: Automated generation of logic puzzles for evaluating reasoning abilities of large language models

    Qin Zhu, Fei Huang, Runyu Peng, Keming Lu, Bowen Yu, Qinyuan Cheng, Xipeng Qiu, Xuanjing Huang, and Junyang Lin. Autologi: Automated generation of logic puzzles for evaluating reasoning abilities of large language models. arXiv preprint arXiv:2502.16906, 2025

  63. [63]

    Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models

    Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models. InInternational Conference on Learning Representations (ICLR), volume 2025, pages 48337–48383, 2025. 20 Appendix A Details of Privileged Input Original Input: Factor the f...

  64. [64]

    Check whether the quadratic has a common numerical factor, if so, simplify quadratic

  65. [65]

    Use the coefficient structure to decide which appropriate factorization strategy

  66. [66]

    Identify the needed pairwise relationship between two numbers for the middle-term split

  67. [67]

    Case 2 (LLM-based) Original Input: Suppose I have a physical, solid square pyramid

    Indicate that the remaining expression can be factored into two linear binomials. Case 2 (LLM-based) Original Input: Suppose I have a physical, solid square pyramid. The bottom square has vertices A, B, C, D, and the final vertex is E. Then I make a cut through the plane defined by ACE. There are now two pieces. What are the pieces? Are they tetrahedra, s...

  68. [68]

    Identify the plane determined by the two opposite base vertices and the apex

  69. [69]

    Observe how this plane intersects the square base along a diagonal

  70. [70]

    Use that diagonal to partition the base into two congruent triangular regions

  71. [71]

    Extend each triangular base region to the common apex to determine the corresponding three-dimensional subsolid

  72. [72]

    Case 1 (LLM-based) Original Input: Consider all words constituted by eight letters from $\\{C ,H,M, O\\}$

    Compare each resulting piece by its vertices, edges, and triangular faces. Case 1 (LLM-based) Original Input: Consider all words constituted by eight letters from $\\{C ,H,M, O\\}$. We arrange the words in an alphabet sequence.\nPrecisely, the first word is $CCCCCCCC$, the second one is $CCCCCCCH$, the third is $CCCCCCCM$, the fourth one is $CCCCCCCO, ......

  73. [73]

    Recognize that the alphabetic ordering induces a four-symbol positional system

  74. [74]

    Assign each letter an ordered digit according to this alphabet

  75. [75]

    Convert the requested ordinal position to a zero-based rank before processing

  76. [76]

    Express this rank as an eight-place base-four representation, preserving leading positions

  77. [77]

    surfboard

    translate each base-four digit back to its corresponding letter. Case 3 (LLM-based) Figure 11Demonstrations of LLM-based privileged input. 21 Original Input: Privileged Input: Which is the main topic of the image: A: A woman surfing, B: A man skating, C: A man surfing, D: A woman skiting. Case 1 (VLM-based) Original Input: Privileged Input: What color are...

  78. [78]

    Please add multiple boxes if necessary

  79. [79]

    Please generate both the object label and quadruple coordinates

  80. [80]

    label":

    Please output only valid JSON format without any other redundant content. Output format: [ {"label": "object", "bbox": [x1, y1, x2, y2]} ] Given a question, and corresponding ground-truth label. Query: {Query} Label: {Label} Add necessary step-wise decomposition hints that support the answer. Rules:

Showing first 80 references.