ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation

Kanghui Tian; Sheng Xia; Shuai Dong; Siyuan Liu; Yi Wang; Ziang Yan

arxiv: 2606.05718 · v1 · pith:GETBTJJRnew · submitted 2026-06-04 · 💻 cs.CV · cs.AI· cs.LG

ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation

Kanghui Tian , Siyuan Liu , Ziang Yan , Sheng Xia , Shuai Dong , Yi Wang This is my paper

Pith reviewed 2026-06-28 01:57 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords multimodal on-policy distillationvisual cuesrecoverable privilegesink-token cross-attentiontrain-test mismatchgrounded reasoningQwen3-VL

0 comments

The pith

Replacing answer-side privilege with recoverable visual cues from the input improves multimodal on-policy distillation by avoiding train-test mismatch and shortcuts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard answer-based supervision in on-policy distillation creates a mismatch because the teacher relies on signals unavailable to the student at inference, which encourages imitation of shortcuts instead of visually grounded reasoning. ViCuR substitutes this with visual cues consisting of query-related evidence already present in the input image, making the teacher's signals recoverable by the student. It adds a lightweight cue recovery module that uses dedicated sink-token cross-attention only during prefill to gather relevant visual evidence into an internal state. Experiments across seven benchmarks with Qwen3-VL 2B and 8B students show consistent gains over answer-based baselines and further improvements when extending to stronger teachers. This demonstrates that the form of teacher privilege itself affects whether distillation produces grounded multimodal reasoning.

Core claim

ViCuR shows that visual cues derived from the same input available at inference can replace answer-side privilege as supervision in multimodal on-policy distillation, with a sink-token cross-attention module recovering the cues into the student's representation during prefill without any inference-time change or auxiliary losses, yielding average gains of 1.19 and 1.24 points over answer-based self-distillation for 2B and 8B models plus additional gains when combined with stronger teachers.

What carries the argument

The cue recovery module, which applies dedicated sink-token cross-attention during prefill to aggregate task-relevant visual evidence from the input into an internal representation usable for the student's reasoning.

If this is right

ViCuR raises average benchmark scores by 1.19 points for 2B students and 1.24 points for 8B students relative to answer-based on-policy self-distillation.
The same visual-cue approach further improves stronger-teacher on-policy distillation by 0.64 and 1.08 points respectively.
Gains remain consistent on out-of-domain tasks at the 8B scale.
The design choice of teacher privilege proves comparable in importance to the choice of teacher strength for multimodal on-policy distillation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Privilege design focused on input-recoverable signals may apply to other distillation or alignment settings where output-side supervision risks encouraging non-grounded behavior.
The sink-token cross-attention pattern could be tested as a general mechanism for injecting auxiliary input-derived information into language-model prefill without architectural changes at inference.
If the recovery module scales, it opens a route to curriculum-style cue provision that varies with query difficulty while keeping the same inference interface.

Load-bearing premise

The cue recovery module aggregates task-relevant visual evidence into a form the student can actually use for grounded reasoning without introducing new shortcuts or requiring any change to the inference interface.

What would settle it

A controlled run that removes the cue recovery module or supplies the same visual cues to the teacher but makes them unavailable to the student at inference, checking whether the reported performance gains over answer-based distillation disappear.

read the original abstract

On-policy distillation (OPD) improves reasoning by training a student on trajectories sampled from its own policy under supervision from a teacher. In multimodal reasoning, a common extension is to use a privileged teacher that observes training-time-only signals such as reference answers or rationales. However, such answer-side privilege creates a train-test mismatch: the teacher's supervision may depend on signals unavailable to the student, encouraging shortcut imitation rather than visually grounded reasoning. We propose ViCuR, a visually grounded privileged-teacher distillation framework that replaces answer-side privilege with visual cues (query-related evidence in the input). Because these cues are derived from the same visual input available at inference, their evidence is recoverable by the student. To support this, ViCuR introduces a lightweight cue recovery module that uses dedicated sink-token cross-attention during prefill to aggregate task-relevant visual evidence into an internal representation, without changing the inference interface or requiring auxiliary cue-generation losses. Across seven benchmarks with Qwen3-VL-2B and 8B students, ViCuR consistently improves over answer-based on-policy self-distillation by +1.19 and +1.24 on overall average performance. It also extends naturally to stronger-teacher OPD, surpassing OPD baselines by +0.64 and +1.08, with consistent out-of-domain gains at the 8B scale. These results show that, in multimodal on-policy distillation, the design of teacher privilege is as important as teacher strength.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ViCuR swaps answer-side privilege for recoverable visual cues in multimodal on-policy distillation and reports modest consistent gains, but the sink-token module's actual contribution remains lightly supported.

read the letter

The core move here is replacing answer-based teacher signals with visual cues drawn from the input image, on the grounds that this keeps the supervision recoverable at inference and reduces shortcut imitation. They implement this with a lightweight cue recovery module that adds sink-token cross-attention only during prefill, leaving the inference path unchanged. That framing is the clearest novelty relative to prior OPD work.

The reported results show steady lifts: +1.19 and +1.24 average over answer-based self-distillation on seven benchmarks for the 2B and 8B Qwen3-VL students, plus further gains when paired with a stronger teacher. Out-of-domain improvements at the 8B scale are also noted. These numbers are small but directionally consistent, which is the main positive.

The soft spot is the lack of detail on what the module actually does. The abstract gives no equations for the sink token, no specification of which visual tokens are attended, and no ablation isolating the recovery component from other training changes. The central assumption—that the internal representation produced by this module supports grounded reasoning rather than a new shortcut—therefore sits on thin evidence. If the full experiments do not include controls for this, the gains could trace to any number of unmentioned factors.

This is a targeted methods paper for people already working on efficient distillation for vision-language models. It engages the mismatch problem directly and ships empirical comparisons, so it clears the bar for peer review even though the mechanism needs tighter validation.

Referee Report

3 major / 1 minor

Summary. The paper proposes ViCuR, a visually grounded privileged-teacher distillation framework for multimodal reasoning that replaces answer-side privilege with visual cues derived from the input image. It introduces a lightweight cue recovery module using dedicated sink-token cross-attention during prefill to aggregate task-relevant visual evidence into an internal representation usable by the student at inference time, without altering the inference interface or adding auxiliary losses. Experiments on seven benchmarks with Qwen3-VL-2B and 8B students report consistent gains over answer-based on-policy self-distillation (+1.19 and +1.24 average) and further improvements when extending to stronger-teacher OPD (+0.64 and +1.08), including out-of-domain gains at the 8B scale.

Significance. If the cue recovery module enables recoverable visual evidence for grounded reasoning without new shortcuts or train-test mismatches, the result would be significant for multimodal on-policy distillation. It empirically demonstrates that the form of teacher privilege matters as much as teacher strength, which could influence design choices in vision-language model distillation. The multi-benchmark evaluation and out-of-domain results provide a concrete basis for assessing impact if the mechanism is validated.

major comments (3)

[Abstract] Abstract: The reported gains of +1.19 and +1.24 are presented without details on experimental controls, statistical significance, ablation of the cue recovery module, or how visual cues are selected. This makes it impossible to attribute improvements specifically to recoverable visual privilege rather than other factors.
[Method] Method (cue recovery module description): The sink-token cross-attention mechanism is described at a high level with no equations for initialization of the sink token, attention computation, selection of visual tokens, or how the aggregated representation is consumed by the student at inference. This leaves the load-bearing assumption that the module produces usable internal representations for grounded reasoning unverified.
[Experiments] Experiments: No ablation studies isolating the cue recovery module's contribution are mentioned, so the central claim that gains arise from visual-cue privilege (vs. unmentioned training dynamics changes) cannot be assessed. The absence of such controls directly affects the soundness of the +1.19/+1.24 and +0.64/+1.08 results.

minor comments (1)

[Abstract] The abstract would be clearer with an explicit list of the seven benchmarks and a one-sentence statement of the overall average metric used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies opportunities to improve clarity around experimental details, the cue recovery module, and supporting ablations. We address each major comment point by point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The reported gains of +1.19 and +1.24 are presented without details on experimental controls, statistical significance, ablation of the cue recovery module, or how visual cues are selected. This makes it impossible to attribute improvements specifically to recoverable visual privilege rather than other factors.

Authors: We agree the abstract is concise and omits granular details due to length constraints. In revision we will update the abstract to note that gains reflect controlled on-policy comparisons averaged across seven benchmarks with Qwen3-VL-2B/8B students, that visual cues are query-related evidence extracted from the input image, and that full controls, significance testing, and module ablations appear in the experiments section. This will better support attribution to recoverable visual privilege. revision: yes
Referee: [Method] Method (cue recovery module description): The sink-token cross-attention mechanism is described at a high level with no equations for initialization of the sink token, attention computation, selection of visual tokens, or how the aggregated representation is consumed by the student at inference. This leaves the load-bearing assumption that the module produces usable internal representations for grounded reasoning unverified.

Authors: The full method section provides a textual description of the sink-token cross-attention. To address the request for precision, the revised manuscript will add explicit equations covering sink-token initialization, the cross-attention formulation, criteria for selecting and aggregating visual tokens, and the mechanism by which the resulting representation is made available to the student during inference without altering the interface. These additions will make the load-bearing assumption directly verifiable. revision: yes
Referee: [Experiments] Experiments: No ablation studies isolating the cue recovery module's contribution are mentioned, so the central claim that gains arise from visual-cue privilege (vs. unmentioned training dynamics changes) cannot be assessed. The absence of such controls directly affects the soundness of the +1.19/+1.24 and +0.64/+1.08 results.

Authors: We acknowledge that isolating the cue recovery module is essential for the central claim. The revised manuscript will add dedicated ablation studies comparing the full ViCuR setup against variants without the module and against controls that hold other training dynamics constant. These results will be reported alongside the main tables to demonstrate that performance gains are attributable to the recoverable visual-cue privilege rather than ancillary factors. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method with benchmark gains

full rationale

The paper presents an empirical method (ViCuR) for multimodal on-policy distillation using visual cues and a sink-token cross-attention module, with reported performance improvements on seven benchmarks. No mathematical derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing steps are present in the provided text. The central claims rest on experimental comparisons rather than any reduction of outputs to inputs by construction. The load-bearing assumption about the cue recovery module is a modeling choice open to empirical test, not a definitional or self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The abstract relies on standard assumptions of on-policy distillation and multimodal model training. No free parameters, ad-hoc axioms, or invented entities are explicitly introduced beyond the cue recovery module itself.

axioms (1)

domain assumption On-policy distillation improves reasoning when teacher supervision is aligned with student-accessible information.
Implicit in the motivation for replacing answer-side privilege.

invented entities (1)

cue recovery module with sink-token cross-attention no independent evidence
purpose: Aggregate task-relevant visual evidence during prefill without changing inference interface.
New component introduced to support recoverable visual cues.

pith-pipeline@v0.9.1-grok · 5815 in / 1404 out tokens · 18173 ms · 2026-06-28T01:57:10.408591+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 20 linked inside Pith

[1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InInternational Confer- ence on Learning Representations, 2024. 2

2024
[2]

Qwen3-vl tech- nical report.arXiv preprint arXiv:2511.21631,

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng,WeiDing,ChangGao,ChunjiangGe,Wen- bin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhao- hai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chen- glong Liu, Yang Liu, Dayiheng Liu, Shixuan L...

Pith/arXiv arXiv
[3]

Geoqa: A geometric question answering bench- mark towards multimodal numerical reasoning

Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiao- dan Liang, Lingbo Liu, Eric Xing, and Liang Lin. Geoqa: A geometric question answering bench- mark towards multimodal numerical reasoning. InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021. 1

2021
[4]

Shuai Dong, Siyuan Wang, Xingyu Liu, Chenglin Li, Haowen Hou, and Zhongyu Wei. Interleaved 8 ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation latent visual reasoning with selective percep- tual modeling.arXiv preprint arXiv:2512.05665,

arXiv
[5]

Vlmevalkit: An open-source toolkit for evalu- ating large multi-modality models

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evalu- ating large multi-modality models. InProceed- ings of the 32nd ACM International Conference on Multimedia, 2024. A.2

2024
[6]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, 2025. 4

2025
[7]

Making the v in vqa matter: Elevating the role of image un- derstanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image un- derstanding in visual question answering. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, 2017. 1

2017
[8]

Minillm: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InInternational Conference on Learning Representations, 2024. 2

2024
[9]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008, 2024

ChaoqunHe, RenjieLuo, YuzhuoBai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008, 2024. 1

Pith/arXiv arXiv 2024
[10]

Distilling the knowledge in a neural network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. 2

Pith/arXiv arXiv 2015
[11]

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Rat- ner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. InFindings of the Asso- ciation for Computational Linguistics: ACL 2023,

2023
[12]

Vision-r1: Incentivizing reasoning capability in multimodal large language models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models. InInternational Confer- ence on Learning Representations, 2026. 4

2026
[13]

Re- inforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buen- ing, Carlos Guestrin, and Andreas Krause. Re- inforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026. 1, 2

Pith/arXiv arXiv 2026
[14]

Entropy- aware on-policy distillation of language models

Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy- aware on-policy distillation of language models. arXiv preprint arXiv:2603.07079, 2026. 3.3

Pith/arXiv arXiv 2026
[15]

Knowledge- augmented reasoning distillation for small lan- guage models in knowledge-intensive tasks

Minki Kang, Seanie Lee, Jinheon Baek, Kenji Kawaguchi, and Sung Ju Hwang. Knowledge- augmented reasoning distillation for small lan- guage models in knowledge-intensive tasks. In Advances in Neural Information Processing Sys- tems, 2023. 2

2023
[16]

Video-opd: Efficient post- training of multimodal large language models for temporal video grounding via on-policy dis- tillation.arXiv preprint arXiv:2602.02994, 2026

Jiaze Li, Hao Yin, Haoran Xu, Boshen Xu, Wen- hui Tan, Zewen He, Jianzhong Ju, Zhenbo Luo, and Jian Luan. Video-opd: Efficient post- training of multimodal large language models for temporal video grounding via on-policy dis- tillation.arXiv preprint arXiv:2602.02994, 2026. 2

Pith/arXiv arXiv 2026
[17]

Learning from language feedback via variational policy distillation.arXiv preprint arXiv:2605.15113, 2026

Yang Li, Erik Nijkamp, Semih Yavuz, and Shafiq Joty. Learning from language feedback via variational policy distillation.arXiv preprint arXiv:2605.15113, 2026. 2

Pith/arXiv arXiv 2026
[18]

Rethinking on-policy distillation of large lan- guage models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016,

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, et al. Rethinking on-policy distillation of large lan- guage models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016,

Pith/arXiv arXiv
[19]

Snapkv: Llm knows what you are looking for before gen- eration

YuhongLi,YingbingHuang,BowenYang,Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before gen- eration. InAdvances in Neural Information Pro- cessing Systems, 2024. 2

2024
[20]

Visd: Enhancing video reasoning via structured self-distillation.arXiv preprint arXiv:2605.06094, 2026

Hao Lin, Kunyang Lv, Xu Jiang, Jingqi Tian, Zhongjing Du, Jiayu Ding, Qiaoman Zhang, and Hongbo Jin. Visd: Enhancing video reasoning via structured self-distillation.arXiv preprint arXiv:2605.06094, 2026. 1

Pith/arXiv arXiv 2026
[21]

Sink- track: Attention sink based context anchoring 9 ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation for large language models

Xu Liu, Guikun Chen, and Wenguan Wang. Sink- track: Attention sink based context anchoring 9 ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation for large language models. InInternational Con- ference on Learning Representations, 2026. 1, 2, 3.2

2026
[22]

On-policy distillation.Think- ing Machines Lab: Connectionism, 2025

Kevin Lu and Thinking Machines Lab. On-policy distillation.Think- ing Machines Lab: Connectionism, 2025. https://thinkingmachines.ai/blog/on-policy- distillation. 2, 3.3, 4

2025
[23]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InInter- national Conference on Learning Representations,
[24]

Inter-gps: Interpretable geometry prob- lem solving with formal language and symbolic reasoning

Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry prob- lem solving with formal language and symbolic reasoning. InThe Joint Conference of the 59th Annual Meeting of the Association for Computa- tional Linguistics and the 11th International Joint Conference on Natural Language Pro...

2021
[25]

Demystifying opd: Length inflationandstabilizationstrategiesforlargelan- guage models.arXiv preprint arXiv:2604.08527,

Feng Luo, Yu-Neng Chuang, Guanchu Wang, Zicheng Xu, Xiaotian Han, Tianyi Zhang, and Vladimir Braverman. Demystifying opd: Length inflationandstabilizationstrategiesforlargelan- guage models.arXiv preprint arXiv:2604.08527,

Pith/arXiv arXiv
[26]

Chartqa: A benchmark for question answering about charts with visual and logical reasoning

Ahmed Masry, Do Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the asso- ciation for computational linguistics: ACL 2022,

2022
[27]

We-math: Does your large multi- modal model achieve human-like mathematical reasoning?arXiv preprint arXiv:2407.01284,

Runqi Qiao, Qiuna Tan, Guanting Dong, Min- hui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large multi- modal model achieve human-like mathematical reasoning?arXiv preprint arXiv:2407.01284,

Pith/arXiv arXiv
[28]

Crisp: Compressed reasoning via iterative self-policy distillation.arXiv preprint arXiv:2603.05433,

Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. Crisp: Compressed reasoning via iterative self-policy distillation.arXiv preprint arXiv:2603.05433,

Pith/arXiv arXiv
[29]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

John Schulman, Filip Wolski, Prafulla Dhari- wal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. 3.3

Pith/arXiv arXiv 2017
[30]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 4

Pith/arXiv arXiv 2024
[31]

Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024. A

Pith/arXiv arXiv 2024
[32]

Llava-mod: Making llava tiny via moe knowledge distillation.arXiv preprint arXiv:2408.15881, 2024

Fangxun Shu, Yue Liao, Le Zhuo, Chenning Xu, Lei Zhang, Guanghao Zhang, Haonan Shi, Long Chen, Tao Zhong, Wanggui He, Siming Fu, et al. Llava-mod: Making llava tiny via moe knowledge distillation.arXiv preprint arXiv:2408.15881, 2024. 2

arXiv 2024
[33]

Learning using privileged information: similarity control and knowledge transfer.The Journal of Machine Learning Research, 2015

Vladimir Vapnik and Rauf Izmailov. Learning using privileged information: similarity control and knowledge transfer.The Journal of Machine Learning Research, 2015. 2

2015
[34]

Internvideo2

Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xiangyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, et al. Internvideo2. 5: Empowering video mllms with long and rich context modeling.arXiv preprint arXiv:2501.12386, 2025. 1

Pith/arXiv arXiv 2025
[35]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInter- national Conference on Learning Representations,
[36]

Self- distilled rlvr.arXiv preprint arXiv:2604.03128,

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weip- ing Wang, Jiaqi Wang, and Nan Duan. Self- distilled rlvr.arXiv preprint arXiv:2604.03128,

Pith/arXiv arXiv
[37]

On-policy context distilla- tion for language models.arXiv preprint arXiv:2602.12275, 2026

Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distilla- tion for language models.arXiv preprint arXiv:2602.12275, 2026. 2

Pith/arXiv arXiv 2026
[38]

Vision- opd: Learning to see fine details for multimodal llms via on-policy self-distillation.arXiv preprint arXiv:2605.18740, 2026

Qianhao Yuan, Jie Lou, Xing Yu, Hongyu Lin, Le Sun, Xianpei Han, and Yaojie Lu. Vision- opd: Learning to see fine details for multimodal llms via on-policy self-distillation.arXiv preprint arXiv:2605.18740, 2026. 1, 2

Pith/arXiv arXiv 2026
[39]

Mmmu: A massive multi-discipline multi- modal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming 10 ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu...

2024
[40]

Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuro- pean Conference on Computer Vision, 2024

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuro- pean Conference on Computer Vision, 2024. 4

2024
[41]

Self-distilled reasoner: On-policy self- distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self- distillation for large language models.arXiv preprint arXiv:2601.18734, 2026. 1, 2, 4, 4.2, F.3

Pith/arXiv arXiv 2026
[42]

Hint" 1 2 3 4 5 Epoch 0 100 200 300Count Total

Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual benchmark for evaluating mathe- matical reasoning robustness of vision language models. InInternational Conference on Learning Representations, 2025. 4 11 ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation A. Training and Eval...

2025
[43]

According to the Intersecting Chords Theorem: (4)(6) = (𝑥)(8) Now, solve for𝑥: 24 = 8𝑥 𝑥= 24 8 = 3 So, the value of𝑥is3

- The other chord is divided into segments of length𝑥and 8. According to the Intersecting Chords Theorem: (4)(6) = (𝑥)(8) Now, solve for𝑥: 24 = 8𝑥 𝑥= 24 8 = 3 So, the value of𝑥is3 . E.2. Case 2: MathVista — Generalization Beyond Geometry Qualitative observations.This example from MathVista examines whether cue recovery extends beyond structured geometry d...
[44]

Pretrain Loss

**Locate the Pretrain Loss:** Find the value ‘1.80‘ on the horizontal axis (x-axis), labeled "Pretrain Loss". 2. **Identify the corresponding Acc scores:** Look at the vertical axis (y-axis), labeled "Acc", to see the accuracy for each model at this point. 3. **Compare the values:** The chart shows the following accuracy values at ‘Pretrain Loss = 1.80‘: ...
[45]

**Locate the Pretrain Loss:** Find the value 1.80 on the horizontal axis (Pretrain Loss). 2. **Identify the corresponding Accuracy:** Look at the vertical positions of the data points on the chart for each model at this specific Pretrain Loss value. 3. **Compare the scores:** The chart shows the following accuracy values at Pretrain Loss = 1.80: - **LLaMA...
[46]

Pretrain Loss

**Locate the Pretrain Loss value:** Find the value ‘1.80‘ on the horizontal axis (x-axis), labeled "Pretrain Loss". 2. **Identify the corresponding accuracy:** Look at the vertical position (y-axis, labeled "Acc") for each model’s line at this specific point. 3. **Compare the values:** The y-axis represents accuracy, with higher values indicating better p...

[1] [1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InInternational Confer- ence on Learning Representations, 2024. 2

2024

[2] [2]

Qwen3-vl tech- nical report.arXiv preprint arXiv:2511.21631,

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng,WeiDing,ChangGao,ChunjiangGe,Wen- bin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhao- hai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chen- glong Liu, Yang Liu, Dayiheng Liu, Shixuan L...

Pith/arXiv arXiv

[3] [3]

Geoqa: A geometric question answering bench- mark towards multimodal numerical reasoning

Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiao- dan Liang, Lingbo Liu, Eric Xing, and Liang Lin. Geoqa: A geometric question answering bench- mark towards multimodal numerical reasoning. InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021. 1

2021

[4] [4]

Shuai Dong, Siyuan Wang, Xingyu Liu, Chenglin Li, Haowen Hou, and Zhongyu Wei. Interleaved 8 ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation latent visual reasoning with selective percep- tual modeling.arXiv preprint arXiv:2512.05665,

arXiv

[5] [5]

Vlmevalkit: An open-source toolkit for evalu- ating large multi-modality models

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evalu- ating large multi-modality models. InProceed- ings of the 32nd ACM International Conference on Multimedia, 2024. A.2

2024

[6] [6]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, 2025. 4

2025

[7] [7]

Making the v in vqa matter: Elevating the role of image un- derstanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image un- derstanding in visual question answering. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, 2017. 1

2017

[8] [8]

Minillm: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InInternational Conference on Learning Representations, 2024. 2

2024

[9] [9]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008, 2024

ChaoqunHe, RenjieLuo, YuzhuoBai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008, 2024. 1

Pith/arXiv arXiv 2024

[10] [10]

Distilling the knowledge in a neural network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. 2

Pith/arXiv arXiv 2015

[11] [11]

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Rat- ner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. InFindings of the Asso- ciation for Computational Linguistics: ACL 2023,

2023

[12] [12]

Vision-r1: Incentivizing reasoning capability in multimodal large language models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models. InInternational Confer- ence on Learning Representations, 2026. 4

2026

[13] [13]

Re- inforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buen- ing, Carlos Guestrin, and Andreas Krause. Re- inforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026. 1, 2

Pith/arXiv arXiv 2026

[14] [14]

Entropy- aware on-policy distillation of language models

Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy- aware on-policy distillation of language models. arXiv preprint arXiv:2603.07079, 2026. 3.3

Pith/arXiv arXiv 2026

[15] [15]

Knowledge- augmented reasoning distillation for small lan- guage models in knowledge-intensive tasks

Minki Kang, Seanie Lee, Jinheon Baek, Kenji Kawaguchi, and Sung Ju Hwang. Knowledge- augmented reasoning distillation for small lan- guage models in knowledge-intensive tasks. In Advances in Neural Information Processing Sys- tems, 2023. 2

2023

[16] [16]

Video-opd: Efficient post- training of multimodal large language models for temporal video grounding via on-policy dis- tillation.arXiv preprint arXiv:2602.02994, 2026

Jiaze Li, Hao Yin, Haoran Xu, Boshen Xu, Wen- hui Tan, Zewen He, Jianzhong Ju, Zhenbo Luo, and Jian Luan. Video-opd: Efficient post- training of multimodal large language models for temporal video grounding via on-policy dis- tillation.arXiv preprint arXiv:2602.02994, 2026. 2

Pith/arXiv arXiv 2026

[17] [17]

Learning from language feedback via variational policy distillation.arXiv preprint arXiv:2605.15113, 2026

Yang Li, Erik Nijkamp, Semih Yavuz, and Shafiq Joty. Learning from language feedback via variational policy distillation.arXiv preprint arXiv:2605.15113, 2026. 2

Pith/arXiv arXiv 2026

[18] [18]

Rethinking on-policy distillation of large lan- guage models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016,

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, et al. Rethinking on-policy distillation of large lan- guage models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016,

Pith/arXiv arXiv

[19] [19]

Snapkv: Llm knows what you are looking for before gen- eration

YuhongLi,YingbingHuang,BowenYang,Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before gen- eration. InAdvances in Neural Information Pro- cessing Systems, 2024. 2

2024

[20] [20]

Visd: Enhancing video reasoning via structured self-distillation.arXiv preprint arXiv:2605.06094, 2026

Hao Lin, Kunyang Lv, Xu Jiang, Jingqi Tian, Zhongjing Du, Jiayu Ding, Qiaoman Zhang, and Hongbo Jin. Visd: Enhancing video reasoning via structured self-distillation.arXiv preprint arXiv:2605.06094, 2026. 1

Pith/arXiv arXiv 2026

[21] [21]

Sink- track: Attention sink based context anchoring 9 ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation for large language models

Xu Liu, Guikun Chen, and Wenguan Wang. Sink- track: Attention sink based context anchoring 9 ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation for large language models. InInternational Con- ference on Learning Representations, 2026. 1, 2, 3.2

2026

[22] [22]

On-policy distillation.Think- ing Machines Lab: Connectionism, 2025

Kevin Lu and Thinking Machines Lab. On-policy distillation.Think- ing Machines Lab: Connectionism, 2025. https://thinkingmachines.ai/blog/on-policy- distillation. 2, 3.3, 4

2025

[23] [23]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InInter- national Conference on Learning Representations,

[24] [24]

Inter-gps: Interpretable geometry prob- lem solving with formal language and symbolic reasoning

Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry prob- lem solving with formal language and symbolic reasoning. InThe Joint Conference of the 59th Annual Meeting of the Association for Computa- tional Linguistics and the 11th International Joint Conference on Natural Language Pro...

2021

[25] [25]

Demystifying opd: Length inflationandstabilizationstrategiesforlargelan- guage models.arXiv preprint arXiv:2604.08527,

Feng Luo, Yu-Neng Chuang, Guanchu Wang, Zicheng Xu, Xiaotian Han, Tianyi Zhang, and Vladimir Braverman. Demystifying opd: Length inflationandstabilizationstrategiesforlargelan- guage models.arXiv preprint arXiv:2604.08527,

Pith/arXiv arXiv

[26] [26]

Chartqa: A benchmark for question answering about charts with visual and logical reasoning

Ahmed Masry, Do Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the asso- ciation for computational linguistics: ACL 2022,

2022

[27] [27]

We-math: Does your large multi- modal model achieve human-like mathematical reasoning?arXiv preprint arXiv:2407.01284,

Runqi Qiao, Qiuna Tan, Guanting Dong, Min- hui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large multi- modal model achieve human-like mathematical reasoning?arXiv preprint arXiv:2407.01284,

Pith/arXiv arXiv

[28] [28]

Crisp: Compressed reasoning via iterative self-policy distillation.arXiv preprint arXiv:2603.05433,

Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. Crisp: Compressed reasoning via iterative self-policy distillation.arXiv preprint arXiv:2603.05433,

Pith/arXiv arXiv

[29] [29]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

John Schulman, Filip Wolski, Prafulla Dhari- wal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. 3.3

Pith/arXiv arXiv 2017

[30] [30]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 4

Pith/arXiv arXiv 2024

[31] [31]

Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024. A

Pith/arXiv arXiv 2024

[32] [32]

Llava-mod: Making llava tiny via moe knowledge distillation.arXiv preprint arXiv:2408.15881, 2024

Fangxun Shu, Yue Liao, Le Zhuo, Chenning Xu, Lei Zhang, Guanghao Zhang, Haonan Shi, Long Chen, Tao Zhong, Wanggui He, Siming Fu, et al. Llava-mod: Making llava tiny via moe knowledge distillation.arXiv preprint arXiv:2408.15881, 2024. 2

arXiv 2024

[33] [33]

Learning using privileged information: similarity control and knowledge transfer.The Journal of Machine Learning Research, 2015

Vladimir Vapnik and Rauf Izmailov. Learning using privileged information: similarity control and knowledge transfer.The Journal of Machine Learning Research, 2015. 2

2015

[34] [34]

Internvideo2

Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xiangyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, et al. Internvideo2. 5: Empowering video mllms with long and rich context modeling.arXiv preprint arXiv:2501.12386, 2025. 1

Pith/arXiv arXiv 2025

[35] [35]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInter- national Conference on Learning Representations,

[36] [36]

Self- distilled rlvr.arXiv preprint arXiv:2604.03128,

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weip- ing Wang, Jiaqi Wang, and Nan Duan. Self- distilled rlvr.arXiv preprint arXiv:2604.03128,

Pith/arXiv arXiv

[37] [37]

On-policy context distilla- tion for language models.arXiv preprint arXiv:2602.12275, 2026

Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distilla- tion for language models.arXiv preprint arXiv:2602.12275, 2026. 2

Pith/arXiv arXiv 2026

[38] [38]

Vision- opd: Learning to see fine details for multimodal llms via on-policy self-distillation.arXiv preprint arXiv:2605.18740, 2026

Qianhao Yuan, Jie Lou, Xing Yu, Hongyu Lin, Le Sun, Xianpei Han, and Yaojie Lu. Vision- opd: Learning to see fine details for multimodal llms via on-policy self-distillation.arXiv preprint arXiv:2605.18740, 2026. 1, 2

Pith/arXiv arXiv 2026

[39] [39]

Mmmu: A massive multi-discipline multi- modal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming 10 ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu...

2024

[40] [40]

Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuro- pean Conference on Computer Vision, 2024

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuro- pean Conference on Computer Vision, 2024. 4

2024

[41] [41]

Self-distilled reasoner: On-policy self- distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self- distillation for large language models.arXiv preprint arXiv:2601.18734, 2026. 1, 2, 4, 4.2, F.3

Pith/arXiv arXiv 2026

[42] [42]

Hint" 1 2 3 4 5 Epoch 0 100 200 300Count Total

Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual benchmark for evaluating mathe- matical reasoning robustness of vision language models. InInternational Conference on Learning Representations, 2025. 4 11 ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation A. Training and Eval...

2025

[43] [43]

According to the Intersecting Chords Theorem: (4)(6) = (𝑥)(8) Now, solve for𝑥: 24 = 8𝑥 𝑥= 24 8 = 3 So, the value of𝑥is3

- The other chord is divided into segments of length𝑥and 8. According to the Intersecting Chords Theorem: (4)(6) = (𝑥)(8) Now, solve for𝑥: 24 = 8𝑥 𝑥= 24 8 = 3 So, the value of𝑥is3 . E.2. Case 2: MathVista — Generalization Beyond Geometry Qualitative observations.This example from MathVista examines whether cue recovery extends beyond structured geometry d...

[44] [44]

Pretrain Loss

**Locate the Pretrain Loss:** Find the value ‘1.80‘ on the horizontal axis (x-axis), labeled "Pretrain Loss". 2. **Identify the corresponding Acc scores:** Look at the vertical axis (y-axis), labeled "Acc", to see the accuracy for each model at this point. 3. **Compare the values:** The chart shows the following accuracy values at ‘Pretrain Loss = 1.80‘: ...

[45] [45]

**Locate the Pretrain Loss:** Find the value 1.80 on the horizontal axis (Pretrain Loss). 2. **Identify the corresponding Accuracy:** Look at the vertical positions of the data points on the chart for each model at this specific Pretrain Loss value. 3. **Compare the scores:** The chart shows the following accuracy values at Pretrain Loss = 1.80: - **LLaMA...

[46] [46]

Pretrain Loss

**Locate the Pretrain Loss value:** Find the value ‘1.80‘ on the horizontal axis (x-axis), labeled "Pretrain Loss". 2. **Identify the corresponding accuracy:** Look at the vertical position (y-axis, labeled "Acc") for each model’s line at this specific point. 3. **Compare the values:** The y-axis represents accuracy, with higher values indicating better p...