pith. machine review for the scientific record. sign in

arxiv: 2604.27083 · v1 · submitted 2026-04-29 · 💻 cs.LG

Recognition: unknown

Co-Evolving Policy Distillation

Authors on Pith no claims yet

Pith reviewed 2026-05-07 08:18 UTC · model grok-4.3

classification 💻 cs.LG
keywords Co-Evolving Policy Distillationpolicy distillationRLVRmultimodal reasoningexpert integrationbidirectional distillationparallel trainingcapability consolidation
0
0 comments X

The pith

Co-Evolving Policy Distillation integrates text, image, and video reasoning into one model by having experts train and distill bidirectionally during RLVR.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines two common ways to merge expert capabilities into a single model after initial training: mixing all tasks in one reinforcement learning run or training experts separately then distilling afterward. Mixing produces interference between skills, while the sequential route leaves transfer incomplete because the finished experts behave too differently from the student model. The proposed fix runs the experts in parallel and inserts bidirectional distillation steps while each is still learning, so the models stay aligned in behavior yet retain distinct strengths. Experiments show the resulting single model handles text, image, and video reasoning at once, beats both mixing and sequential baselines, and sometimes exceeds specialists trained on one domain only. A reader would care because the approach suggests a practical route to versatile post-trained models without the usual capability trade-offs.

Core claim

Co-Evolving Policy Distillation enables all-in-one integration of text, image, and video reasoning capabilities by encouraging parallel training of experts and introducing bidirectional off-policy distillation during each expert's ongoing RLVR training, with experts serving as mutual teachers to co-evolve. This produces more consistent behavioral patterns among experts while maintaining sufficient complementary knowledge throughout, leading to performance that significantly outperforms mixed RLVR and sequential OPD baselines and even surpasses domain-specific experts.

What carries the argument

Co-Evolving Policy Distillation (CoPD): a training procedure that runs multiple expert RLVR trainings in parallel and inserts bidirectional off-policy distillation steps at intervals during training rather than only after completion, so experts act as mutual teachers and adjust together.

If this is right

  • A single model can deliver strong performance across text, image, and video reasoning without the interference seen when all tasks train together.
  • Bidirectional distillation during ongoing training shrinks the behavioral gaps that limit knowledge transfer in sequential pipelines.
  • Mutual teaching while experts adapt preserves complementary knowledge that would otherwise be lost to divergence.
  • The parallel training pattern may open a new route to scaling post-training by co-evolving multiple experts instead of merging them after the fact.
  • The unified model can exceed the accuracy of models trained for only one reasoning domain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same parallel co-evolution idea could extend to additional modalities or larger sets of experts, potentially lowering total compute by reducing redundant sequential stages.
  • Bidirectional mid-training distillation might improve other knowledge-sharing settings such as supervised fine-tuning or alignment where behavioral consistency matters.
  • If the pattern holds at larger scale, training pipelines may shift toward simultaneous expert development rather than post-training merging or mixing.
  • One could measure whether the frequency of bidirectional steps can be tuned to optimize the trade-off between alignment and retained specialization.

Load-bearing premise

That inserting bidirectional distillation while experts are still running RLVR will align their behavioral patterns enough to avoid interference costs without erasing the distinct knowledge each expert holds.

What would settle it

A direct comparison experiment in which a CoPD-trained model performs no better than, or worse than, the strongest mixed RLVR run or sequential OPD run on the combined text-image-video tasks, or falls below separately trained domain experts on their individual tasks.

read the original abstract

RLVR and OPD have become standard paradigms for post-training. We provide a unified analysis of these two paradigms in consolidating multiple expert capabilities into a single model, identifying capability loss in different ways: mixed RLVR suffers from inter-capability divergence cost, while the pipeline of first training experts and then performing OPD, though avoiding divergence, fails to fully absorb teacher capabilities due to large behavioral pattern gaps between teacher and student. We propose Co-Evolving Policy Distillation (CoPD), which encourages parallel training of experts and introduces OPD during each expert's ongoing RLVR training rather than after complete expert training, with experts serving as mutual teachers (making OPD bidirectional) to co-evolve. This enables more consistent behavioral patterns among experts while maintaining sufficient complementary knowledge throughout. Experiments validate that CoPD achieves all-in-one integration of text, image, and video reasoning capabilities, significantly outperforming strong baselines such as mixed RLVR and MOPD, and even surpassing domain-specific experts. The model parallel training pattern offered by CoPD may inspire a novel training scaling paradigm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper provides a unified analysis of RLVR and OPD for consolidating multiple expert capabilities (text, image, video reasoning) into one model. It identifies two failure modes: mixed RLVR incurs inter-capability divergence costs, while sequential OPD (train experts fully then distill) suffers from large behavioral pattern gaps that prevent full absorption of teacher capabilities. The proposed Co-Evolving Policy Distillation (CoPD) trains experts in parallel and inserts bidirectional OPD (mutual teaching) during each expert's ongoing RLVR training rather than after completion. This is claimed to produce more consistent behavioral patterns while preserving complementary knowledge. Experiments reportedly show CoPD outperforming mixed RLVR and MOPD baselines and even surpassing the original domain-specific experts.

Significance. If the empirical claims hold, CoPD would represent a meaningful advance in post-training methods for multi-modal reasoning models by enabling co-evolution of experts rather than sequential or mixed approaches. The parallel training pattern could open a new scaling direction for RLVR-style methods, particularly if the bidirectional distillation reliably trades off consistency against complementarity without collapse.

major comments (2)
  1. [Abstract / Experiments] The central claim that CoPD surpasses domain-specific experts rests on the assertion that bidirectional OPD during ongoing RLVR produces sufficiently consistent behavioral patterns to avoid mixed-RLVR divergence while still retaining complementary knowledge. The abstract states this occurs but supplies no supporting quantitative evidence such as per-modality reward trajectories, KL or action-distribution divergence between experts before/after each OPD step, or an ablation that removes the bidirectional component. Without these measurements it is impossible to verify that the claimed trade-off is achieved rather than one side dominating (e.g., video expert collapsing toward text-like patterns).
  2. [Method] The method description is high-level: it is unclear how the OPD loss is scheduled relative to the RLVR objective, what temperature or weighting is used for the bidirectional distillation, or whether any auxiliary regularization is added to enforce pattern consistency. These details are load-bearing for reproducibility and for understanding why CoPD avoids the behavioral-gap problem identified for sequential OPD.
minor comments (2)
  1. [Abstract] The baseline 'MOPD' is referenced without expansion; the acronym should be defined on first use.
  2. [Conclusion] The final sentence claims the parallel training pattern 'may inspire a novel training scaling paradigm.' This is an interesting forward-looking statement but would be strengthened by a short discussion of potential limitations (e.g., communication overhead in parallel training, sensitivity to expert initialization).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important areas where additional evidence and implementation details would strengthen the manuscript. We address each major comment below and have revised the paper to incorporate the requested quantitative analyses, ablations, and methodological clarifications.

read point-by-point responses
  1. Referee: [Abstract / Experiments] The central claim that CoPD surpasses domain-specific experts rests on the assertion that bidirectional OPD during ongoing RLVR produces sufficiently consistent behavioral patterns to avoid mixed-RLVR divergence while still retaining complementary knowledge. The abstract states this occurs but supplies no supporting quantitative evidence such as per-modality reward trajectories, KL or action-distribution divergence between experts before/after each OPD step, or an ablation that removes the bidirectional component. Without these measurements it is impossible to verify that the claimed trade-off is achieved rather than one side dominating (e.g., video expert collapsing toward text-like patterns).

    Authors: We agree that intermediate quantitative diagnostics would make the mechanism more transparent. The submitted manuscript reports only final-task performance, where CoPD exceeds both mixed RLVR and the original domain-specific experts on text, image, and video reasoning benchmarks. This outcome is consistent with successful retention of complementary knowledge without collapse or divergence, but we did not include the requested per-modality reward trajectories, policy KL divergences, or a bidirectional ablation. In the revised version we will add (i) training curves of per-expert rewards, (ii) KL and action-distribution divergence statistics measured immediately before and after each bidirectional OPD step, and (iii) an ablation that disables the bidirectional component while keeping parallel RLVR. These additions will allow direct verification of the claimed consistency-complementarity balance. revision: yes

  2. Referee: [Method] The method description is high-level: it is unclear how the OPD loss is scheduled relative to the RLVR objective, what temperature or weighting is used for the bidirectional distillation, or whether any auxiliary regularization is added to enforce pattern consistency. These details are load-bearing for reproducibility and for understanding why CoPD avoids the behavioral-gap problem identified for sequential OPD.

    Authors: We acknowledge that the method section in the original submission remained at a conceptual level. The revised manuscript will expand the description with a detailed algorithm box that specifies: the interleaving schedule (OPD loss applied every k RLVR gradient steps), the temperature used for the bidirectional distillation softmax, the scalar weighting coefficient balancing the RLVR and OPD objectives, and any auxiliary consistency regularizer (if present). We will also clarify how the mutual-teacher structure is realized in the parallel training loop, thereby addressing why the online bidirectional setting mitigates the large behavioral-gap issue observed in sequential OPD. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training proposal with no derivation chain

full rationale

The paper proposes Co-Evolving Policy Distillation (CoPD) as a practical, empirical training procedure that runs bidirectional OPD in parallel with each expert's ongoing RLVR. It offers a descriptive analysis of limitations in mixed RLVR (inter-capability divergence) and sequential OPD (behavioral pattern gaps), then introduces the co-evolution method to address them. No mathematical equations, first-principles derivations, fitted parameters, or predictions appear in the provided text. Claims of all-in-one integration and outperformance rest on experimental validation rather than any self-referential reduction, self-citation chain, or ansatz smuggled via prior work. The central assumption about consistent behavioral patterns is an empirical hypothesis tested in experiments, not a quantity defined by construction from the inputs. This is a standard non-circular empirical methods paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Based solely on the abstract, the central claim rests on standard RL and distillation assumptions plus the new training pattern. No explicit free parameters or invented physical entities are described. The method itself is the primary addition.

axioms (2)
  • domain assumption RLVR and OPD are effective base paradigms for post-training expert capabilities into models.
    The unified analysis and proposed improvement presuppose these paradigms work as described in prior work.
  • ad hoc to paper Behavioral pattern consistency can be improved via mutual distillation without losing complementary knowledge.
    This is the core premise enabling the co-evolution benefit.
invented entities (1)
  • Co-Evolving Policy Distillation (CoPD) no independent evidence
    purpose: A training procedure that runs parallel RLVR with bidirectional OPD to co-evolve experts.
    This is the novel method introduced to address the identified limitations.

pith-pipeline@v0.9.0 · 5504 in / 1510 out tokens · 84937 ms · 2026-05-07T08:18:14.590663+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.

Reference graph

Works this paper leans on

56 extracted references · 44 canonical work pages · cited by 1 Pith paper · 29 internal anchors

  1. [1]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300

  2. [2]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective, 2025. URLhttps://arxiv.org/abs/2503.20783

  3. [3]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  4. [4]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025. URLhttps: //arxiv.org/abs/2507.18071

  5. [5]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2026. URL https://arxiv.org/abs/2503.06749

  6. [6]

    arXiv preprint arXiv:2505.16673 (2025)

    Huanjin Yao, Qixiang Yin, Jingyi Zhang, Min Yang, Yibo Wang, Wenhao Wu, Fei Su, Li Shen, Minghui Qiu, Dacheng Tao, and Jiaxing Huang. R1-sharevl: Incentivizing reasoning capability of multimodal large language models via share-grpo, 2025. URLhttps://arxiv.org/abs/2505.16673

  7. [7]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

  8. [8]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  9. [9]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

  10. [10]

    Time-r1: Post-training large vision language model for temporal video grounding,

    Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, Xiangnan Fang, Zewen He, Zhenbo Luo, Wenxuan Wang, Junqi Lin, Jian Luan, and Qin Jin. Time-r1: Post-training large vision language model for temporal video grounding, 2025. URLhttps: //arxiv.org/abs/2503.13377

  11. [11]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y. Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Chen...

  12. [12]

    URLhttps://arxiv.org/abs/2602.02276

  13. [13]

    GLM-5: from Vibe Coding to Agentic Engineering

    GLM-5-Team, :, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, Chenzheng Zhu, Congfeng Yin, Cunxiang Wang, Gengzheng Pan, Hao Zeng, Haoke Zhang, Haoran Wang, Huilong Chen, Jiajie Zhang, Jian Jiao, Jiaqi Guo, Jingsen Wang, Jingzhao Du, Jinzhu Wu, Kedong Wang, Lei Li, Lin Fan, Lucen Zho...

  14. [14]

    Scaling laws for optimal data mixtures,

    Mustafa Shukor, Louis Bethune, Dan Busbridge, David Grangier, Enrico Fini, Alaaeldin El-Nouby, and Pierre Ablin. Scaling laws for optimal data mixtures, 2025. URLhttps://arxiv.org/abs/2507.09404

  15. [15]

    https://thinkingmachines.ai/blog/ on-policy-distillation/

    Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connectionism, 2025. doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation

  16. [16]

    Mimo-v2-flash technical report,

    Core Team, Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, Gang Xie, Hailin Zhang, Hanglong Lv, Hanyu Li, Heyu Chen, Hongshen Xu, Houbin Zhang, Huaqiu Liu, Jiangshan Duo, Jianyu Wei, Jiebao Xiao, Jinhao Dong, Jun Shi, Junhao Hu, Kainan Bao, Kang Zhou, Lei Li, Liang Zhao, Linghao Zhang,...

  17. [17]

    URLhttps://arxiv.org/abs/2601.02780

  18. [18]

    Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

    DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

  19. [19]

    Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025

    Chenxin An, Zhihui Xie, Xiaonan Li, Lei Li, Jun Zhang, Shansan Gong, Ming Zhong, Jingjing Xu, Xipeng Qiu, Mingxuan Wang, and Lingpeng Kong. Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025. URLhttps://hkunlp.github.io/blog/2025/Polaris

  20. [20]

    Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica

    Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Sur- passing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/ DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2 ,

  21. [21]

    Areal: A large-scale asynchronous reinforcement learning system for language reasoning.arXiv preprint arXiv:2505.24298,

    Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, and Yi Wu. Areal: A large-scale asynchronous reinforcement learning system for language reasoning, 2025. URLhttps://arxiv.org/abs/2505.24298

  22. [22]

    Mmfinereason: Closing the multimodal reasoning gap via open data-centric methods, 2026

    Honglin Lin, Zheng Liu, Yun Zhu, Chonghan Qin, Juekai Lin, Xiaoran Shang, Conghui He, Wentao Zhang, and Lijun Wu. Mmfinereason: Closing the multimodal reasoning gap via open data-centric methods.arXiv preprint arXiv:2601.21821, 2026

  23. [23]

    OneThinker: All-in-one Reasoning Model for Image and Video

    Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, et al. Onethinker: All-in-one reasoning model for image and video.arXiv preprint arXiv:2512.03043, 2025

  24. [24]

    arXiv preprint arXiv:2504.06958 (2025)

    Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning, 2025. URL https://arxiv.org/abs/2504.06958. 15

  25. [25]

    MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

    XiangYue, YuanshengNi, KaiZhang, TianyuZheng, RuoqiLiu, GeZhang, SamuelStevens, DongfuJiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,...

  26. [26]

    MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark, 2025. URLhttps://arxiv.org/abs/2409.02813

  27. [27]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024. URLhttps://arxiv.org/abs/2310.02255

  28. [28]

    Measuring multimodal mathematical reasoning with MATH-vision dataset

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with MATH-vision dataset. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/ forum?id=QWTCcxMpPA

  29. [29]

    Zerobench: An impossible visual benchmark for contemporary large multimodal models.arXiv preprint arXiv:2502.09696, 2025

    Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion- Vlad Bogolin, Jialu Tang, Florian Langer, Vyas Raina, Vatsal Raina, Hanyi Xiong, Vishaal Udandarao, Jingyi Lu, Shiyang Chen, Sam Purkis, Tianshuo Yan, Wenye Lin, Gyungin Shin, Qiaochu Yang, Anh Totti Nguyen, David I. Atkinson, Aaditya Baranwal, Alexa...

  30. [30]

    Runqi Qiao, Qiuna Tan, Guanting Dong, MinhuiWu MinhuiWu, Chong Sun, Xiaoshuai Song, Jiapeng Wang, Zhuoma GongQue, Shanglin Lei, YiFan Zhang, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Xiao Zong, Yida Xu, Peiqing Yang, Zhimin Bao, Muxi Diao, Chen Li, and Honggang Zhang. We-math: Does your large multimodal model achieve human-like mathematical reasoning? In Wan...

  31. [31]

    Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?, 2024

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, and Hongsheng Li. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?, 2024. URLhttps://arxiv.org/abs/2403.14624

  32. [32]

    Aime problems and solutions

    MAA Committees. Aime problems and solutions. https://artofproblemsolving.com/wiki/index.php/AIME_ Problems_and_Solutions

  33. [33]

    Matharena: Evaluating llms on uncontaminated math competitions, February 2025

    Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions, February 2025. URLhttps://matharena.ai/

  34. [34]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021. URLhttps://arxiv.org/ abs/2103.03874

  35. [35]

    Solving Quantitative Reasoning Problems with Language Models

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur- Ari, and Vedant Misra. Solving quantitative reasoning problems with language models, 2022. URLhttps: //arxiv.org/abs/2206.14858

  36. [36]

    Video-holmes: Can MLLM think like holmes for complex video reasoning?CoRR, abs/2505.21374, 2025

    Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can mllm think like holmes for complex video reasoning?, 2025. URLhttps://arxiv.org/abs/2505.21374

  37. [37]

    Mvbench: A comprehensive multi-modal video understand- ing benchmark.arXiv preprint arXiv:2311.17005, 2023

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark, 2024. URL https://arxiv.org/abs/2311.17005

  38. [38]

    Mmvu: Measuring expert-level multi-discipline video understanding,

    Yilun Zhao, Lujing Xie, Haowei Zhang, Guo Gan, Yitao Long, Zhiyuan Hu, Tongyan Hu, Weiyuan Chen, Chuhan Li, Junyang Song, Zhijian Xu, Chengye Wang, Weifeng Pan, Ziyao Shangguan, Xiangru Tang, Zhenwen Liang, Yixin Liu, Chen Zhao, and Arman Cohan. Mmvu: Measuring expert-level multi-discipline video understanding,

  39. [39]

    URLhttps://arxiv.org/abs/2501.12380. 16

  40. [40]

    Videomathqa: Benchmarking mathematical reasoning via multimodal understanding in videos, 2025

    Hanoona Rasheed, Abdelrahman Shaker, Anqi Tang, Muhammad Maaz, Ming-Hsuan Yang, Salman Khan, and Fahad Shahbaz Khan. Videomathqa: Benchmarking mathematical reasoning via multimodal understanding in videos, 2025. URLhttps://arxiv.org/abs/2506.05349

  41. [41]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  42. [42]

    EasyVideoR1: Easier RL for Video Understanding

    Chuanyu Qin, Chenxu Yang, Qingyi Si, Naibin Gu, Dingyu Yao, Zheng Lin, Peng Fu, Nan Duan, and Jiaqi Wang. Easyvideor1: Easier rl for video understanding, 2026. URLhttps://arxiv.org/abs/2604.16893

  43. [43]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

  44. [44]

    Easyr1: An efficient, scalable, multi-modality rl training framework.https://github.com/hiyouga/EasyR1, 2025

    Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang, Yuwen Xiong, and Richong Zhang. Easyr1: An efficient, scalable, multi-modality rl training framework.https://github.com/hiyouga/EasyR1, 2025

  45. [45]

    OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondri...

  46. [46]

    URLhttps://arxiv.org/abs/2412.16720. 17

  47. [47]

    KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance

    Linhao Yu, Tianmeng Yang, Siyu Ding, Renren Jin, Naibin Gu, Xiangzhao Hao, Shuaiyi Nie, Deyi Xiong, Weichong Yin, Yu Sun, and Hua Wu. Knowrl: Boosting llm reasoning via reinforcement learning with minimal-sufficient knowledge guidance, 2026. URLhttps://arxiv.org/abs/2604.12627

  48. [48]

    S-grpo: Early exit via reinforcement learning in reasoning models.arXiv preprint arXiv:2505.07686, 2025

    Muzhi Dai, Chenxu Yang, and Qingyi Si. S-grpo: Early exit via reinforcement learning in reasoning models, 2025. URLhttps://arxiv.org/abs/2505.07686

  49. [49]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In The TwelfthInternational Conference on Learning Representations, 2024. URL https://openreview.net/forum?id= 3zKtaqxLhW

  50. [50]

    MiniLLM: On-Policy Distillation of Large Language Models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: On-policy distillation of large language models, 2026. URLhttps://arxiv.org/abs/2306.08543

  51. [51]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models, 2026. URLhttps://arxiv.org/abs/2601.18734

  52. [52]

    Reinforcement Learning via Self-Distillation

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. Reinforcement learning via self-distillation, 2026. URLhttps://arxiv.org/abs/2601.20802

  53. [53]

    Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

    Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan ang Gao, Wenkai Yang, Zhiyuan Liu, and Ning Ding. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe, 2026. URLhttps://arxiv.org/abs/2604.13016

  54. [54]

    Self-Distilled RLVR

    Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr, 2026. URLhttps://arxiv.org/abs/2604.03128

  55. [55]

    Near-Future Policy Optimization

    Chuanyu Qin, Chenxu Yang, Qingyi Si, Naibin Gu, Dingyu Yao, Zheng Lin, Peng Fu, Nan Duan, and Jiaqi Wang. Near-future policy optimization, 2026. URLhttps://arxiv.org/abs/2604.20733

  56. [56]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347. 18 Appendix A Preliminaries A.1 Group Relative Policy Optimization Group Relative Policy Optimization (GRPO) [1] is a variant of Proximal Policy Optimization (PPO) [52] tailored for large langua...