Recognition: unknown
EasyVideoR1: Easier RL for Video Understanding
Pith reviewed 2026-05-10 07:36 UTC · model grok-4.3
The pith
EasyVideoR1 makes reinforcement learning for video understanding practical by caching preprocessed video tensors for 1.47 times faster training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EasyVideoR1 is a complete and efficient reinforcement learning framework for training large vision-language models on video understanding tasks. It features a full video RL training pipeline with offline preprocessing and tensor caching that eliminates redundant video decoding and yields a 1.47 times throughput improvement. It also includes a comprehensive task-aware reward system for 11 distinct video and image problem types, a mixed offline-online data training paradigm, joint image-video training with configurable pixel budgets, and an asynchronous multi-benchmark evaluation framework covering 22 benchmarks where reproduced accuracy aligns with official scores.
What carries the argument
The offline preprocessing and tensor caching in the video RL training pipeline, which removes repeated decoding of high-dimensional visual inputs during training.
If this is right
- Higher throughput allows more efficient policy learning on video tasks.
- The mixed training paradigm improves performance on challenging tasks.
- Joint image-video training allows modalities to reinforce each other.
- Reliable evaluation across many benchmarks enables better comparison and development.
- The reward system handles diverse video problem types uniformly.
Where Pith is reading between the lines
- This approach could be extended to other modalities like audio or 3D data by similar caching strategies.
- Longer video sequences might become trainable if the caching scales well.
- The framework might reduce the barrier for researchers to experiment with RL on video VLMs.
- It could lead to models with better temporal reasoning if the rewards capture time-based aspects effectively.
Load-bearing premise
That offline preprocessing and tensor caching preserve all information needed for effective policy learning without measurable degradation on downstream video tasks.
What would settle it
Running the same training with and without the caching on a small model and comparing final benchmark accuracies; a large drop in the cached version would falsify the claim.
read the original abstract
Reinforcement learning from verifiable rewards (RLVR) has demonstrated remarkable effectiveness in improving the reasoning capabilities of large language models. As models evolve into natively multimodal architectures, extending RLVR to video understanding becomes increasingly important yet remains largely unexplored, due to the diversity of video task types, the computational overhead of repeatedly decoding and preprocessing high-dimensional visual inputs, and the difficulty of reproducible evaluation across numerous sensitive hyperparameters. Existing open-source RL training frameworks provide solid infrastructure for text and image scenarios but lack systematic optimizations tailored for video modality. In this work, we present \textbf{EasyVideoR1}, a complete and efficient reinforcement learning framework specifically designed for training large vision-language models on video understanding tasks. EasyVideoR1 makes the following contributions: (1) a full video RL training pipeline with offline preprocessing and tensor caching that eliminates redundant video decoding and yields a 1.47 $\times$ throughput improvement; (2) a comprehensive, task-aware reward system covering 11 distinct video and image problem types with unified routing and modular extension; (3) a mixed offline-online data training paradigm that combines curated high-quality trajectories with on-policy exploration, benefiting the learning of more challenging tasks; (4) joint image-video training with independently configurable pixel budgets, allowing the two modalities to mutually reinforce each other; and (5) an asynchronous multi-benchmark evaluation framework covering 22 mainstream video understanding benchmarks, with reproduced accuracy closely aligned with officially reported scores.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents EasyVideoR1, a specialized RL framework for training large vision-language models on video understanding tasks. It claims five main contributions: (1) a video RL pipeline using offline preprocessing and tensor caching for a 1.47× throughput gain, (2) a task-aware reward router covering 11 video/image problem types, (3) mixed offline-online trajectory training, (4) joint image-video training with independent pixel budgets, and (5) an asynchronous evaluation harness spanning 22 benchmarks with reproduced accuracies aligned to official reports.
Significance. If the throughput and performance claims hold, the framework would meaningfully lower the barrier to applying RLVR-style training to video modalities by addressing repeated decoding overhead and evaluation fragmentation. The concrete 1.47× throughput number, the 22-benchmark coverage, and the emphasis on reproducible alignment with official scores are practical strengths that could accelerate follow-on work in multimodal RL.
major comments (2)
- [§3] §3 (Training Pipeline, offline preprocessing subsection): The 1.47× throughput improvement is attributed to offline video preprocessing and tensor caching, yet no ablation isolates the caching step's effect on downstream RL policy performance. Without a direct comparison of reward signals, on-policy exploration quality, and final benchmark scores between cached tensors and fully online decoding, it remains possible that irreversible transforms (e.g., frame subsampling or resolution normalization) shift the input distribution and degrade learning, undermining the central claim that efficiency gains come without performance trade-offs.
- [§4] §4 (Experiments): The statement that 'reproduced accuracy closely aligned with officially reported scores' is presented without clarifying whether the alignment was measured on base models only or after RL optimization. Because the mixed offline-online paradigm and the 11-type reward router are active during training, an explicit post-RL ablation confirming that the cached inputs do not cause measurable degradation relative to an online baseline is required to support the overall claim of effective RLVR for video.
minor comments (2)
- The description of the 'unified routing' mechanism for the 11 problem types would benefit from a short pseudocode or diagram showing how task type is detected and routed to the appropriate reward function.
- [§5] Clarify whether the asynchronous multi-benchmark evaluator runs in parallel with training or only post-training, and report wall-clock overhead relative to synchronous evaluation.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major comment below and describe the revisions we will incorporate to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Training Pipeline, offline preprocessing subsection): The 1.47× throughput improvement is attributed to offline video preprocessing and tensor caching, yet no ablation isolates the caching step's effect on downstream RL policy performance. Without a direct comparison of reward signals, on-policy exploration quality, and final benchmark scores between cached tensors and fully online decoding, it remains possible that irreversible transforms (e.g., frame subsampling or resolution normalization) shift the input distribution and degrade learning, undermining the central claim that efficiency gains come without performance trade-offs.
Authors: We appreciate the referee's point that an explicit ablation isolating the effect of tensor caching on RL-specific metrics would provide stronger support for the claim. The preprocessing steps (frame extraction, subsampling, and normalization) are deterministic and identical to the online path, so the input distribution is preserved by design. The reported 1.47× gain comes from eliminating repeated decoding inside the RL loop. We will add a targeted ablation in the revised §3 (or a new appendix) that compares reward signals, on-policy exploration statistics, and final benchmark scores on a representative subset of tasks between the cached and fully online pipelines. This will directly address the concern about potential degradation. revision: yes
-
Referee: [§4] §4 (Experiments): The statement that 'reproduced accuracy closely aligned with officially reported scores' is presented without clarifying whether the alignment was measured on base models only or after RL optimization. Because the mixed offline-online paradigm and the 11-type reward router are active during training, an explicit post-RL ablation confirming that the cached inputs do not cause measurable degradation relative to an online baseline is required to support the overall claim of effective RLVR for video.
Authors: We thank the referee for noting the ambiguity. The alignment statement refers to validation of the asynchronous evaluation harness on base models to confirm faithful reproduction of published numbers. To address the post-RL concern, we will revise §4 to explicitly state this scope and add a post-training comparison on a subset of the 22 benchmarks. This experiment will compare final model performance (after mixed offline-online RL with the reward router) using cached tensors versus an online-decoding baseline, confirming no measurable degradation from the caching approach. revision: yes
Circularity Check
No significant circularity detected in EasyVideoR1 framework claims
full rationale
The paper describes an engineering framework with practical contributions: offline preprocessing for measured 1.47x throughput gains, a task-aware reward router, mixed data training, joint image-video pixel budgets, and an async evaluation suite. These are presented as implemented features with empirical results (throughput improvement and benchmark alignment) rather than derived predictions or first-principles results. No equations, fitted parameters, or self-referential definitions appear in the provided text. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The throughput claim is an external measurement against baselines, not a tautology. The evaluation framework is a tooling contribution, not a renaming or reduction of prior results. The central claims remain independent of any circular reduction to inputs.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 2 Pith papers
-
Near-Future Policy Optimization
NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating co...
-
Co-Evolving Policy Distillation
CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, and Mike Zheng Shou. Livecc: Learning video llm with streaming speech transcription at scale, 2025.https://arxiv.org/abs/2504.16030
-
[4]
R1-v: Reinforcing super generalization ability in vision-language models with less than $3.https://github.com/Deep-Agent/R1-V, 2025
Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, and Vinci. R1-v: Reinforcing super generalization ability in vision-language models with less than $3.https://github.com/Deep-Agent/R1-V, 2025. Accessed: 2025-02-02
2025
-
[5]
Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025
Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, and Song Han. Scaling rl to long videos, 2025. https://arxiv.org/abs/2507.07966
-
[6]
Video-holmes: Can MLLM think like holmes for complex video reasoning?CoRR, abs/2505.21374, 2025
Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can mllm think like holmes for complex video reasoning?, 2025.https://arxiv.org/abs/2505.21374
-
[7]
Muzhi Dai, Chenxu Yang, and Qingyi Si. S-grpo: Early exit via reinforcement learning in reasoning models, 2025. https://arxiv.org/abs/2505.07686
-
[8]
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025
work page internal anchor Pith review arXiv 2025
-
[10]
OneThinker: All-in-one Reasoning Model for Image and Video
Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, et al. Onethinker: All-in-one reasoning model for image and video.arXiv preprint arXiv:2512.03043, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis, 202...
work page internal anchor Pith review arXiv 2025
-
[12]
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
Chaoyou Fu, Haozhi Yuan, Yuhao Dong, Yi-Fan Zhang, Yunhang Shen, Xiaoxing Hu, Xueying Li, Jinsen Su, Chengwu Long, Xiaoyao Xie, Yongkang Xie, Xiawu Zheng, Xue Yang, Haoyu Cao, Yunsheng Wu, Ziwei Liu, Xing Sun, Caifeng Shan, and Ran He. Video-mme-v2: Towards the next stage in benchmarks for comprehensive video understanding, 2026.https://arxiv.org/abs/2604.05015
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[13]
Tall: Temporal activity localization via language query, 2017.https://arxiv.org/abs/1705.02101
Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query, 2017.https://arxiv.org/abs/1705.02101
-
[14]
Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, and Jie Tang. Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models, 2025.https://arxiv.org/abs/2501.02955
work page internal anchor Pith review arXiv 2025
-
[15]
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
Jian Hu, Xibin Wu, Zilin Zhu, Xianyu, Weixun Wang, Dehao Zhang, and Yu Cao. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework.arXiv preprint arXiv:2405.11143, 2024
work page internal anchor Pith review arXiv 2024
-
[16]
Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos, 2025.https://arxiv.org/abs/2501
Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos, 2025.https://arxiv.org/abs/2501. 13826
2025
-
[17]
Zeyu Huang, Tianhao Cheng, Zihan Qiu, Zili Wang, Yinghui Xu, Edoardo M. Ponti, and Ivan Titov. Blending supervised and reinforcement fine-tuning with prefix sampling, 2025.https://arxiv.org/abs/2507.01679
-
[18]
Efficient Memory Management for Large Language Model Serving with PagedAttention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention, 2023. https://arxiv.org/abs/2309.06180
work page internal anchor Pith review arXiv 2023
-
[19]
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark, 2024. https://arxiv.org/abs/2311.17005
-
[20]
arXiv preprint arXiv:2504.06958 (2025)
Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning.arXiv preprint arXiv:2504.06958, 2025. 12
-
[21]
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?,
Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, and Jiaqi Wang. Ovo-bench: How far is your video-llms from real-world online video understanding?, 2025.https://arxiv.org/abs/2501.05510
-
[22]
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, and Pavlo Molchanov. Gdpo: Group reward-decoupled normalization policy optimization for multi-reward rl optimization, 2026.https://arxiv.org/ abs/2601.05242
work page internal anchor Pith review arXiv 2026
-
[23]
Tempcompass: Do video llms really understand videos?,
Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos?, 2024.https://arxiv.org/abs/2403.00476
-
[24]
Charles, Xinyu Zhou, and Xu Sun
Yuanxin Liu, Kun Ouyang, Haoning Wu, Yi Liu, Lin Sui, Xinhao Li, Yan Zhong, Y. Charles, Xinyu Zhou, and Xu Sun. Videoreasonbench: Can mllms perform vision-centric complex video reasoning?, 2026. https: //arxiv.org/abs/2505.23359
-
[25]
Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Bin Cui, et al. Learning what reinforcement learning can’t: Interleaved online fine-tuning for hardest questions. arXiv preprint arXiv:2506.07527, 2025
-
[26]
MINERVA: evaluating complex video reasoning.CoRR, abs/2505.00681, 2025
Arsha Nagrani, Sachit Menon, Ahmet Iscen, Shyamal Buch, Ramin Mehran, Nilpa Jha, Anja Hauth, Yukun Zhu, Carl Vondrick, Mikhail Sirotenko, Cordelia Schmid, and Tobias Weyand. Minerva: Evaluating complex video reasoning, 2025.https://arxiv.org/abs/2505.00681
-
[27]
Yukun Qi, Yiming Zhao, Yu Zeng, Xikun Bao, Wenxuan Huang, Lin Chen, Zehui Chen, Jie Zhao, Zhongang Qi, and Feng Zhao. Vcr-bench: A comprehensive evaluation framework for video chain-of-thought reasoning, 2025. https://arxiv.org/abs/2504.07956
-
[28]
Qwen3.5: Towards native multimodal agents, February 2026.https://qwen.ai/blog?id=qwen3.5
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026.https://qwen.ai/blog?id=qwen3.5
2026
-
[29]
Videomathqa: Benchmarking mathematical reasoning via multimodal understanding in videos, 2025
Hanoona Rasheed, Abdelrahman Shaker, Anqi Tang, Muhammad Maaz, Ming-Hsuan Yang, Salman Khan, and Fahad Shahbaz Khan. Videomathqa: Benchmarking mathematical reasoning via multimodal understanding in videos, 2025.https://arxiv.org/abs/2506.05349
-
[30]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024.https://arxiv.org/abs/2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024
work page internal anchor Pith review arXiv 2024
- [32]
-
[33]
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, Congcong Wang, Dehao Zhang, Dikang Du, Dongliang Wang, Enming Yuan, Enzhe Lu, Fang Li, Flood Sung, Guangda Wei, Guokun Lai, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haoning Wu, Haotian Yao, Haoyu Lu, Heng Wang, Hongcheng Gao, Huabi...
work page internal anchor Pith review arXiv 2025
-
[34]
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y. Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Chen...
work page internal anchor Pith review arXiv 2026
-
[35]
TRL: Transformers Reinforcement Learning, 2020
Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformers Reinforcement Learning, 2020. https://github.com/huggingface/trl
2020
-
[36]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024.https://arxiv.org/abs/2409.12191
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Lvbench: An extreme long video understanding benchmark.CoRR, abs/2406.08035, 2024
Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Lvbench: An extreme long video understanding benchmark, 2025. https://arxiv.org/abs/2406.08035
-
[38]
Weixun Wang, Shaopan Xiong, Gengru Chen, Wei Gao, Sheng Guo, Yancheng He, Ju Huang, Jiaheng Liu, Zhendong Li, Xiaoyang Li, et al. Reinforcement learning optimization for large-scale learning: An efficient and user-friendly scaling library.arXiv preprint arXiv:2506.06122, 2025
-
[39]
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding, 2024.https://arxiv.org/abs/2407.15754
-
[40]
arXiv preprint arXiv:2504.14945 , year =
Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance.arXiv preprint arXiv:2504.14945, 2025
-
[41]
Weights-rotated preference optimization for large language models
Chenxu Yang, Ruipeng Jia, Mingyu Zheng, Naibin Gu, Zheng Lin, Siyuan Chen, Weichong Yin, Hua Wu, and Weiping Wang. Weights-rotated preference optimization for large language models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processi...
-
[42]
Test-time prompt intervention, 2025
Chenxu Yang, Qingyi Si, Mz Dai, Dingyu Yao, Mingyu Zheng, Minghui Chen, Zheng Lin, and Weiping Wang. Test-time prompt intervention, 2025.https://arxiv.org/abs/2508.02511
-
[43]
arXiv preprint arXiv:2504.15895 , year=
Chenxu Yang, Qingyi Si, Yongjie Duan, Zheliang Zhu, Chenyu Zhu, Qiaowei Li, Minghui Chen, Zheng Lin, and Weiping Wang. Dynamic early exit in reasoning models, 2025.https://arxiv.org/abs/2504.15895
-
[44]
Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr, 2026.https://arxiv.org/abs/2604.03128
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[45]
System 1&2 synergy via dynamic model interpolation.arXiv preprint arXiv:2601.21414,
Chenxu Yang, Qingyi Si, Chong Tian, Xiyu Liu, Dingyu Yao, Chuanyu Qin, Zheng Lin, Weiping Wang, and Jiaqi Wang. System 1&2 synergy via dynamic model interpolation.arXiv preprint arXiv:2601.21414, 2026
-
[46]
Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces, 2025.https://arxiv.org/abs/2412.14171
-
[47]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
arXiv preprint arXiv:2509.24871 , year=
Xiangyu Zeng, Kefan Qiu, Qingyu Zhang, Xinhao Li, Jing Wang, Jiaxin Li, Ziang Yan, Kun Tian, Meng Tian, Xinhai Zhao, Yi Wang, and Limin Wang. Streamforest: Efficient online video understanding with persistent event memory, 2025.https://arxiv.org/abs/2509.24871
-
[49]
Exgrpo: Learning to reason from experience
Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F. Wong, and Yu Cheng. ExGRPO: Learning to reason from experience, 2025.https://arxiv.org/abs/2510.02245
-
[50]
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, and Deli Zhao. Videollama 3: Frontier multimodal foundation models for image and video understanding, 2025.https://arxiv.org/abs/2501.13106
work page internal anchor Pith review arXiv 2025
-
[51]
arXiv preprint arXiv:2507.07451 , year=
HongzhiZhang, JiaFu, JingyuanZhang, KaiFu, QiWang, FuzhengZhang, andGuoruiZhou. RLEP:Reinforcement learning with experience replay for llm reasoning.arXiv preprint arXiv:2507.07451, 2025
-
[52]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava-video: Video instruction tuning with synthetic data, 2025.https://arxiv.org/abs/2410.02713
work page Pith review arXiv 2025
-
[53]
Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, and Lianli Gao. Where does it exist: Spatio-temporal video grounding for multi-form sentences, 2020.https://arxiv.org/abs/2001.06891
-
[54]
Yilun Zhao, Lujing Xie, Haowei Zhang, Guo Gan, Yitao Long, Zhiyuan Hu, Tongyan Hu, Weiyuan Chen, Chuhan Li, Junyang Song, Zhijian Xu, Chengye Wang, Weifeng Pan, Ziyao Shangguan, Xiangru Tang, Zhenwen Liang, Yixin Liu, Chen Zhao, and Arman Cohan. Mmvu: Measuring expert-level multi-discipline video understanding, 2025.https://arxiv.org/abs/2501.12380
-
[55]
arXiv preprint arXiv:2408.05517 (2024),https://arxiv.org/abs/ 2408.05517
Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, Wenmeng Zhou, and Yingda Chen. Swift:a scalable lightweight infrastructure for fine-tuning, 2024.https://arxiv.org/abs/2408.05517
-
[56]
Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025. https: //arxiv.org/abs/2507.18071
work page internal anchor Pith review arXiv 2025
-
[57]
Easyr1: An efficient, scalable, multi-modality rl training framework.https://github.com/hiyouga/EasyR1, 2025
Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang, Yuwen Xiong, and Richong Zhang. Easyr1: An efficient, scalable, multi-modality rl training framework.https://github.com/hiyouga/EasyR1, 2025
2025
-
[58]
MLVU: Benchmarking Multi-task Long Video Understanding
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: Benchmarking multi-task long video understanding, 2025. https://arxiv.org/abs/2406.04264. 15
work page internal anchor Pith review arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.