arxiv: 2604.16893 · v1 · submitted 2026-04-18 · 💻 cs.CV · cs.LG

Recognition: unknown

EasyVideoR1: Easier RL for Video Understanding

Chuanyu Qin , Chenxu Yang , Qingyi Si , Naibin Gu , Dingyu Yao , Zheng Lin , Peng Fu , Nan Duan

show 1 more author

Jiaqi Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:36 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords reinforcement learningvideo understandingvision language modelsRLVRmultimodalbenchmarkingtraining efficiency

0 comments

The pith

EasyVideoR1 makes reinforcement learning for video understanding practical by caching preprocessed video tensors for 1.47 times faster training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EasyVideoR1, a framework to extend reinforcement learning from verifiable rewards to video tasks in large vision-language models. It addresses the high cost of repeatedly processing video inputs by using offline preprocessing and tensor caching. This leads to higher throughput and supports a full pipeline including rewards for 11 task types, mixed data training, and joint image-video learning. An asynchronous evaluation setup covers 22 benchmarks with reproducible results. Sympathetic readers would care because it makes scaling reasoning capabilities to video modalities feasible without prohibitive compute.

Core claim

EasyVideoR1 is a complete and efficient reinforcement learning framework for training large vision-language models on video understanding tasks. It features a full video RL training pipeline with offline preprocessing and tensor caching that eliminates redundant video decoding and yields a 1.47 times throughput improvement. It also includes a comprehensive task-aware reward system for 11 distinct video and image problem types, a mixed offline-online data training paradigm, joint image-video training with configurable pixel budgets, and an asynchronous multi-benchmark evaluation framework covering 22 benchmarks where reproduced accuracy aligns with official scores.

What carries the argument

The offline preprocessing and tensor caching in the video RL training pipeline, which removes repeated decoding of high-dimensional visual inputs during training.

If this is right

Higher throughput allows more efficient policy learning on video tasks.
The mixed training paradigm improves performance on challenging tasks.
Joint image-video training allows modalities to reinforce each other.
Reliable evaluation across many benchmarks enables better comparison and development.
The reward system handles diverse video problem types uniformly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could be extended to other modalities like audio or 3D data by similar caching strategies.
Longer video sequences might become trainable if the caching scales well.
The framework might reduce the barrier for researchers to experiment with RL on video VLMs.
It could lead to models with better temporal reasoning if the rewards capture time-based aspects effectively.

Load-bearing premise

That offline preprocessing and tensor caching preserve all information needed for effective policy learning without measurable degradation on downstream video tasks.

What would settle it

Running the same training with and without the caching on a small model and comparing final benchmark accuracies; a large drop in the cached version would falsify the claim.

read the original abstract

Reinforcement learning from verifiable rewards (RLVR) has demonstrated remarkable effectiveness in improving the reasoning capabilities of large language models. As models evolve into natively multimodal architectures, extending RLVR to video understanding becomes increasingly important yet remains largely unexplored, due to the diversity of video task types, the computational overhead of repeatedly decoding and preprocessing high-dimensional visual inputs, and the difficulty of reproducible evaluation across numerous sensitive hyperparameters. Existing open-source RL training frameworks provide solid infrastructure for text and image scenarios but lack systematic optimizations tailored for video modality. In this work, we present \textbf{EasyVideoR1}, a complete and efficient reinforcement learning framework specifically designed for training large vision-language models on video understanding tasks. EasyVideoR1 makes the following contributions: (1) a full video RL training pipeline with offline preprocessing and tensor caching that eliminates redundant video decoding and yields a 1.47 $\times$ throughput improvement; (2) a comprehensive, task-aware reward system covering 11 distinct video and image problem types with unified routing and modular extension; (3) a mixed offline-online data training paradigm that combines curated high-quality trajectories with on-policy exploration, benefiting the learning of more challenging tasks; (4) joint image-video training with independently configurable pixel budgets, allowing the two modalities to mutually reinforce each other; and (5) an asynchronous multi-benchmark evaluation framework covering 22 mainstream video understanding benchmarks, with reproduced accuracy closely aligned with officially reported scores.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents EasyVideoR1, a specialized RL framework for training large vision-language models on video understanding tasks. It claims five main contributions: (1) a video RL pipeline using offline preprocessing and tensor caching for a 1.47× throughput gain, (2) a task-aware reward router covering 11 video/image problem types, (3) mixed offline-online trajectory training, (4) joint image-video training with independent pixel budgets, and (5) an asynchronous evaluation harness spanning 22 benchmarks with reproduced accuracies aligned to official reports.

Significance. If the throughput and performance claims hold, the framework would meaningfully lower the barrier to applying RLVR-style training to video modalities by addressing repeated decoding overhead and evaluation fragmentation. The concrete 1.47× throughput number, the 22-benchmark coverage, and the emphasis on reproducible alignment with official scores are practical strengths that could accelerate follow-on work in multimodal RL.

major comments (2)

[§3] §3 (Training Pipeline, offline preprocessing subsection): The 1.47× throughput improvement is attributed to offline video preprocessing and tensor caching, yet no ablation isolates the caching step's effect on downstream RL policy performance. Without a direct comparison of reward signals, on-policy exploration quality, and final benchmark scores between cached tensors and fully online decoding, it remains possible that irreversible transforms (e.g., frame subsampling or resolution normalization) shift the input distribution and degrade learning, undermining the central claim that efficiency gains come without performance trade-offs.
[§4] §4 (Experiments): The statement that 'reproduced accuracy closely aligned with officially reported scores' is presented without clarifying whether the alignment was measured on base models only or after RL optimization. Because the mixed offline-online paradigm and the 11-type reward router are active during training, an explicit post-RL ablation confirming that the cached inputs do not cause measurable degradation relative to an online baseline is required to support the overall claim of effective RLVR for video.

minor comments (2)

The description of the 'unified routing' mechanism for the 11 problem types would benefit from a short pseudocode or diagram showing how task type is detected and routed to the appropriate reward function.
[§5] Clarify whether the asynchronous multi-benchmark evaluator runs in parallel with training or only post-training, and report wall-clock overhead relative to synchronous evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and describe the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (Training Pipeline, offline preprocessing subsection): The 1.47× throughput improvement is attributed to offline video preprocessing and tensor caching, yet no ablation isolates the caching step's effect on downstream RL policy performance. Without a direct comparison of reward signals, on-policy exploration quality, and final benchmark scores between cached tensors and fully online decoding, it remains possible that irreversible transforms (e.g., frame subsampling or resolution normalization) shift the input distribution and degrade learning, undermining the central claim that efficiency gains come without performance trade-offs.

Authors: We appreciate the referee's point that an explicit ablation isolating the effect of tensor caching on RL-specific metrics would provide stronger support for the claim. The preprocessing steps (frame extraction, subsampling, and normalization) are deterministic and identical to the online path, so the input distribution is preserved by design. The reported 1.47× gain comes from eliminating repeated decoding inside the RL loop. We will add a targeted ablation in the revised §3 (or a new appendix) that compares reward signals, on-policy exploration statistics, and final benchmark scores on a representative subset of tasks between the cached and fully online pipelines. This will directly address the concern about potential degradation. revision: yes
Referee: [§4] §4 (Experiments): The statement that 'reproduced accuracy closely aligned with officially reported scores' is presented without clarifying whether the alignment was measured on base models only or after RL optimization. Because the mixed offline-online paradigm and the 11-type reward router are active during training, an explicit post-RL ablation confirming that the cached inputs do not cause measurable degradation relative to an online baseline is required to support the overall claim of effective RLVR for video.

Authors: We thank the referee for noting the ambiguity. The alignment statement refers to validation of the asynchronous evaluation harness on base models to confirm faithful reproduction of published numbers. To address the post-RL concern, we will revise §4 to explicitly state this scope and add a post-training comparison on a subset of the 22 benchmarks. This experiment will compare final model performance (after mixed offline-online RL with the reward router) using cached tensors versus an online-decoding baseline, confirming no measurable degradation from the caching approach. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in EasyVideoR1 framework claims

full rationale

The paper describes an engineering framework with practical contributions: offline preprocessing for measured 1.47x throughput gains, a task-aware reward router, mixed data training, joint image-video pixel budgets, and an async evaluation suite. These are presented as implemented features with empirical results (throughput improvement and benchmark alignment) rather than derived predictions or first-principles results. No equations, fitted parameters, or self-referential definitions appear in the provided text. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The throughput claim is an external measurement against baselines, not a tautology. The evaluation framework is a tooling contribution, not a renaming or reduction of prior results. The central claims remain independent of any circular reduction to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Engineering framework paper; no mathematical axioms, free parameters, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5579 in / 917 out tokens · 34851 ms · 2026-05-10T07:36:58.049697+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Near-Future Policy Optimization
cs.LG 2026-04 unverdicted novelty 7.0

NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating co...
Co-Evolving Policy Distillation
cs.LG 2026-04 unverdicted novelty 6.0

CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...

Reference graph

Works this paper leans on

58 extracted references · 53 canonical work pages · cited by 2 Pith papers · 21 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Livecc: Learning video llm with streaming speech transcription at scale, 2025.https://arxiv.org/abs/2504.16030

Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, and Mike Zheng Shou. Livecc: Learning video llm with streaming speech transcription at scale, 2025.https://arxiv.org/abs/2504.16030

work page arXiv 2025
[4]

R1-v: Reinforcing super generalization ability in vision-language models with less than $3.https://github.com/Deep-Agent/R1-V, 2025

Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, and Vinci. R1-v: Reinforcing super generalization ability in vision-language models with less than $3.https://github.com/Deep-Agent/R1-V, 2025. Accessed: 2025-02-02

2025
[5]

Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025

Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, and Song Han. Scaling rl to long videos, 2025. https://arxiv.org/abs/2507.07966

work page arXiv 2025
[6]

Video-holmes: Can MLLM think like holmes for complex video reasoning?CoRR, abs/2505.21374, 2025

Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can mllm think like holmes for complex video reasoning?, 2025.https://arxiv.org/abs/2505.21374

work page arXiv 2025
[7]

S-grpo: Early exit via reinforcement learning in reasoning models.arXiv preprint arXiv:2505.07686, 2025

Muzhi Dai, Chenxu Yang, and Qingyi Si. S-grpo: Early exit via reinforcement learning in reasoning models, 2025. https://arxiv.org/abs/2505.07686

work page arXiv 2025
[8]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

work page internal anchor Pith review arXiv 2025
[10]

OneThinker: All-in-one Reasoning Model for Image and Video

Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, et al. Onethinker: All-in-one reasoning model for image and video.arXiv preprint arXiv:2512.03043, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis, 202...

work page internal anchor Pith review arXiv 2025
[12]

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

Chaoyou Fu, Haozhi Yuan, Yuhao Dong, Yi-Fan Zhang, Yunhang Shen, Xiaoxing Hu, Xueying Li, Jinsen Su, Chengwu Long, Xiaoyao Xie, Yongkang Xie, Xiawu Zheng, Xue Yang, Haoyu Cao, Yunsheng Wu, Ziwei Liu, Xing Sun, Caifeng Shan, and Ran He. Video-mme-v2: Towards the next stage in benchmarks for comprehensive video understanding, 2026.https://arxiv.org/abs/2604.05015

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

Tall: Temporal activity localization via language query, 2017.https://arxiv.org/abs/1705.02101

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query, 2017.https://arxiv.org/abs/1705.02101

work page arXiv 2017
[14]

Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models

Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, and Jie Tang. Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models, 2025.https://arxiv.org/abs/2501.02955

work page internal anchor Pith review arXiv 2025
[15]

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Jian Hu, Xibin Wu, Zilin Zhu, Xianyu, Weixun Wang, Dehao Zhang, and Yu Cao. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework.arXiv preprint arXiv:2405.11143, 2024

work page internal anchor Pith review arXiv 2024
[16]

Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos, 2025.https://arxiv.org/abs/2501

Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos, 2025.https://arxiv.org/abs/2501. 13826

2025
[17]

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brah- man, Lester James V

Zeyu Huang, Tianhao Cheng, Zihan Qiu, Zili Wang, Yinghui Xu, Edoardo M. Ponti, and Ivan Titov. Blending supervised and reinforcement fine-tuning with prefix sampling, 2025.https://arxiv.org/abs/2507.01679

work page arXiv 2025
[18]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention, 2023. https://arxiv.org/abs/2309.06180

work page internal anchor Pith review arXiv 2023
[19]

Mvbench: A comprehensive multi-modal video understand- ing benchmark.arXiv preprint arXiv:2311.17005, 2023

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark, 2024. https://arxiv.org/abs/2311.17005

work page arXiv 2024
[20]

arXiv preprint arXiv:2504.06958 (2025)

Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning.arXiv preprint arXiv:2504.06958, 2025. 12

work page arXiv 2025
[21]

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?,

Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, and Jiaqi Wang. Ovo-bench: How far is your video-llms from real-world online video understanding?, 2025.https://arxiv.org/abs/2501.05510

work page arXiv 2025
[22]

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, and Pavlo Molchanov. Gdpo: Group reward-decoupled normalization policy optimization for multi-reward rl optimization, 2026.https://arxiv.org/ abs/2601.05242

work page internal anchor Pith review arXiv 2026
[23]

Tempcompass: Do video llms really understand videos?,

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos?, 2024.https://arxiv.org/abs/2403.00476

work page arXiv 2024
[24]

Charles, Xinyu Zhou, and Xu Sun

Yuanxin Liu, Kun Ouyang, Haoning Wu, Yi Liu, Lin Sui, Xinhao Li, Yan Zhong, Y. Charles, Xinyu Zhou, and Xu Sun. Videoreasonbench: Can mllms perform vision-centric complex video reasoning?, 2026. https: //arxiv.org/abs/2505.23359

work page arXiv 2026
[25]

Kevin Murphy

Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Bin Cui, et al. Learning what reinforcement learning can’t: Interleaved online fine-tuning for hardest questions. arXiv preprint arXiv:2506.07527, 2025

work page arXiv 2025
[26]

MINERVA: evaluating complex video reasoning.CoRR, abs/2505.00681, 2025

Arsha Nagrani, Sachit Menon, Ahmet Iscen, Shyamal Buch, Ramin Mehran, Nilpa Jha, Anja Hauth, Yukun Zhu, Carl Vondrick, Mikhail Sirotenko, Cordelia Schmid, and Tobias Weyand. Minerva: Evaluating complex video reasoning, 2025.https://arxiv.org/abs/2505.00681

work page arXiv 2025
[27]

Vcr-bench: A comprehensive evaluation framework for video chain-of-thought reasoning.CoRR, abs/2504.07956, 2025

Yukun Qi, Yiming Zhao, Yu Zeng, Xikun Bao, Wenxuan Huang, Lin Chen, Zehui Chen, Jie Zhao, Zhongang Qi, and Feng Zhao. Vcr-bench: A comprehensive evaluation framework for video chain-of-thought reasoning, 2025. https://arxiv.org/abs/2504.07956

work page arXiv 2025
[28]

Qwen3.5: Towards native multimodal agents, February 2026.https://qwen.ai/blog?id=qwen3.5

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026.https://qwen.ai/blog?id=qwen3.5

2026
[29]

Videomathqa: Benchmarking mathematical reasoning via multimodal understanding in videos, 2025

Hanoona Rasheed, Abdelrahman Shaker, Anqi Tang, Muhammad Maaz, Ming-Hsuan Yang, Salman Khan, and Fahad Shahbaz Khan. Videomathqa: Benchmarking mathematical reasoning via multimodal understanding in videos, 2025.https://arxiv.org/abs/2506.05349

work page arXiv 2025
[30]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024.https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review arXiv 2024
[32]

Richard S

Mingyu Su, Jian Guan, Yuxian Gu, Minlie Huang, and Hongning Wang. Trust-region adaptive policy optimization, 2025.https://arxiv.org/abs/2512.17636

work page arXiv 2025
[33]

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, Congcong Wang, Dehao Zhang, Dikang Du, Dongliang Wang, Enming Yuan, Enzhe Lu, Fang Li, Flood Sung, Guangda Wei, Guokun Lai, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haoning Wu, Haotian Yao, Haoyu Lu, Heng Wang, Hongcheng Gao, Huabi...

work page internal anchor Pith review arXiv 2025
[34]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y. Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Chen...

work page internal anchor Pith review arXiv 2026
[35]

TRL: Transformers Reinforcement Learning, 2020

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformers Reinforcement Learning, 2020. https://github.com/huggingface/trl

2020
[36]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024.https://arxiv.org/abs/2409.12191

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Lvbench: An extreme long video understanding benchmark.CoRR, abs/2406.08035, 2024

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Lvbench: An extreme long video understanding benchmark, 2025. https://arxiv.org/abs/2406.08035

work page arXiv 2025
[38]

Reinforcement learning optimization for large- scale learning: An efficient and user-friendly scaling library.arXiv preprint arXiv:2506.06122, 2025

Weixun Wang, Shaopan Xiong, Gengru Chen, Wei Gao, Sheng Guo, Yancheng He, Ju Huang, Jiaheng Liu, Zhendong Li, Xiaoyang Li, et al. Reinforcement learning optimization for large-scale learning: An efficient and user-friendly scaling library.arXiv preprint arXiv:2506.06122, 2025

work page arXiv 2025
[39]

Longvideobench: A benchmark for long-context inter- leaved video-language understanding.arXiv Preprint, 2024

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding, 2024.https://arxiv.org/abs/2407.15754

work page arXiv 2024
[40]

arXiv preprint arXiv:2504.14945 , year =

Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance.arXiv preprint arXiv:2504.14945, 2025

work page arXiv 2025
[41]

Weights-rotated preference optimization for large language models

Chenxu Yang, Ruipeng Jia, Mingyu Zheng, Naibin Gu, Zheng Lin, Siyuan Chen, Weichong Yin, Hua Wu, and Weiping Wang. Weights-rotated preference optimization for large language models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processi...

work page doi:10.18653/v1/2025.emnlp-main.1329 2025
[42]

Test-time prompt intervention, 2025

Chenxu Yang, Qingyi Si, Mz Dai, Dingyu Yao, Mingyu Zheng, Minghui Chen, Zheng Lin, and Weiping Wang. Test-time prompt intervention, 2025.https://arxiv.org/abs/2508.02511

work page arXiv 2025
[43]

arXiv preprint arXiv:2504.15895 , year=

Chenxu Yang, Qingyi Si, Yongjie Duan, Zheliang Zhu, Chenyu Zhu, Qiaowei Li, Minghui Chen, Zheng Lin, and Weiping Wang. Dynamic early exit in reasoning models, 2025.https://arxiv.org/abs/2504.15895

work page arXiv 2025
[44]

Self-Distilled RLVR

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr, 2026.https://arxiv.org/abs/2604.03128

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

System 1&2 synergy via dynamic model interpolation.arXiv preprint arXiv:2601.21414,

Chenxu Yang, Qingyi Si, Chong Tian, Xiyu Liu, Dingyu Yao, Chuanyu Qin, Zheng Lin, Weiping Wang, and Jiaqi Wang. System 1&2 synergy via dynamic model interpolation.arXiv preprint arXiv:2601.21414, 2026

work page arXiv 2026
[46]

Thinking in space: How multimodal large language models see, remember, and recall spaces.arXiv preprint arXiv:2412.14171, 2024

Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces, 2025.https://arxiv.org/abs/2412.14171

work page arXiv 2025
[47]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

arXiv preprint arXiv:2509.24871 , year=

Xiangyu Zeng, Kefan Qiu, Qingyu Zhang, Xinhao Li, Jing Wang, Jiaxin Li, Ziang Yan, Kun Tian, Meng Tian, Xinhai Zhao, Yi Wang, and Limin Wang. Streamforest: Efficient online video understanding with persistent event memory, 2025.https://arxiv.org/abs/2509.24871

work page arXiv 2025
[49]

Exgrpo: Learning to reason from experience

Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F. Wong, and Yu Cheng. ExGRPO: Learning to reason from experience, 2025.https://arxiv.org/abs/2510.02245

work page arXiv 2025
[50]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, and Deli Zhao. Videollama 3: Frontier multimodal foundation models for image and video understanding, 2025.https://arxiv.org/abs/2501.13106

work page internal anchor Pith review arXiv 2025
[51]

arXiv preprint arXiv:2507.07451 , year=

HongzhiZhang, JiaFu, JingyuanZhang, KaiFu, QiWang, FuzhengZhang, andGuoruiZhou. RLEP:Reinforcement learning with experience replay for llm reasoning.arXiv preprint arXiv:2507.07451, 2025

work page arXiv 2025
[52]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava-video: Video instruction tuning with synthetic data, 2025.https://arxiv.org/abs/2410.02713

work page Pith review arXiv 2025
[53]

Where does it exist: Spatio-temporal video grounding for multi-form sentences, 2020.https://arxiv.org/abs/2001.06891

Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, and Lianli Gao. Where does it exist: Spatio-temporal video grounding for multi-form sentences, 2020.https://arxiv.org/abs/2001.06891

work page arXiv 2020
[54]

Mmvu: Measuring expert-level multi-discipline video understanding.arXiv preprint arXiv:2501.12380, 2025

Yilun Zhao, Lujing Xie, Haowei Zhang, Guo Gan, Yitao Long, Zhiyuan Hu, Tongyan Hu, Weiyuan Chen, Chuhan Li, Junyang Song, Zhijian Xu, Chengye Wang, Weifeng Pan, Ziyao Shangguan, Xiangru Tang, Zhenwen Liang, Yixin Liu, Chen Zhao, and Arman Cohan. Mmvu: Measuring expert-level multi-discipline video understanding, 2025.https://arxiv.org/abs/2501.12380

work page arXiv 2025
[55]

arXiv preprint arXiv:2408.05517 (2024),https://arxiv.org/abs/ 2408.05517

Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, Wenmeng Zhou, and Yingda Chen. Swift:a scalable lightweight infrastructure for fine-tuning, 2024.https://arxiv.org/abs/2408.05517

work page arXiv 2024
[56]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025. https: //arxiv.org/abs/2507.18071

work page internal anchor Pith review arXiv 2025
[57]

Easyr1: An efficient, scalable, multi-modality rl training framework.https://github.com/hiyouga/EasyR1, 2025

Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang, Yuwen Xiong, and Richong Zhang. Easyr1: An efficient, scalable, multi-modality rl training framework.https://github.com/hiyouga/EasyR1, 2025

2025
[58]

MLVU: Benchmarking Multi-task Long Video Understanding

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: Benchmarking multi-task long video understanding, 2025. https://arxiv.org/abs/2406.04264. 15

work page internal anchor Pith review arXiv 2025