pith. machine review for the scientific record. sign in

arxiv: 2503.21776 · v4 · submitted 2025-03-27 · 💻 cs.CV

Recognition: 2 theorem links

Video-R1: Reinforcing Video Reasoning in MLLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-12 09:37 UTC · model grok-4.3

classification 💻 cs.CV
keywords video reasoningmultimodal large language modelsreinforcement learningGRPOtemporal modelingVSI-benchVideoMMMU
0
0 comments X

The pith

Temporal reinforcement learning with mixed image-video data enables a 7B multimodal model to reach 37.1 percent accuracy on video spatial reasoning, exceeding GPT-4o.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to adapt rule-based reinforcement learning techniques that have worked for text reasoning to the more complex setting of video reasoning inside multimodal large language models. It identifies missing temporal awareness and limited high-quality video data as the main obstacles. The authors respond by designing a temporal extension to the GRPO algorithm and training on combined image and video reasoning examples instead of video alone. If successful this would mean smaller open models can deliver strong video understanding performance without depending on proprietary closed systems. The approach yields measurable gains on dedicated video reasoning benchmarks as well as general video understanding tasks.

Core claim

Video-R1 introduces the T-GRPO algorithm to encourage models to use temporal information across video frames during rule-based reinforcement learning and constructs two mixed image-video datasets for supervised fine-tuning and RL stages, producing clear accuracy increases on video reasoning benchmarks including VideoMMMU and VSI-Bench as well as on general video benchmarks such as MVBench and TempCompass, with the resulting 7B model attaining 37.1 percent accuracy on VSI-bench.

What carries the argument

T-GRPO algorithm, an extension of GRPO that incorporates explicit temporal modeling to guide reasoning over video sequences.

If this is right

  • Accuracy rises on video reasoning benchmarks VideoMMMU and VSI-Bench.
  • Performance improves on general video benchmarks MVBench and TempCompass.
  • The 7B model surpasses the proprietary GPT-4o system on the VSI-bench spatial reasoning task.
  • Public release of the Video-R1-CoT-165k and Video-R1-260k datasets and trained models allows direct replication and extension.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The strategy of mixing high-quality image reasoning data with video data may reduce the data bottleneck for other multimodal reasoning domains where pure video examples are scarce.
  • Temporal modeling additions developed here could transfer to non-reasoning video tasks such as temporal action localization or event prediction.
  • Scaling the same RL procedure to larger base models might produce further gains and narrow the gap with closed-source systems even more.
  • The datasets and training recipe open the possibility of combining this method with other alignment techniques to improve robustness on out-of-distribution video inputs.

Load-bearing premise

The accuracy improvements arise chiefly from the addition of temporal modeling in T-GRPO and the mixing of image and video reasoning data rather than from differences in base model scale or total training compute.

What would settle it

An ablation experiment that applies standard GRPO without the temporal component to the same base model and data mixture and measures whether VSI-bench accuracy falls below 37.1 percent or loses the reported margin over GPT-4o.

read the original abstract

Inspired by DeepSeek-R1's success in eliciting reasoning abilities through rule-based reinforcement learning (RL), we introduce Video-R1 as the first attempt to systematically explore the R1 paradigm for incentivizing video reasoning within multimodal large language models (MLLMs). However, directly applying RL training with the GRPO algorithm to video reasoning presents two primary challenges: (i) a lack of temporal modeling for video reasoning, and (ii) the scarcity of high-quality video-reasoning data. To address these issues, we first propose the T-GRPO algorithm, which encourages models to utilize temporal information in videos for reasoning. Additionally, instead of relying solely on video data, we incorporate high-quality image-reasoning data into the training process. We have constructed two datasets: Video-R1-CoT-165k for SFT cold start and Video-R1-260k for RL training, both comprising image and video data. Experimental results demonstrate that Video-R1 achieves significant improvements on video reasoning benchmarks such as VideoMMMU and VSI-Bench, as well as on general video benchmarks including MVBench and TempCompass, etc. Notably, Video-R1-7B attains a 37.1% accuracy on video spatial reasoning benchmark VSI-bench, surpassing the commercial proprietary model GPT-4o. All code, models, and data are released in: https://github.com/tulerfeng/Video-R1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Video-R1 as the first systematic application of the DeepSeek-R1 paradigm (rule-based RL) to video reasoning in MLLMs. It proposes the T-GRPO algorithm to incorporate temporal modeling, constructs Video-R1-CoT-165k (for SFT cold-start) and Video-R1-260k (for RL) datasets that mix high-quality image and video reasoning data, and reports empirical gains on video reasoning benchmarks (VideoMMMU, VSI-Bench) as well as general video understanding benchmarks (MVBench, TempCompass). Notably, the 7B model reaches 37.1% on VSI-Bench, surpassing GPT-4o. All code, models, and data are released publicly.

Significance. If the performance lifts are shown to arise specifically from T-GRPO and the image-video mixture rather than base-model scale or training volume, the work would constitute a useful first step in adapting R1-style RL to multimodal video reasoning. The public release of code, models, and datasets is a clear strength that supports reproducibility and community follow-up.

major comments (3)
  1. [§4 Experiments] §4 Experiments and associated tables: No results are reported for the identical base 7B model after (a) SFT on Video-R1-CoT-165k alone or (b) standard GRPO (without the temporal reward term) on Video-R1-260k. These matched controls are required to isolate the incremental contribution of T-GRPO from the effects of additional data curation and RL compute.
  2. [VSI-Bench results table] Table reporting VSI-Bench results: The 37.1% accuracy for Video-R1-7B is presented as evidence that the proposed method surpasses GPT-4o, yet the table omits the base model (e.g., Qwen2-VL-7B) after equivalent SFT or conventional RL training on the same data volume. This omission prevents attribution of the lift to the temporal modeling component.
  3. [§3.2 T-GRPO] §3.2 T-GRPO: The temporal reward is described at a high level as encouraging use of temporal information, but the exact mathematical formulation (how the temporal term modifies the GRPO advantage or reward) is not given as an equation. Without this, it is impossible to verify that T-GRPO differs meaningfully from standard GRPO or to reproduce the claimed temporal modeling effect.
minor comments (2)
  1. [Abstract] The abstract ends the benchmark list with 'etc.'; replace with explicit additional benchmarks or remove the phrase for precision.
  2. [§3.3 Datasets] Notation for the image-video data mixture ratio is introduced without a clear definition or sensitivity analysis; add a short paragraph or table entry clarifying the mixing hyperparameter.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and agree that the requested additions will improve the clarity and rigor of the work. We will incorporate all suggested revisions in the updated manuscript.

read point-by-point responses
  1. Referee: [§4 Experiments] §4 Experiments and associated tables: No results are reported for the identical base 7B model after (a) SFT on Video-R1-CoT-165k alone or (b) standard GRPO (without the temporal reward term) on Video-R1-260k. These matched controls are required to isolate the incremental contribution of T-GRPO from the effects of additional data curation and RL compute.

    Authors: We agree that these matched ablations are necessary to isolate the contribution of T-GRPO. Our current comparisons are against published baselines, but we acknowledge the value of the requested controls. In the revised manuscript we will report the 7B base model performance after SFT on Video-R1-CoT-165k alone and after standard GRPO (without the temporal term) on Video-R1-260k. revision: yes

  2. Referee: [VSI-Bench results table] Table reporting VSI-Bench results: The 37.1% accuracy for Video-R1-7B is presented as evidence that the proposed method surpasses GPT-4o, yet the table omits the base model (e.g., Qwen2-VL-7B) after equivalent SFT or conventional RL training on the same data volume. This omission prevents attribution of the lift to the temporal modeling component.

    Authors: We accept this point. To enable direct attribution, we will expand the VSI-Bench table (and the corresponding tables in §4) to include the base Qwen2-VL-7B results after SFT on Video-R1-CoT-165k and after standard GRPO on Video-R1-260k, allowing readers to evaluate the incremental effect of the temporal reward term. revision: yes

  3. Referee: [§3.2 T-GRPO] §3.2 T-GRPO: The temporal reward is described at a high level as encouraging use of temporal information, but the exact mathematical formulation (how the temporal term modifies the GRPO advantage or reward) is not given as an equation. Without this, it is impossible to verify that T-GRPO differs meaningfully from standard GRPO or to reproduce the claimed temporal modeling effect.

    Authors: We thank the referee for highlighting this omission. While the temporal component is described conceptually in the current text, we agree that an explicit equation is required for verification and reproducibility. In the revised §3.2 we will insert the precise mathematical definition of the temporal reward term and its integration into the GRPO advantage computation. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical RL training and benchmark reporting

full rationale

The manuscript describes dataset construction (Video-R1-CoT-165k, Video-R1-260k), the T-GRPO algorithm extension, SFT+RL training, and direct benchmark evaluation (VSI-Bench 37.1%, VideoMMMU, etc.). No equations, uniqueness theorems, or first-principles derivations are presented whose outputs are forced by their own inputs. Reported gains are measured accuracies after training; they do not reduce to fitted parameters renamed as predictions or to self-citation chains. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The work rests on standard assumptions from prior RL and MLLM literature plus the effectiveness of the proposed T-GRPO modification.

invented entities (1)
  • T-GRPO algorithm no independent evidence
    purpose: Encourage temporal information use in video reasoning during RL
    Newly proposed variant of GRPO without independent validation outside the paper's experiments.

pith-pipeline@v0.9.0 · 5589 in / 1030 out tokens · 39032 ms · 2026-05-12T09:37:02.886168+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 38 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

    cs.CV 2026-05 unverdicted novelty 7.0

    A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.

  2. AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    AdaFocus achieves better accuracy on long-video benchmarks with roughly 33 times fewer visual tokens by combining query-aware adaptive sampling and zero-cache disk-based refinement.

  3. From Web to Pixels: Bringing Agentic Search into Visual Perception

    cs.CV 2026-05 unverdicted novelty 7.0

    WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.

  4. Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

    cs.MM 2026-05 unverdicted novelty 7.0

    Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.

  5. ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.

  6. TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    TOC-Bench is an object-track-grounded benchmark that filters for temporally dependent questions and shows Video-LLMs have major weaknesses in event counting, ordering, identity reasoning, and hallucination detection.

  7. TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models

    cs.CV 2026-05 conditional novelty 7.0

    TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.

  8. Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs

    cs.CL 2026-05 unverdicted novelty 7.0

    LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.

  9. Tracking the Truth: Object-Centric Spatio-Temporal Monitoring for Video Large Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    STEMO-Bench evaluates intermediate spatio-temporal reasoning in video MLLMs via object-centric facts, and STEMO-Track improves consistency by chunk-wise trajectory construction and aggregation.

  10. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 7.0

    VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...

  11. Act2See: Emergent Active Visual Perception for Video Reasoning

    cs.CV 2026-05 unverdicted novelty 7.0

    Act2See trains VLMs via supervised fine-tuning on verified reasoning traces to interleave active frame calls within text CoTs, yielding SOTA results on video reasoning benchmarks.

  12. Don't Pause! Every prediction matters in a streaming video

    cs.CV 2026-04 unverdicted novelty 7.0

    SPOT-Bench tests real-time streaming video perception with timeliness metrics, exposing limitations in current models and introducing AsynKV as an improved baseline.

  13. Towards Temporal Compositional Reasoning in Long-Form Sports Videos

    cs.CV 2026-04 unverdicted novelty 7.0

    SportsTime benchmark and CoTR method improve multimodal AI's temporal compositional reasoning and evidence grounding in long-form sports videos.

  14. Motion-o: Trajectory-Grounded Video Reasoning

    cs.CV 2026-03 conditional novelty 7.0

    Motion-o extends VLMs with Motion Chain of Thought (MCoT) using <motion/> tags and perturbation rewards to make object trajectories explicit and supervised in video reasoning.

  15. Dual-Pathway Circuits of Object Hallucination in Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Vision-language models contain identifiable grounding and hallucination pathways; suppressing the latter reduces object hallucinations by up to 76% while preserving accuracy.

  16. Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

    cs.MM 2026-05 unverdicted novelty 6.0

    Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.

  17. SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.

  18. Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Introduces VURB benchmark and VUP-35K dataset to train discriminative and generative video reward models that achieve SOTA performance on VURB and VideoRewardBench.

  19. BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning

    cs.CV 2026-05 unverdicted novelty 6.0

    BalCapRL applies balanced multi-objective RL with GDPO-style normalization and length-conditional masking to improve MLLM image captioning, reporting gains of up to +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena on...

  20. From Priors to Perception: Grounding Video-LLMs in Physical Reality

    cs.CV 2026-05 unverdicted novelty 6.0

    Video-LLMs fail physical reasoning due to semantic prior dominance rather than perception deficits; a new programmatic adversarial curriculum and visual-anchored reasoning chain enable substantial gains via standard L...

  21. Co-Evolving Policy Distillation

    cs.LG 2026-04 unverdicted novelty 6.0

    CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...

  22. Navigating the Clutter: Waypoint-Based Bi-Level Planning for Multi-Robot Systems

    cs.RO 2026-04 unverdicted novelty 6.0

    Waypoint-based bi-level planning with curriculum RLVR improves multi-robot task success rates in dense-obstacle benchmarks over motion-agnostic and VLA baselines.

  23. One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.

  24. Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    Chain-of-Glimpse is a reinforcement learning framework that builds progressive, spatially grounded reasoning traces around task-relevant objects in videos to enable more accurate and interpretable multi-step decisions.

  25. ReRec: Reasoning-Augmented LLM-based Recommendation Assistant via Reinforcement Fine-tuning

    cs.IR 2026-04 unverdicted novelty 6.0

    ReRec uses reinforcement fine-tuning with dual-graph reward shaping, reasoning-aware advantage estimation, and online curriculum scheduling to improve LLM reasoning and performance in recommendation tasks.

  26. Watch Before You Answer: Learning from Visually Grounded Post-Training

    cs.CV 2026-04 unverdicted novelty 6.0

    Filtering post-training data to visually grounded questions improves VLM video understanding performance by up to 6.2 points using 69% of the data.

  27. Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    Video-MME-v2 is a new benchmark that applies progressive visual-to-reasoning levels and non-linear group scoring to expose gaps in video MLLM capabilities.

  28. Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.

  29. Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    G2F-RAG converts retrieved knowledge subgraphs into a single visual reasoning frame appended to videos, enabling training-free and interpretable improvements for LMM-based video reasoning on knowledge-intensive tasks.

  30. EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    EgoMind activates spatial cognition in MLLMs via linguistic Role-Play Caption and Progressive Spatial Analysis, reaching competitive results on VSI-Bench, SPAR-Bench, SITE-Bench and SPBench with only 5K SFT and 20K RL...

  31. Gen-Searcher: Reinforcing Agentic Search for Image Generation

    cs.CV 2026-03 unverdicted novelty 6.0

    Gen-Searcher is the first search-augmented image generation agent trained with SFT followed by agentic RL using dual text and image rewards on custom datasets and the KnowGen benchmark.

  32. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 5.0

    VISD improves VideoLLM reasoning by adding multi-dimensional diagnostic self-distillation and RL decoupling, yielding higher accuracy, better grounding, and nearly 2x faster training convergence.

  33. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 5.0

    VISD adds structured privileged feedback from a judge model and a direction-magnitude decoupling trick to let VideoLLMs learn token-level credit assignment while keeping RL stable, yielding higher accuracy and roughly...

  34. Scaling Video Understanding via Compact Latent Multi-Agent Collaboration

    cs.CV 2026-05 unverdicted novelty 5.0

    MACF decouples agent perception budgets from overall video length using latent token collaboration to scale video understanding in MLLMs beyond current limits.

  35. OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering

    cs.CV 2026-04 unverdicted novelty 5.0

    OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoni...

  36. Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    cs.AI 2025-03 unverdicted novelty 5.0

    The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

  37. RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation

    cs.CV 2026-05 unverdicted novelty 4.0

    RCoT-Seg uses GRPO-reinforced keyframe selection from a CoT-start corpus followed by SAM2 mask propagation to improve video object segmentation under implicit temporal instructions over prior MLLM sampling methods.

  38. EasyVideoR1: Easier RL for Video Understanding

    cs.CV 2026-04 unverdicted novelty 4.0

    EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 34 Pith papers · 18 internal anchors

  1. [1]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  2. [2]

    Videollm-online: Online video large language model for streaming video

    Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. Videollm-online: Online video large language model for streaming video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18407–18418, 2024

  3. [3]

    Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.arXiv preprint arXiv:2503.09567, 2025

  4. [4]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024

  5. [5]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161, 2025

  6. [6]

    Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025

    Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, and Xiangyu Yue. Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025

  7. [7]

    ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

    Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms.arXiv preprint arXiv:2504.11536, 2025

  8. [8]

    Keypoint- based progressive chain-of-thought distillation for llms.arXiv preprint arXiv:2405.16064, 2024

    Kaituo Feng, Changsheng Li, Xiaolu Zhang, Jun Zhou, Ye Yuan, and Guoren Wang. Keypoint- based progressive chain-of-thought distillation for llms.arXiv preprint arXiv:2405.16064, 2024

  9. [9]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075, 2024

  10. [10]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others

    Jiaxuan Gao, Shusheng Xu, Wenjie Ye, Weilin Liu, Chuyi He, Wei Fu, Zhiyu Mei, Guangju Wang, and Yi Wu. On designing effective rl reward at training time for llm reasoning.arXiv preprint arXiv:2410.15115, 2024

  11. [11]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  12. [12]

    Open-reasoner-zero: An open source approach to scaling reinforcement learning on the base model, 2025

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, and Heung-Yeung Shum Xiangyu Zhang. Open-reasoner-zero: An open source approach to scaling reinforcement learning on the base model, 2025

  13. [13]

    Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

    Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos. arXiv preprint arXiv:2501.13826, 2025. 10

  14. [14]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025

  15. [15]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  16. [16]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  17. [17]

    Reinforcement learning: A survey.Journal of artificial intelligence research, 4:237–285, 1996

    Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey.Journal of artificial intelligence research, 4:237–285, 1996

  18. [18]

    Med-r1: Re- inforcement learning for generalizable medical reasoning in 9 vision-language models.arXiv preprint arXiv:2503.13939,

    Yuxiang Lai, Jike Zhong, Ming Li, Shitian Zhao, and Xiaofeng Yang. Med-r1: Reinforce- ment learning for generalizable medical reasoning in vision-language models.arXiv preprint arXiv:2503.13939, 2025

  19. [19]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  20. [20]

    Mvbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024

  21. [21]

    Process reward model with q-value rankings.arXiv preprint arXiv:2410.11287, 2024

    Wendi Li and Yixuan Li. Process reward model with q-value rankings.arXiv preprint arXiv:2410.11287, 2024

  22. [22]

    Llama-vid: An image is worth 2 tokens in large language models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InEuropean Conference on Computer Vision, pages 323–340. Springer, 2024

  23. [23]

    From System 1 to System 2: A Survey of Reasoning Large Language Models

    Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, et al. From system 1 to system 2: A survey of reasoning large language models.arXiv preprint arXiv:2502.17419, 2025

  24. [24]

    Star-r1: Spatial transformation reasoning by rein- forcing multimodal llms.arXiv preprint arXiv:2505.15804,

    Zongzhao Li, Zongyang Ma, Mingze Li, Songyou Li, Yu Rong, Tingyang Xu, Ziqi Zhang, Deli Zhao, and Wenbing Huang. Star-r1: Spatial transformation reasoning by reinforcing multimodal llms.arXiv preprint arXiv:2505.15804, 2025

  25. [25]

    Vila: On pre-training for visual language models

    Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26689–26699, 2024

  26. [26]

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee

    Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xiaoqi Ma, Xiaoming Wei, Jianbin Jiao, Enhua Wu, and Jie Hu. Kangaroo: A powerful video-language model supporting long-context video input.arXiv preprint arXiv:2408.15542, 2024

  27. [27]

    Tempcompass: Do video llms really understand videos?,

    Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos?arXiv preprint arXiv:2403.00476, 2024

  28. [28]

    Online estimation and inference for robust policy evaluation in reinforcement learning.The Annals of Statistics, 53(5):2128–2152, 2025c

    Zhaowei Liu, Xin Guo, Fangqi Lou, Lingfeng Zeng, Jinyi Niu, Zixuan Wang, Jiajie Xu, Weige Cai, Ziwei Yang, Xueqian Zhao, et al. Fin-r1: A large language model for financial reasoning through reinforcement learning.arXiv preprint arXiv:2503.16252, 2025

  29. [29]

    Ui-r1: Enhancing action prediction of gui agents by reinforcement learning

    Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Guanjing Xiong, and Hongsheng Li. Ui-r1: Enhancing action prediction of gui agents by reinforcement learning. arXiv preprint arXiv:2503.21620, 2025

  30. [30]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 11

  31. [31]

    Audio-visual llm for video understanding

    Fangxun Shu, Lei Zhang, Hao Jiang, and Cihang Xie. Audio-visual llm for video understanding. arXiv preprint arXiv:2312.06720, 2023

  32. [32]

    Video understanding with large language models: A survey.IEEE Transactions on Circuits and Systems for Video Technology, 2025

    Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, et al. Video understanding with large language models: A survey.IEEE Transactions on Circuits and Systems for Video Technology, 2025

  33. [33]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

  34. [34]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  35. [35]

    Reinforcing spatial reasoning in vision-language models with interwoven think- ing and visual drawing.arXiv preprint arXiv:2506.09965,

    Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.arXiv preprint arXiv:2506.09965, 2025

  36. [36]

    arXiv preprint arXiv:2502.14768 , year=

    Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo. Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2502.14768, 2025

  37. [37]

    Visa: Reasoning video object segmentation via large language models

    Cilin Yan, Haochen Wang, Shilin Yan, Xiaolong Jiang, Yao Hu, Guoliang Kang, Weidi Xie, and Efstratios Gavves. Visa: Reasoning video object segmentation via large language models. In European Conference on Computer Vision, pages 98–115. Springer, 2024

  38. [38]

    Thinking in space: How multimodal large language models see, remember, and recall spaces.arXiv preprint arXiv:2412.14171, 2024

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces.arXiv preprint arXiv:2412.14171, 2024

  39. [39]

    Skywork r1v: Pioneering multimodal reasoning with chain-of-thought, 2025

    Chris Yi Peng et al. Skywork r1v: Pioneering multimodal reasoning with chain-of-thought, 2025

  40. [40]

    Unhackable temporal rewarding for scalable video mllms.arXiv preprint arXiv:2502.12081, 2025

    En Yu, Kangheng Lin, Liang Zhao, Yana Wei, Zining Zhu, Haoran Wei, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Jingyu Wang, et al. Unhackable temporal rewarding for scalable video mllms.arXiv preprint arXiv:2502.12081, 2025

  41. [41]

    Mme-reasoning: A comprehensive benchmark for logical reasoning in mllms.arXiv preprint arXiv:2505.21327, 2025

    Jiakang Yuan, Tianshuo Peng, Yilei Jiang, Yiting Lu, Renrui Zhang, Kaituo Feng, Chaoyou Fu, Tao Chen, Lei Bai, Bo Zhang, et al. Mme-reasoning: A comprehensive benchmark for logical reasoning in mllms.arXiv preprint arXiv:2505.21327, 2025

  42. [42]

    Advancing llm rea- soning generalists with preference trees.arXiv preprint arXiv:2404.02078, 2024

    Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng, Boji Shan, Huimin Chen, Ruobing Xie, Yankai Lin, et al. Advancing llm reasoning generalists with preference trees.arXiv preprint arXiv:2404.02078, 2024

  43. [43]

    Long Context Transfer from Language to Vision

    Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision. arXiv preprint arXiv:2406.16852, 2024

  44. [44]

    Will pre-training ever end? a first step toward next-generation foundation mllms via self-improving systematic cognition.arXiv preprint arXiv:2503.12303, 2025

    Xiaoying Zhang, Da Peng, Yipeng Zhang, Zonghao Guo, Chengyue Wu, Chi Chen, Wei Ke, Helen Meng, and Maosong Sun. Will pre-training ever end? a first step toward next-generation foundation mllms via self-improving systematic cognition.arXiv preprint arXiv:2503.12303, 2025

  45. [45]

    arXiv preprint arXiv:2506.03106 , year=

    Xiaoying Zhang, Hao Sun, Yipeng Zhang, Kaituo Feng, Chaochao Lu, Chao Yang, and Helen Meng. Critique-grpo: Advancing llm reasoning with natural language and numerical feedback. arXiv preprint arXiv:2506.03106, 2025

  46. [46]

    Llava-uhd v2: an mllm integrating high-resolution feature pyramid via hierarchical window transformer.arXiv preprint arXiv:2412.13871, 2024

    Yipeng Zhang, Yifan Liu, Zonghao Guo, Yidan Zhang, Xuesong Yang, Chi Chen, Jun Song, Bo Zheng, Yuan Yao, Zhiyuan Liu, et al. Llava-uhd v2: an mllm integrating high-resolution feature pyramid via hierarchical window transformer.arXiv preprint arXiv:2412.13871, 2024. 12

  47. [47]

    Multimodal Chain-of-Thought Reasoning in Language Models

    Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Mul- timodal chain-of-thought reasoning in language models.arXiv preprint arXiv:2302.00923, 2023

  48. [48]

    Mmvu: Measuring expert-level multi-discipline video understanding.arXiv preprint arXiv:2501.12380, 2025

    Yilun Zhao, Lujing Xie, Haowei Zhang, Guo Gan, Yitao Long, Zhiyuan Hu, Tongyan Hu, Weiyuan Chen, Chuhan Li, Junyang Song, et al. Mmvu: Measuring expert-level multi-discipline video understanding.arXiv preprint arXiv:2501.12380, 2025

  49. [49]

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

    Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models.arXiv preprint arXiv:2205.10625, 2022. 13 A Additional Experiments A.1 Scaling Up RL Training In previous experiments, our Video-R1 is traine...

  50. [50]

    The kinetic energy is transferred from the cue stick to the balls, and the balls continue to move

    In the pool table scene, the player takes a shot, and the balls move. The kinetic energy is transferred from the cue stick to the balls, and the balls continue to move. This does not seem to lose kinetic energy

  51. [51]

    The ball loses some kinetic energy upon hitting the pins, but it still has some energy left as it rolls away

    In the bowling alley scene, the ball rolls down the lane and hits the pins. The ball loses some kinetic energy upon hitting the pins, but it still has some energy left as it rolls away

  52. [52]

    He really had to pee

    In the car crash scene, the two cars collide. The kinetic energy is transferred between the cars, and some energy is lost as heat and deformation of the cars. This is a significant loss of kinetic energy. Hmm, it seems that the car crash scene is the one where the most kinetic energy is lost. The cars collide, and the energy is dissipated in the form of h...