{"work":{"id":"0ce88332-564c-4361-8e2a-3850eb1ace9c","openalex_id":null,"doi":null,"arxiv_id":"2503.21776","raw_key":null,"title":"Video-R1: Reinforcing Video Reasoning in MLLMs","authors":null,"authors_text":"Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng","year":2025,"venue":"cs.CV","abstract":"Inspired by DeepSeek-R1's success in eliciting reasoning abilities through rule-based reinforcement learning (RL), we introduce Video-R1 as the first attempt to systematically explore the R1 paradigm for incentivizing video reasoning within multimodal large language models (MLLMs). However, directly applying RL training with the GRPO algorithm to video reasoning presents two primary challenges: (i) a lack of temporal modeling for video reasoning, and (ii) the scarcity of high-quality video-reasoning data. To address these issues, we first propose the T-GRPO algorithm, which encourages models to utilize temporal information in videos for reasoning. Additionally, instead of relying solely on video data, we incorporate high-quality image-reasoning data into the training process. We have constructed two datasets: Video-R1-CoT-165k for SFT cold start and Video-R1-260k for RL training, both comprising image and video data. Experimental results demonstrate that Video-R1 achieves significant improvements on video reasoning benchmarks such as VideoMMMU and VSI-Bench, as well as on general video benchmarks including MVBench and TempCompass, etc. Notably, Video-R1-7B attains a 37.1% accuracy on video spatial reasoning benchmark VSI-bench, surpassing the commercial proprietary model GPT-4o. All code, models, and data are released in: https://github.com/tulerfeng/Video-R1.","external_url":"https://arxiv.org/abs/2503.21776","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T06:35:25.047994+00:00","pith_arxiv_id":"2503.21776","created_at":"2026-05-08T21:34:15.417945+00:00","updated_at":"2026-06-05T21:23:00.469572+00:00","title_quality_ok":true,"display_title":"Video-R1: Reinforcing Video Reasoning in MLLMs","render_title":"Video-R1: Reinforcing Video Reasoning in MLLMs"},"hub":{"state":{"work_id":"0ce88332-564c-4361-8e2a-3850eb1ace9c","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":73,"external_cited_by_count":null,"distinct_field_count":7,"first_pith_cited_at":"2025-03-12T08:33:46+00:00","last_pith_cited_at":"2026-05-22T04:19:29+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-19T17:27:24.486384+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":13},{"context_role":"baseline","n":8},{"context_role":"dataset","n":2},{"context_role":"method","n":1}],"polarity_counts":[{"context_polarity":"background","n":13},{"context_polarity":"baseline","n":8},{"context_polarity":"use_dataset","n":2},{"context_polarity":"use_method","n":1}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-17T03:19:31.936327+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":29},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":21},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":19},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":19},{"title":"VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning","work_id":"7be17d59-6cde-455a-99c3-06e28659839f","shared_citers":17},{"title":"GPT-4o System Card","work_id":"f37bf1c7-4964-4e56-9762-d20da8d9009f","shared_citers":14},{"title":"LLaVA-Video: Video Instruction Tuning With Synthetic Data","work_id":"e598f516-d992-449a-ab6d-6c788b3a1d7b","shared_citers":14},{"title":"VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding","work_id":"38f52461-37fd-4266-bc46-9dea31be2824","shared_citers":14},{"title":"LLaVA-OneVision: Easy Visual Task Transfer","work_id":"f5f2452b-f2a9-49ac-b38d-c76e18cdfe49","shared_citers":13},{"title":"Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models","work_id":"38998646-34ee-4605-b661-ab356f16d6e5","shared_citers":13},{"title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","work_id":"ee70bdc8-4656-4849-ada7-ce42a2278d70","shared_citers":11},{"title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","work_id":"b8f5e260-fff5-444e-bcf5-2c42cfefd83d","shared_citers":11},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":11},{"title":"Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding","work_id":"ef2c21b6-ae25-436a-bac3-f8d625541320","shared_citers":11},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":10},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":10},{"title":"VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model","work_id":"d36889cb-edb6-448f-9a50-36df8b1623e5","shared_citers":10},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":9},{"title":"Group Sequence Policy Optimization","work_id":"3a98b53b-9f52-4d95-adf7-89353c0a9a65","shared_citers":8},{"title":"OneThinker: All-in-one Reasoning Model for Image and Video","work_id":"d0ffbcf9-210d-436c-b6f0-cde5bcdbae97","shared_citers":8},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":8},{"title":"Videochat-r1","work_id":"8d2957eb-c24e-46b9-8bef-c978f7c5f0e6","shared_citers":8},{"title":"Video-holmes: Can mllm think like holmes for complex video reasoning?","work_id":"6d26f54b-33e5-4b18-86b0-7202ab41b867","shared_citers":8},{"title":"MiniCPM-V: A GPT-4V Level MLLM on Your Phone","work_id":"0f06e436-0c76-4e3c-be5e-6168f6bc4336","shared_citers":7}],"time_series":[{"n":8,"year":2025},{"n":38,"year":2026}],"dependency_candidates":[{"n":1,"role":"method","polarity":"use_method","paper_title":"Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation","primary_cat":"cs.MM","context_text":"video-text portion combines four open-source video corpora: Video-R1-data [37], VideoAuto-R1- Data [38], ShareGPT4Video [39], and LLaV A-Video-178K [15]. Because these corpora partially overlap, we deduplicate exact matches at the video-query level while retaining multiple distinct 9 StepFun-Audio Team queries for the same video when appropriate. We then rewrite the video CoTs with Qwen2.5-VL- 235B [40], add dense full-video captions derived from 30-second segments, and discard examples that the 235B model still cannot answer correctly. This construction gives each source the same output-token budget so that the comparison focuses on modality composition rather than simple data imbalance. Training setup.We train this SFT stage for 1 epoch with a global batch size of 64.","citing_arxiv_id":"2605.12034"},{"n":1,"role":"baseline","polarity":"baseline","paper_title":"TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models","primary_cat":"cs.CV","context_text":"3 69.6 47.8 49.2 11.4 34.1 21.8 36.3 28.7 28.4 Mimo-V2-Omni [45] 32 31.6 47.5 4.7 69.6 57.5 47.6 7.6 30.3 26.9 42.1 27.0 35.1 Open-source thinking / reasoning models Qwen3-VL-8B-Thinking [1] 32 32.1 35.6 14.0 100.0 35.1 50.8 10.5 34.9 24.7 40.6 30.3 43.8 VideoChat-R1.5-7B [49] 32 29.4 45.2 7.0 100.0 47.0 47.1 9.5 29.3 17.0 40.6 23.6 33.1 Video-R1-7B [12] 32 25.1 34.2 0.0 100.0 32.8 46.5 14.3 25.6 15.1 35.3 24.7 32.5 Open-source standard models Qwen2.5-VL-72B [2] 32 34.0 47.5 3.0 95.7 49.3 50.3 15.2 38.2 30.1 40.6 27.0 42.6 InternVL3-8B [59] 32 30.1 42.5 18.7 91.3 42.5 43.3 5.7 32.0 20.2 31.7 26.4 27.8 LLaV A-Video-72B [58] 32 28.9 49.3 15.0 26.1 47.0 45.5 14.3 25.9 21.5 31.7 24.2 24.5 VideoLLaMA3-7B [56] 32 28.","citing_arxiv_id":"2605.09904"},{"n":1,"role":"baseline","polarity":"baseline","paper_title":"SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs","primary_cat":"cs.CV","context_text":"1 46.2 - Gemini-2.0-Flash [11] 49.0 60.4 54.7 51.9 50.7 51.3 53.0 - Open-Source Models InternVL-2.5-8B [7] 28.3 56.3 42.3 37.3 47.5 42.4 42.3 - Kimi-VL-A3B [37] 25.7 44.9 35.3 23.3 57.6 40.5 37.9 - LLaV A-OneVision-7B [29] 25.4 41.0 33.2 20.6 49.6 35.1 34.2 - Qwen2.5-VL-7B [3] 36.3 60.5 48.4 28.9 49.8 39.3 43.9 53.7 Spatial Reasoning Models Video-R1 [14] 27.7 62.0 44.9 32.5 53.0 42.8 43.8 - SpaceR-7B [41] 35.7 61.5 48.6 63.2 53.7 58.5 53.5 56.5 VILASR-7B [62] 36.6 63.7 50.2 56.2 59.6 57.9 54.0 56.1 Spatial-MLLM-4B [61] 38.1 49.3 43.7 63.7 58.9 61.3 52.5 44.0 SpaceMind [72] 66.353.2 59.7 76.2 70.5 73.8 67.3 - EgoMind [9] - - - - - - 55.0 58.0 SpaceMind++ (Ours) 56.7 65.6 61.1 76.5 82.3 78.9 70.0 61.","citing_arxiv_id":"2605.09449"},{"n":1,"role":"baseline","polarity":"baseline","paper_title":"VISD: Enhancing Video Reasoning via Structured Self-Distillation","primary_cat":"cs.CV","context_text":"2 39.653.849.0 42.270.370.7 21.0 46.2 VISD (Ours) 39.9 39.4 40.4 40.5 53.8 49.7 42.369.373.5 24.8 48.0∆vs. Qwen2.5-VL-7B +4.4 +6.5 +2.0 +3.5 +2.6 - - +4.6 +14.2 +8.5 +7.4 4.2 Main Results Spatio-Temporal Grounding on V-STAR[ 7]. Table 1 compares VISD with closed-source mod- els [19, 8], general video LLMs [2], and recent reasoning-focused frameworks [9, 30, 20]. VISD improves answer accuracy over Qwen2.5-VL-7B by +28.4 points and obtains the best overall V-STAR scores, improving mAM/mLGM over VisionCoach from 34.3/47.5 to 35.1/48.9. The gains appear not only in final answering, but also in temporal localization and spatial grounding, which is the setting where sparse sequence-level rewards are least informative.","citing_arxiv_id":"2605.06094"},{"n":1,"role":"baseline","polarity":"baseline","paper_title":"Towards Temporal Compositional Reasoning in Long-Form Sports Videos","primary_cat":"cs.CV","context_text":"77 27.45 43.85 44.77 46.93 40.72Gemini-2.5-Pro 0.2 fps39.66 19.78 29.49 39.41 13.71 29.37 Open-sourceQwen3-VL-8B-Instruct [2] 76824.27 14.31 33.26 23.08 44.47 27.45Qwen3-VL-4B-Instruct 76823.26 13.14 31.11 22.92 34.17 25.15VideoLLaMA-7B [50] 76820.06 10.98 16.92 14.00 28.54 17.96InternVideo2.5-8B [4] 51218.12 10.39 19.83 9.85 26.21 16.68Video-R1-7B [8] 76819.26 8.04 23.25 18.62 33.01 20.40MiniCPM-V4.5-8B [47] 51216.38 10.14 13.50 16.18 27.27 16.48GLM-4.6v-Flash-9B [30] 5122.27 1.18 3.76 3.54 13.20 4.62 5.1 Experimental setting Implementation detailsWe sample 128 frames per video for training and up to 768 frames for inference. We use Qwen3-VL-4B as the backbone model. All experiments are conducted on 2×NVIDIA H100 (80GB).","citing_arxiv_id":"2604.22226"},{"n":1,"role":"dataset","polarity":"use_dataset","paper_title":"EasyVideoR1: Easier RL for Video Understanding","primary_cat":"cs.CV","context_text":"We select Qwen3-VL-8B-Instruct [1] as the representative base model for our experiments, as it employs the DeepStack architecture with interleaved M-RoPE positional encoding and represents one of the strongest and most widely adopted open-source video-language models at this scale. We train on approximately 100K video samples assembled from publicly available video RL datasets such as OneThinker [10], Video-R1 [9], and VideoChat-R1 [20]. To ensure that training samples lie within the model's learning frontier, we apply a pass-rate-based filtering strategy: for each candidate sample, we performk=8rollouts using the base model and retain only samples with partial success (0<pass rate<1), removing trivially solved instances. Training Configuration.We train with GRPO using the DAPO clipping variant [47] (asymmetric clip ratios","citing_arxiv_id":"2604.16893"},{"n":1,"role":"baseline","polarity":"baseline","paper_title":"OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering","primary_cat":"cs.CV","context_text":"58 72.34 ↑1.64 46.50 ↑2.70 48.29 ↑1.69 62.75 ↑3.99 69.30 ↑1.40 73.46 ↑2.48 TUNA-Bench [12]), high-level cognitive reasoning (Video-Holmes [3], Video- TT [53]), and holistic multi-task comprehension (Video-MME [7], MLVU [58], MLVU-Test [58]). Beyond the Qwen3-Omni-30B-A3B-Instruct [42] baseline, we furtherassessOmni-R1[56],HumanOmniV2[47],andVideo-R1[6].Experiments employ two inference modes (w/ audioandw/o audio), with videos downsam- pled to 200 frames (128×28×28pixels) and audio symmetrically truncated to 600s. As presented in Table 1, OmniJigsaw yields substantial gains (+4.38 on MLVU-Test) across nearly all benchmarks, with CMM consistently outper- forming VideoJigsaw despite auxiliary audio attention allocation.","citing_arxiv_id":"2604.08209"}]},"error":null,"updated_at":"2026-05-17T03:19:26.009788+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-17T03:19:29.552258+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Video-R1: Reinforcing Video Reasoning in MLLMs","claims":[{"claim_text":"Inspired by DeepSeek-R1's success in eliciting reasoning abilities through rule-based reinforcement learning (RL), we introduce Video-R1 as the first attempt to systematically explore the R1 paradigm for incentivizing video reasoning within multimodal large language models (MLLMs). However, directly applying RL training with the GRPO algorithm to video reasoning presents two primary challenges: (i) a lack of temporal modeling for video reasoning, and (ii) the scarcity of high-quality video-reasoning data. To address these issues, we first propose the T-GRPO algorithm, which encourages models t","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"77 27.45 43.85 44.77 46.93 40.72Gemini-2.5-Pro 0.2 fps39.66 19.78 29.49 39.41 13.71 29.37 Open-sourceQwen3-VL-8B-Instruct [2] 76824.27 14.31 33.26 23.08 44.47 27.45Qwen3-VL-4B-Instruct 76823.26 13.14 31.11 22.92 34.17 25.15VideoLLaMA-7B [50] 76820.06 10.98 16.92 14.00 28.54 17.96InternVideo2.5-8B [4] 51218.12 10.39 19.83 9.85 26.21 16.68Video-R1-7B [8] 76819.26 8.04 23.25 18.62 33.01 20.40MiniCPM-V4.5-8B [47] 51216.38 10.14 13.50 16.18 27.27 16.48GLM-4.6v-Flash-9B [30] 5122.27 1.18 3.76 3.54 13.","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"We select Qwen3-VL-8B-Instruct [1] as the representative base model for our experiments, as it employs the DeepStack architecture with interleaved M-RoPE positional encoding and represents one of the strongest and most widely adopted open-source video-language models at this scale. We train on approximately 100K video samples assembled from publicly available video RL datasets such as OneThinker [10], Video-R1 [9], and VideoChat-R1 [20]. To ensure that training samples lie within the model's lea","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"not mask refinement after a correct box, but the earlier path from search planning to entity resolution and visual instance verification. WebEyes provides benchmark infrastructure for this direction, and Pixel-Searcher offers a simple starting workflow for studying how agentic search can identify the right entity and bind it to the right visual instance. References [1] Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiang","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"reasoning in mllms. arXiv preprint arXiv:2503.21776, 2025. [180] Sicheng Feng, Gongfan Fang, Xinyin Ma, and Xinchao Wang. Efficient reasoning models: A survey. arXiv preprint arXiv:2504.10903, 2025. [181] Xiachong Feng, Longxu Dou, and Lingpeng Kong. Reasoning does not necessarily improve role-playing ability. arXiv preprint arXiv:2502.16940, 2025. [182] Xueyang Feng, Bo Lan, Quanyu Dai, Lei Wang, Jiakai Tang, Xu Chen, Zhenhua Dong, and Ji-Rong Wen. Improving retrospective language agents via jo","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"58 72.34 ↑1.64 46.50 ↑2.70 48.29 ↑1.69 62.75 ↑3.99 69.30 ↑1.40 73.46 ↑2.48 TUNA-Bench [12]), high-level cognitive reasoning (Video-Holmes [3], Video- TT [53]), and holistic multi-task comprehension (Video-MME [7], MLVU [58], MLVU-Test [58]). Beyond the Qwen3-Omni-30B-A3B-Instruct [42] baseline, we furtherassessOmni-R1[56],HumanOmniV2[47],andVideo-R1[6].Experiments employ two inference modes (w/ audioandw/o audio), with videos downsam- pled to 200 frames (128×28×28pixels) and audio symmetrically ","claim_type":"baseline","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"2 39.653.849.0 42.270.370.7 21.0 46.2 VISD (Ours) 39.9 39.4 40.4 40.5 53.8 49.7 42.369.373.5 24.8 48.0∆vs. Qwen2.5-VL-7B +4.4 +6.5 +2.0 +3.5 +2.6 - - +4.6 +14.2 +8.5 +7.4 4.2 Main Results Spatio-Temporal Grounding on V-STAR[ 7]. Table 1 compares VISD with closed-source mod- els [19, 8], general video LLMs [2], and recent reasoning-focused frameworks [9, 30, 20]. VISD improves answer accuracy over Qwen2.5-VL-7B by +28.4 points and obtains the best overall V-STAR scores, improving mAM/mLGM over Vi","claim_type":"baseline","confidence":0.85,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Video-R1: Reinforcing Video Reasoning in MLLMs because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (5 contexts).","role_counts":[{"n":5,"context_role":"background"},{"n":5,"context_role":"baseline"},{"n":1,"context_role":"dataset"},{"n":1,"context_role":"method"}]},"error":null,"updated_at":"2026-05-17T03:19:26.016707+00:00"}},"summary":{"title":"Video-R1: Reinforcing Video Reasoning in MLLMs","claims":[{"claim_text":"Inspired by DeepSeek-R1's success in eliciting reasoning abilities through rule-based reinforcement learning (RL), we introduce Video-R1 as the first attempt to systematically explore the R1 paradigm for incentivizing video reasoning within multimodal large language models (MLLMs). However, directly applying RL training with the GRPO algorithm to video reasoning presents two primary challenges: (i) a lack of temporal modeling for video reasoning, and (ii) the scarcity of high-quality video-reasoning data. To address these issues, we first propose the T-GRPO algorithm, which encourages models t","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"77 27.45 43.85 44.77 46.93 40.72Gemini-2.5-Pro 0.2 fps39.66 19.78 29.49 39.41 13.71 29.37 Open-sourceQwen3-VL-8B-Instruct [2] 76824.27 14.31 33.26 23.08 44.47 27.45Qwen3-VL-4B-Instruct 76823.26 13.14 31.11 22.92 34.17 25.15VideoLLaMA-7B [50] 76820.06 10.98 16.92 14.00 28.54 17.96InternVideo2.5-8B [4] 51218.12 10.39 19.83 9.85 26.21 16.68Video-R1-7B [8] 76819.26 8.04 23.25 18.62 33.01 20.40MiniCPM-V4.5-8B [47] 51216.38 10.14 13.50 16.18 27.27 16.48GLM-4.6v-Flash-9B [30] 5122.27 1.18 3.76 3.54 13.","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"We select Qwen3-VL-8B-Instruct [1] as the representative base model for our experiments, as it employs the DeepStack architecture with interleaved M-RoPE positional encoding and represents one of the strongest and most widely adopted open-source video-language models at this scale. We train on approximately 100K video samples assembled from publicly available video RL datasets such as OneThinker [10], Video-R1 [9], and VideoChat-R1 [20]. To ensure that training samples lie within the model's lea","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"not mask refinement after a correct box, but the earlier path from search planning to entity resolution and visual instance verification. WebEyes provides benchmark infrastructure for this direction, and Pixel-Searcher offers a simple starting workflow for studying how agentic search can identify the right entity and bind it to the right visual instance. References [1] Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiang","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"reasoning in mllms. arXiv preprint arXiv:2503.21776, 2025. [180] Sicheng Feng, Gongfan Fang, Xinyin Ma, and Xinchao Wang. Efficient reasoning models: A survey. arXiv preprint arXiv:2504.10903, 2025. [181] Xiachong Feng, Longxu Dou, and Lingpeng Kong. Reasoning does not necessarily improve role-playing ability. arXiv preprint arXiv:2502.16940, 2025. [182] Xueyang Feng, Bo Lan, Quanyu Dai, Lei Wang, Jiakai Tang, Xu Chen, Zhenhua Dong, and Ji-Rong Wen. Improving retrospective language agents via jo","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"58 72.34 ↑1.64 46.50 ↑2.70 48.29 ↑1.69 62.75 ↑3.99 69.30 ↑1.40 73.46 ↑2.48 TUNA-Bench [12]), high-level cognitive reasoning (Video-Holmes [3], Video- TT [53]), and holistic multi-task comprehension (Video-MME [7], MLVU [58], MLVU-Test [58]). Beyond the Qwen3-Omni-30B-A3B-Instruct [42] baseline, we furtherassessOmni-R1[56],HumanOmniV2[47],andVideo-R1[6].Experiments employ two inference modes (w/ audioandw/o audio), with videos downsam- pled to 200 frames (128×28×28pixels) and audio symmetrically ","claim_type":"baseline","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"2 39.653.849.0 42.270.370.7 21.0 46.2 VISD (Ours) 39.9 39.4 40.4 40.5 53.8 49.7 42.369.373.5 24.8 48.0∆vs. Qwen2.5-VL-7B +4.4 +6.5 +2.0 +3.5 +2.6 - - +4.6 +14.2 +8.5 +7.4 4.2 Main Results Spatio-Temporal Grounding on V-STAR[ 7]. Table 1 compares VISD with closed-source mod- els [19, 8], general video LLMs [2], and recent reasoning-focused frameworks [9, 30, 20]. VISD improves answer accuracy over Qwen2.5-VL-7B by +28.4 points and obtains the best overall V-STAR scores, improving mAM/mLGM over Vi","claim_type":"baseline","confidence":0.85,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Video-R1: Reinforcing Video Reasoning in MLLMs because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (5 contexts).","role_counts":[{"n":5,"context_role":"background"},{"n":5,"context_role":"baseline"},{"n":1,"context_role":"dataset"},{"n":1,"context_role":"method"}]},"graph":{"co_cited":[{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":29},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":21},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":19},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":19},{"title":"VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning","work_id":"7be17d59-6cde-455a-99c3-06e28659839f","shared_citers":17},{"title":"GPT-4o System Card","work_id":"f37bf1c7-4964-4e56-9762-d20da8d9009f","shared_citers":14},{"title":"LLaVA-Video: Video Instruction Tuning With Synthetic Data","work_id":"e598f516-d992-449a-ab6d-6c788b3a1d7b","shared_citers":14},{"title":"VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding","work_id":"38f52461-37fd-4266-bc46-9dea31be2824","shared_citers":14},{"title":"LLaVA-OneVision: Easy Visual Task Transfer","work_id":"f5f2452b-f2a9-49ac-b38d-c76e18cdfe49","shared_citers":13},{"title":"Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models","work_id":"38998646-34ee-4605-b661-ab356f16d6e5","shared_citers":13},{"title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","work_id":"ee70bdc8-4656-4849-ada7-ce42a2278d70","shared_citers":11},{"title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","work_id":"b8f5e260-fff5-444e-bcf5-2c42cfefd83d","shared_citers":11},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":11},{"title":"Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding","work_id":"ef2c21b6-ae25-436a-bac3-f8d625541320","shared_citers":11},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":10},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":10},{"title":"VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model","work_id":"d36889cb-edb6-448f-9a50-36df8b1623e5","shared_citers":10},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":9},{"title":"Group Sequence Policy Optimization","work_id":"3a98b53b-9f52-4d95-adf7-89353c0a9a65","shared_citers":8},{"title":"OneThinker: All-in-one Reasoning Model for Image and Video","work_id":"d0ffbcf9-210d-436c-b6f0-cde5bcdbae97","shared_citers":8},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":8},{"title":"Videochat-r1","work_id":"8d2957eb-c24e-46b9-8bef-c978f7c5f0e6","shared_citers":8},{"title":"Video-holmes: Can mllm think like holmes for complex video reasoning?","work_id":"6d26f54b-33e5-4b18-86b0-7202ab41b867","shared_citers":8},{"title":"MiniCPM-V: A GPT-4V Level MLLM on Your Phone","work_id":"0f06e436-0c76-4e3c-be5e-6168f6bc4336","shared_citers":7}],"time_series":[{"n":8,"year":2025},{"n":38,"year":2026}],"dependency_candidates":[{"n":1,"role":"method","polarity":"use_method","paper_title":"Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation","primary_cat":"cs.MM","context_text":"video-text portion combines four open-source video corpora: Video-R1-data [37], VideoAuto-R1- Data [38], ShareGPT4Video [39], and LLaV A-Video-178K [15]. Because these corpora partially overlap, we deduplicate exact matches at the video-query level while retaining multiple distinct 9 StepFun-Audio Team queries for the same video when appropriate. We then rewrite the video CoTs with Qwen2.5-VL- 235B [40], add dense full-video captions derived from 30-second segments, and discard examples that the 235B model still cannot answer correctly. This construction gives each source the same output-token budget so that the comparison focuses on modality composition rather than simple data imbalance. Training setup.We train this SFT stage for 1 epoch with a global batch size of 64.","citing_arxiv_id":"2605.12034"},{"n":1,"role":"baseline","polarity":"baseline","paper_title":"TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models","primary_cat":"cs.CV","context_text":"3 69.6 47.8 49.2 11.4 34.1 21.8 36.3 28.7 28.4 Mimo-V2-Omni [45] 32 31.6 47.5 4.7 69.6 57.5 47.6 7.6 30.3 26.9 42.1 27.0 35.1 Open-source thinking / reasoning models Qwen3-VL-8B-Thinking [1] 32 32.1 35.6 14.0 100.0 35.1 50.8 10.5 34.9 24.7 40.6 30.3 43.8 VideoChat-R1.5-7B [49] 32 29.4 45.2 7.0 100.0 47.0 47.1 9.5 29.3 17.0 40.6 23.6 33.1 Video-R1-7B [12] 32 25.1 34.2 0.0 100.0 32.8 46.5 14.3 25.6 15.1 35.3 24.7 32.5 Open-source standard models Qwen2.5-VL-72B [2] 32 34.0 47.5 3.0 95.7 49.3 50.3 15.2 38.2 30.1 40.6 27.0 42.6 InternVL3-8B [59] 32 30.1 42.5 18.7 91.3 42.5 43.3 5.7 32.0 20.2 31.7 26.4 27.8 LLaV A-Video-72B [58] 32 28.9 49.3 15.0 26.1 47.0 45.5 14.3 25.9 21.5 31.7 24.2 24.5 VideoLLaMA3-7B [56] 32 28.","citing_arxiv_id":"2605.09904"},{"n":1,"role":"baseline","polarity":"baseline","paper_title":"SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs","primary_cat":"cs.CV","context_text":"1 46.2 - Gemini-2.0-Flash [11] 49.0 60.4 54.7 51.9 50.7 51.3 53.0 - Open-Source Models InternVL-2.5-8B [7] 28.3 56.3 42.3 37.3 47.5 42.4 42.3 - Kimi-VL-A3B [37] 25.7 44.9 35.3 23.3 57.6 40.5 37.9 - LLaV A-OneVision-7B [29] 25.4 41.0 33.2 20.6 49.6 35.1 34.2 - Qwen2.5-VL-7B [3] 36.3 60.5 48.4 28.9 49.8 39.3 43.9 53.7 Spatial Reasoning Models Video-R1 [14] 27.7 62.0 44.9 32.5 53.0 42.8 43.8 - SpaceR-7B [41] 35.7 61.5 48.6 63.2 53.7 58.5 53.5 56.5 VILASR-7B [62] 36.6 63.7 50.2 56.2 59.6 57.9 54.0 56.1 Spatial-MLLM-4B [61] 38.1 49.3 43.7 63.7 58.9 61.3 52.5 44.0 SpaceMind [72] 66.353.2 59.7 76.2 70.5 73.8 67.3 - EgoMind [9] - - - - - - 55.0 58.0 SpaceMind++ (Ours) 56.7 65.6 61.1 76.5 82.3 78.9 70.0 61.","citing_arxiv_id":"2605.09449"},{"n":1,"role":"baseline","polarity":"baseline","paper_title":"VISD: Enhancing Video Reasoning via Structured Self-Distillation","primary_cat":"cs.CV","context_text":"2 39.653.849.0 42.270.370.7 21.0 46.2 VISD (Ours) 39.9 39.4 40.4 40.5 53.8 49.7 42.369.373.5 24.8 48.0∆vs. Qwen2.5-VL-7B +4.4 +6.5 +2.0 +3.5 +2.6 - - +4.6 +14.2 +8.5 +7.4 4.2 Main Results Spatio-Temporal Grounding on V-STAR[ 7]. Table 1 compares VISD with closed-source mod- els [19, 8], general video LLMs [2], and recent reasoning-focused frameworks [9, 30, 20]. VISD improves answer accuracy over Qwen2.5-VL-7B by +28.4 points and obtains the best overall V-STAR scores, improving mAM/mLGM over VisionCoach from 34.3/47.5 to 35.1/48.9. The gains appear not only in final answering, but also in temporal localization and spatial grounding, which is the setting where sparse sequence-level rewards are least informative.","citing_arxiv_id":"2605.06094"},{"n":1,"role":"baseline","polarity":"baseline","paper_title":"Towards Temporal Compositional Reasoning in Long-Form Sports Videos","primary_cat":"cs.CV","context_text":"77 27.45 43.85 44.77 46.93 40.72Gemini-2.5-Pro 0.2 fps39.66 19.78 29.49 39.41 13.71 29.37 Open-sourceQwen3-VL-8B-Instruct [2] 76824.27 14.31 33.26 23.08 44.47 27.45Qwen3-VL-4B-Instruct 76823.26 13.14 31.11 22.92 34.17 25.15VideoLLaMA-7B [50] 76820.06 10.98 16.92 14.00 28.54 17.96InternVideo2.5-8B [4] 51218.12 10.39 19.83 9.85 26.21 16.68Video-R1-7B [8] 76819.26 8.04 23.25 18.62 33.01 20.40MiniCPM-V4.5-8B [47] 51216.38 10.14 13.50 16.18 27.27 16.48GLM-4.6v-Flash-9B [30] 5122.27 1.18 3.76 3.54 13.20 4.62 5.1 Experimental setting Implementation detailsWe sample 128 frames per video for training and up to 768 frames for inference. We use Qwen3-VL-4B as the backbone model. All experiments are conducted on 2×NVIDIA H100 (80GB).","citing_arxiv_id":"2604.22226"},{"n":1,"role":"dataset","polarity":"use_dataset","paper_title":"EasyVideoR1: Easier RL for Video Understanding","primary_cat":"cs.CV","context_text":"We select Qwen3-VL-8B-Instruct [1] as the representative base model for our experiments, as it employs the DeepStack architecture with interleaved M-RoPE positional encoding and represents one of the strongest and most widely adopted open-source video-language models at this scale. We train on approximately 100K video samples assembled from publicly available video RL datasets such as OneThinker [10], Video-R1 [9], and VideoChat-R1 [20]. To ensure that training samples lie within the model's learning frontier, we apply a pass-rate-based filtering strategy: for each candidate sample, we performk=8rollouts using the base model and retain only samples with partial success (0<pass rate<1), removing trivially solved instances. Training Configuration.We train with GRPO using the DAPO clipping variant [47] (asymmetric clip ratios","citing_arxiv_id":"2604.16893"},{"n":1,"role":"baseline","polarity":"baseline","paper_title":"OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering","primary_cat":"cs.CV","context_text":"58 72.34 ↑1.64 46.50 ↑2.70 48.29 ↑1.69 62.75 ↑3.99 69.30 ↑1.40 73.46 ↑2.48 TUNA-Bench [12]), high-level cognitive reasoning (Video-Holmes [3], Video- TT [53]), and holistic multi-task comprehension (Video-MME [7], MLVU [58], MLVU-Test [58]). Beyond the Qwen3-Omni-30B-A3B-Instruct [42] baseline, we furtherassessOmni-R1[56],HumanOmniV2[47],andVideo-R1[6].Experiments employ two inference modes (w/ audioandw/o audio), with videos downsam- pled to 200 frames (128×28×28pixels) and audio symmetrically truncated to 600s. As presented in Table 1, OmniJigsaw yields substantial gains (+4.38 on MLVU-Test) across nearly all benchmarks, with CMM consistently outper- forming VideoJigsaw despite auxiliary audio attention allocation.","citing_arxiv_id":"2604.08209"}]},"authors":[]}}