DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding

Guanghao Zhang; Haobing Tang; Hao Jiang; Jinlong Liu; Le Zhang; Longxiang Zhang; Mushui Liu; Peng Zhang; Pipei Huang; Wanggui He

arxiv: 2605.26680 · v1 · pith:FECFVZXInew · submitted 2026-05-26 · 💻 cs.CV · cs.AI

DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding

Peng Zhang , Guanghao Zhang , Wanggui He , Longxiang Zhang , Mushui Liu , Yan Xia , Zhenhao Peng , Weilong Dai

show 5 more authors

Jinlong Liu Haobing Tang Le Zhang Hao Jiang Pipei Huang

This is my paper

Pith reviewed 2026-06-29 17:47 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords dynamic frame samplingvideo multimodal LLMsadaptive retrievalsegment-decoupled GRPOtemporal window tokensmulti-granularity evidencevideo reasoning

0 comments

The pith

A video multimodal model emits both the chosen time window and its sampling density as native tokens inside one autoregressive generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing video reasoning systems let models decide where to look but keep the number of frames per window fixed, so fine detail requires repeated retrieval calls that lengthen context and mix credit between retrieval choices and answers. DynFrame turns the temporal window and the per-window frame density into tokens the model produces alongside its answer in a single pass, allowing one retrieval step to return evidence at multiple scales. A new reinforcement-learning rule, Segment-Decoupled GRPO, splits each rollout at the retrieval boundary and gives separate advantage signals to the sampling tokens and the answer tokens. The resulting 4B and 8B models reach or exceed strong fixed-density baselines on six video benchmarks after training on the authors' curated DM-CoT-74k and DM-RL-45k datasets.

Core claim

DynFrame emits the temporal window and the sampling density as native tokens within a single autoregressive pass. This learnable span-density retrieval enables acquiring multi-granularity evidence with a single retrieval step. Segment-Decoupled GRPO splits each rollout at the retrieval boundary and assigns role-specific token-level advantages, separately crediting the sampling decision and the answer.

What carries the argument

Learnable span-density retrieval, in which the model produces both the temporal window and the sampling density as tokens in the same generation sequence used for the final answer.

If this is right

Multi-granularity evidence can be obtained in one retrieval step instead of repeated calls.
Role-specific advantages allow independent optimization of sampling decisions and answer generation.
4B-scale models become competitive with 7B-8B fixed-density systems across six video benchmarks.
8B-scale models reach new state-of-the-art numbers on most of the same benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same tokenized retrieval interface could be applied to other sequential inputs such as long documents or audio streams where inspection granularity varies.
If density tokens prove stable, agentic systems that currently loop over retrieval steps might collapse those loops into single generations.
Curating data that explicitly rewards varied density choices may become necessary to prevent the model from ignoring the new token.
The approach implicitly assumes that density decisions benefit from the same gradient signal as answer tokens once credit is decoupled.

Load-bearing premise

A single autoregressive pass can reliably output both accurate retrieval tokens for window and density and the final answer without the model collapsing to trivial or fixed-density strategies.

What would settle it

Train the model on the same data but remove the density token from the output vocabulary and measure whether benchmark scores fall back to the level of fixed-density baselines.

Figures

Figures reproduced from arXiv: 2605.26680 by Guanghao Zhang, Haobing Tang, Hao Jiang, Jinlong Liu, Le Zhang, Longxiang Zhang, Mushui Liu, Peng Zhang, Pipei Huang, Wanggui He, Weilong Dai, Yan Xia, Zhenhao Peng.

**Figure 1.** Figure 1: Textual CoT vs. DynFrame. Textual CoT (left) reasons over a fixed sparse frame set and misses the airborne segment, yielding a wrong rotation count. DynFrame (right) emits <span> and <fps> tokens within its reasoning to retrieve a denser, temporally focused frame set, and then continues reasoning over the augmented visual context to reach the correct answer. A growing thinking-with-video line of work addre… view at source ↗

**Figure 2.** Figure 2: Overview of DynFrame. The model interleaves tokenized temporal retrieval (<span>, <fps>) with on-the-fly frame injection inside a single autoregressive pass. SD-GRPO splits each rollout at the retrieval boundary and applies segment-specific advantages so that the sampling decision and the answer reasoning are credited separately. every token in the rollout, entangling the credit for committing a retrieval … view at source ↗

**Figure 3.** Figure 3: Data curation pipeline for DM-CoT-74k and DM-RL-45k. (a) Sources: VideoQA, grounded VideoQA, and temporal grounding benchmarks. (b) For VideoQA without temporal annotations, Gemini selects the evidence window and sampling rate, then answers under a “clip-only” constraint enforced at the prompt level. (c) For temporal grounding, ground-truth windows are reused; Gemini only selects an activity-adaptive FPS. … view at source ↗

**Figure 4.** Figure 4: Reward dynamics during RL. SD-GRPO lifts both the sampling reward Rsamp and the answer reward Rans over vanilla GRPO. (a) Masking strategy. During SFT, only the visual placeholder tokens (<|video_pad|>) are excluded from the loss because they encode raw vision features rather than predicted text, while timestamps and vision-boundary markers (<|vision_start|>, <|vision_end|>) are kept as supervision targets… view at source ↗

read the original abstract

Recent video multimodal large language models (MLLMs) increasingly couple step-by-step reasoning with on-demand visual evidence retrieval, allowing models to revisit relevant video segments during inference. However, two structural gaps remain in existing thinking-with-video systems. (i) Sampling density is not a learnable decision: existing methods may let the model decide where to look, but the per-window frame rate is largely fixed. As a result, fine-grained evidence is often recovered through repeated retrieval calls, which increases inference context length and training difficulty. (ii) Retrieval and answer generation are usually optimized with a single trajectory-level advantage, so the "where to look" tokens and the "how to answer" tokens receive the same credit even when one is correct and the other is not. To address these gaps, we present DynFrame, a framework that emits the temporal window and the sampling density as native tokens within a single autoregressive pass. This learnable span-density retrieval enables acquiring multi-granularity evidence with a single retrieval step. Based on the above tokenized retrieval interface, we further introduce Segment-Decoupled GRPO (SD-GRPO), which splits each rollout at the retrieval boundary and assigns role-specific token-level advantages, separately crediting the sampling decision and the answer. Trained on the curated DM-CoT-74k and DM-RL-45k, DynFrame-4B is competitive with strong 7B-8B baselines across six benchmarks (NExT-GQA, Charades-STA, ActivityNet-MR, Video-MME, MLVU, LVBench), and DynFrame-8B sets new state-of-the-art on most metrics. Code is available at https://github.com/zhangguanghao523/DynFrame.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DynFrame makes sampling density a learnable token and adds segment-decoupled GRPO, but the abstract supplies no evidence that densities actually vary instead of defaulting high.

read the letter

The main takeaway is that DynFrame treats the temporal window and per-window sampling density as native tokens generated in one autoregressive pass, then uses Segment-Decoupled GRPO to split rollouts at the retrieval boundary and assign separate token-level advantages to the retrieval choice versus the answer. This directly targets the two gaps the abstract names: fixed density forcing repeated calls, and single-trajectory credit mixing retrieval and answer quality.

The paper does a clean job naming those limitations and offering a simple tokenized interface plus the split-advantage rule. The 4B model stays competitive with 7-8B baselines on the six listed benchmarks, and the 8B version claims new state-of-the-art on most; code is released, which helps.

The soft spot is exactly the one the stress-test flags. Nothing shown indicates the model produces query-dependent densities rather than always emitting the highest setting to protect answer reward. Without density histograms, ablation on the decoupling, or training curves, it is impossible to tell whether SD-GRPO actually separates the signals or whether the advantage for density tokens remains entangled with downstream accuracy. The abstract mentions curated datasets but gives no further detail on how the density decision is trained or verified.

This is for researchers working on video MLLMs who care about inference cost and credit assignment in reasoning traces. A reader who wants to test whether learnable density works in practice would find the idea and the code worth examining. The architectural move is straightforward enough that the paper deserves a serious referee even though the current evidence leaves the central empirical question open.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces DynFrame, a framework for video multimodal large language models that tokenizes both the temporal window and the sampling density as native tokens in a single autoregressive pass to enable learnable multi-granularity evidence retrieval. It further proposes Segment-Decoupled GRPO (SD-GRPO) to split rollouts at the retrieval boundary and assign separate token-level advantages to the sampling decision and the answer generation. The models are trained on DM-CoT-74k and DM-RL-45k datasets, with DynFrame-4B reported as competitive with larger baselines and DynFrame-8B achieving new state-of-the-art results on most of six evaluated benchmarks.

Significance. If the results hold and the proposed mechanisms are shown to function as described without collapse to trivial strategies, this would be a significant contribution to the field of video understanding in MLLMs by addressing fixed sampling density and entangled credit assignment. The SOTA performance on multiple benchmarks would indicate practical utility, and the open code supports further research.

major comments (2)

[Abstract] Abstract: The central claim that the learnable span-density retrieval acquires multi-granularity evidence in a single step relies on the density tokens being query-dependent and non-trivial; however, the abstract provides no ablation studies, density histograms, or training curves to demonstrate that the sampling density varies meaningfully rather than defaulting to the maximum setting.
[Abstract] Abstract: The description of SD-GRPO as providing role-specific token-level advantages is presented without verification that the advantage signal for density tokens is disentangled from downstream answer quality; this is critical because if entangled, the mechanism may not differ from standard trajectory-level optimization.

minor comments (1)

The abstract could specify the exact benchmark metrics where SOTA is achieved and where it is not, for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for clearer support of our central claims within the abstract. We address each major comment point by point below, drawing on evidence already present in the full manuscript while proposing targeted revisions to improve clarity.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the learnable span-density retrieval acquires multi-granularity evidence in a single step relies on the density tokens being query-dependent and non-trivial; however, the abstract provides no ablation studies, density histograms, or training curves to demonstrate that the sampling density varies meaningfully rather than defaulting to the maximum setting.

Authors: The abstract is a high-level summary and cannot accommodate full experimental details. Sections 4.2 and 4.3 of the manuscript present ablation studies, density histograms across query types, and training curves that confirm sampling density varies meaningfully in a query-dependent manner and does not collapse to the maximum setting. We will revise the abstract to add a concise clause referencing these empirical validations. revision: partial
Referee: [Abstract] Abstract: The description of SD-GRPO as providing role-specific token-level advantages is presented without verification that the advantage signal for density tokens is disentangled from downstream answer quality; this is critical because if entangled, the mechanism may not differ from standard trajectory-level optimization.

Authors: Section 3.2 details the segment-decoupled rollout splitting, and Section 5.2 plus Appendix B provide comparative experiments against standard GRPO that demonstrate performance gains attributable to the disentangled token-level advantages. These results indicate the advantage signals for density tokens are not fully entangled with answer quality. We will update the abstract to briefly note that this disentanglement is empirically supported. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation is self-contained with external benchmarks

full rationale

The paper's central claims rest on a tokenized retrieval interface and SD-GRPO advantage assignment, trained on explicitly curated external datasets (DM-CoT-74k, DM-RL-45k) and evaluated on six independent benchmarks. No equations, fitted parameters, or self-citations are presented that reduce the claimed multi-granularity retrieval or decoupled advantages to a quantity defined by the result itself. The abstract and description supply no self-referential definitions or renamings that collapse the output to the input by construction. This is the normal case of an independent empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the framework description implies standard autoregressive generation and GRPO-style RL but does not detail any new fitted constants or unstated assumptions beyond the two gaps identified.

pith-pipeline@v0.9.1-grok · 5894 in / 1273 out tokens · 46605 ms · 2026-06-29T17:47:32.790995+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 34 canonical work pages · 22 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Keqin Chen, Wenbin Ge, Xuejing Liu, Jialin Wang, Sibo Song, Kai Dang, Shijie Wang, Peng Wang, Jun Tang, et al. Qwen3-vl: Advancing multimodal perception across arbitrarily-resolution visual inputs.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Rextime: A benchmark suite for reasoning-across-time in videos.arXiv preprint arXiv:2406.19392, 2024

Jr-Jen Chen, Yu-Chien Liao, Hsi-Che Lin, Yu-Chu Yu, Yen-Chun Chen, and Yu-Chiang Frank Wang. Rextime: A benchmark suite for reasoning-across-time in videos.arXiv preprint arXiv:2406.19392, 2024

work page arXiv 2024
[4]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URLhttps://arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Videozoomer: Reinforcement- learned temporal focusing for long video reasoning.arXiv preprint arXiv:2512.22315, 2025

Yang Ding, Yizhen Zhang, Xin Lai, Ruihang Chu, and Yujiu Yang. Videozoomer: Reinforcement- learned temporal focusing for long video reasoning.arXiv preprint arXiv:2512.22315, 2025. URL https://arxiv.org/abs/2512.22315

work page arXiv 2025
[6]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24108–24118, 2025

2025
[8]

Love-r1: Advancing long video understanding with an adaptive zoom-in mechanism via multi-step reasoning.arXiv preprint arXiv:2509.24786, 2025

Shenghao Fu, Qize Yang, Yuan-Ming Li, Xihan Wei, Xiaohua Xie, and Wei-Shi Zheng. Love-r1: Advancing long video understanding with an adaptive zoom-in mechanism via multi-step reasoning.arXiv preprint arXiv:2509.24786, 2025

work page arXiv 2025
[9]

Tall: Temporal activity localization via language query

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. InProceedings of the IEEE International Conference on Computer Vision, pages 5267–5275, 2017

2017
[10]

Gemini 3 Pro Model Card

Google DeepMind. Gemini 3 Pro Model Card. https://deepmind.google/models/model-cards/ gemini-3-pro, May 2026. Model release: November 2025; last updated: May 2026. Accessed: 2026-05- 25

2026
[11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Framethinker: Learning to think with long videos via multi-turn frame spotlighting.arXiv preprint arXiv:2509.24304, 2025

Zefeng He, Xiaoye Qu, Yafu Li, Siyuan Huang, Daizong Liu, and Yu Cheng. Framethinker: Learning to think with long videos via multi-turn frame spotlighting.arXiv preprint arXiv:2509.24304, 2025. To appear in ICLR 2026

work page arXiv 2025
[13]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenbo Huang, Bingyi Jia, Zhenbang Zhai, et al. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Chat-univi: Unified visual representation empowers large language models with image and video understanding

Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024
[16]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Dense-captioning events in videos

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. InProceedings of the IEEE International Conference on Computer Vision, pages 706–715, 2017

2017
[18]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

arXiv preprint arXiv:2506.01908 , year=

Hongyu Li, Songhao Han, Yue Liao, Junfeng Luo, Jialin Gao, Shuicheng Yan, and Si Liu. Reinforcement learning tuning for videollms: Reward design and data efficiency.arXiv preprint arXiv:2506.01908, 2025. URLhttps://arxiv.org/abs/2506.01908

work page arXiv 2025
[20]

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning. arXiv preprint arXiv:2504.06958, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos

Wenqi Liu, Yunxiao Wang, Shijie Ma, Meng Liu, Qile Su, Tianke Zhang, Haonan Fan, Changyi Liu, Kaiyu Jiang, Jiankang Chen, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, Yinwei Wei, and Xuemeng Song. VideoTemp-o3: Harmonizing temporal grounding and video understanding in agentic thinking-with- videos.arXiv preprint arXiv:2602.07801, 2026. URLhttps://...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

Video-rts: Rethinking reinforcement learning and test-time scaling for efficient and enhanced video reasoning.arXiv preprint arXiv:2507.06485, 2025

Zuyan Liu et al. Video-rts: Rethinking reinforcement learning and test-time scaling for efficient and enhanced video reasoning.arXiv preprint arXiv:2507.06485, 2025

work page arXiv 2025
[23]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models.arXiv preprint arXiv:2306.05424, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Open-o3 Video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025

Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, and Zhuochen Wang. Open-o3 Video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025. URL https://arxiv.org/ abs/2510.20579

work page arXiv 2025
[25]

Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhi Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. InThe Thirty-eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

2024
[26]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Temporal grounding of activities using multimodal large language models

Young Chol Song. Temporal grounding of activities using multimodal large language models. arXiv preprint arXiv:2407.06157, 2024. URLhttps://arxiv.org/abs/2407.06157

work page arXiv 2024
[28]

Adaptive keyframe sampling for long video understanding

Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, and Qixiang Ye. Adaptive keyframe sampling for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025
[29]

Grpo- care: Consistency-aware reinforcement learning for multimodal reasoning.arXiv preprint arXiv:2506.16141, 2025

GRPO-CARE Team. Grpo-care: Consistency-aware reinforcement learning for video mllms.arXiv preprint arXiv:2506.16141, 2025

work page arXiv 2025
[30]

S3 agent: Unlocking the power of vllm for zero-shot multi-modal sarcasm de- tection.ACM Transactions on Multimedia Computing, Communications and Applications, 21(11):1–16, 2025a

Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, and Tianfei Zhou. Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434, 2025

work page arXiv 2025
[31]

LVBench: An Extreme Long Video Understanding Benchmark

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. LVBench: An extreme long video understanding benchmark. arXiv preprint arXiv:2406.08035, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, Shuicheng Yan, Ziwei Liu, Jiebo Luo, and Hao Fei. Multimodal chain-of-thought reasoning: A comprehensive survey.arXiv preprint arXiv:2503.12605, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

2022
[34]

Can I trust your answer? Visually grounded video question answering

Junbin Xiao, Angela Yao, Yicong Li, and Tat-Seng Chua. Can I trust your answer? Visually grounded video question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13204–13214, 2024

2024
[35]

Videochat-r1.5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception

Ziang Yan, Xinhao Li, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1.5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025
[36]

LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

Zuhao Yang, Sudong Wang, Kaichen Zhang, Keming Wu, Sicong Leng, Yifan Zhang, Bo Li, Chengwei Qin, Shijian Lu, Xingxuan Li, and Lidong Bing. Longvt: Incentivizing “thinking with long videos” via native tool calling.arXiv preprint arXiv:2511.20785, 2025. To appear in CVPR 2026. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Focus: Efficient keyframe selection for long video understanding.arXiv preprint arXiv:2510.27280, 2025

Huanjin Yao et al. Focus: Efficient keyframe selection for long video understanding.arXiv preprint arXiv:2510.27280, 2025

work page arXiv 2025
[38]

Frame-voyager: Learning to query frames for video large language models

Shoubin Yu, Jaemin Cho, Prateek Yadav, and Mohit Bansal. Frame-voyager: Learning to query frames for video large language models. InAdvances in Neural Information Processing Systems, 2024

2024
[39]

Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning

Xiangyu Zeng, Zhiqiu Zhang, Yuhan Zhu, Xinhao Li, Zikang Wang, Changlian Ma, Qingyu Zhang, Zizheng Huang, Kun Ouyang, et al. Video-o3: Native interleaved clue seeking for long video multi-hop reasoning.arXiv preprint arXiv:2601.23224, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

Video-llama: An instruction-tuned audio-visual language model for video understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

2023
[41]

Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning

Haoji Zhang, Xin Gu, Jiawei Liu, Mingyang Li, Quan Wang, Zhibo Yang, Hongqing Yang, and Yansong Tang. Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning. arXiv preprint arXiv:2508.04416, 2025. To appear in CVPR 2026

work page arXiv 2025
[42]

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Kaituo Zhang et al. R1-vl: Learning to reason with multimodal large language models via reinforcement learning.arXiv preprint arXiv:2503.12937, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Bo Li, Haotian Liu, Yong Jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing “thinking with images” via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

MLVU: Benchmarking Multi-task Long Video Understanding

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. MLVU: A comprehensive benchmark for multi-task long video understanding. arXiv preprint arXiv:2406.04264, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 13 Supplementary Material A Data Generation Prompts We provide the complete prompts used to query G...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Read the user’s question and infer the precise visual evidence required
[48]

Explain why this portion is critical

Describe, step by step, how you would locate the segment that likely contains this evidence. Explain why this portion is critical
[49]

Based on the nature and speed of the target activity, recommend an appropriate FPS for analysis
[50]

zoom_in_cot

Conclude your “zoom_in_cot” reasoning by appending the time span and FPS tags.Do NOTprovide or guess the answer in this field
[51]

Using the visual information in the segment, reason step-by-step to reach the answer

After that, imagine you have now watchedONLYthat segment at the specified FPS. Using the visual information in the segment, reason step-by-step to reach the answer
[52]

answer_cot

This reasoning plus the final answer (wrapped in<answer></answer>) goes into “answer_cot”. FPS Selection Guideline • 1–2 fps: Static or quasi-static scenes with minimal temporal variation (e.g., reading text/signs, appearance attributes, object identification, colors, a person standing still). • 3–4 fps: Moderately dynamic scenarios with clear temporal pr...
[53]

Starting from the original temporal span in the input, expand the start and end boundaries by a random margin of 0.5–2.0 seconds on each side to provide additional temporal context
[54]

Move the expanded temporal span to the front of the thinking process, formatted as START - END (seconds with two decimal places, e.g.,29.50 - 74.00)
[55]

Keep the reasoning concise and non-repetitive

Remove redundant or repeated temporal descriptions in the thinking process. Keep the reasoning concise and non-repetitive
[56]

Based on the duration of theexpanded spanand the nature of the activity described, append an FPS tag <fps>N</fps> immediately after thetag
[57]

Max retrieval / injection

Extract only theoriginalstart and end timestamps as the final answer (e.g.,30.00 - 72.00). FPS Selection Guideline • 1–2 fps: Static or quasi-static scenes with minimal temporal variation (e.g., reading text/signs, appearance attributes, object identification, colors, a person standing still). • 3–4 fps: Moderately dynamic scenarios with clear temporal pr...

2021

[1] [1]

Qwen3-VL Technical Report

Shuai Bai, Keqin Chen, Wenbin Ge, Xuejing Liu, Jialin Wang, Sibo Song, Kai Dang, Shijie Wang, Peng Wang, Jun Tang, et al. Qwen3-vl: Advancing multimodal perception across arbitrarily-resolution visual inputs.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Rextime: A benchmark suite for reasoning-across-time in videos.arXiv preprint arXiv:2406.19392, 2024

Jr-Jen Chen, Yu-Chien Liao, Hsi-Che Lin, Yu-Chu Yu, Yen-Chun Chen, and Yu-Chiang Frank Wang. Rextime: A benchmark suite for reasoning-across-time in videos.arXiv preprint arXiv:2406.19392, 2024

work page arXiv 2024

[4] [4]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URLhttps://arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Videozoomer: Reinforcement- learned temporal focusing for long video reasoning.arXiv preprint arXiv:2512.22315, 2025

Yang Ding, Yizhen Zhang, Xin Lai, Ruihang Chu, and Yujiu Yang. Videozoomer: Reinforcement- learned temporal focusing for long video reasoning.arXiv preprint arXiv:2512.22315, 2025. URL https://arxiv.org/abs/2512.22315

work page arXiv 2025

[6] [6]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24108–24118, 2025

2025

[8] [8]

Love-r1: Advancing long video understanding with an adaptive zoom-in mechanism via multi-step reasoning.arXiv preprint arXiv:2509.24786, 2025

Shenghao Fu, Qize Yang, Yuan-Ming Li, Xihan Wei, Xiaohua Xie, and Wei-Shi Zheng. Love-r1: Advancing long video understanding with an adaptive zoom-in mechanism via multi-step reasoning.arXiv preprint arXiv:2509.24786, 2025

work page arXiv 2025

[9] [9]

Tall: Temporal activity localization via language query

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. InProceedings of the IEEE International Conference on Computer Vision, pages 5267–5275, 2017

2017

[10] [10]

Gemini 3 Pro Model Card

Google DeepMind. Gemini 3 Pro Model Card. https://deepmind.google/models/model-cards/ gemini-3-pro, May 2026. Model release: November 2025; last updated: May 2026. Accessed: 2026-05- 25

2026

[11] [11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Framethinker: Learning to think with long videos via multi-turn frame spotlighting.arXiv preprint arXiv:2509.24304, 2025

Zefeng He, Xiaoye Qu, Yafu Li, Siyuan Huang, Daizong Liu, and Yu Cheng. Framethinker: Learning to think with long videos via multi-turn frame spotlighting.arXiv preprint arXiv:2509.24304, 2025. To appear in ICLR 2026

work page arXiv 2025

[13] [13]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenbo Huang, Bingyi Jia, Zhenbang Zhai, et al. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Chat-univi: Unified visual representation empowers large language models with image and video understanding

Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024

[16] [16]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Dense-captioning events in videos

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. InProceedings of the IEEE International Conference on Computer Vision, pages 706–715, 2017

2017

[18] [18]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

arXiv preprint arXiv:2506.01908 , year=

Hongyu Li, Songhao Han, Yue Liao, Junfeng Luo, Jialin Gao, Shuicheng Yan, and Si Liu. Reinforcement learning tuning for videollms: Reward design and data efficiency.arXiv preprint arXiv:2506.01908, 2025. URLhttps://arxiv.org/abs/2506.01908

work page arXiv 2025

[20] [20]

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning. arXiv preprint arXiv:2504.06958, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos

Wenqi Liu, Yunxiao Wang, Shijie Ma, Meng Liu, Qile Su, Tianke Zhang, Haonan Fan, Changyi Liu, Kaiyu Jiang, Jiankang Chen, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, Yinwei Wei, and Xuemeng Song. VideoTemp-o3: Harmonizing temporal grounding and video understanding in agentic thinking-with- videos.arXiv preprint arXiv:2602.07801, 2026. URLhttps://...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[22] [22]

Video-rts: Rethinking reinforcement learning and test-time scaling for efficient and enhanced video reasoning.arXiv preprint arXiv:2507.06485, 2025

Zuyan Liu et al. Video-rts: Rethinking reinforcement learning and test-time scaling for efficient and enhanced video reasoning.arXiv preprint arXiv:2507.06485, 2025

work page arXiv 2025

[23] [23]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models.arXiv preprint arXiv:2306.05424, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Open-o3 Video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025

Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, and Zhuochen Wang. Open-o3 Video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025. URL https://arxiv.org/ abs/2510.20579

work page arXiv 2025

[25] [25]

Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhi Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. InThe Thirty-eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

2024

[26] [26]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Temporal grounding of activities using multimodal large language models

Young Chol Song. Temporal grounding of activities using multimodal large language models. arXiv preprint arXiv:2407.06157, 2024. URLhttps://arxiv.org/abs/2407.06157

work page arXiv 2024

[28] [28]

Adaptive keyframe sampling for long video understanding

Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, and Qixiang Ye. Adaptive keyframe sampling for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025

[29] [29]

Grpo- care: Consistency-aware reinforcement learning for multimodal reasoning.arXiv preprint arXiv:2506.16141, 2025

GRPO-CARE Team. Grpo-care: Consistency-aware reinforcement learning for video mllms.arXiv preprint arXiv:2506.16141, 2025

work page arXiv 2025

[30] [30]

S3 agent: Unlocking the power of vllm for zero-shot multi-modal sarcasm de- tection.ACM Transactions on Multimedia Computing, Communications and Applications, 21(11):1–16, 2025a

Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, and Tianfei Zhou. Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434, 2025

work page arXiv 2025

[31] [31]

LVBench: An Extreme Long Video Understanding Benchmark

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. LVBench: An extreme long video understanding benchmark. arXiv preprint arXiv:2406.08035, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, Shuicheng Yan, Ziwei Liu, Jiebo Luo, and Hao Fei. Multimodal chain-of-thought reasoning: A comprehensive survey.arXiv preprint arXiv:2503.12605, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

2022

[34] [34]

Can I trust your answer? Visually grounded video question answering

Junbin Xiao, Angela Yao, Yicong Li, and Tat-Seng Chua. Can I trust your answer? Visually grounded video question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13204–13214, 2024

2024

[35] [35]

Videochat-r1.5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception

Ziang Yan, Xinhao Li, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1.5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025

[36] [36]

LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

Zuhao Yang, Sudong Wang, Kaichen Zhang, Keming Wu, Sicong Leng, Yifan Zhang, Bo Li, Chengwei Qin, Shijian Lu, Xingxuan Li, and Lidong Bing. Longvt: Incentivizing “thinking with long videos” via native tool calling.arXiv preprint arXiv:2511.20785, 2025. To appear in CVPR 2026. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Focus: Efficient keyframe selection for long video understanding.arXiv preprint arXiv:2510.27280, 2025

Huanjin Yao et al. Focus: Efficient keyframe selection for long video understanding.arXiv preprint arXiv:2510.27280, 2025

work page arXiv 2025

[38] [38]

Frame-voyager: Learning to query frames for video large language models

Shoubin Yu, Jaemin Cho, Prateek Yadav, and Mohit Bansal. Frame-voyager: Learning to query frames for video large language models. InAdvances in Neural Information Processing Systems, 2024

2024

[39] [39]

Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning

Xiangyu Zeng, Zhiqiu Zhang, Yuhan Zhu, Xinhao Li, Zikang Wang, Changlian Ma, Qingyu Zhang, Zizheng Huang, Kun Ouyang, et al. Video-o3: Native interleaved clue seeking for long video multi-hop reasoning.arXiv preprint arXiv:2601.23224, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[40] [40]

Video-llama: An instruction-tuned audio-visual language model for video understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

2023

[41] [41]

Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning

Haoji Zhang, Xin Gu, Jiawei Liu, Mingyang Li, Quan Wang, Zhibo Yang, Hongqing Yang, and Yansong Tang. Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning. arXiv preprint arXiv:2508.04416, 2025. To appear in CVPR 2026

work page arXiv 2025

[42] [42]

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Kaituo Zhang et al. R1-vl: Learning to reason with multimodal large language models via reinforcement learning.arXiv preprint arXiv:2503.12937, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Bo Li, Haotian Liu, Yong Jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing “thinking with images” via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

MLVU: Benchmarking Multi-task Long Video Understanding

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. MLVU: A comprehensive benchmark for multi-task long video understanding. arXiv preprint arXiv:2406.04264, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [46]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 13 Supplementary Material A Data Generation Prompts We provide the complete prompts used to query G...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

Read the user’s question and infer the precise visual evidence required

[48] [48]

Explain why this portion is critical

Describe, step by step, how you would locate the segment that likely contains this evidence. Explain why this portion is critical

[49] [49]

Based on the nature and speed of the target activity, recommend an appropriate FPS for analysis

[50] [50]

zoom_in_cot

Conclude your “zoom_in_cot” reasoning by appending the time span and FPS tags.Do NOTprovide or guess the answer in this field

[51] [51]

Using the visual information in the segment, reason step-by-step to reach the answer

After that, imagine you have now watchedONLYthat segment at the specified FPS. Using the visual information in the segment, reason step-by-step to reach the answer

[52] [52]

answer_cot

This reasoning plus the final answer (wrapped in<answer></answer>) goes into “answer_cot”. FPS Selection Guideline • 1–2 fps: Static or quasi-static scenes with minimal temporal variation (e.g., reading text/signs, appearance attributes, object identification, colors, a person standing still). • 3–4 fps: Moderately dynamic scenarios with clear temporal pr...

[53] [53]

Starting from the original temporal span in the input, expand the start and end boundaries by a random margin of 0.5–2.0 seconds on each side to provide additional temporal context

[54] [54]

Move the expanded temporal span to the front of the thinking process, formatted as <span>START - END</span> (seconds with two decimal places, e.g.,<span>29.50 - 74.00</span>)

[55] [55]

Keep the reasoning concise and non-repetitive

Remove redundant or repeated temporal descriptions in the thinking process. Keep the reasoning concise and non-repetitive

[56] [56]

Based on the duration of theexpanded spanand the nature of the activity described, append an FPS tag <fps>N</fps> immediately after the<span>tag

[57] [57]

Max retrieval / injection

Extract only theoriginalstart and end timestamps as the final answer (e.g.,30.00 - 72.00). FPS Selection Guideline • 1–2 fps: Static or quasi-static scenes with minimal temporal variation (e.g., reading text/signs, appearance attributes, object identification, colors, a person standing still). • 3–4 fps: Moderately dynamic scenarios with clear temporal pr...

2021