pith. sign in

arxiv: 2605.26680 · v1 · pith:FECFVZXInew · submitted 2026-05-26 · 💻 cs.CV · cs.AI

DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding

Pith reviewed 2026-06-29 17:47 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords dynamic frame samplingvideo multimodal LLMsadaptive retrievalsegment-decoupled GRPOtemporal window tokensmulti-granularity evidencevideo reasoning
0
0 comments X

The pith

A video multimodal model emits both the chosen time window and its sampling density as native tokens inside one autoregressive generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing video reasoning systems let models decide where to look but keep the number of frames per window fixed, so fine detail requires repeated retrieval calls that lengthen context and mix credit between retrieval choices and answers. DynFrame turns the temporal window and the per-window frame density into tokens the model produces alongside its answer in a single pass, allowing one retrieval step to return evidence at multiple scales. A new reinforcement-learning rule, Segment-Decoupled GRPO, splits each rollout at the retrieval boundary and gives separate advantage signals to the sampling tokens and the answer tokens. The resulting 4B and 8B models reach or exceed strong fixed-density baselines on six video benchmarks after training on the authors' curated DM-CoT-74k and DM-RL-45k datasets.

Core claim

DynFrame emits the temporal window and the sampling density as native tokens within a single autoregressive pass. This learnable span-density retrieval enables acquiring multi-granularity evidence with a single retrieval step. Segment-Decoupled GRPO splits each rollout at the retrieval boundary and assigns role-specific token-level advantages, separately crediting the sampling decision and the answer.

What carries the argument

Learnable span-density retrieval, in which the model produces both the temporal window and the sampling density as tokens in the same generation sequence used for the final answer.

If this is right

  • Multi-granularity evidence can be obtained in one retrieval step instead of repeated calls.
  • Role-specific advantages allow independent optimization of sampling decisions and answer generation.
  • 4B-scale models become competitive with 7B-8B fixed-density systems across six video benchmarks.
  • 8B-scale models reach new state-of-the-art numbers on most of the same benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same tokenized retrieval interface could be applied to other sequential inputs such as long documents or audio streams where inspection granularity varies.
  • If density tokens prove stable, agentic systems that currently loop over retrieval steps might collapse those loops into single generations.
  • Curating data that explicitly rewards varied density choices may become necessary to prevent the model from ignoring the new token.
  • The approach implicitly assumes that density decisions benefit from the same gradient signal as answer tokens once credit is decoupled.

Load-bearing premise

A single autoregressive pass can reliably output both accurate retrieval tokens for window and density and the final answer without the model collapsing to trivial or fixed-density strategies.

What would settle it

Train the model on the same data but remove the density token from the output vocabulary and measure whether benchmark scores fall back to the level of fixed-density baselines.

Figures

Figures reproduced from arXiv: 2605.26680 by Guanghao Zhang, Haobing Tang, Hao Jiang, Jinlong Liu, Le Zhang, Longxiang Zhang, Mushui Liu, Peng Zhang, Pipei Huang, Wanggui He, Weilong Dai, Yan Xia, Zhenhao Peng.

Figure 1
Figure 1. Figure 1: Textual CoT vs. DynFrame. Textual CoT (left) reasons over a fixed sparse frame set and misses the airborne segment, yielding a wrong rotation count. DynFrame (right) emits <span> and <fps> tokens within its reasoning to retrieve a denser, temporally focused frame set, and then continues reasoning over the augmented visual context to reach the correct answer. A growing thinking-with-video line of work addre… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of DynFrame. The model interleaves tokenized temporal retrieval (<span>, <fps>) with on-the-fly frame injection inside a single autoregressive pass. SD-GRPO splits each rollout at the retrieval boundary and applies segment-specific advantages so that the sampling decision and the answer reasoning are credited separately. every token in the rollout, entangling the credit for committing a retrieval … view at source ↗
Figure 3
Figure 3. Figure 3: Data curation pipeline for DM-CoT-74k and DM-RL-45k. (a) Sources: VideoQA, grounded VideoQA, and temporal grounding benchmarks. (b) For VideoQA without temporal annotations, Gemini selects the evidence window and sampling rate, then answers under a “clip-only” constraint enforced at the prompt level. (c) For temporal grounding, ground-truth windows are reused; Gemini only selects an activity-adaptive FPS. … view at source ↗
Figure 4
Figure 4. Figure 4: Reward dynamics during RL. SD-GRPO lifts both the sampling reward Rsamp and the answer reward Rans over vanilla GRPO. (a) Masking strategy. During SFT, only the visual placeholder tokens (<|video_pad|>) are excluded from the loss because they encode raw vision features rather than predicted text, while timestamps and vision-boundary markers (<|vision_start|>, <|vision_end|>) are kept as supervision targets… view at source ↗
read the original abstract

Recent video multimodal large language models (MLLMs) increasingly couple step-by-step reasoning with on-demand visual evidence retrieval, allowing models to revisit relevant video segments during inference. However, two structural gaps remain in existing thinking-with-video systems. (i) Sampling density is not a learnable decision: existing methods may let the model decide where to look, but the per-window frame rate is largely fixed. As a result, fine-grained evidence is often recovered through repeated retrieval calls, which increases inference context length and training difficulty. (ii) Retrieval and answer generation are usually optimized with a single trajectory-level advantage, so the "where to look" tokens and the "how to answer" tokens receive the same credit even when one is correct and the other is not. To address these gaps, we present DynFrame, a framework that emits the temporal window and the sampling density as native tokens within a single autoregressive pass. This learnable span-density retrieval enables acquiring multi-granularity evidence with a single retrieval step. Based on the above tokenized retrieval interface, we further introduce Segment-Decoupled GRPO (SD-GRPO), which splits each rollout at the retrieval boundary and assigns role-specific token-level advantages, separately crediting the sampling decision and the answer. Trained on the curated DM-CoT-74k and DM-RL-45k, DynFrame-4B is competitive with strong 7B-8B baselines across six benchmarks (NExT-GQA, Charades-STA, ActivityNet-MR, Video-MME, MLVU, LVBench), and DynFrame-8B sets new state-of-the-art on most metrics. Code is available at https://github.com/zhangguanghao523/DynFrame.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces DynFrame, a framework for video multimodal large language models that tokenizes both the temporal window and the sampling density as native tokens in a single autoregressive pass to enable learnable multi-granularity evidence retrieval. It further proposes Segment-Decoupled GRPO (SD-GRPO) to split rollouts at the retrieval boundary and assign separate token-level advantages to the sampling decision and the answer generation. The models are trained on DM-CoT-74k and DM-RL-45k datasets, with DynFrame-4B reported as competitive with larger baselines and DynFrame-8B achieving new state-of-the-art results on most of six evaluated benchmarks.

Significance. If the results hold and the proposed mechanisms are shown to function as described without collapse to trivial strategies, this would be a significant contribution to the field of video understanding in MLLMs by addressing fixed sampling density and entangled credit assignment. The SOTA performance on multiple benchmarks would indicate practical utility, and the open code supports further research.

major comments (2)
  1. [Abstract] Abstract: The central claim that the learnable span-density retrieval acquires multi-granularity evidence in a single step relies on the density tokens being query-dependent and non-trivial; however, the abstract provides no ablation studies, density histograms, or training curves to demonstrate that the sampling density varies meaningfully rather than defaulting to the maximum setting.
  2. [Abstract] Abstract: The description of SD-GRPO as providing role-specific token-level advantages is presented without verification that the advantage signal for density tokens is disentangled from downstream answer quality; this is critical because if entangled, the mechanism may not differ from standard trajectory-level optimization.
minor comments (1)
  1. The abstract could specify the exact benchmark metrics where SOTA is achieved and where it is not, for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for clearer support of our central claims within the abstract. We address each major comment point by point below, drawing on evidence already present in the full manuscript while proposing targeted revisions to improve clarity.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the learnable span-density retrieval acquires multi-granularity evidence in a single step relies on the density tokens being query-dependent and non-trivial; however, the abstract provides no ablation studies, density histograms, or training curves to demonstrate that the sampling density varies meaningfully rather than defaulting to the maximum setting.

    Authors: The abstract is a high-level summary and cannot accommodate full experimental details. Sections 4.2 and 4.3 of the manuscript present ablation studies, density histograms across query types, and training curves that confirm sampling density varies meaningfully in a query-dependent manner and does not collapse to the maximum setting. We will revise the abstract to add a concise clause referencing these empirical validations. revision: partial

  2. Referee: [Abstract] Abstract: The description of SD-GRPO as providing role-specific token-level advantages is presented without verification that the advantage signal for density tokens is disentangled from downstream answer quality; this is critical because if entangled, the mechanism may not differ from standard trajectory-level optimization.

    Authors: Section 3.2 details the segment-decoupled rollout splitting, and Section 5.2 plus Appendix B provide comparative experiments against standard GRPO that demonstrate performance gains attributable to the disentangled token-level advantages. These results indicate the advantage signals for density tokens are not fully entangled with answer quality. We will update the abstract to briefly note that this disentanglement is empirically supported. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation is self-contained with external benchmarks

full rationale

The paper's central claims rest on a tokenized retrieval interface and SD-GRPO advantage assignment, trained on explicitly curated external datasets (DM-CoT-74k, DM-RL-45k) and evaluated on six independent benchmarks. No equations, fitted parameters, or self-citations are presented that reduce the claimed multi-granularity retrieval or decoupled advantages to a quantity defined by the result itself. The abstract and description supply no self-referential definitions or renamings that collapse the output to the input by construction. This is the normal case of an independent empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the framework description implies standard autoregressive generation and GRPO-style RL but does not detail any new fitted constants or unstated assumptions beyond the two gaps identified.

pith-pipeline@v0.9.1-grok · 5894 in / 1273 out tokens · 46605 ms · 2026-06-29T17:47:32.790995+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 34 canonical work pages · 22 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Keqin Chen, Wenbin Ge, Xuejing Liu, Jialin Wang, Sibo Song, Kai Dang, Shijie Wang, Peng Wang, Jun Tang, et al. Qwen3-vl: Advancing multimodal perception across arbitrarily-resolution visual inputs.arXiv preprint arXiv:2511.21631, 2025

  2. [2]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  3. [3]

    Rextime: A benchmark suite for reasoning-across-time in videos.arXiv preprint arXiv:2406.19392, 2024

    Jr-Jen Chen, Yu-Chien Liao, Hsi-Che Lin, Yu-Chu Yu, Yen-Chun Chen, and Yu-Chiang Frank Wang. Rextime: A benchmark suite for reasoning-across-time in videos.arXiv preprint arXiv:2406.19392, 2024

  4. [4]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URLhttps://arxiv.org/abs/2507.06261

  5. [5]

    Videozoomer: Reinforcement- learned temporal focusing for long video reasoning.arXiv preprint arXiv:2512.22315, 2025

    Yang Ding, Yizhen Zhang, Xin Lai, Ruihang Chu, and Yujiu Yang. Videozoomer: Reinforcement- learned temporal focusing for long video reasoning.arXiv preprint arXiv:2512.22315, 2025. URL https://arxiv.org/abs/2512.22315

  6. [6]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

  7. [7]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24108–24118, 2025

  8. [8]

    Love-r1: Advancing long video understanding with an adaptive zoom-in mechanism via multi-step reasoning.arXiv preprint arXiv:2509.24786, 2025

    Shenghao Fu, Qize Yang, Yuan-Ming Li, Xihan Wei, Xiaohua Xie, and Wei-Shi Zheng. Love-r1: Advancing long video understanding with an adaptive zoom-in mechanism via multi-step reasoning.arXiv preprint arXiv:2509.24786, 2025

  9. [9]

    Tall: Temporal activity localization via language query

    Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. InProceedings of the IEEE International Conference on Computer Vision, pages 5267–5275, 2017

  10. [10]

    Gemini 3 Pro Model Card

    Google DeepMind. Gemini 3 Pro Model Card. https://deepmind.google/models/model-cards/ gemini-3-pro, May 2026. Model release: November 2025; last updated: May 2026. Accessed: 2026-05- 25

  11. [11]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  12. [12]

    Framethinker: Learning to think with long videos via multi-turn frame spotlighting.arXiv preprint arXiv:2509.24304, 2025

    Zefeng He, Xiaoye Qu, Yafu Li, Siyuan Huang, Daizong Liu, and Yu Cheng. Framethinker: Learning to think with long videos via multi-turn frame spotlighting.arXiv preprint arXiv:2509.24304, 2025. To appear in ICLR 2026

  13. [13]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenbo Huang, Bingyi Jia, Zhenbang Zhai, et al. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025

  14. [14]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  15. [15]

    Chat-univi: Unified visual representation empowers large language models with image and video understanding

    Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  16. [16]

    Kimi-VL Technical Report

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

  17. [17]

    Dense-captioning events in videos

    Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. InProceedings of the IEEE International Conference on Computer Vision, pages 706–715, 2017

  18. [18]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024. 11

  19. [19]

    arXiv preprint arXiv:2506.01908 , year=

    Hongyu Li, Songhao Han, Yue Liao, Junfeng Luo, Jialin Gao, Shuicheng Yan, and Si Liu. Reinforcement learning tuning for videollms: Reward design and data efficiency.arXiv preprint arXiv:2506.01908, 2025. URLhttps://arxiv.org/abs/2506.01908

  20. [20]

    VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

    Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning. arXiv preprint arXiv:2504.06958, 2025

  21. [21]

    VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos

    Wenqi Liu, Yunxiao Wang, Shijie Ma, Meng Liu, Qile Su, Tianke Zhang, Haonan Fan, Changyi Liu, Kaiyu Jiang, Jiankang Chen, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, Yinwei Wei, and Xuemeng Song. VideoTemp-o3: Harmonizing temporal grounding and video understanding in agentic thinking-with- videos.arXiv preprint arXiv:2602.07801, 2026. URLhttps://...

  22. [22]

    Video-rts: Rethinking reinforcement learning and test-time scaling for efficient and enhanced video reasoning.arXiv preprint arXiv:2507.06485, 2025

    Zuyan Liu et al. Video-rts: Rethinking reinforcement learning and test-time scaling for efficient and enhanced video reasoning.arXiv preprint arXiv:2507.06485, 2025

  23. [23]

    Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models.arXiv preprint arXiv:2306.05424, 2023

  24. [24]

    Open-o3 Video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025

    Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, and Zhuochen Wang. Open-o3 Video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025. URL https://arxiv.org/ abs/2510.20579

  25. [25]

    Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning

    Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhi Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. InThe Thirty-eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

  26. [26]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  27. [27]

    Temporal grounding of activities using multimodal large language models

    Young Chol Song. Temporal grounding of activities using multimodal large language models. arXiv preprint arXiv:2407.06157, 2024. URLhttps://arxiv.org/abs/2407.06157

  28. [28]

    Adaptive keyframe sampling for long video understanding

    Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, and Qixiang Ye. Adaptive keyframe sampling for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  29. [29]

    Grpo- care: Consistency-aware reinforcement learning for multimodal reasoning.arXiv preprint arXiv:2506.16141, 2025

    GRPO-CARE Team. Grpo-care: Consistency-aware reinforcement learning for video mllms.arXiv preprint arXiv:2506.16141, 2025

  30. [30]

    S3 agent: Unlocking the power of vllm for zero-shot multi-modal sarcasm de- tection.ACM Transactions on Multimedia Computing, Communications and Applications, 21(11):1–16, 2025a

    Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, and Tianfei Zhou. Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434, 2025

  31. [31]

    LVBench: An Extreme Long Video Understanding Benchmark

    Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. LVBench: An extreme long video understanding benchmark. arXiv preprint arXiv:2406.08035, 2024

  32. [32]

    Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

    Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, Shuicheng Yan, Ziwei Liu, Jiebo Luo, and Hao Fei. Multimodal chain-of-thought reasoning: A comprehensive survey.arXiv preprint arXiv:2503.12605, 2025

  33. [33]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

  34. [34]

    Can I trust your answer? Visually grounded video question answering

    Junbin Xiao, Angela Yao, Yicong Li, and Tat-Seng Chua. Can I trust your answer? Visually grounded video question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13204–13214, 2024

  35. [35]

    Videochat-r1.5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception

    Ziang Yan, Xinhao Li, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1.5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  36. [36]

    LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

    Zuhao Yang, Sudong Wang, Kaichen Zhang, Keming Wu, Sicong Leng, Yifan Zhang, Bo Li, Chengwei Qin, Shijian Lu, Xingxuan Li, and Lidong Bing. Longvt: Incentivizing “thinking with long videos” via native tool calling.arXiv preprint arXiv:2511.20785, 2025. To appear in CVPR 2026. 12

  37. [37]

    Focus: Efficient keyframe selection for long video understanding.arXiv preprint arXiv:2510.27280, 2025

    Huanjin Yao et al. Focus: Efficient keyframe selection for long video understanding.arXiv preprint arXiv:2510.27280, 2025

  38. [38]

    Frame-voyager: Learning to query frames for video large language models

    Shoubin Yu, Jaemin Cho, Prateek Yadav, and Mohit Bansal. Frame-voyager: Learning to query frames for video large language models. InAdvances in Neural Information Processing Systems, 2024

  39. [39]

    Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning

    Xiangyu Zeng, Zhiqiu Zhang, Yuhan Zhu, Xinhao Li, Zikang Wang, Changlian Ma, Qingyu Zhang, Zizheng Huang, Kun Ouyang, et al. Video-o3: Native interleaved clue seeking for long video multi-hop reasoning.arXiv preprint arXiv:2601.23224, 2026

  40. [40]

    Video-llama: An instruction-tuned audio-visual language model for video understanding

    Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

  41. [41]

    Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning

    Haoji Zhang, Xin Gu, Jiawei Liu, Mingyang Li, Quan Wang, Zhibo Yang, Hongqing Yang, and Yansong Tang. Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning. arXiv preprint arXiv:2508.04416, 2025. To appear in CVPR 2026

  42. [42]

    R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

    Kaituo Zhang et al. R1-vl: Learning to reason with multimodal large language models via reinforcement learning.arXiv preprint arXiv:2503.12937, 2025

  43. [43]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Bo Li, Haotian Liu, Yong Jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024

  44. [44]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing “thinking with images” via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025

  45. [45]

    MLVU: Benchmarking Multi-task Long Video Understanding

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. MLVU: A comprehensive benchmark for multi-task long video understanding. arXiv preprint arXiv:2406.04264, 2024

  46. [46]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 13 Supplementary Material A Data Generation Prompts We provide the complete prompts used to query G...

  47. [47]

    Read the user’s question and infer the precise visual evidence required

  48. [48]

    Explain why this portion is critical

    Describe, step by step, how you would locate the segment that likely contains this evidence. Explain why this portion is critical

  49. [49]

    Based on the nature and speed of the target activity, recommend an appropriate FPS for analysis

  50. [50]

    zoom_in_cot

    Conclude your “zoom_in_cot” reasoning by appending the time span and FPS tags.Do NOTprovide or guess the answer in this field

  51. [51]

    Using the visual information in the segment, reason step-by-step to reach the answer

    After that, imagine you have now watchedONLYthat segment at the specified FPS. Using the visual information in the segment, reason step-by-step to reach the answer

  52. [52]

    answer_cot

    This reasoning plus the final answer (wrapped in<answer></answer>) goes into “answer_cot”. FPS Selection Guideline • 1–2 fps: Static or quasi-static scenes with minimal temporal variation (e.g., reading text/signs, appearance attributes, object identification, colors, a person standing still). • 3–4 fps: Moderately dynamic scenarios with clear temporal pr...

  53. [53]

    Starting from the original temporal span in the input, expand the start and end boundaries by a random margin of 0.5–2.0 seconds on each side to provide additional temporal context

  54. [54]

    Move the expanded temporal span to the front of the thinking process, formatted as <span>START - END</span> (seconds with two decimal places, e.g.,<span>29.50 - 74.00</span>)

  55. [55]

    Keep the reasoning concise and non-repetitive

    Remove redundant or repeated temporal descriptions in the thinking process. Keep the reasoning concise and non-repetitive

  56. [56]

    Based on the duration of theexpanded spanand the nature of the activity described, append an FPS tag <fps>N</fps> immediately after the<span>tag

  57. [57]

    Max retrieval / injection

    Extract only theoriginalstart and end timestamps as the final answer (e.g.,30.00 - 72.00). FPS Selection Guideline • 1–2 fps: Static or quasi-static scenes with minimal temporal variation (e.g., reading text/signs, appearance attributes, object identification, colors, a person standing still). • 3–4 fps: Moderately dynamic scenarios with clear temporal pr...