pith. machine review for the scientific record. sign in

arxiv: 2605.05848 · v2 · submitted 2026-05-07 · 💻 cs.CV · cs.AI

Recognition: no theorem link

VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:54 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords long video understandingtoken compressionquery-adaptive routingvideo large multimodal modelsframe relevance scoringefficient inferencebudgeted evidence allocation
0
0 comments X

The pith

VideoRouter uses dual query-adaptive routers to allocate visual tokens efficiently in long videos, cutting usage by up to 67.9 percent while improving accuracy over the base model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Long videos create excessive visual-token sequences that raise memory and latency costs in large multimodal models. Existing compression techniques tend to apply fixed or weakly query-aware policies, which waste tokens on irrelevant frames and under-allocate to critical evidence. VideoRouter trains a Semantic Router to select between broad temporal coverage and high-resolution preservation, then an Image Router that scores frame relevance from early LLM layers. These routers are supervised on the newly built Video-QTR-10K and Video-FLR-200K datasets. On VideoMME, MLVU, and LongVideoBench the approach delivers consistent gains under equal or lower token budgets.

Core claim

The central claim is that a query-adaptive dual-router framework, with the Semantic Router predicting the dominant allocation policy and the Image Router using early layers to score frame relevance, enables budgeted evidence allocation that preserves detail on critical frames and compresses the rest, producing up to 67.9 percent token reduction and higher accuracy than the InternVL baseline on standard long-video benchmarks.

What carries the argument

The query-adaptive dual-router framework: Semantic Router selects allocation policy between coverage and high-resolution modes, while Image Router scores frame relevance via early LLM layers.

If this is right

  • Longer videos become feasible under fixed memory and latency limits by concentrating tokens on query-relevant segments.
  • Aggressive compression can be applied to low-relevance frames without uniform quality loss across the entire video.
  • Separate supervision of policy selection and frame scoring allows targeted training that improves both components.
  • The method outperforms fixed-policy baselines on VideoMME, MLVU, and LongVideoBench under comparable budgets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar dual-routing logic could be tested on other sequence-heavy inputs such as long audio or multi-page documents.
  • Dynamic per-query token budgets might enable real-time video chat systems that scale detail according to question complexity.
  • The routing pattern could be ported to other base multimodal models to measure whether the efficiency gains hold beyond InternVL.

Load-bearing premise

Routers trained on the Video-QTR-10K and Video-FLR-200K datasets generalize to arbitrary unseen queries and long videos without substantial performance degradation.

What would settle it

Running VideoRouter on a fresh long-video benchmark containing query types absent from the training sets and checking whether accuracy falls below the InternVL baseline at the same token budget.

Figures

Figures reproduced from arXiv: 2605.05848 by Ge Li, Kuanwei Lin, Wenhao Zhang.

Figure 1
Figure 1. Figure 1: Left: Frame-level dynamic token allocation by VideoRouter compared to dense and uniform baselines. Right: Accuracy vs. efficiency trade-off on Video-MME, demonstrating higher performance with significantly fewer visual tokens. uniform coverage and adaptive allocation. When adaptive allocation is selected, an Image Router reuses early multimodal LLM layers to estimate frame-query relevance and drive fine-gr… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the VideoRouter framework. The Semantic Router predicts the dominant view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the data construction pipeline. Stage 1 generates minimal sufficient frame-level view at source ↗
Figure 4
Figure 4. Figure 4: (a) Unified analysis on LongVideoBench: The interplay between Token Budget, Sampled view at source ↗
Figure 5
Figure 5. Figure 5: Empirical A/B probing study for Semantic Router supervision. The left panel shows view at source ↗
Figure 6
Figure 6. Figure 6: Training curves for the Semantic Router and Image Router. view at source ↗
Figure 7
Figure 7. Figure 7: Prompt template for dense captioning, designed to extract objective visual details from view at source ↗
Figure 8
Figure 8. Figure 8: Prompt template for reverse-QA generation, encouraging questions that rely on localized view at source ↗
Figure 9
Figure 9. Figure 9: Prompt template for rule-guided QA generation based on statistically selected allocation view at source ↗
Figure 10
Figure 10. Figure 10: Prompt template for consistency and counterfactual validation, ensuring visual sufficiency view at source ↗
Figure 11
Figure 11. Figure 11: Prompt template for logical-coherence filtering, identifying physical contradictions and view at source ↗
read the original abstract

Video large multimodal models increasingly face a scalability bottleneck: long videos produce excessively long visual-token sequences, which sharply increase memory and latency during inference. While existing compression methods are effective in specific settings, most are either weakly query-aware or apply a fixed compression policy across frames, proving suboptimal when visual evidence is unevenly distributed over time. To address this, we present VideoRouter, a query-adaptive dual-router framework built on InternVL for budgeted evidence allocation. The Semantic Router predicts the dominant allocation policy, choosing between broad temporal coverage and adaptive high-resolution preservation, while the Image Router uses early LLM layers to score frame relevance. This enables aggressive compression on less relevant frames while preserving detail on critical evidence frames. To train both routers, we build Video-QTR-10K for allocation-policy supervision and Video-FLR-200K for frame-relevance supervision. Experiments on VideoMME, MLVU, and LongVideoBench show that VideoRouter consistently improves over the InternVL baseline under comparable or lower budgets, achieving up to a 67.9% token reduction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces VideoRouter, a query-adaptive dual-router framework built on InternVL for efficient long-video understanding. The Semantic Router selects between broad temporal coverage and adaptive high-resolution preservation policies, while the Image Router scores frame relevance using early LLM layers to enable aggressive compression on less relevant frames. Both routers are trained on newly introduced datasets Video-QTR-10K (for allocation-policy supervision) and Video-FLR-200K (for frame-relevance supervision). Experiments on VideoMME, MLVU, and LongVideoBench report consistent improvements over the InternVL baseline under comparable or lower token budgets, with up to 67.9% token reduction.

Significance. If the empirical results hold under rigorous validation, the work addresses a key scalability issue in video multimodal models by introducing query-dependent routing that avoids uniform compression policies. The dual-router design and creation of targeted supervision datasets for policy and relevance prediction represent a concrete step toward budgeted evidence allocation. Strengths include the explicit construction of Video-QTR-10K and Video-FLR-200K for reproducible router training and the reported token savings across three long-video benchmarks.

major comments (2)
  1. [§4] §4 (Experiments): The central performance claim of consistent gains and up to 67.9% token reduction over InternVL rests on the Semantic and Image Routers generalizing from Video-QTR-10K and Video-FLR-200K. No explicit out-of-distribution testing, dataset diversity statistics, or ablation on query/video distribution shift is provided, leaving open whether the adaptive allocation is robust or an artifact of in-distribution performance on the test benchmarks.
  2. [§3.2] §3.2 (Image Router): The description states that early LLM layers score frame relevance, but the precise layer selection, supervision signal from Video-FLR-200K, and how relevance scores translate into per-frame token budgets are not formalized with equations or pseudocode. This makes it difficult to assess whether the reported compression is parameter-free or depends on tuned thresholds.
minor comments (2)
  1. [Abstract] The abstract and §1 claim 'up to 67.9% token reduction' without specifying the exact configuration (e.g., which benchmark, budget level, or router setting) that achieves this maximum; a table or figure reference would improve clarity.
  2. [§2] Missing reference to prior query-aware compression methods (e.g., in related work) that also use early-layer features for relevance; adding 1-2 citations would better situate the Image Router contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): The central performance claim of consistent gains and up to 67.9% token reduction over InternVL rests on the Semantic and Image Routers generalizing from Video-QTR-10K and Video-FLR-200K. No explicit out-of-distribution testing, dataset diversity statistics, or ablation on query/video distribution shift is provided, leaving open whether the adaptive allocation is robust or an artifact of in-distribution performance on the test benchmarks.

    Authors: We acknowledge the value of explicit OOD testing for robustness claims. Video-QTR-10K and Video-FLR-200K were constructed from diverse video sources and query types drawn from the same distribution families as the evaluation benchmarks to promote generalization. To directly address the concern, we will add dataset diversity statistics (e.g., distributions of video durations, query categories, and scene complexity) and a new ablation on controlled distribution shifts in the revised §4. The consistent gains across VideoMME, MLVU, and LongVideoBench already provide supporting evidence, but the additional analysis will clarify limitations and strengthen the results. revision: partial

  2. Referee: [§3.2] §3.2 (Image Router): The description states that early LLM layers score frame relevance, but the precise layer selection, supervision signal from Video-FLR-200K, and how relevance scores translate into per-frame token budgets are not formalized with equations or pseudocode. This makes it difficult to assess whether the reported compression is parameter-free or depends on tuned thresholds.

    Authors: We agree that greater formalization will improve reproducibility and clarity. In the revised manuscript, we will add equations in §3.2 specifying the exact early LLM layer used for scoring (the 5th transformer layer), the supervision signal (binary cross-entropy on frame-relevance labels from Video-FLR-200K), and the mapping of relevance scores to per-frame token budgets via an adaptive threshold function. We will also include pseudocode for the full routing procedure in the appendix. The approach is not parameter-free, as it relies on learned routers and a small set of allocation hyperparameters; these will be explicitly listed and justified. revision: yes

Circularity Check

0 steps flagged

No significant circularity in VideoRouter derivation or claims

full rationale

The paper presents VideoRouter as a new query-adaptive dual-router architecture built on InternVL, with routers trained on two newly introduced supervision datasets (Video-QTR-10K and Video-FLR-200K). Central performance claims rest on empirical results from independent external benchmarks (VideoMME, MLVU, LongVideoBench) that compare against the InternVL baseline under token budgets. No equations, derivations, or fitted parameters are shown to reduce by construction to the same inputs; no self-citation chains justify uniqueness or load-bearing premises; and no ansatz or renaming of known results is invoked. The method and evaluation are self-contained against external data, yielding no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the effectiveness of the newly proposed Semantic and Image Routers plus the utility of the two constructed supervision datasets. No explicit free parameters, mathematical axioms, or invented physical entities are stated in the abstract; standard deep-learning training assumptions are implicit.

pith-pipeline@v0.9.0 · 5485 in / 1068 out tokens · 44276 ms · 2026-05-11T01:54:42.301935+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 9 internal anchors

  1. [1]

    Video-chatgpt: Towards detailed video understanding via large vision and language models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, 2024

  2. [2]

    Video-llava: Learning united visual representation by alignment before projection

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5971–5984, 2024

  3. [3]

    Chat-univi: Unified visual representation empowers large language models with image and video understanding

    Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13700–13710, 2024

  4. [4]

    Llava-next-video: Yielding zero-shot video understanding from image- language models.arXiv preprint arXiv:2404.08129, 2024

    Yuanhan Zhang, Bo Li, Haotian Liu, Yong Jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava-next-video: Yielding zero-shot video understanding from image- language models.arXiv preprint arXiv:2404.08129, 2024

  5. [5]

    arXiv preprint arXiv:2404.16994 , year=

    Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava: Parameter-free llava extension from images to videos for video dense captioning.arXiv preprint arXiv:2404.16994, 2024

  6. [6]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava- video: Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024

  7. [7]

    Llava-prumerge: Adaptive token reduction for efficient large multimodal models

    Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22857–22867, 2025

  8. [8]

    Visionzip: Longer is better but not necessary in vision language models

    Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19792–19802, 2025

  9. [9]

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision, pages 19–35. Springer, 2024

  10. [10]

    Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024

    Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Visual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024

  11. [11]

    Metok: Multi- stage event-based token compression for efficient long video understanding

    Mengyue Wang, Shuo Chen, Kristian Kersting, V olker Tresp, and Yunpu Ma. Metok: Multi- stage event-based token compression for efficient long video understanding. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 18881– 18895, 2025

  12. [12]

    Dtoma: Training-free dynamic token manipulation for long video understanding

    Bowen Yuan, Sisi You, and Bing-Kun Bao. Dtoma: Training-free dynamic token manipulation for long video understanding. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 2314–2322, 2025

  13. [13]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024

  14. [14]

    Token Merging: Your ViT But Faster

    Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2022. 10

  15. [15]

    Not all patches are what you need: Expediting vision transformers via token reorganizations

    Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Not all patches are what you need: Expediting vision transformers via token reorganizations.arXiv preprint arXiv:2202.07800, 2022

  16. [16]

    Dynamicvit: Efficient vision transformers with dynamic token sparsification.Advances in Neural Information Processing Systems, 34:13937–13949, 2021

    Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification.Advances in Neural Information Processing Systems, 34:13937–13949, 2021

  17. [17]

    Tokenlearner: Adaptive space-time tokenization for videos.Advances in Neural Information Processing Systems, 34:12786–12797, 2021

    Michael Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, and Anelia Angelova. Tokenlearner: Adaptive space-time tokenization for videos.Advances in Neural Information Processing Systems, 34:12786–12797, 2021

  18. [18]

    Less is more, but where? dynamic token compression via llm-guided keyframe prior.arXiv preprint arXiv:2512.06866, 2025

    Yulin Li, Haokun Gui, Ziyang Fan, Junjie Wang, Bin Kang, Bin Chen, and Zhuotao Tian. Less is more, but where? dynamic token compression via llm-guided keyframe prior.arXiv preprint arXiv:2512.06866, 2025

  19. [19]

    Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy re- duction.arXiv preprint arXiv:2410.17247, 2024

    Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction.arXiv preprint arXiv:2410.17247, 2024

  20. [20]

    Hidrop: Hierarchical vision token reduction in mllms via late injection, concave pyramid pruning, and early exit.arXiv preprint arXiv:2602.23699, 2026

    Hao Wu, Yingqi Fan, Jinyang Dai, Junlong Tong, Yunpu Ma, and Xiaoyu Shen. Hidrop: Hierarchical vision token reduction in mllms via late injection, concave pyramid pruning, and early exit.arXiv preprint arXiv:2602.23699, 2026

  21. [21]

    Seeing more, saying more: Lightweight language experts are dynamic video token compressors

    Xiangchen Wang, Jinrui Zhang, Teng Wang, Haigang Zhang, and Feng Zheng. Seeing more, saying more: Lightweight language experts are dynamic video token compressors. InProceed- ings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 541–558, 2025

  22. [22]

    Vico: A training strategy towards semantic aware dynamic high-resolution.arXiv preprint arXiv:2510.12793, 2025

    Long Cui, Weiyun Wang, Jie Shao, Zichen Wen, Gen Luo, Linfeng Zhang, Yanting Zhang, Yu Qiao, and Wenhai Wang. Vico: A training strategy towards semantic aware dynamic high-resolution.arXiv preprint arXiv:2510.12793, 2025

  23. [23]

    Dycoke: Dynamic compression of tokens for fast video large language models

    Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Dycoke: Dynamic compression of tokens for fast video large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18992–19001, 2025

  24. [24]

    Flexible frame selection for efficient video reasoning

    Shyamal Buch, Arsha Nagrani, Anurag Arnab, and Cordelia Schmid. Flexible frame selection for efficient video reasoning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29071–29082, 2025

  25. [25]

    Divide, then ground: Adapting frame selection to query types for long-form video understanding.arXiv preprint arXiv:2512.04000, 2025

    Jialuo Li, Bin Li, Jiahao Li, and Yan Lu. Divide, then ground: Adapting frame selection to query types for long-form video understanding.arXiv preprint arXiv:2512.04000, 2025

  26. [26]

    Crop: Contextual region-oriented visual token pruning

    Jiawei Guo, Feifei Zhai, Pu Jian, Qianrun Wei, and Yu Zhou. Crop: Contextual region-oriented visual token pruning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 9767–9783, 2025

  27. [27]

    Language-guided temporal token pruning for efficient videollm processing

    Yogesh Kumar. Language-guided temporal token pruning for efficient videollm processing. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 8935–8942, 2025

  28. [28]

    Prunevid: Visual token pruning for efficient video large language models

    Xiaohu Huang, Hao Zhou, and Kai Han. Prunevid: Visual token pruning for efficient video large language models. InFindings of the Association for Computational Linguistics: ACL 2025, pages 19959–19973, 2025

  29. [29]

    Seeing the forest and the trees: Query-aware tokenizer for long-video multimodal language models.arXiv preprint arXiv:2511.11910, 2025

    Siyou Li, Huanan Wu, Juexi Shao, Yinghao Ma, Yujian Gan, Yihao Luo, Yuwei Wang, Dong Nie, Lu Wang, Wenqing Wu, et al. Seeing the forest and the trees: Query-aware tokenizer for long-video multimodal language models.arXiv preprint arXiv:2511.11910, 2025

  30. [30]

    Generative frame sampler for long video understanding

    Linli Yao, Haoning Wu, Kun Ouyang, Yuanxing Zhang, Caiming Xiong, Bei Chen, Xu Sun, and Junnan Li. Generative frame sampler for long video understanding. InFindings of the Association for Computational Linguistics: ACL 2025, pages 17900–17917, 2025. 11

  31. [31]

    Videotree: Adaptive tree-based video representation for llm reasoning on long videos

    Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, and Mohit Bansal. Videotree: Adaptive tree-based video representation for llm reasoning on long videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3272–3283, 2025

  32. [32]

    Long-tailed distribution-aware router for mixture- of-experts in large vision-language model.arXiv preprint arXiv:2507.01351, 2025

    Chaoxiang Cai, Longrong Yang, Minghe Weng, Xuewei Li, Zequn Qin, and Xi Li. Long-tailed distribution-aware router for mixture-of-experts in large vision-language model.arXiv preprint arXiv:2507.01351, 2025

  33. [33]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24108–24118, 2025

  34. [34]

    Longvideobench: A benchmark for long- context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37:28828–28857, 2024

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long- context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37:28828–28857, 2024

  35. [35]

    Mlvu: Benchmarking multi-task long video understanding

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13691–13701, 2025

  36. [36]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  37. [37]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  38. [38]

    Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution.arXiv preprint arXiv:2409.12961, 2024

    Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Oryx mllm: On- demand spatial-temporal understanding at arbitrary resolution.arXiv preprint arXiv:2409.12961, 2024

  39. [39]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024

  40. [40]

    mplug-owl3: Towards long image-sequence understanding in multi-modal large language models.arXiv preprint arXiv:2408.04840, 2024

    Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl3: Towards long image-sequence understanding in multi-modal large language models.arXiv preprint arXiv:2408.04840, 2024

  41. [41]

    Nvila: Efficient frontier visual language models

    Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, et al. Nvila: Efficient frontier visual language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4122–4134, 2025

  42. [42]

    Video-xl: Extra-long vision language model for hour-scale video understanding

    Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26160–26169, 2025

  43. [43]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024

  44. [44]

    Video-ccam: Enhancing video-language un- derstanding with causal cross-attention masks for short and long videos.arXiv preprint arXiv:2408.14023, 2024

    Jiajun Fei, Dian Li, Zhidong Deng, Zekun Wang, Gang Liu, and Hui Wang. Video-ccam: Enhancing video-language understanding with causal cross-attention masks for short and long videos.arXiv preprint arXiv:2408.14023, 2024

  45. [45]

    Kangaroo: A powerful video-language model supporting long-context video input.International Journal of Computer Vision, 134(3):114, 2026

    Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xiaoqi Ma, Xiaoming Wei, Jianbin Jiao, Enhua Wu, and Jie Hu. Kangaroo: A powerful video-language model supporting long-context video input.International Journal of Computer Vision, 134(3):114, 2026. 12

  46. [46]

    Long Context Transfer from Language to Vision

    Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision. arXiv preprint arXiv:2406.16852, 2024

  47. [47]

    Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024

    Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, et al. Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024

  48. [48]

    Longvu: Spatiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024

    Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, et al. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024

  49. [49]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  50. [50]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 13 A Additional Data Construction Details Training VideoRouter requires high-quality supervision for two routing decisions: frame-level relevance for the Image Router a...