pith. machine review for the scientific record. sign in

arxiv: 2501.12386 · v3 · submitted 2025-01-21 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

Authors on Pith no claims yet

Pith reviewed 2026-05-17 02:47 UTC · model grok-4.3

classification 💻 cs.CV
keywords video understandingmultimodal large language modelslong context modelingdirect preference optimizationtoken compressionobject trackingvideo segmentationtemporal structure
0
0 comments X

The pith

Long and rich context modeling lets video MLLMs process at least six times longer inputs while gaining object tracking and segmentation skills.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper works to strengthen video multimodal large language models by adding long and rich context modeling. It adds dense vision task annotations to the models through direct preference optimization and builds compact spatiotemporal representations with adaptive hierarchical token compression. The goal is to sharpen the models' grasp of fine details and long temporal patterns in video. A sympathetic reader would care because these steps aim to unlock stronger innate focus and memory abilities for practical video understanding tasks.

Core claim

The paper claims that its design for long and rich context modeling, which incorporates dense vision task annotations into MLLMs using direct preference optimization and creates compact spatiotemporal representations through adaptive hierarchical token compression, substantially improves video MLLM performance. This produces stronger results on mainstream short and long video understanding benchmarks, allows the models to memorize and use video inputs at least six times longer than before, and equips them with specialized vision capabilities such as object tracking and segmentation.

What carries the argument

Long and rich context (LRC) modeling, which integrates dense vision task annotations via direct preference optimization and adaptive hierarchical token compression to form compact spatiotemporal representations that support finer detail perception and longer temporal capture.

If this is right

  • Video MLLMs show improved accuracy on both short-form and long-form video understanding benchmarks.
  • Models gain the ability to retain and reason over video inputs at least six times longer than the original design.
  • MLLMs acquire new specialized vision skills including object tracking and segmentation.
  • Context richness in length and fineness directly strengthens the models' focus and memory functions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar preference optimization on dense annotations could extend context handling in other sequential multimodal tasks such as audio streams or time-series data.
  • The token compression method might reduce memory costs enough to allow deployment of long-context video models on resource-limited devices.
  • Future tests on uncurated real-world video sources could show whether the benchmark gains translate to practical applications like surveillance or video editing.

Load-bearing premise

The gains in context length, benchmark scores, and specialized vision tasks arise from the long and rich context modeling components rather than from differences in training data volume, model scale, or benchmark choices.

What would settle it

An ablation experiment that removes the dense vision annotations and hierarchical token compression steps while holding training data volume and model size fixed, then measures whether context length and benchmark performance still improve, would settle whether the central claim holds.

read the original abstract

This paper aims to improve the performance of video multimodal large language models (MLLM) via long and rich context (LRC) modeling. As a result, we develop a new version of InternVideo2.5 with a focus on enhancing the original MLLMs' ability to perceive fine-grained details and capture long-form temporal structure in videos. Specifically, our approach incorporates dense vision task annotations into MLLMs using direct preference optimization and develops compact spatiotemporal representations through adaptive hierarchical token compression. Experimental results demonstrate this unique design of LRC greatly improves the results of video MLLM in mainstream video understanding benchmarks (short & long), enabling the MLLM to memorize significantly longer video inputs (at least 6x longer than the original), and master specialized vision capabilities like object tracking and segmentation. Our work highlights the importance of multimodal context richness (length and fineness) in empowering MLLM's innate abilites (focus and memory), providing new insights for future research on video MLLM. Code and models are available at https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces InternVideo2.5, an updated video multimodal large language model (MLLM) that incorporates long and rich context (LRC) modeling. The core contributions are (1) injecting dense vision task annotations into the MLLM via direct preference optimization (DPO) and (2) producing compact spatiotemporal representations through adaptive hierarchical token compression. The authors report that these changes yield gains on short- and long-form video understanding benchmarks, enable the model to handle at least 6× longer video inputs than the prior InternVideo2, and confer new capabilities such as object tracking and segmentation.

Significance. If the reported gains are shown to stem from the LRC mechanisms rather than uncontrolled differences in training data volume, model scale, or optimizer schedule, the work would meaningfully advance video MLLM research by demonstrating the value of explicit context richness for fine-grained perception and long-term memory. The public release of code and models at the cited GitHub repository is a clear strength that aids reproducibility.

major comments (2)
  1. [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): the headline claims of benchmark improvements, 6× context extension, and new tracking/segmentation capabilities are presented only as end-to-end results against InternVideo2. No ablation is reported that freezes data volume, training schedule, and base architecture while toggling only the DPO stage and the hierarchical compression module; without such controls the attribution to LRC remains untested and the central causal claim is under-supported.
  2. [§3.2 (Adaptive Hierarchical Token Compression)] §3.2 (Adaptive Hierarchical Token Compression): the description does not specify the exact compression ratios, the criterion used for adaptive selection, or quantitative evidence that fine-grained spatial details required for tracking and segmentation are preserved after compression; this detail is load-bearing for the claim that the module enables both longer context and specialized vision capabilities.
minor comments (1)
  1. [Abstract] The abstract refers to “mainstream video understanding benchmarks (short & long)” without naming the specific datasets or splits; adding this information would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We appreciate the emphasis on strengthening the causal attribution of our results to the proposed LRC mechanisms and on providing more technical details in the method section. We address each major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): the headline claims of benchmark improvements, 6× context extension, and new tracking/segmentation capabilities are presented only as end-to-end results against InternVideo2. No ablation is reported that freezes data volume, training schedule, and base architecture while toggling only the DPO stage and the hierarchical compression module; without such controls the attribution to LRC remains untested and the central causal claim is under-supported.

    Authors: We agree that an ablation isolating the DPO and adaptive hierarchical token compression components—while strictly controlling data volume, training schedule, optimizer, and base architecture—would provide stronger evidence for the specific contribution of LRC. Our current comparisons are against the publicly released InternVideo2 checkpoint under the same overall training recipe, and the observed gains are consistent across short- and long-form benchmarks as well as the new tracking/segmentation tasks. Nevertheless, to directly address the referee’s concern we will add a controlled ablation study in the revised §4 that trains variants with and without each LRC module under matched conditions. This will clarify the incremental benefit attributable to dense vision annotations via DPO and to the compression module. revision: yes

  2. Referee: [§3.2 (Adaptive Hierarchical Token Compression)] §3.2 (Adaptive Hierarchical Token Compression): the description does not specify the exact compression ratios, the criterion used for adaptive selection, or quantitative evidence that fine-grained spatial details required for tracking and segmentation are preserved after compression; this detail is load-bearing for the claim that the module enables both longer context and specialized vision capabilities.

    Authors: We thank the referee for highlighting this gap. In the revised manuscript we will expand §3.2 with the exact per-layer compression ratios (spatial and temporal), the adaptive selection criterion (based on token importance scores derived from cross-attention with the text query), and quantitative evidence of detail preservation. Specifically, we will report (i) cosine similarity of visual features before and after compression on a held-out set of tracking/segmentation videos and (ii) downstream performance drop on object-tracking and segmentation probes when the compression module is ablated. These additions will substantiate that fine-grained spatial information is retained while achieving the reported 6× context extension. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on benchmark results rather than self-referential derivations

full rationale

The paper introduces architectural and training modifications—dense vision annotations via direct preference optimization and adaptive hierarchical token compression—to extend context length and improve video understanding in MLLMs. These are presented as novel design choices whose value is demonstrated through end-to-end experimental comparisons on standard benchmarks, not through any closed-form derivation, fitted-parameter prediction, or uniqueness theorem that reduces to prior self-citations by construction. References to the InternVideo lineage are contextual background rather than load-bearing justifications for the reported gains. The work is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work relies on standard supervised fine-tuning, preference optimization, and token compression methods drawn from prior literature; no new free parameters, axioms, or invented entities are introduced beyond the model architecture itself.

pith-pipeline@v0.9.0 · 5552 in / 1094 out tokens · 51917 ms · 2026-05-17T02:47:18.187982+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

    cs.CV 2026-05 unverdicted novelty 8.0

    EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

  2. CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.

  3. TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models

    cs.CV 2026-05 conditional novelty 7.0

    TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.

  4. TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    TOC-Bench is an object-track-grounded benchmark that filters for temporally dependent questions and shows Video-LLMs have major weaknesses in event counting, ordering, identity reasoning, and hallucination detection.

  5. Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    SAVEMem improves streaming video understanding scores by adding semantic awareness to memory compression and query-adaptive retrieval without any model training.

  6. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 7.0

    VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...

  7. Grounding Video Reasoning in Physical Signals

    cs.CV 2026-04 unverdicted novelty 7.0

    A new benchmark converts video clips into shared grounded event records and tests models across physics, semantic, and control prompts under original, shuffled, ablated, and masked conditions, finding selective robust...

  8. InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding

    cs.CV 2026-04 unverdicted novelty 7.0

    InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.

  9. Adapting MLLMs for Nuanced Video Retrieval

    cs.CV 2025-12 unverdicted novelty 7.0

    Text-only contrastive fine-tuning of an MLLM with hard negatives produces embeddings that handle temporal, negation, and multimodal nuances in video retrieval and achieves SOTA performance.

  10. From Priors to Perception: Grounding Video-LLMs in Physical Reality

    cs.CV 2026-05 unverdicted novelty 6.0

    Video-LLMs fail physical reasoning due to semantic prior dominance rather than perception deficits; a new programmatic adversarial curriculum and visual-anchored reasoning chain enable substantial gains via standard L...

  11. All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    A unified synthetic data generation pipeline produces unlimited annotated multimodal video data across multiple tasks, enabling models trained mostly on synthetic data to generalize effectively to real-world video und...

  12. Decoupled Similarity for Task-Aware Token Pruning in Large Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    DeSAP uses decoupled cross-modal similarity plus visual saliency to prune visual tokens in LVLMs, retaining 11.1% tokens for 10x FLOPs reduction and 98.1% performance on LLaVA-1.5-7B.

  13. Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    G2F-RAG converts retrieved knowledge subgraphs into a single visual reasoning frame appended to videos, enabling training-free and interpretable improvements for LMM-based video reasoning on knowledge-intensive tasks.

  14. Streaming Video Instruction Tuning

    cs.CV 2025-12 unverdicted novelty 6.0

    Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.

  15. SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    cs.LG 2025-06 unverdicted novelty 6.0

    SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.

  16. LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

    cs.LG 2025-05 conditional novelty 6.0

    LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.

  17. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 5.0

    VISD adds structured privileged feedback from a judge model and a direction-magnitude decoupling trick to let VideoLLMs learn token-level credit assignment while keeping RL stable, yielding higher accuracy and roughly...

  18. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 5.0

    VISD improves VideoLLM reasoning by adding multi-dimensional diagnostic self-distillation and RL decoupling, yielding higher accuracy, better grounding, and nearly 2x faster training convergence.

  19. High-Speed Vision Improves Zero-Shot Semantic Understanding of Human Actions

    cs.CV 2026-05 unverdicted novelty 5.0

    Higher temporal resolution in video significantly improves zero-shot semantic understanding of high-speed human actions like kendo.

  20. How Should Video LLMs Output Time? An Analysis of Efficient Temporal Grounding Paradigms

    cs.CV 2026-04 unverdicted novelty 5.0

    A controlled study on compact video LLMs finds that continuous temporal decoding delivers the strongest accuracy-efficiency trade-off for video temporal grounding across three benchmarks.

  21. OneThinker: All-in-one Reasoning Model for Image and Video

    cs.CV 2025-12 unverdicted novelty 5.0

    OneThinker unifies image and video reasoning in one model across 10 tasks via a 600k corpus, CoT-annotated SFT, and EMA-GRPO reinforcement learning, reporting strong results on 31 benchmarks plus some cross-task transfer.

  22. VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

    cs.CV 2025-04 unverdicted novelty 5.0

    Reinforcement fine-tuning with temporal rewards produces VideoChat-R1, a video MLLM showing large gains on spatio-temporal perception benchmarks such as +31.8 temporal grounding and +31.2 object tracking.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 19 Pith papers · 16 internal anchors

  1. [1]

    Cosmos World Foundation Model Platform for Physical AI

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575,

  2. [2]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609,

  3. [3]

    One token to seg them all: Language instructed reasoning segmentation in videos

    Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Lei Liu, Zheng Zhang, and Mike Zheng Shou. One token to seg them all: Language instructed reasoning segmentation in videos. arXiv preprint arXiv:2409.19603,

  4. [4]

    Token Merging: Your ViT But Faster

    Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster. arXiv preprint arXiv:2210.09461,

  5. [5]

    InternLM2 Technical Report

    Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297,

  6. [6]

    Hourvideo: 1-hour video-language understanding

    Keshigeyan Chandrasegaran, Agrim Gupta, Lea M Hadzic, Taran Kota, Jimming He, Cristóbal Eyzaguirre, Zane Durante, Manling Li, Jiajun Wu, and Li Fei-Fei. Hourvideo: 1-hour video-language understanding. arXiv preprint arXiv:2411.04998,

  7. [7]

    arXiv preprint arXiv:2402.11684 , year=

    Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, and Benyou Wang. Allava: Harnessing gpt4v-synthesized data for a lite vision-language model. arXiv preprint arXiv:2402.11684, 2024a. Kang Chen, Tao Han, Junchao Gong, Lei Bai, Fenghua Ling, Jing-Jia Luo, Xi Chen, Leiming Ma, Tiannin...

  8. [8]

    Usp: A unified sequence parallelism approach for long context generative ai

    Jiarui Fang and Shangchun Zhao. A unified sequence parallelism approach for long context generative ai.arXiv preprint arXiv:2405.07719,

  9. [9]

    Mmbench-video: A long-form multi-shot benchmark for holistic video understanding.arXiv preprint arXiv:2406.14515, 2024

    Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video understanding. arXiv preprint arXiv:2406.14515,

  10. [10]

    Video-ccam: Enhancing video-language understanding with causal cross-attention masks for short and long videos.CoRR, abs/2408.14023,

    Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, and Wynne Hsu. Video-of-thought: Step-by-step video reasoning from perception to cognition. In Forty-first International Conference on Machine Learning, 2024a. Jiajun Fei, Dian Li, Zhidong Deng, Zekun Wang, Gang Liu, and Hui Wang. Video-ccam: Enhancing video-language understanding ...

  11. [11]

    ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

    Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793,

  12. [12]

    DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. Vtimellm: Empower llm to grasp video moments. In CVPR, 2024a. Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/...

  13. [13]

    The Kinetics Human Action Video Dataset

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950,

  14. [14]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. CoRR, abs/2408.03326, 2024a. Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multi...

  15. [15]

    Mvbench: A comprehensive multi-modal video understanding benchmark

    13 InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. In CVPR, pages 22195–22206, 2024c. Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenj...

  16. [16]

    Ring Attention with Blockwise Transformers for Near-Infinite Context

    Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889,

  17. [17]

    Videogpt+: Integrating image and video encoders for enhanced video understanding

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Videogpt+: Integrating image and video encoders for enhanced video understanding. arXiv preprint arXiv:2406.09418,

  18. [18]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report. CoRR, abs/2303.08774,

  19. [20]

    SAM 2: Segment Anything in Images and Videos

    URL https://arxiv.org/abs/2408.00714. Ruchit Rawal, Khalid Saifullah, Miquel Farré, Ronen Basri, David Jacobs, Gowthami Somepalli, and Tom Goldstein. Cinepile: A long video question answering dataset and benchmark. arXiv preprint arXiv:2405.08813,

  20. [21]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    14 InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy P. Lillicrap, Jean-Baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, Ioannis Antonoglou, Rohan Anil, Sebastian Borgeaud, Andrew M. Dai, Katie Millican, Ethan Dyer, Mia G...

  21. [22]

    Timechat: A time-sensitive multimodal large language model for long video understanding

    Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding. CVPR, abs/2312.02051,

  22. [23]

    LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

    URL https: //github.com/Share14/ShareGemini. Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, et al. Longvu: Spatiotemporal adaptive compression for long video-language understanding. arXiv preprint arXiv:2410.17434,

  23. [24]

    Video-xl: Extra-long vision language model for hour-scale video understanding

    Yan Shu, Peitian Zhang, Zheng Liu, Minghao Qin, Junjie Zhou, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. arXiv preprint arXiv:2409.14485,

  24. [25]

    Grounded-videollm: Sharpening fine-grained temporal grounding in video large language models

    Haibo Wang, Zhiyang Xu, Yu Cheng, Shizhe Diao, Yufan Zhou, Yixin Cao, Qifan Wang, Weifeng Ge, and Lifu Huang. Grounded-videollm: Sharpening fine-grained temporal grounding in video large language models. arXiv preprint arXiv:2410.03290, 2024a. Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong...

  25. [26]

    Longllava: Scaling multi- modal llms to 1000 images efficiently via a hybrid architecture.arXiv preprint arXiv:2409.02889, 2024

    Xidong Wang, Dingjie Song, Shunian Chen, Chen Zhang, and Benyou Wang. Longllava: Scaling multi-modal llms to 1000 images efficiently via hybrid architecture. CoRR, abs/2409.02889, 2024e. Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, et al. Internvideo: General video foundation models via g...

  26. [27]

    InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

    Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942,

  27. [28]

    Internvideo2: Scaling video foundation models for multimodal video understanding

    Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, et al. Internvideo2: Scaling video foundation models for multimodal video understanding. In ECCV, 2024f. Yueqian Wang, Xiaojun Meng, Jianxin Liang, Yuxuan Wang, Qun Liu, and Dongyan Zhao. Hawkeye: Training video-text llms for grounding text in vi...

  28. [29]

    Longvideobench: A benchmark for long-context interleaved video-language understanding

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding. arXiv preprint arXiv:2407.15754, 2024a. Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, and Ping Luo. Language as queries for referring video object segmentation. In CVPR, pages 4974–4984,

  29. [30]

    Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks

    Jiannan Wu, Muyan Zhong, Sen Xing, Zeqiang Lai, Zhaoyang Liu, Wenhai Wang, Zhe Chen, Xizhou Zhu, Lewei Lu, Tong Lu, et al. Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks. arXiv preprint arXiv:2406.08394, 2024b. Yinda Xu et al. Siamfc++: Towards robust and accurate visual tracking with target es...

  30. [31]

    LongVILA: Scaling Long-Context Visual Language Models for Long Videos

    Fuzhao Xue, Yukang Chen, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, et al. Longvila: Scaling long-context visual language models for long videos. arXiv preprint arXiv:2408.10188,

  31. [32]

    Task preference optimization: Improving multimodal large language models with vision task alignment

    Ziang Yan, Zhilin Li, Yinan He, Chenting Wang, Kunchang Li, Xinhao Li, Xiangyu Zeng, Zilei Wang, Yali Wang, Yu Qiao, Limin Wang, and Yi Wang. Task preference optimization: Improving multimodal large language models with vision task alignment. arXiv preprint arXiv:2412.19326,

  32. [33]

    Vript: A video is worth thousands of words

    Dongjie Yang, Suyuan Huang, Chengqiang Lu, Xiaodong Han, Haoxin Zhang, Yan Gao, Yao Hu, and Hai Zhao. Vript: A video is worth thousands of words. arXiv preprint arXiv:2406.06040,

  33. [34]

    mplug-owl3: Towards long image-sequence understanding in multi-modal large language models

    Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl3: Towards long image-sequence understanding in multi-modal large language models.arXiv preprint arXiv:2408.04840,

  34. [35]

    Next-chat: An lmm for chat, detection and segmentation

    Ao Zhang, Liming Zhao, Chen-Wei Xie, Yun Zheng, Wei Ji, and Tat-Seng Chua. Next-chat: An lmm for chat, detection and segmentation. arXiv preprint arXiv:2311.04498, 2023a. Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. Long-clip: Unlocking the long-text capability of clip. In European Conference on Computer Vision, pages 310–325. Springer,

  35. [36]

    Movqa: A benchmark of versatile question-answering for long-form movie understanding

    Hongjie Zhang, Yi Liu, Lu Dong, Yifei Huang, Zhen-Hua Ling, Yali Wang, Limin Wang, and Yu Qiao. Movqa: A benchmark of versatile question-answering for long-form movie understanding. arXiv preprint arXiv:2312.04817, 2023b. Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. L...

  36. [37]

    MLVU: Benchmarking Multi-task Long Video Understanding

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding. arXiv preprint arXiv:2406.04264,

  37. [38]

    Apollo: An exploration of video understanding in large multimodal models

    Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, et al. Apollo: An exploration of video understanding in large multimodal models. arXiv preprint arXiv:2412.10360,