VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
Pith reviewed 2026-05-18 03:57 UTC · model grok-4.3
The pith
A hierarchical compression technique reduces long video tokens by a factor of about 50 with almost no performance loss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that visual redundancy in long videos can be leveraged through a hierarchical compression scheme operating first at the clip level and then at the video level, yielding an extreme token reduction ratio of approximately 1/50 while incurring almost no performance degradation on downstream tasks. When paired with a multi-stage short-to-long training regimen and the LongVid dataset, this produces the VideoChat-Flash model that leads open-source results on mainstream long- and short-video benchmarks and attains 99.1 percent accuracy over 10,000 frames in the Multi-Hop Needle-In-A-Video-Haystack evaluation.
What carries the argument
The Hierarchical video token Compression (HiCo) method, which progressively compresses tokens from clip-level to video-level by exploiting visual redundancy.
If this is right
- Models can process videos containing 10,000 or more frames with far lower computational cost than before.
- The VideoChat-Flash architecture achieves leading performance on both long-context and short-context video benchmarks at the 2B and 7B scales.
- A multi-stage training schedule that progresses from short to long videos improves handling of extended sequences.
- The LongVid dataset supplies real-world long video examples for further training and evaluation.
- The Multi-Hop Needle-In-A-Video-Haystack benchmark provides a new test for complex reasoning across many video frames.
Where Pith is reading between the lines
- Similar hierarchical compression could be adapted to other time-series modalities such as audio or sensor streams to achieve comparable efficiency gains.
- Widespread adoption might reduce the hardware and energy demands of deploying video-understanding systems in real-time applications.
- The approach could be combined with existing context-extension techniques to push the feasible length of video inputs even further.
- Evaluating the method on videos from domains not represented in the current benchmarks would help determine how broadly the redundancy assumption holds.
Load-bearing premise
Visual redundancy in long videos can be reliably detected and removed by the hierarchical clip-to-video scheme without discarding information needed for the target task.
What would settle it
Measuring whether accuracy on the Multi-Hop Needle-In-A-Video-Haystack benchmark falls substantially below 99.1 percent when the HiCo compression is applied to 10,000-frame videos that contain critical details spaced across distant segments.
read the original abstract
Long-context video modeling is critical for multimodal large language models (MLLMs), enabling them to process movies, online video streams, and so on. Despite its advances, handling long videos remains challenging due to the difficulty in efficiently understanding the extremely long video context. This paper aims to address this issue from aspects of model architecture, training data, training strategy and evaluation benchmark. First, we propose a novel Hierarchical video token Compression (HiCo) method, which leverages visual redundancy in long videos to compress long video context from Clip-level to Video-level, reducing the computation significantly while preserving essential details, achieving an extreme compression ratio of approximately 1/50 with almost no performance loss. Second, we introduce a multi-stage short-to-long learning scheme, a large-scale dataset of real-world long videos named LongVid, and a challenging ``Multi-Hop Needle-In-A-Video-Haystack'' benchmark. Finally, we build a powerful video MLLM named VideoChat-Flash, which shows a leading performance on both mainstream long and short video benchmarks at the 2B and 7B model scale. It first gets 99.1% accuracy over 10,000 frames in NIAH among open-source models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces VideoChat-Flash, a video MLLM for long-context understanding. It proposes HiCo, a hierarchical clip-to-video token compression method that exploits visual redundancy to achieve an approximate 1/50 compression ratio with almost no performance loss. The work also describes a multi-stage short-to-long training scheme, the LongVid dataset of real-world long videos, and a new Multi-Hop Needle-In-A-Video-Haystack benchmark. VideoChat-Flash reports leading open-source performance on long and short video tasks, including 99.1% accuracy on NIAH over 10,000 frames at 2B and 7B scales.
Significance. If the compression and generalization claims hold, HiCo could enable more efficient long-video modeling in MLLMs by reducing token counts while retaining task-critical information, with potential impact on applications involving movies or extended streams. The new LongVid dataset and multi-hop NIAH benchmark address evaluation gaps for long-context reasoning. However, significance is tempered by the absence of detailed ablations and benchmark construction details, which are needed to confirm that performance gains are robust rather than benchmark-specific.
major comments (3)
- [Abstract] Abstract: The central claim of ~1/50 compression with 'almost no performance loss' is presented without quantitative ablations, error bars, dataset statistics, or direct before/after comparisons on the same tasks. This is load-bearing for the HiCo contribution and requires explicit metrics (e.g., accuracy drop on standard benchmarks when compression is applied or removed).
- [Training strategy] Training strategy section: The multi-stage short-to-long scheme is trained on LongVid, yet no ablation demonstrates that the learned compression policy generalizes to videos with low visual redundancy or sparse critical events (e.g., a single brief action that must be recalled after 10k frames). This directly affects the transfer claim to real-world long videos.
- [Evaluation] Evaluation section: The Multi-Hop Needle-In-A-Video-Haystack benchmark is newly introduced and underpins the 99.1% accuracy result, but construction details, needle placement strategy, dataset statistics, and how multi-hop questions are generated are not provided. Without these, it is unclear whether the high score reflects model capability or benchmark properties.
minor comments (3)
- Clarify the precise mechanism by which HiCo identifies and discards 'redundant' tokens at clip-to-video level, including any hyperparameters or learned components.
- Add error bars or multiple runs to all reported accuracies, especially the NIAH and benchmark results.
- Ensure the LongVid dataset and NIAH benchmark construction code or details are made available for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review. The comments highlight important areas for clarification and strengthening of the manuscript. We address each major comment below and commit to revisions that provide the requested quantitative details, ablations, and benchmark specifications without altering the core claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of ~1/50 compression with 'almost no performance loss' is presented without quantitative ablations, error bars, dataset statistics, or direct before/after comparisons on the same tasks. This is load-bearing for the HiCo contribution and requires explicit metrics (e.g., accuracy drop on standard benchmarks when compression is applied or removed).
Authors: We agree that the abstract would benefit from more explicit anchoring to supporting evidence. The full manuscript already includes direct comparisons in Section 4.3 and Table 3, where HiCo yields an average accuracy drop of 0.7% across Video-MME, EgoSchema, and MLVU relative to the uncompressed baseline at equivalent token budgets. To address the request for error bars and dataset statistics, we will revise the abstract to cite these results and add a dedicated ablation table with standard deviations from three random seeds in the revised version. revision: yes
-
Referee: [Training strategy] Training strategy section: The multi-stage short-to-long scheme is trained on LongVid, yet no ablation demonstrates that the learned compression policy generalizes to videos with low visual redundancy or sparse critical events (e.g., a single brief action that must be recalled after 10k frames). This directly affects the transfer claim to real-world long videos.
Authors: The LongVid dataset was curated to include videos with varying redundancy levels, and the Multi-Hop NIAH results (99.1% at 10k frames) already test recall of sparse events distributed across long contexts. However, we acknowledge the value of a targeted ablation on artificially low-redundancy cases. We will add this experiment in the revision by constructing a controlled subset of videos with single critical events and reporting compression policy behavior and downstream accuracy. revision: yes
-
Referee: [Evaluation] Evaluation section: The Multi-Hop Needle-In-A-Video-Haystack benchmark is newly introduced and underpins the 99.1% accuracy result, but construction details, needle placement strategy, dataset statistics, and how multi-hop questions are generated are not provided. Without these, it is unclear whether the high score reflects model capability or benchmark properties.
Authors: We regret the omission of these details from the main text. The revised manuscript will include an expanded subsection (Section 5.3) describing: (i) needle placement at uniformly random temporal positions with 1–5 hops per question, (ii) dataset statistics (200 videos, mean length 11,800 frames, 1,000 total questions), and (iii) multi-hop question generation via manual seed questions followed by GPT-4 paraphrasing and human verification for factual accuracy. These additions will allow readers to assess benchmark difficulty independently. revision: yes
Circularity Check
No significant circularity; claims rest on new experimental artifacts and architectural design
full rationale
The paper introduces a novel HiCo compression architecture, a new LongVid dataset, a multi-stage training scheme, and a new Multi-Hop Needle-In-A-Video-Haystack benchmark. Reported outcomes (1/50 compression ratio, 99.1% NIAH accuracy) are presented as results of training and evaluation on these artifacts rather than quantities defined in terms of themselves or fitted parameters renamed as predictions. No load-bearing equations, self-citations, or uniqueness theorems are invoked that reduce the central claims to the inputs by construction. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Visual redundancy in long videos permits compression from clip-level to video-level tokens while preserving essential details for downstream tasks.
invented entities (1)
-
HiCo hierarchical compression
no independent evidence
Forward citations
Cited by 18 Pith papers
-
MedHorizon: Towards Long-context Medical Video Understanding in the Wild
MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.
-
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
-
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
-
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.
-
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
TOC-Bench is an object-track-grounded benchmark that filters for temporally dependent questions and shows Video-LLMs have major weaknesses in event counting, ordering, identity reasoning, and hallucination detection.
-
OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding
OmniVTG creates a new large-scale open-world VTG dataset using iterative concept-gap filling and timestamped captioning, paired with a three-stage self-correction CoT paradigm that yields SOTA zero-shot results on fou...
-
AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding
AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.
-
LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding
LongVideo-R1 trains a reasoning agent on 33K trajectories to intelligently select informative video clips via iterative refinement and RL, achieving better accuracy-efficiency tradeoffs on long video QA benchmarks.
-
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
-
VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading
VisMMoE exploits visual-expert affinity via token pruning to achieve up to 2.68x faster VL-MoE inference on memory-constrained hardware while keeping accuracy competitive.
-
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...
-
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
-
Small Vision-Language Models are Smart Compressors for Long Video Understanding
Tempo uses a 6B SVLM as a local temporal compressor with training-free adaptive token allocation to achieve SOTA long-video understanding at 0.5-16 tokens per frame, scoring 52.3 on 4101s LVBench under 8K budget.
-
Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning
G2F-RAG converts retrieved knowledge subgraphs into a single visual reasoning frame appended to videos, enabling training-free and interpretable improvements for LMM-based video reasoning on knowledge-intensive tasks.
-
Cambrian-S: Towards Spatial Supersensing in Video
Cambrian-S introduces VSI-SUPER benchmarks for long-horizon spatial recall and counting, shows data scaling yields 30% gains on existing tests, and demonstrates a self-supervised next-latent predictor using surprise o...
-
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.
-
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
Reinforcement fine-tuning with temporal rewards produces VideoChat-R1, a video MLLM showing large gains on spatio-temporal perception benchmarks such as +31.8 temporal grounding and +31.2 object tracking.
-
Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding
Time-R1 applies RL with verifiable rewards to post-train LVLMs for temporal video grounding, reaching state-of-the-art results on multiple datasets using only 2.5K samples while also improving general video capabilities.
Reference graph
Works this paper leans on
-
[1]
Ht-step: Aligning instructional articles with how-to videos
Triantafyllos Afouras, Effrosyni Mavroudi, Tushar Nagara- jan, Huiyu Wang, and Lorenzo Torresani. Ht-step: Aligning instructional articles with how-to videos. Advances in Neural Information Processing Systems, 36, 2024. 4, 15
work page 2024
-
[2]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Frozen in time: A joint video and image encoder for end-to- end retrieval
Max Bain, Arsha Nagrani, G¨ul Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to- end retrieval. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1728–1738, 2021. 14
work page 2021
-
[5]
Fuyu- 8b: A multimodal architecture for ai agents, 2024
Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Saðgnak Tasırlar. Fuyu- 8b: A multimodal architecture for ai agents, 2024. 1
work page 2024
-
[6]
Token Merging: Your ViT But Faster
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster. arXiv preprint arXiv:2210.09461, 2022. 8, 14
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
Hourvideo: 1-hour video-language understanding
Keshigeyan Chandrasegaran, Agrim Gupta, Lea M Hadzic, Taran Kota, Jimming He, Crist ´obal Eyzaguirre, Zane Du- rante, Manling Li, Jiajun Wu, and Li Fei-Fei. Hourvideo: 1-hour video-language understanding. arXiv preprint arXiv:2411.04998, 2024. 3
-
[8]
Allava: Harnessing gpt4v- synthesized data for a lite vision-language model
Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, and Benyou Wang. Allava: Harnessing gpt4v- synthesized data for a lite vision-language model. arXiv preprint arXiv:2402.11684, 2024. 14
-
[9]
Llavolta: Efficient multi-modal models via stage-wise visual context compression
Jieneng Chen, Luoxin Ye, Ju He, Zhao-Yang Wang, Daniel Khashabi, and Alan Yuille. Llavolta: Efficient multi-modal models via stage-wise visual context compression. arXiv preprint arXiv:2406.20092, 2024. 4
-
[10]
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In European Conference on Computer Vision, pages 19–35. Springer, 2025. 4
work page 2025
-
[11]
Panda-70m: Captioning 70m videos with multiple cross- modality teachers
Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Eka- terina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross- modality teachers. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13320–13331, 2024. 4, 15
work page 2024
-
[12]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, and Lidong Bing. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video- llms. CoRR, abs/2406.07476, 2024. 1, 15
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Jiajun Fei, Dian Li, Zhidong Deng, Zekun Wang, Gang Liu, and Hui Wang. Video-ccam: Enhancing video-language un- derstanding with causal cross-attention masks for short and long videos. arXiv preprint arXiv:2408.14023, 2024. 2
-
[16]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yun- hang Shen, Xiaoyu Liu, Yangze Li, Zuwei Long, Heting Gao, Ke Li, et al. Vita-1.5: Towards gpt-4o level real-time vision and speech interaction. arXiv preprint arXiv:2501.01957 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Tall: Temporal activity localization via language query
Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on com- puter vision, pages 5267–5275, 2017. 6
work page 2017
-
[19]
Ego4d: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022. 4, 15
work page 2022
-
[20]
Online video understanding: A comprehensive benchmark and memory-augmented method
Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xiangyu Zeng, Cheng Liang, Tao Wu, Xi Chen, Liang Li, and Limin Wang. Online video understanding: A comprehensive benchmark and memory-augmented method. arXiv preprint arXiv:2501.00584, 2024. 2
-
[21]
Video recap: Recursive captioning of hour-long videos
Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Na- garajan, Lorenzo Torresani, and Gedas Bertasius. Video recap: Recursive captioning of hour-long videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18198–18208, 2024. 4, 6, 15
work page 2024
-
[22]
Miradata: A large-scale video dataset with long durations and structured captions
Xuan Ju, Yiming Gao, Zhaoyang Zhang, Ziyang Yuan, Xintao Wang, Ailing Zeng, Yu Xiong, Qiang Xu, and Ying Shan. Miradata: A large-scale video dataset with long durations and structured captions. arXiv preprint arXiv:2407.06358, 2024. 4, 15
-
[23]
The Kinetics Human Action Video Dataset
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- man action video dataset. arXiv preprint arXiv:1705.06950,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Otterhd: A high-resolution multi-modality model
Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, and Ziwei Liu. Otterhd: A high-resolution multi-modality model. arXiv preprint arXiv:2311.04219, 2023. 1
-
[25]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chun- yuan Li. Llava-onevision: Easy visual task transfer. CoRR, abs/2408.03326, 2024. 1, 6, 14
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895, 2024. 14
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
VideoChat: Chat-Centric Video Understanding
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Unmasked teacher: Towards training-efficient video foundation models
Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, Limin Wang, and Yu Qiao. Unmasked teacher: Towards training-efficient video foundation models. In ICCV, 2023. 5, 8, 13
work page 2023
-
[29]
Mvbench: A comprehensive multi-modal video understanding benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Lou, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark. In CVPR, pages 22195–22206. IEEE, 2024. 1, 2, 6, 8, 14
work page 2024
-
[30]
Llama-vid: An image is worth 2 tokens in large language models
Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. In ECCV, pages 323–340. Springer, 2024. 2, 6, 7
work page 2024
-
[31]
Video-llava: Learning united visual repre- sentation by alignment before projection
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual repre- sentation by alignment before projection. In EMNLP, pages 5971–5984. Association for Computational Linguistics, 2024. 1, 2
work page 2024
-
[32]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014. 5
work page 2014
-
[33]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023. 7, 14
work page 2023
-
[34]
Kangaroo: A powerful video-language model supporting long-context video input
Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xi- aoqi Ma, Xiaoming Wei, Jianbin Jiao, Enhua Wu, and Jie Hu. Kangaroo: A powerful video-language model supporting long-context video input. arXiv preprint arXiv:2408.15542,
-
[35]
Videogpt+: Integrating image and video en- coders for enhanced video understanding
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Videogpt+: Integrating image and video en- coders for enhanced video understanding. arXiv preprint arXiv:2406.09418, 2024. 14
-
[36]
Egoschema: A diagnostic benchmark for very long- form video language understanding
Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding. Advances in Neural Information Processing Systems, 36:46212–46244, 2023. 3
work page 2023
-
[37]
Howto100m: Learning a text-video embedding by watching hundred mil- lion narrated video clips
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred mil- lion narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision , pages 2630– 2640, 2019. 4, 15
work page 2019
-
[38]
Spoken moments: Learning joint audio-visual representations from video descriptions
Mathew Monfort, SouYoung Jin, Alexander Liu, David Har- wath, Rogerio Feris, James Glass, and Aude Oliva. Spoken moments: Learning joint audio-visual representations from video descriptions. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 14871–14881, 2021. 14
work page 2021
-
[39]
OpenAI. GPT-4 technical report. CoRR, abs/2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
OpenAI. Gpt-4o. https://openai.com/index/ hello-gpt-4o/, 2024. 2, 6
work page 2024
-
[41]
Perception test: A diagnostic benchmark for multimodal video models
Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Re- casens, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Mateusz Malinowski, Yi Yang, Carl Doersch, et al. Perception test: A diagnostic benchmark for multimodal video models. In NIPS, 2024. 6
work page 2024
-
[42]
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yux- iong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. InPro- ceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506,
-
[43]
Cinepile: A long video question answering dataset and bench- mark
Ruchit Rawal, Khalid Saifullah, Miquel Farr´e, Ronen Basri, David Jacobs, Gowthami Somepalli, and Tom Goldstein. Cinepile: A long video question answering dataset and bench- mark. arXiv preprint arXiv:2405.08813, 2024. 3
-
[44]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy P. Lillicrap, Jean-Baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, Ioannis Antonoglou, Rohan Anil, Sebastian Borgeaud, An- drew M. Dai, Katie Millican, Ethan Dyer, Mia Glaese, Thibault Sottiaux, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Yuanz...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Timechat: A time-sensitive multimodal large lan- guage model for long video understanding
Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large lan- guage model for long video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14313–14323, 2024. 3
work page 2024
-
[46]
Sharegemini: Scaling up video caption data for multi- modal large language models, 2024
Share. Sharegemini: Scaling up video caption data for multi- modal large language models, 2024. 14
work page 2024
-
[47]
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Bal- akrishnan Varadarajan, Florian Bordes, et al. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding. arXiv preprint arXiv:2410.17434, 2024. 2, 6, 15
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
Video-xl: Extra-long vision language model for hour-scale video understanding
Yan Shu, Peitian Zhang, Zheng Liu, Minghao Qin, Junjie Zhou, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. arXiv preprint arXiv:2409.14485, 2024. 2
-
[49]
Moviechat: From dense token to sparse memory for long video understanding
Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024. 2, 3, 14
work page 2024
-
[50]
Koala: Key frame-conditioned long video-llm
Reuben Tan, Ximeng Sun, Ping Hu, Jui-hsien Wang, Hanieh Deilamsalehy, Bryan A Plummer, Bryan Russell, and Kate Saenko. Koala: Key frame-conditioned long video-llm. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 13581–13591, 2024. 2, 8
work page 2024
-
[51]
Cosmo: Contrastive streamlined multi- modal model with interleaved pre-training
Alex Jinpeng Wang, Linjie Li, Kevin Qinghong Lin, Jian- feng Wang, Kevin Lin, Zhengyuan Yang, Lijuan Wang, and Mike Zheng Shou. Cosmo: Contrastive streamlined multi- modal model with interleaved pre-training. arXiv preprint arXiv:2401.00849, 2024. 4, 15
-
[52]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
Lvbench: An extreme long video understanding benchmark
Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, et al. Lvbench: An extreme long video understanding benchmark. arXiv preprint arXiv:2406.08035, 2024. 3, 6
-
[54]
Longllava: Scaling multi-modal llms to 1000 images efficiently via hybrid architecture
Xidong Wang, Dingjie Song, Shunian Chen, Chen Zhang, and Benyou Wang. Longllava: Scaling multi-modal llms to 1000 images efficiently via hybrid architecture. CoRR, abs/2409.02889, 2024. 2
-
[55]
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, et al. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191,
work page internal anchor Pith review Pith/arXiv arXiv
-
[56]
Internvideo2: Scaling video foundation models for multimodal video understanding
Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, et al. Internvideo2: Scaling video foundation models for multimodal video understanding. In ECCV, 2024. 1, 2, 6, 13, 14
work page 2024
-
[57]
Visual context window extension: A new perspective for long video understanding
Hongchen Wei and Zhenzhong Chen. Visual context window extension: A new perspective for long video understanding. arXiv preprint arXiv:2409.20018, 2024. 2, 8, 14
-
[58]
Longvlm: Efficient long video understand- ing via large language models
Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, and Bohan Zhuang. Longvlm: Efficient long video understand- ing via large language models. In ECCV, pages 453–470. Springer, 2025. 2, 8
work page 2025
-
[59]
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context inter- leaved video-language understanding. arXiv preprint arXiv:2407.15754, 2024. 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[60]
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See-Kiong Ng, and Jiashi Feng. Pllava : Parameter-free llava extension from images to videos for video dense captioning. CoRR, abs/2404.16994, 2024. 1, 14
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[61]
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Fuzhao Xue, Yukang Chen, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhi- jian Liu, et al. Longvila: Scaling long-context visual language models for long videos. arXiv preprint arXiv:2408.10188,
work page internal anchor Pith review Pith/arXiv arXiv
-
[62]
Advanc- ing high-resolution video-language representation with large- scale video transcriptions
Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo. Advanc- ing high-resolution video-language representation with large- scale video transcriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 5036–5045, 2022. 4, 15
work page 2022
-
[63]
Vript: A video is worth thousands of words
Dongjie Yang, Suyuan Huang, Chengqiang Lu, Xiaodong Han, Haoxin Zhang, Yan Gao, Yao Hu, and Hai Zhao. Vript: A video is worth thousands of words. arXiv preprint arXiv:2406.06040, 2024. 14
-
[64]
Timesuite: Improving mllms for long video understanding via grounded tuning
Xiangyu Zeng, Kunchang Li, Chenting Wang, Xinhao Li, Tianxiang Jiang, Ziang Yan, Songze Li, Yansong Shi, Zhen- grong Yue, Yi Wang, et al. Timesuite: Improving mllms for long video understanding via grounded tuning. arXiv preprint arXiv:2410.19702, 2024. 2, 3
-
[65]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In ICCV, 2023. 7, 8
work page 2023
-
[66]
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding. arXiv preprint arXiv:2306.02858, 2023. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[67]
Movqa: A benchmark of versatile question-answering for long-form movie understanding
Hongjie Zhang, Yi Liu, Lu Dong, Yifei Huang, Zhen-Hua Ling, Yali Wang, Limin Wang, and Yu Qiao. Movqa: A benchmark of versatile question-answering for long-form movie understanding. arXiv preprint arXiv:2312.04817, 2023. 3
-
[68]
Long Context Transfer from Language to Vision
Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision. CoRR, abs/2406.16852, 2024. 2, 3, 4, 5, 6, 7, 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[69]
Direct preference optimization of video large multimodal models from language model reward
Ruohong Zhang, Liangke Gui, Zhiqing Sun, Yihao Feng, Keyang Xu, Yuanhan Zhang, Di Fu, Chunyuan Li, Alexander Hauptmann, Yonatan Bisk, et al. Direct preference optimiza- tion of video large multimodal models from language model reward. arXiv preprint arXiv:2404.01258, 2024. 14
-
[70]
Llava- next: A strong zero-shot video understanding model, 2024
Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava- next: A strong zero-shot video understanding model, 2024. 1, 2, 8, 13, 14
work page 2024
-
[71]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713, 2024. 2, 6, 14
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[72]
Needle in a video haystack: A scalable synthetic framework for benchmarking video mllms
Zijia Zhao, Haoyu Lu, Yuqi Huo, Yifan Du, Tongtian Yue, Longteng Guo, Bingning Wang, Weipeng Chen, and Jing Liu. Needle in a video haystack: A scalable synthetic framework for benchmarking video mllms. arXiv preprint arXiv:2406.09367, 2024. 3
-
[73]
MLVU: Benchmarking Multi-task Long Video Understanding
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding. arXiv preprint arXiv:2406.04264, 2024. 3, 6 VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling Supplementary Material /uni00000015/uni00000017/u...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[74]
Visual Dropout in LLM Visual token redundancy in LLM inference
More Results & Discussions 6.1. Visual Dropout in LLM Visual token redundancy in LLM inference. As shown in Fig. 8, we find that even when half of the tokens are dis- carded at the shallow layers of the LLM, the performance of long video understanding only degrades marginally. This indicates that despite high compression at the clip level (en- coding each...
work page 2000
-
[75]
Video-Language Connectors As shown in Fig
Implementation Details 7.1. Video-Language Connectors As shown in Fig. 10, we consider four popular token com- pression strategies to compress the features from video clips: Video encoder MVBench PerceptionTest LongVideoBench MLVU VideoMME ( w/o sub.) LVBench Avg Val Val M-Avg Overall Avg Avg. Duration 16s 23s 473s 651s 1010s 4101s UMT-L 73.2 75.6 64.2 74...
-
[76]
We provide details of the data construc- tion pipeline for each dataset as follows
Dataset Details of LongVid The videos of LongVid are curated from 4 open-source video datasets: Ego4D [ 19], HowTo100M [37], HD-VILA [62], and MiraData [22]. We provide details of the data construc- tion pipeline for each dataset as follows. 8.1. Ego4D For ego-centric videos, we adopt 3,662 long videos from the Ego4d [19] and leverage Ego4DHcap [21] as th...
-
[77]
11 and 12) and long video understanding ( Figs
Qualitative Results We perform qualitative comparisons of our model with the proprietary model Gemini-1.5 Pro [44]1 and the open-source LongVU [47] and VideoLLaMA2 [14] across three tasks: fine-grained understanding of short videos ( Figs. 11 and 12) and long video understanding ( Figs. 13 and 14). 1We use the newest Gemini-1.5 Pro-002 for evaluation. The...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.