pith. machine review for the scientific record. sign in

arxiv: 2604.14149 · v2 · submitted 2026-04-15 · 💻 cs.CV

Recognition: unknown

One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

Shixing Chen, Vimal Bhat, Xiang Hao, Yu-Xiong Wang, Zheyu Zhang, Ziqi Pang

Pith reviewed 2026-05-10 13:24 UTC · model grok-4.3

classification 💻 cs.CV
keywords long video understandingtoken compressionframe selectionvision-language modelsattention mechanismsextreme compressionlearnable compression
0
0 comments X

The pith

Learnable token reduction to one per frame plus attention-based selection lets VLMs process far more frames from long videos with higher accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that heuristic token compression in vision-language models for long videos causes avoidable information loss, so it instead trains LLM layers as learnable progressive modules to compress each frame down to a single token. This token-level step is paired with a frame-level step that splits videos into short segments, computes local attention scores inside the LLM, and keeps only the frames most relevant to the query while avoiding the usual start-and-end bias. Together the two steps produce a model called XComp that ingests two to four times as many frames, needs only 2.5 percent of the usual fine-tuning data, and lifts accuracy on LVBench from 42.9 percent to 46.2 percent while also improving other long-video benchmarks. A sympathetic reader cares because current models must skip most frames to fit inside LLM context limits, and recovering dense temporal information without exploding compute is the central practical barrier.

Core claim

Heuristic compression is prone to information loss, so the authors supervise LLM layers to become learnable progressive modules (LP-Comp) that reduce each frame to one token; they further add question-conditioned frame selection (QC-Comp) that splits long videos into short segments and uses local attention scores to retain only the most relevant frames. The combined system, XComp, reaches a significantly larger compression ratio, supports denser frame sampling, and is obtained by fine-tuning VideoChat-Flash on only 2.5 percent of the supervised data yet raises LVBench accuracy from 42.9 percent to 46.2 percent and improves other long-video benchmarks.

What carries the argument

LP-Comp and QC-Comp: learnable progressive token compression supervised inside LLM layers to reach one token per frame, combined with question-conditioned frame selection that uses local attention scores after segmenting long videos to avoid position bias.

If this is right

  • The model can digest 2x-4x more frames than prior methods while still fitting inside the LLM context window.
  • Extreme compression is achieved without the information loss typical of heuristic token dropping.
  • Only 2.5 percent of the usual supervised fine-tuning data is required to reach the reported accuracy gains.
  • Accuracy rises on LVBench and on multiple additional long-video understanding benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the local-attention selector generalizes, the same segmentation trick could be applied to other long-sequence tasks such as long-document question answering.
  • The one-token-per-frame regime opens the possibility of feeding entire hour-long videos into a single forward pass once the LLM context is enlarged.
  • Because the compression modules are learned inside the LLM rather than applied as a fixed preprocessor, they may adapt to new video domains with little additional data.

Load-bearing premise

The internal attention scores of the LLM layers, once videos are split into short segments, can identify the frames most relevant to an arbitrary query without losing temporal context that the token compression stage cannot recover.

What would settle it

On a long-video benchmark whose queries depend on information located in the middle of the sequence, the local-segment attention selector would show no gain over random frame selection or over global attention; that outcome would falsify the claim that local attention reliably surfaces the right frames.

Figures

Figures reproduced from arXiv: 2604.14149 by Shixing Chen, Vimal Bhat, Xiang Hao, Yu-Xiong Wang, Zheyu Zhang, Ziqi Pang.

Figure 1
Figure 1. Figure 1: Left: We present XComp that explores using the LLM layers to progressively compress the video tokens towards the extreme of one token per frame. Right: Such capabilities enable the model to better improve itself with denser input frames without significantly increasing video tokens. encoded into tens or hundreds of tokens, processing more than 1k frames usually exceeds the typical context lengths or comput… view at source ↗
Figure 2
Figure 2. Figure 2: Overview. Our XComp comprises of two parts to achieve extreme compression in long video understanding. (1) At the token level, we propose the supervised compression tuning that enables the LLM to compress every video frame into one compact token in a learnable and progressive manner, namely, LP-Comp (Sec. 3.2). (2) At the frame level, we utilize the internal attention mechanism to select the frames relevan… view at source ↗
Figure 3
Figure 3. Figure 3: Learnable and Progressive Compression (LP-Comp). With supervised compression tuning, the LLM layers learn to condense the visual tokens progressively into a concise set of tokens until reaching the extreme of one token per frame. Sec. 3.3 presents frame-level compression strategies. Finally, our framework unifies both types of compression to enhance long video understanding. 3.2 Learnable and Progressive T… view at source ↗
Figure 4
Figure 4. Figure 4: Question-Conditioned Compression. To reduce the visual redundancy and improve long video understanding performance, we split a video into individual segments and assign question-conditioned relevance scores to video frames. The frames with lower relevance scores are discarded so that the LLM only concentrates on the informative frames. QC-Comp Overview. The QC￾Comp is employed at the inference time for our… view at source ↗
Figure 5
Figure 5. Figure 5: LongVideoBench case analysis. XComp leverages QC-Comp to divide a long video into short segments, mitigating attention bias. Local attention highlights key frames relevant to the question and correct answer, enabling effective filtering of irrelevant content. 4.2 Main Results on Long Video Understanding Benchmarks. We assess our model on four widely used long video understanding benchmarks: LongVideoBench … view at source ↗
read the original abstract

Long video understanding is inherently challenging for vision-language models (VLMs) because of the extensive number of frames. With each video frame typically expanding into tens or hundreds of tokens, the limited context length of large language models (LLMs) forces the VLMs to perceive the frames sparsely and lose temporal information. To address this, we explore extreme video token compression towards one token per frame at the final LLM layer. Our key insight is that heuristic-based compression, widely adopted by previous methods, is prone to information loss, and this necessitates supervising LLM layers into learnable and progressive modules for token-level compression (LP-Comp). Such compression enables our VLM to digest 2x-4x more frames with improved performance. To further increase the token efficiency, we investigate frame-level compression, which selects the frames most relevant to the queries via the internal attention scores of the LLM layers, named question-conditioned compression (QC-Comp). As a notable distinction from previous studies, we mitigate the position bias of LLM attention in long contexts, i.e., the over-concentration on the beginning and end of a sequence, by splitting long videos into short segments and employing local attention. Collectively, our combined token-level and frame-level leads to an extreme compression model for long video understanding, named XComp, achieving a significantly larger compression ratio and enabling denser frame sampling. Our XComp is finetuned from VideoChat-Flash with a data-efficient supervised compression tuning stage that only requires 2.5% of the supervised fine-tuning data, yet boosts the accuracy from 42.9% to 46.2% on LVBench and enhances multiple other long video benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes XComp, a VLM for long video understanding that achieves extreme compression via two components: LP-Comp, which uses supervised progressive modules to reduce each frame to a single token at the final LLM layer, and QC-Comp, which performs frame-level selection by computing query-conditioned attention scores within locally segmented video windows to mitigate position bias. The model is obtained by fine-tuning VideoChat-Flash on only 2.5% of the usual supervised data and is reported to raise LVBench accuracy from 42.9% to 46.2% while supporting denser frame sampling.

Significance. If the empirical gains are reproducible and the attention-based selection proves reliable, the work would demonstrate a practical route to 2-4x denser frame sampling under fixed context budgets, with the data-efficient supervised tuning stage constituting a clear engineering contribution. The explicit separation of token-level and frame-level compression stages is a useful conceptual distinction.

major comments (2)
  1. [Abstract] Abstract: the central claim that QC-Comp enables the observed 3.3-point LVBench gain (42.9% to 46.2%) rests on an unreviewed empirical result; no ablation isolating QC-Comp from LP-Comp, no error bars, and no statistical test are supplied, so it is impossible to rule out that the improvement is attributable to the supervised tuning stage alone.
  2. [QC-Comp description] QC-Comp description: the load-bearing assumption that local attention scores reliably surface frames most relevant to arbitrary queries (and thereby preserve temporal context that one-token-per-frame compression cannot recover) is asserted without any supporting correlation study, qualitative attention maps, or comparison against query-agnostic frame sampling baselines.
minor comments (2)
  1. The abstract states that the method achieves a 'significantly larger compression ratio' but supplies no concrete ratio figures or direct numerical comparison against the strongest prior token-compression baselines.
  2. Notation for the progressive compression modules (LP-Comp) and the local-attention windowing procedure would benefit from a compact diagram or pseudocode to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will make the requested revisions to strengthen the empirical support for our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that QC-Comp enables the observed 3.3-point LVBench gain (42.9% to 46.2%) rests on an unreviewed empirical result; no ablation isolating QC-Comp from LP-Comp, no error bars, and no statistical test are supplied, so it is impossible to rule out that the improvement is attributable to the supervised tuning stage alone.

    Authors: We agree that an ablation isolating QC-Comp's contribution is required to substantiate the claim. In the revised manuscript we will add an ablation comparing LVBench accuracy of the LP-Comp-only model against the full XComp model (both trained with the same 2.5% supervised compression tuning data). We will also report standard deviations across multiple runs and include statistical significance tests. revision: yes

  2. Referee: [QC-Comp description] QC-Comp description: the load-bearing assumption that local attention scores reliably surface frames most relevant to arbitrary queries (and thereby preserve temporal context that one-token-per-frame compression cannot recover) is asserted without any supporting correlation study, qualitative attention maps, or comparison against query-agnostic frame sampling baselines.

    Authors: We acknowledge the need for direct evidence. The revised version will include (i) qualitative attention maps visualizing query-conditioned frame selection, (ii) a correlation analysis between selected frames and query-relevant frames (using available annotations where possible), and (iii) quantitative comparisons against query-agnostic baselines such as uniform sampling and motion-based keyframe selection. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical supervised tuning result

full rationale

The paper presents an engineering pipeline (LP-Comp token compression via supervised layer modules + QC-Comp frame selection via local LLM attention) whose performance gains are measured on held-out benchmarks after finetuning on 2.5% of SFT data. No equations, self-citations, or fitted parameters are shown to reduce the reported accuracy lift (42.9% → 46.2% on LVBench) to a definitional identity or tautology. The central claims rest on external empirical validation rather than internal re-derivation of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central performance claim rests on the empirical effectiveness of supervised compression tuning and attention-based selection; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5617 in / 1256 out tokens · 19922 ms · 2026-05-10T13:24:39.163447+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

82 extracted references · 43 canonical work pages · 13 internal anchors

  1. [1]

    Ht-step: Aligning instructional articles with how-to videos.NeurIPS, 2023

    Triantafyllos Afouras, Effrosyni Mavroudi, Tushar Nagarajan, Huiyu Wang, and Lorenzo Torresani. Ht-step: Aligning instructional articles with how-to videos.NeurIPS, 2023. 25

  2. [2]

    Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens

    Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Deyao Zhu, Jian Ding, and Mohamed Elhoseiny. Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens.arXiv preprint arXiv:2404.03413, 2024. 3

  3. [3]

    Goldfish: Vision-language understanding of arbitrarily long videos

    Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Mingchen Zhuge, Jian Ding, Deyao Zhu, Jürgen Schmidhuber, and Mohamed Elhoseiny. Goldfish: Vision-language understanding of arbitrarily long videos. InECCV, 2024. 2, 3

  4. [4]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,

  5. [5]

    Token Merging: Your ViT But Faster

    Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2022. 3, 5, 23

  6. [6]

    Revisiting the" video" in video-language understanding

    Shyamal Buch, Cristóbal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, and Juan Carlos Niebles. Revisiting the" video" in video-language understanding. InCVPR, 2022. 6

  7. [7]

    Gui-world: A video benchmark and dataset for multimodal gui-oriented understanding

    Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Huichi Zhou, Qihui Zhang, Zhigang He, Yilin Bai, Chujie Gao, Liuyi Chen, et al. Gui-world: A video benchmark and dataset for multimodal gui-oriented understanding. InThe Thirteenth International Conference on Learning Representations, 2024. 25

  8. [8]

    arXiv preprint arXiv:2504.15271 , year=

    Guo Chen, Zhiqi Li, Shihao Wang, Jindong Jiang, Yicheng Liu, Lidong Lu, De-An Huang, Wonmin Byeon, Matthieu Le, Tuomas Rintamaki, et al. Eagle 2.5: Boosting long-context post-training for frontier vision-language models.arXiv preprint arXiv:2504.15271, 2025. 7

  9. [9]

    Sharegpt4video: Improving video understanding and generation with better captions.NeurIPS, 2024

    Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Zhenyu Tang, Li Yuan, et al. Sharegpt4video: Improving video understanding and generation with better captions.NeurIPS, 2024. 25

  10. [10]

    arXiv preprint arXiv:2411.18211 , year=

    Shimin Chen, Xiaohan Lan, Yitian Yuan, Zequn Jie, and Lin Ma. Timemarker: A versatile video-llm for long and short video understanding with superior temporal localization ability. arXiv preprint arXiv:2411.18211, 2024. 7

  11. [11]

    Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024

    Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, et al. Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024. 2, 3

  12. [12]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024. 1, 3

  13. [13]

    vid-tldr: Training free token merging for light-weight video transformer

    Joonmyung Choi, Sanghyeok Lee, Jaewon Chu, Minhyuk Choi, and Hyunwoo J Kim. vid-tldr: Training free token merging for light-weight video transformer. InCVPR, 2024. 3

  14. [14]

    Sharegpt-4o: Comprehensive multimodal annotations with gpt-4o, 2024

    E Cui, Y He, Z Ma, Z Chen, H Tian, W Wang, K Li, Y Wang, W Wang, X Zhu, et al. Sharegpt-4o: Comprehensive multimodal annotations with gpt-4o, 2024. 25 11

  15. [15]

    V ondrick

    Dave Epstein, Boyuan Chen, and Carl. V ondrick. Oops! predicting unintentional action in video. arXiv preprint arXiv:1911.11206, 2019. 25

  16. [16]

    Videoagent: A memory-augmented multimodal agent for video understanding

    Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. Videoagent: A memory-augmented multimodal agent for video understanding. InECCV, 2024. 6

  17. [17]

    Video-ccam: Enhancing video-language un- derstanding with causal cross-attention masks for short and long videos

    Jiajun Fei, Dian Li, Zhidong Deng, Zekun Wang, Gang Liu, and Hui Wang. Video-ccam: Enhancing video-language understanding with causal cross-attention masks for short and long videos.arXiv preprint arXiv:2408.14023, 2024. 3

  18. [18]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075,

  19. [19]

    Interleaved-modal chain-of-thought.arXiv preprint arXiv:2411.19488, 2024

    Jun Gao, Yongqi Li, Ziqiang Cao, and Wenjie Li. Interleaved-modal chain-of-thought.arXiv preprint arXiv:2411.19488, 2024. 2

  20. [20]

    something something

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. InICCV, 2017. 25

  21. [21]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InCVPR, 2022. 25

  22. [22]

    Ma-lmm: Memory-augmented large multimodal model for long-term video understanding

    Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. Ma-lmm: Memory-augmented large multimodal model for long-term video understanding. InCVPR, 2024. 1

  23. [23]

    From image to video, what do we need in multimodal llms?

    Suyuan Huang, Haoxin Zhang, Yan Gao, Yao Hu, and Zengchang Qin. From image to video, what do we need in multimodal llms?arXiv preprint arXiv:2404.11865, 2024. 1

  24. [24]

    Introducing idefics: An open reproduction of state-of-the-art visual language model.https://huggingface.co/blog/idefics, 2023

    IDEFICS Team. Introducing idefics: An open reproduction of state-of-the-art visual language model.https://huggingface.co/blog/idefics, 2023. Accessed 2025-05-12. 3

  25. [25]

    Tgif-qa: Toward spatio-temporal reasoning in visual question answering

    Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. InICCV, 2017. 25

  26. [26]

    Chat-univi: Unified visual representation empowers large language models with image and video understanding

    Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. In CVPR, 2024. 3

  27. [27]

    The Kinetics Human Action Video Dataset

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya- narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017. 25

  28. [28]

    Text-conditioned resampler for long form video understanding

    Bruno Korbar, Yongqin Xian, Alessio Tonioni, Andrew Zisserman, and Federico Tombari. Text-conditioned resampler for long form video understanding. InECCV, 2024. 3

  29. [29]

    Multi-criteria token fusion with one-step-ahead attention for efficient vision transformers

    Sanghyeok Lee, Joonmyung Choi, and Hyunwoo J Kim. Multi-criteria token fusion with one-step-ahead attention for efficient vision transformers. InCVPR, 2024. 3

  30. [30]

    TVQA: Localized, Compositional Video Question Answering

    Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L Berg. Tvqa: Localized, compositional video question answering.arXiv preprint arXiv:1809.01696, 2018. 25

  31. [31]

    Llava-next: Stronger llms supercharge multimodal capabilities in the wild

    Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Renrui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. Llava-next: Stronger llms supercharge multimodal capabilities in the wild. 2024. 2

  32. [32]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024. 1, 3, 25 12

  33. [33]

    LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895, 2024. 25

  34. [34]

    VideoChat: Chat-Centric Video Understanding

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355,

  35. [35]

    Videochat-flash: Hierarchical compression for long-context video modeling,

    Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, et al. Videochat-flash: Hierarchical compression for long-context video modeling.arXiv preprint arXiv:2501.00574, 2024. 1, 2, 3, 4, 5, 7, 8, 9, 23, 24, 25, 26

  36. [36]

    Llama-vid: An image is worth 2 tokens in large language models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InECCV, 2024. 3

  37. [37]

    Mapsparse: Accelerating pre-filling for long-context visual language models via modality-aware permutation sparse attention

    Yucheng Li, Huiqiang Jiang, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Amir H Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, et al. Mapsparse: Accelerating pre-filling for long-context visual language models via modality-aware permutation sparse attention. InICLR 2025 Workshop on Foundation Models in the Wild. 5

  38. [38]

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection.arXiv preprint arXiv:2311.10122,

  39. [39]

    Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. 25

  40. [40]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023. 4

  41. [41]

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee

    Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xiaoqi Ma, Xiaoming Wei, Jianbin Jiao, Enhua Wu, and Jie Hu. Kangaroo: A powerful video-language model supporting long-context video input.arXiv preprint arXiv:2408.15542, 2024. 7

  42. [42]

    Lost in the Middle: How Language Models Use Long Contexts

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.arXiv preprint arXiv:2307.03172, 2023. 2, 6

  43. [43]

    Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection

    Ye Liu, Siyuan Li, Yang Wu, Chang-Wen Chen, Ying Shan, and Xiaohu Qie. Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. InCVPR,

  44. [44]

    arXiv preprint arXiv:2306.07207 , year=

    Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Da Li, Pengcheng Lu, Tao Wang, Linmei Hu, Minghui Qiu, and Zhongyu Wei. Valley: Video assistant with large language model enhanced ability.arXiv preprint arXiv:2306.07207, 2023. 3

  45. [45]

    Image as set of points.arXiv preprint arXiv:2303.01494, 2023

    Xu Ma, Yuqian Zhou, Huan Wang, Can Qin, Bin Sun, Chang Liu, and Yun Fu. Image as set of points.arXiv preprint arXiv:2303.01494, 2023. 3

  46. [46]

    Video-chatgpt: Towards detailed video understanding via large vision and language models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models.arXiv preprint arXiv:2306.05424, 2023. 3

  47. [47]

    X-former elucidator: reviving efficient attention for long context language modeling

    Xupeng Miao, Shenhan Zhu, Fangcheng Fu, Ziyu Guo, Zhi Yang, Yaofeng Tu, Zhihao Jia, and Bin Cui. X-former elucidator: reviving efficient attention for long context language modeling. InIJCAI, 2024. 2

  48. [48]

    Gpt-4 technical report, 2024

    OpenAI. Gpt-4 technical report, 2024. 7

  49. [49]

    Gpt-4o system card, 2024

    OpenAI. Gpt-4o system card, 2024. 3, 7

  50. [50]

    Too many frames, not all useful: Efficient strategies for long-form video qa,

    Jongwoo Park, Kanchana Ranasinghe, Kumara Kahatapitiya, Wonjeong Ryu, Donghyun Kim, and Michael S Ryoo. Too many frames, not all useful: Efficient strategies for long-form video qa.arXiv preprint arXiv:2406.09396, 2024. 6 13

  51. [51]

    Occluded video instance segmentation: A benchmark

    Jiyang Qi, Yan Gao, Yao Hu, Xinggang Wang, Xiaoyu Liu, Xiang Bai, Serge Belongie, Alan Yuille, Philip HS Torr, and Song Bai. Occluded video instance segmentation: A benchmark. IJCV, 2022. 25

  52. [52]

    Testa: Temporal-spatial token aggregation for long-form video-language understanding.arXiv preprint arXiv:2310.19060,

    Shuhuai Ren, Sishuo Chen, Shicheng Li, Xu Sun, and Lu Hou. Testa: Temporal-spatial token aggregation for long-form video-language understanding.arXiv preprint arXiv:2310.19060,

  53. [53]

    Longvu: Spatiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024

    Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, et al. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024. 2

  54. [54]

    Fine-grained audible video description

    Xuyang Shen, Dong Li, Jinxing Zhou, Zhen Qin, Bowen He, Xiaodong Han, Aixuan Li, Yuchao Dai, Lingpeng Kong, Meng Wang, et al. Fine-grained audible video description. InCVPR,

  55. [55]

    Long-vita: Scaling large multi-modal models to 1 million tokens with leading short-context accuracy.arXiv preprint arXiv:2502.05177, 2025

    Yunhang Shen, Chaoyou Fu, Shaoqi Dong, Xiong Wang, Peixian Chen, Mengdan Zhang, Haoyu Cao, Ke Li, Xiawu Zheng, Yan Zhang, et al. Long-vita: Scaling large multi-modal models to 1 million tokens with leading short-context accuray.arXiv preprint arXiv:2502.05177, 2025. 3

  56. [56]

    Video-xl: Extra-long vision language model for hour-scale video understanding

    Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. arXiv preprint arXiv:2409.14485, 2024. 3

  57. [57]

    Moviechat: From dense token to sparse memory for long video understanding

    Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InCVPR, 2024. 1, 3, 25

  58. [58]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024

    Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024. 3, 7

  59. [59]

    Lvbench: An extreme long video understanding benchmark

    Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, et al. Lvbench: An extreme long video understanding benchmark. arXiv preprint arXiv:2406.08035, 2024. 3, 8

  60. [60]

    Unidentified video objects: A benchmark for dense, open-world segmentation

    Weiyao Wang, Matt Feiszli, Heng Wang, and Du Tran. Unidentified video objects: A benchmark for dense, open-world segmentation. InICCV, 2021. 25

  61. [61]

    Internvideo2: Scaling foundation models for multimodal video understanding

    Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling foundation models for multimodal video understanding. InECCV, 2024. 1, 3

  62. [62]

    Internvideo: General video foundation models via generative and discriminative learning

    Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, et al. Internvideo: General video foundation models via generative and discriminative learning.arXiv preprint arXiv:2212.03191, 2022. 1, 3

  63. [63]

    Videotree: Adaptive tree- based video representation for llm reasoning on long videos,

    Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, and Mohit Bansal. Videotree: Adaptive tree-based video representation for llm reasoning on long videos.arXiv preprint arXiv:2405.19209, 2024. 6

  64. [64]

    Longvideobench: A benchmark for long-context interleaved video-language understanding.NeurIPS, 2024

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.NeurIPS, 2024. 8

  65. [65]

    A large cross-modal video retrieval dataset with reading comprehension.Pattern Recognition, 157:110818, 2025

    Weijia Wu, Yuzhong Zhao, Zhuang Li, Jiahong Li, Hong Zhou, Mike Zheng Shou, and Xiang Bai. A large cross-modal video retrieval dataset with reading comprehension.Pattern Recognition, 157:110818, 2025. 25

  66. [66]

    Rating: [[...]] Analysis:

    Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, and Yao Fu. Retrieval head mecha- nistically explains long-context factuality.arXiv preprint arXiv:2404.15574, 2024. 6

  67. [67]

    Next-qa: Next phase of question- answering to explaining temporal actions

    Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question- answering to explaining temporal actions. InCVPR, 2021. 25 14

  68. [68]

    Groupvit: Semantic segmentation emerges from text supervision

    Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, and Xiaolong Wang. Groupvit: Semantic segmentation emerges from text supervision. InCVPR, 2022. 3

  69. [69]

    arXiv preprint arXiv:2404.16994 , year=

    Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava: Parameter-free llava extension from images to videos for video dense captioning.arXiv preprint arXiv:2404.16994, 2024. 1

  70. [70]

    arXiv preprint arXiv:2407.15841 (2024)

    Mingze Xu, Mingfei Gao, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, and Afshin Dehghan. Slowfast-llava: A strong training-free baseline for video large language models.arXiv preprint arXiv:2407.15841, 2024. 1, 3

  71. [71]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024. 7

  72. [72]

    Vript: A video is worth thousands of words.NeurIPS, 2024

    Dongjie Yang, Suyuan Huang, Chengqiang Lu, Xiaodong Han, Haoxin Zhang, Yan Gao, Yao Hu, and Hai Zhao. Vript: A video is worth thousands of words.NeurIPS, 2024. 25

  73. [73]

    mplug-owl3: Towards long image-sequence understanding in multi-modal large language models.arXiv preprint arXiv:2408.04840, 2024

    Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl3: Towards long image-sequence understanding in multi-modal large language models.arXiv preprint arXiv:2408.04840, 2024. 7

  74. [74]

    Clevrer: Collision events for video representation and reasoning

    Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B Tenenbaum. Clevrer: Collision events for video representation and reasoning.arXiv preprint arXiv:1910.01442, 2019. 25

  75. [75]

    Long Context Transfer from Language to Vision

    Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision. arXiv preprint arXiv:2406.16852, 2024. 3

  76. [76]

    Efficient long-range transformers: You need to attend more, but not necessarily at every layer.arXiv preprint arXiv:2310.12442, 2023

    Qingru Zhang, Dhananjay Ram, Cole Hawkins, Sheng Zha, and Tuo Zhao. Efficient long-range transformers: You need to attend more, but not necessarily at every layer.arXiv preprint arXiv:2310.12442, 2023. 2

  77. [77]

    Llava-next: A strong zero-shot video understanding model

    Y Zhang, B Li, H Liu, Y Lee, L Gui, D Fu, J Feng, Z Liu, and C Li. Llava-next: A strong zero-shot video understanding model. 2024. 3

  78. [78]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 10, 25

  79. [79]

    Rmem: Restricted memory banks improve video object segmentation

    Junbao Zhou, Ziqi Pang, and Yu-Xiong Wang. Rmem: Restricted memory banks improve video object segmentation. InCVPR, 2024. 6

  80. [80]

    MLVU: Benchmarking Multi-task Long Video Understanding

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding.arXiv preprint arXiv:2406.04264, 2024. 8

Showing first 80 references.