pith. machine review for the scientific record. sign in

arxiv: 2604.08077 · v1 · submitted 2026-04-09 · 💻 cs.CV

Recognition: unknown

AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:16 UTC · model grok-4.3

classification 💻 cs.CV
keywords adaptive sparsityvideo large language modelslong-video understanding3D spatio-temporal cubesentropy-based selectionefficient inferenceVideo-LLMs
0
0 comments X

The pith

AdaSpark adaptively sparsifies long video processing in Video-LLMs to cut computation by 57% while preserving accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video Large Language Models struggle with long videos because dense processing demands too much computation, often leading to loss of detail or broken temporal modeling. AdaSpark partitions the video into 3D spatio-temporal cubes and uses two adaptive components to select only the most relevant cubes and tokens within them. An entropy-based Top-p mechanism adjusts the sparsity according to the video's complexity. This setup is intended to maintain fine-grained perception and long-range dependencies. If true, it would allow practical use of these models on extended video content without heavy resource requirements.

Core claim

The paper claims that partitioning video inputs into 3D spatio-temporal cubes and applying Adaptive Cube-Selective Attention to select relevant cubes for each query token together with Adaptive Token-Selective FFN to process only salient tokens inside those cubes, all governed by an entropy-based Top-p selection that adapts sparsity to input complexity, reduces FLOPs by up to 57% while delivering performance comparable to dense models and retaining fine-grained and long-range information on hour-scale video benchmarks.

What carries the argument

The central mechanism is AdaSpark's entropy-based Top-p selection applied via Adaptive Cube-Selective Attention for choosing relevant 3D cubes and Adaptive Token-Selective FFN for processing salient tokens within them to allocate computation dynamically according to content complexity.

Load-bearing premise

That the entropy-driven Top-p selection reliably identifies and retains all information necessary for downstream tasks without irreversible loss of fine-grained or long-range details across arbitrary video content.

What would settle it

A significant drop in accuracy on a challenging hour-scale video benchmark where the model misses a key fine detail or long-range event that a dense baseline captures, caused by the selection skipping relevant cubes or tokens.

Figures

Figures reproduced from arXiv: 2604.08077 by Bo Zheng, Cheng Yu, Chuanyang Zheng, Handong Li, Jing Liu, Jun Song, Longteng Guo, Tongtian Yue, Xinxin Zhu, Yepeng Tang, Zhibin Wang, Zikang Liu, Ziming Wang.

Figure 1
Figure 1. Figure 1: Preliminary analysis. We analyzed internal distribu￾tions within the video-LLM layers. The upper figure shows text￾to-visual attention score distributions, marking the 0.7 cumulative probability point per layer with a star. The lower figure displays L2 norm changes across modalities after the FFN, quantified as the post-to-pre norm ratio. ing the model to intelligently adapt its computational budget based … view at source ↗
Figure 2
Figure 2. Figure 2: Framework illustration of AdaSpark. We process long-duration videos at their native resolution and subsequently apply video cube partitioning. Within the AdaS-Attn layer, each token query performs adaptive selection based on relevance scores computed over preceding Cubes. Upon entering the AdaS-FFN, visual tokens within each Cube are adaptively selected to pass through the FFN, while the transformations fo… view at source ↗
Figure 3
Figure 3. Figure 3: Video Needle in A Haystack Results. We compare AdaSpark against existing high-efficiency models and methods [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Analysis of adaptive selection within AdaSpark. The left figure illustrates the number of cubes selected by each query per layer in AdaS-Attn. The middle figure details the average token keep ratio per cube for each layer in AdaS-FFN. The right figure demonstrates the impact of parameter choices on the dynamic selection mechanism. The given query happens in 4.9 – 9.1 seconds Detect and report the start and… view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of adaptive selection in a case study. AdaSpark adaptively selects visual cubes that exhibit high rele￾vance to the posed query token. 4.4. Visualization of Adaptive Selection To validate the efficacy of the adaptive top-p selec￾tion mechanism, we visualized its layer-wise operation within our 3B model. Our initial analysis focused on the AdaS-Attn module. Utilizing 1,000 samples from LongVide… view at source ↗
read the original abstract

Processing long-form videos with Video Large Language Models (Video-LLMs) is computationally prohibitive. Current efficiency methods often compromise fine-grained perception through irreversible information disposal or inhibit long-range temporal modeling via rigid, predefined sparse patterns. This paper introduces AdaSpark, an adaptive sparsity framework designed to address these limitations. AdaSpark first partitions video inputs into 3D spatio-temporal cubes. It then employs two co-designed, context-aware components: (1) Adaptive Cube-Selective Attention (AdaS-Attn), which adaptively selects a subset of relevant video cubes to attend for each query token, and (2) Adaptive Token-Selective FFN (AdaS-FFN), which selectively processes only the most salient tokens within each cube. An entropy-based (Top-p) selection mechanism adaptively allocates computational resources based on input complexity. Experiments demonstrate that AdaSpark significantly reduces computational load by up to 57% FLOPs while maintaining comparable performance to dense models and preserving fine-grained, long-range dependencies, as validated on challenging hour-scale video benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces AdaSpark, an adaptive sparsity framework for Video-LLMs processing long-form videos. It partitions video inputs into 3D spatio-temporal cubes and employs two co-designed components—AdaS-Attn for adaptive cube selection per query token and AdaS-FFN for token selection within cubes—using an entropy-based Top-p mechanism to allocate computation based on input complexity. The central claim is that this yields up to 57% FLOP reduction while maintaining comparable performance to dense models and preserving fine-grained, long-range dependencies on hour-scale video benchmarks.

Significance. If the experimental validation holds and the selection mechanism proves robust, AdaSpark could meaningfully advance efficient long-video understanding by moving beyond rigid sparse patterns or irreversible pruning. The co-design of context-aware sparsity in both attention and FFN layers represents a practical engineering contribution that could influence downstream Video-LLM deployments, provided the entropy proxy reliably captures task relevance.

major comments (2)
  1. [Abstract] Abstract: The performance and dependency-preservation claims (57% FLOP reduction with 'comparable performance' and 'preserved fine-grained, long-range dependencies') are stated without any quantitative metrics, ablation results, baseline comparisons, or error analysis. This makes the central claim difficult to evaluate until the full experiments section is examined for concrete evidence that the reductions do not degrade downstream Video-LLM accuracy.
  2. [Method (AdaS-Attn and AdaS-FFN)] Method (AdaS-Attn and AdaS-FFN): The entropy-driven Top-p cube/token selection is presented as reliably retaining all information needed for downstream tasks, yet no worst-case analysis, recovery mechanism, or bound is supplied showing that low-entropy regions never contain task-critical fine-grained or long-range details. This assumption is load-bearing for the claim that 57% FLOP savings coexist with preserved dependencies on arbitrary hour-scale content.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point-by-point below, proposing targeted revisions to improve clarity and rigor while preserving the core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The performance and dependency-preservation claims (57% FLOP reduction with 'comparable performance' and 'preserved fine-grained, long-range dependencies') are stated without any quantitative metrics, ablation results, baseline comparisons, or error analysis. This makes the central claim difficult to evaluate until the full experiments section is examined for concrete evidence that the reductions do not degrade downstream Video-LLM accuracy.

    Authors: We agree that the abstract would be strengthened by incorporating concrete quantitative evidence. In the revised version, we will update the abstract to explicitly state key results, including the peak 57% FLOP reduction, specific accuracy numbers on hour-scale benchmarks (e.g., Video-MME and similar datasets), direct comparisons to dense baselines and prior sparse methods, and a brief note on ablation findings confirming dependency preservation. This will allow readers to evaluate the central claims immediately while still referring to the experiments section for full details, ablations, and error bars. revision: yes

  2. Referee: [Method (AdaS-Attn and AdaS-FFN)] Method (AdaS-Attn and AdaS-FFN): The entropy-driven Top-p cube/token selection is presented as reliably retaining all information needed for downstream tasks, yet no worst-case analysis, recovery mechanism, or bound is supplied showing that low-entropy regions never contain task-critical fine-grained or long-range details. This assumption is load-bearing for the claim that 57% FLOP savings coexist with preserved dependencies on arbitrary hour-scale content.

    Authors: The referee is correct that the manuscript does not include a formal worst-case theoretical bound or recovery mechanism. Our design is a practical, input-dependent heuristic whose validity rests on extensive empirical validation: across multiple hour-scale benchmarks, AdaSpark matches dense-model accuracy on tasks requiring fine-grained perception and long-range temporal reasoning, with ablations showing that low-entropy regions are consistently non-critical under the tested distributions. We will add a dedicated limitations paragraph in the method section that (1) explicitly states the lack of a general theoretical guarantee, (2) summarizes the empirical evidence from our ablations that supports the entropy proxy, and (3) outlines directions for future formal analysis. We maintain that the current empirical results are sufficient to support the practical claims of the paper. revision: partial

Circularity Check

0 steps flagged

No circularity: engineering method with independent experimental validation

full rationale

The paper describes AdaSpark as a practical framework that partitions long videos into 3D spatio-temporal cubes and applies entropy-driven Top-p selection inside AdaS-Attn and AdaS-FFN modules to drop low-complexity subsets. No equations, uniqueness theorems, or derivation steps appear in the provided text that reduce the claimed 57% FLOP reduction or performance preservation to a fitted parameter, self-definition, or self-citation chain. The method is presented as an independent design choice whose correctness is asserted via benchmark experiments rather than by algebraic identity or prior self-referential result. This is the normal case of a non-circular engineering contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The framework rests on standard video tokenization assumptions plus new selection logic whose hyperparameters are not detailed in the abstract.

free parameters (1)
  • Top-p cutoff parameter
    Entropy-based selection requires a threshold or p-value that controls how many cubes/tokens are retained; its value is not specified and must be chosen or tuned.
axioms (1)
  • domain assumption Partitioning video into 3D spatio-temporal cubes preserves sufficient structure for downstream attention and FFN operations.
    Invoked by the initial partitioning step before any adaptive selection occurs.
invented entities (2)
  • AdaS-Attn no independent evidence
    purpose: Adaptive selection of relevant video cubes for each query token
    New attention variant introduced by the paper.
  • AdaS-FFN no independent evidence
    purpose: Selective processing of salient tokens inside each cube
    New feed-forward variant introduced by the paper.

pith-pipeline@v0.9.0 · 5521 in / 1342 out tokens · 44609 ms · 2026-05-10T17:16:57.683613+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 18 canonical work pages · 8 internal anchors

  1. [1]

    Normxlogit: The head-on-top never lies

    Sina Abbasi, Mohammad Reza Modarres, and Moham- mad Taher Pilehvar. Normxlogit: The head-on-top never lies. InProceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing, pages 34914–34935,

  2. [2]

    Localizing mo- ments in video with natural language

    Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing mo- ments in video with natural language. InProceedings of the IEEE international conference on computer vision, pages 5803–5812, 2017. 5, 12

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2, 6

  4. [4]

    Token merging: Your vit but faster

    Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster. InICLR, 2023. 1, 3, 6

  5. [5]

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision, pages 19–35. Springer, 2024. 1, 6

  6. [6]

    Cosa: Concatenated sample pretrained vision-language foundation model

    Sihan Chen, Xingjian He, Handong Li, Xiaojie Jin, Jiashi Feng, and Jing Liu. Cosa: Concatenated sample pretrained vision-language foundation model. InThe Twelfth Interna- tional Conference on Learning Representations. 1

  7. [7]

    Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset.Advances in Neural Information Processing Sys- tems, 36:72842–72866, 2023

    Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu, and Jing Liu. Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset.Advances in Neural Information Processing Sys- tems, 36:72842–72866, 2023. 1

  8. [8]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 6

  9. [9]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial- temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476, 2024. 3

  10. [10]

    Video-ccam: Enhancing video-language un- derstanding with causal cross-attention masks for short and long videos.arXiv preprint arXiv:2408.14023, 2024

    Jiajun Fei, Dian Li, Zhidong Deng, Zekun Wang, Gang Liu, and Hui Wang. Video-ccam: Enhancing video-language un- derstanding with causal cross-attention masks for short and long videos.arXiv preprint arXiv:2408.14023, 2024. 3

  11. [11]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075, 2024. 5, 13

  12. [12]

    Framefusion: Combining similarity and importance for video token reduction on large visual language models.arXiv preprint arXiv:2501.01986, 2024

    Tianyu Fu, Tengxuan Liu, Qinghao Han, Guohao Dai, Shen- gen Yan, Huazhong Yang, Xuefei Ning, and Yu Wang. Framefusion: Combining similarity and importance for video token reduction on large visual language models.arXiv preprint arXiv:2501.01986, 2024. 1, 3, 6

  13. [13]

    Tall: Temporal activity localization via language query

    Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on com- puter vision, pages 5267–5275, 2017. 5, 13

  14. [14]

    Ma-lmm: Memory-augmented large multimodal model for long-term video understanding

    Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xue- fei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. Ma-lmm: Memory-augmented large multimodal model for long-term video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13504–13514, 2024. 1

  15. [15]

    Chat-univi: Unified visual representation em- powers large language models with image and video un- derstanding

    Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation em- powers large language models with image and video un- derstanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13700– 13710, 2024. 1

  16. [16]

    Dense-captioning events in videos

    Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In Proceedings of the IEEE international conference on com- puter vision, pages 706–715, 2017. 5, 12

  17. [17]

    Flexprefill: A context-aware sparse attention mecha- nism for efficient long-sequence inference

    Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, and Xun Zhou. Flexprefill: A context-aware sparse attention mecha- nism for efficient long-sequence inference. InThe Thirteenth International Conference on Learning Representations. 1

  18. [18]

    Lmms-eval: Accelerating the development of large multimoal models, 2024

    Bo Li, Peiyuan Zhang, Kaichen Zhang, Fanyi Pu, Xinrun Du, Yuhao Dong, Haotian Liu, Yuanhan Zhang, Ge Zhang, Chunyuan Li, and Ziwei Liu. Lmms-eval: Accelerating the development of large multimoal models, 2024. 5

  19. [19]

    Breaking the encoder barrier for seam- less video-language understanding

    Handong Li, Yiyuan Zhang, Longteng Guo, Xiangyu Yue, and Jing Liu. Breaking the encoder barrier for seam- less video-language understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 23167–23176, 2025. 1

  20. [20]

    Mvbench: A comprehensive multi-modal video understand- ing benchmark.arXiv preprint arXiv:2311.17005, 2023

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark.arXiv preprint arXiv:2311.17005, 2023. 5, 6, 13

  21. [21]

    Mvbench: A comprehensive multi-modal video understand- ing benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 1

  22. [22]

    Vidtome: Video token merging for zero-shot video editing

    Xirui Li, Chao Ma, Xiaokang Yang, and Ming-Hsuan Yang. Vidtome: Video token merging for zero-shot video editing. 9 InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7486–7495, 2024. 1

  23. [23]

    Videochat-flash: Hierarchical compression for long-context video modeling,

    Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, et al. Videochat-flash: Hierarchical com- pression for long-context video modeling.arXiv preprint arXiv:2501.00574, 2024. 3, 6

  24. [24]

    Llama-vid: An image is worth 2 tokens in large language models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. In European Conference on Computer Vision, pages 323–340. Springer, 2024. 1, 3, 6

  25. [25]

    Mminference: Accelerat- ing pre-filling for long-context vlms via modality-aware per- mutation sparse attention.arXiv preprint arXiv:2504.16083,

    Yucheng Li, Huiqiang Jiang, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Amir H Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, et al. Mminference: Accelerat- ing pre-filling for long-context vlms via modality-aware per- mutation sparse attention.arXiv preprint arXiv:2504.16083,

  26. [26]

    Enhancing vision-language pre- training with jointly learned questioner and dense captioner

    Zikang Liu, Sihan Chen, Longteng Guo, Handong Li, Xingjian He, and Jing Liu. Enhancing vision-language pre- training with jointly learned questioner and dense captioner. InProceedings of the 31st ACM International Conference on Multimedia, pages 5120–5131, 2023. 1

  27. [27]

    Vrope: Rotary position embedding for video large language models.arXiv preprint arXiv:2502.11664, 2025

    Zikang Liu, Longteng Guo, Yepeng Tang, Tongtian Yue, Junxian Cai, Kai Ma, Qingbin Liu, Xi Chen, and Jing Liu. Vrope: Rotary position embedding for video large language models.arXiv preprint arXiv:2502.11664, 2025. 1

  28. [28]

    Moba: Mixture of block attention for long-context llms.arXiv preprint arXiv:2502.13189,

    Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, et al. Moba: Mixture of block attention for long- context llms.arXiv preprint arXiv:2502.13189, 2025. 3, 6

  29. [29]

    Vipe: Visual perception in parameter space for efficient video-language understanding

    Shichen Lu, Tongtian Yue, Longteng Guo, Handong Li, Xingjian He, Si Liu, and Jing Liu. Vipe: Visual perception in parameter space for efficient video-language understanding. InProceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing, pages 17775–17786,

  30. [30]

    Deepspeed: System optimizations enable train- ing deep learning models with over 100 billion parameters

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable train- ing deep learning models with over 100 billion parameters. InProceedings of the 26th ACM SIGKDD international con- ference on knowledge discovery & data mining, pages 3505– 3506, 2020. 6

  31. [31]

    Llava-prumerge: Adaptive token reduction for efficient large multimodal models

    Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22857– 22867, 2025. 3

  32. [32]

    Tempme: Video temporal token merging for efficient text-video retrieval

    Leqi Shen, Tianxiang Hao, Tao He, Sicheng Zhao, Yifeng Zhang, Yongjun Bao, Guiguang Ding, et al. Tempme: Video temporal token merging for efficient text-video retrieval. In The Thirteenth International Conference on Learning Rep- resentations, . 1

  33. [33]

    Longvu: Spa- tiotemporal adaptive compression for long video-language understanding

    Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Bal- akrishnan Varadarajan, Florian Bordes, et al. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding. InForty-second International Conference on Machine Learning, . 6

  34. [34]

    Video-xl: Extra-long vision language model for hour-scale video understanding

    Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Jun- jie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26160–26169,

  35. [35]

    Adaptive keyframe sampling for long video understanding

    Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, and Qixiang Ye. Adaptive keyframe sampling for long video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29118–29128, 2025. 1

  36. [36]

    Divid: Dis- entangled spatial-temporal modeling within llms for tempo- rally grounded video understanding

    Yepeng Tang, Weining Wang, Longteng Guo, Tongtian Yue, Wenxuan Wang, Chunjie Zhang, and Jing Liu. Divid: Dis- entangled spatial-temporal modeling within llms for tempo- rally grounded video understanding. InThe Fourteenth In- ternational Conference on Learning Representations. 3

  37. [37]

    Dynamic inference with grounding based vision and language models

    Burak Uzkent, Amanmeet Garg, Wentao Zhu, Keval Doshi, Jingru Yi, Xiaolong Wang, and Mohamed Omar. Dynamic inference with grounding based vision and language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2624–2633, 2023. 3

  38. [38]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1, 5

  39. [39]

    Lvbench: An extreme long video understanding benchmark

    Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiao- han Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, et al. Lvbench: An extreme long video understanding benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22958–22967, 2025. 5, 13

  40. [40]

    Longllava: Scaling multi-modal llms to 1000 images efficiently via a hybrid architecture.arXiv preprint arXiv:2409.02889, 2024

    Xidong Wang, Dingjie Song, Shunian Chen, Chen Zhang, and Benyou Wang. Longllava: Scaling multi-modal llms to 1000 images efficiently via a hybrid architecture.arXiv preprint arXiv:2409.02889, 2024. 6

  41. [41]

    Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024. 5, 13

  42. [42]

    Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy re- duction.arXiv preprint arXiv:2410.17247, 2024

    Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy re- duction.arXiv preprint arXiv:2410.17247, 2024. 1

  43. [43]

    Xattention: Block sparse attention with an- tidiagonal scoring

    Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, and Song Han. Xattention: Block sparse attention with an- tidiagonal scoring. InForty-second International Conference on Machine Learning. 1

  44. [44]

    Thinking in space: How mul- timodal large language models see, remember, and recall spaces

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 5, 13

  45. [45]

    Fit and prune: Fast and training-free visual token pruning for multi- 10 modal large language models

    Weihao Ye, Qiong Wu, Wenhao Lin, and Yiyi Zhou. Fit and prune: Fast and training-free visual token pruning for multi- 10 modal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 22128–22136,

  46. [46]

    Frame- voyager: Learning to query frames for video large language models

    Sicheng Yu, CHENGKAI JIN, Huanyu Wang, Zhenghao Chen, Sheng Jin, ZHONGRONG ZUO, XU XIAOLEI, Zhenbang Sun, Bingni Zhang, Jiawei Wu, et al. Frame- voyager: Learning to query frames for video large language models. InThe Thirteenth International Conference on Learning Representations. 1

  47. [47]

    Native sparse attention: Hardware-aligned and natively trainable sparse attention

    Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Yuxing Wei, Lean Wang, Zhiping Xiao, et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention. In Proceedings of the 63rd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 23078–23097, 2025. 3

  48. [48]

    Lmms- eval: Reality check on the evaluation of large multimodal models, 2024

    Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms- eval: Reality check on the evaluation of large multimodal models, 2024. 5

  49. [49]

    Long Context Transfer from Language to Vision

    Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024. 5, 6, 13

  50. [50]

    Q-frame: Query-aware frame selection and multi-resolution adaptation for video-llms.arXiv preprint arXiv:2506.22139, 2025

    Shaojie Zhang, Jiahui Yang, Jianqin Yin, Zhenbo Luo, and Jian Luan. Q-frame: Query-aware frame selection and multi-resolution adaptation for video-llms.arXiv preprint arXiv:2506.22139, 2025. 1

  51. [51]

    Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference

    Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis A Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference. InForty-second International Conference on Ma- chine Learning. 1

  52. [52]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 1, 5, 12

  53. [53]

    Learning beyond still frames: Scaling vision-language mod- els with video

    Yiyuan Zhang, Handong Li, Jing Liu, and Xiangyu Yue. Learning beyond still frames: Scaling vision-language mod- els with video. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 22425–22435,

  54. [54]

    Scaling omni-modal pretraining with multimodal context: Advancing universal representation learning across modal- ities

    Yiyuan Zhang, Handong Li, Jing Liu, and Xiangyu Yue. Scaling omni-modal pretraining with multimodal context: Advancing universal representation learning across modal- ities. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 1336–1348, 2025. 1

  55. [55]

    Nee- dle in a video haystack: A scalable synthetic evaluator for video mllms

    Zijia Zhao, Haoyu Lu, Yuqi Huo, Yifan Du, Tongtian Yue, Longteng Guo, Bingning Wang, Jing Liu, et al. Nee- dle in a video haystack: A scalable synthetic evaluator for video mllms. InThe Thirteenth International Conference on Learning Representations. 5, 13

  56. [56]

    Aim: Adaptive inference of multi-modal llms via token merging and pruning

    Yiwu Zhong, Zhuoming Liu, Yin Li, and Liwei Wang. Aim: Adaptive inference of multi-modal llms via token merging and pruning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20180–20192, 2025. 3

  57. [57]

    MLVU: Benchmarking Multi-task Long Video Understanding

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding.arXiv preprint arXiv:2406.04264,

  58. [58]

    5, 13 11 AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding Supplementary Material

  59. [59]

    As summarized in Table 4, we initially evaluate the most rudi- mentary approach: uniform sampling

    Effect of Selection Strategy Adhering to the ablation experimental setup described in the main text, we investigate the influence of various token se- lection strategies under an identical compression ratio. As summarized in Table 4, we initially evaluate the most rudi- mentary approach: uniform sampling. This method exhibits the most significant performa...

  60. [60]

    Training Configuration Table 5 provides a comprehensive summary of the hyperparameters employed in the training of AdaSpark

    Implementation Details 7.1. Training Configuration Table 5 provides a comprehensive summary of the hyperparameters employed in the training of AdaSpark. Throughout this process, the visual encoder is maintained in a frozen state, and we implement a cube-based sparse strategy regulated by a top-pthreshold. Our training methodology incorporates a mixed data...

  61. [61]

    Ta- ble 6 details the frame sampling configurations employed during inference via thelmms-evalframework

    Evaluation Settings We evaluate AdaSpark on a series of comprehensive video-language benchmarks: 1) Extra Long Video Un- derstanding, using Video Needle in a Haystack [49, 55]; 2) Long Video Understanding, which includes MLVU [57], VideoMME [11], LongVideoBench [41], and LVBench [39]; 3) Short Video Understanding, using MVBench [20]; 4) Spatial Reasoning,...