Recognition: unknown
AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding
Pith reviewed 2026-05-10 17:16 UTC · model grok-4.3
The pith
AdaSpark adaptively sparsifies long video processing in Video-LLMs to cut computation by 57% while preserving accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that partitioning video inputs into 3D spatio-temporal cubes and applying Adaptive Cube-Selective Attention to select relevant cubes for each query token together with Adaptive Token-Selective FFN to process only salient tokens inside those cubes, all governed by an entropy-based Top-p selection that adapts sparsity to input complexity, reduces FLOPs by up to 57% while delivering performance comparable to dense models and retaining fine-grained and long-range information on hour-scale video benchmarks.
What carries the argument
The central mechanism is AdaSpark's entropy-based Top-p selection applied via Adaptive Cube-Selective Attention for choosing relevant 3D cubes and Adaptive Token-Selective FFN for processing salient tokens within them to allocate computation dynamically according to content complexity.
Load-bearing premise
That the entropy-driven Top-p selection reliably identifies and retains all information necessary for downstream tasks without irreversible loss of fine-grained or long-range details across arbitrary video content.
What would settle it
A significant drop in accuracy on a challenging hour-scale video benchmark where the model misses a key fine detail or long-range event that a dense baseline captures, caused by the selection skipping relevant cubes or tokens.
Figures
read the original abstract
Processing long-form videos with Video Large Language Models (Video-LLMs) is computationally prohibitive. Current efficiency methods often compromise fine-grained perception through irreversible information disposal or inhibit long-range temporal modeling via rigid, predefined sparse patterns. This paper introduces AdaSpark, an adaptive sparsity framework designed to address these limitations. AdaSpark first partitions video inputs into 3D spatio-temporal cubes. It then employs two co-designed, context-aware components: (1) Adaptive Cube-Selective Attention (AdaS-Attn), which adaptively selects a subset of relevant video cubes to attend for each query token, and (2) Adaptive Token-Selective FFN (AdaS-FFN), which selectively processes only the most salient tokens within each cube. An entropy-based (Top-p) selection mechanism adaptively allocates computational resources based on input complexity. Experiments demonstrate that AdaSpark significantly reduces computational load by up to 57% FLOPs while maintaining comparable performance to dense models and preserving fine-grained, long-range dependencies, as validated on challenging hour-scale video benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AdaSpark, an adaptive sparsity framework for Video-LLMs processing long-form videos. It partitions video inputs into 3D spatio-temporal cubes and employs two co-designed components—AdaS-Attn for adaptive cube selection per query token and AdaS-FFN for token selection within cubes—using an entropy-based Top-p mechanism to allocate computation based on input complexity. The central claim is that this yields up to 57% FLOP reduction while maintaining comparable performance to dense models and preserving fine-grained, long-range dependencies on hour-scale video benchmarks.
Significance. If the experimental validation holds and the selection mechanism proves robust, AdaSpark could meaningfully advance efficient long-video understanding by moving beyond rigid sparse patterns or irreversible pruning. The co-design of context-aware sparsity in both attention and FFN layers represents a practical engineering contribution that could influence downstream Video-LLM deployments, provided the entropy proxy reliably captures task relevance.
major comments (2)
- [Abstract] Abstract: The performance and dependency-preservation claims (57% FLOP reduction with 'comparable performance' and 'preserved fine-grained, long-range dependencies') are stated without any quantitative metrics, ablation results, baseline comparisons, or error analysis. This makes the central claim difficult to evaluate until the full experiments section is examined for concrete evidence that the reductions do not degrade downstream Video-LLM accuracy.
- [Method (AdaS-Attn and AdaS-FFN)] Method (AdaS-Attn and AdaS-FFN): The entropy-driven Top-p cube/token selection is presented as reliably retaining all information needed for downstream tasks, yet no worst-case analysis, recovery mechanism, or bound is supplied showing that low-entropy regions never contain task-critical fine-grained or long-range details. This assumption is load-bearing for the claim that 57% FLOP savings coexist with preserved dependencies on arbitrary hour-scale content.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point-by-point below, proposing targeted revisions to improve clarity and rigor while preserving the core contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The performance and dependency-preservation claims (57% FLOP reduction with 'comparable performance' and 'preserved fine-grained, long-range dependencies') are stated without any quantitative metrics, ablation results, baseline comparisons, or error analysis. This makes the central claim difficult to evaluate until the full experiments section is examined for concrete evidence that the reductions do not degrade downstream Video-LLM accuracy.
Authors: We agree that the abstract would be strengthened by incorporating concrete quantitative evidence. In the revised version, we will update the abstract to explicitly state key results, including the peak 57% FLOP reduction, specific accuracy numbers on hour-scale benchmarks (e.g., Video-MME and similar datasets), direct comparisons to dense baselines and prior sparse methods, and a brief note on ablation findings confirming dependency preservation. This will allow readers to evaluate the central claims immediately while still referring to the experiments section for full details, ablations, and error bars. revision: yes
-
Referee: [Method (AdaS-Attn and AdaS-FFN)] Method (AdaS-Attn and AdaS-FFN): The entropy-driven Top-p cube/token selection is presented as reliably retaining all information needed for downstream tasks, yet no worst-case analysis, recovery mechanism, or bound is supplied showing that low-entropy regions never contain task-critical fine-grained or long-range details. This assumption is load-bearing for the claim that 57% FLOP savings coexist with preserved dependencies on arbitrary hour-scale content.
Authors: The referee is correct that the manuscript does not include a formal worst-case theoretical bound or recovery mechanism. Our design is a practical, input-dependent heuristic whose validity rests on extensive empirical validation: across multiple hour-scale benchmarks, AdaSpark matches dense-model accuracy on tasks requiring fine-grained perception and long-range temporal reasoning, with ablations showing that low-entropy regions are consistently non-critical under the tested distributions. We will add a dedicated limitations paragraph in the method section that (1) explicitly states the lack of a general theoretical guarantee, (2) summarizes the empirical evidence from our ablations that supports the entropy proxy, and (3) outlines directions for future formal analysis. We maintain that the current empirical results are sufficient to support the practical claims of the paper. revision: partial
Circularity Check
No circularity: engineering method with independent experimental validation
full rationale
The paper describes AdaSpark as a practical framework that partitions long videos into 3D spatio-temporal cubes and applies entropy-driven Top-p selection inside AdaS-Attn and AdaS-FFN modules to drop low-complexity subsets. No equations, uniqueness theorems, or derivation steps appear in the provided text that reduce the claimed 57% FLOP reduction or performance preservation to a fitted parameter, self-definition, or self-citation chain. The method is presented as an independent design choice whose correctness is asserted via benchmark experiments rather than by algebraic identity or prior self-referential result. This is the normal case of a non-circular engineering contribution.
Axiom & Free-Parameter Ledger
free parameters (1)
- Top-p cutoff parameter
axioms (1)
- domain assumption Partitioning video into 3D spatio-temporal cubes preserves sufficient structure for downstream attention and FFN operations.
invented entities (2)
-
AdaS-Attn
no independent evidence
-
AdaS-FFN
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Normxlogit: The head-on-top never lies
Sina Abbasi, Mohammad Reza Modarres, and Moham- mad Taher Pilehvar. Normxlogit: The head-on-top never lies. InProceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing, pages 34914–34935,
2025
-
[2]
Localizing mo- ments in video with natural language
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing mo- ments in video with natural language. InProceedings of the IEEE international conference on computer vision, pages 5803–5812, 2017. 5, 12
2017
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Token merging: Your vit but faster
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster. InICLR, 2023. 1, 3, 6
2023
-
[5]
An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision, pages 19–35. Springer, 2024. 1, 6
2024
-
[6]
Cosa: Concatenated sample pretrained vision-language foundation model
Sihan Chen, Xingjian He, Handong Li, Xiaojie Jin, Jiashi Feng, and Jing Liu. Cosa: Concatenated sample pretrained vision-language foundation model. InThe Twelfth Interna- tional Conference on Learning Representations. 1
-
[7]
Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset.Advances in Neural Information Processing Sys- tems, 36:72842–72866, 2023
Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu, and Jing Liu. Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset.Advances in Neural Information Processing Sys- tems, 36:72842–72866, 2023. 1
2023
-
[8]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial- temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476, 2024. 3
work page internal anchor Pith review arXiv 2024
-
[10]
Jiajun Fei, Dian Li, Zhidong Deng, Zekun Wang, Gang Liu, and Hui Wang. Video-ccam: Enhancing video-language un- derstanding with causal cross-attention masks for short and long videos.arXiv preprint arXiv:2408.14023, 2024. 3
-
[11]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075, 2024. 5, 13
work page internal anchor Pith review arXiv 2024
-
[12]
Tianyu Fu, Tengxuan Liu, Qinghao Han, Guohao Dai, Shen- gen Yan, Huazhong Yang, Xuefei Ning, and Yu Wang. Framefusion: Combining similarity and importance for video token reduction on large visual language models.arXiv preprint arXiv:2501.01986, 2024. 1, 3, 6
-
[13]
Tall: Temporal activity localization via language query
Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on com- puter vision, pages 5267–5275, 2017. 5, 13
2017
-
[14]
Ma-lmm: Memory-augmented large multimodal model for long-term video understanding
Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xue- fei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. Ma-lmm: Memory-augmented large multimodal model for long-term video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13504–13514, 2024. 1
2024
-
[15]
Chat-univi: Unified visual representation em- powers large language models with image and video un- derstanding
Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation em- powers large language models with image and video un- derstanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13700– 13710, 2024. 1
2024
-
[16]
Dense-captioning events in videos
Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In Proceedings of the IEEE international conference on com- puter vision, pages 706–715, 2017. 5, 12
2017
-
[17]
Flexprefill: A context-aware sparse attention mecha- nism for efficient long-sequence inference
Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, and Xun Zhou. Flexprefill: A context-aware sparse attention mecha- nism for efficient long-sequence inference. InThe Thirteenth International Conference on Learning Representations. 1
-
[18]
Lmms-eval: Accelerating the development of large multimoal models, 2024
Bo Li, Peiyuan Zhang, Kaichen Zhang, Fanyi Pu, Xinrun Du, Yuhao Dong, Haotian Liu, Yuanhan Zhang, Ge Zhang, Chunyuan Li, and Ziwei Liu. Lmms-eval: Accelerating the development of large multimoal models, 2024. 5
2024
-
[19]
Breaking the encoder barrier for seam- less video-language understanding
Handong Li, Yiyuan Zhang, Longteng Guo, Xiangyu Yue, and Jing Liu. Breaking the encoder barrier for seam- less video-language understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 23167–23176, 2025. 1
2025
-
[20]
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark.arXiv preprint arXiv:2311.17005, 2023. 5, 6, 13
-
[21]
Mvbench: A comprehensive multi-modal video understand- ing benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 1
2024
-
[22]
Vidtome: Video token merging for zero-shot video editing
Xirui Li, Chao Ma, Xiaokang Yang, and Ming-Hsuan Yang. Vidtome: Video token merging for zero-shot video editing. 9 InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7486–7495, 2024. 1
2024
-
[23]
Videochat-flash: Hierarchical compression for long-context video modeling,
Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, et al. Videochat-flash: Hierarchical com- pression for long-context video modeling.arXiv preprint arXiv:2501.00574, 2024. 3, 6
-
[24]
Llama-vid: An image is worth 2 tokens in large language models
Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. In European Conference on Computer Vision, pages 323–340. Springer, 2024. 1, 3, 6
2024
-
[25]
Yucheng Li, Huiqiang Jiang, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Amir H Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, et al. Mminference: Accelerat- ing pre-filling for long-context vlms via modality-aware per- mutation sparse attention.arXiv preprint arXiv:2504.16083,
-
[26]
Enhancing vision-language pre- training with jointly learned questioner and dense captioner
Zikang Liu, Sihan Chen, Longteng Guo, Handong Li, Xingjian He, and Jing Liu. Enhancing vision-language pre- training with jointly learned questioner and dense captioner. InProceedings of the 31st ACM International Conference on Multimedia, pages 5120–5131, 2023. 1
2023
-
[27]
Zikang Liu, Longteng Guo, Yepeng Tang, Tongtian Yue, Junxian Cai, Kai Ma, Qingbin Liu, Xi Chen, and Jing Liu. Vrope: Rotary position embedding for video large language models.arXiv preprint arXiv:2502.11664, 2025. 1
-
[28]
Moba: Mixture of block attention for long-context llms.arXiv preprint arXiv:2502.13189,
Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, et al. Moba: Mixture of block attention for long- context llms.arXiv preprint arXiv:2502.13189, 2025. 3, 6
-
[29]
Vipe: Visual perception in parameter space for efficient video-language understanding
Shichen Lu, Tongtian Yue, Longteng Guo, Handong Li, Xingjian He, Si Liu, and Jing Liu. Vipe: Visual perception in parameter space for efficient video-language understanding. InProceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing, pages 17775–17786,
2025
-
[30]
Deepspeed: System optimizations enable train- ing deep learning models with over 100 billion parameters
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable train- ing deep learning models with over 100 billion parameters. InProceedings of the 26th ACM SIGKDD international con- ference on knowledge discovery & data mining, pages 3505– 3506, 2020. 6
2020
-
[31]
Llava-prumerge: Adaptive token reduction for efficient large multimodal models
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22857– 22867, 2025. 3
2025
-
[32]
Tempme: Video temporal token merging for efficient text-video retrieval
Leqi Shen, Tianxiang Hao, Tao He, Sicheng Zhao, Yifeng Zhang, Yongjun Bao, Guiguang Ding, et al. Tempme: Video temporal token merging for efficient text-video retrieval. In The Thirteenth International Conference on Learning Rep- resentations, . 1
-
[33]
Longvu: Spa- tiotemporal adaptive compression for long video-language understanding
Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Bal- akrishnan Varadarajan, Florian Bordes, et al. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding. InForty-second International Conference on Machine Learning, . 6
-
[34]
Video-xl: Extra-long vision language model for hour-scale video understanding
Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Jun- jie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26160–26169,
-
[35]
Adaptive keyframe sampling for long video understanding
Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, and Qixiang Ye. Adaptive keyframe sampling for long video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29118–29128, 2025. 1
2025
-
[36]
Divid: Dis- entangled spatial-temporal modeling within llms for tempo- rally grounded video understanding
Yepeng Tang, Weining Wang, Longteng Guo, Tongtian Yue, Wenxuan Wang, Chunjie Zhang, and Jing Liu. Divid: Dis- entangled spatial-temporal modeling within llms for tempo- rally grounded video understanding. InThe Fourteenth In- ternational Conference on Learning Representations. 3
-
[37]
Dynamic inference with grounding based vision and language models
Burak Uzkent, Amanmeet Garg, Wentao Zhu, Keval Doshi, Jingru Yi, Xiaolong Wang, and Mohamed Omar. Dynamic inference with grounding based vision and language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2624–2633, 2023. 3
2023
-
[38]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1, 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Lvbench: An extreme long video understanding benchmark
Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiao- han Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, et al. Lvbench: An extreme long video understanding benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22958–22967, 2025. 5, 13
2025
-
[40]
Xidong Wang, Dingjie Song, Shunian Chen, Chen Zhang, and Benyou Wang. Longllava: Scaling multi-modal llms to 1000 images efficiently via a hybrid architecture.arXiv preprint arXiv:2409.02889, 2024. 6
-
[41]
Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024. 5, 13
2024
-
[42]
Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy re- duction.arXiv preprint arXiv:2410.17247, 2024. 1
-
[43]
Xattention: Block sparse attention with an- tidiagonal scoring
Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, and Song Han. Xattention: Block sparse attention with an- tidiagonal scoring. InForty-second International Conference on Machine Learning. 1
-
[44]
Thinking in space: How mul- timodal large language models see, remember, and recall spaces
Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 5, 13
2025
-
[45]
Fit and prune: Fast and training-free visual token pruning for multi- 10 modal large language models
Weihao Ye, Qiong Wu, Wenhao Lin, and Yiyi Zhou. Fit and prune: Fast and training-free visual token pruning for multi- 10 modal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 22128–22136,
-
[46]
Frame- voyager: Learning to query frames for video large language models
Sicheng Yu, CHENGKAI JIN, Huanyu Wang, Zhenghao Chen, Sheng Jin, ZHONGRONG ZUO, XU XIAOLEI, Zhenbang Sun, Bingni Zhang, Jiawei Wu, et al. Frame- voyager: Learning to query frames for video large language models. InThe Thirteenth International Conference on Learning Representations. 1
-
[47]
Native sparse attention: Hardware-aligned and natively trainable sparse attention
Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Yuxing Wei, Lean Wang, Zhiping Xiao, et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention. In Proceedings of the 63rd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 23078–23097, 2025. 3
2025
-
[48]
Lmms- eval: Reality check on the evaluation of large multimodal models, 2024
Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms- eval: Reality check on the evaluation of large multimodal models, 2024. 5
2024
-
[49]
Long Context Transfer from Language to Vision
Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024. 5, 6, 13
work page internal anchor Pith review arXiv 2024
-
[50]
Shaojie Zhang, Jiahui Yang, Jianqin Yin, Zhenbo Luo, and Jian Luan. Q-frame: Query-aware frame selection and multi-resolution adaptation for video-llms.arXiv preprint arXiv:2506.22139, 2025. 1
-
[51]
Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference
Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis A Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference. InForty-second International Conference on Ma- chine Learning. 1
-
[52]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 1, 5, 12
work page internal anchor Pith review arXiv 2024
-
[53]
Learning beyond still frames: Scaling vision-language mod- els with video
Yiyuan Zhang, Handong Li, Jing Liu, and Xiangyu Yue. Learning beyond still frames: Scaling vision-language mod- els with video. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 22425–22435,
-
[54]
Scaling omni-modal pretraining with multimodal context: Advancing universal representation learning across modal- ities
Yiyuan Zhang, Handong Li, Jing Liu, and Xiangyu Yue. Scaling omni-modal pretraining with multimodal context: Advancing universal representation learning across modal- ities. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 1336–1348, 2025. 1
2025
-
[55]
Nee- dle in a video haystack: A scalable synthetic evaluator for video mllms
Zijia Zhao, Haoyu Lu, Yuqi Huo, Yifan Du, Tongtian Yue, Longteng Guo, Bingning Wang, Jing Liu, et al. Nee- dle in a video haystack: A scalable synthetic evaluator for video mllms. InThe Thirteenth International Conference on Learning Representations. 5, 13
-
[56]
Aim: Adaptive inference of multi-modal llms via token merging and pruning
Yiwu Zhong, Zhuoming Liu, Yin Li, and Liwei Wang. Aim: Adaptive inference of multi-modal llms via token merging and pruning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20180–20192, 2025. 3
2025
-
[57]
MLVU: Benchmarking Multi-task Long Video Understanding
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding.arXiv preprint arXiv:2406.04264,
work page internal anchor Pith review arXiv
-
[58]
5, 13 11 AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding Supplementary Material
-
[59]
As summarized in Table 4, we initially evaluate the most rudi- mentary approach: uniform sampling
Effect of Selection Strategy Adhering to the ablation experimental setup described in the main text, we investigate the influence of various token se- lection strategies under an identical compression ratio. As summarized in Table 4, we initially evaluate the most rudi- mentary approach: uniform sampling. This method exhibits the most significant performa...
-
[60]
Training Configuration Table 5 provides a comprehensive summary of the hyperparameters employed in the training of AdaSpark
Implementation Details 7.1. Training Configuration Table 5 provides a comprehensive summary of the hyperparameters employed in the training of AdaSpark. Throughout this process, the visual encoder is maintained in a frozen state, and we implement a cube-based sparse strategy regulated by a top-pthreshold. Our training methodology incorporates a mixed data...
-
[61]
Ta- ble 6 details the frame sampling configurations employed during inference via thelmms-evalframework
Evaluation Settings We evaluate AdaSpark on a series of comprehensive video-language benchmarks: 1) Extra Long Video Un- derstanding, using Video Needle in a Haystack [49, 55]; 2) Long Video Understanding, which includes MLVU [57], VideoMME [11], LongVideoBench [41], and LVBench [39]; 3) Short Video Understanding, using MVBench [20]; 4) Spatial Reasoning,...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.