Recognition: unknown
Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs
Pith reviewed 2026-05-10 00:40 UTC · model grok-4.3
The pith
Sink tokens that attract too much attention cause standard pruning to fail on precise video tasks, but suppressing them with a new score restores performance even at 90% token reduction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Sink tokens are semantically uninformative visual tokens that attract excessive attention in Video LLMs, and their survival during pruning distorts visual evidence leading to poor performance on fine-grained tasks. Sink-Token-aware Pruning (SToP) introduces a sink score to quantify this behavior and applies it to suppress sink tokens within spatial and temporal pruning methods. This plug-and-play approach improves performance of existing methods on hallucination evaluation, open-ended generation, compositional reasoning, and MCQA, maintaining effectiveness even when pruning up to 90% of visual tokens.
What carries the argument
A sink score that measures each visual token's tendency to attract disproportionate attention and is used to suppress such tokens during pruning.
If this is right
- Existing pruning methods like VisionZip, FastVid, and Holitom gain substantial performance boosts on fine-grained tasks when combined with sink suppression.
- Pruning ratios up to 90% become viable for Video LLMs without the usual collapse in detailed understanding capabilities.
- The method works across diverse benchmarks including hallucination detection and compositional reasoning.
- Attention distortion from sink tokens is mitigated, allowing better visual grounding in generated responses.
Where Pith is reading between the lines
- This suggests that attention patterns in multimodal models may have systematic biases that pruning can correct without retraining.
- Similar sink token phenomena could be investigated in other domains like audio or text-only long context models.
- Designing pruning strategies that prioritize sink identification from the start might yield even higher efficiency gains.
- Longer video sequences could be handled efficiently if token reduction is made reliable for fine details.
Load-bearing premise
That the performance collapse on fine-grained tasks is primarily driven by the presence of sink tokens rather than by the loss of other important visual information during pruning.
What would settle it
If experiments show that suppressing tokens identified as sinks does not improve or even harms performance on fine-grained benchmarks compared to standard pruning, this would indicate the sink score does not address the root issue.
Figures
read the original abstract
Video Large Language Models (Video LLMs) incur high inference latency due to a large number of visual tokens provided to LLMs. To address this, training-free visual token pruning has emerged as a solution to reduce computational costs; however, existing methods are primarily validated on Multiple-Choice Question Answering (MCQA) benchmarks, where coarse-grained cues often suffice. In this work, we reveal that these methods suffer a sharp performance collapse on fine-grained understanding tasks requiring precise visual grounding, such as hallucination evaluation. To explore this gap, we conduct a systematic analysis and identify sink tokens--semantically uninformative tokens that attract excessive attention--as a key obstacle to fine-grained video understanding. When these sink tokens survive pruning, they distort the model's visual evidence and hinder fine-grained understanding. Motivated by these insights, we propose Sink-Token-aware Pruning (SToP), a simple yet effective plug-and-play method that introduces a sink score to quantify each token's tendency to behave as a sink and applies this score to existing spatial and temporal pruning methods to suppress them, thereby enhancing video understanding. To validate the effectiveness of SToP, we apply it to state-of-the-art pruning methods (VisionZip, FastVid, and Holitom) and evaluate it across diverse benchmarks covering hallucination, open-ended generation, compositional reasoning, and MCQA. Our results demonstrate that SToP significantly boosts performance, even when pruning up to 90% of visual tokens.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that standard training-free visual token pruning methods for Video LLMs suffer sharp performance drops on fine-grained tasks (e.g., hallucination evaluation) because semantically uninformative 'sink tokens' that attract excessive attention survive pruning and distort visual evidence. It introduces Sink-Token-aware Pruning (SToP), a plug-and-play extension that computes a sink score for each token and applies it to suppress sinks within existing spatial/temporal pruners (VisionZip, FastVid, Holitom). Experiments across hallucination, open-ended generation, compositional reasoning, and MCQA benchmarks reportedly show consistent gains even at 90% pruning ratios.
Significance. If the empirical gains prove robust, SToP would offer a lightweight, training-free way to improve fine-grained video understanding in efficient Video LLMs. The plug-and-play design and consistent application across multiple base pruners are practical strengths; the identification of sink tokens as a distinct failure mode on fine-grained tasks (distinct from coarse MCQA) could inform future pruning research.
major comments (3)
- [§3 (Method)] The precise definition and computation of the sink score (central to SToP) are not provided with sufficient mathematical detail or pseudocode; without this, it is impossible to verify whether the score reliably isolates attention sinks without discarding task-critical visual information or introducing new biases.
- [§4 (Experiments)] Results lack error bars, standard deviations, or statistics from multiple random seeds/runs; this makes it difficult to judge whether the reported boosts on fine-grained benchmarks (especially at 90% pruning) are statistically reliable or sensitive to implementation choices.
- [§2 (Analysis)] The protocol for identifying sink tokens during the systematic analysis (data exclusion rules, attention threshold, video sampling) is not fully specified, weakening the causal claim that surviving sink tokens are the primary driver of the observed performance collapse on fine-grained tasks.
minor comments (2)
- [Tables/Figures] Table captions and figure legends should explicitly list the exact pruning ratios and base methods used in each row/column for quick reference.
- [Abstract] The abstract mentions 'diverse benchmarks' but does not name the specific datasets (e.g., which hallucination benchmark); adding these would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed each major comment and provide point-by-point responses below. We commit to revisions that will improve the clarity, rigor, and reproducibility of the work without altering its core contributions.
read point-by-point responses
-
Referee: [§3 (Method)] The precise definition and computation of the sink score (central to SToP) are not provided with sufficient mathematical detail or pseudocode; without this, it is impossible to verify whether the score reliably isolates attention sinks without discarding task-critical visual information or introducing new biases.
Authors: We agree that the sink score requires more explicit mathematical formulation and implementation details for full reproducibility and verification. In the revised manuscript, we will expand Section 3 with the precise definition of the sink score (computed from normalized attention weights aggregated across layers and heads), the exact formula, and pseudocode illustrating its integration into existing spatial and temporal pruners. Our experiments applying SToP to VisionZip, FastVid, and Holitom demonstrate consistent gains on fine-grained tasks at high pruning ratios, supporting that the score targets uninformative sinks while preserving task-critical information; the added details will allow readers to confirm this directly. revision: yes
-
Referee: [§4 (Experiments)] Results lack error bars, standard deviations, or statistics from multiple random seeds/runs; this makes it difficult to judge whether the reported boosts on fine-grained benchmarks (especially at 90% pruning) are statistically reliable or sensitive to implementation choices.
Authors: We acknowledge that the lack of error bars and multi-seed statistics limits assessment of result reliability. In the revised version, we will rerun the key experiments (across hallucination, reasoning, and MCQA benchmarks) with multiple random seeds, report means accompanied by standard deviations in the tables, and include a brief discussion of observed variance, with particular attention to the 90% pruning regime. This will strengthen confidence in the reported improvements. revision: yes
-
Referee: [§2 (Analysis)] The protocol for identifying sink tokens during the systematic analysis (data exclusion rules, attention threshold, video sampling) is not fully specified, weakening the causal claim that surviving sink tokens are the primary driver of the observed performance collapse on fine-grained tasks.
Authors: We agree that the analysis protocol in Section 2 would benefit from fuller specification to support the causal interpretation. In the revision, we will explicitly detail the video sampling procedure, attention threshold for sink identification, and any data exclusion rules applied during the systematic study. These additions will clarify the methodology and reinforce the link between surviving sink tokens and degraded fine-grained performance. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper identifies sink tokens via systematic analysis of existing pruning methods' failures on fine-grained tasks, then defines a sink score as an additive modifier applied to independent prior pruners (VisionZip, FastVid, Holitom). No equations, predictions, or uniqueness theorems are presented that reduce the performance claims to fitted parameters, self-definitions, or self-citation chains. Validation rests on external benchmarks and empirical gains up to 90% pruning, which are falsifiable outside the method's own construction. The approach is self-contained with no load-bearing internal reductions.
Axiom & Free-Parameter Ledger
invented entities (2)
-
sink tokens
no independent evidence
-
sink score
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Alvar, S.R., Singh, G., Akbari, M., Zhang, Y.: Divprune: Diversity-based visual tokenpruningforlargemultimodalmodels.In:ProceedingsoftheComputerVision and Pattern Recognition Conference. pp. 9392–9401 (2025)
2025
-
[2]
Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025),https://arxiv.org/abs/2502.13923
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
arXiv preprint arXiv:2412.12075 , year=
Chen, G., Liu, Y., Huang, Y., He, Y., Pei, B., Xu, J., Wang, Y., Lu, T., Wang, L.: Cg-bench: Clue-grounded question answering benchmark for long video under- standing. arXiv preprint arXiv:2412.12075 (2024)
-
[5]
In: European Conference on Computer Vision
Chen, L., Zhao, H., Liu, T., Bai, S., Lin, J., Zhou, C., Chang, B.: An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In: European Conference on Computer Vision. pp. 19–35. Springer (2024)
2024
-
[6]
Advances in neural information pro- cessing systems35, 16344–16359 (2022)
Dao, T., Fu, D., Ermon, S., Rudra, A., Ré, C.: Flashattention: Fast and memory- efficient exact attention with io-awareness. Advances in neural information pro- cessing systems35, 16344–16359 (2022)
2022
-
[7]
Vision Transformers Need Registers
Darcet, T., Oquab, M., Mairal, J., Bojanowski, P.: Vision transformers need reg- isters. arXiv preprint arXiv:2309.16588 (2023)
work page internal anchor Pith review arXiv 2023
-
[8]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Dhouib, M., Buscaldi, D., Vanier, S., Shabou, A.: Pact: Pruning and clustering- based token reduction for faster visual language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 14582–14592 (2025)
2025
-
[9]
Knowledge-Based Systems99, 135– 145 (2016)
Du, M., Ding, S., Jia, H.: Study on density peaks clustering based on k-nearest neighbors and principal component analysis. Knowledge-Based Systems99, 135– 145 (2016)
2016
-
[10]
Fan, Z., Chen, K., Xing, R., Li, Y., Jiang, L., Tian, Z.: Flashvid: Efficient video large language models via training-free tree-based spatiotemporal token merging. arXiv preprint arXiv:2602.08024 (2026)
-
[11]
In: International Conference on Optoelectronics, Computer Science, and Algorithms (OCSA 2025)
Feng, W., Wang, H., Wang, J., Zhang, X., Zhao, J., Liang, Y., Chen, X., Han, D.: Edit: enhancing vision transformers by mitigating attention sink through an encoder-decoder architecture. In: International Conference on Optoelectronics, Computer Science, and Algorithms (OCSA 2025). vol. 14008, pp. 246–259. SPIE (2026)
2025
-
[12]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., et al.: Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24108–24118 (2025)
2025
-
[13]
Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Let- man, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
In: Findings of the Association for Computational Linguis- tics: ACL 2025
Huang, X., Zhou, H., Han, K.: Prunevid: Visual token pruning for efficient video large language models. In: Findings of the Association for Computational Linguis- tics: ACL 2025. pp. 19959–19973 (2025)
2025
-
[15]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Hyun, J., Hwang, S., Han, S.H., Kim, T., Lee, I., Wee, D., Lee, J.Y., Kim, S.J., Shim, M.: Multi-granular spatio-temporal token merging for training-free acceler- ation of video llms. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23990–24000 (2025) 16 Kim et al
2025
-
[16]
Vision transformers don’t need trained registers.arXiv preprint arXiv:2506.08010, 2025
Jiang, N., Dravid, A., Efros, A., Gandelsman, Y.: Vision transformers don’t need trained registers. arXiv preprint arXiv:2506.08010 (2025)
-
[17]
Kang, S., Kim, J., Kim, J., Hwang, S.J.: See what you are told: Visual attention sink in large multimodal models. arXiv preprint arXiv:2503.03321 (2025)
-
[18]
In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Kim, D., Piergiovanni, A., Mallya, G., Angelova, A.: Videocomp: Advancing fine- grained compositional and temporal alignment in video-text models. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 29060–29070 (2025)
2025
-
[19]
Kim, J., Kim, K., Kim, W., Lee, B.K., Park, C.: Why and when visual token pruning fails? a study on relevant visual information shift in mllms decoding. arXiv preprint arXiv:2604.12358 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[20]
CompoDistill: Attention Distillation for Compositional Reasoning in Multimodal LLMs
Kim, J., Kim, K., Seo, S., Park, C.: Compodistill: Attention distillation for com- positional reasoning in multimodal llms. arXiv preprint arXiv:2510.12184 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
arXiv preprint arXiv:2510.04547 (2025)
Kim, S., Kim, J., Yeom, T., Park, W., Kim, K., Lee, J.: Activation quantization of vision encoders needs prefixing registers. arXiv preprint arXiv:2510.04547 (2025)
-
[22]
LLaVA-OneVision: Easy Visual Task Transfer
Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)
work page Pith review arXiv 2024
-
[23]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, J., Chen, G., Luo, P., et al.: Mvbench: A comprehensive multi-modal video understanding benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22195–22206 (2024)
2024
-
[24]
Li, Y., Wang, C., Jia, J.: Llama-vid: An image is worth 2 tokens in large lan- guagemodels.In:EuropeanConferenceonComputerVision.pp.323–340.Springer (2024)
2024
-
[25]
In: Proceedings of the 2024 conference on empirical methods in natural language processing
Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. In: Proceedings of the 2024 conference on empirical methods in natural language processing. pp. 5971– 5984 (2024)
2024
-
[26]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Lin, J., Yin, H., Ping, W., Molchanov, P., Shoeybi, M., Han, S.: Vila: On pre- training for visual language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26689–26699 (2024)
2024
-
[27]
arXiv preprint arXiv:2507.16018 (2025)
Lu, A., Liao, W., Wang, L., Yang, H., Shi, J.: Artifacts and attention sinks: Structured approximations for efficient vision transformers. arXiv preprint arXiv:2507.16018 (2025)
-
[28]
Ma, J., Zhang, Q., Lu, M., Wang, Z., Zhou, Q., Song, J., Zhang, S.: Mmg-vid: Maximizing marginal gains at segment-level and token-level for efficient video llms. arXiv preprint arXiv:2508.21044 (2025)
-
[29]
In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Maaz, M., Rasheed, H., Khan, S., Khan, F.: Video-chatgpt: Towards detailed video understanding via large vision and language models. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 12585–12602 (2024)
2024
-
[30]
Advances in Neural Information Processing Systems36, 46212–46244 (2023)
Mangalam, K., Akshulakov, R., Malik, J.: Egoschema: A diagnostic benchmark for very long-form video language understanding. Advances in Neural Information Processing Systems36, 46212–46244 (2023)
2023
-
[31]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Qiu, H., Gao, M., Qian, L., Pan, K., Yu, Q., Li, J., Wang, W., Tang, S., Zhuang, Y., Chua, T.S.: Step: Enhancing video-llms’ compositional reasoning by spatio- temporal graph-guided self-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3284–3294 (2025)
2025
-
[32]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Rawal, R., Shirkavand, R., Huang, H., Somepalli, G., Goldstein, T.: Argus: Hallu- cination and omission evaluation in video-llms. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 20280–20290 (2025) SToP: Sink-Token-Aware Pruning 17
2025
-
[33]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Shang, Y., Cai, M., Xu, B., Lee, Y.J., Yan, Y.: Llava-prumerge: Adaptive token reduction for efficient large multimodal models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 22857–22867 (2025)
2025
-
[34]
Shao, K., Tao, K., Qin, C., You, H., Sui, Y., Wang, H.: Holitom: Holistic token merging for fast video large language models. arXiv preprint arXiv:2505.21334 (2025)
-
[35]
arXiv preprint arXiv:2503.11187 (2025)
Shen, L., Gong, G., He, T., Zhang, Y., Liu, P., Zhao, S., Ding, G.: Fastvid: Dynamic density pruning for fast video large language models. arXiv preprint arXiv:2503.11187 (2025)
-
[36]
Shen, X., Xiong, Y., Zhao, C., Wu, L., Chen, J., Zhu, C., Liu, Z., Xiao, F., Varadarajan, B., Bordes, F., et al.: Longvu: Spatiotemporal adaptive compression for long video-language understanding. arXiv preprint arXiv:2410.17434 (2024)
-
[37]
Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: Openai gpt-5 system card. arXiv preprint arXiv:2601.03267 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Sinks, E.A.V., Sinks, D.L.e., Sinks, C.P.V., Sinks, B., Image, A.G.: To sink or not to sink: Visual information pathways in large vision-language models
-
[39]
Sun, M., Chen, X., Kolter, J.Z., Liu, Z.: Massive activations in large language models. arXiv preprint arXiv:2402.17762 (2024)
-
[40]
Tao, K., Qin, C., You, H., Sui, Y., Wang, H.: Dycoke: Dynamic compression of tokensforfastvideolargelanguagemodels.In:ProceedingsoftheComputerVision and Pattern Recognition Conference. pp. 18992–19001 (2025)
2025
-
[41]
Journal of vision7(14), 4–4 (2007)
Tatler, B.W.: The central fixation bias in scene viewing: Selecting an optimal view- ing position independently of motor biases and image feature distributions. Journal of vision7(14), 4–4 (2007)
2007
-
[42]
Wang, A., Sun, F., Chen, H., Lin, Z., Han, J., Ding, G.: [cls] token tells everything needed for training-free efficient mllms. arXiv preprint arXiv:2412.05819 (2024)
-
[43]
URL https://arxiv
Wu, H., Li, D., Chen, B., Li, J.: Longvideobench: A benchmark for long-context interleaved video-language understanding, 2024. URL https://arxiv. org/abs/2407 15754(8)
2024
-
[44]
Efficient Streaming Language Models with Attention Sinks
Xiao, G., Tian, Y., Chen, B., Han, S., Lewis, M.: Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453 (2023)
work page internal anchor Pith review arXiv 2023
-
[45]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Xiao,J.,Shang,X.,Yao,A.,Chua,T.S.:Next-qa:Nextphaseofquestion-answering to explaining temporal actions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9777–9786 (2021)
2021
-
[46]
arXiv preprint arXiv:2404.16994 , year=
Xu, L., Zhao, Y., Zhou, D., Lin, Z., Ng, S.K., Feng, J.: Pllava: Parameter-free llava extension from images to videos for video dense captioning. arXiv preprint arXiv:2404.16994 (2024)
-
[47]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Yang, S., Chen, Y., Tian, Z., Wang, C., Li, J., Yu, B., Jia, J.: Visionzip: Longer is better but not necessary in vision language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19792– 19802 (2025)
2025
-
[48]
In: Proceedings of the IEEE/CVF international conference on computer vision
Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language im- age pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11975–11986 (2023)
2023
-
[49]
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
Zhang, B., Li, K., Cheng, Z., Hu, Z., Yuan, Y., Chen, G., Leng, S., Jiang, Y., Zhang, H., Li, X., et al.: Videollama 3: Frontier multimodal foundation models for image and video understanding. arXiv preprint arXiv:2501.13106 (2025)
work page internal anchor Pith review arXiv 2025
-
[50]
Eventhallusion: Diagnosing event hallucinations in video llms.arXiv preprint arXiv:2409.16597, 2024
Zhang, J., Jiao, Y., Chen, S., Zhao, N., Tan, Z., Li, H., Ma, X., Chen, J.: Eventhallusion: Diagnosing event hallucinations in video llms. arXiv preprint arXiv:2409.16597 (2024) 18 Kim et al
-
[51]
arXiv e-prints pp
Zhang, Q., Cheng, A., Lu, M., Zhuo, Z., Wang, M., Cao, J., Guo, S., She, Q., Zhang, S.: [cls] attention is all you need for training-free visual token pruning: Make vlm inference faster. arXiv e-prints pp. arXiv–2412 (2024)
2024
-
[52]
Zhang, Q., Liu, M., Li, L., Lu, M., Zhang, Y., Pan, J., She, Q., Zhang, S.: Be- yond attention or similarity: Maximizing conditional diversity for token pruning in mllms. arXiv preprint arXiv:2506.10967 (2025)
-
[53]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Zhang, Y., Wu, J., Li, W., Li, B., Ma, Z., Liu, Z., Li, C.: Llava-video: Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713 (2024)
work page Pith review arXiv 2024
-
[54]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Zhou, J., Shu, Y., Zhao, B., Wu, B., Liang, Z., Xiao, S., Qin, M., Yang, X., Xiong, Y., Zhang, B., et al.: Mlvu: Benchmarking multi-task long video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13691–13701 (2025) SToP: Sink-Token-Aware Pruning 19 Supplementary Material - Sink-Token-Aware Pruning ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.