Spectral Evolution-Guided Token Pruning in Multimodal Large Language Models
Pith reviewed 2026-06-26 01:02 UTC · model grok-4.3
The pith
Token importance in multimodal models is revealed by how their frequency content redistributes across Transformer layers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CLSE quantifies how token representations evolve across Transformer layers in the frequency domain. This evolution reflects the transition from high-frequency structural details to low-frequency semantic abstractions. Tokens with stronger spectral redistribution across layers are more likely to be semantically active and should therefore be preserved. By modeling cross-layer token dynamics, CLSE provides a stable importance criterion that mitigates positional bias.
What carries the argument
Cross-Layer Spectral Evolution (CLSE), a measure of the magnitude of frequency-domain change in each token's representation from layer to layer.
If this is right
- Aggressive token reduction becomes feasible while accuracy on image and video reasoning tasks remains competitive or improves.
- FLOPs, KV-cache memory, and inference latency decrease across multiple MLLM architectures without retraining.
- Positional bias that affects attention-based or similarity-based pruning is reduced by the layer-wise frequency criterion.
Where Pith is reading between the lines
- The same frequency-redistribution signal could be tested on text-only transformers to see whether it identifies important tokens in pure language settings.
- If the frequency shift correlates with semantic activation, one could explore whether early-layer high-frequency tokens can be safely dropped in other vision-language pipelines.
- The method supplies an explicit link between signal-processing notions of frequency content and the internal dynamics of attention layers.
Load-bearing premise
Stronger spectral redistribution of token representations across layers reliably indicates semantic activity and supplies a stable, bias-mitigating importance score.
What would settle it
An ablation in which tokens ranked highest by CLSE are pruned first and the resulting accuracy drop is no smaller than the drop produced by single-layer baselines at identical token counts.
Figures
read the original abstract
Reducing visual token redundancy is critical for accelerating Multimodal Large Language Models (MLLMs) without degrading cross-modal reasoning performance. Existing token pruning methods typically rely on single-layer signals, such as attention scores or token similarities, which overlook the cross-layer transformation of visual representations and may exhibit positional bias in multimodal token sequences. To address this limitation, we propose a training-free token pruning framework based on Cross-Layer Spectral Evolution (CLSE). Instead of measuring token importance from single-layer feature magnitudes, CLSE quantifies how token representations evolve across Transformer layers in the frequency domain. This evolution reflects the transition from high-frequency structural details to low-frequency semantic abstractions. We observe that tokens with stronger spectral redistribution across layers are more likely to be semantically active and should therefore be preserved. By modeling cross-layer token dynamics, CLSE provides a stable importance criterion that mitigates positional bias. Extensive experiments on both image and video benchmarks demonstrate that CLSE achieves a superior trade-off between efficiency and accuracy under aggressive token reduction. Across multiple MLLMs, CLSE reduces FLOPs, KV cache memory, and latency while maintaining competitive or improved performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Cross-Layer Spectral Evolution (CLSE), a training-free token pruning framework for Multimodal Large Language Models. CLSE measures how token representations evolve across Transformer layers in the frequency domain, observing that stronger spectral redistribution indicates semantic activity; these tokens are preserved to reduce visual token redundancy while mitigating positional bias from single-layer signals such as attention scores. Experiments on image and video benchmarks are claimed to show superior efficiency-accuracy trade-offs under aggressive pruning across multiple MLLMs.
Significance. If the core observational link holds and is independently validated, CLSE could offer a more stable, cross-layer criterion for token importance that reduces reliance on potentially biased single-layer heuristics, with potential benefits for KV cache and latency reduction in MLLMs. The training-free nature and frequency-domain approach are strengths if they prove robust beyond the reported benchmarks.
major comments (2)
- [Abstract] Abstract: The central claim that 'tokens with stronger spectral redistribution across layers are more likely to be semantically active' is presented as an observation that directly motivates the preservation rule, yet no independent validation, correlation analysis, or causal test (e.g., against human-annotated semantic labels or controlled perturbations) is indicated to separate this from end-to-end accuracy numbers. This assumption is load-bearing for the method's claimed advantage over single-layer baselines.
- [Abstract] Abstract: The manuscript asserts experimental superiority in FLOPs, KV cache, latency, and accuracy on image/video benchmarks but supplies no equations defining the spectral redistribution metric, no dataset details, no ablation results isolating the cross-layer frequency component, and no error bars or statistical significance tests. These omissions prevent verification of whether reported gains stem from the proposed mechanism.
Simulated Author's Rebuttal
We thank the referee for the thoughtful comments, which help clarify the presentation of our core claims and experimental details. We respond to each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that 'tokens with stronger spectral redistribution across layers are more likely to be semantically active' is presented as an observation that directly motivates the preservation rule, yet no independent validation, correlation analysis, or causal test (e.g., against human-annotated semantic labels or controlled perturbations) is indicated to separate this from end-to-end accuracy numbers. This assumption is load-bearing for the method's claimed advantage over single-layer baselines.
Authors: The observation originates from our frequency-domain analysis of token evolution across layers (detailed in Section 3), where stronger redistribution correlates with tokens that retain semantic content in later layers. End-to-end benchmark gains provide supporting evidence, but we agree an explicit independent validation would strengthen the argument. In revision we will add a dedicated subsection with correlation analysis against attention-based semantic proxies and controlled perturbation experiments on a subset of the benchmarks. revision: yes
-
Referee: [Abstract] Abstract: The manuscript asserts experimental superiority in FLOPs, KV cache, latency, and accuracy on image/video benchmarks but supplies no equations defining the spectral redistribution metric, no dataset details, no ablation results isolating the cross-layer frequency component, and no error bars or statistical significance tests. These omissions prevent verification of whether reported gains stem from the proposed mechanism.
Authors: The abstract is intentionally concise; the full manuscript contains the spectral redistribution equations (Eq. 3 in Section 3.2), dataset specifications (Section 4.1), and ablations that isolate the cross-layer frequency term (Section 4.3). Error bars and significance tests appear in Tables 1–4. We will revise the abstract to explicitly reference these sections and, if needed, expand the ablation isolating the frequency component to make the contribution of the cross-layer signal clearer. revision: yes
Circularity Check
No significant circularity; derivation is observation-driven and self-contained
full rationale
The paper presents CLSE as a training-free method that computes cross-layer spectral evolution from existing Transformer representations and uses an observed correlation (stronger redistribution marking semantic activity) to rank tokens for pruning. No equations or steps reduce a claimed prediction to a fitted parameter, self-citation chain, or definitional tautology; the importance criterion is computed directly from the model's internal activations without re-using the target accuracy metric or prior self-referential results as inputs. The central link between spectral redistribution and semantic activity is presented as an empirical observation rather than a derived theorem, leaving the method open to external validation on accuracy benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Alvar, S.R., Singh, G., Akbari, M., Zhang, Y .: Di- vprune: Diversity-based visual token pruning for large multimodal models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 9392– 9401 (2025)
2025
-
[2]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Arif, K.H.I., Yoon, J., Nikolopoulos, D.S., Vandieren- donck, H., John, D., Ji, B.: Hired: Attention- guided token dropping for efficient inference of high- resolution vision-language models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 1773–1781 (2025)
2025
-
[3]
In: The Eleventh International Conference on Learn- ing Representations (2023)
Bolya, D., Fu, C.Y ., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token merging: Your vit but faster. In: The Eleventh International Conference on Learn- ing Representations (2023)
2023
-
[4]
In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies
Chen, D., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. pp. 190– 200 (2011)
2011
-
[5]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition
Chen, J., Liu, X., Wen, Z., Wang, Y ., Huang, S., Chen, H.: Variation-aware vision token dropping for faster large vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition. pp. 3489–3499 (2026)
2026
-
[6]
In: European Conference on Com- puter Vision
Chen, L., Zhao, H., Liu, T., Bai, S., Lin, J., Zhou, C., Chang, B.: An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision- language models. In: European Conference on Com- puter Vision. pp. 19–35. Springer (2024)
2024
-
[7]
In: European Confer- ence on Computer Vision
Chen, Z., Cai, Y ., Guo, J., Cai, T., Yin, J., Chen, Z.: Accelerating multimodal large language models with prior-corrected token reduction. In: European Confer- ence on Computer Vision. Springer (2026)
2026
-
[8]
Chen, Z., Wang, W., Cao, Y ., Liu, Y ., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: In- ternvl: Scaling up vision foundation models and align- ing for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)
2024
-
[10]
In: W ACV 2020
Chen, Z., Li, J., Luo, Y ., Huang, Z., Yang, Y .: Canzsl: Cycle-consistent adversarial networks for zero-shot learning from natural language. In: W ACV 2020. pp. 874–883 (2020)
2020
-
[11]
IEEE Transactions on Multimedia (2026)
Chen, Z., Luo, Y ., Huang, Z., Li, J., Wang, S., Yu, X.: Distributed zero-shot learning for visual recognition. IEEE Transactions on Multimedia (2026)
2026
-
[12]
ICCV 2021 (2021)
Chen, Z., Luo, Y ., Qiu, R., Wang, S., Huang, Z., Li, J., Zhang, Z.: Semantics disentangling for generalized zero-shot learning. ICCV 2021 (2021)
2021
-
[13]
IEEE Transactions on Multimedia (2022)
Chen, Z., Luo, Y ., Wang, S., Li, J., Huang, Z.: Gsm- flow: Generation shifts mitigating flow for generalized zero-shot learning. IEEE Transactions on Multimedia (2022)
2022
-
[14]
In: ACM MM 2021 (2021)
Chen, Z., Luo, Y ., Wang, S., Qiu, R., Li, J., Huang, Z.: Mitigating generation shifts for generalized zero-shot learning. In: ACM MM 2021 (2021)
2021
-
[15]
In: ACM MM
Chen, Z., Wang, S., Li, J., Huang, Z.: Rethinking gen- erative zero-shot learning: An ensemble learning per- spective for recognising visual patches. In: ACM MM
-
[16]
3413–3421 (2020)
pp. 3413–3421 (2020)
2020
-
[17]
Pattern Recognition p
Chen, Z., Yu, X., Tao, X., Li, Y ., Huang, Z.: Cluster- aware prompt ensemble learning for few-shot vision– language model adaptation. Pattern Recognition p. 112596 (2025)
2025
-
[18]
In: ACM MM 2023 (2023)
Chen, Z., Zhang, P., Li, J., Wang, S., Huang, Z.: Zero- shot learning by harnessing adversarial samples. In: ACM MM 2023 (2023)
2023
-
[19]
In: ICCV 2025 (2025)
Chen, Z., Zhao, Z., Guo, J., Li, J., Huang, Z.: Svip: Semantically contextualized visual patches for zero- shot learning. In: ICCV 2025 (2025)
2025
-
[20]
Pattern Recog- nition p
Chen, Z., Zhao, Z., Luo, Y ., Li, Y ., Tao, X., Huang, Z.: Fastedit: Fast text-guided single-image editing via semantic-aware diffusion fine-tuning. Pattern Recog- nition p. 112583 (2025)
2025
-
[21]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weis- senborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An im- age is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[22]
In: Proceedings of the 2025 Conference on Em- pirical Methods in Natural Language Processing
Fan, Y ., Zhao, A., Fu, J., Tong, J., Su, H., Pan, Y ., Zhang, W., Shen, X.: Visipruner: Decoding discontin- uous cross-modal dynamics for efficient multimodal llms. In: Proceedings of the 2025 Conference on Em- pirical Methods in Natural Language Processing. pp. 18896–18913 (2025)
2025
-
[23]
In: European conference on com- puter vision
Fayyaz, M., Koohpayegani, S.A., Jafari, F.R., Sen- gupta, S., Joze, H.R.V ., Sommerlade, E., Pirsiavash, H., Gall, J.: Adaptive token sampling for efficient vi- sion transformers. In: European conference on com- puter vision. pp. 396–414. Springer (2022)
2022
-
[24]
In: The Thirty-ninth Annual Conference on Neural Information Processing Sys- tems Datasets and Benchmarks Track (2025)
Fu, C., Chen, P., Shen, Y ., Qin, Y ., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. In: The Thirty-ninth Annual Conference on Neural Information Processing Sys- tems Datasets and Benchmarks Track (2025)
2025
-
[25]
Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Ka- dian, A., Al-Dahle, A., Letman, A., Mathur, A., Schel- ten, A., Vaughan, A., et al.: The llama 3 herd of mod- els. arXiv preprint arXiv:2407.21783 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Gurari, D., Li, Q., Stangl, A.J., Guo, A., Lin, C., Grau- man, K., Luo, J., Bigham, J.P.: Vizwiz grand chal- lenge: Answering visual questions from blind people. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3608–3617 (2018)
2018
-
[27]
arXiv preprint arXiv:2411.176862(3) (2024)
Han, Y ., Liu, X., Ding, P., Wang, D., Chen, H., Yan, Q., Huang, S.: Rethinking token reduction in mllms: Towards a unified paradigm for training-free accelera- tion. arXiv preprint arXiv:2411.176862(3) (2024)
-
[28]
In: Proceedings of the 40th AAAI Confer- ence on Artificial Intelligence (2025)
Han, Y ., Liu, X., Zhang, Z., Ding, P., Chen, J., Wang, D., Chen, H., Yan, Q., Huang, S.: Filter, correlate, compress: Training-free token reduction for mllm ac- celeration. In: Proceedings of the 40th AAAI Confer- ence on Artificial Intelligence (2025)
2025
-
[29]
arXiv preprint arXiv:2410.08584 (2024)
He, Y ., Chen, F., Liu, J., Shao, W., Zhou, H., Zhang, K., Zhuang, B.: Zipvl: Efficient large vision- language models with dynamic token sparsification. arXiv preprint arXiv:2410.08584 (2024)
-
[30]
arXiv preprint arXiv:2509.00419 (2025)
Hu, L., Shang, F., Feng, W., Wan, L.: Lightvlm: Ac- celeraing large multimodal models with pyramid to- ken merging and kv cache compression. arXiv preprint arXiv:2509.00419 (2025)
-
[31]
In: European conference on computer vision
Huang, K., Zou, H., Xi, Y ., Wang, B., Xie, Z., Yu, L.: Ivtp: Instruction-guided visual token pruning for large vision-language models. In: European conference on computer vision. pp. 214–230. Springer (2024)
2024
-
[32]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion
Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion. pp. 6700–6709 (2019)
2019
-
[33]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Jang, Y ., Song, Y ., Yu, Y ., Kim, Y ., Kim, G.: Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2758– 2766 (2017)
2017
-
[34]
arXiv preprint arXiv:2503.11549 (2025)
Jeddi, A., Baghbanzadeh, N., Dolatabadi, E., Taati, B.: Similarity-aware token pruning: Your vlm but faster. arXiv preprint arXiv:2503.11549 (2025)
-
[35]
In: European conference on computer vision
Jiang, X., Zhang, X., Gao, N., Deng, Y .: When fast fourier transform meets transformer for image restora- tion. In: European conference on computer vision. pp. 381–402. Springer (2024)
2024
-
[36]
In: Pro- ceedings of the 2022 Conference of the north Ameri- can chapter of the Association for Computational Lin- guistics: human language technologies
Lee-Thorp, J., Ainslie, J., Eckstein, I., Ontanon, S.: Fnet: Mixing tokens with fourier transforms. In: Pro- ceedings of the 2022 Conference of the north Ameri- can chapter of the Association for Computational Lin- guistics: human language technologies. pp. 4296– 4313 (2022)
2022
-
[37]
arXiv e-prints pp
Li, D., Yang, Z., Lu, S.: Todre: Visual token prun- ing via diversity and task awareness for efficient large vision-language models. arXiv e-prints pp. arXiv– 2505 (2025)
2025
-
[38]
In: ACM MM 2025, pp
Li, X., Zhang, D., Du, Z., Zhu, L., Chen, Z., Li, J.: Pataug: Augmentation of augmentation for test- time adaptation. In: ACM MM 2025, pp. 5080–5089 (2025)
2025
-
[39]
In: Proceedings of the 2023 Confer- ence on Empirical Methods in Natural Language Pro- cessing
Li, Y ., Du, Y ., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision- language models. In: Proceedings of the 2023 Confer- ence on Empirical Methods in Natural Language Pro- cessing. pp. 292–305 (2023)
2023
-
[40]
arXiv preprint arXiv:2501.14204 (2025)
Liang, X., Guan, C., Lu, J., Chen, H., Wang, H., Hu, H.: Dynamic token reduction during gen- eration for vision language models. arXiv preprint arXiv:2501.14204 (2025)
-
[41]
NeurIPS2024 (2024)
Lim, J.S., Chen, Z., Chen, Z., Baktashmotlagh, M., Yu, X., Huang, Z., Luo, Y .: Dipex: Dispersing prompt expansion for class-agnostic object detection. NeurIPS2024 (2024)
2024
-
[42]
In: Proceed- ings of the 2024 conference on empirical methods in natural language processing
Lin, B., Ye, Y ., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual repre- sentation by alignment before projection. In: Proceed- ings of the 2024 conference on empirical methods in natural language processing. pp. 5971–5984 (2024)
2024
-
[43]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Liu, H., Li, C., Li, Y ., Lee, Y .J.: Improved baselines with visual instruction tuning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26296–26306 (2024)
2024
-
[44]
In: Advances in Neural Information Processing Systems (NeurIPS) (2024)
Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., Lee, Y .J.: Llava-next: Improved reasoning, ocr, and world knowledge. In: Advances in Neural Information Processing Systems (NeurIPS) (2024)
2024
-
[45]
In: Advances in Neural Information Process- ing Systems (NeurIPS) (2023)
Liu, H., Li, C., Wu, Q., Lee, Y .J.: Visual instruction tuning. In: Advances in Neural Information Process- ing Systems (NeurIPS) (2023)
2023
-
[46]
arXiv preprint arXiv:2410.07278 (2024)
Liu, Y ., Wu, F., Li, R., Tang, Z., Li, K.: Par: Prompt- aware token reduction method for efficient large multimodal models. arXiv preprint arXiv:2410.07278 (2024)
-
[47]
Liu, Y ., Duan, H., Zhang, Y ., Li, B., Zhang, S., Zhao, W., Yuan, Y ., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? In: European conference on computer vision. pp. 216–233. Springer (2024)
2024
-
[48]
Science China Information Sciences67(12), 220102 (2024)
Liu, Y ., Li, Z., Huang, M., Yang, B., Yu, W., Li, C., Yin, X.C., Liu, C.L., Jin, L., Bai, X.: Ocrbench: on the hidden mystery of ocr in large multimodal models. Science China Information Sciences67(12), 220102 (2024)
2024
-
[49]
Advances in Neural In- formation Processing Systems35, 2507–2521 (2022)
Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to ex- plain: Multimodal reasoning via thought chains for science question answering. Advances in Neural In- formation Processing Systems35, 2507–2521 (2022)
2022
-
[50]
Advances in Neural Information Processing Systems35, 29319– 29335 (2022)
Nguyen, T., Pham, M., Nguyen, T., Nguyen, K., Os- her, S., Ho, N.: Fourierformer: Transformer meets generalized fourier integral theorem. Advances in Neural Information Processing Systems35, 29319– 29335 (2022)
2022
-
[51]
Advances in neu- ral information processing systems34, 13937–13949 (2021)
Rao, Y ., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neu- ral information processing systems34, 13937–13949 (2021)
2021
-
[52]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Shang, Y ., Cai, M., Xu, B., Lee, Y .J., Yan, Y .: Llava- prumerge: Adaptive token reduction for efficient large multimodal models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 22857–22867 (2025)
2025
-
[53]
In: AAAI 2026 (2026)
Shao, Y ., Lin, D., Yan, M., Chen, S., Zeng, F., Liao, M., Ma, A., Yan, Z., Wang, H., Wang, Y ., et al.: Tr-dq: Time-rotation diffusion quantization. In: AAAI 2026 (2026)
2026
-
[54]
arXiv preprint arXiv:2305.17455 (2023)
Shi, D., Tao, C., Rao, A., Yang, Z., Yuan, C., Wang, J.: Crossget: Cross-guided ensemble of tokens for accel- erating vision-language transformers. arXiv preprint arXiv:2305.17455 (2023)
-
[55]
Si, G., Yin, H., Li, X., Liao, W., He, T., Peng, P., Zhu, W., et al.: Infoprune: Revisiting visual token pruning from an information-theoretic perspective
-
[56]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Singh, A., Natarajan, V ., Shah, M., Jiang, Y ., Chen, X., Batra, D., Parikh, D., Rohrbach, M.: Towards vqa models that can read. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8317–8326 (2019)
2019
-
[57]
In: Proceedings of the 31st International Conference on Computational Linguistics
Song, D., Wang, W., Chen, S., Wang, X., Guan, M.X., Wang, B.: Less is more: A simple yet effective token reduction method for efficient multi-modal llms. In: Proceedings of the 31st International Conference on Computational Linguistics. pp. 7614–7623 (2025)
2025
-
[58]
CVPR 2022 (2022)
Su, H., Li, J., Chen, Z., Zhu, L., Lu, K.: Distinguish- ing unseen from seen for generalized zero-shot learn- ing. CVPR 2022 (2022)
2022
-
[59]
arXiv preprint arXiv:2602.03060 (2026)
Sun, Z., Ma, Y ., Liu, G., Chen, Y ., Tang, X., Hu, Y ., Xu, Y .: Ivc-prune: Revealing the implicit visual coordinates in lvlms for vision token pruning. arXiv preprint arXiv:2602.03060 (2026)
-
[60]
Fourier Compressor: Frequency-Domain Visual Token Compression for Vision-Language Models
Wang, H., Kai, J., Bai, H., Hou, L., Jiang, B., He, Z., Lin, Z.: Fourier-vlm: Compressing vision tokens in the frequency domain for large vision-language mod- els. arXiv preprint arXiv:2508.06038 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[61]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, Z., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s per- ception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[62]
arXiv preprint arXiv:2505.19235 (2025)
Wang, Q., Ye, H., Chung, M.Y ., Liu, Y ., Lin, Y ., Kuo, M., Ma, M., Zhang, J., Chen, Y .: Core- matching: A co-adaptive sparse inference framework with token and neuron pruning for comprehensive ac- celeration of vision-language models. arXiv preprint arXiv:2505.19235 (2025)
-
[63]
In: IJCAI 2025 (2025)
Wang, T., Guo, J., Li, D., Chen, Z.: On the discrim- ination and consistency for exemplar-free class incre- mental learning. In: IJCAI 2025 (2025)
2025
-
[64]
In: CVPR 2026 Findings (2026)
Wang, W., Guo, J., Cai, Y ., Chen*, Z.: Learning multi- modal prototypes for cross-domain few-shot object detection. In: CVPR 2026 Findings (2026)
2026
-
[65]
arXiv preprint arXiv:2602.17196 (2026)
Wang, Y ., Wu, J., Ni, Z., Yang, C., Liu, Y ., Yang, L., Zhou, Y ., Wen, Y ., He, L.: Entropy- prune: Matrix entropy guided visual token pruning for multimodal large language models. arXiv preprint arXiv:2602.17196 (2026)
-
[66]
Advances in Neu- ral Information Processing Systems35, 13974–13988 (2022)
Wang, Z., Luo, H., Wang, P., Ding, F., Wang, F., Li, H.: Vtc-lfc: Vision transformer compression with low-frequency components. Advances in Neu- ral Information Processing Systems35, 13974–13988 (2022)
2022
-
[67]
Wen, Z., Gao, Y ., Li, W., He, C., Zhang, L.: Token pruning in multimodal large language models: Are we solving the right problem? In: Findings of the Asso- ciation for Computational Linguistics: ACL 2025. pp. 15537–15549 (2025)
2025
-
[68]
important tokens
Wen, Z., Gao, Y ., Wang, S., Zhang, J., Zhang, Q., Li, W., He, C., Zhang, L.: Stop looking for “important tokens” in multimodal language models: Duplication matters more. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing. pp. 9972–9991 (2025)
2025
-
[69]
arXiv preprint arXiv:2502.00791 (2025)
Xing, L., Wang, A.J., Yan, R., Shu, X., Tang, J.: Vision-centric token compression in large language model. arXiv preprint arXiv:2502.00791 (2025)
-
[70]
In: In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2025)
Xing, L., Huang, Q., Dong, X., Lu, J., Zhang, P., Zang, Y ., Cao, Y ., He, C., Wang, J., Wu, F., et al.: Pyramid- drop: Accelerating your large vision-language models via pyramid visual redundancy reduction. In: In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2025)
2025
-
[71]
In: Pro- ceedings of the 25th ACM international conference on Multimedia
Xu, D., Zhao, Z., Xiao, J., Wu, F., Zhang, H., He, X., Zhuang, Y .: Video question answering via gradually refined attention over appearance and motion. In: Pro- ceedings of the 25th ACM international conference on Multimedia. pp. 1645–1653 (2017)
2017
-
[72]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Xu, J., Mei, T., Yao, T., Rui, Y .: Msr-vtt: A large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5288–5296 (2016)
2016
-
[73]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[74]
In: Proceedings of the Computer Vi- sion and Pattern Recognition Conference
Yang, C., Sui, Y ., Xiao, J., Huang, L., Gong, Y ., Li, C., Yan, J., Bai, Y ., Sadayappan, P., Hu, X., et al.: Topv: Compatible token pruning with inference time opti- mization for fast and low-memory multimodal vision language model. In: Proceedings of the Computer Vi- sion and Pattern Recognition Conference. pp. 19803– 19813 (2025)
2025
-
[75]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Yang, S., Chen, Y ., Tian, Z., Wang, C., Li, J., Yu, B., Jia, J.: Visionzip: Longer is better but not nec- essary in vision language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19792–19802 (2025)
2025
-
[76]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Ye, W., Wu, Q., Lin, W., Zhou, Y .: Fit and prune: Fast and training-free visual token pruning for multi- modal large language models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 22128–22136 (2025)
2025
-
[77]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion
Yin, H., Vahdat, A., Alvarez, J.M., Mallya, A., Kautz, J., Molchanov, P.: A-vit: Adaptive tokens for efficient vision transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion. pp. 10809–10818 (2022)
2022
-
[78]
In: ACM MM 2022 (2022)
You, F., Li, J., Chen, Z., Zhu, L.: Pixel exclu- sion: Uncertainty-aware boundary discovery for ac- tive cross-domain semantic segmentation. In: ACM MM 2022 (2022)
2022
-
[79]
In: ACM MM 2021
You, F., Li, J., Zhu, L., Chen, Z., Huang, Z.: Domain adaptive semantic segmentation without source data. In: ACM MM 2021. pp. 3293–3302 (2021)
2021
-
[80]
arXiv preprint arXiv:2601.22674 (2026)
Yu, H., Li, W., Qu, X., Wang, S., Chen, J., Zhu, J.: Visiontrim: Unified vision token compression for training-free mllm acceleration. arXiv preprint arXiv:2601.22674 (2026)
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.