pith. machine review for the scientific record. sign in

arxiv: 2604.17320 · v1 · submitted 2026-04-19 · 💻 cs.CV

Recognition: unknown

Towards Joint Quantization and Token Pruning of Vision-Language Models

Lei Zhang, Ming-Ming Cheng, Xindong Zhang, Xin He, Xinqing Li, Yun Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:57 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language modelstoken pruningquantizationlow-bit inferenceKV cachemodel compressionvisual tokensefficiency
0
0 comments X

The pith

A unified quantization-and-pruning pipeline for vision-language models achieves higher accuracy retention than separate stage-wise methods at the same low bit-width.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that token pruning and low-bit quantization for vision-language models work better when combined into a single pipeline rather than applied one after the other. This matters because separate stages often create a mismatch between how calibration happens and how pruning runs, causing accuracy to drop more than expected under aggressive compression. The method turns low-bit calibration signals directly into pruning decisions evaluated inside the actual low-bit operators and quantized KV cache. Experiments on standard benchmarks show this joint approach keeps more performance while cutting visual tokens to 30 percent.

Core claim

The QUOTA framework converts low-bit calibration signals into a layer-wise token allocation schedule, evaluates token importance under deployed W4A4 operators with a quantized KV cache by combining activation magnitude, attention cues, and an explicit low-bit risk signal, and performs consistent budgeted top-k selection, achieving 95.65 percent average retention while retaining only 30 percent of visual tokens compared with about 94.3 percent retention for representative stage-wise combinations.

What carries the argument

The Quantization Unified Offline Token Allocator (QUOTA), which materializes low-bit calibration signals into a layer-wise pruning recipe evaluated under actual deployed operators.

If this is right

  • Reduces both prefill compute from long visual-token prefixes and KV cache growth during autoregressive decoding in a single deterministic pipeline.
  • Enables budgeted top-k selection that directly accounts for quantization effects instead of relying on mismatched calibration.
  • Delivers improved robustness in low-bit regimes while supporting aggressive visual-token reduction to 30 percent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The three-signal importance scoring could be adapted for other transformer architectures that face similar token and cache costs.
  • An online version of the allocation schedule might support dynamic pruning on varying input lengths during inference.
  • Combining the low-bit risk signal with additional compression methods such as distillation could produce further efficiency gains.

Load-bearing premise

Token importance scores computed under deployed W4A4 operators with a quantized KV cache remain reliable proxies for pruning decisions across different models, tasks, and bit-widths without introducing new failure modes.

What would settle it

If the joint method produces lower average retention than stage-wise baselines when tested on a new VLM architecture or task under the same W4A4 settings, the unification benefit would be falsified.

Figures

Figures reproduced from arXiv: 2604.17320 by Lei Zhang, Ming-Ming Cheng, Xindong Zhang, Xin He, Xinqing Li, Yun Liu.

Figure 1
Figure 1. Figure 1: Stage-wise vs. unified pipelines for low-bit calibration and visual-token pruning. accounts for most of the compute. During autoregressive decoding, the continu￾ously growing KV cache becomes a bandwidth bottleneck and inflates memory pressure [16, 43]. These issues are particularly severe on resource-constrained GPUs and edge servers, motivating compression methods that reduce both pre￾fill compute and de… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the collaborative pipeline: QUOTA derives a pruning recipe from low-bit calibration, which is executed during deployment under quantized inference. sensitivity signals into a deterministic pruning recipe that specifies the candi￾date layer set and a layer-wise keep-ratio schedule (i.e., token budgets). During deployment, the calibrated W4A4 model executes token pruning inside the quan￾tized for… view at source ↗
Figure 3
Figure 3. Figure 3: Layer-wise attention concentration and visual-token redundancy used to define the candidate layer set Lc. and keep the final blocks conservative, and then condition the per-layer budgets on low-bit calibration sensitivity to obtain a deployable keep-ratio schedule. To instantiate the candidate layer set Lc in the pruning recipe, we profile layer-wise attention concentration and visual-token redundancy on a… view at source ↗
Figure 4
Figure 4. Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Candidate-layer selection on Qwen2.5-VL-7B. Layers 8–12 are used as the prun￾ing candidate range. the trillion-token scale, constructed from cleaned web data, synthesized data, open-source resources, and in-house collected data. Its pre-training data cov￾ered a broad range of multimodal sources, including image captions, interleaved image–text data, OCR data, visual knowledge, multimodal academic questions… view at source ↗
read the original abstract

Deploying Vision-Language Models (VLMs) under aggressive low-bit inference remains challenging because inference cost is dominated by the long visual-token prefix during prefill and the growing KV cache during autoregressive decoding. Token pruning and low-bit quantization are complementary for reducing these costs, yet naive stage-wise combinations are often brittle due to a mismatch between quantization calibration and pruning execution. We present a collaborative quantization-and-pruning framework that unifies low-bit inference and deterministic visual-token pruning in a single deployable pipeline. The framework introduces the \textbf{Q}uantization \textbf{U}nified \textbf{O}ffline \textbf{T}oken \textbf{A}llocator (\textbf{QUOTA}), which converts low-bit calibration signals into a layer-wise token allocation schedule and materializes it as a pruning recipe. Token importance is evaluated under deployed W4A4 operators with a quantized KV cache by combining activation magnitude, attention cues, and an explicit low-bit risk signal, enabling consistent budgeted top-$k$ selection. Experiments on standard VLM benchmarks show improved robustness over stage-wise baselines under the same low-bit regime, achieving 95.65\% average retention while retaining only 30\% of visual tokens, compared with about 94.3\% retention for representative stage-wise combinations. The code will be released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes QUOTA (Quantization Unified Offline Token Allocator), a joint quantization-and-pruning framework for vision-language models. It derives a layer-wise visual-token pruning schedule directly from low-bit (W4A4) calibration signals, evaluates token importance under deployed operators with quantized KV cache by combining activation magnitude, attention cues, and an explicit low-bit risk term, and performs deterministic top-k pruning. Experiments claim 95.65% average retention at 30% retained visual tokens versus approximately 94.3% for representative stage-wise quantization-plus-pruning baselines on standard VLM benchmarks.

Significance. If the reported robustness margin holds under broader validation, the work would be significant for efficient VLM deployment: it directly tackles the calibration-execution mismatch that makes naive stage-wise pipelines brittle, and the planned code release would support reproducibility. The parameter-free derivation of the pruning recipe from calibration signals is a conceptual strength.

major comments (2)
  1. [Abstract] Abstract: the central robustness claim (95.65% vs ~94.3% retention) is presented without error bars, named benchmarks, model sizes, or any ablation tables, so it is impossible to judge whether the 1.35-point margin is stable or sensitive to post-hoc choices.
  2. [Experiments] Experiments section: the claim that the composite importance score (activation magnitude + attention + low-bit risk under W4A4 with quantized KV cache) yields improved robustness rests on the untested assumption that this score remains a reliable proxy when bit-width, KV-cache precision, or VLM backbone changes; no cross-validation or sensitivity analysis is supplied to rule out new failure modes such as over-pruning of critical tokens.
minor comments (2)
  1. [Abstract] The abstract refers to 'standard VLM benchmarks' without enumeration; listing the concrete datasets and metrics would improve clarity.
  2. [Introduction] The QUOTA acronym and its expansion are introduced with bold formatting; ensure the same typographic convention is used consistently on first mention in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below, indicating planned revisions where appropriate to strengthen the presentation and validation of our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central robustness claim (95.65% vs ~94.3% retention) is presented without error bars, named benchmarks, model sizes, or any ablation tables, so it is impossible to judge whether the 1.35-point margin is stable or sensitive to post-hoc choices.

    Authors: We acknowledge that the abstract, due to its brevity constraints, omits error bars, specific benchmark names, model sizes, and ablation details. The Experiments section of the manuscript provides these elements, including results on standard VLM benchmarks with named datasets, model configurations, and supporting tables. To improve accessibility, we will revise the abstract to name the primary benchmarks and models while noting that detailed ablations, error bars, and robustness analysis appear in the main text. This addresses the concern without misrepresenting the high-level nature of the abstract. revision: partial

  2. Referee: [Experiments] Experiments section: the claim that the composite importance score (activation magnitude + attention + low-bit risk under W4A4 with quantized KV cache) yields improved robustness rests on the untested assumption that this score remains a reliable proxy when bit-width, KV-cache precision, or VLM backbone changes; no cross-validation or sensitivity analysis is supplied to rule out new failure modes such as over-pruning of critical tokens.

    Authors: We agree that demonstrating the generalizability of the composite importance score is important for the robustness claim. The current work validates the score specifically under the W4A4 regime with quantized KV cache for the tested VLMs, where it is designed to incorporate low-bit effects. To address this, we will add sensitivity analysis in the revised manuscript, including experiments across varied bit-widths, KV-cache precisions, and additional backbones, with explicit checks for over-pruning of critical tokens. This will provide the requested cross-validation. revision: yes

Circularity Check

0 steps flagged

No circularity: pruning schedule derived from calibration signals, evaluated externally on benchmarks

full rationale

The paper derives the QUOTA token allocation from low-bit calibration signals (activation magnitude + attention + explicit low-bit risk under W4A4 with quantized KV cache) and materializes it as a deterministic pruning recipe. The reported 95.65% retention at 30% tokens is an empirical measurement against stage-wise baselines on standard VLM benchmarks, not a quantity fitted to or defined by the final accuracy metric. No equation or step reduces the central claim to its own inputs by construction, and no load-bearing self-citation or ansatz is invoked in the abstract or described method. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Review performed on abstract only; full paper text was not accessible, so free parameters, axioms, and invented entities cannot be exhaustively enumerated from the provided information.

invented entities (1)
  • QUOTA allocator no independent evidence
    purpose: Converts low-bit calibration signals into layer-wise token pruning schedule
    Framework introduced by the paper to unify quantization and pruning

pith-pipeline@v0.9.0 · 5543 in / 1426 out tokens · 38511 ms · 2026-05-10T06:57:15.700550+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 9 canonical work pages · 4 internal anchors

  1. [1]

    In: CVPR

    Alvar, S.R., Singh, G., Akbari, M., Zhang, Y.: DivPrune: Diversity-based visual token pruning for large multimodal models. In: CVPR. pp. 9392–9401 (2025) 9, 13, 23 Joint Quantization and Token Pruning for VLMs 25

  2. [2]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., et al.: Qwen2.5-VL technical report. arXiv preprint arXiv:2502.13923 (2025) 21

  3. [3]

    arXiv preprint arXiv:2509.23729 (2025) 13, 23

    Bhatnagar, S., Xu, A., Tan, K.H., Ahuja, N.: Luq: Layerwise ultra-low bit quan- tization for multimodal large language models. arXiv preprint arXiv:2509.23729 (2025) 13, 23

  4. [4]

    In: ICLR

    Bolya, D., Fu, C.Y., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token Merging: Your Vit but faster. In: ICLR. pp. 1–12 (2023) 2, 3

  5. [5]

    In: EMNLP

    Bondarenko, Y., Nagel, M., Blankevoort, T.: Understanding and overcoming the challenges of efficient Transformer quantization. In: EMNLP. pp. 7947–7969 (2021) 4

  6. [6]

    In: ECCV (2024) 9, 13, 23

    Chen, L., Zhao, H., Liu, T., Bai, S., Lin, J., Zhou, C., Chang, B.: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play inference acceleration for large Vision-Language Models. In: ECCV (2024) 9, 13, 23

  7. [7]

    Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs

    Chen, Y., Habibian, A., Benini, L., Li, Y.: Gated relational alignment via confidence-based distillation for efficient vlms. arXiv preprint arXiv:2601.22709 (2026) 13

  8. [8]

    SQAP-VLA: Synergistic quantization-aware pruning for VLAs

    Fang, H., Liu, Y., Du, Y., Du, L., Yang, H.: SQAP-VLA: A synergistic quantization-aware pruning framework for high-performance vision-language- action models. arXiv preprint arXiv:2509.09090 (2025) 4

  9. [9]

    In: ICLR

    Frantar, E., Ashkboos, S., Hoefler, T., Alistarh, D.: GPTQ: Accurate post-training quantization for generative pre-trained transformers. In: ICLR. pp. 1–12 (2023) 2, 3, 13, 23

  10. [10]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: MME: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023) 8, 21

  11. [11]

    In: AAAI

    Gong, Z., Liu, J., Wang, J., Cai, X., Zhao, D., Yan, R.: What makes quantization for large language model hard? An empirical study from the lens of perturbation. In: AAAI. pp. 18082–18089 (2024) 3, 4

  12. [12]

    In: ICML

    Guo, J., Wu, J., Wang, Z., Liu, J., Yang, G., Ding, Y., Gong, R., Qin, H., Liu, X.: Compressing large language models by joint sparsification and quantization. In: ICML. pp. 16945–16957 (2024) 4

  13. [13]

    In: ICLR

    Han, S., Mao, H., Dally, W.J.: Deep Compression: Compressing deep neural net- works with pruning, trained quantization and Huffman coding. In: ICLR. pp. 1–13 (2015) 2, 4

  14. [14]

    In: NeurIPS

    Hooper, C., Kim, S., Mohammadzadeh, H., Mahoney, M.W., Shao, Y.S., Keutzer, K., Gholami, A.: KVQuant: Towards 10 million context length LLM inference with KV cache quantization. In: NeurIPS. pp. 1270–1303 (2024) 2, 3

  15. [15]

    In: ECCV

    Huang, K., Zou, H., Xi, Y., Wang, B., Xie, Z., Yu, L.: IVTP: Instruction-guided visual token pruning for large vision-language models. In: ECCV. pp. 214–230 (2024) 13

  16. [16]

    In: ICLR

    Huang, W., Zhai, Z., Shen, Y., Cao, S., Zhao, F., Xu, X., Ye, Z., Hu, Y., Lin, S.: Dynamic-LLaVA: Efficient multimodal large language models via dynamic vision- language context sparsification. In: ICLR. pp. 1–15 (2025) 2

  17. [17]

    In: CVPR

    Hudson, D.A., Manning, C.D.: GQA: A new dataset for real-world visual reasoning and compositional question answering. In: CVPR. pp. 6700–6709 (2019) 8, 21

  18. [18]

    In: CVPR

    Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., Kalenichenko,D.:Quantizationandtrainingofneuralnetworksforefficientinteger- arithmetic-only inference. In: CVPR. pp. 2704–2713 (2018) 3

  19. [19]

    In: ICLR

    Kim, M., Choi, J., Yang, H., Kim, J., Song, J., Kang, U.: Prune-then-Quantize or Quantize-then-Prune? Understanding the impact of compression order in joint model compression. In: ICLR. pp. 1–14 (2026) 2 26 X. Li et al

  20. [20]

    In: Proceedings of the 29th symposium on operating systems prin- ciples

    Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J., Zhang, H., Stoica, I.: Efficient memory management for large language model serving with pagedattention. In: Proceedings of the 29th symposium on operating systems prin- ciples. pp. 611–626 (2023) 3

  21. [21]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., Shan, Y.: SEED-Bench: Bench- marking multimodal LLMs with generative comprehension. arXiv preprint arXiv:2307.16125 (2023) 8, 21

  22. [22]

    Li, K., Chen, X., Gao, C., Li, Y., Chen, X.: Balanced Token Pruning: Accelerating visionlanguagemodelsbeyondlocaloptimization.arXivpreprintarXiv:2505.22038 (2025) 4, 5

  23. [23]

    In: EMNLP

    Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. In: EMNLP. pp. 292–305 (2023) 8, 21

  24. [24]

    Proceedings of machine learning and systems 6, 87–100 (2024) 13, 23

    Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.M., Wang, W.C., Xiao, G., Dang, X., Gan, C., Han, S.: AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. Proceedings of machine learning and systems 6, 87–100 (2024) 13, 23

  25. [25]

    In: CVPR

    Lin, J., Yin, H., Ping, W., Molchanov, P., Shoeybi, M., Han, S.: VILA: On pre- training for visual language models. In: CVPR. pp. 26689–26699 (2024) 1

  26. [26]

    In: AAAI

    Lin, Z., Lin, M., Lin, L., Ji, R.: Boosting multimodal large language models with visual tokens withdrawal for rapid inference. In: AAAI. pp. 5334–5342 (2025) 13, 23

  27. [27]

    In: CVPR

    Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: CVPR. pp. 26296–26306 (2024) 9, 22

  28. [28]

    Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: MMBench: Is your multi-modal model an all-around player? In: ECCV. pp. 216–233 (2024) 8, 21

  29. [29]

    In: NeurIPS

    Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to Explain: Multimodal reasoning via thought chains for science question answering. In: NeurIPS. pp. 2507–2521 (2022) 8, 21

  30. [30]

    In: Findings of the association for computational linguistics: ACL 2022

    Masry, A., Do, X.L., Tan, J.Q., Joty, S., Hoque, E.: ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In: Findings of the association for computational linguistics: ACL 2022. pp. 2263–2279 (2022) 22

  31. [31]

    In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

    Mathew,M.,Karatzas,D.,Jawahar,C.:DocVQA:AdatasetforVQAondocument images. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 2200–2209 (2021) 22

  32. [32]

    In: ICLR

    Merity, S., Xiong, C., Bradbury, J., Socher, R.: Pointer sentinel mixture models. In: ICLR. pp. 1–10 (2017) 22

  33. [33]

    In: ICML

    Mozaffari, M., Yazdanbakhsh, A., Dehnavi, M.M.: SLiM: One-shot quantization and sparsity with low-rank approximation for LLM weight compression. In: ICML. pp. 45024–45049 (2025) 4

  34. [34]

    In: CVPR

    Qu, X., Aponte, D., Banbury, C., Robinson, D.P., Ding, T., Koishida, K., Zharkov, I., Chen, T.: Automatic joint structured pruning and quantization for efficient neural network training and compression. In: CVPR. pp. 15234–15244 (2025) 4

  35. [35]

    In: NeurIPS

    Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: DynamicViT: Efficient vision Transformers with dynamic token sparsification. In: NeurIPS. pp. 13937– 13949 (2021) 2, 5

  36. [36]

    In: ICCV

    Shang, Y., Cai, M., Xu, B., Lee, Y.J., Yan, Y.: LLaVA-PruMerge: Adaptive token reduction for efficient large multimodal models. In: ICCV. pp. 22857–22867 (2025) 1, 2, 3, 4 Joint Quantization and Token Pruning for VLMs 27

  37. [37]

    In: CVPR

    Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh, D., Rohrbach, M.: Towards VQA models that can read. In: CVPR. pp. 8317–8326 (2019) 22

  38. [38]

    In: ICML

    Sun, Y., Liu, R., Bai, H., Bao, H., Zhao, K., Li, Y., Hu, J., Yu, X., Hou, L., Yuan, C., et al.: FlatQuant: Flatness matters for LLM quantization. In: ICML. pp. 57587–57613 (2025) 9

  39. [39]

    Wen, Z., Gao, Y., Li, W., He, C., Zhang, L.: Token Pruning in Multimodal Large LanguageModels:Arewesolvingtherightproblem?In:FindingsoftheAssociation for Computational Linguistics: ACL 2025. pp. 15537–15549 (2025) 9

  40. [40]

    arXiv preprint arXiv:2602.15580 (2026) 5

    Wu, H., Zhang, Y., Zhou, X.: How Vision Becomes Language: A layer- wise information-theoretic analysis of multimodal reasoning. arXiv preprint arXiv:2602.15580 (2026) 5

  41. [41]

    In: ICML

    Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., Han, S.: SmoothQuant: Accu- rate and efficient post-training quantization for large language models. In: ICML. pp. 38087–38099 (2023) 2, 3

  42. [42]

    Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy re- duction.arXiv preprint arXiv:2410.17247, 2024

    Xing, L., Huang, Q., Dong, X., Lu, J., Zhang, P., Zang, Y., Cao, Y., He, C., Wang, J., Wu, F., et al.: PyramidDrop: Accelerating your large vision-language models via pyramid visual redundancy reduction. arXiv preprint arXiv:2410.17247 (2024) 13, 23

  43. [43]

    In: CVPR

    Yang, C., Sui, Y., Xiao, J., Huang, L., Gong, Y., Li, C., Yan, J., Bai, Y., Sadayap- pan, P., Hu, X., et al.: TopV: Compatible token pruning with inference time opti- mization for fast and low-memory multimodal vision language model. In: CVPR. pp. 19803–19813 (2025) 2, 3, 4, 5, 9

  44. [44]

    In: AAAI

    Ye, W., Wu, Q., Lin, W., Zhou, Y.: Fit and Prune: Fast and training-free visual token pruning for multi-modal large language models. In: AAAI. pp. 22128–22136 (2025) 3

  45. [45]

    Zeng, C., Liu, S., Yang, S., Chen, F., Mei, X., Fu, L.: GQSA: Group quantization and sparsity for accelerating large language model inference. In: Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics. pp. 149–165 (2025) 4

  46. [46]

    In: Findings of the Association for Computational Linguistics: NAACL 2025

    Zhang, K., Li, B., Zhang, P., Pu, F., Cahyono, J.A., Hu, K., Liu, S., Zhang, Y., Yang, J., Li, C., et al.: LMMs-Eval: Reality check on the evaluation of large mul- timodal models. In: Findings of the Association for Computational Linguistics: NAACL 2025. pp. 881–916 (2025) 21

  47. [47]

    In: ICCV

    Zhang, Q., Cheng, A., Lu, M., Zhang, R., Zhuo, Z., Cao, J., Guo, S., She, Q., Zhang, S.: Beyond Text-Visual Attention: Exploiting visual cues for effective token pruning in VLMs. In: ICCV. pp. 20857–20867 (2025) 3, 9

  48. [48]

    In: ICML

    Zhang, Y., Fan, C.K., Ma, J., Zheng, W., Huang, T., Cheng, K., Gudovskiy, D.A., Okuno, T., Nakata, Y., Keutzer, K., et al.: SparseVLM: Visual token sparsification for efficient vision-language model inference. In: ICML. pp. 74840–74857 (2025) 4