arxiv: 2604.17320 · v1 · submitted 2026-04-19 · 💻 cs.CV

Recognition: unknown

Towards Joint Quantization and Token Pruning of Vision-Language Models

Lei Zhang, Ming-Ming Cheng, Xindong Zhang, Xin He, Xinqing Li, Yun Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:57 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelstoken pruningquantizationlow-bit inferenceKV cachemodel compressionvisual tokensefficiency

0 comments

The pith

A unified quantization-and-pruning pipeline for vision-language models achieves higher accuracy retention than separate stage-wise methods at the same low bit-width.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that token pruning and low-bit quantization for vision-language models work better when combined into a single pipeline rather than applied one after the other. This matters because separate stages often create a mismatch between how calibration happens and how pruning runs, causing accuracy to drop more than expected under aggressive compression. The method turns low-bit calibration signals directly into pruning decisions evaluated inside the actual low-bit operators and quantized KV cache. Experiments on standard benchmarks show this joint approach keeps more performance while cutting visual tokens to 30 percent.

Core claim

The QUOTA framework converts low-bit calibration signals into a layer-wise token allocation schedule, evaluates token importance under deployed W4A4 operators with a quantized KV cache by combining activation magnitude, attention cues, and an explicit low-bit risk signal, and performs consistent budgeted top-k selection, achieving 95.65 percent average retention while retaining only 30 percent of visual tokens compared with about 94.3 percent retention for representative stage-wise combinations.

What carries the argument

The Quantization Unified Offline Token Allocator (QUOTA), which materializes low-bit calibration signals into a layer-wise pruning recipe evaluated under actual deployed operators.

If this is right

Reduces both prefill compute from long visual-token prefixes and KV cache growth during autoregressive decoding in a single deterministic pipeline.
Enables budgeted top-k selection that directly accounts for quantization effects instead of relying on mismatched calibration.
Delivers improved robustness in low-bit regimes while supporting aggressive visual-token reduction to 30 percent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The three-signal importance scoring could be adapted for other transformer architectures that face similar token and cache costs.
An online version of the allocation schedule might support dynamic pruning on varying input lengths during inference.
Combining the low-bit risk signal with additional compression methods such as distillation could produce further efficiency gains.

Load-bearing premise

Token importance scores computed under deployed W4A4 operators with a quantized KV cache remain reliable proxies for pruning decisions across different models, tasks, and bit-widths without introducing new failure modes.

What would settle it

If the joint method produces lower average retention than stage-wise baselines when tested on a new VLM architecture or task under the same W4A4 settings, the unification benefit would be falsified.

Figures

Figures reproduced from arXiv: 2604.17320 by Lei Zhang, Ming-Ming Cheng, Xindong Zhang, Xin He, Xinqing Li, Yun Liu.

**Figure 1.** Figure 1: Stage-wise vs. unified pipelines for low-bit calibration and visual-token pruning. accounts for most of the compute. During autoregressive decoding, the continuously growing KV cache becomes a bandwidth bottleneck and inflates memory pressure [16, 43]. These issues are particularly severe on resource-constrained GPUs and edge servers, motivating compression methods that reduce both prefill compute and de… view at source ↗

**Figure 2.** Figure 2: Overview of the collaborative pipeline: QUOTA derives a pruning recipe from low-bit calibration, which is executed during deployment under quantized inference. sensitivity signals into a deterministic pruning recipe that specifies the candidate layer set and a layer-wise keep-ratio schedule (i.e., token budgets). During deployment, the calibrated W4A4 model executes token pruning inside the quantized for… view at source ↗

**Figure 3.** Figure 3: Layer-wise attention concentration and visual-token redundancy used to define the candidate layer set Lc. and keep the final blocks conservative, and then condition the per-layer budgets on low-bit calibration sensitivity to obtain a deployable keep-ratio schedule. To instantiate the candidate layer set Lc in the pruning recipe, we profile layer-wise attention concentration and visual-token redundancy on a… view at source ↗

**Figure 4.** Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Candidate-layer selection on Qwen2.5-VL-7B. Layers 8–12 are used as the pruning candidate range. the trillion-token scale, constructed from cleaned web data, synthesized data, open-source resources, and in-house collected data. Its pre-training data covered a broad range of multimodal sources, including image captions, interleaved image–text data, OCR data, visual knowledge, multimodal academic questions… view at source ↗

read the original abstract

Deploying Vision-Language Models (VLMs) under aggressive low-bit inference remains challenging because inference cost is dominated by the long visual-token prefix during prefill and the growing KV cache during autoregressive decoding. Token pruning and low-bit quantization are complementary for reducing these costs, yet naive stage-wise combinations are often brittle due to a mismatch between quantization calibration and pruning execution. We present a collaborative quantization-and-pruning framework that unifies low-bit inference and deterministic visual-token pruning in a single deployable pipeline. The framework introduces the \textbf{Q}uantization \textbf{U}nified \textbf{O}ffline \textbf{T}oken \textbf{A}llocator (\textbf{QUOTA}), which converts low-bit calibration signals into a layer-wise token allocation schedule and materializes it as a pruning recipe. Token importance is evaluated under deployed W4A4 operators with a quantized KV cache by combining activation magnitude, attention cues, and an explicit low-bit risk signal, enabling consistent budgeted top-$k$ selection. Experiments on standard VLM benchmarks show improved robustness over stage-wise baselines under the same low-bit regime, achieving 95.65\% average retention while retaining only 30\% of visual tokens, compared with about 94.3\% retention for representative stage-wise combinations. The code will be released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

QUOTA gives a modest edge over stage-wise quantization-plus-pruning by deriving token schedules from low-bit signals, but the reported 1.35-point gain rests on thin experimental detail.

read the letter

The paper's main move is to build a single pipeline where token pruning decisions come directly from the same low-bit calibration used for W4A4 quantization, including the quantized KV cache. QUOTA turns activation magnitude, attention cues, and an added low-bit risk term into a layer-wise allocation that then becomes the pruning recipe. That integration is the concrete novelty; prior work handled the two steps separately and often saw mismatches at deployment time. They show the joint version keeps 95.65% average retention at 30% visual tokens versus roughly 94.3% for representative stage-wise baselines on standard VLM benchmarks, and they plan to release code. That is useful practical work for anyone shipping these models on constrained hardware. The numbers are small but directionally consistent with the claim that joint calibration helps. The soft spots are exactly where the stress-test note points. The abstract supplies no error bars, no model sizes, no dataset list, and no ablations that would show whether the composite importance score is robust when bit-width, KV-cache precision, or backbone changes. Without those checks the 1.35-point margin could be tied to the specific calibration regime rather than a general property of the framework. The assumption that importance scores computed under deployed W4A4 operators remain stable proxies is stated but not stress-tested across conditions. This is aimed at the efficient-inference corner of the VLM community. The thinking is clear and the problem is real, so the paper deserves a serious referee who can ask for the missing sensitivity runs and full experimental breakdown. I would send it to review rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes QUOTA (Quantization Unified Offline Token Allocator), a joint quantization-and-pruning framework for vision-language models. It derives a layer-wise visual-token pruning schedule directly from low-bit (W4A4) calibration signals, evaluates token importance under deployed operators with quantized KV cache by combining activation magnitude, attention cues, and an explicit low-bit risk term, and performs deterministic top-k pruning. Experiments claim 95.65% average retention at 30% retained visual tokens versus approximately 94.3% for representative stage-wise quantization-plus-pruning baselines on standard VLM benchmarks.

Significance. If the reported robustness margin holds under broader validation, the work would be significant for efficient VLM deployment: it directly tackles the calibration-execution mismatch that makes naive stage-wise pipelines brittle, and the planned code release would support reproducibility. The parameter-free derivation of the pruning recipe from calibration signals is a conceptual strength.

major comments (2)

[Abstract] Abstract: the central robustness claim (95.65% vs ~94.3% retention) is presented without error bars, named benchmarks, model sizes, or any ablation tables, so it is impossible to judge whether the 1.35-point margin is stable or sensitive to post-hoc choices.
[Experiments] Experiments section: the claim that the composite importance score (activation magnitude + attention + low-bit risk under W4A4 with quantized KV cache) yields improved robustness rests on the untested assumption that this score remains a reliable proxy when bit-width, KV-cache precision, or VLM backbone changes; no cross-validation or sensitivity analysis is supplied to rule out new failure modes such as over-pruning of critical tokens.

minor comments (2)

[Abstract] The abstract refers to 'standard VLM benchmarks' without enumeration; listing the concrete datasets and metrics would improve clarity.
[Introduction] The QUOTA acronym and its expansion are introduced with bold formatting; ensure the same typographic convention is used consistently on first mention in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below, indicating planned revisions where appropriate to strengthen the presentation and validation of our claims.

read point-by-point responses

Referee: [Abstract] Abstract: the central robustness claim (95.65% vs ~94.3% retention) is presented without error bars, named benchmarks, model sizes, or any ablation tables, so it is impossible to judge whether the 1.35-point margin is stable or sensitive to post-hoc choices.

Authors: We acknowledge that the abstract, due to its brevity constraints, omits error bars, specific benchmark names, model sizes, and ablation details. The Experiments section of the manuscript provides these elements, including results on standard VLM benchmarks with named datasets, model configurations, and supporting tables. To improve accessibility, we will revise the abstract to name the primary benchmarks and models while noting that detailed ablations, error bars, and robustness analysis appear in the main text. This addresses the concern without misrepresenting the high-level nature of the abstract. revision: partial
Referee: [Experiments] Experiments section: the claim that the composite importance score (activation magnitude + attention + low-bit risk under W4A4 with quantized KV cache) yields improved robustness rests on the untested assumption that this score remains a reliable proxy when bit-width, KV-cache precision, or VLM backbone changes; no cross-validation or sensitivity analysis is supplied to rule out new failure modes such as over-pruning of critical tokens.

Authors: We agree that demonstrating the generalizability of the composite importance score is important for the robustness claim. The current work validates the score specifically under the W4A4 regime with quantized KV cache for the tested VLMs, where it is designed to incorporate low-bit effects. To address this, we will add sensitivity analysis in the revised manuscript, including experiments across varied bit-widths, KV-cache precisions, and additional backbones, with explicit checks for over-pruning of critical tokens. This will provide the requested cross-validation. revision: yes

Circularity Check

0 steps flagged

No circularity: pruning schedule derived from calibration signals, evaluated externally on benchmarks

full rationale

The paper derives the QUOTA token allocation from low-bit calibration signals (activation magnitude + attention + explicit low-bit risk under W4A4 with quantized KV cache) and materializes it as a deterministic pruning recipe. The reported 95.65% retention at 30% tokens is an empirical measurement against stage-wise baselines on standard VLM benchmarks, not a quantity fitted to or defined by the final accuracy metric. No equation or step reduces the central claim to its own inputs by construction, and no load-bearing self-citation or ansatz is invoked in the abstract or described method. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Review performed on abstract only; full paper text was not accessible, so free parameters, axioms, and invented entities cannot be exhaustively enumerated from the provided information.

invented entities (1)

QUOTA allocator no independent evidence
purpose: Converts low-bit calibration signals into layer-wise token pruning schedule
Framework introduced by the paper to unify quantization and pruning

pith-pipeline@v0.9.0 · 5543 in / 1426 out tokens · 38511 ms · 2026-05-10T06:57:15.700550+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 9 canonical work pages · 4 internal anchors

[1]

In: CVPR

Alvar, S.R., Singh, G., Akbari, M., Zhang, Y.: DivPrune: Diversity-based visual token pruning for large multimodal models. In: CVPR. pp. 9392–9401 (2025) 9, 13, 23 Joint Quantization and Token Pruning for VLMs 25

2025
[2]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., et al.: Qwen2.5-VL technical report. arXiv preprint arXiv:2502.13923 (2025) 21

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

arXiv preprint arXiv:2509.23729 (2025) 13, 23

Bhatnagar, S., Xu, A., Tan, K.H., Ahuja, N.: Luq: Layerwise ultra-low bit quan- tization for multimodal large language models. arXiv preprint arXiv:2509.23729 (2025) 13, 23

work page arXiv 2025
[4]

In: ICLR

Bolya, D., Fu, C.Y., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token Merging: Your Vit but faster. In: ICLR. pp. 1–12 (2023) 2, 3

2023
[5]

In: EMNLP

Bondarenko, Y., Nagel, M., Blankevoort, T.: Understanding and overcoming the challenges of efficient Transformer quantization. In: EMNLP. pp. 7947–7969 (2021) 4

2021
[6]

In: ECCV (2024) 9, 13, 23

Chen, L., Zhao, H., Liu, T., Bai, S., Lin, J., Zhou, C., Chang, B.: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play inference acceleration for large Vision-Language Models. In: ECCV (2024) 9, 13, 23

2024
[7]

Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs

Chen, Y., Habibian, A., Benini, L., Li, Y.: Gated relational alignment via confidence-based distillation for efficient vlms. arXiv preprint arXiv:2601.22709 (2026) 13

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

SQAP-VLA: Synergistic quantization-aware pruning for VLAs

Fang, H., Liu, Y., Du, Y., Du, L., Yang, H.: SQAP-VLA: A synergistic quantization-aware pruning framework for high-performance vision-language- action models. arXiv preprint arXiv:2509.09090 (2025) 4

work page arXiv 2025
[9]

In: ICLR

Frantar, E., Ashkboos, S., Hoefler, T., Alistarh, D.: GPTQ: Accurate post-training quantization for generative pre-trained transformers. In: ICLR. pp. 1–12 (2023) 2, 3, 13, 23

2023
[10]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: MME: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023) 8, 21

work page internal anchor Pith review arXiv 2023
[11]

In: AAAI

Gong, Z., Liu, J., Wang, J., Cai, X., Zhao, D., Yan, R.: What makes quantization for large language model hard? An empirical study from the lens of perturbation. In: AAAI. pp. 18082–18089 (2024) 3, 4

2024
[12]

In: ICML

Guo, J., Wu, J., Wang, Z., Liu, J., Yang, G., Ding, Y., Gong, R., Qin, H., Liu, X.: Compressing large language models by joint sparsification and quantization. In: ICML. pp. 16945–16957 (2024) 4

2024
[13]

In: ICLR

Han, S., Mao, H., Dally, W.J.: Deep Compression: Compressing deep neural net- works with pruning, trained quantization and Huffman coding. In: ICLR. pp. 1–13 (2015) 2, 4

2015
[14]

In: NeurIPS

Hooper, C., Kim, S., Mohammadzadeh, H., Mahoney, M.W., Shao, Y.S., Keutzer, K., Gholami, A.: KVQuant: Towards 10 million context length LLM inference with KV cache quantization. In: NeurIPS. pp. 1270–1303 (2024) 2, 3

2024
[15]

In: ECCV

Huang, K., Zou, H., Xi, Y., Wang, B., Xie, Z., Yu, L.: IVTP: Instruction-guided visual token pruning for large vision-language models. In: ECCV. pp. 214–230 (2024) 13

2024
[16]

In: ICLR

Huang, W., Zhai, Z., Shen, Y., Cao, S., Zhao, F., Xu, X., Ye, Z., Hu, Y., Lin, S.: Dynamic-LLaVA: Efficient multimodal large language models via dynamic vision- language context sparsification. In: ICLR. pp. 1–15 (2025) 2

2025
[17]

In: CVPR

Hudson, D.A., Manning, C.D.: GQA: A new dataset for real-world visual reasoning and compositional question answering. In: CVPR. pp. 6700–6709 (2019) 8, 21

2019
[18]

In: CVPR

Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., Kalenichenko,D.:Quantizationandtrainingofneuralnetworksforefficientinteger- arithmetic-only inference. In: CVPR. pp. 2704–2713 (2018) 3

2018
[19]

In: ICLR

Kim, M., Choi, J., Yang, H., Kim, J., Song, J., Kang, U.: Prune-then-Quantize or Quantize-then-Prune? Understanding the impact of compression order in joint model compression. In: ICLR. pp. 1–14 (2026) 2 26 X. Li et al

2026
[20]

In: Proceedings of the 29th symposium on operating systems prin- ciples

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J., Zhang, H., Stoica, I.: Efficient memory management for large language model serving with pagedattention. In: Proceedings of the 29th symposium on operating systems prin- ciples. pp. 611–626 (2023) 3

2023
[21]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., Shan, Y.: SEED-Bench: Bench- marking multimodal LLMs with generative comprehension. arXiv preprint arXiv:2307.16125 (2023) 8, 21

work page internal anchor Pith review arXiv 2023
[22]

Li, K., Chen, X., Gao, C., Li, Y., Chen, X.: Balanced Token Pruning: Accelerating visionlanguagemodelsbeyondlocaloptimization.arXivpreprintarXiv:2505.22038 (2025) 4, 5

work page arXiv 2025
[23]

In: EMNLP

Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. In: EMNLP. pp. 292–305 (2023) 8, 21

2023
[24]

Proceedings of machine learning and systems 6, 87–100 (2024) 13, 23

Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.M., Wang, W.C., Xiao, G., Dang, X., Gan, C., Han, S.: AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. Proceedings of machine learning and systems 6, 87–100 (2024) 13, 23

2024
[25]

In: CVPR

Lin, J., Yin, H., Ping, W., Molchanov, P., Shoeybi, M., Han, S.: VILA: On pre- training for visual language models. In: CVPR. pp. 26689–26699 (2024) 1

2024
[26]

In: AAAI

Lin, Z., Lin, M., Lin, L., Ji, R.: Boosting multimodal large language models with visual tokens withdrawal for rapid inference. In: AAAI. pp. 5334–5342 (2025) 13, 23

2025
[27]

In: CVPR

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: CVPR. pp. 26296–26306 (2024) 9, 22

2024
[28]

Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: MMBench: Is your multi-modal model an all-around player? In: ECCV. pp. 216–233 (2024) 8, 21

2024
[29]

In: NeurIPS

Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to Explain: Multimodal reasoning via thought chains for science question answering. In: NeurIPS. pp. 2507–2521 (2022) 8, 21

2022
[30]

In: Findings of the association for computational linguistics: ACL 2022

Masry, A., Do, X.L., Tan, J.Q., Joty, S., Hoque, E.: ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In: Findings of the association for computational linguistics: ACL 2022. pp. 2263–2279 (2022) 22

2022
[31]

In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

Mathew,M.,Karatzas,D.,Jawahar,C.:DocVQA:AdatasetforVQAondocument images. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 2200–2209 (2021) 22

2021
[32]

In: ICLR

Merity, S., Xiong, C., Bradbury, J., Socher, R.: Pointer sentinel mixture models. In: ICLR. pp. 1–10 (2017) 22

2017
[33]

In: ICML

Mozaffari, M., Yazdanbakhsh, A., Dehnavi, M.M.: SLiM: One-shot quantization and sparsity with low-rank approximation for LLM weight compression. In: ICML. pp. 45024–45049 (2025) 4

2025
[34]

In: CVPR

Qu, X., Aponte, D., Banbury, C., Robinson, D.P., Ding, T., Koishida, K., Zharkov, I., Chen, T.: Automatic joint structured pruning and quantization for efficient neural network training and compression. In: CVPR. pp. 15234–15244 (2025) 4

2025
[35]

In: NeurIPS

Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: DynamicViT: Efficient vision Transformers with dynamic token sparsification. In: NeurIPS. pp. 13937– 13949 (2021) 2, 5

2021
[36]

In: ICCV

Shang, Y., Cai, M., Xu, B., Lee, Y.J., Yan, Y.: LLaVA-PruMerge: Adaptive token reduction for efficient large multimodal models. In: ICCV. pp. 22857–22867 (2025) 1, 2, 3, 4 Joint Quantization and Token Pruning for VLMs 27

2025
[37]

In: CVPR

Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh, D., Rohrbach, M.: Towards VQA models that can read. In: CVPR. pp. 8317–8326 (2019) 22

2019
[38]

In: ICML

Sun, Y., Liu, R., Bai, H., Bao, H., Zhao, K., Li, Y., Hu, J., Yu, X., Hou, L., Yuan, C., et al.: FlatQuant: Flatness matters for LLM quantization. In: ICML. pp. 57587–57613 (2025) 9

2025
[39]

Wen, Z., Gao, Y., Li, W., He, C., Zhang, L.: Token Pruning in Multimodal Large LanguageModels:Arewesolvingtherightproblem?In:FindingsoftheAssociation for Computational Linguistics: ACL 2025. pp. 15537–15549 (2025) 9

2025
[40]

arXiv preprint arXiv:2602.15580 (2026) 5

Wu, H., Zhang, Y., Zhou, X.: How Vision Becomes Language: A layer- wise information-theoretic analysis of multimodal reasoning. arXiv preprint arXiv:2602.15580 (2026) 5

work page arXiv 2026
[41]

In: ICML

Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., Han, S.: SmoothQuant: Accu- rate and efficient post-training quantization for large language models. In: ICML. pp. 38087–38099 (2023) 2, 3

2023
[42]

Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy re- duction.arXiv preprint arXiv:2410.17247, 2024

Xing, L., Huang, Q., Dong, X., Lu, J., Zhang, P., Zang, Y., Cao, Y., He, C., Wang, J., Wu, F., et al.: PyramidDrop: Accelerating your large vision-language models via pyramid visual redundancy reduction. arXiv preprint arXiv:2410.17247 (2024) 13, 23

work page arXiv 2024
[43]

In: CVPR

Yang, C., Sui, Y., Xiao, J., Huang, L., Gong, Y., Li, C., Yan, J., Bai, Y., Sadayap- pan, P., Hu, X., et al.: TopV: Compatible token pruning with inference time opti- mization for fast and low-memory multimodal vision language model. In: CVPR. pp. 19803–19813 (2025) 2, 3, 4, 5, 9

2025
[44]

In: AAAI

Ye, W., Wu, Q., Lin, W., Zhou, Y.: Fit and Prune: Fast and training-free visual token pruning for multi-modal large language models. In: AAAI. pp. 22128–22136 (2025) 3

2025
[45]

Zeng, C., Liu, S., Yang, S., Chen, F., Mei, X., Fu, L.: GQSA: Group quantization and sparsity for accelerating large language model inference. In: Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics. pp. 149–165 (2025) 4

2025
[46]

In: Findings of the Association for Computational Linguistics: NAACL 2025

Zhang, K., Li, B., Zhang, P., Pu, F., Cahyono, J.A., Hu, K., Liu, S., Zhang, Y., Yang, J., Li, C., et al.: LMMs-Eval: Reality check on the evaluation of large mul- timodal models. In: Findings of the Association for Computational Linguistics: NAACL 2025. pp. 881–916 (2025) 21

2025
[47]

In: ICCV

Zhang, Q., Cheng, A., Lu, M., Zhang, R., Zhuo, Z., Cao, J., Guo, S., She, Q., Zhang, S.: Beyond Text-Visual Attention: Exploiting visual cues for effective token pruning in VLMs. In: ICCV. pp. 20857–20867 (2025) 3, 9

2025
[48]

In: ICML

Zhang, Y., Fan, C.K., Ma, J., Zheng, W., Huang, T., Cheng, K., Gudovskiy, D.A., Okuno, T., Nakata, Y., Keutzer, K., et al.: SparseVLM: Visual token sparsification for efficient vision-language model inference. In: ICML. pp. 74840–74857 (2025) 4

2025