pith. sign in

arxiv: 2604.12035 · v1 · submitted 2026-04-13 · 💻 cs.CV

Does Visual Token Pruning Improve Calibration? An Empirical Study on Confidence in MLLMs

Pith reviewed 2026-05-10 15:48 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual token pruningmodel calibrationmultimodal large language modelsexpected calibration errorPOPEScienceQASCOPE pruningsaliency weight
0
0 comments X

The pith

Visual token pruning can lower calibration error in multimodal models while preserving accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether pruning visual tokens in multimodal large language models improves or harms calibration, meaning how well predicted confidence matches actual correctness. It tests multiple pruning methods including SCOPE variants, saliency-only, FastV, and random pruning on LLaVA-1.5-7B across POPE and ScienceQA-IMG, tracking Expected Calibration Error, Brier score, and AURC at varying token budgets. Results show that a pure-coverage SCOPE setting reduces ECE below the full unpruned model on POPE with comparable accuracy, and lowering the saliency weight consistently improves calibration across budgets. On ScienceQA-IMG pruning also cuts ECE while accuracy holds steady or rises slightly. The work argues that token pruning must be judged on confidence quality in addition to efficiency and accuracy.

Core claim

On POPE, a pure-coverage setting in SCOPE achieves substantially lower ECE than the full unpruned model while maintaining similar accuracy. An internal alpha-sweep shows reducing the saliency weight improves calibration at all tested token budgets, while accuracy changes only slightly. In contrast, saliency-based pruning leads to worse calibration and real FastV causes severe performance degradation. On ScienceQA-IMG, pruning also reduces ECE with accuracy remaining stable or slightly improving. The default gap power exponent in coverage-based selection is not always optimal.

What carries the argument

SCOPE pruning with adjustable saliency weight alpha, evaluated by Expected Calibration Error (ECE) on visual question-answering tasks.

If this is right

  • Certain pruning choices can improve reliability of confidence estimates without accuracy loss.
  • Saliency weight acts as a tunable knob that trades off accuracy and calibration in predictable ways.
  • Saliency-only and FastV-style pruning degrade calibration more than coverage-based alternatives.
  • The gap power exponent in coverage selection should be tuned rather than left at default.
  • Multimodal systems needing trustworthy decisions must track calibration metrics when adopting pruning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • These patterns suggest pruning could be used deliberately to produce more trustworthy confidence scores in deployed multimodal agents.
  • Combining coverage-based pruning with post-hoc calibration methods might yield additive gains not tested here.
  • The results raise the question of whether similar calibration benefits appear in vision-only or language-only pruning regimes.
  • If the trend holds, evaluation benchmarks for efficient MLLMs should routinely include ECE alongside accuracy.

Load-bearing premise

The calibration trends observed with these specific datasets, model, token budgets, and pruning implementations will generalize beyond the tested conditions.

What would settle it

Repeating the experiments on a different MLLM or dataset such as LLaVA-Next or VQA v2 and observing that all tested pruning methods increase ECE relative to the unpruned baseline.

Figures

Figures reproduced from arXiv: 2604.12035 by Kaizhen Tan.

Figure 1
Figure 1. Figure 1: Study design. An image-question pair is encoded into V =576 visual tokens. We compare five token selection strategies: SCOPE with α=1 (default hybrid), SCOPE with α=0.5, SCOPE with α=0 (pure coverage), saliency-only pruning based on CLS attention, and random pruning. The selected K tokens are then passed to the LLM. We extract first-token class probabilities and evaluate calibration with ECE, Brier score, … view at source ↗
Figure 2
Figure 2. Figure 2: (a) ECE heatmap across α and K on POPE. The lowest ECE at each tested token budget is achieved at α=0, and the best overall setting in this sweep is (α=0, K=128). (b) ECE decreases as the saliency weight is reduced for all three token budgets. For reference, we also show the random baseline, the saliency-only baseline at K=128, and the full unpruned model. 4.8 Selective Prediction We also evaluate selectiv… view at source ↗
Figure 3
Figure 3. Figure 3: Main results on POPE. (a) ECE under default SCOPE across token budgets, with accuracy shown for reference. The best calibration is observed at K=128. (b) Comparison between SCOPE and random pruning at two token budgets. (c,d) Reliability diagrams for the full model (K=576) and SCOPE pruning at K=128. The high-confidence region shows a smaller confidence-accuracy gap after pruning [PITH_FULL_IMAGE:figures/… view at source ↗
Figure 4
Figure 4. Figure 4: Change in MME score relative to the default SCOPE [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Visual token pruning is a widely used strategy for efficient inference in multimodal large language models (MLLMs), but existing work mainly evaluates it with task accuracy. In this paper, we study how visual token pruning affects model calibration, that is, whether predicted confidence matches actual correctness. Using LLaVA-1.5-7B on POPE and ScienceQA-IMG, we evaluate Expected Calibration Error (ECE), Brier score, and AURC under several pruning strategies, including SCOPE with different saliency weights, saliency-only pruning, FastV, and random pruning, across multiple token budgets. Our results show that pruning does not simply trade reliability for efficiency. On POPE, a pure-coverage setting in SCOPE achieves substantially lower ECE than the full unpruned model while maintaining similar accuracy. An internal alpha-sweep further shows a consistent trend: reducing the saliency weight improves calibration at all tested token budgets, while accuracy changes only slightly. In contrast, saliency-based pruning leads to worse calibration, and real FastV causes severe performance degradation in our setting. On ScienceQA-IMG, pruning also reduces ECE, with accuracy remaining stable or slightly improving. We additionally study the gap power exponent in coverage-based selection and find that its default setting is not always optimal. Overall, our results suggest that visual token pruning should be evaluated not only by accuracy, but also by confidence quality, especially for multimodal systems that need reliable decisions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents an empirical study examining how visual token pruning strategies affect calibration (via ECE, Brier score, and AURC) in multimodal LLMs, using LLaVA-1.5-7B on POPE and ScienceQA-IMG. It compares SCOPE (with varying saliency weights alpha), saliency-only pruning, FastV, and random pruning across token budgets. Central observations are that pure-coverage SCOPE on POPE yields substantially lower ECE than the unpruned baseline at comparable accuracy, that lowering alpha consistently improves calibration, and that pruning reduces ECE on ScienceQA-IMG with stable or slightly improved accuracy; the work concludes that pruning should be assessed on confidence quality in addition to accuracy.

Significance. If the directional trends hold under broader conditions, the work is significant for shifting evaluation of efficiency methods in MLLMs beyond accuracy to include calibration, which matters for reliable decision-making. It supplies concrete, controlled comparisons across multiple pruning variants and budgets on two datasets, and the internal alpha-sweep provides a falsifiable pattern that can guide follow-up work.

major comments (2)
  1. [Abstract and Results] Abstract and Results: the headline claim that 'pruning does not simply trade reliability for efficiency' and the recommendation to evaluate pruning on confidence quality rest on experiments with only LLaVA-1.5-7B; without cross-model validation on other MLLM architectures the observed ECE reductions could be architecture-specific rather than general.
  2. [Experiments/Results] Experiments/Results: no error bars, standard deviations across seeds, or statistical significance tests are reported for the ECE and accuracy differences (e.g., the 'substantially lower ECE' for pure-coverage SCOPE on POPE); this weakens the ability to judge whether the calibration gains are robust or could be due to run-to-run variation.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'gap power exponent in coverage-based selection' is introduced without definition or citation; a brief parenthetical explanation or reference would improve accessibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our empirical study. The comments correctly identify limitations in model scope and statistical reporting. We address each major comment below and will revise the manuscript to strengthen the presentation without overstating generality.

read point-by-point responses
  1. Referee: [Abstract and Results] Abstract and Results: the headline claim that 'pruning does not simply trade reliability for efficiency' and the recommendation to evaluate pruning on confidence quality rest on experiments with only LLaVA-1.5-7B; without cross-model validation on other MLLM architectures the observed ECE reductions could be architecture-specific rather than general.

    Authors: We agree that the experiments are restricted to LLaVA-1.5-7B and that the ECE reductions may not generalize to other MLLM architectures. The manuscript does not claim universality; however, the abstract and conclusion use phrasing that could be read as broader. In the revised version we will explicitly qualify all headline statements (abstract, introduction, and conclusion) to note that results are for LLaVA-1.5-7B, add a limitations paragraph discussing the single-model scope, and explicitly recommend cross-model validation in future work. This preserves the value of the controlled comparisons while removing any implication of generality. revision: yes

  2. Referee: [Experiments/Results] Experiments/Results: no error bars, standard deviations across seeds, or statistical significance tests are reported for the ECE and accuracy differences (e.g., the 'substantially lower ECE' for pure-coverage SCOPE on POPE); this weakens the ability to judge whether the calibration gains are robust or could be due to run-to-run variation.

    Authors: We concur that the absence of variability measures limits assessment of robustness. The original runs used fixed seeds for reproducibility. For the revision we will re-execute the key POPE configurations (unpruned baseline, pure-coverage SCOPE at multiple budgets, and the alpha sweep) with at least three independent random seeds, report mean ECE and accuracy together with standard deviations, and add a brief statistical comparison (e.g., noting whether differences exceed one standard deviation). This will directly address concerns about run-to-run variation. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical evaluation with no derivations or self-defined quantities

full rationale

This paper is a pure empirical study that runs experiments on LLaVA-1.5-7B using POPE and ScienceQA-IMG datasets, measuring ECE, Brier score, and AURC under various pruning strategies (SCOPE with alpha sweeps, FastV, random) and token budgets. No equations, derivations, fitted parameters presented as predictions, or load-bearing self-citations exist; all claims are direct experimental observations. The work is self-contained against external benchmarks, with no reduction of results to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical evaluation and introduces no free parameters, new entities, or non-standard axioms beyond the established definitions of calibration metrics.

axioms (1)
  • domain assumption Expected Calibration Error, Brier score, and AURC are appropriate and sufficient measures of model confidence quality.
    These are standard metrics drawn from the model calibration literature.

pith-pipeline@v0.9.0 · 5557 in / 1217 out tokens · 36085 ms · 2026-05-10T15:48:15.253493+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

  1. [1]

    DivPrune: Diversity-based visual token pruning for large multimodal models

    Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, and Yong Zhang. DivPrune: Diversity-based visual token pruning for large multimodal models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2025

  2. [2]

    AgilePruner: An empirical study of attention and diver- sity for adaptive visual token pruning in large vision-language models.arXiv preprint arXiv:2603.01236, 2026

    Changwoo Baek, Jouwon Song, Sohyeon Kim, and Kyeongbo Kong. AgilePruner: An empirical study of attention and diver- sity for adaptive visual token pruning in large vision-language models.arXiv preprint arXiv:2603.01236, 2026

  3. [3]

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InProceedings of the European Conference on Computer Vision, 2024. 8

  4. [4]

    OTPrune: Distribution-aligned visual token pruning via optimal transport

    Xiwen Chen, Wenhui Zhu, Gen Li, Xuanzhao Dong, Yu- jian Xiong, Hao Wang, Peijie Qiu, Qingquan Song, Zhipeng Wang, Shao Tang, Yalin Wang, and Abolfazl Razi. OTPrune: Distribution-aligned visual token pruning via optimal transport. arXiv preprint arXiv:2602.20205, 2026

  5. [5]

    Unveiling uncertainty: A deep dive into calibration and performance of multimodal large language models

    Zijun Chen, Wenbo Hu, Guande He, Zhijie Deng, Zheng Zhang, and Richang Hong. Unveiling uncertainty: A deep dive into calibration and performance of multimodal large language models. InProceedings of the 31st International Conference on Computational Linguistics, 2025

  6. [6]

    FLoC: Facility location-based efficient visual token compression for long video understanding

    Janghoon Cho, Jungsoo Lee, Munawar Hayat, Kyuwoong Hwang, Fatih Porikli, and Sungha Choi. FLoC: Facility location-based efficient visual token compression for long video understanding. InInternational Conference on Learning Representations, 2026

  7. [7]

    SCOPE: Saliency-coverage oriented token pruning for effi- cient multimodal LLMs

    Jinhong Deng, Wen Li, Joey Tianyi Zhou, and Yang He. SCOPE: Saliency-coverage oriented token pruning for effi- cient multimodal LLMs. InAdvances in Neural Information Processing Systems, 2025

  8. [8]

    Prune redundancy, preserve essence: Vision token compression in VLMs via synergistic importance-diversity

    Zhengyao Fang, Pengyuan Lyu, Chengquan Zhang, Guang- ming Lu, Jun Yu, and Wenjie Pei. Prune redundancy, preserve essence: Vision token compression in VLMs via synergistic importance-diversity. InInternational Conference on Learning Representations, 2026

  9. [9]

    Evaluating object hallucination in large vision-language models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing, 2023

  10. [10]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, 2023

  11. [11]

    LLaV A-NeXT: Improved rea- soning, OCR, and world knowledge

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. LLaV A-NeXT: Improved rea- soning, OCR, and world knowledge. https://llava-vl. github.io/blog/2024-01-30-llava-next/ ,

  12. [12]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InAdvances in Neural Information Processing Systems, 2022

  13. [13]

    Better relia- bility compression: Model pruning with calibrated uncertainty estimation for mobile deep learning applications

    Runyu Ma, Songqing Chen, and Shuochao Yao. Better relia- bility compression: Model pruning with calibrated uncertainty estimation for mobile deep learning applications. In2025 IEEE 3rd International Conference on Mobility, Operations, Services and Technologies (MOST), 2025

  14. [14]

    Uncovering the hidden cost of model compression

    Diganta Misra, Muawiz Chaudhary, Agam Goyal, Bharat Run- wal, and Pin-Yu Chen. Uncovering the hidden cost of model compression. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024

  15. [15]

    Investigat- ing calibration and corruption robustness of post-hoc pruned perception CNNs: An image classification benchmark study

    Pallavi Mitra, Gesina Schwalbe, and Nadja Klein. Investigat- ing calibration and corruption robustness of post-hoc pruned perception CNNs: An image classification benchmark study. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024

  16. [16]

    CDPruner: Be- yond attention or similarity: Maximizing conditional diversity for token pruning in MLLMs

    Qizhe Zhang, Mengzhen Liu, Lichen Li, Ming Lu, Yuan Zhang, Junwen Pan, Qi She, and Shanghang Zhang. CDPruner: Be- yond attention or similarity: Maximizing conditional diversity for token pruning in MLLMs. InAdvances in Neural Informa- tion Processing Systems, 2025

  17. [17]

    arXiv preprint arXiv:2411.11919 (2024) 2, 3, 4, 6, 10, 12, 13, 14, 18

    Ruiyang Zhang, Hu Zhang, and Zhedong Zheng. VL- Uncertainty: Detecting hallucination in large vision- language model via uncertainty estimation.arXiv preprint arXiv:2411.11919, 2024

  18. [18]

    Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, and Shanghang Zhang

    Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis A. Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, and Shanghang Zhang. Sparse- VLM: Visual token sparsification for efficient vision-language model inference. InProceedings of the 42nd International Conference on Machine Learning, 2025

  19. [19]

    Calibrated self-rewarding vision lan- guage models

    Yiyang Zhou, Zhiyuan Fan, Dongjie Cheng, Sihan Yang, Zhaorun Chen, Chenhang Cui, Xiyao Wang, Yun Li, Linjun Zhang, and Huaxiu Yao. Calibrated self-rewarding vision lan- guage models. InAdvances in Neural Information Processing Systems, 2024. 9