ETC: Extreme Token Compression via Task-aware Visual Information Distillation in VLMs

Hongchen Wei; Yiling Gao; Zhenzhong Chen

arxiv: 2606.00543 · v1 · pith:P635D5NEnew · submitted 2026-05-30 · 💻 cs.CV

ETC: Extreme Token Compression via Task-aware Visual Information Distillation in VLMs

Yiling Gao , Hongchen Wei , Zhenzhong Chen This is my paper

Pith reviewed 2026-06-28 18:49 UTC · model grok-4.3

classification 💻 cs.CV

keywords token compressionvision-language modelsvisual tokenscross-attentioninformation distillationsufficient statisticKV-cachemultimodal inference

0 comments

The pith

Vision-language models can compress high-resolution images to a single visual token by preserving only the instruction-aware visual information needed for the task.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

High-resolution images fed to VLMs create large numbers of visual tokens that drive up computation and KV-cache memory during inference. The paper establishes that task loss minimization under compression requires the compact tokens to retain the instruction-aware sufficient statistic of task-relevant visual features. ETC implements this by weighting visual features according to text-to-image cross-attention and then applying variational information distillation to transfer the essential content into far fewer tokens. Experiments on LLaVA-1.5-7B and Qwen3-VL-2B demonstrate that the approach holds performance even at one-token compression while substantially lowering memory overhead.

Core claim

Minimizing task loss requires the compact representation to preserve the instruction-aware sufficient statistic of the task-relevant visual information for prediction. ETC approximates this statistic by weighting original visual features with text-to-image cross-attention scores and uses variational information distillation so the reduced tokens recover the same predictive content.

What carries the argument

The instruction-aware sufficient statistic of task-relevant visual information, approximated by text-to-image cross-attention weights and preserved via variational information distillation.

If this is right

VLMs can process high-resolution inputs with far lower KV-cache memory during inference.
Performance on standard vision-language benchmarks holds even when visual tokens are reduced to one.
The same compression pipeline works on both LLaVA-1.5-7B and Qwen3-VL-2B without retraining the base model.
Task loss directly guides the amount of visual information retained rather than relying on generic token pruning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Attention scores already computed inside the model may be reused for compression decisions in future architectures.
The same sufficient-statistic principle could guide compression of other modalities if analogous cross-modal statistics can be identified.
Designers might embed the distillation step as a fixed layer rather than a separate training phase.
Limits of the method would appear first on tasks where cross-attention fails to highlight the truly predictive visual regions.

Load-bearing premise

Text-to-image cross-attention weights serve as a reliable proxy for the latent instruction-aware predictive statistic that must be kept under compression.

What would settle it

Measure task accuracy at single-token compression after replacing the cross-attention weighting step with uniform or random weights; a large drop would indicate the approximation is necessary.

Figures

Figures reproduced from arXiv: 2606.00543 by Hongchen Wei, Yiling Gao, Zhenzhong Chen.

**Figure 1.** Figure 1: Overview of ETC. Training: compressed tokens Z are inserted between visual tokens V and text tokens T, and a bottleneck attention mask allows Z to aggregate information from V while preventing T from directly attending to the raw visual tokens. At the final LLM layer, text-to-image cross-attention scores produce instruction-aware weights that define the predictivestatistic estimate Xb; in parallel, an MLP… view at source ↗

**Figure 2.** Figure 2: Text-to-image cross-attention patterns in LLaVA-1.5-7B. (a) Layer-wise average of cross-modal attention scores across [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison between selective compression methods [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Token-budget scaling behavior of ETC on SQA (left) and TextVQA (right). ETC approaches the full-token baseline [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of ETC with LLaVA-v1.5-7B and Qwen3VL-2B. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

read the original abstract

In Vision-Language Models (VLMs), high-resolution images produce a large number of visual tokens, resulting in high computational costs and KV-cache overhead during inference. To address this problem, we propose an Extreme Token Compression (ETC) framework that minimizes task loss when reducing the number of input tokens based on the principle of variational information distillation. Specifically, from an information-theoretic perspective, we show that minimizing task loss requires the compact representation to preserve the instruction-aware sufficient statistic of the task-relevant visual information for prediction. In practice, ETC leverages text-to-image cross-attention to weight the original visual features to approximate the latent instruction-aware predictive statistic. Moreover, ETC introduces a variational information distillation, enabling the compact representation to preserve the essential information to recover this predictive statistic. Experiments on LLaVA-1.5-7B and Qwen3-VL-2B show that ETC remains effective even under single-token compression, substantially reducing KV-cache overhead while retaining strong task performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ETC gets extreme compression working in practice on LLaVA and Qwen but the cross-attention step is presented as capturing the sufficient statistic without a derivation showing why.

read the letter

The main takeaway is that this paper gets single-token compression to hold up on LLaVA-1.5-7B and Qwen3-VL-2B while cutting KV-cache use, which matters for running high-res VLMs on limited hardware.

It does the practical part cleanly. The framework weights visual features with text-to-image cross-attention, then uses variational distillation to keep enough information to recover the task-relevant signal. The experiments show the method stays effective even at the extreme end, which is the result that would interest people shipping these models.

The information-theoretic framing is the part that feels thinner. The claim is that the compact tokens must preserve the instruction-aware sufficient statistic, and cross-attention is used to approximate it. Nothing in the abstract or the stress-test note shows a derivation that turns the attention weights into that statistic rather than a useful relevance heuristic. The variational step then assumes the weighted features already contain what needs preserving. That gap makes the theory read more like post-hoc justification than a tight argument.

The rest looks standard for the area: they build on existing VLM compression ideas without obvious circularity in the reported numbers. No machine-checked proofs or external data releases are mentioned, so the evidence stays at the level of the reported runs.

This is for groups working on inference efficiency in vision-language models. Someone already running token-reduction experiments would get concrete numbers and a workable recipe to try. It is worth sending to a serious referee because the empirical compression result is sharp enough to justify review time, even if the authors need to tighten the link between attention weights and the claimed sufficient statistic.

Referee Report

2 major / 1 minor

Summary. The paper proposes ETC, a framework for extreme token compression in VLMs that minimizes task loss via variational information distillation. It claims an information-theoretic result that the compact representation must preserve the instruction-aware sufficient statistic of task-relevant visual information, which is approximated in practice by weighting original visual features with text-to-image cross-attention; a variational step then distills the compact tokens to retain information needed to recover this statistic. Experiments on LLaVA-1.5-7B and Qwen3-VL-2B report that the method remains effective even at single-token compression while substantially reducing KV-cache overhead.

Significance. If the central information-theoretic argument is valid and the cross-attention approximation is justified, ETC would provide a principled, task-aware alternative to heuristic token pruning methods, enabling efficient high-resolution VLM inference. The reported single-token results, if reproducible across tasks, would be a notable empirical contribution to the compression literature.

major comments (2)

[§3] The information-theoretic claim (abstract and §3) requires that the compact tokens preserve exactly the instruction-aware sufficient statistic; however, the manuscript approximates this statistic via text-to-image cross-attention weights without a derivation showing why these weights yield (an approximation to) the minimal sufficient statistic rather than a heuristic relevance score. The subsequent variational distillation step assumes the weighted features already encode the target statistic, which is not independently verified and is load-bearing for the central claim.
[§4] §4 and experimental section: no ablation studies isolate the contribution of the cross-attention approximation versus the variational distillation objective, nor compare against other potential estimators of the sufficient statistic; this leaves open whether performance under extreme compression is due to the claimed information-theoretic grounding or to the specific implementation choices.

minor comments (1)

[§3] Notation for the variational objective and the definition of the 'instruction-aware sufficient statistic' should be introduced with explicit equations early in §3 to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the major comments point by point below. We agree that clarifications and additional experiments are warranted and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§3] The information-theoretic claim (abstract and §3) requires that the compact tokens preserve exactly the instruction-aware sufficient statistic; however, the manuscript approximates this statistic via text-to-image cross-attention weights without a derivation showing why these weights yield (an approximation to) the minimal sufficient statistic rather than a heuristic relevance score. The subsequent variational distillation step assumes the weighted features already encode the target statistic, which is not independently verified and is load-bearing for the central claim.

Authors: We acknowledge that the manuscript presents cross-attention weighting as a practical approximation to the instruction-aware sufficient statistic without a formal derivation establishing it as the minimal sufficient statistic. The information-theoretic argument shows that the compact tokens must preserve this statistic to minimize task loss, but the choice of cross-attention is motivated by its ability to capture text-conditioned relevance rather than proven optimality. The variational distillation then operates on these weighted features. We will revise §3 to explicitly distinguish the theoretical requirement from the practical approximation, add discussion of why cross-attention is a reasonable proxy, and note the lack of independent verification of the statistic. This will be addressed in the revision. revision: yes
Referee: [§4] §4 and experimental section: no ablation studies isolate the contribution of the cross-attention approximation versus the variational distillation objective, nor compare against other potential estimators of the sufficient statistic; this leaves open whether performance under extreme compression is due to the claimed information-theoretic grounding or to the specific implementation choices.

Authors: We agree that the current experiments do not include ablations separating the cross-attention weighting from the variational objective, nor comparisons to alternative estimators of the sufficient statistic. Such studies would strengthen the claims regarding the source of performance gains under extreme compression. We will add these ablations in the revised manuscript, including variants that replace cross-attention with uniform or random weighting and comparisons to other relevance estimators where feasible. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

Abstract states an information-theoretic claim that minimizing task loss requires preserving the instruction-aware sufficient statistic, then describes using cross-attention weights as a practical approximation to that statistic followed by variational distillation. No equations are provided, no self-citations are invoked to justify the approximation as a theorem, and the method is not shown to reduce to a fitted input or renamed known result by construction. The derivation chain is self-contained against external benchmarks with no load-bearing reductions exhibited.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5700 in / 1042 out tokens · 16170 ms · 2026-06-28T18:49:56.184925+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 3 canonical work pages · 3 internal anchors

[1]

Qwen3 Technical Report

An Y ang, Anfeng Li, Baosong Y ang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Y u, Chang Gao, Chen- gen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv Preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Y e, Jie Shao, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and eﬃciency. arXiv Preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

LLaV A-CoT: Let vision lan- guage models reason step-by-step

Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Y uan. LLaV A-CoT: Let vision lan- guage models reason step-by-step. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 2087–2098, 2025

2087
[4]

TokenPacker: Eﬃcient visual projector for multimodal LLM

Wentong Li, Y uqian Y uan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, and Lei Zhang. TokenPacker: Eﬃcient visual projector for multimodal LLM. Inter- national Journal of Computer Vision , pages 6794–6812, 2025

2025
[5]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Y uheng Li, and Y ong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE /CVF Conference on Computer Vi- sion and Pattern Recognition, pages 26296–26306, 2024

2024
[6]

Attention is all you need

Ashish V aswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Ad- vances in Neural Information Processing Systems , 30: 5998–6008, 2017

2017
[7]

SparseVLM: Visual token sparsiﬁcation for e ﬃcient vision-language model inference

Y uan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis A Gudovskiy, To- moyuki Okuno, Y ohei Nakata, Kurt Keutzer, and Shang- hang Zhang. SparseVLM: Visual token sparsiﬁcation for e ﬃcient vision-language model inference. In Inter- national Conference on Machine Learning , pages 74840– 74857, 2025

2025
[8]

TopV: Compatible token pruning with inference time optimization for fast and low- memory multimodal vision language model

Cheng Y ang, Y ang Sui, Jinqi Xiao, Lingyi Huang, Y u Gong, Chendi Li, Jinghua Y an, Y u Bai, Ponnuswamy Sadayappan, Xia Hu, et al. TopV: Compatible token pruning with inference time optimization for fast and low- memory multimodal vision language model. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19803–19...

2025
[9]

LLaV A-PruMerge: Adaptive token reduction for eﬃcient large multimodal models

Y uzhang Shang, Mu Cai, Bingxin Xu, Y ong Jae Lee, and Y an Y an. LLaV A-PruMerge: Adaptive token reduction for eﬃcient large multimodal models. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 22857–22867, 2025

2025
[10]

A TP-LLaV A: Adaptive token pruning for large vision language models

Xubing Y e, Y ukang Gan, Yixiao Ge, Xiao-Ping Zhang, and Y ansong Tang. A TP-LLaV A: Adaptive token pruning for large vision language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24972–24982, 2025

2025
[11]

Matryoshka query transformer for large vision-language models

Wenbo Hu, Zi-Yi Dou, Liunian Li, Amita Kamath, Nanyun Peng, and Kai-Wei Chang. Matryoshka query transformer for large vision-language models. Advances in Neural Information Processing Systems , pages 50168– 50188, 2024.7

2024
[12]

V oco-LLaMA: Towards vision compres- sion with large language models

Xubing Y e, Y ukang Gan, Xiaoke Huang, Yixiao Ge, and Y ansong Tang. V oco-LLaMA: Towards vision compres- sion with large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 29836–29846, 2025

2025
[13]

PVC: Progressive visual token com- pression for uniﬁed image and video processing in large vision-language models

Chenyu Y ang, Xuan Dong, Xizhou Zhu, Weijie Su, Jia- hao Wang, Hao Tian, Zhe Chen, Wenhai Wang, Lewei Lu, and Jifeng Dai. PVC: Progressive visual token com- pression for uniﬁed image and video processing in large vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 24939–24949, 2025

2025
[14]

V ariational information dis- tillation for knowledge transfer

Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D Lawrence, and Zhenwen Dai. V ariational information dis- tillation for knowledge transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9163–9171, 2019

2019
[15]

E ﬃcient self-attention with smart pruning for sustainable large lan- guage models

Samir Brahim Belhaouari and Insaf Kraidia. E ﬃcient self-attention with smart pruning for sustainable large lan- guage models. Scientiﬁc Reports, 15(1):10171, 2025

2025
[16]

DyLoFViT: A novel approach for real-time metal 3d printing surface quality classiﬁcation

Y uqin Zeng, Lianli Liu, Ze Wen, Jiquan Liu, and Shuqian Fan. DyLoFViT: A novel approach for real-time metal 3d printing surface quality classiﬁcation. IET Image Process- ing, 19(1):e70182, 2025

2025
[17]

LLaV A-OneVision: Easy visual task transfer

Bo Li, Y uanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Y anwei Li, Ziwei Liu, and Chunyuan Li. LLaV A-OneVision: Easy visual task transfer. Transactions on Machine Learn- ing Research, 2025. ISSN 2835-8856

2025
[18]

Qwen 2.5: A comprehensive review of the lead- ing resource-e ﬃcient LLM with potentioal to surpass all competitors

Imtiaz Ahmed, Sadman Islam, Partha Protim Datta, Im- ran Kabir, Naseef Ur Rahman Chowdhury, and Ahshanul Haque. Qwen 2.5: A comprehensive review of the lead- ing resource-e ﬃcient LLM with potentioal to surpass all competitors. Authorea Preprints, 2025

2025
[19]

RocketKV: Ac- celerating long-context LLM inference via two-stage KV cache compression

Payman Behnam, Y aosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Y u, and Alexey Tumanov. RocketKV: Ac- celerating long-context LLM inference via two-stage KV cache compression. In International Conference on Ma- chine Learning, pages 3358–3392, 2025

2025
[20]

SCOPE: Optimizing key- value cache compression in long-context generation

Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Y ulan He, and Deyu Zhou. SCOPE: Optimizing key- value cache compression in long-context generation. In Proceedings of the 63rd Annual Meeting of the Associa- tion for Computational Linguistics , pages 10775–10790, 2025

2025
[21]

Accelerating multi- modal large language models by searching optimal vision token reduction

Shiyu Zhao, Zhenting Wang, Felix Juefei-Xu, Xide Xia, Miao Liu, Xiaofang Wang, Mingfu Liang, Ning Zhang, Dimitris N Metaxas, and Licheng Y u. Accelerating multi- modal large language models by searching optimal vision token reduction. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition , pages 29869–29879, 2025

2025
[22]

ST3: Accelerating multimodal large language model by spatial-temporal visual token trimming

Jiedong Zhuang, Lu Lu, Ming Dai, Rui Hu, Jian Chen, Qiang Liu, and Haoji Hu. ST3: Accelerating multimodal large language model by spatial-temporal visual token trimming. In Proceedings of the AAAI Conference on Ar- tiﬁcial Intelligence, pages 11049–11057, 2025

2025
[23]

L VPruning: An e ﬀective yet simple language-guided vi- sion token pruning approach for multi-modal large lan- guage models

Yizheng Sun, Y anze Xin, Hao Li, Jingyuan Sun, Chenghua Lin, and Riza Theresa Batista-Navarro. L VPruning: An e ﬀective yet simple language-guided vi- sion token pruning approach for multi-modal large lan- guage models. In Findings of the Association for Com- putational Linguistics: NAACL , pages 4299–4308, 2025

2025
[24]

Fit and prune: Fast and training-free visual token pruning for multi-modal large language models

Weihao Y e, Qiong Wu, Wenhao Lin, and Yiyi Zhou. Fit and prune: Fast and training-free visual token pruning for multi-modal large language models. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence , pages 22128– 22136, 2025

2025
[25]

PACT: Pruning and clustering-based to- ken reduction for faster visual language models

Mohamed Dhouib, Davide Buscaldi, Sonia V anier, and Aymen Shabou. PACT: Pruning and clustering-based to- ken reduction for faster visual language models. In Pro- ceedings of the IEEE /CVF Conference on Computer Vi- sion and Pattern Recognition, pages 14582–14592, 2025

2025
[26]

TempMe: Video temporal token merging for eﬃcient text- video retrieval

Leqi Shen, Tianxiang Hao, Tao He, Sicheng Zhao, Yifeng Zhang, pengzhang liu, Y ongjun Bao, and Guiguang Ding. TempMe: Video temporal token merging for eﬃcient text- video retrieval. In International Conference on Learning Representations, pages 60839–60860, 2025

2025
[27]

E ﬃcient visual transformer by learnable token merging

Y ancheng Wang and Yingzhen Y ang. E ﬃcient visual transformer by learnable token merging. IEEE Transac- tions on Pattern Analysis & Machine Intelligence, 47(11): 9597–9608, 2025

2025
[28]

HierarQ: Task-aware hierarchical Q-Former for enhanced video understanding

Shehreen Azad, Vibhav Vineet, and Y ogesh Singh Rawat. HierarQ: Task-aware hierarchical Q-Former for enhanced video understanding. In Proceedings of the IEEE /CVF Conference on Computer Vision and Pattern Recognition , pages 8545–8556, 2025

2025
[29]

Per- ceive

Roberto Amoroso, Gengyuan Zhang, Rajat Koner, Lorenzo Baraldi, Rita Cucchiara, and V olker Tresp. Per- ceive. query & reason: Enhancing video QA with question-guided temporal queries. In IEEE/CVF Winter Conference on Applications of Computer Vision , pages 8853–8862. IEEE, 2025

2025
[30]

LLaMA-Vid: An image is worth 2 tokens in large language models

Y anwei Li, Chengyao Wang, and Jiaya Jia. LLaMA-Vid: An image is worth 2 tokens in large language models. In European Conference on Computer Vision, pages 323–
[31]

Semedo, and J Zico Kolter

Kevin Li, Sachin Goyal, João D. Semedo, and J Zico Kolter. Inference optimal VLMs need fewer visual to- kens and more parameters. In International Conference on Learning Representations, pages 96066–96083, 2025

2025
[32]

Microsoft COCO: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision , pages 740–755. Springer, 2014

2014
[33]

GQA: A new dataset for real-world visual reasoning and com- positional question answering

Drew A Hudson and Christopher D Manning. GQA: A new dataset for real-world visual reasoning and com- positional question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6700–6709, 2019

2019
[34]

OCR-VQA: Visual question8 answering by reading text in images

Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. OCR-VQA: Visual question8 answering by reading text in images. In International Con- ference on Document Analysis and Recognition , pages 947–952. IEEE, 2019

2019
[35]

Towards VQA models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Y u Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA models that can read. In Pro- ceedings of the IEEE /CVF Conference on Computer Vi- sion and Pattern Recognition, pages 8317–8326, 2019

2019
[36]

Visual genome: Connecting language and vision using crowd- sourced dense image annotations

Ranjay Krishna, Y uke Zhu, Oliver Groth, Justin John- son, Kenji Hata, Joshua Kravitz, Stephanie Chen, Y annis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowd- sourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017

2017
[37]

MMBench: Is your multi- modal model an all-around player? In European Confer- ence on Computer Vision, pages 216–233

Y uan Liu, Haodong Duan, Y uanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Y uan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. MMBench: Is your multi- modal model an all-around player? In European Confer- ence on Computer Vision, pages 216–233. Springer, 2024

2024
[38]

MME: A comprehensive evalu- ation benchmark for multimodal large language models

Chaoyou Fu, Peixian Chen, Y unhang Shen, Y ulei Qin, Mengdan Zhang, Xu Lin, Jinrui Y ang, Xiawu Zheng, Ke Li, Xing Sun, et al. MME: A comprehensive evalu- ation benchmark for multimodal large language models. In Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

2025
[39]

Seed-Bench: Benchmarking multimodal large language models

Bohao Li, Y uying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-Bench: Benchmarking multimodal large language models. In Pro- ceedings of the IEEE /CVF Conference on Computer Vi- sion and Pattern Recognition, pages 13299–13308, 2024

2024
[40]

Learn to explain: Multimodal rea- soning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai- Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal rea- soning via thought chains for science question answering. Advances in Neural Information Processing Systems , 35: 2507–2521, 2022

2022
[41]

Making the V in VQA matter: El- evating the role of image understanding in visual question answering

Y ash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: El- evating the role of image understanding in visual question answering. In Proceedings of the IEEE /CVF Conference on Computer Vision and Pattern Recognition, pages 6904– 6913, 2017

2017
[42]

Q-Bench: A bench- mark for general-purpose foundation models on low-level vision

Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Chunyi Li, Wenxiu Sun, Qiong Y an, Guangtao Zhai, and Weisi Lin. Q-Bench: A bench- mark for general-purpose foundation models on low-level vision. In International Conference on Learning Repre- sentations, 2024

2024
[43]

VisionZIP: Longer is better but not necessary in vision language models

Senqiao Y ang, Y ukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Y u, and Jiaya Jia. VisionZIP: Longer is better but not necessary in vision language models. In Proceedings of the IEEE /CVF Conference on Computer Vision and Pattern Recognition , pages 19792– 19802, 2025

2025
[44]

Conical visual concen- tration for e ﬃcient large vision-language models

Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Y uhang Zang, Y uhang Cao, Conghui He, Jiaqi Wang, Feng Wu, and Dahua Lin. Conical visual concen- tration for e ﬃcient large vision-language models. In Pro- ceedings of the IEEE /CVF Conference on Computer Vi- sion and Pattern Recognition, pages 14593–14603, 2025

2025
[45]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference ac- celeration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Jun- yang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference ac- celeration for large vision-language models. In European Conference on Computer Vision , pages 19–35. Springer, 2024

2024
[46]

BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning , pages 19730–19742, 2023

2023
[47]

Extending Context Window of Large Language Models via Positional Interpolation

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Y uandong Tian. Extending context window of large lan- guage models via positional interpolation. arXiv Preprint arXiv:2306.15595, 2023. 9 A Appendix A.1 Proofs and Additional Derivations This appendix provides the full derivations for Section 3. We derive the ideal requirement for the compact representatio...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Qwen3 Technical Report

An Y ang, Anfeng Li, Baosong Y ang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Y u, Chang Gao, Chen- gen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv Preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Y e, Jie Shao, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and eﬃciency. arXiv Preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

LLaV A-CoT: Let vision lan- guage models reason step-by-step

Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Y uan. LLaV A-CoT: Let vision lan- guage models reason step-by-step. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 2087–2098, 2025

2087

[4] [4]

TokenPacker: Eﬃcient visual projector for multimodal LLM

Wentong Li, Y uqian Y uan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, and Lei Zhang. TokenPacker: Eﬃcient visual projector for multimodal LLM. Inter- national Journal of Computer Vision , pages 6794–6812, 2025

2025

[5] [5]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Y uheng Li, and Y ong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE /CVF Conference on Computer Vi- sion and Pattern Recognition, pages 26296–26306, 2024

2024

[6] [6]

Attention is all you need

Ashish V aswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Ad- vances in Neural Information Processing Systems , 30: 5998–6008, 2017

2017

[7] [7]

SparseVLM: Visual token sparsiﬁcation for e ﬃcient vision-language model inference

Y uan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis A Gudovskiy, To- moyuki Okuno, Y ohei Nakata, Kurt Keutzer, and Shang- hang Zhang. SparseVLM: Visual token sparsiﬁcation for e ﬃcient vision-language model inference. In Inter- national Conference on Machine Learning , pages 74840– 74857, 2025

2025

[8] [8]

TopV: Compatible token pruning with inference time optimization for fast and low- memory multimodal vision language model

Cheng Y ang, Y ang Sui, Jinqi Xiao, Lingyi Huang, Y u Gong, Chendi Li, Jinghua Y an, Y u Bai, Ponnuswamy Sadayappan, Xia Hu, et al. TopV: Compatible token pruning with inference time optimization for fast and low- memory multimodal vision language model. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19803–19...

2025

[9] [9]

LLaV A-PruMerge: Adaptive token reduction for eﬃcient large multimodal models

Y uzhang Shang, Mu Cai, Bingxin Xu, Y ong Jae Lee, and Y an Y an. LLaV A-PruMerge: Adaptive token reduction for eﬃcient large multimodal models. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 22857–22867, 2025

2025

[10] [10]

A TP-LLaV A: Adaptive token pruning for large vision language models

Xubing Y e, Y ukang Gan, Yixiao Ge, Xiao-Ping Zhang, and Y ansong Tang. A TP-LLaV A: Adaptive token pruning for large vision language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24972–24982, 2025

2025

[11] [11]

Matryoshka query transformer for large vision-language models

Wenbo Hu, Zi-Yi Dou, Liunian Li, Amita Kamath, Nanyun Peng, and Kai-Wei Chang. Matryoshka query transformer for large vision-language models. Advances in Neural Information Processing Systems , pages 50168– 50188, 2024.7

2024

[12] [12]

V oco-LLaMA: Towards vision compres- sion with large language models

Xubing Y e, Y ukang Gan, Xiaoke Huang, Yixiao Ge, and Y ansong Tang. V oco-LLaMA: Towards vision compres- sion with large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 29836–29846, 2025

2025

[13] [13]

PVC: Progressive visual token com- pression for uniﬁed image and video processing in large vision-language models

Chenyu Y ang, Xuan Dong, Xizhou Zhu, Weijie Su, Jia- hao Wang, Hao Tian, Zhe Chen, Wenhai Wang, Lewei Lu, and Jifeng Dai. PVC: Progressive visual token com- pression for uniﬁed image and video processing in large vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 24939–24949, 2025

2025

[14] [14]

V ariational information dis- tillation for knowledge transfer

Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D Lawrence, and Zhenwen Dai. V ariational information dis- tillation for knowledge transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9163–9171, 2019

2019

[15] [15]

E ﬃcient self-attention with smart pruning for sustainable large lan- guage models

Samir Brahim Belhaouari and Insaf Kraidia. E ﬃcient self-attention with smart pruning for sustainable large lan- guage models. Scientiﬁc Reports, 15(1):10171, 2025

2025

[16] [16]

DyLoFViT: A novel approach for real-time metal 3d printing surface quality classiﬁcation

Y uqin Zeng, Lianli Liu, Ze Wen, Jiquan Liu, and Shuqian Fan. DyLoFViT: A novel approach for real-time metal 3d printing surface quality classiﬁcation. IET Image Process- ing, 19(1):e70182, 2025

2025

[17] [17]

LLaV A-OneVision: Easy visual task transfer

Bo Li, Y uanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Y anwei Li, Ziwei Liu, and Chunyuan Li. LLaV A-OneVision: Easy visual task transfer. Transactions on Machine Learn- ing Research, 2025. ISSN 2835-8856

2025

[18] [18]

Qwen 2.5: A comprehensive review of the lead- ing resource-e ﬃcient LLM with potentioal to surpass all competitors

Imtiaz Ahmed, Sadman Islam, Partha Protim Datta, Im- ran Kabir, Naseef Ur Rahman Chowdhury, and Ahshanul Haque. Qwen 2.5: A comprehensive review of the lead- ing resource-e ﬃcient LLM with potentioal to surpass all competitors. Authorea Preprints, 2025

2025

[19] [19]

RocketKV: Ac- celerating long-context LLM inference via two-stage KV cache compression

Payman Behnam, Y aosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Y u, and Alexey Tumanov. RocketKV: Ac- celerating long-context LLM inference via two-stage KV cache compression. In International Conference on Ma- chine Learning, pages 3358–3392, 2025

2025

[20] [20]

SCOPE: Optimizing key- value cache compression in long-context generation

Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Y ulan He, and Deyu Zhou. SCOPE: Optimizing key- value cache compression in long-context generation. In Proceedings of the 63rd Annual Meeting of the Associa- tion for Computational Linguistics , pages 10775–10790, 2025

2025

[21] [21]

Accelerating multi- modal large language models by searching optimal vision token reduction

Shiyu Zhao, Zhenting Wang, Felix Juefei-Xu, Xide Xia, Miao Liu, Xiaofang Wang, Mingfu Liang, Ning Zhang, Dimitris N Metaxas, and Licheng Y u. Accelerating multi- modal large language models by searching optimal vision token reduction. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition , pages 29869–29879, 2025

2025

[22] [22]

ST3: Accelerating multimodal large language model by spatial-temporal visual token trimming

Jiedong Zhuang, Lu Lu, Ming Dai, Rui Hu, Jian Chen, Qiang Liu, and Haoji Hu. ST3: Accelerating multimodal large language model by spatial-temporal visual token trimming. In Proceedings of the AAAI Conference on Ar- tiﬁcial Intelligence, pages 11049–11057, 2025

2025

[23] [23]

L VPruning: An e ﬀective yet simple language-guided vi- sion token pruning approach for multi-modal large lan- guage models

Yizheng Sun, Y anze Xin, Hao Li, Jingyuan Sun, Chenghua Lin, and Riza Theresa Batista-Navarro. L VPruning: An e ﬀective yet simple language-guided vi- sion token pruning approach for multi-modal large lan- guage models. In Findings of the Association for Com- putational Linguistics: NAACL , pages 4299–4308, 2025

2025

[24] [24]

Fit and prune: Fast and training-free visual token pruning for multi-modal large language models

Weihao Y e, Qiong Wu, Wenhao Lin, and Yiyi Zhou. Fit and prune: Fast and training-free visual token pruning for multi-modal large language models. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence , pages 22128– 22136, 2025

2025

[25] [25]

PACT: Pruning and clustering-based to- ken reduction for faster visual language models

Mohamed Dhouib, Davide Buscaldi, Sonia V anier, and Aymen Shabou. PACT: Pruning and clustering-based to- ken reduction for faster visual language models. In Pro- ceedings of the IEEE /CVF Conference on Computer Vi- sion and Pattern Recognition, pages 14582–14592, 2025

2025

[26] [26]

TempMe: Video temporal token merging for eﬃcient text- video retrieval

Leqi Shen, Tianxiang Hao, Tao He, Sicheng Zhao, Yifeng Zhang, pengzhang liu, Y ongjun Bao, and Guiguang Ding. TempMe: Video temporal token merging for eﬃcient text- video retrieval. In International Conference on Learning Representations, pages 60839–60860, 2025

2025

[27] [27]

E ﬃcient visual transformer by learnable token merging

Y ancheng Wang and Yingzhen Y ang. E ﬃcient visual transformer by learnable token merging. IEEE Transac- tions on Pattern Analysis & Machine Intelligence, 47(11): 9597–9608, 2025

2025

[28] [28]

HierarQ: Task-aware hierarchical Q-Former for enhanced video understanding

Shehreen Azad, Vibhav Vineet, and Y ogesh Singh Rawat. HierarQ: Task-aware hierarchical Q-Former for enhanced video understanding. In Proceedings of the IEEE /CVF Conference on Computer Vision and Pattern Recognition , pages 8545–8556, 2025

2025

[29] [29]

Per- ceive

Roberto Amoroso, Gengyuan Zhang, Rajat Koner, Lorenzo Baraldi, Rita Cucchiara, and V olker Tresp. Per- ceive. query & reason: Enhancing video QA with question-guided temporal queries. In IEEE/CVF Winter Conference on Applications of Computer Vision , pages 8853–8862. IEEE, 2025

2025

[30] [30]

LLaMA-Vid: An image is worth 2 tokens in large language models

Y anwei Li, Chengyao Wang, and Jiaya Jia. LLaMA-Vid: An image is worth 2 tokens in large language models. In European Conference on Computer Vision, pages 323–

[31] [31]

Semedo, and J Zico Kolter

Kevin Li, Sachin Goyal, João D. Semedo, and J Zico Kolter. Inference optimal VLMs need fewer visual to- kens and more parameters. In International Conference on Learning Representations, pages 96066–96083, 2025

2025

[32] [32]

Microsoft COCO: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision , pages 740–755. Springer, 2014

2014

[33] [33]

GQA: A new dataset for real-world visual reasoning and com- positional question answering

Drew A Hudson and Christopher D Manning. GQA: A new dataset for real-world visual reasoning and com- positional question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6700–6709, 2019

2019

[34] [34]

OCR-VQA: Visual question8 answering by reading text in images

Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. OCR-VQA: Visual question8 answering by reading text in images. In International Con- ference on Document Analysis and Recognition , pages 947–952. IEEE, 2019

2019

[35] [35]

Towards VQA models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Y u Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA models that can read. In Pro- ceedings of the IEEE /CVF Conference on Computer Vi- sion and Pattern Recognition, pages 8317–8326, 2019

2019

[36] [36]

Visual genome: Connecting language and vision using crowd- sourced dense image annotations

Ranjay Krishna, Y uke Zhu, Oliver Groth, Justin John- son, Kenji Hata, Joshua Kravitz, Stephanie Chen, Y annis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowd- sourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017

2017

[37] [37]

MMBench: Is your multi- modal model an all-around player? In European Confer- ence on Computer Vision, pages 216–233

Y uan Liu, Haodong Duan, Y uanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Y uan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. MMBench: Is your multi- modal model an all-around player? In European Confer- ence on Computer Vision, pages 216–233. Springer, 2024

2024

[38] [38]

MME: A comprehensive evalu- ation benchmark for multimodal large language models

Chaoyou Fu, Peixian Chen, Y unhang Shen, Y ulei Qin, Mengdan Zhang, Xu Lin, Jinrui Y ang, Xiawu Zheng, Ke Li, Xing Sun, et al. MME: A comprehensive evalu- ation benchmark for multimodal large language models. In Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

2025

[39] [39]

Seed-Bench: Benchmarking multimodal large language models

Bohao Li, Y uying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-Bench: Benchmarking multimodal large language models. In Pro- ceedings of the IEEE /CVF Conference on Computer Vi- sion and Pattern Recognition, pages 13299–13308, 2024

2024

[40] [40]

Learn to explain: Multimodal rea- soning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai- Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal rea- soning via thought chains for science question answering. Advances in Neural Information Processing Systems , 35: 2507–2521, 2022

2022

[41] [41]

Making the V in VQA matter: El- evating the role of image understanding in visual question answering

Y ash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: El- evating the role of image understanding in visual question answering. In Proceedings of the IEEE /CVF Conference on Computer Vision and Pattern Recognition, pages 6904– 6913, 2017

2017

[42] [42]

Q-Bench: A bench- mark for general-purpose foundation models on low-level vision

Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Chunyi Li, Wenxiu Sun, Qiong Y an, Guangtao Zhai, and Weisi Lin. Q-Bench: A bench- mark for general-purpose foundation models on low-level vision. In International Conference on Learning Repre- sentations, 2024

2024

[43] [43]

VisionZIP: Longer is better but not necessary in vision language models

Senqiao Y ang, Y ukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Y u, and Jiaya Jia. VisionZIP: Longer is better but not necessary in vision language models. In Proceedings of the IEEE /CVF Conference on Computer Vision and Pattern Recognition , pages 19792– 19802, 2025

2025

[44] [44]

Conical visual concen- tration for e ﬃcient large vision-language models

Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Y uhang Zang, Y uhang Cao, Conghui He, Jiaqi Wang, Feng Wu, and Dahua Lin. Conical visual concen- tration for e ﬃcient large vision-language models. In Pro- ceedings of the IEEE /CVF Conference on Computer Vi- sion and Pattern Recognition, pages 14593–14603, 2025

2025

[45] [45]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference ac- celeration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Jun- yang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference ac- celeration for large vision-language models. In European Conference on Computer Vision , pages 19–35. Springer, 2024

2024

[46] [46]

BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning , pages 19730–19742, 2023

2023

[47] [47]

Extending Context Window of Large Language Models via Positional Interpolation

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Y uandong Tian. Extending context window of large lan- guage models via positional interpolation. arXiv Preprint arXiv:2306.15595, 2023. 9 A Appendix A.1 Proofs and Additional Derivations This appendix provides the full derivations for Section 3. We derive the ideal requirement for the compact representatio...

work page internal anchor Pith review Pith/arXiv arXiv 2023