arxiv: 2603.27960 · v2 · submitted 2026-03-30 · 💻 cs.LG · cs.CL· cs.CV

Recognition: 2 theorem links

· Lean Theorem

Towards Efficient Large Vision-Language Models: A Comprehensive Survey on Inference Strategies

Bo Han, Surendra Pathak

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:37 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.CV

keywords large vision-language modelsinference optimizationvisual token compressionmemory managementefficient architecturedecoding strategiesmultimodal modelssurvey

0 comments

The pith

A survey organizes techniques to accelerate large vision-language model inference into four categories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper surveys methods to make large vision-language models faster during inference. It groups existing optimization approaches into four main dimensions: visual token compression, memory management and serving, efficient architectural design, and advanced decoding strategies. The authors review these techniques, examine their limitations, and point out open problems that could guide future work. The survey addresses how massive visual token counts from high-resolution inputs create quadratic attention costs that hinder practical deployment of these models.

Core claim

The paper presents a comprehensive survey of the current state-of-the-art techniques for accelerating LVLM inference and introduces a systematic taxonomy that categorizes existing optimization frameworks into four primary dimensions: visual token compression, memory management and serving, efficient architectural design, and advanced decoding strategies. It further critically examines the limitations of these current methodologies and identifies critical open problems to inspire future research directions in efficient multimodal systems.

What carries the argument

The four-dimension taxonomy that groups optimization frameworks for LVLM inference acceleration.

If this is right

Visual token compression reduces the quadratic complexity of attention mechanisms caused by high-resolution inputs.
Memory management and serving techniques improve efficiency when deploying large models at scale.
Efficient architectural designs lower the overall computational load during inference.
Advanced decoding strategies accelerate the output generation process.
Critical examination of current limitations helps define open problems for future multimodal efficiency research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The taxonomy could serve as a reference frame for researchers to classify and compare new inference methods consistently.
Combining techniques across the four dimensions might produce efficiency improvements larger than any single category alone.
Real-world benchmarks on high-resolution multimodal tasks could test whether the taxonomy covers emerging approaches.
Resolving the identified open problems may support wider use of these models on devices with limited compute resources.

Load-bearing premise

That the four-dimension taxonomy is systematic and complete and that the identified limitations accurately reflect the full literature.

What would settle it

A new optimization method for LVLM inference that cannot be placed in any of the four taxonomy dimensions or that shows the stated limitations do not hold across recent work.

Figures

Figures reproduced from arXiv: 2603.27960 by Bo Han, Surendra Pathak.

**Figure 1.** Figure 1: Schematic of a standard Large Vision Language Model (LVLM) architecture. The vision encoder and text encoder convert the visual and text inputs to corresponding tokens. Vision tokens further transform through the projector before being fused with the text tokens. The aggregated tokens are then passed to the backbone language model for autoregressive generation. advantages of optimization in this portion of… view at source ↗

**Figure 2.** Figure 2: Taxonomy of Efficient Inference Strategies for Large Vision-Language Models (LVLMs). [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

Although Large Vision Language Models (LVLMs) have demonstrated impressive multimodal reasoning capabilities, their scalability and deployment are constrained by massive computational requirements. In particular, the massive amount of visual tokens from high-resolution input data aggravates the situation due to the quadratic complexity of attention mechanisms. To address these issues, the research community has developed several optimization frameworks. This paper presents a comprehensive survey of the current state-of-the-art techniques for accelerating LVLM inference. We introduce a systematic taxonomy that categorizes existing optimization frameworks into four primary dimensions: visual token compression, memory management and serving, efficient architectural design, and advanced decoding strategies. Furthermore, we critically examine the limitations of these current methodologies and identify critical open problems to inspire future research directions in efficient multimodal systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey organizes LVLM inference techniques into a four-part taxonomy but largely synthesizes existing work rather than breaking new ground.

read the letter

This survey organizes techniques for faster inference in large vision-language models using a four-part taxonomy: visual token compression, memory management and serving, efficient architectural design, and advanced decoding strategies. The authors do a reasonable job laying out the core problem with high-resolution visual inputs and attention costs. They map existing methods to these categories and highlight limitations plus some open questions for the field. That kind of synthesis can save time for people entering the area. The main soft spot is that we have to take the comprehensiveness on faith. The taxonomy might group things neatly on paper but could feel artificial once you look at specific papers. The critical analysis of limitations will only be as good as the papers they reviewed, and there's always a risk of missing recent or obscure work. No new experiments or formal results are offered, so the contribution rests entirely on the organization and commentary. This paper is for researchers and engineers focused on practical efficiency in multimodal models. Someone who needs an overview of current inference tricks and where they fall short will find it helpful. It should go through peer review. A referee can check for gaps in coverage and suggest ways to strengthen the taxonomy or the discussion of open problems.

Referee Report

0 major / 1 minor

Summary. The paper surveys techniques for accelerating inference in Large Vision-Language Models (LVLMs). It introduces a systematic four-dimension taxonomy covering visual token compression, memory management and serving, efficient architectural design, and advanced decoding strategies, while critically examining limitations of current methods and identifying open problems to guide future work.

Significance. If the taxonomy proves systematic and the coverage representative, the survey could serve as a useful reference for organizing the literature on LVLM efficiency, helping researchers navigate deployment challenges in multimodal systems.

minor comments (1)

[Abstract] Abstract: the claim of a 'systematic taxonomy' would be strengthened by briefly stating the paper-selection criteria or number of works reviewed.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their summary of our survey on inference optimization techniques for Large Vision-Language Models. We appreciate the acknowledgment that a systematic four-dimension taxonomy covering visual token compression, memory management and serving, efficient architectural design, and advanced decoding strategies, along with discussion of limitations and open problems, could serve as a useful reference for the community. We note the 'uncertain' recommendation but observe that no specific major comments were provided in the report.

Circularity Check

0 steps flagged

No circularity: survey with external literature review only

full rationale

This is a survey paper that introduces a four-dimension taxonomy to organize existing LVLM inference acceleration techniques drawn from the broader literature. No derivations, equations, fitted parameters, predictions, or self-referential chains appear in the abstract or described structure. The taxonomy is presented as an organizational framework rather than a result derived from internal assumptions or prior self-citations. All claims rest on external references, satisfying the condition for a self-contained review with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a literature survey and does not introduce or rely on any free parameters, axioms, or invented entities beyond standard background knowledge in machine learning.

pith-pipeline@v0.9.0 · 5422 in / 1028 out tokens · 51292 ms · 2026-05-14T21:37:33.204559+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
We introduce a systematic taxonomy that categorizes existing optimization frameworks into four primary dimensions: visual token compression, memory management and serving, efficient architectural design, and advanced decoding strategies.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
The primary objective of visual token compression is to identify and drop/merge such tokens as early as possible to alleviate the computational overhead

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · 8 internal anchors

[1]

In: Proceedings of the 41st International Conference on Machine Learning

Agarwal, S., Acun, B., Hosmer, B., Elhoushi, M., Lee, Y ., Venkataraman, S., Papailiopoulos, D., Wu, C.J.: Chai: clustered head attention for efficient llm inference. In: Proceedings of the 41st International Conference on Machine Learning. pp. 291–312 (2024)

work page 2024
[2]

Advances in neural information processing systems35, 23716–23736 (2022)

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y ., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems35, 23716–23736 (2022)

work page 2022
[3]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Alvar, S.R., Singh, G., Akbari, M., Zhang, Y .: Di- vprune: Diversity-based visual token pruning for large multimodal models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 9392– 9401 (2025)

work page 2025
[4]

Qwen Technical Report

Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A frontier large vision- language model with versatile abilities. arXiv preprint arXiv:2308.129661(2), 3 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Token Merging: Your ViT But Faster

Bolya, D., Fu, C.Y ., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token merging: Your vit but faster. arXiv preprint arXiv:2210.09461 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Striped attention: Faster ring attention for causal transformers

Brandon, W., Nrusimha, A., Qian, K., Ankner, Z., Jin, T., Song, Z., Ragan-Kelley, J.: Striped attention: Faster ring attention for causal transformers. arXiv preprint arXiv:2311.09431 (2023)

work page arXiv 2023
[8]

In: Second Conference on Lan- guage Modeling (2025), https://openreview.net/forum? id=ayi7qezU87

Cai, Z., Zhang, Y ., Gao, B., Liu, Y ., Li, Y ., Liu, T., Lu, K., Xiong, W., Dong, Y ., Hu, J., Xiao, W.: PyramidKV: Dynamic KV cache compression based on pyramidal information funneling. In: Second Conference on Lan- guage Modeling (2025), https://openreview.net/forum? id=ayi7qezU87

work page 2025
[9]

arXiv preprint arXiv:2305.17530 (2023)

Cao, Q., Paranjape, B., Hajishirzi, H.: Pumer: Pruning and merging tokens for efficient vision language models. arXiv preprint arXiv:2305.17530 (2023)

work page arXiv 2023
[10]

In: European Conference on Computer Vision

Chen, L., Zhao, H., Liu, T., Bai, S., Lin, J., Zhou, C., Chang, B.: An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision- language models. In: European Conference on Computer Vision. pp. 19–35. Springer (2024)

work page 2024
[11]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)

work page 2024
[12]

See https://vicuna

Chiang, W.L., Li, Z., Lin, Z., Sheng, Y ., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y ., Gonzalez, J.E., et al.: Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023)2(3), 6 (2023)

work page 2023
[13]

In: The Twelfth Inter- national Conference on Learning Representations (2024), https://openreview.net/forum?id=mZn2Xyh9Ec

Dao, T.: Flashattention-2: Faster attention with better parallelism and work partitioning. In: The Twelfth Inter- national Conference on Learning Representations (2024), https://openreview.net/forum?id=mZn2Xyh9Ec

work page 2024
[14]

Advances in neural information processing systems35, 16344–16359 (2022)

Dao, T., Fu, D., Ermon, S., Rudra, A., R ´e, C.: Flashat- tention: Fast and memory-efficient exact attention with io-awareness. Advances in neural information processing systems35, 16344–16359 (2022)

work page 2022
[15]

In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Devoto, A., Zhao, Y ., Scardapane, S., Minervini, P.: A simple and effective l 2 norm-based strategy for kv cache compression. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 18476–18499 (2024)

work page 2024
[16]

In: Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25

Fan, S., Jiang, X., Li, X., Meng, X., Han, P., Shang, S., Sun, A., Wang, Y .: Not all layers of llms are necessary during inference. In: Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25. pp. 5083–5091 (2025)

work page 2025
[17]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Fu, T., Liu, T., Han, Q., Dai, G., Yan, S., Yang, H., Ning, X., Wang, Y .: Framefusion: Combining similarity and importance for video token reduction on large vision language models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 22654–22663 (October 2025)

work page 2025
[18]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Gagrani, M., Goel, R., Jeon, W., Park, J., Lee, M., Lott, C.: On speculative decoding for multimodal large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8285–8289 (2024)

work page 2024
[19]

In: First conference on language modeling (2024)

Gu, A., Dao, T.: Mamba: Linear-time sequence model- ing with selective state spaces. In: First conference on language modeling (2024)

work page 2024
[20]

arXiv preprint arXiv:2401.08671 (2024)

Holmes, C., Tanaka, M., Wyatt, M., Awan, A.A., Rasley, J., Rajbhandari, S., Aminabadi, R.Y ., Qin, H., Bakhtiari, A., Kurilenko, L., et al.: Deepspeed-fastgen: High-throughput text generation for llms via mii and deepspeed-inference. arXiv preprint arXiv:2401.08671 (2024)

work page arXiv 2024
[21]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers)

Hooper, C.R.C., Kim, S., Mohammadzadeh, H., Mah- eswaran, M., Zhao, S., Paik, J., Mahoney, M.W., Keutzer, K., Gholami, A.: Squeezed attention: Accelerating long context length llm inference. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). pp. 32631–32652 (2025)

work page 2025
[22]

arXiv preprint arXiv:2406.17565 (2024)

Hu, C., Huang, H., Hu, J., Xu, J., Chen, X., Xie, T., Wang, C., Wang, S., Bao, Y ., Sun, N., et al.: Mem- serve: Context caching for disaggregated llm serving with elastic memory pool. arXiv preprint arXiv:2406.17565 (2024)

work page arXiv 2024
[23]

ACM Transactions on Architecture and Code Optimization22(2), 1–24 (2025)

Hu, C., Huang, H., Xu, L., Chen, X., Wang, C., Xu, J., Chen, S., Feng, H., Wang, S., Bao, Y ., et al.: Shuffle- infer: Disaggregate llm inference for mixed downstream workloads. ACM Transactions on Architecture and Code Optimization22(2), 1–24 (2025)

work page 2025
[24]

In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision

Hyun, J., Hwang, S., Han, S.H., Kim, T., Lee, I., Wee, D., Lee, J.Y ., Kim, S.J., Shim, M.: Multi-granular spatio- temporal token merging for training-free acceleration of video llms. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. pp. 23990–24000 (2025)

work page 2025
[25]

In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/ forum?id=98d7DLMGdt

Jang, D., Park, S., Yang, J.Y ., Jung, Y ., Yun, J., Kundu, S., Kim, S.Y ., Yang, E.: LANTERN: Accelerating vi- sual autoregressive models with relaxed speculative de- coding. In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/ forum?id=98d7DLMGdt

work page 2025
[26]

In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Ji, Y ., Zhang, J., Xia, H., Chen, J., Shou, L., Chen, G., Li, H.: Specvlm: Enhancing speculative decoding of video llms via verifier-guided token pruning. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 7216–7230 (2025)

work page 2025
[27]

Mistral 7B

Jiang, A., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D., Casas, D., Bressand, F., Lengyel, G., Lam- ple, G., Saulnier, L., et al.: Mistral 7b. arxiv 2023. arXiv preprint arXiv:2310.06825 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Jin, P., Takanobu, R., Zhang, W., Cao, X., Yuan, L.: Chat-univi: Unified visual representation empowers large language models with image and video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13700–13710 (2024)

work page 2024
[29]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025), https://openreview.net/forum?id=x2BsIdJJJW

Kang, J., Shu, H., Li, W., Zhai, Y ., Chen, X.: Vis- pec: Accelerating vision-language models with vision- aware speculative decoding. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025), https://openreview.net/forum?id=x2BsIdJJJW

work page 2025
[30]

In: International conference on machine learning

Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are rnns: Fast autoregressive transformers with linear attention. In: International conference on machine learning. pp. 5156–5165. PMLR (2020)

work page 2020
[31]

In: The Twelfth International Conference on Learning Representations (2024), https://openreview.net/ forum?id=64kSvC4iPg

Kim, J.H., Yeom, J., Yun, S., Song, H.O.: Com- pressed context memory for online language model in- teraction. In: The Twelfth International Conference on Learning Representations (2024), https://openreview.net/ forum?id=64kSvC4iPg

work page 2024
[32]

In: Proceedings of the 29th symposium on operating systems principles

Kwon, W., Li, Z., Zhuang, S., Sheng, Y ., Zheng, L., Yu, C.H., Gonzalez, J., Zhang, H., Stoica, I.: Efficient mem- ory management for large language model serving with pagedattention. In: Proceedings of the 29th symposium on operating systems principles. pp. 611–626 (2023)

work page 2023
[33]

In: European Confer- ence on Computer Vision

Li, Y ., Wang, C., Jia, J.: Llama-vid: An image is worth 2 tokens in large language models. In: European Confer- ence on Computer Vision. pp. 323–340. Springer (2024)

work page 2024
[34]

Advances in Neural Information Processing Systems37, 22947– 22970 (2024)

Li, Y ., Huang, Y ., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., Cai, T., Lewis, P., Chen, D.: Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems37, 22947– 22970 (2024)

work page 2024
[35]

IEEE Transactions on Multimedia (2026)

Lin, B., Tang, Z., Ye, Y ., Huang, J., Zhang, J., Pang, Y ., Jin, P., Ning, M., Luo, J., Yuan, L.: Moe-llava: Mixture of experts for large vision-language models. IEEE Transactions on Multimedia (2026)

work page 2026
[36]

arXiv preprint arXiv:2401.02669 (2024)

Lin, B., Zhang, C., Peng, T., Zhao, H., Xiao, W., Sun, M., Liu, A., Zhang, Z., Li, L., Qiu, X., et al.: Infinite-llm: Ef- ficient llm service for long context with distattention and distributed kvcache. arXiv preprint arXiv:2401.02669 (2024)

work page arXiv 2024
[37]

In: The Twelfth International Conference on Learning Representations (2024), https://openreview.net/forum? id=WsRHpHH4s0

Liu, H., Zaharia, M., Abbeel, P.: Ringattention with blockwise transformers for near-infinite context. In: The Twelfth International Conference on Learning Representations (2024), https://openreview.net/forum? id=WsRHpHH4s0

work page 2024
[38]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, H., Li, C., Li, Y ., Lee, Y .J.: Improved baselines with visual instruction tuning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26296–26306 (2024)

work page 2024
[39]

Advances in neural information processing systems 36, 34892–34916 (2023)

Liu, H., Li, C., Wu, Q., Lee, Y .J.: Visual instruction tun- ing. Advances in neural information processing systems 36, 34892–34916 (2023)

work page 2023
[40]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liu, Z., Zhu, L., Shi, B., Zhang, Z., Lou, Y ., Yang, S., Xi, H., Cao, S., Gu, Y ., Li, D., et al.: Nvila: Efficient frontier visual language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4122–4134 (2025)

work page 2025
[41]

In: European Conference on Computer Vision

McKinzie, B., Gan, Z., Fauconnier, J.P., Dodge, S., Zhang, B., Dufter, P., Shah, D., Du, X., Peng, F., Belyi, A., et al.: Mm1: methods, analysis and insights from multimodal llm pre-training. In: European Conference on Computer Vision. pp. 304–323. Springer (2024)

work page 2024
[42]

In: Proceedings of the 41st International Conference on Machine Learning

Nawrot, P., Ła ´ncucki, A., Chochowski, M., Tarjan, D., Ponti, E.M.: Dynamic memory compression: retrofitting llms for accelerated inference. In: Proceedings of the 41st International Conference on Machine Learning. pp. 37396–37412 (2024)

work page 2024
[43]

arXiv preprint arXiv:2603.14549 (2026)

Pathak, S., Han, B.: Asap: Attention-shift-aware pruning for efficient lvlm inference. arXiv preprint arXiv:2603.14549 (2026)

work page arXiv 2026
[44]

In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/forum?id=EQgEMAD4kv

Qin, Z., Cao, Y ., Lin, M., Hu, W., Fan, S., Cheng, K., Lin, W., Li, J.: CAKE: Cascading and adaptive KV cache eviction with layer preferences. In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/forum?id=EQgEMAD4kv

work page 2025
[45]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

work page 2021
[46]

In: International con- ference on machine learning

Rajbhandari, S., Li, C., Yao, Z., Zhang, M., Aminabadi, R.Y ., Awan, A.A., Rasley, J., He, Y .: Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In: International con- ference on machine learning. pp. 18332–18346. PMLR (2022)

work page 2022
[47]

In: Proceedings of the 41st International Conference on Machine Learning

Ribar, L., Chelombiev, I., Hudlass-Galley, L., Blake, C., Luschi, C., Orr, D.: Sparq attention: bandwidth-efficient llm inference. In: Proceedings of the 41st International Conference on Machine Learning. pp. 42558–42583 (2024)

work page 2024
[48]

Advances in Neural Information Processing Systems37, 68658–68685 (2024)

Shah, J., Bikshandi, G., Zhang, Y ., Thakkar, V ., Ramani, P., Dao, T.: Flashattention-3: Fast and accurate attention with asynchrony and low-precision. Advances in Neural Information Processing Systems37, 68658–68685 (2024)

work page 2024
[49]

In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision

Shang, Y ., Cai, M., Xu, B., Lee, Y .J., Yan, Y .: Llava- prumerge: Adaptive token reduction for efficient large multimodal models. In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision. pp. 22857– 22867 (2025)

work page 2025
[50]

Holitom: Holistic token merging for fast video large language models.arXiv preprint arXiv:2505.21334, 2025

Shao, K., Tao, K., Qin, C., You, H., Sui, Y ., Wang, H.: Holitom: Holistic token merging for fast video large lan- guage models. arXiv preprint arXiv:2505.21334 (2025)

work page arXiv 2025
[51]

arXiv preprint arXiv:2503.11187 (2025)

Shen, L., Gong, G., He, T., Zhang, Y ., Liu, P., Zhao, S., Ding, G.: Fastvid: Dynamic density pruning for fast video large language models. arXiv preprint arXiv:2503.11187 (2025)

work page arXiv 2025
[52]

In: International Confer- ence on Machine Learning

Shen, X., Xiong, Y ., Zhao, C., Wu, L., Chen, J., Zhu, C., Liu, Z., Xiao, F., Varadarajan, B., Bordes, F., et al.: Longvu: Spatiotemporal adaptive compression for long video-language understanding. In: International Confer- ence on Machine Learning. pp. 54582–54599. PMLR (2025)

work page 2025
[53]

In: International Conference on Machine Learning

Sheng, Y ., Zheng, L., Yuan, B., Li, Z., Ryabinin, M., Chen, B., Liang, P., R´e, C., Stoica, I., Zhang, C.: Flexgen: High-throughput generative inference of large language models with a single gpu. In: International Conference on Machine Learning. pp. 31094–31116. PMLR (2023)

work page 2023
[54]

In: Proceedings of the 31st International Conference on Computational Linguistics

Song, D., Wang, W., Chen, S., Wang, X., Guan, M.X., Wang, B.: Less is more: A simple yet effective token reduction method for efficient multi-modal llms. In: Proceedings of the 31st International Conference on Computational Linguistics. pp. 7614–7623 (2025)

work page 2025
[55]

In: Proceedings of the 41st International Con- ference on Machine Learning

Tang, J., Zhao, Y ., Zhu, K., Xiao, G., Kasikci, B., Han, S.: Quest: query-aware sparsity for efficient long-context llm inference. In: Proceedings of the 41st International Con- ference on Machine Learning. pp. 47901–47911 (2024)

work page 2024
[56]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Tao, K., Qin, C., You, H., Sui, Y ., Wang, H.: Dycoke: Dynamic compression of tokens for fast video large language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 18992–19001 (2025)

work page 2025
[57]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozi `ere, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Alma- hairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

Advances in neural information processing systems30(2017)

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: At- tention is all you need. Advances in neural information processing systems30(2017)

work page 2017
[60]

In: The Thirteenth International Con- ference on Learning Representations (2025), https:// openreview.net/forum?id=HzBfoUdjHt

Wan, Z., Wu, X., Zhang, Y ., Xin, Y ., Tao, C., Zhu, Z., Wang, X., Luo, S., Xiong, J., Wang, L., Zhang, M.: $\text{D} {2}\text{O}$: Dynamic discriminative operations for efficient long-context inference of large language models. In: The Thirteenth International Con- ference on Learning Representations (2025), https:// openreview.net/forum?id=HzBfoUdjHt

work page 2025
[61]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Wang, H., Nie, Y ., Ye, Y ., Wang, Y ., Li, S., Yu, H., Lu, J., Huang, C.: Dynamic-vlm: Simple dynamic vi- sual token compression for videollm. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 20812–20823 (2025)

work page 2025
[62]

In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Wang, X., Zhang, J., Wang, T., Zhang, H., Zheng, F.: Seeing more, saying more: Lightweight language experts are dynamic video token compressors. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 541–558 (2025)

work page 2025
[63]

arXiv preprint arXiv:2401.09486 (2024)

Wang, Y ., Xiao, Z.: Loma: Lossless compressed memory attention. arXiv preprint arXiv:2401.09486 (2024)

work page arXiv 2024
[64]

In: Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles

Wu, B., Liu, S., Zhong, Y ., Sun, P., Liu, X., Jin, X.: Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism. In: Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles. pp. 640–654 (2024)

work page 2024
[65]

Fast distributed inference serving for large language models.arXiv preprint arXiv:2305.05920,

Wu, B., Zhong, Y ., Zhang, Z., Liu, S., Liu, F., Sun, Y ., Huang, G., Liu, X., Jin, X.: Fast distributed infer- ence serving for large language models. arXiv preprint arXiv:2305.05920 (2023)

work page arXiv 2023
[66]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Wu, Z., Chen, X., Pan, Z., Liu, X., Liu, W., Dai, D., Gao, H., Ma, Y ., Wu, C., Wang, B., et al.: Deepseek-vl2: Mixture-of-experts vision-language mod- els for advanced multimodal understanding. arXiv preprint arXiv:2412.10302 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

Advances in neural information processing systems37, 119638–119661 (2024)

Xiao, C., Zhang, P., Han, X., Xiao, G., Lin, Y ., Zhang, Z., Liu, Z., Sun, M.: Infllm: Training-free long-context extrapolation for llms with an efficient context memory. Advances in neural information processing systems37, 119638–119661 (2024)

work page 2024
[68]

In: The Twelfth International Conference on Learning Representations (2024), https://openreview.net/ forum?id=NG7sS51zVF

Xiao, G., Tian, Y ., Chen, B., Han, S., Lewis, M.: Efficient streaming language models with attention sinks. In: The Twelfth International Conference on Learning Representations (2024), https://openreview.net/ forum?id=NG7sS51zVF

work page 2024
[69]

Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy re- duction.arXiv preprint arXiv:2410.17247, 2024

Xing, L., Huang, Q., Dong, X., Lu, J., Zhang, P., Zang, Y ., Cao, Y ., He, C., Wang, J., Wu, F., et al.: Pyramid- drop: Accelerating your large vision-language models via pyramid visual redundancy reduction. arXiv preprint arXiv:2410.17247 (2024)

work page arXiv 2024
[70]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Yang, S., Chen, Y ., Tian, Z., Wang, C., Li, J., Yu, B., Jia, J.: Visionzip: Longer is better but not necessary in vision language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 19792– 19802 (2025)

work page 2025
[71]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguis- tics (V olume 1: Long Papers)

Ye, L., Tao, Z., Huang, Y ., Li, Y .: Chunkattention: Efficient self-attention with prefix-aware kv cache and two-phase partition. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguis- tics (V olume 1: Long Papers). pp. 11608–11620 (2024)

work page 2024
[72]

In: 16th USENIX symposium on operating systems design and implementation (OSDI 22)

Yu, G.I., Jeong, J.S., Kim, G.W., Kim, S., Chun, B.G.: Orca: A distributed serving system for{Transformer- Based}generative models. In: 16th USENIX symposium on operating systems design and implementation (OSDI 22). pp. 521–538 (2022)

work page 2022
[73]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

Yunzhuzhang, Lu, Y ., Wang, T., Rao, F., Yang, Y ., Zhu, L.: Flexselect: Flexible token selection for efficient long video understanding. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

work page 2025
[74]

In: Proceedings of the IEEE/CVF international conference on computer vision

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11975–11986 (2023)

work page 2023
[75]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

Zhang, H., Fu, Y .: Vqtoken: Neural discrete token repre- sentation learning for extreme token reduction in video large language models. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

work page 2025
[76]

In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision

Zhang, Q., Cheng, A., Lu, M., Zhang, R., Zhuo, Z., Cao, J., Guo, S., She, Q., Zhang, S.: Beyond text- visual attention: Exploiting visual cues for effective token pruning in vlms. In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision. pp. 20857– 20867 (2025)

work page 2025
[77]

Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024

Zhang, Y ., Fan, C.K., Ma, J., Zheng, W., Huang, T., Cheng, K., Gudovskiy, D., Okuno, T., Nakata, Y ., Keutzer, K., et al.: Sparsevlm: Visual token sparsifica- tion for efficient vision-language model inference. arXiv preprint arXiv:2410.04417 (2024)

work page arXiv 2024
[78]

Advances in Neural Information Processing Systems36, 34661–34710 (2023)

Zhang, Z., Sheng, Y ., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y ., R ´e, C., Barrett, C., et al.: H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems36, 34661–34710 (2023)

work page 2023
[79]

Advances in neural information processing systems37, 62557–62583 (2024)

Zheng, L., Yin, L., Xie, Z., Sun, C., Huang, J., Yu, C.H., Cao, S., Kozyrakis, C., Stoica, I., Gonzalez, J.E., et al.: Sglang: Efficient execution of structured language model programs. Advances in neural information processing systems37, 62557–62583 (2024)

work page 2024
[80]

BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

Zheng, Z., Ji, X., Fang, T., Zhou, F., Liu, C., Peng, G.: Batchllm: Optimizing large batched llm inference with global prefix sharing and throughput-oriented token batching. arXiv preprint arXiv:2412.03594 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

Showing first 80 references.