pith. machine review for the scientific record. sign in

arxiv: 2603.27960 · v2 · submitted 2026-03-30 · 💻 cs.LG · cs.CL· cs.CV

Recognition: 2 theorem links

· Lean Theorem

Towards Efficient Large Vision-Language Models: A Comprehensive Survey on Inference Strategies

Bo Han, Surendra Pathak

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:37 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.CV
keywords large vision-language modelsinference optimizationvisual token compressionmemory managementefficient architecturedecoding strategiesmultimodal modelssurvey
0
0 comments X

The pith

A survey organizes techniques to accelerate large vision-language model inference into four categories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper surveys methods to make large vision-language models faster during inference. It groups existing optimization approaches into four main dimensions: visual token compression, memory management and serving, efficient architectural design, and advanced decoding strategies. The authors review these techniques, examine their limitations, and point out open problems that could guide future work. The survey addresses how massive visual token counts from high-resolution inputs create quadratic attention costs that hinder practical deployment of these models.

Core claim

The paper presents a comprehensive survey of the current state-of-the-art techniques for accelerating LVLM inference and introduces a systematic taxonomy that categorizes existing optimization frameworks into four primary dimensions: visual token compression, memory management and serving, efficient architectural design, and advanced decoding strategies. It further critically examines the limitations of these current methodologies and identifies critical open problems to inspire future research directions in efficient multimodal systems.

What carries the argument

The four-dimension taxonomy that groups optimization frameworks for LVLM inference acceleration.

If this is right

  • Visual token compression reduces the quadratic complexity of attention mechanisms caused by high-resolution inputs.
  • Memory management and serving techniques improve efficiency when deploying large models at scale.
  • Efficient architectural designs lower the overall computational load during inference.
  • Advanced decoding strategies accelerate the output generation process.
  • Critical examination of current limitations helps define open problems for future multimodal efficiency research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The taxonomy could serve as a reference frame for researchers to classify and compare new inference methods consistently.
  • Combining techniques across the four dimensions might produce efficiency improvements larger than any single category alone.
  • Real-world benchmarks on high-resolution multimodal tasks could test whether the taxonomy covers emerging approaches.
  • Resolving the identified open problems may support wider use of these models on devices with limited compute resources.

Load-bearing premise

That the four-dimension taxonomy is systematic and complete and that the identified limitations accurately reflect the full literature.

What would settle it

A new optimization method for LVLM inference that cannot be placed in any of the four taxonomy dimensions or that shows the stated limitations do not hold across recent work.

Figures

Figures reproduced from arXiv: 2603.27960 by Bo Han, Surendra Pathak.

Figure 1
Figure 1. Figure 1: Schematic of a standard Large Vision Language Model (LVLM) architecture. The vision encoder and text encoder convert the visual and text inputs to corresponding tokens. Vision tokens further transform through the projector before being fused with the text tokens. The aggregated tokens are then passed to the backbone language model for autoregressive generation. advantages of optimization in this portion of… view at source ↗
Figure 2
Figure 2. Figure 2: Taxonomy of Efficient Inference Strategies for Large Vision-Language Models (LVLMs). [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
read the original abstract

Although Large Vision Language Models (LVLMs) have demonstrated impressive multimodal reasoning capabilities, their scalability and deployment are constrained by massive computational requirements. In particular, the massive amount of visual tokens from high-resolution input data aggravates the situation due to the quadratic complexity of attention mechanisms. To address these issues, the research community has developed several optimization frameworks. This paper presents a comprehensive survey of the current state-of-the-art techniques for accelerating LVLM inference. We introduce a systematic taxonomy that categorizes existing optimization frameworks into four primary dimensions: visual token compression, memory management and serving, efficient architectural design, and advanced decoding strategies. Furthermore, we critically examine the limitations of these current methodologies and identify critical open problems to inspire future research directions in efficient multimodal systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The paper surveys techniques for accelerating inference in Large Vision-Language Models (LVLMs). It introduces a systematic four-dimension taxonomy covering visual token compression, memory management and serving, efficient architectural design, and advanced decoding strategies, while critically examining limitations of current methods and identifying open problems to guide future work.

Significance. If the taxonomy proves systematic and the coverage representative, the survey could serve as a useful reference for organizing the literature on LVLM efficiency, helping researchers navigate deployment challenges in multimodal systems.

minor comments (1)
  1. [Abstract] Abstract: the claim of a 'systematic taxonomy' would be strengthened by briefly stating the paper-selection criteria or number of works reviewed.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their summary of our survey on inference optimization techniques for Large Vision-Language Models. We appreciate the acknowledgment that a systematic four-dimension taxonomy covering visual token compression, memory management and serving, efficient architectural design, and advanced decoding strategies, along with discussion of limitations and open problems, could serve as a useful reference for the community. We note the 'uncertain' recommendation but observe that no specific major comments were provided in the report.

Circularity Check

0 steps flagged

No circularity: survey with external literature review only

full rationale

This is a survey paper that introduces a four-dimension taxonomy to organize existing LVLM inference acceleration techniques drawn from the broader literature. No derivations, equations, fitted parameters, predictions, or self-referential chains appear in the abstract or described structure. The taxonomy is presented as an organizational framework rather than a result derived from internal assumptions or prior self-citations. All claims rest on external references, satisfying the condition for a self-contained review with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a literature survey and does not introduce or rely on any free parameters, axioms, or invented entities beyond standard background knowledge in machine learning.

pith-pipeline@v0.9.0 · 5422 in / 1028 out tokens · 51292 ms · 2026-05-14T21:37:33.204559+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · 8 internal anchors

  1. [1]

    In: Proceedings of the 41st International Conference on Machine Learning

    Agarwal, S., Acun, B., Hosmer, B., Elhoushi, M., Lee, Y ., Venkataraman, S., Papailiopoulos, D., Wu, C.J.: Chai: clustered head attention for efficient llm inference. In: Proceedings of the 41st International Conference on Machine Learning. pp. 291–312 (2024)

  2. [2]

    Advances in neural information processing systems35, 23716–23736 (2022)

    Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y ., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems35, 23716–23736 (2022)

  3. [3]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Alvar, S.R., Singh, G., Akbari, M., Zhang, Y .: Di- vprune: Diversity-based visual token pruning for large multimodal models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 9392– 9401 (2025)

  4. [4]

    Qwen Technical Report

    Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

  5. [5]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A frontier large vision- language model with versatile abilities. arXiv preprint arXiv:2308.129661(2), 3 (2023)

  6. [6]

    Token Merging: Your ViT But Faster

    Bolya, D., Fu, C.Y ., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token merging: Your vit but faster. arXiv preprint arXiv:2210.09461 (2022)

  7. [7]

    Striped attention: Faster ring attention for causal transformers

    Brandon, W., Nrusimha, A., Qian, K., Ankner, Z., Jin, T., Song, Z., Ragan-Kelley, J.: Striped attention: Faster ring attention for causal transformers. arXiv preprint arXiv:2311.09431 (2023)

  8. [8]

    In: Second Conference on Lan- guage Modeling (2025), https://openreview.net/forum? id=ayi7qezU87

    Cai, Z., Zhang, Y ., Gao, B., Liu, Y ., Li, Y ., Liu, T., Lu, K., Xiong, W., Dong, Y ., Hu, J., Xiao, W.: PyramidKV: Dynamic KV cache compression based on pyramidal information funneling. In: Second Conference on Lan- guage Modeling (2025), https://openreview.net/forum? id=ayi7qezU87

  9. [9]

    arXiv preprint arXiv:2305.17530 (2023)

    Cao, Q., Paranjape, B., Hajishirzi, H.: Pumer: Pruning and merging tokens for efficient vision language models. arXiv preprint arXiv:2305.17530 (2023)

  10. [10]

    In: European Conference on Computer Vision

    Chen, L., Zhao, H., Liu, T., Bai, S., Lin, J., Zhou, C., Chang, B.: An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision- language models. In: European Conference on Computer Vision. pp. 19–35. Springer (2024)

  11. [11]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)

  12. [12]

    See https://vicuna

    Chiang, W.L., Li, Z., Lin, Z., Sheng, Y ., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y ., Gonzalez, J.E., et al.: Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023)2(3), 6 (2023)

  13. [13]

    In: The Twelfth Inter- national Conference on Learning Representations (2024), https://openreview.net/forum?id=mZn2Xyh9Ec

    Dao, T.: Flashattention-2: Faster attention with better parallelism and work partitioning. In: The Twelfth Inter- national Conference on Learning Representations (2024), https://openreview.net/forum?id=mZn2Xyh9Ec

  14. [14]

    Advances in neural information processing systems35, 16344–16359 (2022)

    Dao, T., Fu, D., Ermon, S., Rudra, A., R ´e, C.: Flashat- tention: Fast and memory-efficient exact attention with io-awareness. Advances in neural information processing systems35, 16344–16359 (2022)

  15. [15]

    In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

    Devoto, A., Zhao, Y ., Scardapane, S., Minervini, P.: A simple and effective l 2 norm-based strategy for kv cache compression. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 18476–18499 (2024)

  16. [16]

    In: Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25

    Fan, S., Jiang, X., Li, X., Meng, X., Han, P., Shang, S., Sun, A., Wang, Y .: Not all layers of llms are necessary during inference. In: Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25. pp. 5083–5091 (2025)

  17. [17]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Fu, T., Liu, T., Han, Q., Dai, G., Yan, S., Yang, H., Ning, X., Wang, Y .: Framefusion: Combining similarity and importance for video token reduction on large vision language models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 22654–22663 (October 2025)

  18. [18]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Gagrani, M., Goel, R., Jeon, W., Park, J., Lee, M., Lott, C.: On speculative decoding for multimodal large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8285–8289 (2024)

  19. [19]

    In: First conference on language modeling (2024)

    Gu, A., Dao, T.: Mamba: Linear-time sequence model- ing with selective state spaces. In: First conference on language modeling (2024)

  20. [20]

    arXiv preprint arXiv:2401.08671 (2024)

    Holmes, C., Tanaka, M., Wyatt, M., Awan, A.A., Rasley, J., Rajbhandari, S., Aminabadi, R.Y ., Qin, H., Bakhtiari, A., Kurilenko, L., et al.: Deepspeed-fastgen: High-throughput text generation for llms via mii and deepspeed-inference. arXiv preprint arXiv:2401.08671 (2024)

  21. [21]

    In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers)

    Hooper, C.R.C., Kim, S., Mohammadzadeh, H., Mah- eswaran, M., Zhao, S., Paik, J., Mahoney, M.W., Keutzer, K., Gholami, A.: Squeezed attention: Accelerating long context length llm inference. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). pp. 32631–32652 (2025)

  22. [22]

    arXiv preprint arXiv:2406.17565 (2024)

    Hu, C., Huang, H., Hu, J., Xu, J., Chen, X., Xie, T., Wang, C., Wang, S., Bao, Y ., Sun, N., et al.: Mem- serve: Context caching for disaggregated llm serving with elastic memory pool. arXiv preprint arXiv:2406.17565 (2024)

  23. [23]

    ACM Transactions on Architecture and Code Optimization22(2), 1–24 (2025)

    Hu, C., Huang, H., Xu, L., Chen, X., Wang, C., Xu, J., Chen, S., Feng, H., Wang, S., Bao, Y ., et al.: Shuffle- infer: Disaggregate llm inference for mixed downstream workloads. ACM Transactions on Architecture and Code Optimization22(2), 1–24 (2025)

  24. [24]

    In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision

    Hyun, J., Hwang, S., Han, S.H., Kim, T., Lee, I., Wee, D., Lee, J.Y ., Kim, S.J., Shim, M.: Multi-granular spatio- temporal token merging for training-free acceleration of video llms. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. pp. 23990–24000 (2025)

  25. [25]

    In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/ forum?id=98d7DLMGdt

    Jang, D., Park, S., Yang, J.Y ., Jung, Y ., Yun, J., Kundu, S., Kim, S.Y ., Yang, E.: LANTERN: Accelerating vi- sual autoregressive models with relaxed speculative de- coding. In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/ forum?id=98d7DLMGdt

  26. [26]

    In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

    Ji, Y ., Zhang, J., Xia, H., Chen, J., Shou, L., Chen, G., Li, H.: Specvlm: Enhancing speculative decoding of video llms via verifier-guided token pruning. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 7216–7230 (2025)

  27. [27]

    Mistral 7B

    Jiang, A., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D., Casas, D., Bressand, F., Lengyel, G., Lam- ple, G., Saulnier, L., et al.: Mistral 7b. arxiv 2023. arXiv preprint arXiv:2310.06825 (2024)

  28. [28]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Jin, P., Takanobu, R., Zhang, W., Cao, X., Yuan, L.: Chat-univi: Unified visual representation empowers large language models with image and video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13700–13710 (2024)

  29. [29]

    In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025), https://openreview.net/forum?id=x2BsIdJJJW

    Kang, J., Shu, H., Li, W., Zhai, Y ., Chen, X.: Vis- pec: Accelerating vision-language models with vision- aware speculative decoding. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025), https://openreview.net/forum?id=x2BsIdJJJW

  30. [30]

    In: International conference on machine learning

    Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are rnns: Fast autoregressive transformers with linear attention. In: International conference on machine learning. pp. 5156–5165. PMLR (2020)

  31. [31]

    In: The Twelfth International Conference on Learning Representations (2024), https://openreview.net/ forum?id=64kSvC4iPg

    Kim, J.H., Yeom, J., Yun, S., Song, H.O.: Com- pressed context memory for online language model in- teraction. In: The Twelfth International Conference on Learning Representations (2024), https://openreview.net/ forum?id=64kSvC4iPg

  32. [32]

    In: Proceedings of the 29th symposium on operating systems principles

    Kwon, W., Li, Z., Zhuang, S., Sheng, Y ., Zheng, L., Yu, C.H., Gonzalez, J., Zhang, H., Stoica, I.: Efficient mem- ory management for large language model serving with pagedattention. In: Proceedings of the 29th symposium on operating systems principles. pp. 611–626 (2023)

  33. [33]

    In: European Confer- ence on Computer Vision

    Li, Y ., Wang, C., Jia, J.: Llama-vid: An image is worth 2 tokens in large language models. In: European Confer- ence on Computer Vision. pp. 323–340. Springer (2024)

  34. [34]

    Advances in Neural Information Processing Systems37, 22947– 22970 (2024)

    Li, Y ., Huang, Y ., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., Cai, T., Lewis, P., Chen, D.: Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems37, 22947– 22970 (2024)

  35. [35]

    IEEE Transactions on Multimedia (2026)

    Lin, B., Tang, Z., Ye, Y ., Huang, J., Zhang, J., Pang, Y ., Jin, P., Ning, M., Luo, J., Yuan, L.: Moe-llava: Mixture of experts for large vision-language models. IEEE Transactions on Multimedia (2026)

  36. [36]

    arXiv preprint arXiv:2401.02669 (2024)

    Lin, B., Zhang, C., Peng, T., Zhao, H., Xiao, W., Sun, M., Liu, A., Zhang, Z., Li, L., Qiu, X., et al.: Infinite-llm: Ef- ficient llm service for long context with distattention and distributed kvcache. arXiv preprint arXiv:2401.02669 (2024)

  37. [37]

    In: The Twelfth International Conference on Learning Representations (2024), https://openreview.net/forum? id=WsRHpHH4s0

    Liu, H., Zaharia, M., Abbeel, P.: Ringattention with blockwise transformers for near-infinite context. In: The Twelfth International Conference on Learning Representations (2024), https://openreview.net/forum? id=WsRHpHH4s0

  38. [38]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Liu, H., Li, C., Li, Y ., Lee, Y .J.: Improved baselines with visual instruction tuning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26296–26306 (2024)

  39. [39]

    Advances in neural information processing systems 36, 34892–34916 (2023)

    Liu, H., Li, C., Wu, Q., Lee, Y .J.: Visual instruction tun- ing. Advances in neural information processing systems 36, 34892–34916 (2023)

  40. [40]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Liu, Z., Zhu, L., Shi, B., Zhang, Z., Lou, Y ., Yang, S., Xi, H., Cao, S., Gu, Y ., Li, D., et al.: Nvila: Efficient frontier visual language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4122–4134 (2025)

  41. [41]

    In: European Conference on Computer Vision

    McKinzie, B., Gan, Z., Fauconnier, J.P., Dodge, S., Zhang, B., Dufter, P., Shah, D., Du, X., Peng, F., Belyi, A., et al.: Mm1: methods, analysis and insights from multimodal llm pre-training. In: European Conference on Computer Vision. pp. 304–323. Springer (2024)

  42. [42]

    In: Proceedings of the 41st International Conference on Machine Learning

    Nawrot, P., Ła ´ncucki, A., Chochowski, M., Tarjan, D., Ponti, E.M.: Dynamic memory compression: retrofitting llms for accelerated inference. In: Proceedings of the 41st International Conference on Machine Learning. pp. 37396–37412 (2024)

  43. [43]

    arXiv preprint arXiv:2603.14549 (2026)

    Pathak, S., Han, B.: Asap: Attention-shift-aware pruning for efficient lvlm inference. arXiv preprint arXiv:2603.14549 (2026)

  44. [44]

    In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/forum?id=EQgEMAD4kv

    Qin, Z., Cao, Y ., Lin, M., Hu, W., Fan, S., Cheng, K., Lin, W., Li, J.: CAKE: Cascading and adaptive KV cache eviction with layer preferences. In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/forum?id=EQgEMAD4kv

  45. [45]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  46. [46]

    In: International con- ference on machine learning

    Rajbhandari, S., Li, C., Yao, Z., Zhang, M., Aminabadi, R.Y ., Awan, A.A., Rasley, J., He, Y .: Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In: International con- ference on machine learning. pp. 18332–18346. PMLR (2022)

  47. [47]

    In: Proceedings of the 41st International Conference on Machine Learning

    Ribar, L., Chelombiev, I., Hudlass-Galley, L., Blake, C., Luschi, C., Orr, D.: Sparq attention: bandwidth-efficient llm inference. In: Proceedings of the 41st International Conference on Machine Learning. pp. 42558–42583 (2024)

  48. [48]

    Advances in Neural Information Processing Systems37, 68658–68685 (2024)

    Shah, J., Bikshandi, G., Zhang, Y ., Thakkar, V ., Ramani, P., Dao, T.: Flashattention-3: Fast and accurate attention with asynchrony and low-precision. Advances in Neural Information Processing Systems37, 68658–68685 (2024)

  49. [49]

    In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision

    Shang, Y ., Cai, M., Xu, B., Lee, Y .J., Yan, Y .: Llava- prumerge: Adaptive token reduction for efficient large multimodal models. In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision. pp. 22857– 22867 (2025)

  50. [50]

    Holitom: Holistic token merging for fast video large language models.arXiv preprint arXiv:2505.21334, 2025

    Shao, K., Tao, K., Qin, C., You, H., Sui, Y ., Wang, H.: Holitom: Holistic token merging for fast video large lan- guage models. arXiv preprint arXiv:2505.21334 (2025)

  51. [51]

    arXiv preprint arXiv:2503.11187 (2025)

    Shen, L., Gong, G., He, T., Zhang, Y ., Liu, P., Zhao, S., Ding, G.: Fastvid: Dynamic density pruning for fast video large language models. arXiv preprint arXiv:2503.11187 (2025)

  52. [52]

    In: International Confer- ence on Machine Learning

    Shen, X., Xiong, Y ., Zhao, C., Wu, L., Chen, J., Zhu, C., Liu, Z., Xiao, F., Varadarajan, B., Bordes, F., et al.: Longvu: Spatiotemporal adaptive compression for long video-language understanding. In: International Confer- ence on Machine Learning. pp. 54582–54599. PMLR (2025)

  53. [53]

    In: International Conference on Machine Learning

    Sheng, Y ., Zheng, L., Yuan, B., Li, Z., Ryabinin, M., Chen, B., Liang, P., R´e, C., Stoica, I., Zhang, C.: Flexgen: High-throughput generative inference of large language models with a single gpu. In: International Conference on Machine Learning. pp. 31094–31116. PMLR (2023)

  54. [54]

    In: Proceedings of the 31st International Conference on Computational Linguistics

    Song, D., Wang, W., Chen, S., Wang, X., Guan, M.X., Wang, B.: Less is more: A simple yet effective token reduction method for efficient multi-modal llms. In: Proceedings of the 31st International Conference on Computational Linguistics. pp. 7614–7623 (2025)

  55. [55]

    In: Proceedings of the 41st International Con- ference on Machine Learning

    Tang, J., Zhao, Y ., Zhu, K., Xiao, G., Kasikci, B., Han, S.: Quest: query-aware sparsity for efficient long-context llm inference. In: Proceedings of the 41st International Con- ference on Machine Learning. pp. 47901–47911 (2024)

  56. [56]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Tao, K., Qin, C., You, H., Sui, Y ., Wang, H.: Dycoke: Dynamic compression of tokens for fast video large language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 18992–19001 (2025)

  57. [57]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozi `ere, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  58. [58]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Alma- hairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

  59. [59]

    Advances in neural information processing systems30(2017)

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: At- tention is all you need. Advances in neural information processing systems30(2017)

  60. [60]

    In: The Thirteenth International Con- ference on Learning Representations (2025), https:// openreview.net/forum?id=HzBfoUdjHt

    Wan, Z., Wu, X., Zhang, Y ., Xin, Y ., Tao, C., Zhu, Z., Wang, X., Luo, S., Xiong, J., Wang, L., Zhang, M.: $\text{D} {2}\text{O}$: Dynamic discriminative operations for efficient long-context inference of large language models. In: The Thirteenth International Con- ference on Learning Representations (2025), https:// openreview.net/forum?id=HzBfoUdjHt

  61. [61]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Wang, H., Nie, Y ., Ye, Y ., Wang, Y ., Li, S., Yu, H., Lu, J., Huang, C.: Dynamic-vlm: Simple dynamic vi- sual token compression for videollm. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 20812–20823 (2025)

  62. [62]

    In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

    Wang, X., Zhang, J., Wang, T., Zhang, H., Zheng, F.: Seeing more, saying more: Lightweight language experts are dynamic video token compressors. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 541–558 (2025)

  63. [63]

    arXiv preprint arXiv:2401.09486 (2024)

    Wang, Y ., Xiao, Z.: Loma: Lossless compressed memory attention. arXiv preprint arXiv:2401.09486 (2024)

  64. [64]

    In: Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles

    Wu, B., Liu, S., Zhong, Y ., Sun, P., Liu, X., Jin, X.: Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism. In: Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles. pp. 640–654 (2024)

  65. [65]

    Fast distributed inference serving for large language models.arXiv preprint arXiv:2305.05920,

    Wu, B., Zhong, Y ., Zhang, Z., Liu, S., Liu, F., Sun, Y ., Huang, G., Liu, X., Jin, X.: Fast distributed infer- ence serving for large language models. arXiv preprint arXiv:2305.05920 (2023)

  66. [66]

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Wu, Z., Chen, X., Pan, Z., Liu, X., Liu, W., Dai, D., Gao, H., Ma, Y ., Wu, C., Wang, B., et al.: Deepseek-vl2: Mixture-of-experts vision-language mod- els for advanced multimodal understanding. arXiv preprint arXiv:2412.10302 (2024)

  67. [67]

    Advances in neural information processing systems37, 119638–119661 (2024)

    Xiao, C., Zhang, P., Han, X., Xiao, G., Lin, Y ., Zhang, Z., Liu, Z., Sun, M.: Infllm: Training-free long-context extrapolation for llms with an efficient context memory. Advances in neural information processing systems37, 119638–119661 (2024)

  68. [68]

    In: The Twelfth International Conference on Learning Representations (2024), https://openreview.net/ forum?id=NG7sS51zVF

    Xiao, G., Tian, Y ., Chen, B., Han, S., Lewis, M.: Efficient streaming language models with attention sinks. In: The Twelfth International Conference on Learning Representations (2024), https://openreview.net/ forum?id=NG7sS51zVF

  69. [69]

    Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy re- duction.arXiv preprint arXiv:2410.17247, 2024

    Xing, L., Huang, Q., Dong, X., Lu, J., Zhang, P., Zang, Y ., Cao, Y ., He, C., Wang, J., Wu, F., et al.: Pyramid- drop: Accelerating your large vision-language models via pyramid visual redundancy reduction. arXiv preprint arXiv:2410.17247 (2024)

  70. [70]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Yang, S., Chen, Y ., Tian, Z., Wang, C., Li, J., Yu, B., Jia, J.: Visionzip: Longer is better but not necessary in vision language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 19792– 19802 (2025)

  71. [71]

    In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguis- tics (V olume 1: Long Papers)

    Ye, L., Tao, Z., Huang, Y ., Li, Y .: Chunkattention: Efficient self-attention with prefix-aware kv cache and two-phase partition. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguis- tics (V olume 1: Long Papers). pp. 11608–11620 (2024)

  72. [72]

    In: 16th USENIX symposium on operating systems design and implementation (OSDI 22)

    Yu, G.I., Jeong, J.S., Kim, G.W., Kim, S., Chun, B.G.: Orca: A distributed serving system for{Transformer- Based}generative models. In: 16th USENIX symposium on operating systems design and implementation (OSDI 22). pp. 521–538 (2022)

  73. [73]

    In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

    Yunzhuzhang, Lu, Y ., Wang, T., Rao, F., Yang, Y ., Zhu, L.: Flexselect: Flexible token selection for efficient long video understanding. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

  74. [74]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11975–11986 (2023)

  75. [75]

    In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

    Zhang, H., Fu, Y .: Vqtoken: Neural discrete token repre- sentation learning for extreme token reduction in video large language models. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

  76. [76]

    In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision

    Zhang, Q., Cheng, A., Lu, M., Zhang, R., Zhuo, Z., Cao, J., Guo, S., She, Q., Zhang, S.: Beyond text- visual attention: Exploiting visual cues for effective token pruning in vlms. In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision. pp. 20857– 20867 (2025)

  77. [77]

    Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024

    Zhang, Y ., Fan, C.K., Ma, J., Zheng, W., Huang, T., Cheng, K., Gudovskiy, D., Okuno, T., Nakata, Y ., Keutzer, K., et al.: Sparsevlm: Visual token sparsifica- tion for efficient vision-language model inference. arXiv preprint arXiv:2410.04417 (2024)

  78. [78]

    Advances in Neural Information Processing Systems36, 34661–34710 (2023)

    Zhang, Z., Sheng, Y ., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y ., R ´e, C., Barrett, C., et al.: H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems36, 34661–34710 (2023)

  79. [79]

    Advances in neural information processing systems37, 62557–62583 (2024)

    Zheng, L., Yin, L., Xie, Z., Sun, C., Huang, J., Yu, C.H., Cao, S., Kozyrakis, C., Stoica, I., Gonzalez, J.E., et al.: Sglang: Efficient execution of structured language model programs. Advances in neural information processing systems37, 62557–62583 (2024)

  80. [80]

    BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

    Zheng, Z., Ji, X., Fang, T., Zhou, F., Liu, C., Peng, G.: Batchllm: Optimizing large batched llm inference with global prefix sharing and throughput-oriented token batching. arXiv preprint arXiv:2412.03594 (2024)

Showing first 80 references.