arxiv: 2605.02262 · v1 · submitted 2026-05-04 · 💻 cs.CV · cs.CL

Recognition: unknown

WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization

Wei Tao , Xiaoyang Qu , Peiqiang Wang , Guokuan Li , Jiguang Wan , Kai Lu , Jianzong Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-09 16:02 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords KV cache quantizationmixed-precisionVLMswindow-level similarityinference optimizationvideo language modelsquantization search

0 comments

The pith

WindowQuant applies mixed-precision KV cache quantization to VLMs at the window level using similarity to the text prompt for faster inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to cut the memory and latency penalties from long visual token sequences in video language models. It replaces slow per-token mixed-precision searches with a window-based approach that scores groups of visual tokens against the text prompt to pick bit widths. A second step reorders those windows before quantization so that the mixed-precision math runs efficiently on standard hardware. The result is claimed to keep accuracy high while using less GPU memory and time than prior methods. A sympathetic reader would care because this could let VLMs process longer videos or run on smaller devices without retraining or special chips.

Core claim

WindowQuant consists of a window-level quantization search module that sets bit-width configurations for KV cache windows from similarity scores to the text prompt, plus a window-level KV cache computation module that reorders the windows before quantization to remove hardware inefficiency, delivering better inference speed and memory use than token-granularity mixed-precision baselines on multiple datasets.

What carries the argument

Window-level similarity scoring that assigns bit widths for mixed-precision KV cache quantization and enables reordering for hardware-efficient computation.

If this is right

VLMs can handle longer visual sequences with lower peak memory and faster per-token processing.
Mixed-precision KV cache becomes practical on existing GPUs without custom kernels for every token.
Accuracy holds across datasets while beating prior KV cache quantization techniques.
Deployment of video language models becomes feasible on edge hardware with limited memory.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same window-similarity idea could be tried in standard large language models to quantize long context caches.
Dynamic window sizing based on content change rate might further improve the accuracy-speed trade-off.
Hardware vendors could add reorder buffers tuned to this pattern to make mixed-precision even faster.
If the similarity proxy holds, quantization pipelines might skip some full accuracy validation steps.

Load-bearing premise

Similarity scores between visual token windows and the text prompt can pick bit widths that keep overall model accuracy without checking tokens one by one.

What would settle it

A test set where windows rated highly similar to the prompt still produce large accuracy drops after low-bit quantization, or where the reordering step fails to increase measured throughput.

Figures

Figures reproduced from arXiv: 2605.02262 by Guokuan Li, Jianzong Wang, Jiguang Wan, Kai Lu, Peiqiang Wang, Wei Tao, Xiaoyang Qu.

**Figure 1.** Figure 1: (a): An example of VLM application. (b): The memory usage comparison of KV cache (including visual token and text token) and weight of the Llava-OneVision-0.5B model when processing two videos of different lengths. Left: 0.5s video. Right: 30s video. 1 Introduction Large Language Models (LLMs) have gained significant popularity in both academia and industry due to their remarkable capabilities in understan… view at source ↗

**Figure 3.** Figure 3: The similarity heatmap between a video and 10 text view at source ↗

**Figure 4.** Figure 4: The architecture of typical VLMs. token windows according to those scores. This module maintains the accuracy of VLMs quickly and effectively. Since our quantization configuration is determined from the window perspective—rather than assigning it token by token as in previous methods—our configuration search time is much shorter. Moreover, the subsequent quantization is also performed on a per-window basis… view at source ↗

**Figure 5.** Figure 5: The inference process of typical LLMs. feeds them together into a pre-trained LLM. The pre-trained LLM calculates the input tokens in parallel and generates the first output token. • Decoding Stage: During the decoding stage, the pre-trained LLM in the VLM combines the entire input token sequence with the already generated output tokens to produce the next token in the output sequence. This process iterate… view at source ↗

**Figure 6.** Figure 6: The architecture overview of WindowQuant. view at source ↗

**Figure 7.** Figure 7: The attention weights of the tokens in four different layers in the pre-trained LLM of the Llava-Onevision-Qwen2-7B model. view at source ↗

**Figure 8.** Figure 8: The hidden states similarity between layers, results are tested on three pre-trained LLMs of different VLM models. From left view at source ↗

**Figure 9.** Figure 9: An example of KV cache window reordering. view at source ↗

**Figure 10.** Figure 10: (a): GPU memory usage comparison of WindowQuant and FP16 model under different number of frames. (b): Decoding view at source ↗

**Figure 11.** Figure 11: The number of tokens with different bit-widths in each layer after view at source ↗

**Figure 12.** Figure 12: (a): The generation latency of WindowQuant under different window sizes. (b): The number of tokens with different bit-widths view at source ↗

**Figure 13.** Figure 13: (a): The generation latency of WindowQuant under different threshold parameter view at source ↗

**Figure 14.** Figure 14: The function plot of 𝑓1 (𝑥 ) and 𝑓2 (𝑥 ) with different 𝛼. the number of INT4 tokens is 6,956, and only 215 tokens remain in FP16. Accordingly, the average memory reduction per layer is 128 ∗ 4 ∗ 2 ∗ (12564 ∗ 7 4 + 6956 ∗ 3 2 ) = 33199104Byte ≈ 31.66MB, resulting in a total memory reduction of 31.66 ∗ 28 = 886.48MB ≈ 0.86GB across all layers, which corresponds with the result in view at source ↗

read the original abstract

Recently, video language models (VLMs) have been applied in various fields. However, the visual token sequence of the VLM is too long, which may cause intolerant inference latency and GPU memory usage. Existing methods propose mixed-precision quantization to the key-value (KV) cache in VLMs based on token granularity, which is time-consuming in the search process and hardware inefficient during computation. This paper introduces a novel approach called WindowQuant, which employs window-adaptive mixed-precision quantization to optimize the KV cache. WindowQuant consists of two modules: window-level quantization search and window-level KV cache computation. Window-level quantization search quickly determines the optimal bit-width configuration of the KV cache windows based on the similarity scores between the corresponding visual token windows and the text prompt, maintaining the model accuracy. Furthermore, window-level KV cache computation reorders the KV cache windows before quantization, avoiding the hardware inefficiency caused by mixed-precision quantization in inference computation. Extensive experiments demonstrate that WindowQuant outperforms state-of-the-art VLM models and KV cache quantization methods on various datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes WindowQuant, a window-adaptive mixed-precision KV cache quantization method for VLMs. It uses a window-level quantization search that assigns bit-widths to visual token windows according to their similarity scores with the text prompt, paired with a reordering step in the window-level KV cache computation module to mitigate hardware inefficiency from mixed-precision operations. The central claim is that this maintains model accuracy while outperforming existing VLM models and KV cache quantization methods across various datasets.

Significance. If the experimental validation holds, the work could meaningfully advance efficient inference for VLMs by replacing per-token search with a faster window-level heuristic and addressing hardware costs of mixed-precision KV caches, potentially lowering memory footprint and latency for long visual sequences without accuracy loss.

major comments (2)

Abstract: the assertion that 'extensive experiments demonstrate that WindowQuant outperforms state-of-the-art VLM models and KV cache quantization methods' supplies no quantitative results, baselines, error bars, ablation details, or dataset names, rendering the central performance claim unevaluable from the manuscript text.
Window-level quantization search (method description): the core assumption that similarity scores between visual token windows and the text prompt reliably predict bit-width configurations that preserve end-to-end accuracy is presented without any correlation analysis, ablation study, or sensitivity measurement showing that higher-similarity windows tolerate aggressive quantization; this proxy is load-bearing for the method's correctness and is not shown to be a valid surrogate for per-window quantization tolerance.

minor comments (1)

Abstract: the phrasing 'outperforms state-of-the-art VLM models' is imprecise, as the contribution targets KV cache quantization rather than end-to-end VLM architecture changes; clarify the exact baselines compared.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We agree that the abstract and method description require strengthening with additional quantitative details and validation of the core heuristic. We will revise the manuscript accordingly to address these points.

read point-by-point responses

Referee: Abstract: the assertion that 'extensive experiments demonstrate that WindowQuant outperforms state-of-the-art VLM models and KV cache quantization methods' supplies no quantitative results, baselines, error bars, ablation details, or dataset names, rendering the central performance claim unevaluable from the manuscript text.

Authors: We agree that the abstract should provide concrete quantitative highlights to allow readers to evaluate the claims immediately. In the revised version, we will expand the abstract to include specific results such as accuracy retention (e.g., within 1% of full-precision baseline), latency reductions (e.g., 1.8x speedup), and memory savings on named datasets including VideoChat, LLaVA, and ActivityNet, with explicit baselines (per-token mixed-precision methods and uniform quantization) and mention of multi-run error bars. revision: yes
Referee: Window-level quantization search (method description): the core assumption that similarity scores between visual token windows and the text prompt reliably predict bit-width configurations that preserve end-to-end accuracy is presented without any correlation analysis, ablation study, or sensitivity measurement showing that higher-similarity windows tolerate aggressive quantization; this proxy is load-bearing for the method's correctness and is not shown to be a valid surrogate for per-window quantization tolerance.

Authors: The referee is correct that the similarity-based proxy requires explicit validation to demonstrate it is a reliable surrogate. While end-to-end experiments in the current manuscript show accuracy preservation, we did not include dedicated correlation or sensitivity analyses. In the revision, we will add a new subsection with (1) scatter plots correlating window-text similarity scores against assigned bit-widths and measured per-window accuracy impact, (2) an ablation varying the similarity threshold and reporting resulting accuracy/latency trade-offs, and (3) sensitivity measurements confirming higher-similarity windows tolerate lower precision. This will directly address the load-bearing assumption. revision: yes

Circularity Check

0 steps flagged

No circularity: WindowQuant uses an external similarity heuristic for bit allocation with no self-referential derivations or fitted predictions.

full rationale

The paper introduces a window-level similarity heuristic to assign KV cache bit-widths without any equations, derivations, or parameter fitting shown in the abstract or described method. The core claim rests on empirical validation via experiments rather than mathematical reduction to inputs. No self-citations, ansatzes, or uniqueness theorems are invoked to justify the similarity-to-bitwidth mapping. The reordering step addresses hardware efficiency separately. This is a standard non-circular empirical proposal; the lack of shown correlation between similarity and quantization tolerance is a correctness concern, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are specified in the abstract; the method relies on an unstated similarity metric and accuracy-maintenance assumption.

pith-pipeline@v0.9.0 · 5506 in / 909 out tokens · 25003 ms · 2026-05-09T16:02:52.508870+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 29 canonical work pages · 13 internal anchors

[1]

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245(2023)

work page internal anchor Pith review arXiv 2023
[2]

Mu Cai, Jianwei Yang, Jianfeng Gao, and Yong Jae Lee. 2024. Matryoshka multimodal models.arXiv preprint arXiv:2405.17430(2024)

work page arXiv 2024
[3]

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. 2024. Medusa: Simple llm inference acceleration framework with multiple decoding heads.arXiv preprint arXiv:2401.10774(2024)

work page internal anchor Pith review arXiv 2024
[4]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al
[5]

In13th USENIX symposium on operating systems design and implementation (OSDI 18)

{TVM}: An automated {End-to-End} optimizing compiler for deep learning. In13th USENIX symposium on operating systems design and implementation (OSDI 18). 578–594
[6]

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. 2024. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences67, 12 (2024), 220101

2024
[7]

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al . 2024. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476(2024)

work page internal anchor Pith review arXiv 2024
[8]

Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691(2023)

work page internal anchor Pith review arXiv 2023
[9]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems35 (2022), 16344–16359

2022
[10]

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale.Advances in Neural Information Processing Systems35 (2022), 30318–30332

2022
[11]

Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. 2023. Spqr: A sparse-quantized representation for near-lossless llm weight compression.arXiv preprint arXiv:2306.03078(2023)

work page arXiv 2023
[12]

Shichen Dong, Wen Cheng, Jiayu Qin, and Wei Wang. 2024. QAQ: Quality Adaptive Quantization for LLM KV Cache.arXiv preprint arXiv:2403.04643 (2024)

work page arXiv 2024
[13]

Haojie Duanmu, Zhihang Yuan, Xiuhong Li, Jiangfei Duan, Xingcheng Zhang, and Dahua Lin. 2024. Skvq: Sliding-window key and value cache quantization for large language models.arXiv preprint arXiv:2405.06219(2024)

work page arXiv 2024
[14]

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323(2022)

work page internal anchor Pith review arXiv 2022
[15]

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. 2023. Model tells you what to discard: Adaptive kv cache compression for llms.arXiv preprint arXiv:2310.01801(2023)

work page arXiv 2023
[16]

Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. 2023. Lm-infinite: Simple on-the-fly length generalization for large language models.arXiv preprint arXiv:2308.16137(2023)

work page arXiv 2023
[17]

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. 2024. Kvquant: Towards 10 million context length llm inference with kv cache quantization.arXiv preprint arXiv:2401.18079(2024)

work page arXiv 2024
[18]

Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W Mahoney, and Kurt Keutzer. 2023. Squeezellm: Dense-and-sparse quantization.arXiv preprint arXiv:2306.07629(2023)

work page arXiv 2023
[19]

Wonkyun Kim, Changin Choi, Wonseok Lee, and Wonjong Rhee. 2024. An image grid can be worth a video: Zero-shot video question answering using a vlm.IEEE Access(2024)

2024
[20]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles. 611–626

2023
[21]

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. 2024. Llava- onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326(2024)

work page internal anchor Pith review arXiv 2024
[22]

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. 2024. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22195–22206

2024
[23]

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. 2023. Video-llava: Learning united visual representation by alignment before projection.arXiv preprint arXiv:2311.10122(2023)

work page internal anchor Pith review arXiv 2023
[24]

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han
[25]

Manuscript submitted to ACM WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization 25

AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration.Proceedings of Machine Learning and Systems 6 (2024), 87–100. Manuscript submitted to ACM WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization 25

2024
[26]

Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. 2024. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving.arXiv preprint arXiv:2405.04532(2024)

work page arXiv 2024
[27]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual Instruction Tuning. InAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 34892–34916. https: //proceedings.neurips.cc/paper_files/paper/2023/file/6dcf277ea32ce3288914faf369fe6de...

2023
[28]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning.Advances in neural information processing systems36 (2023), 34892–34916

2023
[29]

Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, and Ge Li. 2024. St-llm: Large language models are effective temporal learners. InEuropean Conference on Computer Vision. Springer, 1–18

2024
[30]

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. Kivi: A tuning-free asymmetric 2bit quantization for kv cache.arXiv preprint arXiv:2402.02750(2024)

work page internal anchor Pith review arXiv 2024
[31]

Fan Ma, Xiaojie Jin, Heng Wang, Yuchen Xian, Jiashi Feng, and Yi Yang. 2024. Vista-llama: Reducing hallucination in video language models via equal distance to visual tokens. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13151–13160

2024
[32]

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. 2023. Video-chatgpt: Towards detailed video understanding via large vision and language models.arXiv preprint arXiv:2306.05424(2023)

work page internal anchor Pith review arXiv 2023
[33]

Lingchen Meng, Jianwei Yang, Rui Tian, Xiyang Dai, Zuxuan Wu, Jianfeng Gao, and Yu-Gang Jiang. 2024. Deepstack: Deeply stacking visual tokens is surprisingly simple and effective for lmms.Advances in Neural Information Processing Systems37 (2024), 23464–23487

2024
[34]

OpenAI. 2023. GPT-4V(ision) System Card. https://cdn.openai.com/papers/GPTV_System_Card.pdf. Version 2

2023
[35]

Tingyu Qu, Mingxiao Li, Tinne Tuytelaars, and Marie-Francine Moens. 2024. TS-LLaVA: Constructing Visual Tokens through Thumbnail-and- Sampling for Training-Free Video Large Language Models.arXiv preprint arXiv:2411.11066(2024)

work page arXiv 2024
[36]

Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150(2019)

work page internal anchor Pith review arXiv 2019
[37]

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. 2023. Flexgen: High-throughput generative inference of large language models with a single gpu. InInternational Conference on Machine Learning. PMLR, 31094–31116

2023
[38]

Dingjie Song, Shunian Chen, Guiming Hardy Chen, Fei Yu, Xiang Wan, and Benyou Wang. 2024. Milebench: Benchmarking mllms in long context. arXiv preprint arXiv:2404.18532(2024)

work page arXiv 2024
[39]

Enxin Song, Wenhao Chai, Tian Ye, Jenq-Neng Hwang, Xi Li, and Gaoang Wang. 2025. Moviechat+: Question-aware sparse memory for long video question answering.IEEE Transactions on Pattern Analysis and Machine Intelligence(2025)

2025
[40]

Zunhai Su, Wang Shen, Linge Li, Zhe Chen, Hanyu Wei, Huangqi Yu, and Kehong Yuan. 2025. AKVQ-VL: Attention-Aware KV Cache Adaptive 2-Bit Quantization for Vision-Language Models.arXiv preprint arXiv:2501.15021(2025)

work page arXiv 2025
[41]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

2017
[42]

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. 2024. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Zihao Wang, Bin Cui, and Shaoduo Gan. 2024. SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget. arXiv:2404.04793 [cs.LG] https://arxiv.org/abs/2404.04793

work page arXiv 2024
[44]

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational Conference on Machine Learning. PMLR, 38087–38099

2023
[45]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2023. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453(2023)

work page internal anchor Pith review arXiv 2023
[46]

Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. 2024. Pllava: Parameter-free llava extension from images to videos for video dense captioning.arXiv preprint arXiv:2404.16994(2024)

work page arXiv 2024
[47]

Mingze Xu, Mingfei Gao, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, and Afshin Dehghan. 2024. Slowfast-llava: A strong training-free baseline for video large language models.arXiv preprint arXiv:2407.15841(2024)

work page arXiv 2024
[48]

June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho Park, Eunho Yang, Se Jung Kwon, and Dongsoo Lee. 2024. No token left behind: Reliable kv cache compression via importance-aware mixed precision quantization.arXiv preprint arXiv:2402.18096(2024)

work page arXiv 2024
[49]

Hang Zhang, Xin Li, and Lidong Bing. 2023. Video-llama: An instruction-tuned audio-visual language model for video understanding.arXiv preprint arXiv:2306.02858(2023)

work page internal anchor Pith review arXiv 2023
[50]

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al
[51]

H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems36 (2024)

2024
[52]

Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci. 2024. Atom: Low-bit quantization for efficient and accurate llm serving.Proceedings of Machine Learning and Systems6 (2024), 196–209. Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009 Manuscript submit...

2024