arxiv: 2604.06694 · v1 · submitted 2026-04-08 · 💻 cs.SD

Recognition: unknown

AudioKV: KV Cache Eviction in Efficient Large Audio Language Models

Yuxuan Wang , Peize He , Xiyan Gui , Xiaoqian Liu , Junhao He , Xuyang Liu , Zichen Wen , Xuming Hu

show 1 more author

Linfeng Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:23 UTC · model grok-4.3

classification 💻 cs.SD

keywords KV cache compressionLarge Audio Language ModelsAudioKVSpectral Score Smoothingattention head analysismultimodal inferenceaudio signal processing

0 comments

The pith

AudioKV compresses audio model KV cache to 40% with 0.45% accuracy drop

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes AudioKV to solve the memory bottleneck of KV caches in Large Audio-Language Models during long-context inference. It shows that generic KV compression methods break down on audio inputs because they ignore the continuous temporal structure of acoustic signals. AudioKV instead locates audio-critical attention heads by studying attention patterns on ASR tasks and routes KV cache budgets toward those heads using a semantic-acoustic alignment step. It further applies Spectral Score Smoothing, an FFT filter that removes high-frequency noise from token importance scores, to produce smoother and more balanced eviction decisions. The result is that 40% compression incurs only a 0.45% accuracy loss on Qwen3-Omni-30B while conventional methods trigger repetition and collapse.

Core claim

AudioKV identifies modality-specialized attention heads from ASR attention analysis, allocates KV budgets preferentially to them via semantic-acoustic alignment, and applies FFT-based Spectral Score Smoothing to importance scores, enabling KV cache eviction at high ratios while preserving accuracy far better than standard techniques across Qwen and Gemma LALMs.

What carries the argument

Semantic-acoustic alignment that ranks attention heads by ASR importance combined with Spectral Score Smoothing (SSS), an FFT global filter that suppresses high-frequency fluctuations in token scores to recover smooth trends for eviction.

If this is right

LALMs can run longer audio contexts on the same hardware without retraining.
Standard eviction policies that cause repetition become unnecessary for audio inputs.
The approach works across multiple model families including Qwen and Gemma series.
Overall inference memory and compute drop while task accuracy stays close to the uncompressed baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same head-prioritization idea may transfer to video-language models where one modality also has strong temporal continuity.
Applying Spectral Score Smoothing alone to existing LLM eviction methods could reduce jitter in token selection without audio-specific changes.
Real-time streaming audio applications could benefit if the smoothing step runs incrementally rather than on full context.

Load-bearing premise

That heads found critical on ASR tasks will stay the most important across other audio-language tasks and that the alignment step will correctly rank audio-relevant tokens without any retraining.

What would settle it

Measuring a drop larger than 5% accuracy or noticeable repetition on any evaluated LALM at exactly 40% compression would show the method does not maintain near-full performance.

Figures

Figures reproduced from arXiv: 2604.06694 by Junhao He, Linfeng Zhang, Peize He, Xiaoqian Liu, Xiyan Gui, Xuming Hu, Xuyang Liu, Yuxuan Wang, Zichen Wen.

**Figure 2.** Figure 2: Distribution shift of Top-K token selection before [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of AudioKV. (A) Offline identification of audio-critical attention heads. We analyze attention distributions [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Performance comparison of KV cache methods across five models with varying retain ratios on LibriSpeech-long and [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative visualization of long-form decoding [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: Comparison of KV cache memory footprint. We [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Large Audio-Language Models (LALMs) have set new benchmarks in speech processing, yet their deployment is hindered by the memory footprint of the Key-Value (KV) cache during long-context inference. While general KV cache compression techniques excel in LLMs, they often fail in the audio domain by overlooking the intrinsic temporal continuity of acoustic signals. To bridge this gap, we propose AudioKV, a novel framework that robustly prioritizes audio-critical attention heads through a hardware-friendly semantic-acoustic alignment mechanism. Specifically, we identify these modality-specialized heads by analyzing attention scores in ASR tasks and dynamically allocate KV cache budgets preferentially to them. Furthermore, we introduce Spectral Score Smoothing (SSS), an FFT-based global filtering strategy designed to suppress high-frequency noise and recover smooth global trends from importance scores, ensuring more balanced token selection with unprecedented precision. Extensive evaluations across multiple LALMs, including Qwen and Gemma series, demonstrate that AudioKV significantly outperforms baselines while enhancing computational efficiency. Notably, at a 40% compression ratio, AudioKV maintains near-full accuracy on Qwen3-Omni-30B with only a 0.45% drop, whereas traditional methods suffer from catastrophic performance degradation and repetition. Our code will be released after acceptance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes AudioKV, a KV-cache eviction framework for Large Audio-Language Models that first ranks attention heads by importance on ASR tasks to identify 'modality-specialized' heads, then preferentially allocates KV budget to those heads while applying Spectral Score Smoothing (an FFT-based low-pass filter) to the per-token importance scores. The central empirical claim is that this yields only a 0.45% accuracy drop at 40% compression on Qwen3-Omni-30B, while baseline eviction methods produce catastrophic degradation and repetition.

Significance. If the transferability of ASR-derived head rankings and the effectiveness of SSS hold under broader evaluation, the work would offer a practical, hardware-friendly route to long-context audio inference with modest memory savings. The reliance on standard FFT operations and empirical attention analysis rather than learned parameters is a methodological strength that aids reproducibility.

major comments (3)

[Abstract and §4 (Evaluation)] The central 0.45% drop claim at 40% compression rests on the assumption that head importance rankings derived from ASR attention analysis remain stable across the evaluation tasks (audio QA, captioning, etc.). No quantitative overlap statistics or cross-task head-ranking correlation are reported, so it is unclear whether the preferential budget allocation actually protects the tokens that matter for the reported benchmarks.
[Abstract] The abstract states that 'extensive evaluations across multiple LALMs' were performed, yet supplies no baseline definitions, dataset statistics, number of runs, or statistical significance tests. Without these, the magnitude of the reported improvement over 'traditional methods' cannot be assessed.
[§3.2 (Spectral Score Smoothing)] Spectral Score Smoothing is described as an FFT-based global filter, but the manuscript does not specify the cutoff frequency, windowing, or how the smoothed scores are normalized before eviction; these choices directly affect which tokens are retained and therefore bear on the robustness claim.

minor comments (2)

[Abstract] The abstract claims 'near-full accuracy' at 40% compression; the precise metric (e.g., WER, BLEU, or task-specific accuracy) and the uncompressed baseline value should be stated explicitly.
[Figures 3-5] Figure captions and axis labels for the compression-ratio plots should include error bars or standard deviations if multiple runs were performed.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and recommendation for major revision. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Abstract and §4 (Evaluation)] The central 0.45% drop claim at 40% compression rests on the assumption that head importance rankings derived from ASR attention analysis remain stable across the evaluation tasks (audio QA, captioning, etc.). No quantitative overlap statistics or cross-task head-ranking correlation are reported, so it is unclear whether the preferential budget allocation actually protects the tokens that matter for the reported benchmarks.

Authors: We acknowledge that explicit quantitative validation of head-ranking stability (e.g., overlap statistics or Spearman correlations) across ASR and downstream tasks was not reported in the initial submission. Our design choice to derive rankings from ASR is grounded in the observation that ASR attention patterns capture fundamental acoustic-semantic alignments that transfer to QA and captioning, as evidenced by the strong empirical results on those benchmarks. In the revised manuscript we will add a dedicated analysis (new subsection in §4 or appendix) reporting top-k head overlap (Jaccard index) and rank correlations between ASR-derived rankings and those computed on the evaluation tasks, thereby directly addressing transferability. revision: yes
Referee: [Abstract] The abstract states that 'extensive evaluations across multiple LALMs' were performed, yet supplies no baseline definitions, dataset statistics, number of runs, or statistical significance tests. Without these, the magnitude of the reported improvement over 'traditional methods' cannot be assessed.

Authors: The abstract is intentionally concise; the full experimental protocol—including baseline definitions (H2O, SnapKV, etc.), dataset statistics, number of runs (averaged over three random seeds with standard deviations), and significance testing—is detailed in §4. To improve standalone readability we will revise the abstract to briefly note the evaluated model families, compression ratios, and that all results include run statistics, while retaining the pointer to §4 for complete information. revision: partial
Referee: [§3.2 (Spectral Score Smoothing)] Spectral Score Smoothing is described as an FFT-based global filter, but the manuscript does not specify the cutoff frequency, windowing, or how the smoothed scores are normalized before eviction; these choices directly affect which tokens are retained and therefore bear on the robustness claim.

Authors: We agree that these implementation details are essential for reproducibility. The revised §3.2 will explicitly state the cutoff frequency (normalized frequency 0.05), the window function (Hamming), and the post-smoothing min-max normalization to [0,1] before eviction. We will also insert the precise mathematical formulation and a short ablation on cutoff frequency to demonstrate that performance remains stable within a reasonable range. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical head ranking and standard FFT smoothing are independent of target metrics

full rationale

The paper's core steps—identifying modality-specialized heads via ASR attention analysis and applying Spectral Score Smoothing via FFT—are presented as empirical procedures and standard signal-processing operations. No equations, uniqueness theorems, or self-citations are invoked to derive the eviction policy or performance claims. The 0.45% accuracy drop at 40% compression is reported as an experimental outcome on Qwen3-Omni-30B rather than a quantity forced by construction from the method's own inputs. The derivation chain therefore remains self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review identifies no explicit free parameters, axioms, or invented entities; the method builds on standard transformer attention and FFT operations already present in the literature.

pith-pipeline@v0.9.0 · 5546 in / 969 out tokens · 64624 ms · 2026-05-10T18:23:34.283078+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models
cs.SD 2026-04 unverdicted novelty 6.0

HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.

Reference graph

Works this paper leans on

31 extracted references · 19 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisserman. 2023. Whis- perx: Time-accurate speech transcription of long-form audio.arXiv preprint arXiv:2303.00747(2023)

work page arXiv 2023
[2]

Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long- document transformer.arXiv preprint arXiv:2004.05150(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[3]

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Kem- ing Lu, Wayne Xiong, Yue Dong, Junjie Hu, and Wen Xiao. 2025. PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling. arXiv:2406.02069 [cs.CL] https://arxiv.org/abs/2406.02069

work page internal anchor Pith review arXiv 2025
[4]

William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4960–4964. doi:10.1109/ICASSP.2016.7472621

work page doi:10.1109/icassp.2016.7472621 2016
[5]

Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691(2023)

work page internal anchor Pith review arXiv 2023
[6]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashat- tention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems35 (2022), 16344–16359

2022
[7]

arXiv preprint arXiv:2407.11550 , year =

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S. Kevin Zhou. 2025. Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference. arXiv:2407.11550 [cs.CL] https://arxiv.org/abs/2407.11550

work page arXiv 2025
[8]

Google Gemini Team. 2025. Gemini 2.5: Pushing the Frontier with Advanced Rea- soning, Multimodality, Long Context, and Next Generation Agentic Capabilities. arXiv:2507.06261 [cs.CL] https://arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. 2025. A Survey on LLM-as- a-Judge. arXiv:2411.15594 [cs.CL] https://arxiv.org/abs/2411.15594

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Dong-Jae Lee, Jiwan Hur, Jaehyun Choi, Jaemyung Yu, and Junmo Kim
[11]

arXiv:2511.21477 [cs.CV] https://arxiv.org/abs/2511.21477

Frequency-Aware Token Reduction for Efficient Vision Transformer. arXiv:2511.21477 [cs.CV] https://arxiv.org/abs/2511.21477

work page arXiv
[12]

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. {InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 155–172

2024
[13]

James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, and Santiago Ontañón. 2021. FNet: Mixing Tokens with Fourier Transforms.CoRRabs/2105.03824 (2021). arXiv:2105.03824 https://arxiv.org/abs/2105.03824

work page arXiv 2021
[14]

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. 2024. SnapKV: LLM Knows What You are Looking for Before Generation. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran A...

work page doi:10.52202/079017-0722 2024
[15]

Paul Michel, Omer Levy, and Graham Neubig. 2019. Are Sixteen Heads Really Better than One?CoRRabs/1905.10650 (2019). arXiv:1905.10650 http://arxiv.org/ abs/1905.10650

work page arXiv 2019
[16]

Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. 2020. MLS: A Large-Scale Multilingual Dataset for Speech Research. ArXivabs/2012.03411 (2020)

work page arXiv 2020
[17]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning. PMLR, 28492–28518

2023
[18]

Yongming Rao, Wenliang Zhao, Zheng Zhu, Jiwen Lu, and Jie Zhou. 2021. Global Filter Networks for Image Classification. arXiv:2107.00645 [cs.CV] https://arxiv. org/abs/2107.00645

work page arXiv 2021
[19]

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. 2024. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems37 (2024), 68658–68685

2024
[20]

Qundong Shi, Jie Zhou, Biyuan Lin, Junbo Cui, Guoyang Zeng, Yixuan Zhou, Ziyang Wang, Xin Liu, Zhen Luo, Yudong Wang, et al. 2026. UltraEval-Audio: A Unified Framework for Comprehensive Evaluation of Audio Foundation Models. arXiv preprint arXiv:2601.01373(2026)

work page arXiv 2026
[21]

Zhiyuan Tang, Dong Wang, Yanguang Xu, Jianwei Sun, Xiaoning Lei, Shuaijiang Zhao, Cheng Wen, Xi Tan, Chuandong Xie, Shuran Zhou, Rui Yan, Chenjia Lv, Yang Han, Wei Zou, and Xiangang Li. 2021. KeSpeech: An Open Source Speech Dataset of Mandarin and Its Eight Subdialects. InNeurIPS Datasets and Benchmarks. https://api.semanticscholar.org/CorpusID:244905405

2021
[22]

Gemma Team. 2025. Gemma 3n. (2025). https://ai.google.dev/gemma/docs/ gemma-3n

2025
[23]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

2017
[24]

Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned.CoRRabs/1905.09418 (2019). arXiv:1905.09418 http: //arxiv.org/abs/1905.09418

work page Pith review arXiv 2019
[25]

Changhan Wang, Anne Wu, Jiatao Gu, and Juan Pino. 2021. CoVoST 2 and Massively Multilingual Speech Translation. InInterspeech 2021. 2247–2251. doi:10. 21437/Interspeech.2021-2027

2021
[26]

Jiahui Wang, Zuyan Liu, Yongming Rao, and Jiwen Lu. 2025. SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 23177–23187

2025
[27]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient Streaming Language Models with Attention Sinks. InInternational Conference on Representation Learning, B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun (Eds.), Vol. 2024. 21875–21895. https://proceedings.iclr.cc/paper_files/paper/2024/file/ 5e5fd18f863cbe...

2024
[28]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. 2025. Qwen2.5-Omni Technical Report. arXiv:2503.20215 [cs.CL] https://arxiv.org/abs/2503.20215

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo ...

work page internal anchor Pith review arXiv 2025
[30]

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang "At- las" Wang, and Beidi Chen. 2023. H2O: Heavy-Hitter Oracle for Ef- ficient Generative Inference of Large Language Models. InAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson...

2023
[31]

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2023. KIVI : Plug-and-play 2bit KV Cache Quantization with Streaming Asymmetric Quantization. (2023). doi:10.13140/ RG.2.2.28167.37282 MM ’26, November 10–14, 2026, Rio de Janeiro, Brazil Wang et al. A Appendix A.1 Ablation Study Table Methods Autom...

work page arXiv 2023