OmniMem: Perturbation-aware Memory Compression for Streaming Audio-Visual LLMs

Chao Zhang; Guangzhi Sun; Yixuan Li; Yudong Yang

arxiv: 2606.07577 · v1 · pith:NIVC4BKDnew · submitted 2026-05-26 · 💻 cs.AI · cs.CV· cs.SD· eess.AS

OmniMem: Perturbation-aware Memory Compression for Streaming Audio-Visual LLMs

Guangzhi Sun , Yixuan Li , Yudong Yang , Chao Zhang This is my paper

Pith reviewed 2026-06-29 16:54 UTC · model grok-4.3

classification 💻 cs.AI cs.CVcs.SDeess.AS

keywords memory compressionKV cacheaudio-visual LLMstreaming inferencemodality-aware allocationperturbation-aware selectionlong video understanding

0 comments

The pith

OmniMem compresses KV caches in audio-visual LLMs by allocating memory separately to visual and audio tokens and selecting non-redundant states through perturbation awareness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OmniMem as a streaming compression method for audio-visual large language models that must handle long videos without exhausting memory. Existing approaches apply uniform compression across all tokens, which ignores the large imbalance between visual and audio token counts and often discards information needed for long-range reasoning. OmniMem instead splits allocation between modalities and chooses which key-value states to keep by measuring how much each one would perturb the model's output if removed. The method also includes an optional fine-tuning stage that trains the model to pack useful information into the retained states. Under fixed memory budgets the approach raises accuracy 2-4 percent over prior training-free baselines on long-video benchmarks, with a further 1-2 percent lift after fine-tuning.

Core claim

OmniMem shows that modality-aware allocation combined with perturbation-aware selection of KV states produces more compact memory footprints than uniform compression while preserving the long-range understanding required for audio-visual video tasks. The framework therefore allows streaming inference on extended video sequences without proportional growth in memory usage.

What carries the argument

modality-aware memory allocation paired with perturbation-aware memory selection, which separately budgets visual and audio contexts and retains only those KV states whose removal would most change the model's predictions.

If this is right

Longer video sequences become feasible under fixed GPU memory limits.
Audio and visual information remain balanced even when their token counts differ by orders of magnitude.
Fine-tuning under the compression budget further improves retained information density.
The same selection logic can be applied at inference time without retraining the base model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same perturbation criterion might extend to other multimodal settings such as text-plus-image or speech-plus-text models that also face token imbalance.
If perturbation scores correlate with semantic importance, they could serve as a general diagnostic for what information the model actually uses during long-context reasoning.
Budget-aware fine-tuning may reduce the need for ever-larger context windows by teaching models to compress their own histories more effectively.

Load-bearing premise

Perturbation-aware selection can reliably keep the most informative non-redundant KV states and that splitting memory between modalities will resolve visual-audio imbalance without creating new failure modes.

What would settle it

A controlled experiment that applies the same memory budget to long video inputs and finds that OmniMem produces lower accuracy than uniform compression baselines on VideoMME Long or LVBench would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.07577 by Chao Zhang, Guangzhi Sun, Yixuan Li, Yudong Yang.

**Figure 2.** Figure 2: Plot of cosine similarity and normalized entropy against layers for audio and visual positions on a small [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Fine-tuning process with allocated budgets. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Runtime statistics, including peak memory consumption and first token latency using videoSALMONN 2+ (8B) and (4B) on a single H800 GPU. 8K per-layer memory size is used. of GPU memory, with a slight < 1GB overhead in OmniMem. This overhead was mainly due to the retain of hidden states for cosine similarity redundancy computation, which we found important to achieve good performance for videos over 1 hou… view at source ↗

**Figure 5.** Figure 5: Sensitivity to the key hyper-parameters λ in Eqn. (8), and temperature T in Eqn. (14). Experiments are performed using the video-SALMONN 2+ (8B) model on VideoMME long partition and LVBench datasets [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Accuracy variation against varying memory size from 4k to 48k per layer using video-SALMONN 2+ [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Budget distribution across layers for video-SALMONN 2+ (8B) with [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

read the original abstract

Audio-visual large language models (LLMs) hold strong promise for long-form video understanding, yet their long-video inference is fundamentally limited by the linear growth of video tokens and key-value (KV) caches. We present OmniMem, a memory-efficient streaming framework designed specifically for audio-visual LLMs. Unlike existing compression methods that treat all tokens uniformly, OmniMem introduces a modality-aware memory allocation strategy that separately manages visual and audio contexts, addressing the severe token imbalance between the two modalities. OmniMem further preserves informative and non-redundant KV states through perturbation-aware memory selection, enabling compact memory without sacrificing long-range understanding. To strengthen compression under realistic deployment constraints, we also explore budget-aware fine-tuning, which encourages the model to consolidate useful information into retained memory. Experiments on VideoMME Long, LVBench, and LVOmniBench with video-SALMONN 2+ and Qwen-2.5-Omni show that OmniMem consistently improves over strong training-free compression baselines by 2-4% absolute accuracy under the same memory budgets, with an additional 1-2% gain after fine-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OmniMem pairs modality-specific KV allocation with perturbation selection for audio-visual LLMs and reports 2-4% gains over baselines, but the evidence stays thin without ablations or implementation details.

read the letter

The paper's core move is to split memory allocation between visual and audio tokens instead of treating them the same, then use perturbation to decide which KV states to drop. That combination is presented as new for streaming audio-visual models, and the experiments run on VideoMME Long, LVBench, and LVOmniBench with video-SALMONN 2+ and Qwen-2.5-Omni. The reported lift is 2-4% absolute accuracy over training-free baselines at fixed memory budgets, plus another 1-2% after budget-aware fine-tuning.

What stands out is the focus on the real token imbalance between modalities, which uniform compression methods overlook. The benchmarks are the right ones for long-form video understanding, so the setup matches the stated problem.

The soft spots are the size of the gains and the missing supporting analysis. Two to four percent is modest, and without ablations it is hard to know whether the perturbation step is doing the work or whether the modality split alone would have been enough. The fine-tuning step is described only at a high level, so it is unclear how much it depends on the compression choices or how stable the results are across different budgets. If those details are in the full paper they need to be shown clearly; otherwise the claims rest on limited evidence.

This is for people working on memory-efficient inference for multimodal LLMs who already deal with long video streams. A reader in that area might pick up the modality split idea, but the current results are not strong enough to adopt the method as is.

I would send it to peer review. The problem is practical and the approach is straightforward, even if the paper will need more experiments and transparency to hold up.

Referee Report

1 major / 0 minor

Summary. The paper proposes OmniMem, a memory-efficient streaming framework for audio-visual LLMs. It introduces a modality-aware memory allocation strategy to separately manage visual and audio contexts (addressing token imbalance) and a perturbation-aware memory selection method to retain informative, non-redundant KV states. The framework also explores budget-aware fine-tuning. Experiments on VideoMME Long, LVBench, and LVOmniBench using video-SALMONN 2+ and Qwen-2.5-Omni report 2-4% absolute accuracy gains over strong training-free compression baselines under identical memory budgets, plus an additional 1-2% after fine-tuning.

Significance. If the empirical gains are robustly validated, the work would offer a practical contribution to long-form audio-visual understanding by mitigating KV cache growth in streaming settings. The modality-aware allocation directly targets a known multi-modal imbalance, and the perturbation-aware selection provides a mechanism for preserving long-range dependencies under compression. The multi-benchmark, multi-model evaluation strengthens the case for applicability.

major comments (1)

[Abstract] Abstract: The central empirical claim of consistent 2-4% accuracy improvements (plus 1-2% post-fine-tuning) is presented without implementation details on the perturbation-aware selection criterion, the precise memory budget allocations per modality, ablation studies isolating each component, or error analysis. This absence makes it impossible to determine whether the reported gains are attributable to the proposed mechanisms or to uncontrolled variables in baseline comparisons.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the work's significance and for the constructive comment on the abstract. We address the point below.

read point-by-point responses

Referee: [Abstract] Abstract: The central empirical claim of consistent 2-4% accuracy improvements (plus 1-2% post-fine-tuning) is presented without implementation details on the perturbation-aware selection criterion, the precise memory budget allocations per modality, ablation studies isolating each component, or error analysis. This absence makes it impossible to determine whether the reported gains are attributable to the proposed mechanisms or to uncontrolled variables in baseline comparisons.

Authors: Abstracts are intentionally concise and high-level; they are not the venue for implementation specifics, precise allocations, full ablations, or error analysis, which are provided in the main manuscript (modality-aware allocation in Sec. 3.1, perturbation-aware selection in Sec. 3.2, ablations isolating components in Sec. 4.2, and error analysis in Sec. 4.3). Baseline comparisons control for identical memory budgets as stated in Sec. 4.1. We can revise the abstract to add one sentence briefly naming the two core mechanisms if space permits, but the detailed validation remains in the body. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical performance claims only

full rationale

The paper's central claims are direct experimental measurements of accuracy improvements (2-4% over baselines, plus 1-2% post-fine-tuning) on public benchmarks (VideoMME Long, LVBench, LVOmniBench) using specific models. No derivation chain, first-principles predictions, fitted parameters renamed as outputs, or self-citation load-bearing premises are present in the abstract or described claims. The modality-aware allocation and perturbation-aware selection are presented as engineering mechanisms evaluated empirically, not as quantities derived by construction from their own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Review performed on abstract only; full methods, equations, and experimental protocols unavailable. Free parameters and detailed assumptions cannot be enumerated precisely.

axioms (2)

domain assumption Audio-visual LLMs are limited by linear growth of video tokens and KV caches during long-video inference.
Explicitly stated as the core problem in the abstract.
ad hoc to paper Separate management of visual and audio contexts addresses severe token imbalance between modalities.
Central design choice of the proposed modality-aware strategy.

invented entities (1)

perturbation-aware memory selection no independent evidence
purpose: Identify and retain informative non-redundant KV states for compact memory.
New technique introduced to enable compression without sacrificing understanding; no independent evidence supplied in abstract.

pith-pipeline@v0.9.1-grok · 5744 in / 1438 out tokens · 58189 ms · 2026-06-29T16:54:22.778187+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 9 canonical work pages · 4 internal anchors

[1]

arXiv:2508.15717 , year =

StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding , author =. arXiv:2508.15717 , year =

work page arXiv
[2]

2024 , journal=

StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding , author=. 2024 , journal=

2024
[3]

arXiv:2511.07278 , year =

StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression , author =. arXiv:2511.07278 , year =

work page arXiv
[4]

InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding , author =. Proc. NeurIPS 2025 , year =

2025
[5]

2026 , booktitle =

HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding , author=. 2026 , booktitle =

2026
[6]

video-SALMONN 2: Caption-enhanced audio-visual large language models.arXiv preprint arXiv:2506.15220, 2025

video-SALMONN 2: Caption-Enhanced Audio-Visual Large Language Models , author =. arXiv:2506.15220 , year =

work page arXiv
[7]

Qwen2.5-Omni Technical Report

Qwen2.5-Omni Technical Report , author =. arXiv:2503.20215 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis , author =. arXiv:2405.21075 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[9]

LVBench: An Extreme Long Video Understanding Benchmark

LVBench: An Extreme Long Video Understanding Benchmark , author =. arXiv:2406.08035 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[10]

LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs , author =
[11]

2025 , booktitle=

StreamForest: Efficient Online Video Understanding with Persistent Event Memory , author=. 2025 , booktitle=

2025
[12]

2026 , booktitle=

video-SALMONN S: Memory-Enhanced Streaming Audio-Visual LLM , author=. 2026 , booktitle=

2026
[13]

2025 , journal=

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling , author=. 2025 , journal=

2025
[14]

arXiv:2411.09688 , year=

Squeezed Attention: Accelerating Long Context Length LLM Inference , author=. arXiv:2411.09688 , year=

work page arXiv
[15]

2025 , journal=

LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation , author=. 2025 , journal=

2025
[16]

2025 , journal=

EvolKV: Evolutionary KV Cache Compression for LLM Inference , author=. 2025 , journal=

2025
[17]

Streaming video question-answering with in-context video kv-cache retrieval.arXiv preprint arXiv:2503.00540, 2025

Shangzhe Di and Zhelun Yu and Guanghao Zhang and Haoyuan Li and Tao Zhong and Hao Cheng and Bolin Li and Wanggui He and Fangxun Shu and Hao Jiang , title =. arXiv preprint arXiv:2503.00540 , year =

work page arXiv
[18]

Long Context Transfer from Language to Vision

Long Context Transfer from Language to Vision , author=. arXiv preprint arXiv:2406.16852 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Kim, Hyunwoo and Soran, Bilge and Krishnamoorthi, Raghuraman and Elhoseiny, Mohamed and Chandra, Vikas , journal=

Shen, Xiaoqian and Xiong, Yunyang and Zhao, Changsheng and Wu, Lemeng and Chen, Jun and Zhu, Chenchen and Liu, Zechun and Xiao, Fanyi and Varadarajan, Balakrishnan and Bordes, Florian and Liu, Zhuang and Xu, Hu and J. Kim, Hyunwoo and Soran, Bilge and Krishnamoorthi, Raghuraman and Elhoseiny, Mohamed and Chandra, Vikas , journal=
[20]

Shu, Yan and Zhang, Peitian and Liu, Zheng and Qin, Minghao and Zhou, Junjie and Huang, Tiejun and Zhao, Bo , journal=
[21]

CVPR , year=

StreamingTOM: Streaming Token Compression for Efficient Video Understanding , author=. CVPR , year=
[22]

Enxin Song and Wenhao Chai and Guanhong Wang and Yucheng Zhang and Haoyang Zhou and Feiyang Wu and Haozhe Chi and Xun Guo and Tian Ye and Yanting Zhang and Yan Lu and Jenq-Neng Hwang and Gaoang Wang , year=
[23]

2024 , eprint=

LongVILA: Scaling Long-Context Visual Language Models for Long Videos , author=. 2024 , eprint=

2024
[24]

Liu, Xiangrui and Shu, Yan and Liu, Zheng and Li, Ao and Tian, Yang and Zhao, Bo , journal=

[1] [1]

arXiv:2508.15717 , year =

StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding , author =. arXiv:2508.15717 , year =

work page arXiv

[2] [2]

2024 , journal=

StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding , author=. 2024 , journal=

2024

[3] [3]

arXiv:2511.07278 , year =

StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression , author =. arXiv:2511.07278 , year =

work page arXiv

[4] [4]

InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding , author =. Proc. NeurIPS 2025 , year =

2025

[5] [5]

2026 , booktitle =

HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding , author=. 2026 , booktitle =

2026

[6] [6]

video-SALMONN 2: Caption-enhanced audio-visual large language models.arXiv preprint arXiv:2506.15220, 2025

video-SALMONN 2: Caption-Enhanced Audio-Visual Large Language Models , author =. arXiv:2506.15220 , year =

work page arXiv

[7] [7]

Qwen2.5-Omni Technical Report

Qwen2.5-Omni Technical Report , author =. arXiv:2503.20215 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis , author =. arXiv:2405.21075 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

LVBench: An Extreme Long Video Understanding Benchmark

LVBench: An Extreme Long Video Understanding Benchmark , author =. arXiv:2406.08035 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs , author =

[11] [11]

2025 , booktitle=

StreamForest: Efficient Online Video Understanding with Persistent Event Memory , author=. 2025 , booktitle=

2025

[12] [12]

2026 , booktitle=

video-SALMONN S: Memory-Enhanced Streaming Audio-Visual LLM , author=. 2026 , booktitle=

2026

[13] [13]

2025 , journal=

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling , author=. 2025 , journal=

2025

[14] [14]

arXiv:2411.09688 , year=

Squeezed Attention: Accelerating Long Context Length LLM Inference , author=. arXiv:2411.09688 , year=

work page arXiv

[15] [15]

2025 , journal=

LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation , author=. 2025 , journal=

2025

[16] [16]

2025 , journal=

EvolKV: Evolutionary KV Cache Compression for LLM Inference , author=. 2025 , journal=

2025

[17] [17]

Streaming video question-answering with in-context video kv-cache retrieval.arXiv preprint arXiv:2503.00540, 2025

Shangzhe Di and Zhelun Yu and Guanghao Zhang and Haoyuan Li and Tao Zhong and Hao Cheng and Bolin Li and Wanggui He and Fangxun Shu and Hao Jiang , title =. arXiv preprint arXiv:2503.00540 , year =

work page arXiv

[18] [18]

Long Context Transfer from Language to Vision

Long Context Transfer from Language to Vision , author=. arXiv preprint arXiv:2406.16852 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Kim, Hyunwoo and Soran, Bilge and Krishnamoorthi, Raghuraman and Elhoseiny, Mohamed and Chandra, Vikas , journal=

Shen, Xiaoqian and Xiong, Yunyang and Zhao, Changsheng and Wu, Lemeng and Chen, Jun and Zhu, Chenchen and Liu, Zechun and Xiao, Fanyi and Varadarajan, Balakrishnan and Bordes, Florian and Liu, Zhuang and Xu, Hu and J. Kim, Hyunwoo and Soran, Bilge and Krishnamoorthi, Raghuraman and Elhoseiny, Mohamed and Chandra, Vikas , journal=

[20] [20]

Shu, Yan and Zhang, Peitian and Liu, Zheng and Qin, Minghao and Zhou, Junjie and Huang, Tiejun and Zhao, Bo , journal=

[21] [21]

CVPR , year=

StreamingTOM: Streaming Token Compression for Efficient Video Understanding , author=. CVPR , year=

[22] [22]

Enxin Song and Wenhao Chai and Guanhong Wang and Yucheng Zhang and Haoyang Zhou and Feiyang Wu and Haozhe Chi and Xun Guo and Tian Ye and Yanting Zhang and Yan Lu and Jenq-Neng Hwang and Gaoang Wang , year=

[23] [23]

2024 , eprint=

LongVILA: Scaling Long-Context Visual Language Models for Long Videos , author=. 2024 , eprint=

2024

[24] [24]

Liu, Xiangrui and Shu, Yan and Liu, Zheng and Li, Ao and Tian, Yang and Zhao, Bo , journal=