pith. machine review for the scientific record. sign in

arxiv: 2602.04804 · v2 · submitted 2026-02-04 · 💻 cs.CL

Recognition: unknown

OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

Authors on Pith no claims yet
classification 💻 cs.CL
keywords compressionomnisifttokenomni-llmsomni-modalaudiobaselinesframework
0
0 comments X
read the original abstract

Omni-modal Large Language Models (Omni-LLMs) have demonstrated strong capabilities in audio-video understanding tasks. However, their reliance on long multimodal token sequences leads to substantial computational overhead. Despite this challenge, token compression methods designed for Omni-LLMs remain limited. To bridge this gap, we propose OmniSIFT (Omni-modal Spatio-temporal Informed Fine-grained Token compression), a modality-asymmetric token compression framework tailored for Omni-LLMs. Specifically, OmniSIFT adopts a two-stage compression strategy: (i) a spatio-temporal video pruning module that removes video redundancy arising from both intra-frame structure and inter-frame overlap, and (ii) a vision-guided audio selection module that filters audio tokens. The entire framework is optimized end-to-end via a differentiable straight-through estimator. Extensive experiments on five representative benchmarks demonstrate the efficacy and robustness of OmniSIFT. Notably, for Qwen2.5-Omni-7B, OmniSIFT introduces only 4.85M parameters while maintaining lower latency than training-free baselines such as OmniZip. With merely 25% of the original token context, OmniSIFT consistently outperforms all compression baselines and even surpasses the performance of the full-token model on several tasks.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

    cs.CV 2026-05 unverdicted novelty 8.0

    TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

  2. Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    ContextGuard prunes 55% of tokens in Qwen2.5-Omni 7B while matching full performance on five of six audio-visual benchmarks by preserving audio-irrecoverable visual context.

  3. HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models

    cs.SD 2026-04 unverdicted novelty 6.0

    HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.

  4. OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models

    cs.AI 2026-05 unverdicted novelty 5.0

    OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.

  5. OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

    cs.CV 2026-04 unverdicted novelty 4.0

    OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.