Stateful Token Reduction for Long-Video Hybrid VLMs

Amala Sanjay Deshmukh; Andrew Tao; Guilin Liu; Jan Kautz; Jindong Jiang; Karan Sapra; Kateryna Chumachenko; Pavlo Molchanov; Wonmin Byeon; Zhiding Yu

arxiv: 2603.00198 · v2 · pith:5TDINR7Xnew · submitted 2026-02-27 · 💻 cs.CV · cs.AI

Stateful Token Reduction for Long-Video Hybrid VLMs

Jindong Jiang , Amala Sanjay Deshmukh , Kateryna Chumachenko , Karan Sapra , Zhiding Yu , Guilin Liu , Andrew Tao , Pavlo Molchanov

show 2 more authors

Jan Kautz Wonmin Byeon

This is my paper

classification 💻 cs.CV cs.AI

keywords reductiontokenhybridlayersimportancelong-videomambamethod

0 comments

read the original abstract

Token reduction accelerates long-video vision--language models (VLMs), but existing methods target Transformers, where reduction is treated as token pruning. We study token reduction in hybrid Mamba--Transformer VLMs and find that it is \emph{stateful}: Mamba layers maintain a recurrent state that accumulates information from earlier tokens, allowing discarded tokens to persist, so reduction behaves more like compression than dropping.We support this view with a representation-based probing method measuring how much information from discarded tokens is retained, and analyze layer-wise sparsity and cross-layer importance stability. Our findings show importance is sparse within layers but unstable across layers, making aggressive early pruning unreliable while hybrids remain robust to later reduction.Motivated by this, we propose a hybrid-aware token reduction framework with a low-to-high progressive schedule and a unified query-conditioned importance score for attention and Mamba layers. For Mamba, excluding the position-dependent decay from the recurrence produces a stronger selection signal. Across long-video benchmarks, our method achieves $3.8{\times}$--$4.2{\times}$ prefilling speedups at a 25% token budget while maintaining near-baseline accuracy and improving with light finetuning. Hybrid models benefit from aggressive reduction, improving both efficiency and accuracy, whereas Transformers exhibit the standard trade-off. Our method also outperforms prior baselines on the same hybrid backbone and combines effectively with visual redundancy reduction methods.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference
cs.DC 2026-04 unverdicted novelty 6.0

CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baseli...