Beyond training: Dynamic token merging for zero-shot video understanding

Yiming Zhang, Zhuokai Zhao, Zhaorun Chen, Zenghui Ding, Xianjun Yang, Yining Sun · 2024 · arXiv 2411.14401

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Mosaic: Cross-Modal Clustering for Efficient Video Understanding

cs.PF · 2026-04-11 · unverdicted · novelty 7.0

Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.

GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

GRIP-VLM applies group-relative policy optimization via reinforcement learning to prune visual tokens in VLMs, yielding up to 15% inference speedup at matched accuracy over prior methods.

citing papers explorer

Showing 2 of 2 citing papers.

Mosaic: Cross-Modal Clustering for Efficient Video Understanding cs.PF · 2026-04-11 · unverdicted · none · ref 28
Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.
GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models cs.CV · 2026-05-13 · unverdicted · none · ref 32
GRIP-VLM applies group-relative policy optimization via reinforcement learning to prune visual tokens in VLMs, yielding up to 15% inference speedup at matched accuracy over prior methods.

Beyond training: Dynamic token merging for zero-shot video understanding

fields

years

verdicts

representative citing papers

citing papers explorer