arXiv preprint arXiv:2110.03860 (2021)

Token pooling in vision transformers , author= · 2021 · arXiv 2110.03860

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

representative citing papers

DTM-Codec: Dynamic Token Masking for VFR Speech Coding with Efficient Boundary Selection

eess.AS · 2026-06-28 · unverdicted · novelty 7.0

DTM-Codec achieves better reconstruction quality and intelligibility than fixed-frame-rate neural speech codecs at matched total bitrate via dynamic token masking and Path Length Equalization for variable frame rates.

Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals

cs.AI · 2026-04-17 · unverdicted · novelty 7.0

Pairwise scoring signals in Vision Transformer token reduction are inherently unstable due to high perturbation counts and degrade in deep layers, causing collapse, while unary signals with triage enable CATIS to retain 96.9% accuracy at 63% FLOPs reduction on ViT-Large ImageNet-1K.

Spatial-Aware Reduction Framework: Towards Efficient and Faithful Visual State Space Models

cs.CV · 2026-06-18 · unverdicted · novelty 6.0

STORM is a training-free spatial-aware token reduction framework that reformulates compression on spatial units to preserve grid topology and neighborhood coherence in visual state space models.

Token Merging: Your ViT But Faster

cs.CV · 2022-10-17 · unverdicted · novelty 6.0

Token Merging (ToMe) doubles the throughput of large Vision Transformers on images, video, and audio by merging similar tokens with a fast matching algorithm, incurring only 0.2-0.4% accuracy loss.

Accelerating Vision Foundation Models with Drop-in Depthwise Convolution

cs.CV · 2026-05-21 · unverdicted · novelty 5.0

Replacing selected attention heads in pretrained ViTs with depthwise convolutions, identified by simple strategies and recovered via fine-tuning, delivers 17-20% inference speedup on image tasks with minimal accuracy loss.

RAPID: Layer-Wise Redundancy-Aware Pruning and Importance-Driven Token Merging for Efficient ViT

cs.CV · 2026-06-06 · unverdicted · novelty 4.0

RAPID is a training-free, depth-aware token reduction framework for ViTs that switches from redundancy-aware pruning in shallow layers to importance-aware merging in deep layers and reports better accuracy-compression tradeoffs than ToMe on ImageNet.

citing papers explorer

Showing 5 of 5 citing papers after filters.

DTM-Codec: Dynamic Token Masking for VFR Speech Coding with Efficient Boundary Selection eess.AS · 2026-06-28 · unverdicted · none · ref 45
DTM-Codec achieves better reconstruction quality and intelligibility than fixed-frame-rate neural speech codecs at matched total bitrate via dynamic token masking and Path Length Equalization for variable frame rates.
Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals cs.AI · 2026-04-17 · unverdicted · none · ref 34
Pairwise scoring signals in Vision Transformer token reduction are inherently unstable due to high perturbation counts and degrade in deep layers, causing collapse, while unary signals with triage enable CATIS to retain 96.9% accuracy at 63% FLOPs reduction on ViT-Large ImageNet-1K.
Spatial-Aware Reduction Framework: Towards Efficient and Faithful Visual State Space Models cs.CV · 2026-06-18 · unverdicted · none · ref 81
STORM is a training-free spatial-aware token reduction framework that reformulates compression on spatial units to preserve grid topology and neighborhood coherence in visual state space models.
Accelerating Vision Foundation Models with Drop-in Depthwise Convolution cs.CV · 2026-05-21 · unverdicted · none · ref 26
Replacing selected attention heads in pretrained ViTs with depthwise convolutions, identified by simple strategies and recovered via fine-tuning, delivers 17-20% inference speedup on image tasks with minimal accuracy loss.
RAPID: Layer-Wise Redundancy-Aware Pruning and Importance-Driven Token Merging for Efficient ViT cs.CV · 2026-06-06 · unverdicted · none · ref 12
RAPID is a training-free, depth-aware token reduction framework for ViTs that switches from redundancy-aware pruning in shallow layers to importance-aware merging in deep layers and reports better accuracy-compression tradeoffs than ToMe on ImageNet.

arXiv preprint arXiv:2110.03860 (2021)

fields

years

verdicts

representative citing papers

citing papers explorer