Pairwise scoring signals in Vision Transformer token reduction are inherently unstable due to high perturbation counts and degrade in deep layers, causing collapse, while unary signals with triage enable CATIS to retain 96.9% accuracy at 63% FLOPs reduction on ViT-Large ImageNet-1K.
Token pooling in vision transformers
2 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 2representative citing papers
Token Merging (ToMe) doubles the throughput of large Vision Transformers on images, video, and audio by merging similar tokens with a fast matching algorithm, incurring only 0.2-0.4% accuracy loss.
citing papers explorer
-
Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals
Pairwise scoring signals in Vision Transformer token reduction are inherently unstable due to high perturbation counts and degrade in deep layers, causing collapse, while unary signals with triage enable CATIS to retain 96.9% accuracy at 63% FLOPs reduction on ViT-Large ImageNet-1K.
-
Token Merging: Your ViT But Faster
Token Merging (ToMe) doubles the throughput of large Vision Transformers on images, video, and audio by merging similar tokens with a fast matching algorithm, incurring only 0.2-0.4% accuracy loss.