DTM-Codec achieves better reconstruction quality and intelligibility than fixed-frame-rate neural speech codecs at matched total bitrate via dynamic token masking and Path Length Equalization for variable frame rates.
arXiv preprint arXiv:2110.03860 (2021)
6 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 6representative citing papers
Pairwise scoring signals in Vision Transformer token reduction are inherently unstable due to high perturbation counts and degrade in deep layers, causing collapse, while unary signals with triage enable CATIS to retain 96.9% accuracy at 63% FLOPs reduction on ViT-Large ImageNet-1K.
STORM is a training-free spatial-aware token reduction framework that reformulates compression on spatial units to preserve grid topology and neighborhood coherence in visual state space models.
Token Merging (ToMe) doubles the throughput of large Vision Transformers on images, video, and audio by merging similar tokens with a fast matching algorithm, incurring only 0.2-0.4% accuracy loss.
Replacing selected attention heads in pretrained ViTs with depthwise convolutions, identified by simple strategies and recovered via fine-tuning, delivers 17-20% inference speedup on image tasks with minimal accuracy loss.
RAPID is a training-free, depth-aware token reduction framework for ViTs that switches from redundancy-aware pruning in shallow layers to importance-aware merging in deep layers and reports better accuracy-compression tradeoffs than ToMe on ImageNet.
citing papers explorer
-
DTM-Codec: Dynamic Token Masking for VFR Speech Coding with Efficient Boundary Selection
DTM-Codec achieves better reconstruction quality and intelligibility than fixed-frame-rate neural speech codecs at matched total bitrate via dynamic token masking and Path Length Equalization for variable frame rates.
-
Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals
Pairwise scoring signals in Vision Transformer token reduction are inherently unstable due to high perturbation counts and degrade in deep layers, causing collapse, while unary signals with triage enable CATIS to retain 96.9% accuracy at 63% FLOPs reduction on ViT-Large ImageNet-1K.
-
Spatial-Aware Reduction Framework: Towards Efficient and Faithful Visual State Space Models
STORM is a training-free spatial-aware token reduction framework that reformulates compression on spatial units to preserve grid topology and neighborhood coherence in visual state space models.
-
Accelerating Vision Foundation Models with Drop-in Depthwise Convolution
Replacing selected attention heads in pretrained ViTs with depthwise convolutions, identified by simple strategies and recovered via fine-tuning, delivers 17-20% inference speedup on image tasks with minimal accuracy loss.
-
RAPID: Layer-Wise Redundancy-Aware Pruning and Importance-Driven Token Merging for Efficient ViT
RAPID is a training-free, depth-aware token reduction framework for ViTs that switches from redundancy-aware pruning in shallow layers to importance-aware merging in deep layers and reports better accuracy-compression tradeoffs than ToMe on ImageNet.