CoReDiT reduces self-attention FLOPs in DiTs by up to 55% via linear-time spatial coherence pruning and neighbor-based reconstruction, delivering 1.33x-1.72x speedups with maintained quality.
hub
arXiv preprint arXiv:2202.07800 , year=
16 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
Pairwise scoring signals in Vision Transformer token reduction are inherently unstable due to high perturbation counts and degrade in deep layers, causing collapse, while unary signals with triage enable CATIS to retain 96.9% accuracy at 63% FLOPs reduction on ViT-Large ImageNet-1K.
STS is a two-stage pruning framework that decouples structural diversity via repulsion sampling from semantic filtering via cross-attention to reduce redundancy in visual tokens for VLMs.
S2 improves generalization in vision-language-action models by using goal-preserving refined language guidance and explicit visual evidence budgets, raising mean subtask success from 54.2% to 79.0% on eight real-robot tasks compared to pi0.5.
SToRe3D delivers up to 3x faster inference for multi-view 3D object detection in ViTs by selecting relevant 2D tokens and 3D queries via mutual relevance heads with only marginal accuracy loss.
FedMITR uses sparse model inversion and token relabeling to improve one-shot federated learning with ViTs under non-IID conditions, delivering a tighter generalization bound via algorithmic stability analysis and better empirical performance.
VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.
MaMe is a differentiable matrix-only token merging method that doubles ViT-B throughput with a 2% accuracy drop on pre-trained models and enables faster, higher-quality image synthesis when paired with MaRe.
APT adaptively varies patch sizes within a single image to reduce ViT token count, delivering 40-50% throughput gains on large models with no downstream performance loss.
TrajViT tokenizes videos via panoptic sub-object trajectories, achieving 10x token reduction and outperforming ViT3D by 6% on retrieval and 5.2% on VideoQA tasks with faster training and inference.
MVPruner is a two-stage adaptive token pruning technique for multi-view VLMs that achieves 87.3% FLOPs reduction and 4.97x prefilling speedup while retaining 98.5% accuracy on DriveLM.
VisionPulse is a step-wise visual token pruning method for LMMs that retains 5% of tokens per step, shortens reasoning traces by 11.2%, and maintains accuracy.
ASAP prunes tokens in ViTs by anchoring on attention sinks modeled as lazy random walks, using cumulative transition matrices and radial diffusion clustering to compress redundancy while preserving accuracy.
TAPE applies temporal-aware token pruning with smoothing, reselection, and timestep scheduling to speed up video diffusion models while preserving visual fidelity and coherence.
SEPatch3D accelerates ViT-based 3D object detectors up to 57% faster than StreamPETR via dynamic patch sizing and cross-granularity enhancement while keeping comparable accuracy on nuScenes and Argoverse 2.
CATP prunes low-confidence tokens in COD Transformers and uses dual-path compensation to cut computation while preserving segmentation accuracy on boundary regions.
citing papers explorer
No citing papers match the current filters.