arxiv: 2604.16745 · v1 · submitted 2026-04-17 · 💻 cs.AI · cs.CV

Recognition: unknown

Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals

Yang Shanglin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:55 UTC · model grok-4.3

classification 💻 cs.AI cs.CV

keywords token reductionVision Transformerspairwise similarityranking consistencytraining-free pruningFLOPs reductionunary signalstoken pruning

0 comments

The pith

Training-free token reduction collapses at high compression because pairwise similarity signals lose ranking consistency in deeper layers of Vision Transformers, creating a signal-agnostic error amplifier in layer-wise reduction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Training-free token reduction methods for Vision Transformers exhibit a shared sharp collapse in performance once token compression passes a critical threshold, even though they employ different scoring mechanisms. The paper isolates this through two diagnostic metrics that separate a layer-wise error amplifier, which produces convex performance curves and a critical reduction ratio that scales as one over the number of layers, from the rapid degradation of ranking consistency in pairwise similarity signals, which falls from 0.88 in early layers to 0.27 in deep layers. Pairwise signals are shown to be inherently less stable because they are subject to a quadratic number of joint perturbations, whereas unary signals benefit from linear scaling and the central limit theorem. These observations motivate a set of design principles that are used to construct an alternative approach relying on unary signals and a triage step, which maintains substantially higher accuracy at aggressive compression levels.

Core claim

The central claim is that the observed collapse arises from the combination of a signal-agnostic error amplifier inherent to sequential layer-wise token reduction, which predicts convex Pareto curves and a critical reduction ratio proportional to one over model depth, and the shared use of pairwise similarity signals whose ranking consistency degrades sharply from 0.88 to 0.27 across layers. Pairwise rankings are unstable due to O(N_p squared) joint perturbations, while unary signals enjoy greater stability under O(N_p) perturbations and the central limit theorem. This diagnosis directly yields three design principles that are validated by constructing an alternative method using unary cues.

What carries the argument

The diagnostic framework consisting of ranking consistency rho_s and off-diagonal correlation rho_off, which decomposes the collapse into error amplification and pairwise signal instability; the contrast between pairwise similarity signals and unary signals as carriers of ranking information.

If this is right

All methods that rely on pairwise similarity signals will display similar cliff-like accuracy drops once compression exceeds the critical ratio.
The critical compression ratio scales inversely with the number of layers in the model.
Switching to unary signals raises the compression threshold at which collapse begins.
A triage step that limits error propagation allows higher compression while preserving most of the original accuracy.
The alternative method built from these principles retains 96.9 percent of original accuracy at 63 percent FLOPs reduction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same layer-wise instability may appear in sequence reduction tasks outside image classification whenever reductions are applied sequentially through transformer blocks.
Layer-depth-adaptive choice between pairwise and unary signals could be tested as a general strategy for other efficiency techniques.
The diagnostic metrics could be applied to quantify robustness in related compression settings such as attention head pruning.
Models with fewer layers would be expected to tolerate higher per-layer reduction ratios before the error amplifier dominates.

Load-bearing premise

The measured drop in pairwise ranking consistency together with the layer-wise error amplification are the dominant drivers of collapse rather than other unmeasured factors such as particular attention patterns or dataset properties.

What would settle it

A controlled replacement of pairwise similarity scoring with unary signals in an existing reduction pipeline, followed by measurement of whether accuracy remains above 80 percent at the 63 percent FLOPs reduction point where pairwise methods drop to the 43-65 percent range on the same Vision Transformer and dataset.

Figures

Figures reproduced from arXiv: 2604.16745 by Yang Shanglin.

**Figure 1.** Figure 1: Pareto curves on ViT-Large. Despite using fundamentally different scoring formulas, all four baselines exhibit cliff-like [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: rcrit vs. 1/L for all 7 models (clean / shifted). rcrit: minimum r with > 5 pp drop vs. vanilla. Linear fits yield R2 > 0.9 (n=7); the consistent trend across diverse architectures supports the 1/L scaling. CATIS’s larger slope reflects lower ϵ0. functional. The intuition is direct: for score vectors with comparable margin distributions, larger total perturbation energy implies more ranking inversions and… view at source ↗

**Figure 3.** Figure 3: Pairwise similarity degradation across layers (DeiT-3 Base). [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: CATIS pipeline (one transformer layer). Two unary signals are z-normalized and fused into a layer-local importance [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: ρoff under token reduction at varying r. ToMe/MCTF exhibit surging ρoff at high r (structural damage from incorrect merges); PiToMe stays low but achieves worst accuracy (correct merge pairing, wrong triage decisions, i.e., semantic loss invisible to ρoff); CATIS tracks Vanilla throughout. so that the per-pair perturbation energy is E[(˜s(i, j) − s(i, j))2 ] = (∥∇zi fij∥ 2 + ∥∇zj fij∥ 2 ) · σ 2 ξ . (26) Un… view at source ↗

**Figure 6.** Figure 6: Signal complementarity (DeiT-3 Base, layer 8; ImageNet-A, class [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Off-diagonal correlation ρoff on vanilla models (no token reduction), 12-layer architectures. All values stay below 0.25; curves overlap across datasets (< 0.03 variance). achieves only 65.49%. On shifted benchmarks, CATIS outperforms PiToMe by +19.21pp on Shift avg (49.51 vs. 30.30). Critically, CATIS nearly matches vanilla on IN-A (43.65 vs. 43.27) and IN-R (59.76 vs. 60.89), while ToMe/MCTF collapse t… view at source ↗

**Figure 8.** Figure 8: Same as Figure 7 for 24-layer models [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: ρoff under token reduction: DeiT-3 Small. DeiT-3 Base appears in the main text ( [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: ρoff under token reduction: DeiT-3 Large. (a) ViT-Small — IN-1K (b) ViT-Small — shifted [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: ρoff under token reduction: ViT-Small. collapses to its activation-only branch by construction. Selection procedure. Within each family, the precise instantiation inside these intervals is obtained by a diagnostic-driven search restricted to the predicates above, rather than to validation accuracy on any benchmark of [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: ρoff under token reduction: ViT-Base. (a) ViT-Large — IN-1K (b) ViT-Large — shifted [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: ρoff under token reduction: ViT-Large. Across Figures 9–13, the pattern is universal: ToMe/MCTF’s ρoff surges at high r (structural damage from hybrid tokens); PiToMe stays low but achieves worst accuracy (semantic loss invisible to ρoff); CATIS tracks Vanilla throughout. absolute difference is < 5 img/s for models running below 320 img/s. (2) Consistent parity with ToMe/MCTF. Across all six configuration… view at source ↗

**Figure 14.** Figure 14: Accuracy–GFLOPs Pareto curves (part 1): ViT-Base, DeiT-3 Large, and DeiT-3 Base. [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

**Figure 15.** Figure 15: Accuracy–GFLOPs Pareto curves (part 2): ViT-Small and DINOv3 (Sim [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗

**Figure 16.** Figure 16: Spatial layout of CATIS merge/evict on DeiT-3-Small: one random image per row (ImageNet-A/R/Sketch/C); columns [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗

read the original abstract

Training-free token reduction methods for Vision Transformers (ToMe, ToFu, PiToMe, and MCTF) employ different scoring mechanisms, yet they share a closely matched cliff-like collapse at high compression. This paper explains \emph{why}. We develop a diagnostic framework with two tools, ranking consistency $\rho_s$ and off-diagonal correlation $\rho_\text{off}$, that decomposes the collapse into (1)a signal-agnostic error amplifier inherent to layer-wise reduction, predicting convex Pareto curves and $r_{\text{crit}} \propto 1/L$; and (2)shared reliance on \emph{pairwise} similarity signals whose ranking consistency degrades from $\rho_s{=}0.88$ to $0.27$ in deep layers. Pairwise rankings are inherently unstable ($O(N_p^2)$ joint perturbations) while unary signals enjoy greater stability ($O(N_p)$ perturbations, CLT). From three design principles derived from this diagnosis, we construct CATIS as a constructive validation: unary signals raise the trigger threshold, triage suppresses the gain. On ViT-Large at 63% FLOPs reduction, CATIS retains 96.9% of vanilla accuracy (81.0%) on ImageNet-1K where all baselines collapse to 43--65%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper explains the shared collapse in training-free ViT token reduction via a layer-wise error amplifier plus pairwise signal instability, then builds a unary+triage method that holds accuracy better, but the causal isolation of pairwise effects is not fully locked down.

read the letter

The punchline is that several training-free token reduction methods for Vision Transformers all cliff at roughly the same high-compression point, and this work supplies a diagnostic split plus a concrete alternative that avoids the drop on the reported numbers. The new pieces are the ranking-consistency metric rho_s, the off-diagonal correlation rho_off, the derived scaling r_crit proportional to 1/L from the layer-wise amplifier model, and the CATIS construction that swaps to unary signals and adds triage. Those elements do not appear in the ToMe/ToFu/PiToMe/MCTF line of work. On the positive side, the paper reports a clear experimental contrast: on ViT-Large ImageNet-1K, CATIS keeps 96.9 percent of the 81 percent baseline accuracy at 63 percent FLOPs reduction while the prior methods fall to 43-65 percent. That outcome is useful on its face for anyone trying to cut inference cost without retraining. The soft spot is the load-bearing claim that pairwise scoring is the dominant shared cause. The O(N_p^2) perturbation count and the observed rho_s drop from 0.88 to 0.27 are presented as evidence, yet the abstract and available description do not include an explicit control that swaps only the scoring signal while freezing the reduction schedule and layer-wise amplifier. Without that, it remains possible that ViT-specific factors such as residual accumulation or head specialization are doing more of the work. The stress-test concern therefore lands: the signal-agnostic label on the amplifier is asserted rather than isolated. The math on perturbation scaling and the central-limit argument for unary stability are straightforward, but they do not by themselves prove dominance over other depth-dependent effects. This paper is aimed at researchers who need practical, training-free compression for deployed ViTs and who want an analysis that goes beyond single-method tweaks. A reader already working on token pruning or efficient attention will find the metrics and the CATIS recipe worth testing. It deserves a serious referee because the problem is real, the numbers are concrete, and the proposed fix is constructive even if the full causal story needs tightening in revision.

Referee Report

3 major / 1 minor

Summary. The manuscript claims that training-free token reduction methods (ToMe, ToFu, PiToMe, MCTF) for Vision Transformers exhibit closely matched cliff-like collapse at high compression because of two factors: (1) a signal-agnostic error amplifier inherent to layer-wise reduction that predicts convex Pareto curves and r_crit ∝ 1/L, and (2) shared reliance on pairwise similarity signals whose ranking consistency ρ_s degrades from 0.88 to 0.27 in deep layers due to O(N_p²) joint perturbations (versus O(N_p) for unary signals under the central limit theorem). It introduces diagnostic metrics ρ_s and ρ_off to decompose these effects, derives three design principles, and validates them by constructing CATIS (unary signals plus triage), which retains 96.9% of vanilla accuracy (81.0%) at 63% FLOPs reduction on ViT-Large/ImageNet-1K while baselines drop to 43-65%.

Significance. If the decomposition and causality claims hold, the work supplies a diagnostic framework that explains the similar failure points across disparate methods and yields concrete design principles for more robust reduction. The concrete experimental outcomes on ImageNet-1K with ViT-Large, including specific accuracy and FLOPs numbers, and the construction of CATIS as a constructive validation are clear strengths. The layer-wise model that predicts r_crit proportionality without free parameters is a positive element if the derivation is made fully explicit.

major comments (3)

[Abstract] Abstract and diagnostic framework: the assertion that the error amplifier is signal-agnostic and that pairwise instability is the shared cause of collapse across methods is load-bearing, yet the manuscript provides no derivation or controlled experiment that swaps the scoring signal to unary while holding the reduction schedule fixed; without this, the O(N_p²) perturbation argument remains correlational with layer depth rather than shown to be the dominant driver.
[Diagnostic framework] Layer-wise model and r_crit prediction: the proportionality r_crit ∝ 1/L is presented as a prediction from the layer-wise model rather than a post-hoc fit, but the manuscript does not supply the explicit derivation steps or verification that this prediction matches observed collapse points independently of the measured ρ_s degradation; this weakens the claim that the amplifier is isolated from signal type.
[Experiments] CATIS validation experiments: while the reported 96.9% retention at 63% FLOPs reduction is promising, the results do not include an ablation that isolates the unary-signal component from the triage component, leaving open whether the improvement truly addresses the proposed pairwise instability mechanism or other unmeasured ViT factors such as attention patterns or residual accumulation.

minor comments (1)

[Abstract] The abstract introduces ρ_off without a one-sentence definition, which reduces immediate readability for readers encountering the diagnostics for the first time.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. These have helped us strengthen the explicitness of our derivations, add controlled experiments, and provide component ablations to better support the causality claims. We address each major comment point-by-point below.

read point-by-point responses

Referee: [Abstract] Abstract and diagnostic framework: the assertion that the error amplifier is signal-agnostic and that pairwise instability is the shared cause of collapse across methods is load-bearing, yet the manuscript provides no derivation or controlled experiment that swaps the scoring signal to unary while holding the reduction schedule fixed; without this, the O(N_p²) perturbation argument remains correlational with layer depth rather than shown to be the dominant driver.

Authors: We appreciate this point. The O(N_p²) joint perturbation count for pairwise signals (versus O(N_p) for unary under the central limit theorem) is derived from ranking stability analysis in Section 3.2. To establish causality beyond correlation, the revision adds a controlled experiment in Section 4.2: a unary token-importance scorer (attention-norm based) is substituted while exactly preserving the layer-wise reduction schedule of ToMe. This unary variant delays collapse to ~75% reduction (vs. ~50% for pairwise), isolating the pairwise instability mechanism. The perturbation-count derivation has also been expanded with explicit steps in the main text. revision: yes
Referee: [Diagnostic framework] Layer-wise model and r_crit prediction: the proportionality r_crit ∝ 1/L is presented as a prediction from the layer-wise model rather than a post-hoc fit, but the manuscript does not supply the explicit derivation steps or verification that this prediction matches observed collapse points independently of the measured ρ_s degradation; this weakens the claim that the amplifier is isolated from signal type.

Authors: The layer-wise error-amplifier model begins in Section 3.1 with the recursive error-propagation equation; the full steps yielding convex Pareto curves and r_crit ∝ 1/L appear in Appendix A. We agree the main-text presentation was insufficiently explicit. The revision inserts the key derivation equations (error recursion to r_crit = (1-ρ_off)/L) into Section 3.1. We further add a verification experiment (Section 4.1) that holds ρ_s fixed in simulation across layers, confirming that the model's predicted collapse points match observed r_crit values independently of ρ_s degradation, thereby isolating the signal-agnostic amplifier. revision: yes
Referee: [Experiments] CATIS validation experiments: while the reported 96.9% retention at 63% FLOPs reduction is promising, the results do not include an ablation that isolates the unary-signal component from the triage component, leaving open whether the improvement truly addresses the proposed pairwise instability mechanism or other unmeasured ViT factors such as attention patterns or residual accumulation.

Authors: We concur that isolating the unary and triage contributions is essential. The revision adds Table 3 with a full ablation on ViT-Large/ImageNet-1K: (i) pairwise baseline, (ii) unary signals only, (iii) triage only, and (iv) CATIS (unary+triage). Unary signals alone improve ρ_s stability and raise the collapse threshold; triage alone reduces erroneous merges via ρ_off suppression. Their combination reproduces the 96.9% retention at 63% FLOPs reduction. All variants use the identical ViT backbone and training-free protocol, controlling for attention patterns and residual effects. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation from perturbation counting, CLT, and layer-wise model is independent of target results

full rationale

The paper derives the signal-agnostic amplifier and r_crit ∝ 1/L from a layer-wise reduction model, then separately measures rho_s degradation on real ViT layers and validates via CATIS experiments on ImageNet-1K. The O(N_p^2) vs O(N_p) instability argument follows directly from counting joint perturbations plus the central limit theorem; neither step is fitted to the collapse data nor defined in terms of the final accuracy numbers. No self-citation is load-bearing, no ansatz is smuggled, and the Pareto convexity prediction is presented as a model output rather than a post-hoc fit. External benchmarks (ViT-Large accuracy retention) serve as an independent check, keeping the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on empirical measurements of token ranking consistency across ViT layers and a theoretical perturbation-count argument; no free parameters are fitted to the collapse behavior itself, and no new physical or mathematical entities are postulated.

axioms (1)

standard math Central Limit Theorem governs stability of unary signals under independent perturbations
Invoked to contrast O(N_p) unary perturbations with O(N_p^2) pairwise joint perturbations

pith-pipeline@v0.9.0 · 5530 in / 1484 out tokens · 34747 ms · 2026-05-10T07:55:55.316670+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 15 canonical work pages · 9 internal anchors

[1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
[3]

Bhojanapalli, S.; Chakrabarti, A.; Glasner, D.; Li, D.; Unterthiner, T.; and Veit, A. 2021. Understanding robustness of transformers for image classification. In Proceedings of the IEEE/CVF international conference on computer vision, 10231--10241

2021
[4]

Bolya, D.; Fu, C.-Y.; Dai, X.; Zhang, P.; Feichtenhofer, C.; and Hoffman, J. 2022. Token merging: Your vit but faster. arXiv preprint arXiv:2210.09461

work page internal anchor Pith review arXiv 2022
[5]

Bonnaerens, M.; and Dambre, J. 2023. Learned thresholds token merging and pruning for vision transformers. arXiv preprint arXiv:2307.10780

work page arXiv 2023
[6]

Cao, Q.; Paranjape, B.; and Hajishirzi, H. 2023. PuMer: Pruning and merging tokens for efficient vision language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 12890--12903

2023
[7]

Caron, M.; Touvron, H.; Misra, I.; J \'e gou, H.; Mairal, J.; Bojanowski, P.; and Joulin, A. 2021. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, 9650--9660

2021
[8]

Chavan, A.; Shen, Z.; Liu, Z.; Liu, Z.; Cheng, K.-T.; and Xing, E. P. 2022. Vision transformer slimming: Multi-dimension searching in continuous optimization space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 4931--4941

2022
[9]

Chen, M.; Lin, M.; Li, K.; Shen, Y.; Wu, Y.; Chao, F.; and Ji, R. 2023 a . Cf-vit: A general coarse-to-fine method for vision transformer. In Proceedings of the AAAI conference on artificial intelligence, volume 37, 7042--7052

2023
[10]

Chen, M.; Shao, W.; Xu, P.; Lin, M.; Zhang, K.; Chao, F.; Ji, R.; Qiao, Y.; and Luo, P. 2023 b . Diffrate: Differentiable compression rate for efficient vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, 17164--17174

2023
[11]

Dao, T.; Fu, D.; Ermon, S.; Rudra, A.; and R \'e , C. 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in neural information processing systems, 35: 16344--16359

2022
[12]

Darcet, T.; Oquab, M.; Mairal, J.; and Bojanowski, P. 2023. Vision transformers need registers. arXiv preprint arXiv:2309.16588

work page internal anchor Pith review arXiv 2023
[13]

Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248--255. Ieee

2009
[14]

Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 4171--4186

2019
[15]

Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2020
[16]

A.; Jafari, F

Fayyaz, M.; Koohpayegani, S. A.; Jafari, F. R.; Sengupta, S.; Joze, H. R. V.; Sommerlade, E.; Pirsiavash, H.; and Gall, J. 2022. Adaptive token sampling for efficient vision transformers. In European conference on computer vision, 396--414. Springer

2022
[17]

A.; and Brendel, W

Geirhos, R.; Rubisch, P.; Michaelis, C.; Bethge, M.; Wichmann, F. A.; and Brendel, W. 2018. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International conference on learning representations

2018
[18]

He, K.; Chen, X.; Xie, S.; Li, Y.; Doll \'a r, P.; and Girshick, R. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16000--16009

2022
[19]

He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770--778

2016
[20]

Hendrycks, D.; Basart, S.; Mu, N.; Kadavath, S.; Wang, F.; Dorundo, E.; Desai, R.; Zhu, T.; Parajuli, S.; Guo, M.; et al. 2021 a . The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision, 8340--8349

2021
[21]

Hendrycks, D.; and Dietterich, T. 2019. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261

work page internal anchor Pith review arXiv 2019
[22]

Hendrycks, D.; and Gimpel, K. 2016. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136

work page internal anchor Pith review arXiv 2016
[23]

Hendrycks, D.; Zhao, K.; Basart, S.; Steinhardt, J.; and Song, D. 2021 b . Natural adversarial examples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 15262--15271

2021
[24]

Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015
[25]

Katharopoulos, A.; Vyas, A.; Pappas, N.; and Fleuret, F. 2020. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, 5156--5165. PMLR

2020
[26]

Kim, M.; Gao, S.; Hsu, Y.-C.; Shen, Y.; and Jin, H. 2024. Token fusion: Bridging the gap between token pruning and token merging. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 1383--1392

2024
[27]

Kong, Z.; Dong, P.; Ma, X.; Meng, X.; Niu, W.; Sun, M.; Shen, X.; Yuan, G.; Ren, B.; Tang, H.; et al. 2022. Spvit: Enabling faster vision transformers via latency-aware soft token pruning. In European conference on computer vision, 620--640. Springer

2022
[28]

Kong, Z.; Li, Y.; Zeng, F.; Xin, L.; Messica, S.; Lin, X.; Zhao, P.; Kellis, M.; Tang, H.; and Zitnik, M. 2025. Token Reduction Should Go Beyond Efficiency in Generative Models--From Vision, Language to Multimodality. arXiv preprint arXiv:2505.18227

work page arXiv 2025
[29]

LeCun, Y.; Denker, J.; and Solla, S. 1989. Optimal brain damage. Advances in neural information processing systems, 2

1989
[30]

Lee, K.; Lee, K.; Lee, H.; and Shin, J. 2018. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. Advances in neural information processing systems, 31

2018
[31]

Lee, S.; Choi, J.; and Kim, H. J. 2024. Multi-criteria token fusion with one-step-ahead attention for efficient vision transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 15741--15750

2024
[32]

Liang, Y.; Ge, C.; Tong, Z.; Song, Y.; Wang, J.; and Xie, P. 2022. Not all patches are what you need: Expediting vision transformers via token reorganizations. arXiv preprint arXiv:2202.07800

work page arXiv 2022
[33]

Mao, X.; Qi, G.; Chen, Y.; Li, X.; Duan, R.; Ye, S.; He, Y.; and Xue, H. 2022. Towards robust vision transformer. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 12042--12051

2022
[34]

Token pooling in vision transformers

Marin, D.; Chang, J.-H. R.; Ranjan, A.; Prabhu, A.; Rastegari, M.; and Tuzel, O. 2021. Token pooling in vision transformers. arXiv preprint arXiv:2110.03860

work page arXiv 2021
[35]

Menghani, G. 2023. Efficient deep learning: A survey on making deep learning models smaller, faster, and better. ACM Computing Surveys, 55(12): 1--37

2023
[36]

M.; Ranasinghe, K.; Khan, S

Naseer, M. M.; Ranasinghe, K.; Khan, S. H.; Hayat, M.; Shahbaz Khan, F.; and Yang, M.-H. 2021. Intriguing properties of vision transformers. Advances in Neural Information Processing Systems, 34: 23296--23308

2021
[37]

Nikzad, N.; Liao, Y.; Gao, Y.; and Zhou, J. 2025. SATA: spatial autocorrelation token analysis for enhancing the robustness of vision transformers. In Proceedings of the Computer Vision and Pattern Recognition Conference, 9730--9739

2025
[38]

Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. 2023. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Pan, B.; Panda, R.; Jiang, Y.; Wang, Z.; Feris, R.; and Oliva, A. 2021. IA-RED ^2 : Interpretability-aware redundancy reduction for vision transformers. Advances in neural information processing systems, 34: 24898--24911

2021
[40]

Papa, L.; Russo, P.; Amerini, I.; and Zhou, L. 2024. A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking. IEEE transactions on pattern analysis and machine intelligence, 46(12): 7682--7700

2024
[41]

Paul, S.; and Chen, P.-Y. 2022. Vision transformers are robust learners. In Proceedings of the AAAI conference on Artificial Intelligence, volume 36, 2071--2081

2022
[42]

Qin, Y.; Zhang, C.; Chen, T.; Lakshminarayanan, B.; Beutel, A.; and Wang, X. 2022. Understanding and improving robustness of vision transformers through patch-based negative augmentation. Advances in Neural Information Processing Systems, 35: 16276--16289

2022
[43]

Raghu, M.; Unterthiner, T.; Kornblith, S.; Zhang, C.; and Dosovitskiy, A. 2021. Do vision transformers see like convolutional neural networks? Advances in neural information processing systems, 34: 12116--12128

2021
[44]

Rao, Y.; Zhao, W.; Liu, B.; Lu, J.; Zhou, J.; and Hsieh, C.-J. 2021. Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems, 34: 13937--13949

2021
[45]

Raposo, S

Raposo, D.; Ritter, S.; Richards, B.; Lillicrap, T.; Humphreys, P. C.; and Santoro, A. 2024. Mixture-of-depths: Dynamically allocating compute in transformer-based language models. arXiv preprint arXiv:2404.02258

work page arXiv 2024
[46]

Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3): 211--252

2015
[47]

arXiv preprint arXiv:2106.11297 , year=

Ryoo, M. S.; Piergiovanni, A.; Arnab, A.; Dehghani, M.; and Angelova, A. 2021. Tokenlearner: What can 8 learned tokens do for images and videos? arXiv preprint arXiv:2106.11297

work page arXiv 2021
[48]

DINOv3

Sim \'e oni, O.; Vo, H. V.; Seitzer, M.; Baldassarre, F.; Oquab, M.; Jose, C.; Khalidov, V.; Szafraniec, M.; Yi, S.; Ramamonjisoa, M.; et al. 2025. Dinov3. arXiv preprint arXiv:2508.10104

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Soomro, K.; Zamir, A. R.; and Shah, M. 2012. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402

work page internal anchor Pith review arXiv 2012
[50]

Tang, Y.; Han, K.; Wang, Y.; Xu, C.; Guo, J.; Xu, C.; and Tao, D. 2022. Patch slimming for efficient vision transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 12165--12174

2022
[51]

Taori, R.; Dave, A.; Shankar, V.; Carlini, N.; Recht, B.; and Schmidt, L. 2020. Measuring robustness to natural distribution shifts in image classification. Advances in Neural Information Processing Systems, 33: 18583--18599

2020
[52]

Tong, Z.; Song, Y.; Wang, J.; and Wang, L. 2022. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35: 10078--10093

2022
[53]

Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; and J \'e gou, H. 2021. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, 10347--10357. PMLR

2021
[54]

Touvron, H.; Cord, M.; and J \'e gou, H. 2022. Deit iii: Revenge of the vit. In European conference on computer vision, 516--533. Springer

2022
[55]

M.; Nguyen, T.; Le, N.; Xie, P.; Sonntag, D.; Zou, J.; Nguyen, B

Tran, H.-C.; Nguyen, D. M.; Nguyen, T.; Le, N.; Xie, P.; Sonntag, D.; Zou, J.; Nguyen, B. T.; and Niepert, M. 2024. Accelerating transformers with spectrum-preserving token merging. Advances in Neural Information Processing Systems, 37: 30772--30810

2024
[56]

N.; Kaiser, .; and Polosukhin, I

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, .; and Polosukhin, I. 2017. Attention is all you need. Advances in neural information processing systems, 30

2017
[57]

Wang, H.; Ge, S.; Lipton, Z.; and Xing, E. P. 2019. Learning robust global representations by penalizing local predictive power. Advances in neural information processing systems, 32

2019
[58]

Wei, S.; Ye, T.; Zhang, S.; Tang, Y.; and Liang, J. 2023. Joint token pruning and squeezing towards more aggressive compression of vision transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2092--2101

2023
[59]

Xu, Y.; Zhang, Z.; Zhang, M.; Sheng, K.; Li, K.; Dong, W.; Zhang, L.; Xu, C.; and Sun, X. 2022. Evo-vit: Slow-fast token evolution for dynamic vision transformer. In Proceedings of the AAAI conference on artificial intelligence, volume 36, 2964--2972

2022
[60]

M.; Mallya, A.; Kautz, J.; and Molchanov, P

Yin, H.; Vahdat, A.; Alvarez, J. M.; Mallya, A.; Kautz, J.; and Molchanov, P. 2022. A-vit: Adaptive tokens for efficient vision transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10809--10818

2022
[61]

Yu, F.; Huang, K.; Wang, M.; Cheng, Y.; Chu, W.; and Cui, L. 2022. Width & depth pruning for vision transformers. In Proceedings of the AAAI conference on artificial intelligence, volume 36, 3143--3151

2022
[62]

Yu, L.; and Xiang, W. 2023. X-pruner: explainable pruning for vision transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 24355--24363

2023
[63]

Zong, Z.; Li, K.; Song, G.; Wang, Y.; Qiao, Y.; Leng, B.; and Liu, Y. 2022. Self-slimmed vision transformer. In European Conference on Computer Vision, 432--448. Springer

2022