Recognition: no theorem link
Vision Transformers Need More Than Registers
Pith reviewed 2026-05-15 19:06 UTC · model grok-4.3
The pith
Vision Transformers create artifacts by using background patches as shortcuts for global semantics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Vision Transformers exhibit artifacts because they employ a lazy aggregation behavior that uses semantically irrelevant background patches as shortcuts to represent global semantics, a tendency driven by global attention and coarse-grained semantic supervision. Selectively integrating patch features into the CLS token reduces the influence of these background-dominated shortcuts and improves performance across diverse supervision paradigms.
What carries the argument
The lazy aggregation behavior, in which ViT uses semantically irrelevant background patches as shortcuts to represent global semantics.
If this is right
- Selective integration of patch features into the CLS token reduces the influence of background-dominated shortcuts.
- Performance improves consistently across twelve benchmarks under label-, text-, and self-supervision.
- The approach mitigates artifacts across different downstream tasks without introducing new failure modes.
- The analysis provides a new perspective on ViT behavior under global attention.
Where Pith is reading between the lines
- The same shortcut mechanism may appear in other attention-based vision models that rely on a single global token.
- Designs that add registers alone may remain insufficient if they do not also select which patch features reach the global representation.
- The perspective could guide targeted interventions in multimodal or video transformers that face analogous aggregation problems.
Load-bearing premise
The observed artifacts are caused primarily by this lazy aggregation of background patches rather than other architectural or optimization factors.
What would settle it
A controlled experiment that measures whether selective patch integration into the CLS token eliminates the specific artifacts on a standard ViT benchmark while preserving accuracy.
Figures
read the original abstract
Vision Transformers (ViTs), when pre-trained on large-scale data, provide general-purpose representations for diverse downstream tasks. However, artifacts in ViTs are widely observed across different supervision paradigms and downstream tasks. Through systematic analysis of artifacts in ViTs, we find that their fundamental mechanisms have yet to be sufficiently elucidated. In this paper, through systematic analysis, we conclude that these artifacts originate from a lazy aggregation behavior: ViT uses semantically irrelevant background patches as shortcuts to represent global semantics, driven by global attention and Coarse-grained semantic supervision. Our solution selectively integrates patch features into the CLS token, reducing the influence of background-dominated shortcuts and consistently improving performance across 12 benchmarks under label-, text-, and self-supervision. We hope this work offers a new perspective on ViT behavior.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript analyzes artifacts in Vision Transformers (ViTs) pre-trained on large-scale data and attributes them to a 'lazy aggregation' behavior: the model uses semantically irrelevant background patches as shortcuts to represent global semantics, driven by global attention and coarse-grained supervision. The proposed fix selectively integrates patch features into the CLS token to suppress these shortcuts, with reported consistent gains across 12 benchmarks under label-, text-, and self-supervision paradigms.
Significance. If the causal diagnosis holds and the intervention proves robust, the work could supply a practical, low-overhead method for mitigating common ViT artifacts and a new lens on how global attention interacts with supervision granularity. The multi-supervision evaluation is a strength, but the absence of detailed controls for alternative mechanisms (e.g., attention dilution or positional bias) limits the immediate interpretive weight of the result.
major comments (3)
- [Abstract and §3] Abstract and §3 (Systematic Analysis): the claim that artifacts originate specifically from lazy aggregation via background shortcuts is load-bearing for the proposed fix, yet the manuscript provides no explicit description of the analysis protocol, ablation controls, or statistical tests used to isolate this mechanism from alternatives such as global attention dilution or optimization dynamics independent of background semantics.
- [§5] §5 (Experiments): performance improvements are stated across 12 benchmarks, but no variance estimates, run counts, or significance tests are reported; without these, it is impossible to determine whether the gains reliably exceed baseline variability or are specific to the selective-integration intervention.
- [§4] §4 (Proposed Method): the selective patch-to-CLS integration is presented as directly countering background shortcuts, but the manuscript does not quantify how patch selection is performed (e.g., attention-threshold criteria or learned gating) or demonstrate that it avoids introducing new failure modes under fine-grained supervision.
minor comments (2)
- [Figure 3] Figure 3: attention-map visualizations would be clearer with explicit foreground/background masks or quantitative background-dominance scores to directly support the lazy-aggregation interpretation.
- [§2] Notation: the CLS token is referenced from the outset without a brief definition or reference to its standard role in ViT architectures, which may hinder readers outside the immediate subfield.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and have revised the manuscript to provide additional protocol details, statistical reporting, and method clarifications.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Systematic Analysis): the claim that artifacts originate specifically from lazy aggregation via background shortcuts is load-bearing for the proposed fix, yet the manuscript provides no explicit description of the analysis protocol, ablation controls, or statistical tests used to isolate this mechanism from alternatives such as global attention dilution or optimization dynamics independent of background semantics.
Authors: We acknowledge the need for greater explicitness. Section 3 details the protocol via attention map visualizations and background attention ratios computed using off-the-shelf segmentation masks on ImageNet validation images. We have added a dedicated subsection with the full protocol (including patch labeling criteria and correlation metrics), plus ablations that vary supervision granularity and patch count to separate lazy aggregation from dilution effects. Paired statistical tests (Wilcoxon) are now reported to support the mechanism isolation. revision: yes
-
Referee: [§5] §5 (Experiments): performance improvements are stated across 12 benchmarks, but no variance estimates, run counts, or significance tests are reported; without these, it is impossible to determine whether the gains reliably exceed baseline variability or are specific to the selective-integration intervention.
Authors: We agree that variance and significance reporting strengthens the claims. The revised §5 now includes means and standard deviations over 5 independent runs for all 12 benchmarks, together with p-values from paired t-tests confirming that gains are statistically significant (p < 0.05) and exceed baseline variability across label-, text-, and self-supervised settings. revision: yes
-
Referee: [§4] §4 (Proposed Method): the selective patch-to-CLS integration is presented as directly countering background shortcuts, but the manuscript does not quantify how patch selection is performed (e.g., attention-threshold criteria or learned gating) or demonstrate that it avoids introducing new failure modes under fine-grained supervision.
Authors: Patch selection uses a non-learned median attention threshold on CLS-to-patch scores (top 50 % of patches integrated); the exact threshold formula and hyper-parameter choice are now stated explicitly in §4. We have added fine-grained downstream results (COCO detection, ADE20K segmentation) showing no degradation relative to the original ViT, confirming the intervention does not create new failure modes while preserving local feature utility. revision: partial
Circularity Check
No circularity: empirical analysis and intervention with no self-referential derivations
full rationale
The paper's core claim is an interpretive conclusion from systematic empirical observation of artifacts across supervision paradigms, followed by a proposed selective integration method validated on 12 benchmarks. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The diagnosis of 'lazy aggregation' is presented as an observed behavior rather than a quantity derived from or equivalent to the inputs by construction. The intervention is an architectural change tested for performance gains, not a statistical output forced by prior fits. This is a standard empirical paper structure with independent content.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Global attention and coarse-grained semantic supervision drive ViT to use background patches as shortcuts for global semantics.
Forward citations
Cited by 3 Pith papers
-
LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute
LookWhen factorizes video recognition into learning when, where, and what to compute via uniqueness-based token selection and dual-teacher distillation, achieving better accuracy-FLOPs trade-offs than baselines on mul...
-
SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models
SARA improves text alignment and motion quality in video diffusion models by routing token-relation distillation supervision to semantically salient pairs using a Stage-1 aligner trained with SAM masks and InfoNCE.
-
Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models
VLMs possess a latent 3D scene topology subspace corresponding to Laplacian eigenmaps that can be causally shaped via Dirichlet energy regularization to improve spatial task performance by up to 12.1%.
Reference graph
Works this paper leans on
-
[1]
Emerg- ing properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the International Conference on Computer Vision (ICCV), 2021. 1, 2, 3, 6, 7, 8
work page 2021
-
[2]
Improved Baselines with Momentum Contrastive Learning
Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Im- proved baselines with momentum contrastive learning.arXiv preprint arXiv:2003.04297, 2020. 1
work page internal anchor Pith review Pith/arXiv arXiv 2003
-
[3]
Vision transformers with self-distilled registers.arXiv preprint arXiv:2505.21501, 2025
Yinjie Chen, Zipeng Yan, Chong Zhou, Bo Dai, and Andrew F Luo. Vision transformers with self-distilled registers.arXiv preprint arXiv:2505.21501, 2025. 3
-
[4]
Schwing, Alexan- der Kirillov, and Rohit Girdhar
Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask trans- former for universal image segmentation. 2022. 1
work page 2022
-
[5]
Vision Transformers Need Registers
Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bo- janowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023. 2, 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 1, 3, 4, 8
work page 2009
-
[7]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 1, 2, 3, 4, 6, 8
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[8]
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.International journal of computer vision, 88:303–338, 2010. 8
work page 2010
-
[9]
Yunxiang Fu, Meng Lou, and Yizhou Yu. Segman: Omni- scale context modeling with state space models and local attention for semantic segmentation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19077–19087, 2025. 1
work page 2025
-
[10]
Sina Hajimiri, Ismail Ben Ayed, and Jose Dolz. Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation.arXiv preprint arXiv:2404.08181, 2024. 3
-
[11]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, 2016. 1, 2, 4, 6, 7
work page 2016
-
[12]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- shick. Mask R-CNN. InCVPR, 2017. 1
work page 2017
-
[13]
Attribute surrogates learning and spectral tokens pooling in transform- ers for few-shot learning
Yangji He, Weihan Liang, Dongyang Zhao, Hong-Yu Zhou, Weifeng Ge, Yizhou Yu, and Wenqiang Zhang. Attribute surrogates learning and spectral tokens pooling in transform- ers for few-shot learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9119–9129, 2022. 1
work page 2022
-
[14]
Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Ha- jishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip, 2021. If you use this software, please cite it as below. 2, 3, 8
work page 2021
-
[15]
Vision transformers don’t need trained registers.arXiv preprint arXiv:2506.08010, 2025
Nick Jiang, Amil Dravid, Alexei Efros, and Yossi Gandels- man. Vision transformers don’t need trained registers.arXiv preprint arXiv:2506.08010, 2025. 3
-
[16]
Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, and Anelia Angelova. F-vlm: Open-vocabulary object detection upon frozen vision and language models.arXiv preprint arXiv:2209.15639, 2022. 6, 7, 8
-
[17]
Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Proxyclip: Proxy atten- tion improves clip for open-vocabulary segmentation.arXiv preprint arXiv:2408.04883, 2024. 3
-
[18]
Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation.Advances in neural information processing systems, 34:9694–9705, 2021. 2
work page 2021
-
[19]
Clip surgery for better explainability with enhancement in open- vocabulary tasks, 2023
Yi Li, Hualiang Wang, Yiqun Duan, and Xiaomeng Li. Clip surgery for better explainability with enhancement in open- vocabulary tasks, 2023. 3
work page 2023
-
[20]
Visual instruction tuning, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 1
work page 2023
-
[21]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 5
work page 2021
-
[22]
Overlock: An overview-first-look- closely-next convnet with context-mixing dynamic kernels
Meng Lou and Yizhou Yu. Overlock: An overview-first-look- closely-next convnet with context-mixing dynamic kernels. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 128–138, 2025. 1
work page 2025
-
[23]
Meng Lou, Yunxiang Fu, and Yizhou Yu. Sparx: A sparse cross-layer connection mechanism for hierarchical vision mamba and transformer networks. InProceedings of the AAAI Conference on Artificial Intelligence, pages 19104–19114, 2025
work page 2025
-
[24]
Meng Lou, Shu Zhang, Hong-Yu Zhou, Sibei Yang, Chuan Wu, and Yizhou Yu. Transxnet: learning both global and local dynamics with a dual dynamic token mixer for visual recognition.IEEE Transactions on Neural Networks and Learning Systems, 2025. 1
work page 2025
-
[25]
Maxime Oquab, Timothée Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Laba...
-
[26]
Bill Psomas, Ioannis Kakogeorgiou, Konstantinos Karantza- los, and Yannis Avrithis. Keep it simpool: Who said super- vised transformers suffer from attention deficit? InProceed- ings of the IEEE/CVF International Conference on Computer Vision, pages 5350–5360, 2023. 1
work page 2023
-
[27]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 1, 2, 6, 7
work page 2021
-
[28]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 3
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[29]
Edadet: Open-vocabulary object detection using early dense alignment
Cheng Shi and Sibei Yang. Edadet: Open-vocabulary object detection using early dense alignment. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15724–15734, 2023. 6
work page 2023
-
[30]
Cheng Shi and Sibei Yang. The devil is in the object bound- ary: Towards annotation-free instance segmentation using foundation models.arXiv preprint arXiv:2404.11957, 2024. 6
-
[31]
Oriane Siméoni, Gilles Puy, Huy V V o, Simon Roburin, Spy- ros Gidaris, Andrei Bursuc, Patrick Pérez, Renaud Marlet, and Jean Ponce. Localizing objects with self-supervised trans- formers and no labels.arXiv preprint arXiv:2109.14279,
-
[32]
EVA-CLIP: Improved Training Techniques for CLIP at Scale
Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023. 6, 7, 8
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Con- trastive grouping with transformer for referring image seg- mentation
Jiajin Tang, Ge Zheng, Cheng Shi, and Sibei Yang. Con- trastive grouping with transformer for referring image seg- mentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 23570–23580,
-
[34]
Training data-efficient image transformers & distillation through atten- tion
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through atten- tion. InInternational conference on machine learning, pages 10347–10357. PMLR, 2021. 2, 3, 6, 8
work page 2021
-
[35]
Jasper RR Uijlings, Koen EA Van De Sande, Theo Gev- ers, and Arnold WM Smeulders. Selective search for object recognition.International journal of computer vision, 104(2): 154–171, 2013. 8
work page 2013
-
[36]
Sclip: Rethink- ing self-attention for dense vision-language inference
Feng Wang, Jieru Mei, and Alan Yuille. Sclip: Rethink- ing self-attention for dense vision-language inference. In European Conference on Computer Vision, pages 315–332. Springer, 2025. 2, 3
work page 2025
-
[37]
Clipself: Vision trans- former distills itself for open-vocabulary dense prediction
Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Xiangtai Li, Wentao Liu, and Chen Change Loy. Clipself: Vision trans- former distills itself for open-vocabulary dense prediction. arXiv preprint arXiv:2310.01403, 2023. 1, 3, 6, 7, 8
-
[38]
Clip-dinoiser: Teaching clip a few dino tricks.arXiv preprint arXiv:2312.12359, 2023
Monika Wysocza´nska, Oriane Siméoni, Michaël Ramamon- jisoa, Andrei Bursuc, Tomasz Trzci´nski, and Patrick Pérez. Clip-dinoiser: Teaching clip a few dino tricks.arXiv preprint arXiv:2312.12359, 2023. 3
-
[39]
Demystifying clip data.arXiv preprint arXiv:2309.16671, 2023
Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Rus- sell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data.arXiv preprint arXiv:2309.16671, 2023. 6, 7
-
[40]
Hd-cnn: hierarchical deep convolutional neural networks for large scale visual recognition
Zhicheng Yan, Hao Zhang, Robinson Piramuthu, Vignesh Jagadeesh, Dennis DeCoste, Wei Di, and Yizhou Yu. Hd-cnn: hierarchical deep convolutional neural networks for large scale visual recognition. InProceedings of the IEEE inter- national conference on computer vision, pages 2740–2748,
-
[41]
Emergence of segmen- tation with minimalistic white-box transformers
Yaodong Yu, Tianzhe Chu, Shengbang Tong, Ziyang Wu, Druv Pai, Sam Buchanan, and Yi Ma. Emergence of segmen- tation with minimalistic white-box transformers. InConfer- ence on Parsimony and Learning, pages 72–93. PMLR, 2024. 7
work page 2024
-
[42]
Dengke Zhang, Fagui Liu, and Quan Tang. Corrclip: Re- constructing correlations in clip with off-the-shelf foundation models for open-vocabulary semantic segmentation.arXiv preprint arXiv:2411.10086, 2024. 3
-
[43]
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection
Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection.arXiv preprint arXiv:2203.03605, 2022. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[44]
Extract free dense labels from clip
Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. InECCV, 2022. 2, 3
work page 2022
-
[45]
Rethinking query-based transformer for continual image segmentation
Yuchen Zhu, Cheng Shi, Dingyou Wang, Jiajin Tang, Zhengx- uan Wei, Yu Wu, Guanbin Li, and Sibei Yang. Rethinking query-based transformer for continual image segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4595–4606, 2025. 1
work page 2025
-
[46]
Edge Boxes: Locating Object Proposals from Edges
C Lawrence Zitnick and Piotr Dollár. Edge Boxes: Locating Object Proposals from Edges. InECCV. Springer, 2014. 8
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.