Recognition: 2 theorem links
· Lean TheoremVision Transformers Need Registers
Pith reviewed 2026-05-13 09:36 UTC · model grok-4.3
The pith
Vision transformers develop high-norm tokens in background regions that disrupt feature maps, but adding register tokens absorbs this role and cleans up representations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
High-norm tokens appear in low-informative background regions of ViT feature maps and are repurposed for internal computations. Adding a few register tokens to the input sequence lets these computations occur separately, removing the artifacts from image token representations and yielding cleaner outputs for both supervised and self-supervised training.
What carries the argument
Register tokens, a small set of additional input tokens that absorb internal computations previously performed by high-norm background tokens.
If this is right
- Eliminates artifacts entirely in both supervised and self-supervised ViTs
- Sets new state of the art for self-supervised models on dense visual prediction tasks
- Enables object discovery methods with larger models
- Produces smoother feature maps and attention maps for downstream processing
Where Pith is reading between the lines
- Register tokens may reduce unnecessary computation in background regions across other transformer vision models
- The method could extend to tasks like semantic segmentation by further isolating non-informative areas
- Standard ViT tokenization may need rethinking to separate image content from internal model bookkeeping
Load-bearing premise
The high-norm tokens are mainly handling internal computations rather than carrying essential image information, so extra registers can take over without degrading useful representations.
What would settle it
If adding register tokens either lowers accuracy on standard image classification or creates new high-norm artifacts in feature maps, the claim that they fully resolve the issue would be falsified.
read the original abstract
Transformers have recently emerged as a powerful tool for learning visual representations. In this paper, we identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks. The artifacts correspond to high-norm tokens appearing during inference primarily in low-informative background areas of images, that are repurposed for internal computations. We propose a simple yet effective solution based on providing additional tokens to the input sequence of the Vision Transformer to fill that role. We show that this solution fixes that problem entirely for both supervised and self-supervised models, sets a new state of the art for self-supervised visual models on dense visual prediction tasks, enables object discovery methods with larger models, and most importantly leads to smoother feature maps and attention maps for downstream visual processing.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript identifies artifacts in the feature maps of Vision Transformers (both supervised and self-supervised), characterized by high-norm tokens appearing in low-informative background regions of images. These tokens are interpreted as being repurposed for internal computations within the network. The authors propose adding a small number of additional 'register' tokens to the input sequence to absorb this role. They claim that this simple modification eliminates the artifacts entirely, produces smoother feature and attention maps, achieves new state-of-the-art results on dense visual prediction tasks for self-supervised models, and facilitates object discovery with larger models.
Significance. If the central claims hold, this represents a significant practical contribution to Vision Transformer architectures. The addition of register tokens offers a minimal-change solution that improves feature quality for downstream tasks, particularly benefiting self-supervised learning pipelines and applications requiring dense predictions or object discovery. The empirical nature of the fix, while simple, could have broad adoption if the improvements are robustly demonstrated.
major comments (3)
- [§3] §3 (analysis of token norms): The interpretation that high-norm background tokens serve exclusively as computational sinks (rather than encoding any image information) is central to the motivation but relies on correlational evidence from norm statistics and attention maps. A controlled ablation—such as zeroing out these tokens or replacing their features while keeping the rest of the model fixed—would be necessary to isolate their role and rule out the possibility that they carry useful background context, as raised by the weakest assumption in the work.
- [§5] §5 (experiments on dense tasks): The claim of setting a new state of the art on dense visual prediction tasks is load-bearing for the paper's impact but requires explicit numerical comparisons in the main results table, including the exact performance deltas over previous best methods, the number of register tokens used, and ablations showing that gains are attributable to the registers rather than other factors.
- [§4.1] §4.1 (qualitative results): The assertion that the solution 'fixes that problem entirely' for both supervised and self-supervised models needs stronger quantitative backing, such as metrics quantifying the reduction in artifact severity (e.g., background feature variance or attention entropy) before and after adding registers, beyond the qualitative examples provided.
minor comments (2)
- [Abstract] The abstract states the claims strongly but omits any quantitative results or specific numbers, which would help readers assess the magnitude of the improvements immediately.
- [Method] Clarify the exact implementation details of how register tokens are initialized (learned parameters vs. fixed) and their interaction with positional embeddings, as this affects reproducibility and should be stated explicitly in the method section.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment point by point below. Where the suggestions strengthen the evidence or clarity, we have revised the manuscript accordingly.
read point-by-point responses
-
Referee: [§3] §3 (analysis of token norms): The interpretation that high-norm background tokens serve exclusively as computational sinks (rather than encoding any image information) is central to the motivation but relies on correlational evidence from norm statistics and attention maps. A controlled ablation—such as zeroing out these tokens or replacing their features while keeping the rest of the model fixed—would be necessary to isolate their role and rule out the possibility that they carry useful background context, as raised by the weakest assumption in the work.
Authors: We agree that the evidence in §3 is observational and correlational, relying on norm statistics and attention patterns rather than direct causal intervention. A zeroing ablation on internal activations would be a useful addition for stronger isolation of the role, but it is non-trivial to implement without modifying the forward pass in ways that could introduce new artifacts. In the revised manuscript we have added a dedicated paragraph in §3 explicitly acknowledging the correlational nature of the evidence and discussing why the register-token results provide indirect support (the high-norm role is visibly transferred to the registers while image tokens remain low-norm and semantically coherent). We do not claim the interpretation is proven beyond correlation, but maintain that the practical elimination of artifacts via registers is the core contribution. revision: partial
-
Referee: [§5] §5 (experiments on dense tasks): The claim of setting a new state of the art on dense visual prediction tasks is load-bearing for the paper's impact but requires explicit numerical comparisons in the main results table, including the exact performance deltas over previous best methods, the number of register tokens used, and ablations showing that gains are attributable to the registers rather than other factors.
Authors: We accept this point. The original submission presented results but did not tabulate deltas or isolate the register contribution with dedicated ablations. In the revised version we have updated the primary results table in §5 to report (i) exact performance numbers and deltas versus the prior best methods, (ii) the precise number of register tokens used in each experiment (typically 4), and (iii) an additional ablation column that varies only the presence of registers while holding all other hyperparameters fixed. These changes make the source of the gains explicit. revision: yes
-
Referee: [§4.1] §4.1 (qualitative results): The assertion that the solution 'fixes that problem entirely' for both supervised and self-supervised models needs stronger quantitative backing, such as metrics quantifying the reduction in artifact severity (e.g., background feature variance or attention entropy) before and after adding registers, beyond the qualitative examples provided.
Authors: We agree that quantitative metrics would make the 'fixes entirely' claim more robust. We have therefore added, in the revised §4.1, two new quantitative measures: (1) average background feature variance (computed over non-object regions identified by off-the-shelf saliency) and (2) attention entropy across layers. Both metrics show large, consistent reductions (background variance drops by approximately 80 % on average) when registers are introduced, for both supervised and self-supervised models. These numbers are reported alongside the existing qualitative figures. revision: yes
Circularity Check
No circularity: empirical observation leads to architectural addition without self-referential reduction
full rationale
The paper identifies high-norm tokens in background regions via direct inspection of ViT activations and attention maps, then proposes adding a small number of register tokens to the input sequence as an empirical fix. This proposal is validated through downstream experiments on supervised and self-supervised models but does not rely on any equation that defines the solution in terms of itself, any fitted parameter renamed as a prediction, or a load-bearing self-citation chain. The central claim remains an independent architectural intervention supported by ablation results and qualitative visualizations, with no derivation step that collapses to the input observations by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Vision Transformers process images as fixed-length sequences of patch embeddings plus class token
invented entities (1)
-
register tokens
no independent evidence
Lean theorems connected to this paper
-
Foundation.EightTickeight_tick_forces_D3 echoesWe propose a simple yet effective solution based on providing additional tokens to the input sequence of the Vision Transformer to fill that role.
Forward citations
Cited by 27 Pith papers
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.
-
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models
LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-laye...
-
Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs
Sink-Token-aware Pruning (SToP) suppresses semantically uninformative sink tokens during visual token pruning in Video LLMs, boosting fine-grained performance even at 90% pruning rates across hallucination, reasoning,...
-
Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals
Pairwise scoring signals in Vision Transformer token reduction are inherently unstable due to high perturbation counts and degrade in deep layers, causing collapse, while unary signals with triage enable CATIS to reta...
-
OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance
OVS-DINO structurally aligns DINO with SAM to revitalize attenuated boundary features, achieving SOTA gains of 2.1% average and 6.3% on Cityscapes in weakly-supervised open-vocabulary segmentation.
-
Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP
The paper proposes information scope as a new interpretability axis for SAE features in CLIP and introduces the Contextual Dependency Score to separate local from global scope features, showing they influence model pr...
-
Elastic Attention Cores for Scalable Vision Transformers
VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintain...
-
The two clocks and the innovation window: When and how generative models learn rules
Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.
-
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion
Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
-
Taming Outlier Tokens in Diffusion Transformers
Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.
-
Revisiting Graph-Tokenizing Large Language Models: A Systematic Evaluation of Graph Token Understanding
GTokenLLMs do not fully understand graph tokens, exhibiting over-sensitivity or insensitivity to instruction changes and relying heavily on text for reasoning even when graph information is preserved.
-
RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
RTPrune prunes visual tokens in DeepSeek-OCR via a reading-twice two-stage process, retaining 84.25% tokens for 99.47% accuracy and 1.23x faster prefill on OmniDocBench.
-
Self-supervised pretraining for an iterative image size agnostic vision transformer
A sequential-to-global SSL method based on DINO pretrains iterative foveal-inspired vision transformers to achieve competitive ImageNet-1K performance with constant compute regardless of input resolution.
-
OmniMouse: Scaling properties of multi-modal, multi-task Brain Models on 150B Neural Tokens
OmniMouse demonstrates data-driven scaling in multi-task brain models on a 150B-token neural dataset, achieving SOTA across prediction, decoding, and forecasting while model size gains saturate.
-
OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation
OFlow unifies temporal foresight and object-aware reasoning inside a shared latent space via flow matching to improve VLA robustness in robotic manipulation under distribution shifts.
-
Enhancing event reconstruction for $\gamma$-ray particle detector arrays using transformers
Transformer models applied to simulated water-Cherenkov array data improve gamma-hadron separation and reconstruction of direction, core position, and energy compared to established techniques.
-
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
-
Learning Robust Visual Features in Computed Tomography Enables Efficient Transfer Learning for Clinical Tasks
VoxelFM learns robust 3D CT visual features via DINO self-distillation that transfer effectively to seven clinical task categories using frozen backbones and lightweight heads, outperforming prior CT foundation models...
-
Generative Event Pretraining with Foundation Model Alignment
GEP transfers semantic knowledge from image foundation models to event data via alignment and generative pretraining on mixed sequences to create transferable event-based visual models.
-
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.
-
Colinearity Decay: Training Quantization-Friendly ViTs with Outlier Decay
Colinearity-Decay regularizer trains ViTs that maintain or improve full-precision accuracy while delivering higher accuracy after low-bit quantization on ImageNet and COCO tasks.
-
Let ViT Speak: Generative Language-Image Pre-training
GenLIP pretrains ViTs to generate language tokens from visual tokens via autoregressive language modeling, matching strong baselines on multimodal tasks with less data.
-
Uni-Encoder Meets Multi-Encoders: Representation Before Fusion for Brain Tumor Segmentation with Missing Modalities
UniME combines a pretrained unified ViT encoder with modality-specific CNN encoders to improve brain tumor segmentation performance when some MRI modalities are missing.
-
LTX-2: Efficient Joint Audio-Visual Foundation Model
LTX-2 generates high-quality synchronized audiovisual content from text prompts via an asymmetric 14B-video / 5B-audio dual-stream transformer with cross-attention and modality-aware guidance.
-
Movie Gen: A Cast of Media Foundation Models
A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
-
RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
RTPrune delivers 99.47% accuracy and 1.23x faster prefill on OmniDocBench for DeepSeek-OCR-Large by retaining only 84.25% of tokens through a reading-twice inspired two-stage pruning process.
-
Revisiting Human-in-the-Loop Object Retrieval with Pre-Trained Vision Transformers
Pre-trained ViT representations combined with active learning and targeted design choices for annotations and selection improve object class retrieval in multi-object scenes.
Reference graph
Works this paper leans on
-
[1]
Perceiver: General perception with iterative attention , author=
-
[2]
Cut and learn for unsupervised object detection and instance segmentation , author=
-
[3]
DINOv2: Learning Robust Visual Features without Supervision
DINOv2: Learning robust visual features without supervision , author=. arXiv preprint arXiv:2304.07193 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
The "something something" video database for learning and evaluating visual common sense , author=
-
[5]
The Kinetics Human Action Video Dataset
The kinetics human action video dataset , author=. arXiv preprint arXiv:1705.06950 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
UCF101: A dataset of 101 human actions classes from videos in the wild , author=. arXiv preprint arXiv:1212.0402 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Scene parsing through ade20k dataset , author=
-
[8]
The cityscapes dataset for semantic urban scene understanding , author=
-
[9]
Learning to Generate Reviews and Discovering Sentiment , publisher =
Learning to generate reviews and discovering sentiment , author=. arXiv preprint arXiv:1704.01444 , year=
-
[10]
International Conference on Machine Learning , pages=
Understanding the robustness in vision transformers , author=. International Conference on Machine Learning , pages=. 2022 , organization=
work page 2022
-
[11]
European conference on computer vision , pages=
SOLAR: second-order loss and attention for image retrieval , author=. European conference on computer vision , pages=. 2020 , organization=
work page 2020
-
[12]
Divide and contrast: Self-supervised learning from uncurated data , author=. ICCV , year=
-
[13]
arXiv preprint arXiv:2203.08777 , year=
Object discovery and representation networks , author=. arXiv preprint arXiv:2203.08777 , year=
-
[14]
Billion-scale semi-supervised learning for image classification
Billion-scale semi-supervised learning for image classification , author=. arXiv preprint arXiv:1905.00546 , year=
work page Pith review arXiv 1905
-
[15]
arXiv preprint arXiv:2202.08360 , year=
Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision , author=. arXiv preprint arXiv:2202.08360 , year=
-
[16]
Gupta, Agrim and Dollar, Piotr and Girshick, Ross , booktitle=
-
[17]
iBOT: Image bert pre-training with online tokenizer , author=
-
[18]
Masked autoencoders are scalable vision learners , author=
-
[19]
Prototypical Contrastive Learning of Unsupervised Representations , author=. ICLR , year=
-
[20]
Boosting self-supervised learning via knowledge transfer , author=
-
[21]
Eva: Exploring the limits of masked visual representation learning at scale , author=
-
[22]
SEED: Self-supervised Distillation For Visual Representation , author=
-
[23]
arXiv preprint arXiv:2012.02733 , year=
Seed the Views: Hierarchical Semantic Alignment for Contrastive Representation Learning , author=. arXiv preprint arXiv:2012.02733 , year=
-
[24]
Learning Super-Features for Image Retrieval , author=
-
[25]
Fine-tuning CNN image retrieval with no human annotation , author=
-
[26]
arXiv preprint arXiv:2012.11552 , year=
Online Bag-of-Visual-Words Generation for Unsupervised Representation Learning , author=. arXiv preprint arXiv:2012.11552 , year=
-
[27]
MultiGrain : a unified image embedding for classes and instances
Berman, Maxim and J. MultiGrain : a unified image embedding for classes and instances. arXiv preprint arXiv:1902.05509 , year=
-
[28]
arXiv preprint arXiv:2203.14415 , year=
Mugs: A multi-granular self-supervised learning framework , author=. arXiv preprint arXiv:2203.14415 , year=
-
[29]
Communications of the ACM , volume=
YFCC100M: The New Data in Multimedia Research , author=. Communications of the ACM , volume=. 2016 , pages=
work page 2016
-
[30]
Evaluation of gist descriptors for web-scale image search , author=
-
[31]
arXiv preprint arXiv:2012.05649 , year=
Concept generalization in visual representation learning , author=. arXiv preprint arXiv:2012.05649 , year=
-
[32]
Particular object retrieval with integral max-pooling of CNN activations , author=
-
[33]
Learning with average precision: Training image retrieval with a listwise loss , author=
-
[34]
Google Landmarks Dataset v2 -- A Large-Scale Benchmark for Instance-Level Recognition and Retrieval , author=
-
[35]
Lost in quantization: Improving particular object retrieval in large scale image databases , author=
-
[36]
Revisiting Oxford and Paris: Large-scale image retrieval benchmarking , author=
-
[37]
BEiT: BERT Pre-Training of Image Transformers , author=
-
[38]
arXiv preprint arXiv:2111.09886 , year=
Simmim: A simple framework for masked image modeling , author=. arXiv preprint arXiv:2111.09886 , year=
-
[39]
Localizing Objects with Self-Supervised Transformers and no Labels , author=
-
[40]
Filtering and Mining Parallel Data in a Joint Multilingual Space , author=. Proc. ACL , year=
-
[41]
WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia , author=. Proc. EACL , year=
-
[42]
arXiv preprint arXiv:2104.14548 , year=
With a little help from my friends: Nearest-neighbor contrastive learning of visual representations , author=. arXiv preprint arXiv:2104.14548 , year=
-
[43]
arXiv preprint arXiv:1911.04944 , year=
Ccmatrix: Mining billions of high-quality parallel sentences on the web , author=. arXiv preprint arXiv:1911.04944 , year=
-
[44]
Unsupervised Learning of Dense Visual Representations , author=
-
[45]
Space-time correspondence as a contrastive random walk , author=
-
[46]
SIAM journal on control and optimization , volume=
Acceleration of stochastic approximation by averaging , author=. SIAM journal on control and optimization , volume=. 1992 , publisher=
work page 1992
-
[47]
Asano and Christian Rupprecht and Andrew Zisserman and Andrea Vedaldi
Yuki M. Asano and Christian Rupprecht and Andrew Zisserman and Andrea Vedaldi. PASS: An ImageNet replacement for self-supervised pretraining without humans. NeurIPS Track on Datasets and Benchmarks. 2021
work page 2021
-
[48]
Self-labelling via simultaneous clustering and representation learning , author=
-
[49]
Deep clustering for unsupervised learning of visual features , author=
-
[50]
An empirical study of training self-supervised vision transformers , author=
-
[51]
arXiv preprint arXiv:2112.10740 , year=
Are Large-scale Datasets Necessary for Self-Supervised Pre-training? , author=. arXiv preprint arXiv:2112.10740 , year=
-
[52]
Unsupervised feature learning via non-parametric instance discrimination , author=
-
[53]
Unsupervised pre-training of image features on non-curated data , author=
-
[54]
Microsoft coco: Common objects in context , author=
-
[55]
Mask r-cnn , author=
-
[56]
Large Batch Training of Convolutional Networks
Large Batch Training of Convolutional Networks , author=. preprint arXiv:1708.03888 , year=
-
[57]
preprint arXiv:1912.08165 , year=
Cyanure: An Open-Source Toolbox for Empirical Risk Minimization for Python, C++, and soon more , author=. preprint arXiv:1912.08165 , year=
-
[58]
Fixing the train-test resolution discrepancy , author=
-
[59]
Momentum contrast for unsupervised visual representation learning , author=
-
[60]
Self-supervised learning of pretext-invariant representations , author=
-
[61]
End-to-end learning of visual representations from uncurated instructional videos , author=
-
[62]
ClusterFit: Improving Generalization of Visual Representations , author=
-
[63]
Big self-supervised models are strong semi-supervised learners , author=
-
[64]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Semi-supervised learning of visual features by non-parametrically predicting view assignments with support samples , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[65]
Exploring the limits of weakly supervised pretraining , author=
-
[66]
Learning visual features from large weakly supervised data , author=
-
[67]
A simple framework for contrastive learning of visual representations , author=
-
[68]
Bootstrap your own latent: A new approach to self-supervised learning , author=
-
[69]
Unsupervised learning of visual features by contrasting cluster assignments , author=
-
[70]
Scaling the scattering transform: Deep hybrid networks , author=
-
[71]
Unsupervised learning by predicting Noise , author=
-
[72]
Language models are unsupervised multitask learners , author=
-
[73]
Bert: Pre-training of deep bidirectional transformers for language understanding , author=. NAACL , year=
-
[74]
wav2vec: Unsupervised pre-training for speech recognition.arXiv preprint arXiv:1904.05862, 2019
wav2vec: Unsupervised pre-training for speech recognition , author=. preprint arXiv:1904.05862 , year=
-
[75]
Unsupervised pretraining transfers well across languages , author=
-
[76]
wav2vec 2.0: A framework for self-supervised learning of speech representations , author=
-
[77]
Data2vec: A general framework for self-supervised learning in speech, vision and language , author=
-
[78]
arXiv preprint arXiv:2209.03917 , year=
Exploring target representations for masked autoencoders , author=. arXiv preprint arXiv:2209.03917 , year=
-
[79]
arXiv preprint arXiv:2111.12710 , year=
Peco: Perceptual codebook for bert pre-training of vision transformers , author=. arXiv preprint arXiv:2111.12710 , year=
-
[80]
Libri-light: A benchmark for asr with limited or no supervision , author=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.