arxiv: 2604.14433 · v1 · submitted 2026-04-15 · 💻 cs.CV · cs.LG

Recognition: unknown

Zero-Ablation Overstates Register Content Dependence in DINO Vision Transformers

Felipe Parodi , Jordan Matelsky , Melanie Segado

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:53 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords zero-ablationregister tokensDINO vision transformersfeature probingablation controlsfrozen featuresperformance dependence

0 comments

The pith

Zero-ablation overstates how much DINO vision transformers rely on exact register content.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether large performance drops from zeroing register tokens in DINO models indicate a need for their precise, image-specific values. By testing alternative replacements like mean activations, random noise, and registers from other images, it finds that these preserve performance nearly as well as the original model on classification, correspondence, and segmentation tasks. This suggests that the models depend on having some form of plausible register-like signal rather than the exact content tied to each image. The results hold across different model sizes and highlight that zeroing creates unusually large disruptions compared to other changes. Consequently, registers appear to provide a buffering role for other features without carrying irreplaceable specific information.

Core claim

Zero-ablation of register tokens in DINOv2 and DINOv3 produces large drops in classification and segmentation performance, but mean-substitution, noise-substitution, and cross-image register-shuffling keep performance within about 1 percentage point of the unmodified baseline. Internal representation perturbations are measured via per-patch cosine similarity, showing that zeroing causes disproportionately large changes while the other methods perturb but do not degrade tasks. The conclusion is that in these frozen-feature setups, performance depends on plausible register-like activations rather than exact image-specific values. Registers still help buffer dense features from [CLS] token 1.0,

What carries the argument

Comparison of zero-ablation against mean, noise, and cross-image shuffle replacements for register tokens, combined with cosine similarity analysis of internal activations.

If this is right

Performance relies on the presence of register-like activations rather than their precise values from the input.
Zero-ablation is an unreliable method for assessing token importance because it introduces extreme perturbations.
Registers serve to buffer [CLS] dependence and influence patch geometry in these models.
These patterns replicate consistently at larger ViT-B scales.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This implies that in deployed systems, one could potentially use generic or averaged register values without much loss.
Probing techniques in transformers should incorporate multiple control conditions to distinguish necessary presence from specific content.
Similar overstatement risks may exist when zero-ablation is applied to other special tokens in vision or language models.

Load-bearing premise

The mean, noise, and cross-image shuffle replacements isolate the effect of removing exact register content without adding compensating artifacts or altering task-relevant statistics.

What would settle it

If a new evaluation task or architecture shows that mean or shuffled register replacements cause performance drops comparable to zero-ablation, this would indicate the replacements do not preserve the necessary properties.

Figures

Figures reproduced from arXiv: 2604.14433 by Felipe Parodi, Jordan Matelsky, Melanie Segado.

**Figure 1.** Figure 1: Approach overview. We compare three ViT-S/B models with different register configurations. Hook-based ablations zero [CLS] or register hidden states at every block output, and we evaluate on global (classification, retrieval) and dense (correspondence, segmentation) tasks. Three replacement controls test whether observed deficits reflect genuine content dependence or distributional shift from out-of-dist… view at source ↗

**Figure 2.** Figure 2: Zero-ablation effects and patch geometry. (a) Task × Ablation heatmap (∆ pp from Full). v2 = DINOv2, v2+R = DINOv2+reg, v3 = DINOv3. Zeroing registers produces large drops, but plausible replacement controls preserve all tasks (Tab. 1). (b) Effective rank (median ± std): registers compress patch geometry; DINOv3 exhibits the most compression. (c) Normalized eigenspectrum (log scale, Full condition; eigenva… view at source ↗

**Figure 3.** Figure 3: PCA projection of patch features under ablation (ViTS, layer 11, 3-component RGB). Rows: models. Columns: input image, Full (no ablation), Zero CLS, Zero Registers. Zero CLS barely alters spatial structure with registers present; Zero Registers drastically reorganizes the feature space. 2 [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗

**Figure 4.** Figure 4: Attention flow across all 12 layers (ViT-S, 200 images). Rows: source token type (CLS, registers, patches). Columns: model. Stacked areas show what fraction of attention each source directs to CLS (gray), registers (magenta), and patches (olive). Register attention builds gradually from mid-layers, yet classification dependence on registers emerges abruptly at layers 10–11 (Fig. 8b). Park et al. [17] sho… view at source ↗

**Figure 5.** Figure 5: Attention patterns under ablation (1,000 images). (a) JS divergence vs. layer for ViT-S (dark) and ViT-B (light): register zeroing (solid) causes cascading divergence while meansubstitution (dashed) preserves attention patterns, supporting the distributional-shift interpretation. (b) CLS attention redistribution at last layer when registers are zeroed (ViT-S). mean-substitution uses per-layer dataset-mean… view at source ↗

**Figure 6.** Figure 6: Qualitative correspondence under ablation (DINOv3- B, layer 11, tolerance 1). Green lines: correct NN matches. Red: incorrect. Top: Full condition preserves correct correspondences. Middle: Zero CLS causes minimal disruption (register buffering). Bottom: Zero Registers collapses correspondence accuracy. 5 [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 8.** Figure 8: Patch compression and register dependence across layers. (a) Effective rank (Full condition) decreases across layers; DINOv3 is already compressed by layer 6. (b) CLS accuracy under Full (solid) vs. register-zeroed (dashed): dependence is layerspecific, emerging only at layers 10–11. ViT-S models. (74.3–74.7% vs. 74.5% full), correspondence (70.4–70.5% vs. 71.2%), and segmentation (72.3–72.7% vs. 72.3%) f… view at source ↗

**Figure 9.** Figure 9: Task performance across transformer layers. (a) CLS classification (linear probe, 50 epochs, 1 seed). Classification emerges at layers 10–11. (b) Patch correspondence (tolerance = 1). Correspondence peaks at layers 6–8 then declines, except DINOv3 which maintains 78.9% at layer 11. ViT-S models. 10. Mechanistic Analysis Attention flow across layers [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: CLS attention (ViT-S, last layer, 200 images). (a) CLS attention fraction per token type. DINOv2+reg: 17.9% to registers; DINOv3: 29.1%. (b) Per-register breakdown. This routing structure is maintained under all plausible register replacements. ViT-B ablation matrix is in Tab. 5; replacement controls are in the main text (Tab. 4). Task-level replication. ViT-B absolute accuracies are higher but ablation … view at source ↗

read the original abstract

Zero-ablation -- replacing token activations with zero vectors -- is widely used to probe token function in vision transformers. Register zeroing in DINOv2+registers and DINOv3 produces large drops (up to $-36.6$\,pp classification, $-30.9$\,pp segmentation), suggesting registers are functionally indispensable. However, three replacement controls -- mean-substitution, noise-substitution, and cross-image register-shuffling -- preserve performance across classification, correspondence, and segmentation, remaining within ${\sim}1$\,pp of the unmodified baseline. Per-patch cosine similarity shows these replacements genuinely perturb internal representations, while zeroing causes disproportionately large perturbations, consistent with why it alone degrades tasks. We conclude that zero-ablation overstates dependence on exact register content. In the frozen-feature evaluations we test, performance depends on plausible register-like activations rather than on exact image-specific values. Registers nevertheless buffer dense features from \texttt{[CLS]} dependence and are associated with compressed patch geometry. These findings, including the replacement-control results, replicate at ViT-B scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Zero-ablation exaggerates register dependence in DINO ViTs because mean, noise, and cross-image replacements keep performance nearly intact while zeroing tanks it.

read the letter

The main takeaway is straightforward: zeroing register tokens in DINO models creates an exaggerated picture of their content dependence. The paper shows that mean substitution, noise injection, and shuffling registers across images all leave classification, correspondence, and segmentation within about 1 point of the baseline, while zero-ablation drops performance by up to 36 points. Cosine similarity checks confirm the replacements do perturb the representations, just not as violently as zeroing does. This pattern holds at ViT-B scale and across the tasks they tested. That combination of controls plus the internal verification is the clearest part of the work and the part that feels new for this specific setting. Prior ablation papers have used some of these tricks, but the focused comparison on DINO registers with multiple non-zero baselines and the representation check is a useful addition. The central claim stays scoped to frozen-feature evaluations, which keeps it honest. The main soft spot is that the replacements could still introduce their own statistics that happen to be task-compatible, even if the similarity numbers look reasonable. The paper does not test whether the same pattern appears under fine-tuning or in other ViT families, so the broader implication for all register ablation work is left open. Readers working on transformer interpretability or ablation methodology will find this directly relevant and worth reading. The evidence is clean enough on its own terms that the paper deserves a serious referee rather than a desk rejection.

Referee Report

0 major / 1 minor

Summary. The paper claims that zero-ablation overstates register content dependence in DINO vision transformers. Experiments show that replacing register token activations with zero vectors causes large performance drops (up to -36.6 pp classification, -30.9 pp segmentation), but three controls—mean-substitution, noise-substitution, and cross-image register shuffling—preserve performance within ~1 pp of baseline across classification, correspondence, and segmentation. Cosine similarity analysis confirms that the non-zero replacements perturb internal representations (though less than zeroing), supporting the conclusion that frozen-feature performance depends on plausible register-like activations rather than exact image-specific values. Registers are still shown to buffer dense features from [CLS] dependence and associate with compressed patch geometry; all findings replicate at ViT-B scale.

Significance. If the results hold, the work offers a useful methodological caution for ablation studies in vision transformers, showing that zero-ablation can exaggerate token importance when plausible alternatives suffice. Strengths include multiple independent replacement controls, verification of representation perturbation via cosine similarity, replication across three tasks, and confirmation at ViT-B scale. These elements provide solid empirical grounding for the scoped claim without circularity or post-hoc exclusions.

minor comments (1)

The abstract and methods would benefit from an explicit statement of the precise layers at which register replacements and cosine-similarity measurements are performed, to aid exact replication.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary, significance assessment, and recommendation to accept the manuscript. The review accurately captures the core claim and the empirical controls.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper reports direct experimental comparisons of zero-ablation versus mean-, noise-, and cross-image-shuffle replacements on frozen DINOv2/v3 features for classification, correspondence, and segmentation tasks. No mathematical derivations, equations, fitted parameters, or self-referential definitions appear in the provided text. Performance deltas are measured against unmodified baselines and supported by per-patch cosine-similarity checks; these are externally replicable measurements rather than quantities defined in terms of themselves. No load-bearing premise reduces to a self-citation chain or an ansatz smuggled via prior work by the same authors. The scoped claim that zero-ablation overstates exact-content dependence follows from the observed pattern that plausible register-like activations suffice, without circular reduction to the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical study relying on standard machine-learning experimental practices with no new free parameters, mathematical axioms, or postulated entities.

axioms (1)

domain assumption Frozen-feature evaluations on classification, correspondence, and segmentation tasks are representative of register function in DINO models
The paper explicitly limits its claims to frozen-feature settings and the listed tasks.

pith-pipeline@v0.9.0 · 5489 in / 1160 out tokens · 69743 ms · 2026-05-10T12:53:46.395791+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Could a neuroscientist understand a microprocessor?PLoS Computational Biology, 13(1):e1005268, 2017

Eric Jonas and Konrad Paul Kording. Could a neuroscientist understand a microprocessor?PLoS Computational Biology, 13(1):e1005268, 2017. 1, 6

2017
[2]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021. 1, 2

2021
[3]

DINOv2: Learning robust visual features without supervision

Maxime Oquab, Timoth´ee Darcet, Th´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv´e Jegou, Julien Mairal, Patrick...

2024
[4]

Matelsky, Michael L

Melanie Segado, Felipe Parodi, Jordan K. Matelsky, Michael L. Platt, Eva B. Dyer, and Konrad P. Kord- ing. Grounding intelligence in movement.arXiv preprint arXiv:2507.02771, 2025. 1

work page arXiv 2025
[5]

Matelsky, Alessandro P

Felipe Parodi, Jordan K. Matelsky, Alessandro P. Lamacchia, Melanie Segado, Yaoguang Jiang, Alejandra Regla-Vargas, Liala Sofi, Clare Kimock, Bridget M. Waller, Michael L. Platt, and Konrad P. Kording. PrimateFace: A machine learning re- source for automated face analysis in human and non-human primates.bioRxiv, 2025. 1

2025
[6]

Vision transformers need registers

Timoth´ee Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. InICLR,
[7]

DINOv3

Oriane Sim´eoni, Huy V . V o, Maximilian Seitzer, Federico Bal- dassarre, Maxime Oquab, Timoth´ee Darcet, Herv´e J´egou, Pi- otr Bojanowski, Julien Mairal, et al. DINOv3.arXiv preprint arXiv:2508.10104, 2025. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Efros, and Yossi Gandelsman

Nicholas Jiang, Amil Dravid, Alexei A. Efros, and Yossi Gandelsman. Vision transformers don’t need trained registers. InNeurIPS, 2025. 2, 3

2025
[9]

Alexander Lappe and Martin A. Giese. Register and [CLS] tokens induce a decoupling of local and global features in large ViTs. InNeurIPS, 2025. 2, 3

2025
[10]

Alexis Marouani, Oriane Sim ´eoni, Herv ´e J ´egou, Piotr Bo- janowski, and Huy V . V o. Revisiting [CLS] and patch token interaction in vision transformers. InICLR, 2026. 3

2026
[11]

Lepori, Matthew Kowal, Andrew Lee, Randall Balestriero, Sonia Joseph, Ekdeep Singh Lubana, Talia Konkle, Demba E

Thomas Fel, Binxu Wang, Michael A. Lepori, Matthew Kowal, Andrew Lee, Randall Balestriero, Sonia Joseph, Ekdeep Singh Lubana, Talia Konkle, Demba E. Ba, and Mar- tin Wattenberg. Into the rabbit hull: From task-relevant con- cepts in DINO to Minkowski geometry. InICLR, 2026. 2, 3

2026
[12]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InICCV,
[13]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InICLR, 2024

2024
[14]

Thiagarajan

Srikar Yellapragada, Kowshik Thopalli, Vivek Narayanaswamy, Wesam Sakla, Yang Liu, Yamen Mubarka, Dimitris Samaras, and Jayaraman J. Thiagarajan. Leveraging registers in vision transformers for robust adaptation. In ICASSP, 2025. 2

2025
[15]

Vision transformers need more than registers

Cheng Shi, Yizhou Yu, and Sibei Yang. Vision transformers need more than registers. InCVPR, 2026. 3

2026
[16]

Zipeng Yan, Yinjie Chen, Chong Zhou, Bo Dai, and Andrew F. Luo. Vision transformers with self-distilled registers. In NeurIPS, 2025. 3

2025
[17]

What do self-supervised vision transform- ers learn? InICLR, 2023

Namuk Park, Wonjae Kim, Byeongho Heo, Taekyung Kim, and Sangdoo Yun. What do self-supervised vision transform- ers learn? InICLR, 2023. 3, 5

2023
[18]

Burgh- outs, Francesco Locatello, and Yuki M

Valentinos Pariza, Mohammadreza Salehi, Gertjan J. Burgh- outs, Francesco Locatello, and Yuki M. Asano. Near, far: Patch-ordering enhances vision foundation models’ scene understanding. InICLR, 2025. 3

2025
[19]

Weinberger, Yon- glong Tian, and Yue Wang

Jiawei Yang, Katie Z Luo, Jiefeng Li, Congyue Deng, Leonidas Guibas, Dilip Krishnan, Kilian Q. Weinberger, Yon- glong Tian, and Yue Wang. Denoising vision transformers. InECCV, 2024. 3

2024
[20]

SINDER: Repairing the singular defects of DINOv2

Haoqi Wang, Tong Zhang, and Mathieu Salzmann. SINDER: Repairing the singular defects of DINOv2. InECCV, 2024. 3

2024
[21]

iBOT: Image BERT pre-training with online tokenizer

Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. iBOT: Image BERT pre-training with online tokenizer. InICLR, 2022. 3

2022
[22]

Are sixteen heads really better than one? InNeurIPS, 2019

Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? InNeurIPS, 2019. 3

2019
[23]

Token merging: Your ViT but faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your ViT but faster. InICLR, 2023. 3

2023
[24]

The out-of- distribution problem in explainability and search methods for feature importance explanations

Peter Hase, Harry Xie, and Mohit Bansal. The out-of- distribution problem in explainability and search methods for feature importance explanations. InNeurIPS, 2021. 3 7

2021
[25]

Axiomatic attribution for deep networks

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. InICML, 2017. 3

2017
[26]

Causal abstractions of neural networks

Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks. InNeurIPS,
[27]

Locating and editing factual associations in GPT

Kevin Meng, David Bau, Alex Andonian, and Yonatan Be- linkov. Locating and editing factual associations in GPT. In NeurIPS, 2022. 3

2022
[28]

How to use and interpret activation patching.arXiv preprint arXiv:2404.15255,

Stefan Heimersheim and Neel Nanda. How to use and inter- pret activation patching.arXiv preprint arXiv:2404.15255,

work page arXiv
[29]

Towards best practices of acti- vation patching in language models: Metrics and methods

Fred Zhang and Neel Nanda. Towards best practices of acti- vation patching in language models: Metrics and methods. In ICLR, 2024. 3

2024
[30]

Optimal ablation for inter- pretability

Maximilian Li and Lucas Janson. Optimal ablation for inter- pretability. InNeurIPS, 2024. 3

2024
[31]

arXiv prepreint arXiv:1908.10543 (2019) 10

Juhong Min, Jongmin Lee, Jean Ponce, and Minsu Cho. SPair- 71k: A large-scale benchmark for semantic correspondence. arXiv preprint arXiv:1908.10543, 2019. 4, 6

work page arXiv 1908
[32]

The effective rank: A mea- sure of effective dimensionality

Olivier Roy and Martin Vetterli. The effective rank: A mea- sure of effective dimensionality. InEuropean Signal Process- ing Conference (EUSIPCO), pages 606–610, 2007. 5

2007
[33]

Training data-efficient image transformers & distillation through atten- tion

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv´e J´egou. Training data-efficient image transformers & distillation through atten- tion. InICML, 2021. 6

2021
[34]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, 2022. 6 8 Zero-Ablation Overstates Register Content Dependence in DINO Vision Transformers Supplementary Material Table 5. ViT-B task× ablation matrix (layer 11). CLS: probe top-1 (%). Corr: correspondence (%). Seg:...

2022
[35]

5) complements the ViT-S matrix in the main text (Tab

Extended Results The ViT-B task × ablation matrix (Tab. 5) complements the ViT-S matrix in the main text (Tab. 2). Per-task breakdowns with confidence intervals follow. Segmentation. Synthetic correspondence.Table 7 provides per- condition correspondence with bootstrapped CIs. Register zeroing reduces correspondence from 69–79% to 58–64% (ViT-S; Tab. 7), ...
[36]

9), vs.−18.9 / −36.6 pp for register zeroing

Controls and Statistical Tests Random-patch negative control.Zeroing 4 random patch tokens (5 seeds) causes ≤1 pp CLS drop for ViT-S and ≤2.3 pp for ViT-B (Tab. 9), vs.−18.9 / −36.6 pp for register zeroing. Mean-substitution control.Replacing registers with per- layer dataset-mean activations (5,000 images; Tab. 1 in main text) has negligible effect (−0.3...
[37]

11 and 12 and Fig

Representation Geometry Effective rank across layers reveals when patch compression and register dependence emerge (Tabs. 11 and 12 and Fig. 9; see also Fig. 8). DINOv3 is already compressed at layer 6 (effective rank 6.4 vs. 32.3 for DINOv2), yet register depen- dence for classification emerges only at layers 10–11. In DINOv3, register zeroingimprovesCLS...
[38]

4 traces how attention mass distributes between token types at each of the 12 trans- former layers

Mechanistic Analysis Attention flow across layers.Fig. 4 traces how attention mass distributes between token types at each of the 12 trans- former layers. In DINOv2 (no registers), CLS self-attention dominates early layers then declines. In both register mod- els, register attention share buildsgraduallyfrom mid-layers: DINOv2+reg stabilizes at ∼20% CLS→r...
[39]

The full Figure 10.CLS attention (ViT-S, last layer, 200 images)

ViT-B Scale Validation We replicate the zero-ablation experiments at ViT-B scale: DINOv2-B (86.6M params), DINOv2-B+reg (with four reg- ister tokens), and DINOv3-B (85.7M params). The full Figure 10.CLS attention (ViT-S, last layer, 200 images). (a)CLS attention fraction per token type. DINOv2+reg: 17.9% to registers; DINOv3: 29.1%.(b)Per-register breakdo...
[40]

DINOv3 models are loaded via torch.hub with locally cached weights ( dinov3 vits16 and dinov3 vitb16)

Experimental Details Feature extraction.All features are ex- tracted using HuggingFace transformers (facebook/dinov2-small and facebook/dinov2-with-registers-small for ViT-S; facebook/dinov2-base and facebook/dinov2-with-registers-base 11 for ViT-B). DINOv3 models are loaded via torch.hub with locally cached weights ( dinov3 vits16 and dinov3 vitb16). Inp...
[41]

AdamW, weight decay 10−2, learning rate 10−3, constant, 100 epochs

per patch token; masks downsampled to the patch grid via nearest-neighbor interpolation. AdamW, weight decay 10−2, learning rate 10−3, constant, 100 epochs. Per-pixel cross-entropy, ignoring void (index 255). kNN retrieval.2,000 ImageNet val images, each pro- ducing two augmented views ( RandomResizedCrop, ColorJitter, RandomHorizontalFlip). Cosine simila...