pith. machine review for the scientific record. sign in

arxiv: 2111.06377 · v3 · submitted 2021-11-11 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Masked Autoencoders Are Scalable Vision Learners

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords masked autoencodersself-supervised learningvision transformersimage reconstructionImageNet pretrainingscalable vision modelsViT-Huge
0
0 comments X

The pith

Masked autoencoders learn scalable vision features by reconstructing heavily masked image patches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a simple self-supervised method called masked autoencoders works well for training large vision models. Random patches of an input image are hidden, and the model must reconstruct the missing pixel values. An asymmetric design runs the main encoder only on the visible patches while a small decoder adds mask tokens to rebuild the full image. High masking ratios around 75 percent turn the task into a useful challenge that avoids easy shortcuts. This setup speeds up training by three times or more and lets a plain ViT-Huge model reach 87.8 percent accuracy on ImageNet-1K data alone, with stronger transfer to other tasks than supervised pre-training.

Core claim

Masked autoencoders are scalable self-supervised learners for computer vision. The approach masks random patches of the input image and reconstructs the missing pixels. It is based on an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches without mask tokens, along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Masking a high proportion of the input image, such as 75 percent, yields a nontrivial and meaningful self-supervisory task. Coupling these designs enables efficient training of large models that generalize well, for example a vanilla ViT-Huge model achieving the 87

What carries the argument

Asymmetric encoder-decoder where the encoder processes only visible patches and the lightweight decoder reconstructs the full image from latent features plus mask tokens.

If this is right

  • Training accelerates by 3x or more while accuracy improves.
  • Vanilla ViT-Huge reaches 87.8 percent accuracy on ImageNet-1K using only that data.
  • Transfer performance on downstream tasks exceeds supervised pre-training.
  • The method exhibits promising scaling behavior as model size grows.
  • High masking ratios produce meaningful self-supervision that supports large models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same masking-plus-reconstruction pattern could apply directly to video or audio by hiding patches across time or frequency.
  • The training efficiency opens the door to pre-training on image collections far larger than ImageNet without labels.
  • Reconstruction objectives may serve as a drop-in replacement for contrastive losses when scaling vision transformers.
  • Hybrid versions that combine this decoder with contrastive heads could be tested on the same architectures.

Load-bearing premise

Masking a high proportion of the input creates a nontrivial self-supervisory task whose difficulty drives useful feature learning rather than trivial solutions.

What would settle it

A ViT-Huge model trained with this 75-percent masking method on ImageNet-1K reaches below 87.8 percent top-1 accuracy, or a lower masking ratio produces equal or higher accuracy.

read the original abstract

This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces masked autoencoders (MAE) as a scalable self-supervised pre-training approach for vision. It masks a high fraction (e.g., 75%) of random image patches and reconstructs the missing pixels via an asymmetric encoder-decoder: the encoder processes only the visible patches (no mask tokens), while a lightweight decoder reconstructs the full image from the latent representation plus mask tokens. This design enables efficient training of large ViT models; a vanilla ViT-Huge achieves 87.8% top-1 accuracy on ImageNet-1K using only ImageNet-1K data and shows strong transfer gains over supervised pre-training.

Significance. If the empirical results hold, the work is significant because it demonstrates that a simple, high-masking-ratio reconstruction task combined with an asymmetric architecture can scale self-supervised learning to high-capacity vision models, yielding both 3x+ training acceleration and state-of-the-art ImageNet-1K accuracy among ImageNet-only methods. The extensive ablations on masking ratio and decoder depth, together with downstream transfer experiments, provide direct support for the central scalability claim.

minor comments (2)
  1. [Abstract] Abstract: the statement that 87.8% is the 'best accuracy among methods that use only ImageNet-1K data' would be strengthened by an explicit footnote or table reference listing the exact competing methods and their scores.
  2. [Section 4.2] The description of the masking ratio ablation would benefit from a brief statement of the reconstruction loss behavior at 75% versus lower ratios to make the 'nontrivial task' claim more concrete.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and insightful review, as well as the recommendation to accept the manuscript. The referee's summary accurately captures our core contributions regarding the asymmetric encoder-decoder design and high masking ratio in masked autoencoders for scalable self-supervised pre-training of vision transformers.

Circularity Check

0 steps flagged

No significant circularity; empirical method is self-contained

full rationale

The paper presents an empirical self-supervised method (asymmetric encoder-decoder with 75% random patch masking) whose core designs are stated directly as architectural choices and training procedures. All reported results, including the 87.8% ImageNet-1K accuracy for ViT-Huge, are obtained from end-to-end training and evaluation on fixed public benchmarks. No central quantity is defined in terms of itself, no fitted parameter is relabeled as a prediction, and no load-bearing premise reduces to a self-citation chain; the nontriviality of the masking task is tested via ablations rather than assumed by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the standard Vision Transformer architecture and pixel reconstruction loss from prior literature; the main addition is the masking strategy and asymmetry, which are validated empirically rather than derived from axioms.

free parameters (1)
  • masking ratio = 75%
    Set to 75% after ablation; chosen because lower ratios make the task too easy.
axioms (1)
  • standard math Vision Transformer patch embedding and self-attention from Dosovitskiy et al. 2020
    The encoder is a standard ViT; no new mathematical foundation is introduced.

pith-pipeline@v0.9.0 · 5499 in / 1200 out tokens · 26738 ms · 2026-05-16T06:49:33.669252+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • LawOfExistence defect_zero_iff_one echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    we mask random patches of the input image and reconstruct the missing pixels... masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Mask World Model: Predicting What Matters for Robust Robot Policy Learning

    cs.RO 2026-04 unverdicted novelty 7.0

    Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...

  2. Representing 3D Faces with Learnable B-Spline Volumes

    cs.CV 2026-04 unverdicted novelty 7.0

    CUBE encodes 3D faces via a grid of learned high-dimensional B-spline features that map parametrically to a base shape plus MLP-refined displacements, enabling dense correspondence and state-of-the-art registration fr...

  3. Learning to Discover at Test Time

    cs.LG 2026-01 unverdicted novelty 7.0

    TTT-Discover applies test-time RL to set new state-of-the-art results on math inequalities, GPU kernels, algorithm contests, and single-cell denoising using an open model and public code.

  4. Massive Activations in Large Language Models

    cs.CL 2024-02 unverdicted novelty 7.0

    Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.

  5. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

    cs.LG 2022-11 conditional novelty 7.0

    PatchTST uses subseries patching and channel-independent Transformers to deliver significantly better long-term multivariate time series forecasting and strong self-supervised transfer performance.

  6. Earth System Foundation Model (ESFM): A unified framework for heterogeneous data integration and forecasting

    physics.ao-ph 2026-04 unverdicted novelty 6.0

    ESFM is a single open foundation model that unifies heterogeneous Earth data sources and forecasts missing regions while preserving inter-variable physical relationships.

  7. Beyond Independent Frames: Latent Attention Masked Autoencoders for Multi-View Echocardiography

    cs.CV 2026-04 unverdicted novelty 6.0

    LAMAE adds latent-space attention to masked autoencoders so multi-view echocardiography videos can exchange information across frames and views, yielding representations that transfer from adult to pediatric hearts an...

  8. Self-supervised Pretraining of Cell Segmentation Models

    cs.CV 2026-04 unverdicted novelty 6.0

    DINOCell achieves a SEG score of 0.784 on LIVECell by self-supervised domain adaptation of DINOv2, improving 10.42% over SAM-based models and showing strong zero-shot transfer.

  9. Multi-modal, multi-scale representation learning for satellite imagery analysis just needs a good ALiBi

    cs.CV 2026-04 unverdicted novelty 6.0

    Scale-ALiBi adds a spatial-scale bias to ALiBi attention, enabling effective representation learning across high- and low-resolution optical and SAR satellite images.

  10. Zero-shot World Models Are Developmentally Efficient Learners

    cs.AI 2026-04 unverdicted novelty 6.0

    A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.

  11. Physics-Informed Transformer for Real-Time High-Fidelity Topology Optimization

    cs.CE 2026-04 unverdicted novelty 6.0

    A transformer model with self-attention and auxiliary physics losses learns a direct non-iterative mapping from loads and fields to manufacturable optimized topologies.

  12. C$^2$FG: Control Classifier-Free Guidance via Score Discrepancy Analysis

    cs.LG 2026-03 unverdicted novelty 6.0

    C²FG provides a time-dependent exponential decay control for classifier-free guidance based on theoretical upper bounds on conditional-unconditional score discrepancies in diffusion processes.

  13. Foundation Model-Driven Semantic Change Detection in Remote Sensing Imagery

    cs.CV 2026-02 unverdicted novelty 6.0

    PerASCD sets new state-of-the-art Sek scores on SECOND and LandsatSCD datasets by using a modular cascaded gated decoder on PerA foundation model features plus a new consistency loss.

  14. LLaVA-Video: Video Instruction Tuning With Synthetic Data

    cs.CV 2024-10 unverdicted novelty 6.0

    LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.

  15. Revisiting Feature Prediction for Learning Visual Representations from Video

    cs.CV 2024-02 conditional novelty 6.0

    V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.

  16. CoCa: Contrastive Captioners are Image-Text Foundation Models

    cs.CV 2022-05 accept novelty 6.0

    CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.

  17. PANC: Prior-Aware Normalized Cut via Anchor-Augmented Token Graphs

    cs.CV 2026-02 unverdicted novelty 5.0

    PANC augments Normalized Cut with anchor-augmented token graphs using priors to steer spectral partitions, yielding mIoU gains of 2.3-8.7% over baselines on DUTS-TE, DUT-OMRON, and CrackForest.

  18. Now You See That: Learning End-to-End Humanoid Locomotion from Raw Pixels

    cs.RO 2026-02 unverdicted novelty 5.0

    An end-to-end policy learns robust humanoid locomotion directly from noisy depth images via high-fidelity sensor simulation, vision-aware distillation from privileged maps, and terrain-specific multi-critic reward shaping.

  19. Galactica: A Large Language Model for Science

    cs.CL 2022-11 unverdicted novelty 5.0

    Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.

  20. Using Deep Learning Models Pretrained by Self-Supervised Learning for Protein Localization

    cs.CV 2026-04 unverdicted novelty 4.0

    DINO-based ViT models pretrained on HPA FOV achieve macro F1 of 0.822 zero-shot and 0.860 after fine-tuning for protein localization on OpenCell, demonstrating effective transfer from SSL pretraining.

  21. AMO-ENE: Attention-based Multi-Omics Fusion Model for Outcome Prediction in Extra Nodal Extension and HPV-associated Oropharyngeal Cancer

    eess.IV 2026-04 unverdicted novelty 4.0

    An attention-based fusion model combining semi-supervised CT segmentation, radiomics, and clinical features predicts metastatic recurrence, overall survival, and disease-free survival in HPV+ oropharyngeal cancer with...

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · cited by 21 Pith papers · 5 internal anchors

  1. [1]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv:1607.06450, 2016

  2. [2]

    BEiT: BERT Pre-Training of Image Transformers

    Hangbo Bao, Li Dong, and Furu Wei. BEiT: BERT pre-training of image transformers. arXiv:2106.08254, 2021. Accessed in June 2021

  3. [3]

    Self-organizing neural network that discovers surfaces in random-dot stereograms

    Suzanna Becker and Geoffrey E Hinton. Self-organizing neural network that discovers surfaces in random-dot stereograms. Na- ture, 1992

  4. [4]

    Language mod- els are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott G...

  5. [5]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In ICCV, 2021

  6. [6]

    Generative pretraining from pix- els

    Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pix- els. In ICML, 2020

  7. [7]

    A simple framework for contrastive learning of visual rep- resentations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual rep- resentations. In ICML, 2020

  8. [8]

    Exploring simple Siamese represen- tation learning

    Xinlei Chen and Kaiming He. Exploring simple Siamese represen- tation learning. In CVPR, 2021

  9. [9]

    An empirical study of training self-supervised Vision Transformers

    Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised Vision Transformers. In ICCV, 2021

  10. [10]

    ELECTRA: Pre-training text encoders as discriminators rather than generators

    Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. ELECTRA: Pre-training text encoders as discriminators rather than generators. In ICLR, 2020

  11. [11]

    Support-vector networks

    Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 1995

  12. [12]

    Ran- daugment: Practical automated data augmentation with a reduced search space

    Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Ran- daugment: Practical automated data augmentation with a reduced search space. In CVPR Workshops, 2020

  13. [13]

    ImageNet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009

  14. [14]

    BERT: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019

  15. [15]

    Unsupervised visual representation learning by context prediction

    Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In ICCV, 2015

  16. [16]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa De- hghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021

  17. [17]

    Unsuper- vised representation learning by predicting image rotations

    Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsuper- vised representation learning by predicting image rotations. In ICLR, 2018

  18. [18]

    Understanding the difficulty of training deep feedforward neural networks

    Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010

  19. [19]

    Self-supervised pretraining of visual features in the wild

    Priya Goyal, Mathilde Caron, Benjamin Lefaudeux, Min Xu, Pengchao Wang, Vivek Pai, Mannat Singh, Vitaliy Liptchinsky, Is- han Misra, Armand Joulin, and Piotr Bojanowski. Self-supervised pretraining of visual features in the wild. arXiv:2103.01988, 2021

  20. [20]

    Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

    Priya Goyal, Piotr Doll ´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv:1706.02677, 2017

  21. [21]

    Boot- strap your own latent - a new approach to self-supervised learning

    Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Remi Munos, and Michal Valko. Boot- strap your own latent - a new approach to self-supervised learning. In NeurIPS, 2020

  22. [22]

    Dimensionality reduction by learning an invariant mapping

    Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In CVPR, 2006

  23. [23]

    Momentum contrast for unsupervised visual representation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Gir- shick. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020

  24. [24]

    Mask R-CNN

    Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Girshick. Mask R-CNN. In ICCV, 2017

  25. [25]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016

  26. [26]

    The many faces of robustness: A critical analysis of out-of-distribution generalization

    Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, 2021

  27. [27]

    Benchmarking neural net- work robustness to common corruptions and perturbations

    Dan Hendrycks and Thomas Dietterich. Benchmarking neural net- work robustness to common corruptions and perturbations. In ICLR, 2019

  28. [28]

    Natural adversarial examples

    Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In CVPR, 2021

  29. [29]

    Autoencoders, minimum description length, and helmholtz free energy

    Geoffrey E Hinton and Richard S Zemel. Autoencoders, minimum description length, and helmholtz free energy. In NeurIPS, 1994

  30. [30]

    Deep networks with stochastic depth

    Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Wein- berger. Deep networks with stochastic depth. In ECCV, 2016

  31. [31]

    Batch normalization: Accel- erating deep network training by reducing internal covariate shift

    Sergey Ioffe and Christian Szegedy. Batch normalization: Accel- erating deep network training by reducing internal covariate shift. In ICML, 2015

  32. [32]

    Quality-agnostic image recognition via in- vertible decoder

    Insoo Kim, Seungju Han, Ji-won Baek, Seong-Jin Park, Jae-Joon Han, and Jinwoo Shin. Quality-agnostic image recognition via in- vertible decoder. In CVPR, 2021

  33. [33]

    Imagenet clas- sification with deep convolutional neural networks

    Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet clas- sification with deep convolutional neural networks. In NeurIPS, 2012

  34. [34]

    Backpropagation applied to handwritten zip code recognition.Neu- ral computation, 1989

    Yann LeCun, Bernhard Boser, John S Denker, Donnie Hender- son, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition.Neu- ral computation, 1989

  35. [35]

    Benchmarking detection transfer learning with vision transformers

    Yanghao Li, Saining Xie, Xinlei Chen, Piotr Doll ´ar, Kaiming He, and Ross Girshick. Benchmarking detection transfer learning with vision transformers. In preparation, 2021. 9

  36. [36]

    Feature pyramid networks for ob- ject detection

    Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for ob- ject detection. In CVPR, 2017

  37. [37]

    Mi- crosoft COCO: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Mi- crosoft COCO: Common objects in context. In ECCV, 2014

  38. [38]

    SGDR: Stochastic gradient de- scent with warm restarts

    Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient de- scent with warm restarts. In ICLR, 2017

  39. [39]

    Decoupled weight decay regu- larization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regu- larization. In ICLR, 2019

  40. [40]

    Exploring the limits of weakly supervised pre- training

    Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pre- training. In ECCV, 2018

  41. [41]

    Towards robust vision trans- former

    Xiaofeng Mao, Gege Qi, Yuefeng Chen, Xiaodan Li, Ranjie Duan, Shaokai Ye, Yuan He, and Hui Xue. Towards robust vision trans- former. arXiv:2105.07926, 2021

  42. [42]

    Unsupervised learning of visual representations by solving jigsaw puzzles

    Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016

  43. [43]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representa- tion learning with contrastive predictive coding.arXiv:1807.03748, 2018

  44. [44]

    Neu- ral discrete representation learning

    Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neu- ral discrete representation learning. In NeurIPS, 2017

  45. [45]

    Learning features by watching objects move

    Deepak Pathak, Ross Girshick, Piotr Doll ´ar, Trevor Darrell, and Bharath Hariharan. Learning features by watching objects move. In CVPR, 2017

  46. [46]

    Context encoders: Feature learning by inpaint- ing

    Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpaint- ing. In CVPR, 2016

  47. [47]

    Improving language understanding by generative pre- training

    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre- training. 2018

  48. [48]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019

  49. [49]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020

  50. [50]

    Zero-shot text-to-image generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In ICML, 2021

  51. [51]

    Very deep convolutional networks for large-scale image recognition

    Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015

  52. [52]

    Rethinking the inception architec- ture for computer vision

    Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architec- ture for computer vision. In CVPR, 2016

  53. [53]

    Training data-efficient image transformers & distillation through attention

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv ´e J´egou. Training data-efficient image transformers & distillation through attention. In ICML, 2021

  54. [54]

    Grafit: Learning fine-grained image repre- sentations with coarse labels

    Hugo Touvron, Alexandre Sablayrolles, Matthijs Douze, Matthieu Cord, and Herv´e J´egou. Grafit: Learning fine-grained image repre- sentations with coarse labels. In ICCV, 2021

  55. [55]

    Fixing the train-test resolution discrepancy

    Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Herv ´e J´egou. Fixing the train-test resolution discrepancy. arXiv:1906.06423, 2019

  56. [56]

    The iNaturalist species classification and detection dataset

    Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Be- longie. The iNaturalist species classification and detection dataset. In CVPR, 2018

  57. [57]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017

  58. [58]

    Extracting and composing robust features with denoising autoencoders

    Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre- Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML, 2008

  59. [59]

    Stacked denoising au- toencoders: Learning useful representations in a deep network with a local denoising criterion

    Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, and L´eon Bottou. Stacked denoising au- toencoders: Learning useful representations in a deep network with a local denoising criterion. JMLR, 2010

  60. [60]

    Learning robust global representations by penalizing local predic- tive power

    Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predic- tive power. In NeurIPS, 2019

  61. [61]

    Unsupervised learning of vi- sual representations using videos

    Xiaolong Wang and Abhinav Gupta. Unsupervised learning of vi- sual representations using videos. In ICCV, 2015

  62. [62]

    Unsuper- vised feature learning via non-parametric instance discrimination

    Zhirong Wu, Yuanjun Xiong, Stella Yu, and Dahua Lin. Unsuper- vised feature learning via non-parametric instance discrimination. In CVPR, 2018

  63. [63]

    Unified perceptual parsing for scene understanding

    Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understanding. InECCV, 2018

  64. [64]

    Early convolutions help transformers see better

    Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Doll´ar, and Ross Girshick. Early convolutions help transformers see better. In NeurIPS, 2021

  65. [65]

    How transferable are features in deep neural networks? In NeurIPS, 2014

    Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In NeurIPS, 2014

  66. [66]

    Large Batch Training of Convolutional Networks

    Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks. arXiv:1708.03888, 2017

  67. [67]

    VOLO: Vision outlooker for visual recognition.arXiv:2106.13112, 2021

    Li Yuan, Qibin Hou, Zihang Jiang, Jiashi Feng, and Shuicheng Yan. VOLO: Vision outlooker for visual recognition.arXiv:2106.13112, 2021

  68. [68]

    Cutmix: Regularization strategy to train strong classifiers with localizable features

    Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, 2019

  69. [69]

    mixup: Beyond empirical risk minimization

    Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In ICLR, 2018

  70. [70]

    Colorful image colorization

    Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In ECCV, 2016

  71. [71]

    Learning deep features for scene recognition using Places database

    Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. Learning deep features for scene recognition using Places database. In NeurIPS, 2014

  72. [72]

    Semantic understanding of scenes through the ADE20K dataset

    Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ADE20K dataset. IJCV, 2019. 10 A. Implementation Details A.1. ImageNet Experiments ViT architecture. We follow the standard ViT architecture [16]. It has a stack of Transformer blocks [57], and each block consists...

  73. [73]

    Our MAE does not use relative position or layer scaling (which are used in the code of [2])

    (the sine-cosine version) to both the encoder and de- coder inputs. Our MAE does not use relative position or layer scaling (which are used in the code of [2]). We extract features from the encoder output for fine- tuning and linear probing. As ViT has a class token [16], to adapt to this design, in our MAE pre-training we append an auxiliary dummy token t...

  74. [74]

    ViT has a stack of Transformer blocks that all produce feature maps at a single scale ( e.g., stride 16)

    in Mask R-CNN [24]. ViT has a stack of Transformer blocks that all produce feature maps at a single scale ( e.g., stride 16). We equally divide this stack into 4 subsets and apply convolutions to upsample or downsample the inter- mediate feature maps for producing different scales (stride 4, 8, 16, or 32, the same as a standard ResNet [25]). FPN is built ...