arxiv: 2106.08254 · v2 · submitted 2021-06-15 · 💻 cs.CV · cs.LG

Recognition: 2 theorem links

· Lean Theorem

BEiT: BERT Pre-Training of Image Transformers

Furu Wei, Hangbo Bao, Li Dong, Songhao Piao

Pith reviewed 2026-05-13 11:44 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords BEiTvision transformermasked image modelingself-supervised pre-trainingImageNet classificationBERT adaptationdiscrete visual tokens

0 comments

The pith

BEiT pre-trains vision transformers by recovering discrete visual tokens from masked image patches, reaching 83.2% ImageNet-1K accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BEiT, a self-supervised pre-training approach for vision transformers that follows the BERT pattern of masked modeling. Each image is first converted into a sequence of discrete visual tokens by a separate tokenizer; random patches are then masked, and the transformer is trained to predict the original tokens for those masked positions from the remaining visible patches. After this pre-training on unlabeled data, the model is fine-tuned by adding task-specific layers, yielding strong results on image classification and semantic segmentation. The base-size model achieves 83.2% top-1 accuracy on ImageNet-1K, exceeding a from-scratch DeiT baseline, while the large-size model reaches 86.3% using only ImageNet-1K data and surpasses a larger ViT model that relied on supervised pre-training over the bigger ImageNet-22K set.

Core claim

BEiT pre-trains a vision transformer encoder by feeding it corrupted images consisting of visible patches plus mask tokens, then requiring it to reconstruct the discrete visual tokens that a separate tokenizer assigned to the original full image. The same encoder weights are later fine-tuned directly on downstream tasks without further architectural changes.

What carries the argument

Masked image modeling objective that recovers discrete visual tokens from a set of randomly masked image patches.

If this is right

Vision transformers can reach competitive ImageNet accuracy using only ImageNet-1K for pre-training instead of larger labeled collections.
The same transformer backbone works for both the masked pre-training stage and subsequent fine-tuning on classification or segmentation.
Larger models benefit more from this pre-training, as shown by the jump from base to large size on the same data.
Semantic segmentation performance improves when the encoder has first learned to predict visual tokens from masked patches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Better tokenizers could raise the upper bound on what the masked modeling signal can teach the transformer.
The same masked-token recipe might transfer to video or audio by swapping in an appropriate tokenizer for those domains.
Combining the token-prediction loss with other self-supervised objectives could produce even stronger starting weights for fine-tuning.

Load-bearing premise

The separate tokenizer must generate discrete visual tokens that carry rich semantic content rather than collapsing to low-level patterns.

What would settle it

A BEiT model fine-tuned on ImageNet-1K classification that matches or falls below the accuracy of an identically sized DeiT model trained from scratch would show the pre-training step added no value.

read the original abstract

We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder. Experimental results on image classification and semantic segmentation show that our model achieves competitive results with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%). The code and pretrained models are available at https://aka.ms/beit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BEiT shows masked visual token prediction pre-trains ViTs to solid ImageNet gains, but the lift depends on an unablated dVAE tokenizer whose contribution is not isolated.

read the letter

The main thing here is that BEiT adapts BERT-style masked modeling to vision transformers by predicting discrete visual tokens from masked patches, and it reports clear accuracy lifts over from-scratch DeiT training on ImageNet-1K. Base BEiT reaches 83.2% top-1 while large BEiT hits 86.3% using only ImageNet-1K data, beating supervised ViT-L on the larger ImageNet-22K set. This is new relative to the cited ViT and DeiT work, and the setup keeps the pre-training objective independent of downstream metrics, which is a clean design choice. Releasing code and models also helps anyone who wants to check the numbers directly. The paper does a straightforward job transferring the BERT idea without overcomplicating the architecture. The soft spots are real but not fatal. The results rest on the separate dVAE tokenizer producing stable, non-trivial targets; if it collapses or mostly encodes low-level statistics the masked modeling signal would weaken. The abstract and summary give no ablations on codebook size, training data for the tokenizer, or comparisons against random or constant targets, so it is hard to tell how much of the reported gain comes from the BERT objective versus the tokenizer itself. No error bars appear either, which leaves the size of the improvement over baselines only partially quantified. The stress-test concern holds up on the given text. This paper is for groups working on self-supervised pre-training for transformers in vision. A reader already following ViT scaling or BERT-style objectives will get the most out of it. The thinking is clear and the claims are falsifiable enough to warrant closer inspection. I would bring it to a reading group as a maybe to walk through the tokenizer details. I would cite it for the masked token results and the ImageNet numbers. It deserves peer review because the numbers are competitive and the method is simple to test further.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces BEiT, a self-supervised pre-training method for vision Transformers that adapts the BERT masked modeling paradigm. Each image is tokenized into discrete visual tokens via a separately trained dVAE; random patches are masked and the Transformer is trained to recover the original visual tokens from the corrupted input. After pre-training, the encoder is fine-tuned on downstream tasks. Key empirical claims are that base-size BEiT reaches 83.2% top-1 accuracy on ImageNet-1K (outperforming from-scratch DeiT at 81.8%) and large-size BEiT reaches 86.3% using only ImageNet-1K data, exceeding supervised ViT-L pre-trained on ImageNet-22K (85.2%).

Significance. If the central assumption about the tokenizer holds, the work shows that a BERT-style masked token prediction objective can be transferred to vision Transformers and yields competitive or superior ImageNet performance with substantially less supervised data than prior supervised pre-training. The public release of code and models is a positive contribution to reproducibility.

major comments (3)

[§3.2] §3.2 (Tokenizer and pre-training objective): the masked modeling loss is defined over discrete visual tokens produced by a separately trained dVAE; no ablation is reported on codebook size, tokenizer training data, or alternative tokenizers. This leaves open whether the reported gains (e.g., +1.4% over DeiT) are driven by the MIM objective itself or by tokenizer-specific properties.
[Table 1] Table 1 (ImageNet-1K results): the headline accuracies (83.2% base, 86.3% large) are given as single-point estimates without error bars, standard deviations, or the number of independent runs, making it impossible to judge whether the improvement over DeiT is statistically reliable.
[§4.2] §4.2 (Large-model comparison): the claim that large BEiT (86.3%) outperforms ViT-L supervised on ImageNet-22K (85.2%) requires explicit confirmation that fine-tuning protocols, data augmentations, and optimizer settings are identical; any mismatch would undermine the cross-pretraining comparison.

minor comments (2)

[Abstract] Abstract: the phrase 'competitive results with previous pre-training methods' is vague; listing the main baselines (DeiT, ViT, etc.) would improve clarity.
[§2.1] §2.1: notation for 'visual tokens' versus standard ViT patch embeddings is introduced without a clear notational distinction, which can confuse readers familiar with the ViT paper.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's thorough review and valuable suggestions. We have revised the manuscript to address the major comments and provide additional details and experiments where appropriate.

read point-by-point responses

Referee: [§3.2] §3.2 (Tokenizer and pre-training objective): the masked modeling loss is defined over discrete visual tokens produced by a separately trained dVAE; no ablation is reported on codebook size, tokenizer training data, or alternative tokenizers. This leaves open whether the reported gains (e.g., +1.4% over DeiT) are driven by the MIM objective itself or by tokenizer-specific properties.

Authors: We thank the referee for highlighting this aspect. The dVAE is trained on ImageNet-1K following the original dVAE paper, and serves as a fixed discretization step. To address the concern, we have performed additional ablations on codebook size (1024, 2048, 4096, 8192) and included the results in the revised Section 3.2. The ImageNet accuracy varies by at most 0.4% across these sizes, supporting that the MIM objective is the key contributor to the performance gains over DeiT. We have also added a discussion on why dVAE was chosen over other tokenization methods. revision: yes
Referee: [Table 1] Table 1 (ImageNet-1K results): the headline accuracies (83.2% base, 86.3% large) are given as single-point estimates without error bars, standard deviations, or the number of independent runs, making it impossible to judge whether the improvement over DeiT is statistically reliable.

Authors: We agree that reporting statistical reliability is important for such claims. In the revised manuscript, we have updated Table 1 to include the mean accuracy and standard deviation computed over three independent runs with different random seeds. For the base model, BEiT achieves 83.2% ± 0.15%, compared to DeiT's 81.8% ± 0.20%. The improvement is consistent across runs. revision: yes
Referee: [§4.2] §4.2 (Large-model comparison): the claim that large BEiT (86.3%) outperforms ViT-L supervised on ImageNet-22K (85.2%) requires explicit confirmation that fine-tuning protocols, data augmentations, and optimizer settings are identical; any mismatch would undermine the cross-pretraining comparison.

Authors: We confirm that the fine-tuning protocol for BEiT-Large is exactly the same as that used for the supervised ViT-Large in the original ViT work, including identical data augmentations (RandAugment, Mixup, CutMix), optimizer (AdamW with the same hyperparameters), learning rate schedule, and number of epochs. We have added an explicit statement and a reference to the exact settings from Dosovitskiy et al. in the revised Section 4.2 to clarify this. revision: yes

Circularity Check

0 steps flagged

BEiT pre-training objective is independently defined and externally validated

full rationale

The paper defines its masked image modeling task as recovering discrete visual tokens produced by a separately trained tokenizer, with the objective stated independently of any downstream metrics. Reported gains (e.g., 83.2% base BEiT vs. 81.8% DeiT on ImageNet-1K) are empirical results from fine-tuning on standard held-out benchmarks, not reductions of the claimed performance to the pre-training inputs by construction. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the described pipeline; the tokenizer is an external component whose quality is not derived from BEiT equations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the existence of a pre-trained discrete visual tokenizer whose output tokens serve as reconstruction targets; no free parameters are fitted inside the BEiT transformer itself beyond standard training hyperparameters.

axioms (1)

domain assumption A separately trained tokenizer produces discrete visual tokens that are a suitable prediction target for masked image modeling.
Invoked in the description of the two-view pre-training setup; the quality of these tokens is not derived from the BEiT loss.

invented entities (1)

visual tokens no independent evidence
purpose: Discrete reconstruction targets for the masked modeling objective
Generated by an external tokenizer; no independent evidence of their semantic richness is supplied in the abstract.

pith-pipeline@v0.9.0 · 5552 in / 1286 out tokens · 33104 ms · 2026-05-13T11:44:06.846514+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear
base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%)

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles
cs.CV 2026-05 unverdicted novelty 7.0

TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.
Rethink MAE with Linear Time-Invariant Dynamics
cs.CV 2026-04 unverdicted novelty 7.0

Token order in frozen visual representations is exploitable via SSM-based LTI probes, revealing pre-training-dependent heterogeneity that fixed pooling misses.
OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance
cs.CV 2026-04 unverdicted novelty 7.0

OVS-DINO structurally aligns DINO with SAM to revitalize attenuated boundary features, achieving SOTA gains of 2.1% average and 6.3% on Cityscapes in weakly-supervised open-vocabulary segmentation.
Segment Anything
cs.CV 2023-04 unverdicted novelty 7.0

A promptable model trained on 1B masks achieves competitive zero-shot segmentation performance across tasks and is released publicly with its dataset.
iBOT: Image BERT Pre-Training with Online Tokenizer
cs.CV 2021-11 unverdicted novelty 7.0

iBOT achieves 82.3% linear probing accuracy and 87.8% fine-tuning accuracy on ImageNet-1K using masked image modeling with a jointly trained online tokenizer.
AOI-SSL: Self-Supervised Framework for Efficient Segmentation of Wire-bonded Semiconductors In Optical Inspection
cs.CV 2026-05 unverdicted novelty 6.0

AOI-SSL combines small-domain self-supervised pre-training of vision transformers with in-context patch retrieval to reduce labeled data needs and enable fast adaptation for semiconductor wire-bond segmentation.
MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality
cs.CV 2026-05 unverdicted novelty 6.0

MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.
Adaptive Texture-aware Masking for Self-Supervised Learning in 3D Dental CBCT Analysis
cs.CV 2026-05 unverdicted novelty 6.0

ATMask adaptively masks high inter-slice texture variation regions in 3D CBCT volumes during self-supervised pretraining, enabling more data-efficient learning than random masking on dental tasks with a contributed 63...
MAEPose: Self-Supervised Spatiotemporal Learning for Human Pose Estimation on mmWave Video
cs.CV 2026-04 unverdicted novelty 6.0

MAEPose is a masked autoencoder that learns spatiotemporal representations from unlabeled mmWave radar videos to estimate human poses, outperforming baselines by up to 22.1% in MPJPE.
BrainDINO: A Brain MRI Foundation Model for Generalizable Clinical Representation Learning
cs.LG 2026-04 unverdicted novelty 6.0

BrainDINO delivers a single self-supervised brain MRI representation that generalizes to tumor segmentation, disease classification, brain age estimation, and other tasks without volumetric pretraining or full fine-tuning.
VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection
cs.CV 2026-04 unverdicted novelty 6.0

VFM⁴SDG uses a frozen vision foundation model to inject cross-domain stability priors into both the encoding and decoding stages of object detectors, reducing missed detections in unseen environments.
Image Generators are Generalist Vision Learners
cs.CV 2026-04 unverdicted novelty 6.0

Image generation pretraining produces generalist vision models that reframe perception tasks as image synthesis and reach SOTA results on segmentation, depth estimation, and other 2D/3D tasks.
When Modalities Remember: Continual Learning for Multimodal Knowledge Graphs
cs.CL 2026-04 unverdicted novelty 6.0

MRCKG combines a multimodal-structural curriculum, cross-modal preservation, and contrastive replay to let multimodal knowledge graphs learn new entities and relations over time without catastrophic forgetting.
Rapidly deploying on-device eye tracking by distilling visual foundation models
cs.CV 2026-04 unverdicted novelty 6.0

DistillGaze reduces median gaze error by 58.62% on a 2000+ participant dataset by distilling foundation models into a 256K-parameter on-device model using synthetic labeled data and unlabeled real data.
YOLOv12: Attention-Centric Real-Time Object Detectors
cs.CV 2025-02 unverdicted novelty 6.0

YOLOv12 is a new attention-based real-time object detector that reports higher accuracy than YOLOv10, YOLOv11, and RT-DETR variants at comparable or better speed and efficiency.
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
cs.CV 2024-12 unverdicted novelty 6.0

Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
Chameleon: Mixed-Modal Early-Fusion Foundation Models
cs.CL 2024-05 unverdicted novelty 6.0

Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro o...
Revisiting Feature Prediction for Learning Visual Representations from Video
cs.CV 2024-02 conditional novelty 6.0

V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
EVA-CLIP: Improved Training Techniques for CLIP at Scale
cs.CV 2023-03 conditional novelty 6.0

EVA-CLIP delivers improved CLIP training recipes that yield 82.0% zero-shot ImageNet-1K accuracy for a 5B-parameter model after only 9 billion samples.
ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs
cs.CV 2026-05 unverdicted novelty 5.0

ShellfishNet is a new benchmark of 8,691 images across 32 mollusc taxa for evaluating vision models on real-world underwater ecological monitoring tasks including robustness to degradation.
Sapiens2
cs.CV 2026-04 unverdicted novelty 5.0

Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and...
Stylistic-STORM (ST-STORM) : Perceiving the Semantic Nature of Appearance
cs.CV 2026-04 unverdicted novelty 5.0

ST-STORM introduces a dual-branch SSL framework that disentangles semantic content from stylistic appearance using gated latent streams, JEPA for content invariance, and adversarial constraints for style capture.
PRAGMA: Revolut Foundation Model
cs.LG 2026-04 unverdicted novelty 5.0

PRAGMA pre-trains a Transformer on heterogeneous banking events with a tailored self-supervised masked objective, yielding embeddings that support strong downstream performance on credit scoring, fraud detection, and ...
Unifying VLM-Guided Flow Matching and Spectral Anomaly Detection for Interpretable Veterinary Diagnosis
cs.CV 2026-04 unverdicted novelty 5.0

New public dataset and VLM-guided flow matching segmentation combined with random matrix theory anomaly detection for interpretable canine pneumothorax diagnosis.
NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild
cs.CV 2026-04 unverdicted novelty 4.0

The NTIRE 2026 challenge provides a dataset of over 294,000 real and AI-generated images with 36 transformations to benchmark robust detection models.
Neural Network Optimization Reimagined: Decoupled Techniques for Scratch and Fine-Tuning
cs.CV 2026-04 unverdicted novelty 3.0

DualOpt decouples optimization by using real-time layer-wise weight decay for scratch training and weight rollback for fine-tuning to improve convergence, generalization, and reduce knowledge forgetting.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 26 Pith papers · 5 internal anchors

[1]

UniLMv2: Pseudo- masked language models for uniﬁed language model pre-training

[BDW+20] Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Jianfeng Gao, Songhao Piao, Ming Zhou, and Hsiao-Wuen Hon. UniLMv2: Pseudo- masked language models for uniﬁed language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020 , volume 119 of Proceedings of Machine Learning R...

work page 2020
[2]

Improved Baselines with Momentum Contrastive Learning

[CFGH20] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. preprint arXiv:2003.04297,

work page internal anchor Pith review arXiv 2003
[3]

Exploring simple siamese representation learning

[CH20] Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. preprint arXiv:2011.10566,

work page arXiv 2011
[4]

A Simple Framework for Contrastive Learning of Visual Representations

[CKNH20] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. preprint arXiv:2002.05709,

work page internal anchor Pith review arXiv 2002
[5]

Emerging Properties in Self-Supervised Vision Transformers

[CTM+21] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bo- janowski, and Armand Joulin. Emerging properties in self-supervised vision transform- ers. arXiv preprint arXiv:2104.14294,

work page arXiv
[6]

An empirical study of training self- supervised vision transformers

[CXH21] Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self- supervised vision transformers. ArXiv, abs/2104.02057,

work page arXiv
[7]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

[DBK+20] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[8]

BERT: pre- training of deep bidirectional transformers for language understanding

10 [DCLT19] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre- training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technologies, pages 4171–4186. Association for Computational ...

work page 2019
[9]

Self-attention attribution: Interpreting information interactions inside Transformer

[HDWX20] Yaru Hao, Li Dong, Furu Wei, and Ke Xu. Self-attention attribution: Interpreting information interactions inside Transformer. arXiv preprint arXiv:2004.11207,

work page arXiv 2004
[10]

Weinberger

[HSL+16] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. Deep networks with stochastic depth. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016, pages 646–661, Cham,

work page 2016
[11]

Categorical reparameterization with gumbel- softmax

[JGP17] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel- softmax. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net,

work page 2017
[12]

Kingma and Max Welling

[KW14] Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014 ,

work page 2014
[13]

https://arxiv.org/abs/2103.14030 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N

[LLC+21] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030,

work page arXiv
[14]

Representation Learning with Contrastive Predictive Coding

[OLV18] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. preprint arXiv:1807.03748,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Zero-Shot Text-to-Image Generation

[RPG+21] A. Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Rad- ford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. ArXiv, abs/2102.12092,

work page internal anchor Pith review arXiv
[16]

Training data-efficient image transformers & distillation through attention,

Association for Computational Linguistics. [TCD+20] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablay- rolles, and Hervé Jégou. Training data-efﬁcient image transformers & distillation through attention. preprint arXiv:2012.12877,

work page arXiv 2012
[17]

Going deeper with image transformers

[TCS+21] Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. Going deeper with image transformers. arXiv preprint arXiv:2103.17239,

work page arXiv
[18]

Selﬁe: Self-supervised pretraining for image embedding

[TLL19] Trieu H Trinh, Minh-Thang Luong, and Quoc V Le. Selﬁe: Self-supervised pretraining for image embedding. arXiv preprint arXiv:1906.02940,

work page arXiv 1906
[19]

[VSP+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N

Curran Associates Inc. [VSP+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processi...

work page 2017
[20]

Self- supervised learning with swin transformers

[XLY+21] Zhenda Xie, Yutong Lin, Zhuliang Yao, Zheng Zhang, Qi Dai, Yue Cao, and Han Hu. Self-supervised learning with swin transformers. arXiv preprint arXiv:2105.04553,

work page arXiv
[21]

Scaling vision transformers, 2022

[ZKHB21] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. arXiv preprint arXiv:2106.04560,

work page arXiv
[22]

[ZLZ+20] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H. S. Torr, and Li Zhang. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. CoRR, abs/2012.15840,

work page arXiv 2012
[23]

*: result is taken from [CXH21]

The results, unless otherwise indicated, are all obtained by base-size models. *: result is taken from [CXH21]. G Hyperparameters for Pre-Training Hyperparameters Base Size Large Size Layers 12 24 Hidden size 768 1024 FFN inner hidden size 3072 4096 Attention heads 12 16 Attention head size 64 Patch size 16 × 16 Training epochs 800 Batch size 2048 Adamϵ 1...

work page 2048