Recognition: 2 theorem links
· Lean TheoremBEiT: BERT Pre-Training of Image Transformers
Pith reviewed 2026-05-13 11:44 UTC · model grok-4.3
The pith
BEiT pre-trains vision transformers by recovering discrete visual tokens from masked image patches, reaching 83.2% ImageNet-1K accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BEiT pre-trains a vision transformer encoder by feeding it corrupted images consisting of visible patches plus mask tokens, then requiring it to reconstruct the discrete visual tokens that a separate tokenizer assigned to the original full image. The same encoder weights are later fine-tuned directly on downstream tasks without further architectural changes.
What carries the argument
Masked image modeling objective that recovers discrete visual tokens from a set of randomly masked image patches.
If this is right
- Vision transformers can reach competitive ImageNet accuracy using only ImageNet-1K for pre-training instead of larger labeled collections.
- The same transformer backbone works for both the masked pre-training stage and subsequent fine-tuning on classification or segmentation.
- Larger models benefit more from this pre-training, as shown by the jump from base to large size on the same data.
- Semantic segmentation performance improves when the encoder has first learned to predict visual tokens from masked patches.
Where Pith is reading between the lines
- Better tokenizers could raise the upper bound on what the masked modeling signal can teach the transformer.
- The same masked-token recipe might transfer to video or audio by swapping in an appropriate tokenizer for those domains.
- Combining the token-prediction loss with other self-supervised objectives could produce even stronger starting weights for fine-tuning.
Load-bearing premise
The separate tokenizer must generate discrete visual tokens that carry rich semantic content rather than collapsing to low-level patterns.
What would settle it
A BEiT model fine-tuned on ImageNet-1K classification that matches or falls below the accuracy of an identically sized DeiT model trained from scratch would show the pre-training step added no value.
read the original abstract
We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder. Experimental results on image classification and semantic segmentation show that our model achieves competitive results with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%). The code and pretrained models are available at https://aka.ms/beit.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces BEiT, a self-supervised pre-training method for vision Transformers that adapts the BERT masked modeling paradigm. Each image is tokenized into discrete visual tokens via a separately trained dVAE; random patches are masked and the Transformer is trained to recover the original visual tokens from the corrupted input. After pre-training, the encoder is fine-tuned on downstream tasks. Key empirical claims are that base-size BEiT reaches 83.2% top-1 accuracy on ImageNet-1K (outperforming from-scratch DeiT at 81.8%) and large-size BEiT reaches 86.3% using only ImageNet-1K data, exceeding supervised ViT-L pre-trained on ImageNet-22K (85.2%).
Significance. If the central assumption about the tokenizer holds, the work shows that a BERT-style masked token prediction objective can be transferred to vision Transformers and yields competitive or superior ImageNet performance with substantially less supervised data than prior supervised pre-training. The public release of code and models is a positive contribution to reproducibility.
major comments (3)
- [§3.2] §3.2 (Tokenizer and pre-training objective): the masked modeling loss is defined over discrete visual tokens produced by a separately trained dVAE; no ablation is reported on codebook size, tokenizer training data, or alternative tokenizers. This leaves open whether the reported gains (e.g., +1.4% over DeiT) are driven by the MIM objective itself or by tokenizer-specific properties.
- [Table 1] Table 1 (ImageNet-1K results): the headline accuracies (83.2% base, 86.3% large) are given as single-point estimates without error bars, standard deviations, or the number of independent runs, making it impossible to judge whether the improvement over DeiT is statistically reliable.
- [§4.2] §4.2 (Large-model comparison): the claim that large BEiT (86.3%) outperforms ViT-L supervised on ImageNet-22K (85.2%) requires explicit confirmation that fine-tuning protocols, data augmentations, and optimizer settings are identical; any mismatch would undermine the cross-pretraining comparison.
minor comments (2)
- [Abstract] Abstract: the phrase 'competitive results with previous pre-training methods' is vague; listing the main baselines (DeiT, ViT, etc.) would improve clarity.
- [§2.1] §2.1: notation for 'visual tokens' versus standard ViT patch embeddings is introduced without a clear notational distinction, which can confuse readers familiar with the ViT paper.
Simulated Author's Rebuttal
We appreciate the referee's thorough review and valuable suggestions. We have revised the manuscript to address the major comments and provide additional details and experiments where appropriate.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Tokenizer and pre-training objective): the masked modeling loss is defined over discrete visual tokens produced by a separately trained dVAE; no ablation is reported on codebook size, tokenizer training data, or alternative tokenizers. This leaves open whether the reported gains (e.g., +1.4% over DeiT) are driven by the MIM objective itself or by tokenizer-specific properties.
Authors: We thank the referee for highlighting this aspect. The dVAE is trained on ImageNet-1K following the original dVAE paper, and serves as a fixed discretization step. To address the concern, we have performed additional ablations on codebook size (1024, 2048, 4096, 8192) and included the results in the revised Section 3.2. The ImageNet accuracy varies by at most 0.4% across these sizes, supporting that the MIM objective is the key contributor to the performance gains over DeiT. We have also added a discussion on why dVAE was chosen over other tokenization methods. revision: yes
-
Referee: [Table 1] Table 1 (ImageNet-1K results): the headline accuracies (83.2% base, 86.3% large) are given as single-point estimates without error bars, standard deviations, or the number of independent runs, making it impossible to judge whether the improvement over DeiT is statistically reliable.
Authors: We agree that reporting statistical reliability is important for such claims. In the revised manuscript, we have updated Table 1 to include the mean accuracy and standard deviation computed over three independent runs with different random seeds. For the base model, BEiT achieves 83.2% ± 0.15%, compared to DeiT's 81.8% ± 0.20%. The improvement is consistent across runs. revision: yes
-
Referee: [§4.2] §4.2 (Large-model comparison): the claim that large BEiT (86.3%) outperforms ViT-L supervised on ImageNet-22K (85.2%) requires explicit confirmation that fine-tuning protocols, data augmentations, and optimizer settings are identical; any mismatch would undermine the cross-pretraining comparison.
Authors: We confirm that the fine-tuning protocol for BEiT-Large is exactly the same as that used for the supervised ViT-Large in the original ViT work, including identical data augmentations (RandAugment, Mixup, CutMix), optimizer (AdamW with the same hyperparameters), learning rate schedule, and number of epochs. We have added an explicit statement and a reference to the exact settings from Dosovitskiy et al. in the revised Section 4.2 to clarify this. revision: yes
Circularity Check
BEiT pre-training objective is independently defined and externally validated
full rationale
The paper defines its masked image modeling task as recovering discrete visual tokens produced by a separately trained tokenizer, with the objective stated independently of any downstream metrics. Reported gains (e.g., 83.2% base BEiT vs. 81.8% DeiT on ImageNet-1K) are empirical results from fine-tuning on standard held-out benchmarks, not reductions of the claimed performance to the pre-training inputs by construction. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the described pipeline; the tokenizer is an external component whose quality is not derived from BEiT equations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A separately trained tokenizer produces discrete visual tokens that are a suitable prediction target for masked image modeling.
invented entities (1)
-
visual tokens
no independent evidence
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclearbase-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%)
Forward citations
Cited by 26 Pith papers
-
TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles
TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.
-
Rethink MAE with Linear Time-Invariant Dynamics
Token order in frozen visual representations is exploitable via SSM-based LTI probes, revealing pre-training-dependent heterogeneity that fixed pooling misses.
-
OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance
OVS-DINO structurally aligns DINO with SAM to revitalize attenuated boundary features, achieving SOTA gains of 2.1% average and 6.3% on Cityscapes in weakly-supervised open-vocabulary segmentation.
-
Segment Anything
A promptable model trained on 1B masks achieves competitive zero-shot segmentation performance across tasks and is released publicly with its dataset.
-
iBOT: Image BERT Pre-Training with Online Tokenizer
iBOT achieves 82.3% linear probing accuracy and 87.8% fine-tuning accuracy on ImageNet-1K using masked image modeling with a jointly trained online tokenizer.
-
AOI-SSL: Self-Supervised Framework for Efficient Segmentation of Wire-bonded Semiconductors In Optical Inspection
AOI-SSL combines small-domain self-supervised pre-training of vision transformers with in-context patch retrieval to reduce labeled data needs and enable fast adaptation for semiconductor wire-bond segmentation.
-
MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality
MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.
-
Adaptive Texture-aware Masking for Self-Supervised Learning in 3D Dental CBCT Analysis
ATMask adaptively masks high inter-slice texture variation regions in 3D CBCT volumes during self-supervised pretraining, enabling more data-efficient learning than random masking on dental tasks with a contributed 63...
-
MAEPose: Self-Supervised Spatiotemporal Learning for Human Pose Estimation on mmWave Video
MAEPose is a masked autoencoder that learns spatiotemporal representations from unlabeled mmWave radar videos to estimate human poses, outperforming baselines by up to 22.1% in MPJPE.
-
BrainDINO: A Brain MRI Foundation Model for Generalizable Clinical Representation Learning
BrainDINO delivers a single self-supervised brain MRI representation that generalizes to tumor segmentation, disease classification, brain age estimation, and other tasks without volumetric pretraining or full fine-tuning.
-
VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection
VFM⁴SDG uses a frozen vision foundation model to inject cross-domain stability priors into both the encoding and decoding stages of object detectors, reducing missed detections in unseen environments.
-
Image Generators are Generalist Vision Learners
Image generation pretraining produces generalist vision models that reframe perception tasks as image synthesis and reach SOTA results on segmentation, depth estimation, and other 2D/3D tasks.
-
When Modalities Remember: Continual Learning for Multimodal Knowledge Graphs
MRCKG combines a multimodal-structural curriculum, cross-modal preservation, and contrastive replay to let multimodal knowledge graphs learn new entities and relations over time without catastrophic forgetting.
-
Rapidly deploying on-device eye tracking by distilling visual foundation models
DistillGaze reduces median gaze error by 58.62% on a 2000+ participant dataset by distilling foundation models into a 256K-parameter on-device model using synthetic labeled data and unlabeled real data.
-
YOLOv12: Attention-Centric Real-Time Object Detectors
YOLOv12 is a new attention-based real-time object detector that reports higher accuracy than YOLOv10, YOLOv11, and RT-DETR variants at comparable or better speed and efficiency.
-
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
-
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro o...
-
Revisiting Feature Prediction for Learning Visual Representations from Video
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
-
EVA-CLIP: Improved Training Techniques for CLIP at Scale
EVA-CLIP delivers improved CLIP training recipes that yield 82.0% zero-shot ImageNet-1K accuracy for a 5B-parameter model after only 9 billion samples.
-
ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs
ShellfishNet is a new benchmark of 8,691 images across 32 mollusc taxa for evaluating vision models on real-world underwater ecological monitoring tasks including robustness to degradation.
-
Sapiens2
Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and...
-
Stylistic-STORM (ST-STORM) : Perceiving the Semantic Nature of Appearance
ST-STORM introduces a dual-branch SSL framework that disentangles semantic content from stylistic appearance using gated latent streams, JEPA for content invariance, and adversarial constraints for style capture.
-
PRAGMA: Revolut Foundation Model
PRAGMA pre-trains a Transformer on heterogeneous banking events with a tailored self-supervised masked objective, yielding embeddings that support strong downstream performance on credit scoring, fraud detection, and ...
-
Unifying VLM-Guided Flow Matching and Spectral Anomaly Detection for Interpretable Veterinary Diagnosis
New public dataset and VLM-guided flow matching segmentation combined with random matrix theory anomaly detection for interpretable canine pneumothorax diagnosis.
-
NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild
The NTIRE 2026 challenge provides a dataset of over 294,000 real and AI-generated images with 36 transformations to benchmark robust detection models.
-
Neural Network Optimization Reimagined: Decoupled Techniques for Scratch and Fine-Tuning
DualOpt decouples optimization by using real-time layer-wise weight decay for scratch training and weight rollback for fine-tuning to improve convergence, generalization, and reduce knowledge forgetting.
Reference graph
Works this paper leans on
-
[1]
UniLMv2: Pseudo- masked language models for unified language model pre-training
[BDW+20] Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Jianfeng Gao, Songhao Piao, Ming Zhou, and Hsiao-Wuen Hon. UniLMv2: Pseudo- masked language models for unified language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020 , volume 119 of Proceedings of Machine Learning R...
work page 2020
-
[2]
Improved Baselines with Momentum Contrastive Learning
[CFGH20] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. preprint arXiv:2003.04297,
work page internal anchor Pith review arXiv 2003
-
[3]
Exploring simple siamese representation learning
[CH20] Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. preprint arXiv:2011.10566,
-
[4]
A Simple Framework for Contrastive Learning of Visual Representations
[CKNH20] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. preprint arXiv:2002.05709,
work page internal anchor Pith review arXiv 2002
-
[5]
Emerging Properties in Self-Supervised Vision Transformers
[CTM+21] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bo- janowski, and Armand Joulin. Emerging properties in self-supervised vision transform- ers. arXiv preprint arXiv:2104.14294,
-
[6]
An empirical study of training self- supervised vision transformers
[CXH21] Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self- supervised vision transformers. ArXiv, abs/2104.02057,
-
[7]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
[DBK+20] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. preprint arXiv:2010.11929,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[8]
BERT: pre- training of deep bidirectional transformers for language understanding
10 [DCLT19] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre- training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technologies, pages 4171–4186. Association for Computational ...
work page 2019
-
[9]
Self-attention attribution: Interpreting information interactions inside Transformer
[HDWX20] Yaru Hao, Li Dong, Furu Wei, and Ke Xu. Self-attention attribution: Interpreting information interactions inside Transformer. arXiv preprint arXiv:2004.11207,
-
[10]
[HSL+16] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. Deep networks with stochastic depth. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016, pages 646–661, Cham,
work page 2016
-
[11]
Categorical reparameterization with gumbel- softmax
[JGP17] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel- softmax. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net,
work page 2017
-
[12]
[KW14] Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014 ,
work page 2014
-
[13]
[LLC+21] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030,
-
[14]
Representation Learning with Contrastive Predictive Coding
[OLV18] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. preprint arXiv:1807.03748,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Zero-Shot Text-to-Image Generation
[RPG+21] A. Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Rad- ford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. ArXiv, abs/2102.12092,
work page internal anchor Pith review arXiv
-
[16]
Training data-efficient image transformers & distillation through attention,
Association for Computational Linguistics. [TCD+20] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablay- rolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. preprint arXiv:2012.12877,
-
[17]
Going deeper with image transformers
[TCS+21] Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. Going deeper with image transformers. arXiv preprint arXiv:2103.17239,
-
[18]
Selfie: Self-supervised pretraining for image embedding
[TLL19] Trieu H Trinh, Minh-Thang Luong, and Quoc V Le. Selfie: Self-supervised pretraining for image embedding. arXiv preprint arXiv:1906.02940,
-
[19]
[VSP+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Curran Associates Inc. [VSP+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processi...
work page 2017
-
[20]
Self- supervised learning with swin transformers
[XLY+21] Zhenda Xie, Yutong Lin, Zhuliang Yao, Zheng Zhang, Qi Dai, Yue Cao, and Han Hu. Self-supervised learning with swin transformers. arXiv preprint arXiv:2105.04553,
-
[21]
Scaling vision transformers, 2022
[ZKHB21] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. arXiv preprint arXiv:2106.04560,
- [22]
-
[23]
*: result is taken from [CXH21]
The results, unless otherwise indicated, are all obtained by base-size models. *: result is taken from [CXH21]. G Hyperparameters for Pre-Training Hyperparameters Base Size Large Size Layers 12 24 Hidden size 768 1024 FFN inner hidden size 3072 4096 Attention heads 12 16 Attention head size 64 Patch size 16 × 16 Training epochs 800 Batch size 2048 Adamϵ 1...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.