arxiv: 2104.14294 · v2 · submitted 2021-04-29 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Emerging Properties in Self-Supervised Vision Transformers

Mathilde Caron , Hugo Touvron , Ishan Misra , Herv\'e J\'egou , Julien Mairal , Piotr Bojanowski , Armand Joulin

Authors on Pith no claims yet

Pith reviewed 2026-05-16 13:59 UTC · model grok-4.3

classification 💻 cs.CV

keywords self-supervised learningvision transformerssemantic segmentationDINOk-nearest neighborsImageNetself-distillationlinear evaluation

0 comments

The pith

Self-supervised Vision Transformers encode explicit semantic segmentation information in their features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper asks whether self-supervised learning imparts distinctive properties to Vision Transformers that convolutional networks lack. It demonstrates that features from self-supervised ViTs carry clear semantic segmentation details for an image, a signal weaker in both supervised ViTs and convnets. The same features also function as strong k-nearest-neighbor classifiers, hitting 78.3 percent top-1 accuracy on ImageNet with a small model. Building on these observations, the authors introduce DINO, a label-free self-distillation procedure that reaches 80.1 percent top-1 accuracy in linear evaluation with ViT-Base when combined with a momentum encoder, multi-crop training, and small patches.

Core claim

Self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets. These features are also excellent k-NN classifiers, reaching 78.3 percent top-1 on ImageNet with a small ViT. The authors implement their findings into DINO, a form of self-distillation with no labels, and show the synergy between DINO and ViTs by achieving 80.1 percent top-1 on ImageNet in linear evaluation with ViT-Base, while underlining the importance of momentum encoder, multi-crop training, and small patches.

What carries the argument

DINO, a label-free self-distillation procedure applied to Vision Transformers that uses a momentum encoder and multi-crop augmentation to produce the observed features.

Load-bearing premise

The semantic segmentation signal and strong k-NN performance arise specifically from the interaction of self-supervision with the ViT architecture rather than from hyperparameter choices, dataset statistics, or evaluation protocols alone.

What would settle it

Train an identical ViT architecture in a fully supervised manner using the same momentum encoder, multi-crop strategy, and small patch size, then measure whether the explicit semantic segmentation maps in the features disappear or weaken substantially.

read the original abstract

In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the following observations: first, self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets. Second, these features are also excellent k-NN classifiers, reaching 78.3% top-1 on ImageNet with a small ViT. Our study also underlines the importance of momentum encoder, multi-crop training, and the use of small patches with ViTs. We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels. We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Self-supervised ViTs show clear semantic segmentation in their features that supervised training does not, plus strong k-NN and linear eval numbers via DINO, but the baselines need matched training details to confirm the architecture-supervision split.

read the letter

The core finding is that self-supervised ViTs produce features with explicit semantic segmentation signals that do not appear as strongly in supervised ViTs or in convnets. They also report solid k-NN performance at 78.3% top-1 on ImageNet with a small model, and their DINO method reaches 80.1% linear evaluation with ViT-Base. The work flags momentum encoders, multi-crop training, and small patches as key ingredients and frames DINO as label-free self-distillation. These are concrete observations with numbers that were not in the earlier ViT or self-supervised literature they cite, so the empirical contribution is real and worth noting for anyone training transformers without labels. The paper does a clean job laying out the practical recipe and showing the synergy with the ViT architecture. The main soft spot is the comparison to supervised baselines. If those runs did not use identical augmentations, patch sizes, and schedules, then the claim that the segmentation property emerges specifically from self-supervision plus ViT interaction rests on weaker ground. The abstract gives no error bars or full protocol, so the full paper needs to document those controls explicitly for the attribution to hold up. This is useful reading for groups working on label-efficient vision models or on transformer training recipes. It is not a theoretical advance but it supplies reproducible empirical patterns that others can test. I would send it to peer review so referees can verify the experimental details and see how robust the emergence claim is under tighter controls.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DINO, a self-supervised learning method for Vision Transformers based on self-distillation without labels. It claims that self-supervised ViT features contain explicit semantic segmentation information that does not emerge as clearly in supervised ViTs or convnets, that these features yield strong k-NN classifiers (78.3% top-1 on ImageNet with a small ViT), and that momentum encoder, multi-crop training, and small patches are critical. The authors demonstrate the synergy by achieving 80.1% top-1 linear evaluation accuracy on ImageNet with ViT-Base.

Significance. If the central empirical claims hold under matched training protocols, the work provides valuable evidence for architectural synergies between self-supervision and transformers, offering new insights into emergent properties of representations and practical gains for downstream tasks such as unsupervised semantic segmentation. The detailed ablations on momentum, multi-crop, and patch size, together with the release of code, strengthen the contribution.

major comments (2)

[§4] §4 (Experiments) and associated baseline descriptions: the central claim that semantic segmentation information arises specifically from the self-supervision + ViT interaction (rather than from the multi-crop/momentum/small-patch regime) requires explicit confirmation that the supervised ViT baseline was trained under an identical augmentation policy, optimizer schedule, and patch size. Without this matching, the attribution to supervision type versus training protocol remains confounded.
[Table 1, §4.2] Table 1 and §4.2 (k-NN evaluation): the reported 78.3% and 80.1% top-1 figures are presented without error bars, standard deviations across runs, or full hyperparameter details for the supervised ViT and convnet baselines; this weakens the strength of the performance comparison that underpins the emerging-properties narrative.

minor comments (2)

[Abstract] Abstract: the specific accuracy numbers (78.3%, 80.1%) should include forward references to the corresponding tables or sections for immediate verifiability.
[Figures 3-5] Figure captions (e.g., segmentation visualizations): add quantitative metrics such as mIoU on a held-out set or at least the exact inference procedure used to generate the maps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and have revised the manuscript to strengthen the presentation of our results.

read point-by-point responses

Referee: [§4] §4 (Experiments) and associated baseline descriptions: the central claim that semantic segmentation information arises specifically from the self-supervision + ViT interaction (rather than from the multi-crop/momentum/small-patch regime) requires explicit confirmation that the supervised ViT baseline was trained under an identical augmentation policy, optimizer schedule, and patch size. Without this matching, the attribution to supervision type versus training protocol remains confounded.

Authors: We acknowledge that the supervised ViT baseline follows the standard protocol from Dosovitskiy et al. (different augmentations, no multi-crop, different optimizer schedule) rather than an exact match to the DINO regime. Our central claim concerns the emergence of semantic segmentation in self-supervised ViTs relative to standard supervised training of both ViTs and convnets, which is the typical comparison in the literature. In the revised manuscript we have expanded §4 to explicitly list the training protocol, augmentation policy, and patch size for every baseline, added a clarifying paragraph on why exact matching across supervised and self-supervised regimes is not always feasible, and included an additional ablation training a supervised ViT with multi-crop augmentations to further isolate the effect of supervision type. revision: yes
Referee: [Table 1, §4.2] Table 1 and §4.2 (k-NN evaluation): the reported 78.3% and 80.1% top-1 figures are presented without error bars, standard deviations across runs, or full hyperparameter details for the supervised ViT and convnet baselines; this weakens the strength of the performance comparison that underpins the emerging-properties narrative.

Authors: We agree that variability measures and fuller hyperparameter disclosure would strengthen the comparisons. In the revised version we report standard deviations over three independent runs for all DINO k-NN and linear-evaluation numbers. We have moved the complete hyperparameter tables for every baseline (including our re-implementations of supervised ViT and convnet models) to the appendix and added a note in §4.2 clarifying which baseline numbers are taken from the original papers versus re-run under our evaluation protocol. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical observations from training runs

full rationale

The paper reports direct experimental results: self-supervised ViT features exhibit semantic segmentation properties (absent or weaker in supervised ViTs and convnets), strong k-NN classification (78.3% top-1), and the DINO method achieves 80.1% linear evaluation. These quantities are measured outputs of training and evaluation protocols, not quantities derived from equations that reduce to fitted inputs or self-citations by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citation chains appear; the emphasis on momentum encoder, multi-crop, and small patches is presented as empirical findings rather than a mathematical derivation. The paper is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior self-work to force its conclusions.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical adaptation of self-supervised objectives to the ViT architecture; key free choices include momentum encoder, multi-crop strategy, and patch size, which are treated as important but not derived from first principles.

free parameters (2)

momentum coefficient
Momentum encoder is stated as important; its exact value is a training hyperparameter not derived in the abstract.
patch size
Use of small patches is highlighted as beneficial for ViTs; the choice is empirical rather than theoretically fixed.

axioms (1)

domain assumption Self-supervised learning objectives can be directly adapted to Vision Transformer architectures
The entire study presupposes successful adaptation of SSL methods to ViT before observing the new properties.

pith-pipeline@v0.9.0 · 5495 in / 1374 out tokens · 42987 ms · 2026-05-16T13:59:46.222058+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Normalizing Trajectory Models
cs.CV 2026-05 unverdicted novelty 7.0

NTM models each generative reverse step as a conditional normalizing flow with a hybrid shallow-deep architecture, enabling exact-likelihood training and strong four-step sampling performance on text-to-image tasks.
Normalizing Trajectory Models
cs.CV 2026-05 unverdicted novelty 7.0

NTM uses per-step conditional normalizing flows plus a trajectory-wide predictor to achieve exact-likelihood 4-step sampling that matches or exceeds baselines on text-to-image tasks.
ProtoSSL: Interpretable Prototype Learning from Unlabeled Time-Series Data
cs.LG 2026-05 unverdicted novelty 7.0

ProtoSSL discovers generalizable prototypes from unlabeled time-series via self-supervision and assigns them to new tasks for interpretable predictions, outperforming supervised baselines in low-data regimes on ECG datasets.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
Modality-Agnostic Prompt Learning for Multi-Modal Camouflaged Object Detection
cs.CV 2026-04 unverdicted novelty 7.0

A framework uses modality-agnostic prompts to adapt SAM for multi-modal camouflaged object detection, with a mask refine module for better boundaries.
BEiT: BERT Pre-Training of Image Transformers
cs.CV 2021-06 conditional novelty 7.0

BEiT pre-trains vision transformers via masked image modeling on visual tokens and reaches 83.2% ImageNet top-1 accuracy for the base model and 86.3% for the large model using only ImageNet-1K data.
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
cs.CV 2026-05 unverdicted novelty 6.0

Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...
Velox: Learning Representations of 4D Geometry and Appearance
cs.CV 2026-05 unverdicted novelty 6.0

Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth...
How Far Are Video Models from True Multimodal Reasoning?
cs.CV 2026-04 unverdicted novelty 6.0

Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
Self-supervised Pretraining of Cell Segmentation Models
cs.CV 2026-04 unverdicted novelty 6.0

DINOCell achieves a SEG score of 0.784 on LIVECell by self-supervised domain adaptation of DINOv2, improving 10.42% over SAM-based models and showing strong zero-shot transfer.
Zero-shot World Models Are Developmentally Efficient Learners
cs.AI 2026-04 unverdicted novelty 6.0

A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.
Learning Robust Visual Features in Computed Tomography Enables Efficient Transfer Learning for Clinical Tasks
cs.CV 2026-04 conditional novelty 6.0

VoxelFM learns robust 3D CT visual features via DINO self-distillation that transfer effectively to seven clinical task categories using frozen backbones and lightweight heads, outperforming prior CT foundation models...
mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs
cs.RO 2025-12 unverdicted novelty 6.0

mimic-video combines internet video pretraining with a flow-matching decoder to achieve state-of-the-art robotic manipulation performance with 10x better sample efficiency than vision-language-action models.
Revisiting Feature Prediction for Learning Visual Representations from Video
cs.CV 2024-02 conditional novelty 6.0

V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
Atlas: Few-shot Learning with Retrieval Augmented Language Models
cs.CL 2022-08 unverdicted novelty 6.0

Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.
Unsupervised Dense Information Retrieval with Contrastive Learning
cs.IR 2021-12 unverdicted novelty 6.0

Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.
FUS3DMaps: Scalable and Accurate Open-Vocabulary Semantic Mapping by 3D Fusion of Voxel- and Instance-Level Layers
cs.RO 2026-05 unverdicted novelty 5.0

FUS3DMaps fuses voxel- and instance-level open-vocabulary layers inside a shared 3D voxel map to improve both layers and enable scalable accurate semantic mapping.
PANC: Prior-Aware Normalized Cut via Anchor-Augmented Token Graphs
cs.CV 2026-02 unverdicted novelty 5.0

PANC augments Normalized Cut with anchor-augmented token graphs using priors to steer spectral partitions, yielding mIoU gains of 2.3-8.7% over baselines on DUTS-TE, DUT-OMRON, and CrackForest.
Using Deep Learning Models Pretrained by Self-Supervised Learning for Protein Localization
cs.CV 2026-04 unverdicted novelty 4.0

DINO-based ViT models pretrained on HPA FOV achieve macro F1 of 0.822 zero-shot and 0.860 after fine-tuning for protein localization on OpenCell, demonstrating effective transfer from SSL pretraining.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · cited by 18 Pith papers · 17 internal anchors

[1]

Large scale distributed neural network training through online distillation

Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Or- mandi, George E Dahl, and Geoffrey E Hinton. Large scale distributed neural network training through online distillation. arXiv preprint arXiv:1804.03235, 2018. 3

work page arXiv 2018
[2]

Self-labelling via simultaneous clustering and repre- sentation learning

Yuki Markus Asano, Christian Rupprecht, and Andrea Vedaldi. Self-labelling via simultaneous clustering and repre- sentation learning. In ICLR, 2020. 2, 9

work page 2020
[3]

Recovering petaﬂops in contrastive semi- supervised learning of visual representations

Mahmoud Assran, Nicolas Ballas, Lluis Castrejon, and Michael Rabbat. Recovering petaﬂops in contrastive semi- supervised learning of visual representations. preprint arXiv:2006.10803, 2020. 14

work page arXiv 2006
[4]

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. preprint arXiv:1409.0473, 2014. 5

work page internal anchor Pith review Pith/arXiv arXiv 2014
[5]

MultiGrain: a unified image embedding for classes and instances

Maxim Berman, Herv ´e J ´egou, Vedaldi Andrea, Iasonas Kokkinos, and Matthijs Douze. MultiGrain: a uniﬁed im- age embedding for classes and instances. arXiv preprint arXiv:1902.05509, 2019. 6

work page internal anchor Pith review Pith/arXiv arXiv 1902
[6]

Unsupervised learning by predicting noise

Piotr Bojanowski and Armand Joulin. Unsupervised learning by predicting noise. In ICML, 2017. 2

work page 2017
[7]

Model compression

Cristian Buciluˇa, Rich Caruana, and Alexandru Niculescu- Mizil. Model compression. In SIGKDD, 2006. 3

work page 2006
[8]

Deep clustering for unsupervised learning of visual features

Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018. 2, 4, 9, 16

work page 2018
[9]

Unsupervised pre-training of image features on non-curated data

Mathilde Caron, Piotr Bojanowski, Julien Mairal, and Ar- mand Joulin. Unsupervised pre-training of image features on non-curated data. In ICCV, 2019. 2, 16

work page 2019
[10]

Unsupervised learn- ing of visual features by contrasting cluster assignments

Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learn- ing of visual features by contrasting cluster assignments. In NeurIPS, 2020. 1, 2, 3, 4, 5, 7, 8, 10, 14, 15, 16, 17, 18

work page 2020
[11]

The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation

Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Niki Parmar, Mike Schuster, Zhifeng Chen, et al. The best of both worlds: Combining recent advances in neural machine translation. preprint arXiv:1804.09849, 2018. 5

work page internal anchor Pith review Pith/arXiv arXiv 2018
[12]

A Simple Framework for Contrastive Learning of Visual Representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geof- frey Hinton. A simple framework for contrastive learning of visual representations. preprint arXiv:2002.05709, 2020. 2, 3, 5, 16, 17

work page internal anchor Pith review Pith/arXiv arXiv 2002
[13]

Big self-supervised models are strong semi-supervised learners

Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. Big self-supervised models are strong semi-supervised learners. In NeurIPS, 2020. 3, 5, 6, 14

work page 2020
[14]

Unsupervised image classiﬁcation for deep representation learning

Weijie Chen, Shiliang Pu, Di Xie, Shicai Yang, Yilu Guo, and Luojun Lin. Unsupervised image classiﬁcation for deep representation learning. arXiv preprint arXiv:2006.11480,

work page arXiv 2006
[15]

Improved Baselines with Momentum Contrastive Learning

Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. preprint arXiv:2003.04297, 2020. 5, 8, 14, 15, 18

work page internal anchor Pith review Pith/arXiv arXiv 2003
[16]

Exploring simple siamese representation learning

Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. preprint arXiv:2011.10566, 2020. 2, 3, 4, 8, 14, 16, 18

work page arXiv 2011
[17]

Sinkhorn distances: Lightspeed computation of optimal transport

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In NeurIPS, 2013. 15

work page 2013
[18]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transform- ers for language understanding. preprint arXiv:1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Transform- ers for image recognition at scale. preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[20]

Discriminative unsupervised feature learning with exemplar convolutional neural networks

Alexey Dosovitskiy, Philipp Fischer, Jost Tobias Springen- berg, Martin Riedmiller, and Thomas Brox. Discriminative unsupervised feature learning with exemplar convolutional neural networks. TPAMI, 2016. 2

work page 2016
[21]

Evaluation of gist de- scriptors for web-scale image search

Matthijs Douze, Herv´e J´egou, Harsimrat Sandhawalia, Lau- rent Amsaleg, and Cordelia Schmid. Evaluation of gist de- scriptors for web-scale image search. In CIVR, 2009. 6

work page 2009
[22]

Training vision transformers for image retrieval

Alaaeldin El-Nouby, Natalia Neverova, Ivan Laptev, and Herv´e J´egou. Training vision transformers for image retrieval. preprint arXiv:2102.05644, 2021. 10

work page arXiv 2021
[23]

Whitening for self-supervised representation learning

Aleksandr Ermolov, Aliaksandr Siarohin, Enver Sangineto, and Nicu Sebe. Whitening for self-supervised representation learning. preprint arXiv:2007.06346, 2020. 2

work page arXiv 2007
[24]

The pascal visual object classes (voc) challenge

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. IJCV, 2010. 13

work page 2010
[25]

Seed: Self-supervised distil- lation for visual representation

Zhiyuan Fang, Jianfeng Wang, Lijuan Wang, Lei Zhang, Yezhou Yang, and Zicheng Liu. Seed: Self-supervised distil- lation for visual representation. 2021. 3

work page 2021
[26]

Learning representations by pre- dicting bags of visual words

Spyros Gidaris, Andrei Bursuc, Nikos Komodakis, Patrick P´erez, and Matthieu Cord. Learning representations by pre- dicting bags of visual words. In CVPR, 2020. 2

work page 2020
[27]

Online bag-of-visual- words generation for unsupervised representation learning

Spyros Gidaris, Andrei Bursuc, Gilles Puy, Nikos Komodakis, Matthieu Cord, and Patrick P ´erez. Online bag-of-visual- words generation for unsupervised representation learning. arXiv preprint arXiv:2012.11552, 2020. 2, 5

work page arXiv 2012
[28]

Self- supervised pretraining of visual features in the wild

Priya Goyal, Mathilde Caron, Benjamin Lefaudeux, Min Xu, Pengchao Wang, Vivek Pai, Mannat Singh, Vitaliy Liptchinsky, Ishan Misra, Armand Joulin, et al. Self- supervised pretraining of visual features in the wild. preprint arXiv:2103.01988, 2021. 10

work page arXiv 2021
[29]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Priya Goyal, Piotr Doll ´ar, Ross Girshick, Pieter Noord- huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. preprint arXiv:1706.02677,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Bootstrap your own latent: A new approach to self-supervised learning

Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Do- ersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Moham- mad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, R´emi Munos, and Michal Valko. Bootstrap your own latent: A new approach to self-supervised learning. In NeurIPS, 2020. 2, 3, 4, 5,...

work page 2020
[31]

Visualization of su- pervised and self-supervised neural networks via attribution guided factorization

Shir Gur, Ameen Ali, and Lior Wolf. Visualization of su- pervised and self-supervised neural networks via attribution guided factorization. preprint arXiv:2012.02166, 2020. 7

work page arXiv 2012
[32]

Noise-contrastive estimation: A new estimation principle for unnormalized statistical models

Michael Gutmann and Aapo Hyv ¨arinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In International Conference on Artiﬁcial Intelligence and Statistics, 2010. 2

work page 2010
[33]

Momentum contrast for unsupervised visual rep- resentation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual rep- resentation learning. In CVPR, 2020. 1, 2, 3, 4, 5, 7, 9, 16

work page 2020
[34]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. 4, 5

work page 2016
[35]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Unsupervised deep learning by neighbourhood discovery

Jiabo Huang, Qi Dong, Shaogang Gong, and Xiatian Zhu. Unsupervised deep learning by neighbourhood discovery. In ICML, 2019. 2

work page 2019
[37]

Space-time correspondence as a contrastive random walk

Allan Jabri, Andrew Owens, and Alexei A Efros. Space-time correspondence as a contrastive random walk. 2020. 7

work page 2020
[38]

On Using Very Large Target Vocabulary for Neural Machine Translation

S´ebastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. On using very large target vocabulary for neural machine translation. preprint arXiv:1412.2007, 2014. 4, 9

work page internal anchor Pith review Pith/arXiv arXiv 2007
[39]

OpenNMT: Open-Source Toolkit for Neural Machine Translation

Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M Rush. Opennmt: Open-source toolkit for neural machine translation. preprint arXiv:1701.02810, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017
[40]

Mast: A memory- augmented self-supervised tracker

Zihang Lai, Erika Lu, and Weidi Xie. Mast: A memory- augmented self-supervised tracker. In CVPR, 2020. 7

work page 2020
[41]

Pseudo-label: The simple and efﬁcient semi-supervised learning method for deep neural networks

Dong-Hyun Lee et al. Pseudo-label: The simple and efﬁcient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML,

work page
[42]

Junnan Li, Pan Zhou, Caiming Xiong, and Steven C.H. Hoi. Prototypical contrastive learning of unsupervised representa- tions. ICLR, 2021. 2

work page 2021
[43]

SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. preprint arXiv:1608.03983, 2016. 5

work page internal anchor Pith review Pith/arXiv arXiv 2016
[44]

Fixing weight decay regu- larization in adam

Ilya Loshchilov and Frank Hutter. Fixing weight decay regu- larization in adam. 2018. 5

work page 2018
[45]

Cyanure: An open-source toolbox for empirical risk minimization for python, c++, and soon more

Julien Mairal. Cyanure: An open-source toolbox for empirical risk minimization for python, c++, and soon more. preprint arXiv:1912.08165, 2019. 13, 14

work page arXiv 1912
[46]

Automated ﬂower classiﬁcation over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated ﬂower classiﬁcation over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, 2008. 13

work page 2008
[47]

Boosting self-supervised learning via knowledge transfer

Mehdi Noroozi, Ananth Vinjimoor, Paolo Favaro, and Hamed Pirsiavash. Boosting self-supervised learning via knowledge transfer. In CVPR, 2018. 3

work page 2018
[48]

Video object segmentation using space-time memory networks

Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. Video object segmentation using space-time memory networks. In ICCV, 2019. 7

work page 2019
[49]

Meta pseudo labels

Hieu Pham, Qizhe Xie, Zihang Dai, and Quoc V Le. Meta pseudo labels. preprint arXiv:2003.10580, 2020. 14

work page arXiv 2003
[50]

Lost in quantization: Improving particular object retrieval in large scale image databases

James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Lost in quantization: Improving particular object retrieval in large scale image databases. In CVPR, 2008. 6

work page 2008
[51]

Acceleration of stochastic approximation by averaging

Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855, 1992. 4, 9, 17

work page 1992
[52]

The 2017 DAVIS Challenge on Video Object Segmentation

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Ar- bel´aez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation. preprint arXiv:1704.00675, 2017. 7

work page internal anchor Pith review Pith/arXiv arXiv 2017
[53]

Revisiting oxford and paris: Large-scale image retrieval benchmarking

Filip Radenovi ´c, Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ond ˇrej Chum. Revisiting oxford and paris: Large-scale image retrieval benchmarking. 2018. 6

work page 2018
[54]

Fine- tuning cnn image retrieval with no human annotation

Filip Radenovi´c, Giorgos Tolias, and Ond ˇrej Chum. Fine- tuning cnn image retrieval with no human annotation. IEEE transactions on pattern analysis and machine intelligence ,

work page
[55]

Language models are unsuper- vised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsuper- vised multitask learners. 1

work page
[56]

Designing network design spaces

Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaim- ing He, and Piotr Doll´ar. Designing network design spaces. In CVPR, 2020. 13

work page 2020
[57]

Learning with average precision: Training image retrieval with a listwise loss

Jerome Revaud, Jon Almaz´an, Rafael S Rezende, and Cesar Roberto de Souza. Learning with average precision: Training image retrieval with a listwise loss. In ICCV, 2019. 6

work page 2019
[58]

Byol works even without batch statistics

Pierre H Richemond, Jean-Bastien Grill, Florent Altch ´e, Corentin Tallec, Florian Strub, Andrew Brock, Samuel Smith, Soham De, Razvan Pascanu, Bilal Piot, et al. Byol works even without batch statistics. preprint arXiv:2010.10241, 2020. 2, 4

work page arXiv 2010
[59]

Efﬁcient estimations from a slowly conver- gent robbins-monro process

David Ruppert. Efﬁcient estimations from a slowly conver- gent robbins-monro process. Technical report, 1988. 4, 9

work page 1988
[60]

Imagenet large scale visual recognition challenge

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. IJCV, 2015. 1, 5, 13

work page 2015
[61]

Weight normalization: A simple reparameterization to accelerate training of deep neural networks

Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. NeurIPS, 2016. 4, 16

work page 2016
[62]

Concept generalization in visual representa- tion learning

Mert Bulent Sariyildiz, Yannis Kalantidis, Diane Larlus, and Karteek Alahari. Concept generalization in visual representa- tion learning. arXiv preprint arXiv:2012.05649, 2020. 7

work page arXiv 2012
[63]

S2-bnn: Bridging the gap between self-supervised real and 1-bit neural net- works via guided distribution calibration

Zhiqiang Shen, Zechun Liu, Jie Qin, Lei Huang, Kwang- Ting Cheng, and Marios Savvides. S2-bnn: Bridging the gap between self-supervised real and 1-bit neural net- works via guided distribution calibration. arXiv preprint arXiv:2102.08946, 2021. 3

work page arXiv 2021
[64]

Fixmatch: Simplifying semi-supervised learning with consistency and conﬁdence

Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, Han Zhang, and Colin Raffel. Fixmatch: Simplifying semi-supervised learning with consistency and conﬁdence. In NeurIPS, 2020. 14

work page 2020
[65]

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results

Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. preprint arXiv:1703.01780, 2017. 3, 4, 9, 17

work page internal anchor Pith review Pith/arXiv arXiv 2017
[66]

YFCC100M: The New Data in Multimedia Research

Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research. arXiv preprint arXiv:1503.01817, 2015. 6

work page internal anchor Pith review Pith/arXiv arXiv 2015
[67]

What makes for good views for contrastive learning

Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. What makes for good views for contrastive learning. NeurIPS, 2020. 5

work page 2020
[68]

Particular object retrieval with integral max-pooling of CNN activations

Giorgos Tolias, Ronan Sicre, and Herv ´e J´egou. Particular object retrieval with integral max-pooling of cnn activations. arXiv preprint arXiv:1511.05879, 2015. 6

work page internal anchor Pith review Pith/arXiv arXiv 2015
[69]

Training data-efﬁcient image transformers & distillation through atten- tion

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv´e J´egou. Training data-efﬁcient image transformers & distillation through atten- tion. preprint arXiv:2012.12877, 2020. 1, 4, 5, 6, 7, 8, 13, 17

work page arXiv 2012
[70]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017. 1, 4

work page 2017
[71]

Learning correspondence from the cycle-consistency of time

Xiaolong Wang, Allan Jabri, and Alexei A Efros. Learning correspondence from the cycle-consistency of time. In CVPR,

work page
[72]

Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval

Tobias Weyand, Andre Araujo, Bingyi Cao, and Jack Sim. Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval. 2020. 6

work page 2020
[73]

Unsupervised feature learning via non-parametric instance discrimination

Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018. 2, 4, 5, 9, 18

work page 2018
[74]

Unsupervised deep embedding for clustering analysis

Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised deep embedding for clustering analysis. In ICML, 2016. 2

work page 2016
[75]

Qizhe Xie, Zihang Dai Dai, Eduard Hovy, Minh-Thang Lu- ong, and Quoc V . Le. Unsupervised data augmentation for consistency training. preprint arXiv:1904.12848, 2020. 14

work page arXiv 1904
[76]

Self-training with noisy student improves imagenet clas- siﬁcation

Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. Self-training with noisy student improves imagenet clas- siﬁcation. In CVPR, 2020. 3

work page 2020
[77]

Seed the views: Hierarchical seman- tic alignment for contrastive representation learning

Haohang Xu, Xiaopeng Zhang, Hao Li, Lingxi Xie, Hongkai Xiong, and Qi Tian. Seed the views: Hierarchical seman- tic alignment for contrastive representation learning. arXiv preprint arXiv:2012.02733, 2021. 16

work page arXiv 2012
[78]

Iter- ative pseudo-labeling for speech recognition

Qiantong Xu, Tatiana Likhomanenko, Jacob Kahn, Awni Hannun, Gabriel Synnaeve, and Ronan Collobert. Iter- ative pseudo-labeling for speech recognition. preprint arXiv:2005.09267, 2020. 3

work page arXiv 2005
[79]

Billion-scale semi-supervised learning for image classification

I Zeki Yalniz, Herv´e J´egou, Kan Chen, Manohar Paluri, and Dhruv Mahajan. Billion-scale semi-supervised learning for image classiﬁcation. preprint arXiv:1905.00546, 2019. 3

work page internal anchor Pith review Pith/arXiv arXiv 1905
[80]

Joint unsuper- vised learning of deep representations and image clusters

Jianwei Yang, Devi Parikh, and Dhruv Batra. Joint unsuper- vised learning of deep representations and image clusters. In CVPR, 2016. 2

work page 2016

Showing first 80 references.