pith. machine review for the scientific record. sign in

arxiv: 2104.14294 · v2 · submitted 2021-04-29 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Emerging Properties in Self-Supervised Vision Transformers

Authors on Pith no claims yet

Pith reviewed 2026-05-16 13:59 UTC · model grok-4.3

classification 💻 cs.CV
keywords self-supervised learningvision transformerssemantic segmentationDINOk-nearest neighborsImageNetself-distillationlinear evaluation
0
0 comments X

The pith

Self-supervised Vision Transformers encode explicit semantic segmentation information in their features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper asks whether self-supervised learning imparts distinctive properties to Vision Transformers that convolutional networks lack. It demonstrates that features from self-supervised ViTs carry clear semantic segmentation details for an image, a signal weaker in both supervised ViTs and convnets. The same features also function as strong k-nearest-neighbor classifiers, hitting 78.3 percent top-1 accuracy on ImageNet with a small model. Building on these observations, the authors introduce DINO, a label-free self-distillation procedure that reaches 80.1 percent top-1 accuracy in linear evaluation with ViT-Base when combined with a momentum encoder, multi-crop training, and small patches.

Core claim

Self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets. These features are also excellent k-NN classifiers, reaching 78.3 percent top-1 on ImageNet with a small ViT. The authors implement their findings into DINO, a form of self-distillation with no labels, and show the synergy between DINO and ViTs by achieving 80.1 percent top-1 on ImageNet in linear evaluation with ViT-Base, while underlining the importance of momentum encoder, multi-crop training, and small patches.

What carries the argument

DINO, a label-free self-distillation procedure applied to Vision Transformers that uses a momentum encoder and multi-crop augmentation to produce the observed features.

Load-bearing premise

The semantic segmentation signal and strong k-NN performance arise specifically from the interaction of self-supervision with the ViT architecture rather than from hyperparameter choices, dataset statistics, or evaluation protocols alone.

What would settle it

Train an identical ViT architecture in a fully supervised manner using the same momentum encoder, multi-crop strategy, and small patch size, then measure whether the explicit semantic segmentation maps in the features disappear or weaken substantially.

read the original abstract

In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the following observations: first, self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets. Second, these features are also excellent k-NN classifiers, reaching 78.3% top-1 on ImageNet with a small ViT. Our study also underlines the importance of momentum encoder, multi-crop training, and the use of small patches with ViTs. We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels. We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DINO, a self-supervised learning method for Vision Transformers based on self-distillation without labels. It claims that self-supervised ViT features contain explicit semantic segmentation information that does not emerge as clearly in supervised ViTs or convnets, that these features yield strong k-NN classifiers (78.3% top-1 on ImageNet with a small ViT), and that momentum encoder, multi-crop training, and small patches are critical. The authors demonstrate the synergy by achieving 80.1% top-1 linear evaluation accuracy on ImageNet with ViT-Base.

Significance. If the central empirical claims hold under matched training protocols, the work provides valuable evidence for architectural synergies between self-supervision and transformers, offering new insights into emergent properties of representations and practical gains for downstream tasks such as unsupervised semantic segmentation. The detailed ablations on momentum, multi-crop, and patch size, together with the release of code, strengthen the contribution.

major comments (2)
  1. [§4] §4 (Experiments) and associated baseline descriptions: the central claim that semantic segmentation information arises specifically from the self-supervision + ViT interaction (rather than from the multi-crop/momentum/small-patch regime) requires explicit confirmation that the supervised ViT baseline was trained under an identical augmentation policy, optimizer schedule, and patch size. Without this matching, the attribution to supervision type versus training protocol remains confounded.
  2. [Table 1, §4.2] Table 1 and §4.2 (k-NN evaluation): the reported 78.3% and 80.1% top-1 figures are presented without error bars, standard deviations across runs, or full hyperparameter details for the supervised ViT and convnet baselines; this weakens the strength of the performance comparison that underpins the emerging-properties narrative.
minor comments (2)
  1. [Abstract] Abstract: the specific accuracy numbers (78.3%, 80.1%) should include forward references to the corresponding tables or sections for immediate verifiability.
  2. [Figures 3-5] Figure captions (e.g., segmentation visualizations): add quantitative metrics such as mIoU on a held-out set or at least the exact inference procedure used to generate the maps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and have revised the manuscript to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments) and associated baseline descriptions: the central claim that semantic segmentation information arises specifically from the self-supervision + ViT interaction (rather than from the multi-crop/momentum/small-patch regime) requires explicit confirmation that the supervised ViT baseline was trained under an identical augmentation policy, optimizer schedule, and patch size. Without this matching, the attribution to supervision type versus training protocol remains confounded.

    Authors: We acknowledge that the supervised ViT baseline follows the standard protocol from Dosovitskiy et al. (different augmentations, no multi-crop, different optimizer schedule) rather than an exact match to the DINO regime. Our central claim concerns the emergence of semantic segmentation in self-supervised ViTs relative to standard supervised training of both ViTs and convnets, which is the typical comparison in the literature. In the revised manuscript we have expanded §4 to explicitly list the training protocol, augmentation policy, and patch size for every baseline, added a clarifying paragraph on why exact matching across supervised and self-supervised regimes is not always feasible, and included an additional ablation training a supervised ViT with multi-crop augmentations to further isolate the effect of supervision type. revision: yes

  2. Referee: [Table 1, §4.2] Table 1 and §4.2 (k-NN evaluation): the reported 78.3% and 80.1% top-1 figures are presented without error bars, standard deviations across runs, or full hyperparameter details for the supervised ViT and convnet baselines; this weakens the strength of the performance comparison that underpins the emerging-properties narrative.

    Authors: We agree that variability measures and fuller hyperparameter disclosure would strengthen the comparisons. In the revised version we report standard deviations over three independent runs for all DINO k-NN and linear-evaluation numbers. We have moved the complete hyperparameter tables for every baseline (including our re-implementations of supervised ViT and convnet models) to the appendix and added a note in §4.2 clarifying which baseline numbers are taken from the original papers versus re-run under our evaluation protocol. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical observations from training runs

full rationale

The paper reports direct experimental results: self-supervised ViT features exhibit semantic segmentation properties (absent or weaker in supervised ViTs and convnets), strong k-NN classification (78.3% top-1), and the DINO method achieves 80.1% linear evaluation. These quantities are measured outputs of training and evaluation protocols, not quantities derived from equations that reduce to fitted inputs or self-citations by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citation chains appear; the emphasis on momentum encoder, multi-crop, and small patches is presented as empirical findings rather than a mathematical derivation. The paper is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior self-work to force its conclusions.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical adaptation of self-supervised objectives to the ViT architecture; key free choices include momentum encoder, multi-crop strategy, and patch size, which are treated as important but not derived from first principles.

free parameters (2)
  • momentum coefficient
    Momentum encoder is stated as important; its exact value is a training hyperparameter not derived in the abstract.
  • patch size
    Use of small patches is highlighted as beneficial for ViTs; the choice is empirical rather than theoretically fixed.
axioms (1)
  • domain assumption Self-supervised learning objectives can be directly adapted to Vision Transformer architectures
    The entire study presupposes successful adaptation of SSL methods to ViT before observing the new properties.

pith-pipeline@v0.9.0 · 5495 in / 1374 out tokens · 42987 ms · 2026-05-16T13:59:46.222058+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Normalizing Trajectory Models

    cs.CV 2026-05 unverdicted novelty 7.0

    NTM models each generative reverse step as a conditional normalizing flow with a hybrid shallow-deep architecture, enabling exact-likelihood training and strong four-step sampling performance on text-to-image tasks.

  2. Normalizing Trajectory Models

    cs.CV 2026-05 unverdicted novelty 7.0

    NTM uses per-step conditional normalizing flows plus a trajectory-wide predictor to achieve exact-likelihood 4-step sampling that matches or exceeds baselines on text-to-image tasks.

  3. ProtoSSL: Interpretable Prototype Learning from Unlabeled Time-Series Data

    cs.LG 2026-05 unverdicted novelty 7.0

    ProtoSSL discovers generalizable prototypes from unlabeled time-series via self-supervision and assigns them to new tasks for interpretable predictions, outperforming supervised baselines in low-data regimes on ECG datasets.

  4. OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

  5. Modality-Agnostic Prompt Learning for Multi-Modal Camouflaged Object Detection

    cs.CV 2026-04 unverdicted novelty 7.0

    A framework uses modality-agnostic prompts to adapt SAM for multi-modal camouflaged object detection, with a mask refine module for better boundaries.

  6. BEiT: BERT Pre-Training of Image Transformers

    cs.CV 2021-06 conditional novelty 7.0

    BEiT pre-trains vision transformers via masked image modeling on visual tokens and reaches 83.2% ImageNet top-1 accuracy for the base model and 86.3% for the large model using only ImageNet-1K data.

  7. When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy

    cs.CV 2026-05 unverdicted novelty 6.0

    Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...

  8. Velox: Learning Representations of 4D Geometry and Appearance

    cs.CV 2026-05 unverdicted novelty 6.0

    Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth...

  9. How Far Are Video Models from True Multimodal Reasoning?

    cs.CV 2026-04 unverdicted novelty 6.0

    Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.

  10. Self-supervised Pretraining of Cell Segmentation Models

    cs.CV 2026-04 unverdicted novelty 6.0

    DINOCell achieves a SEG score of 0.784 on LIVECell by self-supervised domain adaptation of DINOv2, improving 10.42% over SAM-based models and showing strong zero-shot transfer.

  11. Zero-shot World Models Are Developmentally Efficient Learners

    cs.AI 2026-04 unverdicted novelty 6.0

    A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.

  12. Learning Robust Visual Features in Computed Tomography Enables Efficient Transfer Learning for Clinical Tasks

    cs.CV 2026-04 conditional novelty 6.0

    VoxelFM learns robust 3D CT visual features via DINO self-distillation that transfer effectively to seven clinical task categories using frozen backbones and lightweight heads, outperforming prior CT foundation models...

  13. mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

    cs.RO 2025-12 unverdicted novelty 6.0

    mimic-video combines internet video pretraining with a flow-matching decoder to achieve state-of-the-art robotic manipulation performance with 10x better sample efficiency than vision-language-action models.

  14. Revisiting Feature Prediction for Learning Visual Representations from Video

    cs.CV 2024-02 conditional novelty 6.0

    V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.

  15. Atlas: Few-shot Learning with Retrieval Augmented Language Models

    cs.CL 2022-08 unverdicted novelty 6.0

    Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.

  16. Unsupervised Dense Information Retrieval with Contrastive Learning

    cs.IR 2021-12 unverdicted novelty 6.0

    Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.

  17. FUS3DMaps: Scalable and Accurate Open-Vocabulary Semantic Mapping by 3D Fusion of Voxel- and Instance-Level Layers

    cs.RO 2026-05 unverdicted novelty 5.0

    FUS3DMaps fuses voxel- and instance-level open-vocabulary layers inside a shared 3D voxel map to improve both layers and enable scalable accurate semantic mapping.

  18. PANC: Prior-Aware Normalized Cut via Anchor-Augmented Token Graphs

    cs.CV 2026-02 unverdicted novelty 5.0

    PANC augments Normalized Cut with anchor-augmented token graphs using priors to steer spectral partitions, yielding mIoU gains of 2.3-8.7% over baselines on DUTS-TE, DUT-OMRON, and CrackForest.

  19. Using Deep Learning Models Pretrained by Self-Supervised Learning for Protein Localization

    cs.CV 2026-04 unverdicted novelty 4.0

    DINO-based ViT models pretrained on HPA FOV achieve macro F1 of 0.822 zero-shot and 0.860 after fine-tuning for protein localization on OpenCell, demonstrating effective transfer from SSL pretraining.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · cited by 18 Pith papers · 17 internal anchors

  1. [1]

    Large scale distributed neural network training through online distillation

    Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Or- mandi, George E Dahl, and Geoffrey E Hinton. Large scale distributed neural network training through online distillation. arXiv preprint arXiv:1804.03235, 2018. 3

  2. [2]

    Self-labelling via simultaneous clustering and repre- sentation learning

    Yuki Markus Asano, Christian Rupprecht, and Andrea Vedaldi. Self-labelling via simultaneous clustering and repre- sentation learning. In ICLR, 2020. 2, 9

  3. [3]

    Recovering petaflops in contrastive semi- supervised learning of visual representations

    Mahmoud Assran, Nicolas Ballas, Lluis Castrejon, and Michael Rabbat. Recovering petaflops in contrastive semi- supervised learning of visual representations. preprint arXiv:2006.10803, 2020. 14

  4. [4]

    Neural Machine Translation by Jointly Learning to Align and Translate

    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. preprint arXiv:1409.0473, 2014. 5

  5. [5]

    MultiGrain: a unified image embedding for classes and instances

    Maxim Berman, Herv ´e J ´egou, Vedaldi Andrea, Iasonas Kokkinos, and Matthijs Douze. MultiGrain: a unified im- age embedding for classes and instances. arXiv preprint arXiv:1902.05509, 2019. 6

  6. [6]

    Unsupervised learning by predicting noise

    Piotr Bojanowski and Armand Joulin. Unsupervised learning by predicting noise. In ICML, 2017. 2

  7. [7]

    Model compression

    Cristian Buciluˇa, Rich Caruana, and Alexandru Niculescu- Mizil. Model compression. In SIGKDD, 2006. 3

  8. [8]

    Deep clustering for unsupervised learning of visual features

    Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018. 2, 4, 9, 16

  9. [9]

    Unsupervised pre-training of image features on non-curated data

    Mathilde Caron, Piotr Bojanowski, Julien Mairal, and Ar- mand Joulin. Unsupervised pre-training of image features on non-curated data. In ICCV, 2019. 2, 16

  10. [10]

    Unsupervised learn- ing of visual features by contrasting cluster assignments

    Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learn- ing of visual features by contrasting cluster assignments. In NeurIPS, 2020. 1, 2, 3, 4, 5, 7, 8, 10, 14, 15, 16, 17, 18

  11. [11]

    The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation

    Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Niki Parmar, Mike Schuster, Zhifeng Chen, et al. The best of both worlds: Combining recent advances in neural machine translation. preprint arXiv:1804.09849, 2018. 5

  12. [12]

    A Simple Framework for Contrastive Learning of Visual Representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geof- frey Hinton. A simple framework for contrastive learning of visual representations. preprint arXiv:2002.05709, 2020. 2, 3, 5, 16, 17

  13. [13]

    Big self-supervised models are strong semi-supervised learners

    Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. Big self-supervised models are strong semi-supervised learners. In NeurIPS, 2020. 3, 5, 6, 14

  14. [14]

    Unsupervised image classification for deep representation learning

    Weijie Chen, Shiliang Pu, Di Xie, Shicai Yang, Yilu Guo, and Luojun Lin. Unsupervised image classification for deep representation learning. arXiv preprint arXiv:2006.11480,

  15. [15]

    Improved Baselines with Momentum Contrastive Learning

    Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. preprint arXiv:2003.04297, 2020. 5, 8, 14, 15, 18

  16. [16]

    Exploring simple siamese representation learning

    Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. preprint arXiv:2011.10566, 2020. 2, 3, 4, 8, 14, 16, 18

  17. [17]

    Sinkhorn distances: Lightspeed computation of optimal transport

    Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In NeurIPS, 2013. 15

  18. [18]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transform- ers for language understanding. preprint arXiv:1810.04805,

  19. [19]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Transform- ers for image recognition at scale. preprint arXiv:2010.11929,

  20. [20]

    Discriminative unsupervised feature learning with exemplar convolutional neural networks

    Alexey Dosovitskiy, Philipp Fischer, Jost Tobias Springen- berg, Martin Riedmiller, and Thomas Brox. Discriminative unsupervised feature learning with exemplar convolutional neural networks. TPAMI, 2016. 2

  21. [21]

    Evaluation of gist de- scriptors for web-scale image search

    Matthijs Douze, Herv´e J´egou, Harsimrat Sandhawalia, Lau- rent Amsaleg, and Cordelia Schmid. Evaluation of gist de- scriptors for web-scale image search. In CIVR, 2009. 6

  22. [22]

    Training vision transformers for image retrieval

    Alaaeldin El-Nouby, Natalia Neverova, Ivan Laptev, and Herv´e J´egou. Training vision transformers for image retrieval. preprint arXiv:2102.05644, 2021. 10

  23. [23]

    Whitening for self-supervised representation learning

    Aleksandr Ermolov, Aliaksandr Siarohin, Enver Sangineto, and Nicu Sebe. Whitening for self-supervised representation learning. preprint arXiv:2007.06346, 2020. 2

  24. [24]

    The pascal visual object classes (voc) challenge

    Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. IJCV, 2010. 13

  25. [25]

    Seed: Self-supervised distil- lation for visual representation

    Zhiyuan Fang, Jianfeng Wang, Lijuan Wang, Lei Zhang, Yezhou Yang, and Zicheng Liu. Seed: Self-supervised distil- lation for visual representation. 2021. 3

  26. [26]

    Learning representations by pre- dicting bags of visual words

    Spyros Gidaris, Andrei Bursuc, Nikos Komodakis, Patrick P´erez, and Matthieu Cord. Learning representations by pre- dicting bags of visual words. In CVPR, 2020. 2

  27. [27]

    Online bag-of-visual- words generation for unsupervised representation learning

    Spyros Gidaris, Andrei Bursuc, Gilles Puy, Nikos Komodakis, Matthieu Cord, and Patrick P ´erez. Online bag-of-visual- words generation for unsupervised representation learning. arXiv preprint arXiv:2012.11552, 2020. 2, 5

  28. [28]

    Self- supervised pretraining of visual features in the wild

    Priya Goyal, Mathilde Caron, Benjamin Lefaudeux, Min Xu, Pengchao Wang, Vivek Pai, Mannat Singh, Vitaliy Liptchinsky, Ishan Misra, Armand Joulin, et al. Self- supervised pretraining of visual features in the wild. preprint arXiv:2103.01988, 2021. 10

  29. [29]

    Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

    Priya Goyal, Piotr Doll ´ar, Ross Girshick, Pieter Noord- huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. preprint arXiv:1706.02677,

  30. [30]

    Bootstrap your own latent: A new approach to self-supervised learning

    Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Do- ersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Moham- mad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, R´emi Munos, and Michal Valko. Bootstrap your own latent: A new approach to self-supervised learning. In NeurIPS, 2020. 2, 3, 4, 5,...

  31. [31]

    Visualization of su- pervised and self-supervised neural networks via attribution guided factorization

    Shir Gur, Ameen Ali, and Lior Wolf. Visualization of su- pervised and self-supervised neural networks via attribution guided factorization. preprint arXiv:2012.02166, 2020. 7

  32. [32]

    Noise-contrastive estimation: A new estimation principle for unnormalized statistical models

    Michael Gutmann and Aapo Hyv ¨arinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In International Conference on Artificial Intelligence and Statistics, 2010. 2

  33. [33]

    Momentum contrast for unsupervised visual rep- resentation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual rep- resentation learning. In CVPR, 2020. 1, 2, 3, 4, 5, 7, 9, 16

  34. [34]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. 4, 5

  35. [35]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. preprint arXiv:1503.02531,

  36. [36]

    Unsupervised deep learning by neighbourhood discovery

    Jiabo Huang, Qi Dong, Shaogang Gong, and Xiatian Zhu. Unsupervised deep learning by neighbourhood discovery. In ICML, 2019. 2

  37. [37]

    Space-time correspondence as a contrastive random walk

    Allan Jabri, Andrew Owens, and Alexei A Efros. Space-time correspondence as a contrastive random walk. 2020. 7

  38. [38]

    On Using Very Large Target Vocabulary for Neural Machine Translation

    S´ebastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. On using very large target vocabulary for neural machine translation. preprint arXiv:1412.2007, 2014. 4, 9

  39. [39]

    OpenNMT: Open-Source Toolkit for Neural Machine Translation

    Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M Rush. Opennmt: Open-source toolkit for neural machine translation. preprint arXiv:1701.02810, 2017. 5

  40. [40]

    Mast: A memory- augmented self-supervised tracker

    Zihang Lai, Erika Lu, and Weidi Xie. Mast: A memory- augmented self-supervised tracker. In CVPR, 2020. 7

  41. [41]

    Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks

    Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML,

  42. [42]

    Junnan Li, Pan Zhou, Caiming Xiong, and Steven C.H. Hoi. Prototypical contrastive learning of unsupervised representa- tions. ICLR, 2021. 2

  43. [43]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. preprint arXiv:1608.03983, 2016. 5

  44. [44]

    Fixing weight decay regu- larization in adam

    Ilya Loshchilov and Frank Hutter. Fixing weight decay regu- larization in adam. 2018. 5

  45. [45]

    Cyanure: An open-source toolbox for empirical risk minimization for python, c++, and soon more

    Julien Mairal. Cyanure: An open-source toolbox for empirical risk minimization for python, c++, and soon more. preprint arXiv:1912.08165, 2019. 13, 14

  46. [46]

    Automated flower classification over a large number of classes

    Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, 2008. 13

  47. [47]

    Boosting self-supervised learning via knowledge transfer

    Mehdi Noroozi, Ananth Vinjimoor, Paolo Favaro, and Hamed Pirsiavash. Boosting self-supervised learning via knowledge transfer. In CVPR, 2018. 3

  48. [48]

    Video object segmentation using space-time memory networks

    Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. Video object segmentation using space-time memory networks. In ICCV, 2019. 7

  49. [49]

    Meta pseudo labels

    Hieu Pham, Qizhe Xie, Zihang Dai, and Quoc V Le. Meta pseudo labels. preprint arXiv:2003.10580, 2020. 14

  50. [50]

    Lost in quantization: Improving particular object retrieval in large scale image databases

    James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Lost in quantization: Improving particular object retrieval in large scale image databases. In CVPR, 2008. 6

  51. [51]

    Acceleration of stochastic approximation by averaging

    Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855, 1992. 4, 9, 17

  52. [52]

    The 2017 DAVIS Challenge on Video Object Segmentation

    Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Ar- bel´aez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation. preprint arXiv:1704.00675, 2017. 7

  53. [53]

    Revisiting oxford and paris: Large-scale image retrieval benchmarking

    Filip Radenovi ´c, Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ond ˇrej Chum. Revisiting oxford and paris: Large-scale image retrieval benchmarking. 2018. 6

  54. [54]

    Fine- tuning cnn image retrieval with no human annotation

    Filip Radenovi´c, Giorgos Tolias, and Ond ˇrej Chum. Fine- tuning cnn image retrieval with no human annotation. IEEE transactions on pattern analysis and machine intelligence ,

  55. [55]

    Language models are unsuper- vised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsuper- vised multitask learners. 1

  56. [56]

    Designing network design spaces

    Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaim- ing He, and Piotr Doll´ar. Designing network design spaces. In CVPR, 2020. 13

  57. [57]

    Learning with average precision: Training image retrieval with a listwise loss

    Jerome Revaud, Jon Almaz´an, Rafael S Rezende, and Cesar Roberto de Souza. Learning with average precision: Training image retrieval with a listwise loss. In ICCV, 2019. 6

  58. [58]

    Byol works even without batch statistics

    Pierre H Richemond, Jean-Bastien Grill, Florent Altch ´e, Corentin Tallec, Florian Strub, Andrew Brock, Samuel Smith, Soham De, Razvan Pascanu, Bilal Piot, et al. Byol works even without batch statistics. preprint arXiv:2010.10241, 2020. 2, 4

  59. [59]

    Efficient estimations from a slowly conver- gent robbins-monro process

    David Ruppert. Efficient estimations from a slowly conver- gent robbins-monro process. Technical report, 1988. 4, 9

  60. [60]

    Imagenet large scale visual recognition challenge

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. IJCV, 2015. 1, 5, 13

  61. [61]

    Weight normalization: A simple reparameterization to accelerate training of deep neural networks

    Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. NeurIPS, 2016. 4, 16

  62. [62]

    Concept generalization in visual representa- tion learning

    Mert Bulent Sariyildiz, Yannis Kalantidis, Diane Larlus, and Karteek Alahari. Concept generalization in visual representa- tion learning. arXiv preprint arXiv:2012.05649, 2020. 7

  63. [63]

    S2-bnn: Bridging the gap between self-supervised real and 1-bit neural net- works via guided distribution calibration

    Zhiqiang Shen, Zechun Liu, Jie Qin, Lei Huang, Kwang- Ting Cheng, and Marios Savvides. S2-bnn: Bridging the gap between self-supervised real and 1-bit neural net- works via guided distribution calibration. arXiv preprint arXiv:2102.08946, 2021. 3

  64. [64]

    Fixmatch: Simplifying semi-supervised learning with consistency and confidence

    Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, Han Zhang, and Colin Raffel. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. In NeurIPS, 2020. 14

  65. [65]

    Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results

    Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. preprint arXiv:1703.01780, 2017. 3, 4, 9, 17

  66. [66]

    YFCC100M: The New Data in Multimedia Research

    Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research. arXiv preprint arXiv:1503.01817, 2015. 6

  67. [67]

    What makes for good views for contrastive learning

    Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. What makes for good views for contrastive learning. NeurIPS, 2020. 5

  68. [68]

    Particular object retrieval with integral max-pooling of CNN activations

    Giorgos Tolias, Ronan Sicre, and Herv ´e J´egou. Particular object retrieval with integral max-pooling of cnn activations. arXiv preprint arXiv:1511.05879, 2015. 6

  69. [69]

    Training data-efficient image transformers & distillation through atten- tion

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv´e J´egou. Training data-efficient image transformers & distillation through atten- tion. preprint arXiv:2012.12877, 2020. 1, 4, 5, 6, 7, 8, 13, 17

  70. [70]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017. 1, 4

  71. [71]

    Learning correspondence from the cycle-consistency of time

    Xiaolong Wang, Allan Jabri, and Alexei A Efros. Learning correspondence from the cycle-consistency of time. In CVPR,

  72. [72]

    Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval

    Tobias Weyand, Andre Araujo, Bingyi Cao, and Jack Sim. Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval. 2020. 6

  73. [73]

    Unsupervised feature learning via non-parametric instance discrimination

    Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018. 2, 4, 5, 9, 18

  74. [74]

    Unsupervised deep embedding for clustering analysis

    Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised deep embedding for clustering analysis. In ICML, 2016. 2

  75. [75]

    Qizhe Xie, Zihang Dai Dai, Eduard Hovy, Minh-Thang Lu- ong, and Quoc V . Le. Unsupervised data augmentation for consistency training. preprint arXiv:1904.12848, 2020. 14

  76. [76]

    Self-training with noisy student improves imagenet clas- sification

    Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. Self-training with noisy student improves imagenet clas- sification. In CVPR, 2020. 3

  77. [77]

    Seed the views: Hierarchical seman- tic alignment for contrastive representation learning

    Haohang Xu, Xiaopeng Zhang, Hao Li, Lingxi Xie, Hongkai Xiong, and Qi Tian. Seed the views: Hierarchical seman- tic alignment for contrastive representation learning. arXiv preprint arXiv:2012.02733, 2021. 16

  78. [78]

    Iter- ative pseudo-labeling for speech recognition

    Qiantong Xu, Tatiana Likhomanenko, Jacob Kahn, Awni Hannun, Gabriel Synnaeve, and Ronan Collobert. Iter- ative pseudo-labeling for speech recognition. preprint arXiv:2005.09267, 2020. 3

  79. [79]

    Billion-scale semi-supervised learning for image classification

    I Zeki Yalniz, Herv´e J´egou, Kan Chen, Manohar Paluri, and Dhruv Mahajan. Billion-scale semi-supervised learning for image classification. preprint arXiv:1905.00546, 2019. 3

  80. [80]

    Joint unsuper- vised learning of deep representations and image clusters

    Jianwei Yang, Devi Parikh, and Dhruv Batra. Joint unsuper- vised learning of deep representations and image clusters. In CVPR, 2016. 2

Showing first 80 references.