pith. machine review for the scientific record. sign in

arxiv: 2605.09963 · v1 · submitted 2026-05-11 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Learning to Perceive "Where": Spatial Pretext Tasks for Robust Self-Supervised Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:31 UTC · model grok-4.3

classification 💻 cs.CV
keywords self-supervised learningspatial predictionpretext tasksvisual representationsinductive biasgeneralizationrobustnesssemantic segmentation
0
0 comments X

The pith

Adding a spatial prediction task to self-supervised learning gives models an inductive bias for part-to-part geometry and better generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that typical self-supervised methods focus on object identity but ignore how parts of an image relate in space. It introduces Spatial Prediction, a regression task that requires the model to output the relative position and scale between two local views cropped from the same image. This forces the learned features to encode compositional structure in a continuous geometric space rather than only invariant semantics. The task is designed as a decoupled plug-in that can be added to existing frameworks. If the claim holds, downstream performance should rise on tasks that need geometric understanding, such as segmentation and depth estimation, while also increasing robustness when test images differ from training data.

Core claim

The central claim is that explicitly modeling spatial information via a Spatial Prediction pretext task, which regresses the relative position and scale between a pair of disentangled local views, provides an effective inductive bias for self-supervised learning. This produces representations that capture fine-grained spatial dependencies and the compositional structure of scenes, leading to consistent gains on image recognition, fine-grained classification, semantic segmentation, depth estimation, and out-of-distribution robustness, plus stronger results on dedicated spatial reasoning tests.

What carries the argument

Spatial Prediction (SP), a plug-in regression pretext task that predicts relative position and scale between pairs of local image views to capture part-to-part spatial relationships.

If this is right

  • Gains appear on image recognition and fine-grained classification benchmarks.
  • Performance rises on semantic segmentation and depth estimation tasks.
  • Out-of-distribution robustness improves for object recognition.
  • New spatial reasoning tests show stronger position prediction and jigsaw understanding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The decoupled design implies the task can be layered onto other pretext objectives to add geometric awareness without redesigning the base method.
  • Continuous regression in geometric space may support finer localization needs in applications that require precise part placement.
  • The emphasis on spatial structure could transfer to settings where geometry is central, such as multi-view or temporal data.
  • The paper's own controls leave open whether similar gains would arise from any auxiliary regression head of comparable strength.

Load-bearing premise

The observed gains come specifically from learning spatial part-to-part relationships rather than from any added training signal or hyperparameter adjustments.

What would settle it

An experiment that adds a non-spatial regression task of matched complexity to the same SSL frameworks and checks whether the gains on spatial and downstream tasks disappear.

Figures

Figures reproduced from arXiv: 2605.09963 by Mengmi Zhang, Qing Lin, Weronika Hryniewska-Guzik, Yang Shen, Yusen Cai.

Figure 1
Figure 1. Figure 1: Problem setting and task overview for visual representation learning from partial observations. The left panel illustrates the problem setting: two cropped and resized views (orange and blue) are sampled from the same image, and the SSL model predicts the relative position and scale of the orange patch with respect to the blue reference. The right panel summarizes the evaluation tasks for benchmarking SSL … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Spatial Prediction (SP). Given a reference view Ir and a target view It sampled from the same image, both views are encoded by a shared Vision Transformer (ViT) [50]. The class token ([CLS]) of the reference view serves as the query (Q), while patch tokens from both reference and target views act as keys (K) and values (V ) to compute a cross-attention-like interaction (⊗), producing reference … view at source ↗
Figure 3
Figure 3. Figure 3: Example visualizations of learned representations by SSL models with and without SP. (a) Qualitative comparison of spatial attention maps. Following [31], we visualize attention by computing the attention weights between the classification token z and all patch tokens Z, reshaping them into a 2D spatial map, and upsampling to the input image size. Each row corresponds to an input image, and columns show at… view at source ↗
read the original abstract

Existing self-supervised learning (SSL) methods primarily learn object-invariant representations but often neglect the spatial structure and relationships among object parts. To address this limitation, we introduce Spatial Prediction (SP), a spatially aware pretext regression task that predicts the relative position and scale between a pair of disentangled local views from the same image. By modeling part-to-part relationships in a continuous geometric space, SP encourages representations to capture fine-grained spatial dependencies beyond invariant categorical semantics, thereby learning the compositional structure of visual scenes. SP is implemented as a decoupled plug-in and can be seamlessly integrated into diverse SSL frameworks. Extensive experiments show consistent improvements across image recognition, fine-grained classification, semantic segmentation, and depth estimation, as well as substantial gains in out-of-distribution robustness for object recognition. To evaluate spatial reasoning, we introduce (1) a position and scale prediction task on image patch pairs and (2) a jigsaw understanding task requiring patch reordering and recognition after reconstruction. Strong performance on these tasks indicates improved spatial structure and geometric awareness. Overall, explicitly modeling spatial information provides an effective inductive bias for SSL, leading to more structured representations and better generalization. Code and models will be released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The manuscript proposes Spatial Prediction (SP), a pretext regression task for self-supervised learning that predicts the continuous relative position and scale between pairs of disentangled local views from the same image. SP is implemented as a decoupled plug-in module that can be added to existing SSL frameworks to encourage capture of part-to-part spatial relationships and compositional scene structure. The authors report consistent empirical gains on image recognition, fine-grained classification, semantic segmentation, depth estimation, and out-of-distribution robustness, and introduce two new spatial-reasoning evaluation tasks (position/scale prediction on patch pairs and jigsaw understanding after reconstruction).

Significance. If the reported gains are shown to arise specifically from the geometric content of the SP task rather than the addition of any auxiliary regression signal, the work supplies a lightweight inductive bias for spatial structure in SSL representations. This would be useful for downstream tasks that require geometric awareness. The introduction of dedicated spatial-reasoning probes is a constructive addition for future evaluation of SSL methods. The plug-in design and promised code release are practical strengths.

major comments (1)
  1. §4 (Experiments) and Abstract: The central claim that 'explicitly modeling spatial information provides an effective inductive bias' (Abstract) requires that observed gains on recognition, segmentation, depth, and OOD tasks stem from regressing continuous relative position/scale rather than from the mere addition of an auxiliary loss term. No matched control is described in which the regression target is non-spatial (e.g., fixed offsets or randomly permuted positions) while preserving identical architecture, loss weighting, and training schedule. Without this control, the inductive-bias interpretation remains unisolated from generic effects of extra gradient signal or regularization.
minor comments (3)
  1. Abstract: The statements of 'consistent improvements' and 'substantial gains' are not accompanied by any numerical results, standard deviations, or statistical tests, which reduces immediate assessability of effect sizes.
  2. §3 (Method): The precise formulation of the regression targets (how position and scale are encoded and normalized) and the weighting hyper-parameter between the SP loss and the base SSL objective should be stated explicitly with equations.
  3. Evaluation tasks: The construction details, difficulty baselines, and human-performance references for the new position/scale prediction and jigsaw-understanding probes should be expanded to allow independent replication.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The suggestion to better isolate the spatial inductive bias is well-taken. We address the major comment below and will incorporate the requested control in the revised manuscript.

read point-by-point responses
  1. Referee: §4 (Experiments) and Abstract: The central claim that 'explicitly modeling spatial information provides an effective inductive bias' (Abstract) requires that observed gains on recognition, segmentation, depth, and OOD tasks stem from regressing continuous relative position/scale rather than from the mere addition of an auxiliary loss term. No matched control is described in which the regression target is non-spatial (e.g., fixed offsets or randomly permuted positions) while preserving identical architecture, loss weighting, and training schedule. Without this control, the inductive-bias interpretation remains unisolated from generic effects of extra gradient signal or regularization.

    Authors: We agree that a matched control with non-spatial regression targets is necessary to strengthen the claim that gains arise specifically from the geometric content of the SP task. In the revised manuscript we will add an ablation study that replaces the continuous relative position/scale targets with non-spatial alternatives (fixed arbitrary offsets and randomly permuted position labels) while keeping the network architecture, loss weighting, optimizer, and training schedule identical. Results from this control will be reported alongside the original SP results on the main downstream tasks. We expect this addition to clarify whether the observed improvements are attributable to spatial structure rather than generic auxiliary supervision. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pretext task with independent experimental validation

full rationale

The paper introduces Spatial Prediction (SP) as a new decoupled pretext regression task for SSL, with claims of improved spatial structure and generalization supported solely by empirical results across multiple benchmarks and new evaluation probes. No equations, derivations, or first-principles predictions are presented that reduce the claimed benefits to quantities defined by the method itself. The approach is a plug-in auxiliary loss whose performance gains are measured externally rather than forced by construction or self-citation chains. The central inductive-bias interpretation rests on comparative experiments, not on any self-referential definition or fitted input renamed as prediction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of adding a spatial regression pretext task to existing SSL pipelines. No first-principles derivation is offered; success is asserted via experiments whose details are not visible in the abstract.

axioms (1)
  • domain assumption Adding a spatial prediction objective to standard SSL frameworks will produce representations that capture compositional structure without harming other learning objectives.
    This premise underpins the plug-in design and the expectation of consistent gains across tasks.

pith-pipeline@v0.9.0 · 5521 in / 1242 out tokens · 66632 ms · 2026-05-12T03:31:06.545037+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 2 internal anchors

  1. [1]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

  2. [2]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. IEEE, 2009

  3. [3]

    Benchmarking neural network robustness to common corruptions and perturbations.Proceedings of the International Conference on Learning Representations, 2019

    Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations.Proceedings of the International Conference on Learning Representations, 2019

  4. [4]

    The many faces of robustness: A critical analysis of out-of-distribution generalization

    Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization. ICCV, 2021

  5. [5]

    Learning robust global representations by penalizing local predictive power

    Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power. InAdvances in Neural Information Processing Systems, pages 10506–10518, 2019

  6. [6]

    Automated flower classification over a large number of classes

    Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008

  7. [7]

    Cimpoi, S

    M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A. Vedaldi. Describing textures in the wild. InProceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014

  8. [8]

    Food-101 – mining discriminative components with random forests

    Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 – mining discriminative components with random forests. InEuropean Conference on Computer Vision, 2014

  9. [9]

    Everingham, L

    M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge.International Journal of Computer Vision, 88(2):303–338, June 2010

  10. [10]

    Indoor segmentation and support inference from rgbd images

    Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. InECCV, 2012

  11. [11]

    Recognition-by-components: a theory of human image understanding

    Irving Biederman. Recognition-by-components: a theory of human image understanding. Psychological review, 94(2):115, 1987

  12. [12]

    Dynamic binding in a neural network for shape recognition.Psychological review, 99(3):480, 1992

    John E Hummel and Irving Biederman. Dynamic binding in a neural network for shape recognition.Psychological review, 99(3):480, 1992

  13. [13]

    Label-efficient online continual object detection in streaming video

    Jay Zhangjie Wu, David Junhao Zhang, Wynne Hsu, Mengmi Zhang, and Mike Zheng Shou. Label-efficient online continual object detection in streaming video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19246–19255, 2023

  14. [14]

    Pose prior learner: Unsupervised categorical prior learning for pose estimation.arXiv preprint arXiv:2410.03858, 2024

    Ziyu Wang, Shuangpeng Han, and Mengmi Zhang. Pose prior learner: Unsupervised categorical prior learning for pose estimation.arXiv preprint arXiv:2410.03858, 2024. 10

  15. [15]

    Putting visual object recognition in context

    Mengmi Zhang, Claire Tseng, and Gabriel Kreiman. Putting visual object recognition in context. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12985–12994, 2020

  16. [16]

    When pigs fly: Contextual reasoning in synthetic and natural scenes

    Philipp Bomatter, Mengmi Zhang, Dimitar Karev, Spandan Madan, Claire Tseng, and Gabriel Kreiman. When pigs fly: Contextual reasoning in synthetic and natural scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 255–264, 2021

  17. [17]

    Reason from context with self-supervised learning.arXiv preprint arXiv:2211.12817, 2022

    Xiao Liu, Ankur Sikarwar, Gabriel Kreiman, Zenglin Shi, and Mengmi Zhang. Reason from context with self-supervised learning.arXiv preprint arXiv:2211.12817, 2022

  18. [18]

    Seeing sound, hearing sight: Uncovering modality bias and conflict of ai models in sound localization.arXiv preprint arXiv:2505.11217, 2025

    Yanhao Jia, Ji Xie, S Jivaganesh, Hao Li, Xu Wu, and Mengmi Zhang. Seeing sound, hearing sight: Uncovering modality bias and conflict of ai models in sound localization.arXiv preprint arXiv:2505.11217, 2025

  19. [19]

    Adaptive visual scene understanding: incremental scene graph generation.arXiv preprint arXiv:2310.01636, 2023

    Naitik Khandelwal, Xiao Liu, and Mengmi Zhang. Adaptive visual scene understanding: incremental scene graph generation.arXiv preprint arXiv:2310.01636, 2023

  20. [20]

    Object-centric learning with cyclic walks between parts and whole.Advances in Neural Information Processing Systems, 36:9388–9408, 2023

    Ziyu Wang, Mike Zheng Shou, and Mengmi Zhang. Object-centric learning with cyclic walks between parts and whole.Advances in Neural Information Processing Systems, 36:9388–9408, 2023

  21. [21]

    Flow snapshot neurons in action: Deep neural networks generalize to biological motion perception.Advances in Neural Information Processing Systems, 37:53732–53763, 2024

    Shuangpeng Han, Ziyu Wang, and Mengmi Zhang. Flow snapshot neurons in action: Deep neural networks generalize to biological motion perception.Advances in Neural Information Processing Systems, 37:53732–53763, 2024

  22. [22]

    Peering into the unknown: Active view selection with neural uncertainty maps for 3d reconstruction.arXiv preprint arXiv:2506.14856, 2025

    Zhengquan Zhang, Feng Xu, and Mengmi Zhang. Peering into the unknown: Active view selection with neural uncertainty maps for 3d reconstruction.arXiv preprint arXiv:2506.14856, 2025

  23. [23]

    Learning to see through a baby’s eyes: Early visual diets enable robust visual intelligence in humans and machines.arXiv preprint arXiv:2511.14440, 2025

    Yusen Cai, Bhargava Satya Nunna, Qing Lin, and Mengmi Zhang. Learning to see through a baby’s eyes: Early visual diets enable robust visual intelligence in humans and machines.arXiv preprint arXiv:2511.14440, 2025

  24. [24]

    Tta-nav: Test-time adaptive reconstruction for point-goal navigation under visual corruptions.arXiv preprint arXiv:2403.01977, 2024

    Maytus Piriyajitakonkij, Mingfei Sun, Mengmi Zhang, and Wei Pan. Tta-nav: Test-time adaptive reconstruction for point-goal navigation under visual corruptions.arXiv preprint arXiv:2403.01977, 2024

  25. [25]

    Spatialsense: An adversarially crowdsourced benchmark for spatial relation recognition

    Kaiyu Yang, Olga Russakovsky, and Jia Deng. Spatialsense: An adversarially crowdsourced benchmark for spatial relation recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2051–2060, 2019

  26. [26]

    Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

    Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

  27. [27]

    Egocentric spatial memory

    Mengmi Zhang, Keng Teck Ma, Shih-Cheng Yen, Joo Hwee Lim, Qi Zhao, and Jiashi Feng. Egocentric spatial memory. In2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 137–144. IEEE, 2018

  28. [28]

    Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsupervised visual representation learning by context prediction. InProceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015

  29. [29]

    Self-supervised learning from images with a joint-embedding predictive architecture

    Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15619–15629, June 2023

  30. [30]

    An empirical study of training self-supervised vision transformers, 2021

    Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers, 2021

  31. [31]

    Emerging properties in self-supervised vision transformers, 2021

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers, 2021. 11

  32. [32]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

  33. [33]

    Simmim: A simple framework for masked image modeling

    Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple framework for masked image modeling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9653–9663, 2022

  34. [34]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1597–1607. PMLR, 13–18 Jul 2020

  35. [35]

    Dinov2: Learning robust visual features without supervision, 2024

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick La...

  36. [36]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

  37. [37]

    Exploring simple siamese representation learning

    Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15750–15758, 2021

  38. [38]

    Unsupervised learning of dense visual representations.Advances in neural information processing systems, 33:4489–4500, 2020

    Pedro O O Pinheiro, Amjad Almahairi, Ryan Benmalek, Florian Golemo, and Aaron C Courville. Unsupervised learning of dense visual representations.Advances in neural information processing systems, 33:4489–4500, 2020

  39. [39]

    Patch-level representation learning for self-supervised vision transformers

    Sukmin Yun, Hankook Lee, Jaehyung Kim, and Jinwoo Shin. Patch-level representation learning for self-supervised vision transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8354–8363, 2022

  40. [40]

    Near, far: Patch-ordering enhances vision foundation models’ scene understanding

    Valentinos Pariza, Mohammadreza Salehi, Gertjan Burghouts, Francesco Locatello, and Yuki M Asano. Near, far: Patch-ordering enhances vision foundation models’ scene understanding. arXiv preprint arXiv:2408.11054, 2024

  41. [41]

    Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning

    Zhenda Xie, Yutong Lin, Zheng Zhang, Yue Cao, Stephen Lin, and Han Hu. Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16684–16693, 2021

  42. [42]

    Beit: Bert pre-training of image transformers, 2022

    Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers, 2022

  43. [43]

    Masked siamese networks for label-efficient learning

    Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Mike Rabbat, and Nicolas Ballas. Masked siamese networks for label-efficient learning. InEuropean conference on computer vision, pages 456–473. Springer, 2022

  44. [44]

    Revealing the dark secrets of masked image modeling

    Zhenda Xie, Zigang Geng, Jingcheng Hu, Zheng Zhang, Han Hu, and Yue Cao. Revealing the dark secrets of masked image modeling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14475–14485, 2023

  45. [45]

    Unsupervised learning of visual representations by solving jigsaw puzzles

    Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors,Computer Vision – ECCV 2016, pages 69–84, Cham, 2016. Springer International Publishing

  46. [46]

    Self-supervised learning of pretext-invariant representations

    Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6707–6717, 2020. 12

  47. [47]

    Context encoders: Feature learning by inpainting

    Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016

  48. [48]

    Colorful image colorization

    Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. InEuropean conference on computer vision, pages 649–666. Springer, 2016

  49. [49]

    Unsupervised representation learning by predicting image rotations, 2018

    Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations, 2018

  50. [50]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  51. [51]

    data2vec: A general framework for self-supervised learning in speech, vision and language

    Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. data2vec: A general framework for self-supervised learning in speech, vision and language. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors,Proceedings of the 39th International Conference on Machine Learning, volume 16...

  52. [52]

    A path towards autonomous machine intelligence version 0.9

    Yann LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1):1–62, 2022

  53. [53]

    Learning by reconstruction produces uninformative features for perception.arXiv preprint arXiv:2402.11337, 2024

    Randall Balestriero and Yann LeCun. Learning by reconstruction produces uninformative features for perception.arXiv preprint arXiv:2402.11337, 2024

  54. [54]

    Discriminative unsupervised feature learning with exemplar convolutional neural networks.IEEE TPAMI, 38(9):1734–1747, 2016

    Dosovitskiy Alexey, Philipp Fischer, Jost Tobias, Martin Riedmiller Springenberg, and Thomas Brox. Discriminative unsupervised feature learning with exemplar convolutional neural networks.IEEE TPAMI, 38(9):1734–1747, 2016

  55. [55]

    Unsupervised learning by predicting noise

    Piotr Bojanowski and Armand Joulin. Unsupervised learning by predicting noise. In International conference on machine learning, pages 517–526. PMLR, 2017

  56. [56]

    Unsupervised feature learning via non-parametric instance discrimination

    Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3733–3742, 2018

  57. [57]

    Momentum contrast for unsupervised visual representation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

  58. [58]

    Unsupervised learning of visual features by contrasting cluster assignments.Advances in neural information processing systems, 33:9912–9924, 2020

    Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments.Advances in neural information processing systems, 33:9912–9924, 2020

  59. [59]

    Improved baselines with momentum contrastive learning, 2020

    Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning, 2020

  60. [60]

    Bootstrap your own latent - a new approach to self-supervised learning

    Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, koray kavukcuoglu, Remi Munos, and Michal Valko. Bootstrap your own latent - a new approach to self-supervised learning. In H. Larochelle, M. Ranzato, R. Hadsell, M....

  61. [61]

    Deep clustering for unsupervised learning of visual features

    Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. InProceedings of the European conference on computer vision (ECCV), pages 132–149, 2018

  62. [62]

    arXiv preprint arXiv:1911.05371 , year=

    Yuki Markus Asano, Christian Rupprecht, and Andrea Vedaldi. Self-labelling via simultaneous clustering and representation learning.arXiv preprint arXiv:1911.05371, 2019. 13

  63. [63]

    Dense contrastive learning for self-supervised visual pre-training

    Xinlong Wang, Rufeng Zhang, Chunhua Shen, Tao Kong, and Lei Li. Dense contrastive learning for self-supervised visual pre-training. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3024–3033, 2021

  64. [64]

    ibot: Image bert pre-training with online tokenizer, 2022

    Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer, 2022

  65. [65]

    Transitive invariance for self-supervised visual representation learning

    Xiaolong Wang, Kaiming He, and Abhinav Gupta. Transitive invariance for self-supervised visual representation learning. InProceedings of the IEEE international conference on computer vision, pages 1329–1338, 2017

  66. [66]

    Soft equivariance regularization for invariant self-supervised learning.arXiv preprint arXiv:2603.06693, 2026

    Joohyung Lee, Changhun Kim, Hyunsu Kim, Kwanhyung Lee, and Juho Lee. Soft equivariance regularization for invariant self-supervised learning.arXiv preprint arXiv:2603.06693, 2026

  67. [67]

    Position prediction as an effective pretraining strategy.arXiv preprint arXiv:2207.07611, 2022

    Shuangfei Zhai, Navdeep Jaitly, Jason Ramapuram, Dan Busbridge, Tatiana Likhomanenko, Joseph Yitan Cheng, Walter Talbott, Chen Huang, Hanlin Goh, and Joshua Susskind. Position prediction as an effective pretraining strategy.arXiv preprint arXiv:2207.07611, 2022

  68. [68]

    Context autoencoder for self-supervised representation learning.International Journal of Computer Vision, 132(1):208–223, 2024

    Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo, Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, and Jingdong Wang. Context autoencoder for self-supervised representation learning.International Journal of Computer Vision, 132(1):208–223, 2024

  69. [69]

    Droppos: Pre-training vision transformers by reconstructing dropped positions

    Haochen Wang, Junsong Fan, Yuxi Wang, Kaiyou Song, Tong Wang, and ZHAO-XIANG ZHANG. Droppos: Pre-training vision transformers by reconstructing dropped positions. Advances in Neural Information Processing Systems, 36:46134–46151, 2023

  70. [70]

    How parts assemble into wholes: Learning the relative composition of images.arXiv preprint arXiv:2506.03682, 2025

    Melika Ayoughi, Samira Abnar, Chen Huang, Chris Sandino, Sayeri Lala, Eeshan Gunesh Dhekane, Dan Busbridge, Shuangfei Zhai, Vimal Thilak, Josh Susskind, et al. How parts assemble into wholes: Learning the relative composition of images.arXiv preprint arXiv:2506.03682, 2025

  71. [71]

    solo-learn: A library of self-supervised methods for visual representation learning.Journal of Machine Learning Research, 23(56):1–6, 2022

    Victor Guilherme Turrisi da Costa, Enrico Fini, Moin Nabi, Nicu Sebe, and Elisa Ricci. solo-learn: A library of self-supervised methods for visual representation learning.Journal of Machine Learning Research, 23(56):1–6, 2022

  72. [72]

    Igor Susmelj, Matthias Heller, Philipp Wirth, Jeremy Prescott, Malte Ebner, and et al. Lightly

  73. [73]

    Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...