Recognition: 2 theorem links
· Lean TheoremLearning to Perceive "Where": Spatial Pretext Tasks for Robust Self-Supervised Learning
Pith reviewed 2026-05-12 03:31 UTC · model grok-4.3
The pith
Adding a spatial prediction task to self-supervised learning gives models an inductive bias for part-to-part geometry and better generalization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that explicitly modeling spatial information via a Spatial Prediction pretext task, which regresses the relative position and scale between a pair of disentangled local views, provides an effective inductive bias for self-supervised learning. This produces representations that capture fine-grained spatial dependencies and the compositional structure of scenes, leading to consistent gains on image recognition, fine-grained classification, semantic segmentation, depth estimation, and out-of-distribution robustness, plus stronger results on dedicated spatial reasoning tests.
What carries the argument
Spatial Prediction (SP), a plug-in regression pretext task that predicts relative position and scale between pairs of local image views to capture part-to-part spatial relationships.
If this is right
- Gains appear on image recognition and fine-grained classification benchmarks.
- Performance rises on semantic segmentation and depth estimation tasks.
- Out-of-distribution robustness improves for object recognition.
- New spatial reasoning tests show stronger position prediction and jigsaw understanding.
Where Pith is reading between the lines
- The decoupled design implies the task can be layered onto other pretext objectives to add geometric awareness without redesigning the base method.
- Continuous regression in geometric space may support finer localization needs in applications that require precise part placement.
- The emphasis on spatial structure could transfer to settings where geometry is central, such as multi-view or temporal data.
- The paper's own controls leave open whether similar gains would arise from any auxiliary regression head of comparable strength.
Load-bearing premise
The observed gains come specifically from learning spatial part-to-part relationships rather than from any added training signal or hyperparameter adjustments.
What would settle it
An experiment that adds a non-spatial regression task of matched complexity to the same SSL frameworks and checks whether the gains on spatial and downstream tasks disappear.
Figures
read the original abstract
Existing self-supervised learning (SSL) methods primarily learn object-invariant representations but often neglect the spatial structure and relationships among object parts. To address this limitation, we introduce Spatial Prediction (SP), a spatially aware pretext regression task that predicts the relative position and scale between a pair of disentangled local views from the same image. By modeling part-to-part relationships in a continuous geometric space, SP encourages representations to capture fine-grained spatial dependencies beyond invariant categorical semantics, thereby learning the compositional structure of visual scenes. SP is implemented as a decoupled plug-in and can be seamlessly integrated into diverse SSL frameworks. Extensive experiments show consistent improvements across image recognition, fine-grained classification, semantic segmentation, and depth estimation, as well as substantial gains in out-of-distribution robustness for object recognition. To evaluate spatial reasoning, we introduce (1) a position and scale prediction task on image patch pairs and (2) a jigsaw understanding task requiring patch reordering and recognition after reconstruction. Strong performance on these tasks indicates improved spatial structure and geometric awareness. Overall, explicitly modeling spatial information provides an effective inductive bias for SSL, leading to more structured representations and better generalization. Code and models will be released.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Spatial Prediction (SP), a pretext regression task for self-supervised learning that predicts the continuous relative position and scale between pairs of disentangled local views from the same image. SP is implemented as a decoupled plug-in module that can be added to existing SSL frameworks to encourage capture of part-to-part spatial relationships and compositional scene structure. The authors report consistent empirical gains on image recognition, fine-grained classification, semantic segmentation, depth estimation, and out-of-distribution robustness, and introduce two new spatial-reasoning evaluation tasks (position/scale prediction on patch pairs and jigsaw understanding after reconstruction).
Significance. If the reported gains are shown to arise specifically from the geometric content of the SP task rather than the addition of any auxiliary regression signal, the work supplies a lightweight inductive bias for spatial structure in SSL representations. This would be useful for downstream tasks that require geometric awareness. The introduction of dedicated spatial-reasoning probes is a constructive addition for future evaluation of SSL methods. The plug-in design and promised code release are practical strengths.
major comments (1)
- §4 (Experiments) and Abstract: The central claim that 'explicitly modeling spatial information provides an effective inductive bias' (Abstract) requires that observed gains on recognition, segmentation, depth, and OOD tasks stem from regressing continuous relative position/scale rather than from the mere addition of an auxiliary loss term. No matched control is described in which the regression target is non-spatial (e.g., fixed offsets or randomly permuted positions) while preserving identical architecture, loss weighting, and training schedule. Without this control, the inductive-bias interpretation remains unisolated from generic effects of extra gradient signal or regularization.
minor comments (3)
- Abstract: The statements of 'consistent improvements' and 'substantial gains' are not accompanied by any numerical results, standard deviations, or statistical tests, which reduces immediate assessability of effect sizes.
- §3 (Method): The precise formulation of the regression targets (how position and scale are encoded and normalized) and the weighting hyper-parameter between the SP loss and the base SSL objective should be stated explicitly with equations.
- Evaluation tasks: The construction details, difficulty baselines, and human-performance references for the new position/scale prediction and jigsaw-understanding probes should be expanded to allow independent replication.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review. The suggestion to better isolate the spatial inductive bias is well-taken. We address the major comment below and will incorporate the requested control in the revised manuscript.
read point-by-point responses
-
Referee: §4 (Experiments) and Abstract: The central claim that 'explicitly modeling spatial information provides an effective inductive bias' (Abstract) requires that observed gains on recognition, segmentation, depth, and OOD tasks stem from regressing continuous relative position/scale rather than from the mere addition of an auxiliary loss term. No matched control is described in which the regression target is non-spatial (e.g., fixed offsets or randomly permuted positions) while preserving identical architecture, loss weighting, and training schedule. Without this control, the inductive-bias interpretation remains unisolated from generic effects of extra gradient signal or regularization.
Authors: We agree that a matched control with non-spatial regression targets is necessary to strengthen the claim that gains arise specifically from the geometric content of the SP task. In the revised manuscript we will add an ablation study that replaces the continuous relative position/scale targets with non-spatial alternatives (fixed arbitrary offsets and randomly permuted position labels) while keeping the network architecture, loss weighting, optimizer, and training schedule identical. Results from this control will be reported alongside the original SP results on the main downstream tasks. We expect this addition to clarify whether the observed improvements are attributable to spatial structure rather than generic auxiliary supervision. revision: yes
Circularity Check
No circularity: empirical pretext task with independent experimental validation
full rationale
The paper introduces Spatial Prediction (SP) as a new decoupled pretext regression task for SSL, with claims of improved spatial structure and generalization supported solely by empirical results across multiple benchmarks and new evaluation probes. No equations, derivations, or first-principles predictions are presented that reduce the claimed benefits to quantities defined by the method itself. The approach is a plug-in auxiliary loss whose performance gains are measured externally rather than forced by construction or self-citation chains. The central inductive-bias interpretation rests on comparative experiments, not on any self-referential definition or fitted input renamed as prediction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Adding a spatial prediction objective to standard SSL frameworks will produce representations that capture compositional structure without harming other learning objectives.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SP ... predicts the relative position and scale between a pair of disentangled local views ... L_SP = λ_p ||p̂-p||_2^2 + λ_s ||ŝ-s||_2^2
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
explicitly modeling spatial information provides an effective inductive bias for SSL
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Learning multiple layers of features from tiny images
Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009
work page 2009
-
[2]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. IEEE, 2009
work page 2009
-
[3]
Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations.Proceedings of the International Conference on Learning Representations, 2019
work page 2019
-
[4]
The many faces of robustness: A critical analysis of out-of-distribution generalization
Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization. ICCV, 2021
work page 2021
-
[5]
Learning robust global representations by penalizing local predictive power
Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power. InAdvances in Neural Information Processing Systems, pages 10506–10518, 2019
work page 2019
-
[6]
Automated flower classification over a large number of classes
Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008
work page 2008
- [7]
-
[8]
Food-101 – mining discriminative components with random forests
Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 – mining discriminative components with random forests. InEuropean Conference on Computer Vision, 2014
work page 2014
-
[9]
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge.International Journal of Computer Vision, 88(2):303–338, June 2010
work page 2010
-
[10]
Indoor segmentation and support inference from rgbd images
Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. InECCV, 2012
work page 2012
-
[11]
Recognition-by-components: a theory of human image understanding
Irving Biederman. Recognition-by-components: a theory of human image understanding. Psychological review, 94(2):115, 1987
work page 1987
-
[12]
Dynamic binding in a neural network for shape recognition.Psychological review, 99(3):480, 1992
John E Hummel and Irving Biederman. Dynamic binding in a neural network for shape recognition.Psychological review, 99(3):480, 1992
work page 1992
-
[13]
Label-efficient online continual object detection in streaming video
Jay Zhangjie Wu, David Junhao Zhang, Wynne Hsu, Mengmi Zhang, and Mike Zheng Shou. Label-efficient online continual object detection in streaming video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19246–19255, 2023
work page 2023
-
[14]
Ziyu Wang, Shuangpeng Han, and Mengmi Zhang. Pose prior learner: Unsupervised categorical prior learning for pose estimation.arXiv preprint arXiv:2410.03858, 2024. 10
-
[15]
Putting visual object recognition in context
Mengmi Zhang, Claire Tseng, and Gabriel Kreiman. Putting visual object recognition in context. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12985–12994, 2020
work page 2020
-
[16]
When pigs fly: Contextual reasoning in synthetic and natural scenes
Philipp Bomatter, Mengmi Zhang, Dimitar Karev, Spandan Madan, Claire Tseng, and Gabriel Kreiman. When pigs fly: Contextual reasoning in synthetic and natural scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 255–264, 2021
work page 2021
-
[17]
Reason from context with self-supervised learning.arXiv preprint arXiv:2211.12817, 2022
Xiao Liu, Ankur Sikarwar, Gabriel Kreiman, Zenglin Shi, and Mengmi Zhang. Reason from context with self-supervised learning.arXiv preprint arXiv:2211.12817, 2022
-
[18]
Yanhao Jia, Ji Xie, S Jivaganesh, Hao Li, Xu Wu, and Mengmi Zhang. Seeing sound, hearing sight: Uncovering modality bias and conflict of ai models in sound localization.arXiv preprint arXiv:2505.11217, 2025
-
[19]
Naitik Khandelwal, Xiao Liu, and Mengmi Zhang. Adaptive visual scene understanding: incremental scene graph generation.arXiv preprint arXiv:2310.01636, 2023
-
[20]
Ziyu Wang, Mike Zheng Shou, and Mengmi Zhang. Object-centric learning with cyclic walks between parts and whole.Advances in Neural Information Processing Systems, 36:9388–9408, 2023
work page 2023
-
[21]
Shuangpeng Han, Ziyu Wang, and Mengmi Zhang. Flow snapshot neurons in action: Deep neural networks generalize to biological motion perception.Advances in Neural Information Processing Systems, 37:53732–53763, 2024
work page 2024
-
[22]
Zhengquan Zhang, Feng Xu, and Mengmi Zhang. Peering into the unknown: Active view selection with neural uncertainty maps for 3d reconstruction.arXiv preprint arXiv:2506.14856, 2025
-
[23]
Yusen Cai, Bhargava Satya Nunna, Qing Lin, and Mengmi Zhang. Learning to see through a baby’s eyes: Early visual diets enable robust visual intelligence in humans and machines.arXiv preprint arXiv:2511.14440, 2025
-
[24]
Maytus Piriyajitakonkij, Mingfei Sun, Mengmi Zhang, and Wei Pan. Tta-nav: Test-time adaptive reconstruction for point-goal navigation under visual corruptions.arXiv preprint arXiv:2403.01977, 2024
-
[25]
Spatialsense: An adversarially crowdsourced benchmark for spatial relation recognition
Kaiyu Yang, Olga Russakovsky, and Jia Deng. Spatialsense: An adversarially crowdsourced benchmark for spatial relation recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2051–2060, 2019
work page 2051
-
[26]
Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023
work page 2023
-
[27]
Mengmi Zhang, Keng Teck Ma, Shih-Cheng Yen, Joo Hwee Lim, Qi Zhao, and Jiashi Feng. Egocentric spatial memory. In2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 137–144. IEEE, 2018
work page 2018
-
[28]
Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsupervised visual representation learning by context prediction. InProceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015
work page 2015
-
[29]
Self-supervised learning from images with a joint-embedding predictive architecture
Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15619–15629, June 2023
work page 2023
-
[30]
An empirical study of training self-supervised vision transformers, 2021
Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers, 2021
work page 2021
-
[31]
Emerging properties in self-supervised vision transformers, 2021
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers, 2021. 11
work page 2021
-
[32]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022
work page 2022
-
[33]
Simmim: A simple framework for masked image modeling
Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple framework for masked image modeling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9653–9663, 2022
work page 2022
-
[34]
A simple framework for contrastive learning of visual representations
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1597–1607. PMLR, 13–18 Jul 2020
work page 2020
-
[35]
Dinov2: Learning robust visual features without supervision, 2024
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick La...
work page 2024
-
[36]
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[37]
Exploring simple siamese representation learning
Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15750–15758, 2021
work page 2021
-
[38]
Pedro O O Pinheiro, Amjad Almahairi, Ryan Benmalek, Florian Golemo, and Aaron C Courville. Unsupervised learning of dense visual representations.Advances in neural information processing systems, 33:4489–4500, 2020
work page 2020
-
[39]
Patch-level representation learning for self-supervised vision transformers
Sukmin Yun, Hankook Lee, Jaehyung Kim, and Jinwoo Shin. Patch-level representation learning for self-supervised vision transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8354–8363, 2022
work page 2022
-
[40]
Near, far: Patch-ordering enhances vision foundation models’ scene understanding
Valentinos Pariza, Mohammadreza Salehi, Gertjan Burghouts, Francesco Locatello, and Yuki M Asano. Near, far: Patch-ordering enhances vision foundation models’ scene understanding. arXiv preprint arXiv:2408.11054, 2024
-
[41]
Zhenda Xie, Yutong Lin, Zheng Zhang, Yue Cao, Stephen Lin, and Han Hu. Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16684–16693, 2021
work page 2021
-
[42]
Beit: Bert pre-training of image transformers, 2022
Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers, 2022
work page 2022
-
[43]
Masked siamese networks for label-efficient learning
Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Mike Rabbat, and Nicolas Ballas. Masked siamese networks for label-efficient learning. InEuropean conference on computer vision, pages 456–473. Springer, 2022
work page 2022
-
[44]
Revealing the dark secrets of masked image modeling
Zhenda Xie, Zigang Geng, Jingcheng Hu, Zheng Zhang, Han Hu, and Yue Cao. Revealing the dark secrets of masked image modeling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14475–14485, 2023
work page 2023
-
[45]
Unsupervised learning of visual representations by solving jigsaw puzzles
Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors,Computer Vision – ECCV 2016, pages 69–84, Cham, 2016. Springer International Publishing
work page 2016
-
[46]
Self-supervised learning of pretext-invariant representations
Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6707–6717, 2020. 12
work page 2020
-
[47]
Context encoders: Feature learning by inpainting
Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016
work page 2016
-
[48]
Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. InEuropean conference on computer vision, pages 649–666. Springer, 2016
work page 2016
-
[49]
Unsupervised representation learning by predicting image rotations, 2018
Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations, 2018
work page 2018
-
[50]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[51]
data2vec: A general framework for self-supervised learning in speech, vision and language
Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. data2vec: A general framework for self-supervised learning in speech, vision and language. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors,Proceedings of the 39th International Conference on Machine Learning, volume 16...
work page 2022
-
[52]
A path towards autonomous machine intelligence version 0.9
Yann LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1):1–62, 2022
work page 2022
-
[53]
Randall Balestriero and Yann LeCun. Learning by reconstruction produces uninformative features for perception.arXiv preprint arXiv:2402.11337, 2024
-
[54]
Dosovitskiy Alexey, Philipp Fischer, Jost Tobias, Martin Riedmiller Springenberg, and Thomas Brox. Discriminative unsupervised feature learning with exemplar convolutional neural networks.IEEE TPAMI, 38(9):1734–1747, 2016
work page 2016
-
[55]
Unsupervised learning by predicting noise
Piotr Bojanowski and Armand Joulin. Unsupervised learning by predicting noise. In International conference on machine learning, pages 517–526. PMLR, 2017
work page 2017
-
[56]
Unsupervised feature learning via non-parametric instance discrimination
Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3733–3742, 2018
work page 2018
-
[57]
Momentum contrast for unsupervised visual representation learning
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020
work page 2020
-
[58]
Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments.Advances in neural information processing systems, 33:9912–9924, 2020
work page 2020
-
[59]
Improved baselines with momentum contrastive learning, 2020
Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning, 2020
work page 2020
-
[60]
Bootstrap your own latent - a new approach to self-supervised learning
Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, koray kavukcuoglu, Remi Munos, and Michal Valko. Bootstrap your own latent - a new approach to self-supervised learning. In H. Larochelle, M. Ranzato, R. Hadsell, M....
work page 2020
-
[61]
Deep clustering for unsupervised learning of visual features
Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. InProceedings of the European conference on computer vision (ECCV), pages 132–149, 2018
work page 2018
-
[62]
arXiv preprint arXiv:1911.05371 , year=
Yuki Markus Asano, Christian Rupprecht, and Andrea Vedaldi. Self-labelling via simultaneous clustering and representation learning.arXiv preprint arXiv:1911.05371, 2019. 13
-
[63]
Dense contrastive learning for self-supervised visual pre-training
Xinlong Wang, Rufeng Zhang, Chunhua Shen, Tao Kong, and Lei Li. Dense contrastive learning for self-supervised visual pre-training. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3024–3033, 2021
work page 2021
-
[64]
ibot: Image bert pre-training with online tokenizer, 2022
Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer, 2022
work page 2022
-
[65]
Transitive invariance for self-supervised visual representation learning
Xiaolong Wang, Kaiming He, and Abhinav Gupta. Transitive invariance for self-supervised visual representation learning. InProceedings of the IEEE international conference on computer vision, pages 1329–1338, 2017
work page 2017
-
[66]
Joohyung Lee, Changhun Kim, Hyunsu Kim, Kwanhyung Lee, and Juho Lee. Soft equivariance regularization for invariant self-supervised learning.arXiv preprint arXiv:2603.06693, 2026
-
[67]
Position prediction as an effective pretraining strategy.arXiv preprint arXiv:2207.07611, 2022
Shuangfei Zhai, Navdeep Jaitly, Jason Ramapuram, Dan Busbridge, Tatiana Likhomanenko, Joseph Yitan Cheng, Walter Talbott, Chen Huang, Hanlin Goh, and Joshua Susskind. Position prediction as an effective pretraining strategy.arXiv preprint arXiv:2207.07611, 2022
-
[68]
Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo, Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, and Jingdong Wang. Context autoencoder for self-supervised representation learning.International Journal of Computer Vision, 132(1):208–223, 2024
work page 2024
-
[69]
Droppos: Pre-training vision transformers by reconstructing dropped positions
Haochen Wang, Junsong Fan, Yuxi Wang, Kaiyou Song, Tong Wang, and ZHAO-XIANG ZHANG. Droppos: Pre-training vision transformers by reconstructing dropped positions. Advances in Neural Information Processing Systems, 36:46134–46151, 2023
work page 2023
-
[70]
Melika Ayoughi, Samira Abnar, Chen Huang, Chris Sandino, Sayeri Lala, Eeshan Gunesh Dhekane, Dan Busbridge, Shuangfei Zhai, Vimal Thilak, Josh Susskind, et al. How parts assemble into wholes: Learning the relative composition of images.arXiv preprint arXiv:2506.03682, 2025
-
[71]
Victor Guilherme Turrisi da Costa, Enrico Fini, Moin Nabi, Nicu Sebe, and Elisa Ricci. solo-learn: A library of self-supervised methods for visual representation learning.Journal of Machine Learning Research, 23(56):1–6, 2022
work page 2022
-
[72]
Igor Susmelj, Matthias Heller, Philipp Wirth, Jeremy Prescott, Malte Ebner, and et al. Lightly
-
[73]
Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.