arxiv: 2605.09963 · v1 · submitted 2026-05-11 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Learning to Perceive "Where": Spatial Pretext Tasks for Robust Self-Supervised Learning

Yang Shen , Yusen Cai , Weronika Hryniewska-Guzik , Qing Lin , Mengmi Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:31 UTC · model grok-4.3

classification 💻 cs.CV

keywords self-supervised learningspatial predictionpretext tasksvisual representationsinductive biasgeneralizationrobustnesssemantic segmentation

0 comments

The pith

Adding a spatial prediction task to self-supervised learning gives models an inductive bias for part-to-part geometry and better generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that typical self-supervised methods focus on object identity but ignore how parts of an image relate in space. It introduces Spatial Prediction, a regression task that requires the model to output the relative position and scale between two local views cropped from the same image. This forces the learned features to encode compositional structure in a continuous geometric space rather than only invariant semantics. The task is designed as a decoupled plug-in that can be added to existing frameworks. If the claim holds, downstream performance should rise on tasks that need geometric understanding, such as segmentation and depth estimation, while also increasing robustness when test images differ from training data.

Core claim

The central claim is that explicitly modeling spatial information via a Spatial Prediction pretext task, which regresses the relative position and scale between a pair of disentangled local views, provides an effective inductive bias for self-supervised learning. This produces representations that capture fine-grained spatial dependencies and the compositional structure of scenes, leading to consistent gains on image recognition, fine-grained classification, semantic segmentation, depth estimation, and out-of-distribution robustness, plus stronger results on dedicated spatial reasoning tests.

What carries the argument

Spatial Prediction (SP), a plug-in regression pretext task that predicts relative position and scale between pairs of local image views to capture part-to-part spatial relationships.

If this is right

Gains appear on image recognition and fine-grained classification benchmarks.
Performance rises on semantic segmentation and depth estimation tasks.
Out-of-distribution robustness improves for object recognition.
New spatial reasoning tests show stronger position prediction and jigsaw understanding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The decoupled design implies the task can be layered onto other pretext objectives to add geometric awareness without redesigning the base method.
Continuous regression in geometric space may support finer localization needs in applications that require precise part placement.
The emphasis on spatial structure could transfer to settings where geometry is central, such as multi-view or temporal data.
The paper's own controls leave open whether similar gains would arise from any auxiliary regression head of comparable strength.

Load-bearing premise

The observed gains come specifically from learning spatial part-to-part relationships rather than from any added training signal or hyperparameter adjustments.

What would settle it

An experiment that adds a non-spatial regression task of matched complexity to the same SSL frameworks and checks whether the gains on spatial and downstream tasks disappear.

Figures

Figures reproduced from arXiv: 2605.09963 by Mengmi Zhang, Qing Lin, Weronika Hryniewska-Guzik, Yang Shen, Yusen Cai.

**Figure 1.** Figure 1: Problem setting and task overview for visual representation learning from partial observations. The left panel illustrates the problem setting: two cropped and resized views (orange and blue) are sampled from the same image, and the SSL model predicts the relative position and scale of the orange patch with respect to the blue reference. The right panel summarizes the evaluation tasks for benchmarking SSL … view at source ↗

**Figure 2.** Figure 2: Overview of Spatial Prediction (SP). Given a reference view Ir and a target view It sampled from the same image, both views are encoded by a shared Vision Transformer (ViT) [50]. The class token ([CLS]) of the reference view serves as the query (Q), while patch tokens from both reference and target views act as keys (K) and values (V ) to compute a cross-attention-like interaction (⊗), producing reference … view at source ↗

**Figure 3.** Figure 3: Example visualizations of learned representations by SSL models with and without SP. (a) Qualitative comparison of spatial attention maps. Following [31], we visualize attention by computing the attention weights between the classification token z and all patch tokens Z, reshaping them into a 2D spatial map, and upsampling to the input image size. Each row corresponds to an input image, and columns show at… view at source ↗

read the original abstract

Existing self-supervised learning (SSL) methods primarily learn object-invariant representations but often neglect the spatial structure and relationships among object parts. To address this limitation, we introduce Spatial Prediction (SP), a spatially aware pretext regression task that predicts the relative position and scale between a pair of disentangled local views from the same image. By modeling part-to-part relationships in a continuous geometric space, SP encourages representations to capture fine-grained spatial dependencies beyond invariant categorical semantics, thereby learning the compositional structure of visual scenes. SP is implemented as a decoupled plug-in and can be seamlessly integrated into diverse SSL frameworks. Extensive experiments show consistent improvements across image recognition, fine-grained classification, semantic segmentation, and depth estimation, as well as substantial gains in out-of-distribution robustness for object recognition. To evaluate spatial reasoning, we introduce (1) a position and scale prediction task on image patch pairs and (2) a jigsaw understanding task requiring patch reordering and recognition after reconstruction. Strong performance on these tasks indicates improved spatial structure and geometric awareness. Overall, explicitly modeling spatial information provides an effective inductive bias for SSL, leading to more structured representations and better generalization. Code and models will be released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SP adds a continuous spatial regression pretext to SSL as a plug-in and reports multi-task gains, but the gains could come from the extra loss rather than the geometry itself.

read the letter

The paper introduces Spatial Prediction, a pretext task that regresses the relative position and scale between two local views cropped from the same image. The goal is to push SSL representations beyond category invariance toward explicit part-to-part spatial relationships, and they frame the task as a decoupled add-on that slots into existing frameworks without much redesign. They also define two new evaluation probes: one for direct position and scale prediction on patches, and one for jigsaw-style reordering after reconstruction to check spatial understanding. The abstract states that this yields consistent gains on recognition, fine-grained classification, segmentation, depth estimation, and out-of-distribution robustness. That multi-task scope is the part that stands out as useful. The setup is straightforward and the new probes give a direct way to inspect whether spatial structure was actually learned. The main limitation is the missing control. Any additional regression head supplies extra gradient signal and regularization during training. Without a matched run that uses the same architecture and loss weighting but targets something non-spatial, such as a constant offset or randomly permuted values, it is hard to know whether the reported improvements trace to the geometric content or simply to having one more objective. The abstract does not include the actual numbers, ablation tables, or statistical details that would let a reader judge effect size or rule out this alternative. This paper is aimed at people already working on SSL pretext tasks in vision who want a simple spatial auxiliary. A reader in that group would get a clear idea and some new test tasks to try, even if they would run their own controls before adopting it. It deserves peer review because the proposal is practical, the evaluation covers relevant downstream problems, and the control issue is straightforward to address in revision.

Referee Report

1 major / 3 minor

Summary. The manuscript proposes Spatial Prediction (SP), a pretext regression task for self-supervised learning that predicts the continuous relative position and scale between pairs of disentangled local views from the same image. SP is implemented as a decoupled plug-in module that can be added to existing SSL frameworks to encourage capture of part-to-part spatial relationships and compositional scene structure. The authors report consistent empirical gains on image recognition, fine-grained classification, semantic segmentation, depth estimation, and out-of-distribution robustness, and introduce two new spatial-reasoning evaluation tasks (position/scale prediction on patch pairs and jigsaw understanding after reconstruction).

Significance. If the reported gains are shown to arise specifically from the geometric content of the SP task rather than the addition of any auxiliary regression signal, the work supplies a lightweight inductive bias for spatial structure in SSL representations. This would be useful for downstream tasks that require geometric awareness. The introduction of dedicated spatial-reasoning probes is a constructive addition for future evaluation of SSL methods. The plug-in design and promised code release are practical strengths.

major comments (1)

§4 (Experiments) and Abstract: The central claim that 'explicitly modeling spatial information provides an effective inductive bias' (Abstract) requires that observed gains on recognition, segmentation, depth, and OOD tasks stem from regressing continuous relative position/scale rather than from the mere addition of an auxiliary loss term. No matched control is described in which the regression target is non-spatial (e.g., fixed offsets or randomly permuted positions) while preserving identical architecture, loss weighting, and training schedule. Without this control, the inductive-bias interpretation remains unisolated from generic effects of extra gradient signal or regularization.

minor comments (3)

Abstract: The statements of 'consistent improvements' and 'substantial gains' are not accompanied by any numerical results, standard deviations, or statistical tests, which reduces immediate assessability of effect sizes.
§3 (Method): The precise formulation of the regression targets (how position and scale are encoded and normalized) and the weighting hyper-parameter between the SP loss and the base SSL objective should be stated explicitly with equations.
Evaluation tasks: The construction details, difficulty baselines, and human-performance references for the new position/scale prediction and jigsaw-understanding probes should be expanded to allow independent replication.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The suggestion to better isolate the spatial inductive bias is well-taken. We address the major comment below and will incorporate the requested control in the revised manuscript.

read point-by-point responses

Referee: §4 (Experiments) and Abstract: The central claim that 'explicitly modeling spatial information provides an effective inductive bias' (Abstract) requires that observed gains on recognition, segmentation, depth, and OOD tasks stem from regressing continuous relative position/scale rather than from the mere addition of an auxiliary loss term. No matched control is described in which the regression target is non-spatial (e.g., fixed offsets or randomly permuted positions) while preserving identical architecture, loss weighting, and training schedule. Without this control, the inductive-bias interpretation remains unisolated from generic effects of extra gradient signal or regularization.

Authors: We agree that a matched control with non-spatial regression targets is necessary to strengthen the claim that gains arise specifically from the geometric content of the SP task. In the revised manuscript we will add an ablation study that replaces the continuous relative position/scale targets with non-spatial alternatives (fixed arbitrary offsets and randomly permuted position labels) while keeping the network architecture, loss weighting, optimizer, and training schedule identical. Results from this control will be reported alongside the original SP results on the main downstream tasks. We expect this addition to clarify whether the observed improvements are attributable to spatial structure rather than generic auxiliary supervision. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pretext task with independent experimental validation

full rationale

The paper introduces Spatial Prediction (SP) as a new decoupled pretext regression task for SSL, with claims of improved spatial structure and generalization supported solely by empirical results across multiple benchmarks and new evaluation probes. No equations, derivations, or first-principles predictions are presented that reduce the claimed benefits to quantities defined by the method itself. The approach is a plug-in auxiliary loss whose performance gains are measured externally rather than forced by construction or self-citation chains. The central inductive-bias interpretation rests on comparative experiments, not on any self-referential definition or fitted input renamed as prediction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of adding a spatial regression pretext task to existing SSL pipelines. No first-principles derivation is offered; success is asserted via experiments whose details are not visible in the abstract.

axioms (1)

domain assumption Adding a spatial prediction objective to standard SSL frameworks will produce representations that capture compositional structure without harming other learning objectives.
This premise underpins the plug-in design and the expectation of consistent gains across tasks.

pith-pipeline@v0.9.0 · 5521 in / 1242 out tokens · 66632 ms · 2026-05-12T03:31:06.545037+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SP ... predicts the relative position and scale between a pair of disentangled local views ... L_SP = λ_p ||p̂-p||_2^2 + λ_s ||ŝ-s||_2^2
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

explicitly modeling spatial information provides an effective inductive bias for SSL

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 2 internal anchors

[1]

Learning multiple layers of features from tiny images

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

work page 2009
[2]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. IEEE, 2009

work page 2009
[3]

Benchmarking neural network robustness to common corruptions and perturbations.Proceedings of the International Conference on Learning Representations, 2019

Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations.Proceedings of the International Conference on Learning Representations, 2019

work page 2019
[4]

The many faces of robustness: A critical analysis of out-of-distribution generalization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization. ICCV, 2021

work page 2021
[5]

Learning robust global representations by penalizing local predictive power

Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power. InAdvances in Neural Information Processing Systems, pages 10506–10518, 2019

work page 2019
[6]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008

work page 2008
[7]

Cimpoi, S

M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A. Vedaldi. Describing textures in the wild. InProceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014

work page 2014
[8]

Food-101 – mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 – mining discriminative components with random forests. InEuropean Conference on Computer Vision, 2014

work page 2014
[9]

Everingham, L

M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge.International Journal of Computer Vision, 88(2):303–338, June 2010

work page 2010
[10]

Indoor segmentation and support inference from rgbd images

Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. InECCV, 2012

work page 2012
[11]

Recognition-by-components: a theory of human image understanding

Irving Biederman. Recognition-by-components: a theory of human image understanding. Psychological review, 94(2):115, 1987

work page 1987
[12]

Dynamic binding in a neural network for shape recognition.Psychological review, 99(3):480, 1992

John E Hummel and Irving Biederman. Dynamic binding in a neural network for shape recognition.Psychological review, 99(3):480, 1992

work page 1992
[13]

Label-efficient online continual object detection in streaming video

Jay Zhangjie Wu, David Junhao Zhang, Wynne Hsu, Mengmi Zhang, and Mike Zheng Shou. Label-efficient online continual object detection in streaming video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19246–19255, 2023

work page 2023
[14]

Pose prior learner: Unsupervised categorical prior learning for pose estimation.arXiv preprint arXiv:2410.03858, 2024

Ziyu Wang, Shuangpeng Han, and Mengmi Zhang. Pose prior learner: Unsupervised categorical prior learning for pose estimation.arXiv preprint arXiv:2410.03858, 2024. 10

work page arXiv 2024
[15]

Putting visual object recognition in context

Mengmi Zhang, Claire Tseng, and Gabriel Kreiman. Putting visual object recognition in context. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12985–12994, 2020

work page 2020
[16]

When pigs fly: Contextual reasoning in synthetic and natural scenes

Philipp Bomatter, Mengmi Zhang, Dimitar Karev, Spandan Madan, Claire Tseng, and Gabriel Kreiman. When pigs fly: Contextual reasoning in synthetic and natural scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 255–264, 2021

work page 2021
[17]

Reason from context with self-supervised learning.arXiv preprint arXiv:2211.12817, 2022

Xiao Liu, Ankur Sikarwar, Gabriel Kreiman, Zenglin Shi, and Mengmi Zhang. Reason from context with self-supervised learning.arXiv preprint arXiv:2211.12817, 2022

work page arXiv 2022
[18]

Seeing sound, hearing sight: Uncovering modality bias and conflict of ai models in sound localization.arXiv preprint arXiv:2505.11217, 2025

Yanhao Jia, Ji Xie, S Jivaganesh, Hao Li, Xu Wu, and Mengmi Zhang. Seeing sound, hearing sight: Uncovering modality bias and conflict of ai models in sound localization.arXiv preprint arXiv:2505.11217, 2025

work page arXiv 2025
[19]

Adaptive visual scene understanding: incremental scene graph generation.arXiv preprint arXiv:2310.01636, 2023

Naitik Khandelwal, Xiao Liu, and Mengmi Zhang. Adaptive visual scene understanding: incremental scene graph generation.arXiv preprint arXiv:2310.01636, 2023

work page arXiv 2023
[20]

Object-centric learning with cyclic walks between parts and whole.Advances in Neural Information Processing Systems, 36:9388–9408, 2023

Ziyu Wang, Mike Zheng Shou, and Mengmi Zhang. Object-centric learning with cyclic walks between parts and whole.Advances in Neural Information Processing Systems, 36:9388–9408, 2023

work page 2023
[21]

Flow snapshot neurons in action: Deep neural networks generalize to biological motion perception.Advances in Neural Information Processing Systems, 37:53732–53763, 2024

Shuangpeng Han, Ziyu Wang, and Mengmi Zhang. Flow snapshot neurons in action: Deep neural networks generalize to biological motion perception.Advances in Neural Information Processing Systems, 37:53732–53763, 2024

work page 2024
[22]

Peering into the unknown: Active view selection with neural uncertainty maps for 3d reconstruction.arXiv preprint arXiv:2506.14856, 2025

Zhengquan Zhang, Feng Xu, and Mengmi Zhang. Peering into the unknown: Active view selection with neural uncertainty maps for 3d reconstruction.arXiv preprint arXiv:2506.14856, 2025

work page arXiv 2025
[23]

Learning to see through a baby’s eyes: Early visual diets enable robust visual intelligence in humans and machines.arXiv preprint arXiv:2511.14440, 2025

Yusen Cai, Bhargava Satya Nunna, Qing Lin, and Mengmi Zhang. Learning to see through a baby’s eyes: Early visual diets enable robust visual intelligence in humans and machines.arXiv preprint arXiv:2511.14440, 2025

work page arXiv 2025
[24]

Tta-nav: Test-time adaptive reconstruction for point-goal navigation under visual corruptions.arXiv preprint arXiv:2403.01977, 2024

Maytus Piriyajitakonkij, Mingfei Sun, Mengmi Zhang, and Wei Pan. Tta-nav: Test-time adaptive reconstruction for point-goal navigation under visual corruptions.arXiv preprint arXiv:2403.01977, 2024

work page arXiv 2024
[25]

Spatialsense: An adversarially crowdsourced benchmark for spatial relation recognition

Kaiyu Yang, Olga Russakovsky, and Jia Deng. Spatialsense: An adversarially crowdsourced benchmark for spatial relation recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2051–2060, 2019

work page 2051
[26]

Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

work page 2023
[27]

Egocentric spatial memory

Mengmi Zhang, Keng Teck Ma, Shih-Cheng Yen, Joo Hwee Lim, Qi Zhao, and Jiashi Feng. Egocentric spatial memory. In2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 137–144. IEEE, 2018

work page 2018
[28]

Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsupervised visual representation learning by context prediction. InProceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015

work page 2015
[29]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15619–15629, June 2023

work page 2023
[30]

An empirical study of training self-supervised vision transformers, 2021

Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers, 2021

work page 2021
[31]

Emerging properties in self-supervised vision transformers, 2021

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers, 2021. 11

work page 2021
[32]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

work page 2022
[33]

Simmim: A simple framework for masked image modeling

Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple framework for masked image modeling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9653–9663, 2022

work page 2022
[34]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1597–1607. PMLR, 13–18 Jul 2020

work page 2020
[35]

Dinov2: Learning robust visual features without supervision, 2024

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick La...

work page 2024
[36]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[37]

Exploring simple siamese representation learning

Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15750–15758, 2021

work page 2021
[38]

Unsupervised learning of dense visual representations.Advances in neural information processing systems, 33:4489–4500, 2020

Pedro O O Pinheiro, Amjad Almahairi, Ryan Benmalek, Florian Golemo, and Aaron C Courville. Unsupervised learning of dense visual representations.Advances in neural information processing systems, 33:4489–4500, 2020

work page 2020
[39]

Patch-level representation learning for self-supervised vision transformers

Sukmin Yun, Hankook Lee, Jaehyung Kim, and Jinwoo Shin. Patch-level representation learning for self-supervised vision transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8354–8363, 2022

work page 2022
[40]

Near, far: Patch-ordering enhances vision foundation models’ scene understanding

Valentinos Pariza, Mohammadreza Salehi, Gertjan Burghouts, Francesco Locatello, and Yuki M Asano. Near, far: Patch-ordering enhances vision foundation models’ scene understanding. arXiv preprint arXiv:2408.11054, 2024

work page arXiv 2024
[41]

Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning

Zhenda Xie, Yutong Lin, Zheng Zhang, Yue Cao, Stephen Lin, and Han Hu. Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16684–16693, 2021

work page 2021
[42]

Beit: Bert pre-training of image transformers, 2022

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers, 2022

work page 2022
[43]

Masked siamese networks for label-efficient learning

Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Mike Rabbat, and Nicolas Ballas. Masked siamese networks for label-efficient learning. InEuropean conference on computer vision, pages 456–473. Springer, 2022

work page 2022
[44]

Revealing the dark secrets of masked image modeling

Zhenda Xie, Zigang Geng, Jingcheng Hu, Zheng Zhang, Han Hu, and Yue Cao. Revealing the dark secrets of masked image modeling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14475–14485, 2023

work page 2023
[45]

Unsupervised learning of visual representations by solving jigsaw puzzles

Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors,Computer Vision – ECCV 2016, pages 69–84, Cham, 2016. Springer International Publishing

work page 2016
[46]

Self-supervised learning of pretext-invariant representations

Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6707–6717, 2020. 12

work page 2020
[47]

Context encoders: Feature learning by inpainting

Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016

work page 2016
[48]

Colorful image colorization

Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. InEuropean conference on computer vision, pages 649–666. Springer, 2016

work page 2016
[49]

Unsupervised representation learning by predicting image rotations, 2018

Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations, 2018

work page 2018
[50]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[51]

data2vec: A general framework for self-supervised learning in speech, vision and language

Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. data2vec: A general framework for self-supervised learning in speech, vision and language. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors,Proceedings of the 39th International Conference on Machine Learning, volume 16...

work page 2022
[52]

A path towards autonomous machine intelligence version 0.9

Yann LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1):1–62, 2022

work page 2022
[53]

Learning by reconstruction produces uninformative features for perception.arXiv preprint arXiv:2402.11337, 2024

Randall Balestriero and Yann LeCun. Learning by reconstruction produces uninformative features for perception.arXiv preprint arXiv:2402.11337, 2024

work page arXiv 2024
[54]

Discriminative unsupervised feature learning with exemplar convolutional neural networks.IEEE TPAMI, 38(9):1734–1747, 2016

Dosovitskiy Alexey, Philipp Fischer, Jost Tobias, Martin Riedmiller Springenberg, and Thomas Brox. Discriminative unsupervised feature learning with exemplar convolutional neural networks.IEEE TPAMI, 38(9):1734–1747, 2016

work page 2016
[55]

Unsupervised learning by predicting noise

Piotr Bojanowski and Armand Joulin. Unsupervised learning by predicting noise. In International conference on machine learning, pages 517–526. PMLR, 2017

work page 2017
[56]

Unsupervised feature learning via non-parametric instance discrimination

Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3733–3742, 2018

work page 2018
[57]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

work page 2020
[58]

Unsupervised learning of visual features by contrasting cluster assignments.Advances in neural information processing systems, 33:9912–9924, 2020

Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments.Advances in neural information processing systems, 33:9912–9924, 2020

work page 2020
[59]

Improved baselines with momentum contrastive learning, 2020

Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning, 2020

work page 2020
[60]

Bootstrap your own latent - a new approach to self-supervised learning

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, koray kavukcuoglu, Remi Munos, and Michal Valko. Bootstrap your own latent - a new approach to self-supervised learning. In H. Larochelle, M. Ranzato, R. Hadsell, M....

work page 2020
[61]

Deep clustering for unsupervised learning of visual features

Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. InProceedings of the European conference on computer vision (ECCV), pages 132–149, 2018

work page 2018
[62]

arXiv preprint arXiv:1911.05371 , year=

Yuki Markus Asano, Christian Rupprecht, and Andrea Vedaldi. Self-labelling via simultaneous clustering and representation learning.arXiv preprint arXiv:1911.05371, 2019. 13

work page arXiv 1911
[63]

Dense contrastive learning for self-supervised visual pre-training

Xinlong Wang, Rufeng Zhang, Chunhua Shen, Tao Kong, and Lei Li. Dense contrastive learning for self-supervised visual pre-training. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3024–3033, 2021

work page 2021
[64]

ibot: Image bert pre-training with online tokenizer, 2022

Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer, 2022

work page 2022
[65]

Transitive invariance for self-supervised visual representation learning

Xiaolong Wang, Kaiming He, and Abhinav Gupta. Transitive invariance for self-supervised visual representation learning. InProceedings of the IEEE international conference on computer vision, pages 1329–1338, 2017

work page 2017
[66]

Soft equivariance regularization for invariant self-supervised learning.arXiv preprint arXiv:2603.06693, 2026

Joohyung Lee, Changhun Kim, Hyunsu Kim, Kwanhyung Lee, and Juho Lee. Soft equivariance regularization for invariant self-supervised learning.arXiv preprint arXiv:2603.06693, 2026

work page arXiv 2026
[67]

Position prediction as an effective pretraining strategy.arXiv preprint arXiv:2207.07611, 2022

Shuangfei Zhai, Navdeep Jaitly, Jason Ramapuram, Dan Busbridge, Tatiana Likhomanenko, Joseph Yitan Cheng, Walter Talbott, Chen Huang, Hanlin Goh, and Joshua Susskind. Position prediction as an effective pretraining strategy.arXiv preprint arXiv:2207.07611, 2022

work page arXiv 2022
[68]

Context autoencoder for self-supervised representation learning.International Journal of Computer Vision, 132(1):208–223, 2024

Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo, Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, and Jingdong Wang. Context autoencoder for self-supervised representation learning.International Journal of Computer Vision, 132(1):208–223, 2024

work page 2024
[69]

Droppos: Pre-training vision transformers by reconstructing dropped positions

Haochen Wang, Junsong Fan, Yuxi Wang, Kaiyou Song, Tong Wang, and ZHAO-XIANG ZHANG. Droppos: Pre-training vision transformers by reconstructing dropped positions. Advances in Neural Information Processing Systems, 36:46134–46151, 2023

work page 2023
[70]

How parts assemble into wholes: Learning the relative composition of images.arXiv preprint arXiv:2506.03682, 2025

Melika Ayoughi, Samira Abnar, Chen Huang, Chris Sandino, Sayeri Lala, Eeshan Gunesh Dhekane, Dan Busbridge, Shuangfei Zhai, Vimal Thilak, Josh Susskind, et al. How parts assemble into wholes: Learning the relative composition of images.arXiv preprint arXiv:2506.03682, 2025

work page arXiv 2025
[71]

solo-learn: A library of self-supervised methods for visual representation learning.Journal of Machine Learning Research, 23(56):1–6, 2022

Victor Guilherme Turrisi da Costa, Enrico Fini, Moin Nabi, Nicu Sebe, and Elisa Ricci. solo-learn: A library of self-supervised methods for visual representation learning.Journal of Machine Learning Research, 23(56):1–6, 2022

work page 2022
[72]

Igor Susmelj, Matthias Heller, Philipp Wirth, Jeremy Prescott, Malte Ebner, and et al. Lightly

work page
[73]

Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

work page 2025