pith. sign in

arxiv: 2507.14137 · v4 · submitted 2025-07-18 · 💻 cs.CV

Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning

Pith reviewed 2026-05-19 03:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords nested Matryoshka clusteringvision foundation modelself-supervised learningopen-source modelpositional disentanglementvisual representation learningclustering projectorscalable SSL
0
0 comments X

The pith

Franca shows a fully open-source vision foundation model can match or surpass proprietary ones like DINOv2 and CLIP using nested Matryoshka clustering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Franca as the first vision foundation model released with full public access to its data, code, and weights. It builds the model through a transparent self-supervised pipeline on public datasets and introduces a multi-head clustering projector based on nested Matryoshka representations to refine features into finer clusters step by step. It further applies a positional disentanglement step to strip location biases from the learned representations and focus on semantic content. These design choices are shown to produce consistent gains on downstream benchmarks while remaining parameter-efficient. A reader would care because the work supplies a reproducible high-performing alternative that anyone can inspect, modify, or extend.

Core claim

The central claim is that a parameter-efficient multi-head clustering projector built on nested Matryoshka representations, paired with explicit positional disentanglement, allows a vision model trained only on public data to match and often exceed the performance of closed-source foundation models such as DINOv2, CLIP, and SigLIPv2.

What carries the argument

Nested Matryoshka clustering projector: a multi-head design that progressively refines image features into increasingly fine-grained clusters without increasing model size.

If this is right

  • Cleaner feature spaces produce consistent gains across multiple downstream benchmarks.
  • Progressive refinement into finer clusters improves both accuracy and memory efficiency.
  • Explicit removal of positional biases strengthens the encoding of semantic content.
  • Full openness of data, code, and weights sets a new standard for reproducible vision foundation models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The nested clustering structure could be ported to other self-supervised frameworks to reduce semantic ambiguity in their codebooks.
  • Full release of training data invites independent audits for unintended biases or coverage gaps.
  • Positional disentanglement may prove especially useful for dense prediction tasks such as segmentation that need semantic focus without spatial shortcuts.
  • The same progressive-refinement idea could be tested at larger scales or in multimodal settings to check whether the efficiency benefit scales.

Load-bearing premise

The reported gains on downstream benchmarks arise primarily from the nested Matryoshka projector and positional disentanglement rather than from the specific public data subsets, training schedule, or evaluation protocol.

What would settle it

A controlled ablation that trains two otherwise identical models on the same data and schedule, one with the nested Matryoshka projector and positional disentanglement and one without, then compares their downstream benchmark scores.

Figures

Figures reproduced from arXiv: 2507.14137 by Andrei Bursuc, Elias Ramzi, Lukas Knobel, Mohammadreza Salehi, Shashanka Venkataramanan, Spyros Gidaris, Valentinos Pariza, Yuki M. Asano.

Figure 1
Figure 1. Figure 1: Overview of Franca. Top-left: We learn efficient Matryoshka-style [Kusupati et al., 2022] visual representations using a multi-head clustering projection head. The encoder produces fea￾tures z ∈ R d , which is sliced into progressively smaller subsets of dimensions d, . . . d/8, d/16. Each slice passes through a projection head and a corresponding clustering head with cluster counts c, . . . , c/8, c/16, i… view at source ↗
Figure 2
Figure 2. Figure 2: Pretraining ablation of Franca. Starting from a ViT-B/14 pretrained on ImageNet-21K, we show the im￾pact of each proposed components. The inner bar represents in-context segmentation performance on the Hummingbird benchmark [Balazevic et al., 2023], while the outer bar shows linear probing accuracy on the ImageNet-1K [Rus￾sakovsky et al., 2015]. Each addition, i.e., CyclicMask, Ma￾tryoshka representations,… view at source ↗
Figure 3
Figure 3. Figure 3: PCA visualizations across Matryoshka slices. We show the first three PCA components for different feature slices mj of Franca and DINOv2. Despite Franca being trained only up to dim/16, it maintains coherent part structure even in smaller feature dimension as compared to DINOv2. The standard Matryoshka approach slices the encoder’s output along the feature dimension and applies the same projection head to … view at source ↗
Figure 4
Figure 4. Figure 4: k-NN classification accuracy on ImageNet-v2 at varying embedding slice levels using a ViT-L backbone. Franca consistently outperforms DINOv2 across all sub￾space dimensions, maintaining high performance even un￾der strong compression (dim/64). Note that DINOv2 was not trained with sliced dimensions and its features are uni￾formly distributed across the full embedding space. Our framework supports hierarchi… view at source ↗
Figure 5
Figure 5. Figure 5: Masking strategies used in masked image modeling. Compared to Random (a), Block (b), and Inverse (c) masking, our CyclicMask (d) circularly shifts the visible region across spatial axes, preventing the model from being biased toward specific spatial locations. alization in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Entropy of patch locations for each cluster. For each visual cluster predicted from the projection head on the patch embeddings, we compute the entropy of the 2D spatial coordinates of the patches assigned to it. A low entropy value indicates that the cluster consistently activates mostly at specific spatial positions (e.g., always top left patch), revealing positional bias in the representation. Left: We … view at source ↗
Figure 7
Figure 7. Figure 7: Each iteration of RASA projects a patch embedding Zi onto a learned positional plane span{ur, uc} and subtracts its projection pi . Formally, given Zi ∈ {Zh,w ∈ R D} n i=1, where n is the number of patches in an image, we optimize the position prediction head parametrized by W on a small set of images: ybi = σ [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Self-attention maps utilizing 14 × 14 patches. These maps are visualized using the [CLS] token on the last layer’s heads on the validation set of ImageNet-1K [Russakovsky et al., 2015]. Franca has better localization than DINOv2 with registers [Darcet et al., 2024] without requiring the use of registers, where the nested Matryoshka clustering captures fine-grained details, e.g., feathers, beaks of bird. 5 … view at source ↗
Figure 9
Figure 9. Figure 9: Out-of-Distribution Detection across five robustness-benchmarks: SSB-Hard [Vaze et al., 2022], NINCO [Bitterwolf et al., 2023], iNaturalist [Huang and Li, 2021], OpenImage-O [Wang et al., 2022a], and Texture [Kylberg, 2011]. Franca consistently outperforms DINOv2, at larger scales, demonstrating its robustness across distribution shifts. DINOv2-B and DINOv2-L are dis￾tilled from DINOv2-G and trained on LVD… view at source ↗
Figure 10
Figure 10. Figure 10: Visualization of the first PCA components. We compute PCA across patches on DAVIS [Pont-Tuset et al., 2017] and illustrate the first three components using RGB color chan￾nels. Despite variations in pose, style, or even object identity, corresponding parts are consistently matched. Background regions are removed by thresholding the first PCA component. Images were selected randomly with np.random.randint(… view at source ↗
Figure 11
Figure 11. Figure 11: Unsupervised clustering. We com￾pare self-supervised clustering results of Franca with DINOv2 and DINOv2-R. Each method gen￾erates pseudo-segmentations from self-attention maps without labels or fine-tuning. Franca yields sharper boundaries and more semantically co￾herent regions, especially on fine-grained objects such as birds and bicycles. METHOD BACKBONE VOC-07 VOC-12 TokenCut SigLIPv2 7.8 9.7 DINOv2 … view at source ↗
Figure 12
Figure 12. Figure 12: Probing with Gaussian Splatting, Normalized average metrics using Feat2GS [Chen et al., 2025] across six datasets for two probing schemes: geometry (G), and all (A), i.e., Geometry + Texture with ViT-L backbone. We measure PSNR, SSIM (higher is better) and LPIPS (lower is better) showing that Franca achieves significantly better performance than state-of-the-art vision encoders suggesting strong geometric… view at source ↗
read the original abstract

We present Franca (pronounced Fran-ka): free one; the first fully open-source (data, code, weights) vision foundation model that matches and in many cases surpasses the performance of state-of-the-art proprietary models, e.g., DINOv2, CLIP, SigLIPv2, etc. Our approach is grounded in a transparent training pipeline inspired by Web-SSL and uses publicly available data: ImageNet-21K and a subset of ReLAION-2B. Beyond model release, we tackle critical limitations in SSL clustering methods. While modern models rely on assigning image features to large codebooks via clustering algorithms like Sinkhorn-Knopp, they fail to account for the inherent ambiguity in clustering semantics. To address this, we introduce a parameter-efficient, multi-head clustering projector based on nested Matryoshka representations. This design progressively refines features into increasingly fine-grained clusters without increasing the model size, enabling both performance and memory efficiency. Additionally, we propose a novel positional disentanglement strategy that explicitly removes positional biases from dense representations, thereby improving the encoding of semantic content. This leads to consistent gains on several downstream benchmarks, demonstrating the utility of cleaner feature spaces. Our contributions establish a new standard for transparent, high-performance vision models and open a path toward more reproducible and generalizable foundation models for the broader AI community. The code and model checkpoints are available at https://github.com/valeoai/Franca.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Franca, a vision foundation model trained via a Web-SSL-inspired pipeline on publicly available data (ImageNet-21K and a ReLAION-2B subset). It introduces a parameter-efficient multi-head nested Matryoshka clustering projector to address semantic ambiguity in SSL codebook assignment and a positional disentanglement module to remove positional biases from dense features. The central claim is that the resulting fully open-source model (data, code, weights) matches or surpasses proprietary models such as DINOv2, CLIP, and SigLIPv2 on downstream benchmarks.

Significance. If the performance claims are substantiated and the new components are shown to drive the gains, the work would be significant as the first fully transparent, high-performing vision foundation model released with complete reproducibility artifacts. The nested Matryoshka projector offers an efficient mechanism for progressive cluster refinement, and the disentanglement step produces cleaner semantic representations; both address documented limitations in existing SSL clustering pipelines.

major comments (2)
  1. [§4] §4 (Experimental results): The headline claim that the nested Matryoshka projector and positional disentanglement are the primary sources of matching or surpassing DINOv2/CLIP/SigLIPv2 performance is not supported by controlled ablations. No experiment replaces the multi-head nested projector with standard Sinkhorn-Knopp clustering (or removes the disentanglement step) while freezing the exact data subsets, optimizer, schedule, and compute budget; without this isolation the causal attribution remains untested and the central contribution claim is weakened.
  2. [§3.2] §3.2 (Nested Matryoshka projector): The description of how nesting is realized across heads and how the progressive refinement is enforced without increasing parameter count is insufficiently precise. It is unclear whether the nesting is achieved by shared weights, hierarchical codebooks, or progressive projection layers, which is load-bearing for the claimed parameter efficiency and for reproducing the method.
minor comments (2)
  1. [Abstract] Abstract and §1: Quantitative improvements (e.g., absolute deltas on ImageNet linear probing, k-NN, or retrieval metrics) and error bars are not summarized; readers must reach the tables to assess the magnitude of the claimed gains.
  2. [§3.3] Figure 2 / §3.3: The positional disentanglement diagram and accompanying equations would benefit from an explicit statement of the loss term used to enforce orthogonality or decorrelation between positional and semantic components.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and commit to revisions that strengthen the manuscript without overstating current evidence.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental results): The headline claim that the nested Matryoshka projector and positional disentanglement are the primary sources of matching or surpassing DINOv2/CLIP/SigLIPv2 performance is not supported by controlled ablations. No experiment replaces the multi-head nested projector with standard Sinkhorn-Knopp clustering (or removes the disentanglement step) while freezing the exact data subsets, optimizer, schedule, and compute budget; without this isolation the causal attribution remains untested and the central contribution claim is weakened.

    Authors: We agree that fully controlled ablations isolating the nested Matryoshka projector (replaced by standard Sinkhorn-Knopp) and the positional disentanglement module, while exactly matching data subsets, optimizer, schedule, and compute, would provide stronger causal evidence. Our current results include comparisons to baselines and partial component studies, but do not meet this strict isolation criterion. In the revised version we will add these controlled experiments under identical conditions to better substantiate the contribution of each proposed component. revision: yes

  2. Referee: [§3.2] §3.2 (Nested Matryoshka projector): The description of how nesting is realized across heads and how the progressive refinement is enforced without increasing parameter count is insufficiently precise. It is unclear whether the nesting is achieved by shared weights, hierarchical codebooks, or progressive projection layers, which is load-bearing for the claimed parameter efficiency and for reproducing the method.

    Authors: We appreciate this observation on the need for greater technical precision. The nesting is implemented via a multi-head projector in which heads correspond to successive granularity levels of the Matryoshka representation; all heads share the same projection weights, and progressive refinement is enforced by a hierarchical alignment loss that conditions finer assignments on coarser ones. No additional parameters or separate codebooks are introduced. We will revise §3.2 to include an explicit mathematical formulation, pseudocode, and a diagram clarifying the shared-weight mechanism and loss structure. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical evaluation of new modules

full rationale

The paper introduces architectural innovations (multi-head nested Matryoshka clustering projector and positional disentanglement) to address ambiguity in SSL clustering and positional biases. These are presented as design choices rather than derived quantities. Performance claims of matching or surpassing DINOv2/CLIP/SigLIPv2 are grounded in evaluations on public benchmarks using ImageNet-21K and a ReLAION-2B subset within a Web-SSL-inspired pipeline. No equations, self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The derivation chain is self-contained through standard SSL objectives plus externally validated modules, with no reduction of results to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on standard self-supervised learning assumptions plus two new architectural components whose effectiveness is asserted via downstream benchmarks. No explicit free parameters beyond typical training hyperparameters are mentioned; the Matryoshka projector is an invented design rather than a new physical entity.

axioms (2)
  • domain assumption Standard SSL clustering objectives (e.g., Sinkhorn-Knopp) remain valid when augmented with multi-head nested representations.
    Invoked when the paper states that the new projector addresses ambiguity in existing clustering methods.
  • domain assumption Removing positional biases from dense features improves semantic encoding without harming other properties.
    Stated as the motivation for the positional disentanglement strategy.
invented entities (1)
  • Nested Matryoshka multi-head clustering projector no independent evidence
    purpose: To progressively refine features into increasingly fine-grained clusters in a parameter-efficient manner.
    New architectural module introduced to handle clustering ambiguity; no independent falsifiable prediction outside the model performance is given.

pith-pipeline@v0.9.0 · 5823 in / 1462 out tokens · 37720 ms · 2026-05-19T03:35:53.162715+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Coevolving Representations in Joint Image-Feature Diffusion

    cs.CV 2026-04 unverdicted novelty 7.0

    CoReDi coevolves semantic representations with the diffusion model via a jointly learned linear projection stabilized by stop-gradient, normalization, and regularization, yielding faster convergence and higher sample ...

  2. Text-Conditional JEPA for Learning Semantically Rich Visual Representations

    cs.LG 2026-05 unverdicted novelty 6.0

    TC-JEPA conditions masked feature prediction on text captions via sparse cross-attention to produce more semantically rich visual representations and outperforms contrastive methods on fine-grained tasks.

  3. Boosting Visual Instruction Tuning with Self-Supervised Guidance

    cs.CV 2026-04 unverdicted novelty 6.0

    Mixing 3-10% of visually grounded self-supervised instructions into visual instruction tuning consistently boosts MLLM performance on vision-centric benchmarks.

  4. TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

    cs.CV 2026-04 unverdicted novelty 6.0

    TIPSv2 improves dense patch-text alignment in vision-language pretraining through distillation and iBOT++ modifications, yielding models on par with or better than recent baselines on 9 tasks across 20 datasets.

  5. Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

    cs.CV 2026-04 unverdicted novelty 6.0

    Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.

  6. Modular Multimodal Classification Without Fine-Tuning: A Simple Compositional Approach

    cs.LG 2026-05 unverdicted novelty 5.0

    CoMET achieves strong multimodal classification performance by composing frozen modality encoders, PCA compression, and tabular foundation models without any training, reaching state-of-the-art on diverse benchmarks i...

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · cited by 6 Pith papers · 5 internal anchors

  1. [1]

    Matryoshka representation learning

    Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, et al. Matryoshka representation learning. In NeurIPS, 2022

  2. [2]

    Towards in-context scene understanding

    Ivana Balazevic, David Steiner, Nikhil Parthasarathy, Relja Arandjelovi \'c , and Olivier Henaff. Towards in-context scene understanding. NeurIPS, 2023

  3. [3]

    Open OOD : Benchmarking generalized out-of-distribution detection

    Jingkang Yang, Pengyun Wang, Dejian Zou, Zitang Zhou, Kunyuan Ding, Wenxuan Peng, Haoqi Wang, Guangyao Chen, Bo Li, Yiyou Sun, Xuefeng Du, Kaiyang Zhou, Wayne Zhang, Dan Hendrycks, Yixuan Li, and Ziwei Liu. Open OOD : Benchmarking generalized out-of-distribution detection. In NeurIPS Datasets and Benchmarks, 2022

  4. [4]

    Feat2gs: Probing visual foundation models with gaussian splatting

    Yue Chen, Xingyu Chen, Anpei Chen, Gerard Pons-Moll, and Yuliang Xiu. Feat2gs: Probing visual foundation models with gaussian splatting. CVPR, 2025

  5. [5]

    DINO v2: Learning robust visual features without supervision

    Maxime Oquab, Timoth \'e e Darcet, Th \'e o Moutakanni, Huy V Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINO v2: Learning robust visual features without supervision. TMLR, 2024

  6. [6]

    Self- supervised pretraining of visual features in the wild.CoRR, abs/2103.01988, 2021

    Priya Goyal, Mathilde Caron, Benjamin Lefaudeux, Min Xu, Pengchao Wang, Vivek Pai, Mannat Singh, Vitaliy Liptchinsky, Ishan Misra, Armand Joulin, et al. Self-supervised pretraining of visual features in the wild. arXiv preprint arXiv:2103.01988, 2021

  7. [7]

    The effectiveness of mae pre-pretraining for billion-scale pretraining

    Mannat Singh, Quentin Duval, Kalyan Vasudev Alwala, Haoqi Fan, Vaibhav Aggarwal, Aaron Adcock, Armand Joulin, Piotr Doll \'a r, Christoph Feichtenhofer, Ross Girshick, et al. The effectiveness of mae pre-pretraining for billion-scale pretraining. In ICCV, 2023

  8. [8]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786, 2025

  9. [9]

    Scaling language-free visual representation learning.arXiv preprint arXiv:2504.01017,

    David Fan, Shengbang Tong, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nicolas Ballas, Yann LeCun, Amir Bar, and Saining Xie. Scaling language-free visual representation learning. arXiv preprint arXiv:2504.01017, 2025

  10. [10]

    Invariant information clustering for unsupervised image classification and segmentation

    Xu Ji, Joao F Henriques, and Andrea Vedaldi. Invariant information clustering for unsupervised image classification and segmentation. In ICCV, 2019

  11. [11]

    Self-labelling via simultaneous clustering and representation learning

    Yuki Markus Asano, Christian Rupprecht, and Andrea Vedaldi. Self-labelling via simultaneous clustering and representation learning. In ICLR, 2020

  12. [12]

    Burghouts, Francesco Locatello, and Yuki M Asano

    Valentinos Pariza, Mohammadreza Salehi, Gertjan J. Burghouts, Francesco Locatello, and Yuki M Asano. Near, far: Patch-ordering enhances vision foundation models' scene understanding. In ICLR, 2025

  13. [13]

    Everingham, L

    M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL V isual O bject C lasses C hallenge 2012 (VOC2012) R esults. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html

  14. [14]

    Coco-stuff: Thing and stuff classes in context, 2018

    Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context, 2018

  15. [15]

    Unsupervised visual representation learning by context prediction

    Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In ICCV, 2015

  16. [16]

    Unsupervised learning of visual representations by solving jigsaw puzzles

    Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016

  17. [17]

    Colorful image colorization

    Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In ECCV, 2016

  18. [18]

    Split-brain autoencoders: Unsupervised learning by cross-channel prediction

    Richard Zhang, Phillip Isola, and Alexei A Efros. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In CVPR, 2017

  19. [19]

    Context encoders: Feature learning by inpainting

    Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016

  20. [20]

    Unsupervised representation learning by predicting image rotations

    Spyros Gidaris and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. In ICLR, 2018

  21. [21]

    Discriminative unsupervised feature learning with convolutional neural networks

    Alexey Dosovitskiy, Jost Tobias Springenberg, Martin Riedmiller, and Thomas Brox. Discriminative unsupervised feature learning with convolutional neural networks. NeurIPS, 2014

  22. [22]

    Unsupervised feature learning via non-parametric instance discrimination

    Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018

  23. [23]

    Representation learning with contrastive predictive coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv, 2018

  24. [24]

    Self-supervised learning of pretext-invariant representations

    Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. In CVPR, 2020

  25. [25]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In ICML, 2020 a

  26. [26]

    Momentum contrast for unsupervised visual representation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020

  27. [27]

    Improved Baselines with Momentum Contrastive Learning

    Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020 b

  28. [28]

    An empirical study of training self-supervised vision transformers

    Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. In ICCV, 2021

  29. [29]

    Bootstrap your own latent-a new approach to self-supervised learning

    Jean-Bastien Grill, Florian Strub, Florent Altch \'e , Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. NeurIPS, 2020

  30. [30]

    Exploring simple siamese representation learning

    Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In CVPR, 2021

  31. [31]

    Obow: Online bag-of-visual-words generation for self-supervised learning

    Spyros Gidaris, Andrei Bursuc, Gilles Puy, Nikos Komodakis, Matthieu Cord, and Patrick P \'e rez. Obow: Online bag-of-visual-words generation for self-supervised learning. In CVPR, 2021

  32. [32]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv \'e J \'e gou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In ICCV, 2021

  33. [33]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll \'a r, and Ross Girshick. Masked autoencoders are scalable vision learners. In CVPR, 2022

  34. [34]

    Image bert pre-training with online tokenizer

    Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. Image bert pre-training with online tokenizer. In ICLR, 2022 a

  35. [35]

    BEiT : Bert pre-training of image transformers

    Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEiT : Bert pre-training of image transformers. In ICLR, 2022

  36. [36]

    Masked feature prediction for self-supervised visual pre-training

    Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, and Christoph Feichtenhofer. Masked feature prediction for self-supervised visual pre-training. In CVPR, 2022

  37. [37]

    Deep clustering for unsupervised learning of visual features

    Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018

  38. [38]

    Unsupervised learning of visual features by contrasting cluster assignments

    Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. NeurIPS, 2020

  39. [39]

    Moca: Self-supervised representation learning by predicting masked online codebook assignments

    Spyros Gidaris, Andrei Bursuc, Oriane Sim \'e oni, Anton \' n Vobeck \`y , Nikos Komodakis, Matthieu Cord, and Patrick Perez. Moca: Self-supervised representation learning by predicting masked online codebook assignments. TMLR, 2024

  40. [40]

    Cluster and predict latents patches for improved masked image modeling

    Timoth \'e e Darcet, Federico Baldassarre, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Cluster and predict latents patches for improved masked image modeling. TMLR, 2025

  41. [41]

    Scaling and benchmarking self-supervised visual representation learning

    Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan Misra. Scaling and benchmarking self-supervised visual representation learning. In ICCV, 2019

  42. [42]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021

  43. [43]

    Demystifying clip data

    Hu Xu, Saining Xie, Xiaoqing Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data. In ICLR, 2024

  44. [44]

    Releasing re-laion 5b: Transparent iteration on laion-5b with additional safety fixes

    LAION. Releasing re-laion 5b: Transparent iteration on laion-5b with additional safety fixes. 2024. URL https://laion.ai/blog/relaion-5b/

  45. [45]

    Invariant Risk Minimization

    Martin Arjovsky, L \'e on Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019

  46. [46]

    Don't judge an object by its context: learning to overcome contextual bias

    Krishna Kumar Singh, Dhruv Mahajan, Kristen Grauman, Yong Jae Lee, Matt Feiszli, and Deepti Ghadiyaram. Don't judge an object by its context: learning to overcome contextual bias. In CVPR, 2020

  47. [47]

    Understanding image representations by measuring their equivariance and equivalence

    Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their equivariance and equivalence. In CVPR, 2015

  48. [48]

    a henb \

    Chao Wang, Yujun Liu, Yang Zou, and Philipp Kr \"a henb \"u hl. Projective manifold disentanglement for self-supervised learning. In CVPR, 2023

  49. [49]

    iBOT : Image BERT pre-training with online tokenizer

    Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. iBOT : Image BERT pre-training with online tokenizer. In ICLR, 2022 b

  50. [50]

    Vision transformers need registers

    Timoth \'e e Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. In ICLR, 2024

  51. [51]

    Imagenet large scale visual recognition challenge

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015

  52. [52]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021

  53. [53]

    Sinkhorn distances: Lightspeed computation of optimal transport

    Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. NeurIPS, 2013

  54. [54]

    Scan: Learning to classify images without labels

    Wouter Van Gansbeke, Simon Vandenhende, Stamatios Georgoulis, Marc Proesmans, and Luc Van Gool. Scan: Learning to classify images without labels. In ECCV, 2020

  55. [55]

    Mugs: A multi-granular self-supervised learning framework

    Pan Zhou, Yichen Zhou, Chenyang Si, Weihao Yu, Teck Khim Ng, and Shuicheng Yan. Mugs: A multi-granular self-supervised learning framework. arXiv preprint arXiv:2203.14415, 2022 c

  56. [56]

    Efficient self-supervised learning with contextualized target representations for vision, speech and language

    Alexei Baevski, Arun Babu, Wei-Ning Hsu, and Michael Auli. Efficient self-supervised learning with contextualized target representations for vision, speech and language. In ICML, 2023

  57. [57]

    Hummingbird evaluation for vision encoders, 2024

    Valentinos Pariza, Mohammadreza Salehi, and Yuki Asano. Hummingbird evaluation for vision encoders, 2024. URL https://github.com/vpariza/open-hummingbird-eval

  58. [58]

    Golub and Charles F

    Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns Hopkins University Press, 4th edition, 2013

  59. [59]

    Imagenet-21k pretraining for the masses

    Tal Ridnik, Elad Ben-Baruch, Amir Zamir, and Ido Friedman. Imagenet-21k pretraining for the masses. In NeurIPS, 2021

  60. [60]

    Laion-5b: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Aleksei Drozd, Marius Cuadros, Dmitry Gritsenko, Sebastian Kintscher, Maxim Botros, Christoph Müller, Patrick Ludwig, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS Datasets and Benchmarks, 2022

  61. [61]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019

  62. [62]

    Are we done with imagenet?arXiv preprint arXiv:2006.07159,

    Lucas Beyer, Olivier J H \'e naff, Alexander Kolesnikov, Xiaohua Zhai, and A \"a ron van den Oord. Are we done with imagenet? arXiv preprint arXiv:2006.07159, 2020

  63. [63]

    Do imagenet classifiers generalize to imagenet? In ICML, 2019

    Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? In ICML, 2019

  64. [64]

    Open-set recognition: A good closed-set classifier is all you need

    Sagar Vaze, Kai Han, Andrea Vedaldi, and Andrew Zisserman. Open-set recognition: A good closed-set classifier is all you need. In ICLR, 2022

  65. [65]

    In or out? fixing imagenet out-of-distribution detection evaluation

    Julian Bitterwolf, Maximilian Mueller, and Matthias Hein. In or out? fixing imagenet out-of-distribution detection evaluation. In ICML, 2023

  66. [66]

    Mos: Towards scaling out-of-distribution detection for large semantic space

    Rui Huang and Yixuan Li. Mos: Towards scaling out-of-distribution detection for large semantic space. In CVPR, 2021

  67. [67]

    Vim: Out-of-distribution with virtual-logit matching

    Haoqi Wang, Zhizhong Li, Litong Feng, and Wayne Zhang. Vim: Out-of-distribution with virtual-logit matching. In CVPR, 2022 a

  68. [68]

    Kylberg texture dataset v

    Gustaf Kylberg. Kylberg texture dataset v. 1.0. Centre for Image Analysis, Swedish University of Agricultural Sciences and Uppsala University, 2011

  69. [69]

    Natural adversarial examples

    Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In CVPR, 2021 a

  70. [70]

    The many faces of robustness: A critical analysis of out-of-distribution generalization

    Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, 2021 b

  71. [71]

    Learning correspondence from the cycle-consistency of time

    Xiaolong Wang, Allan Jabri, and Alexei A Efros. Learning correspondence from the cycle-consistency of time. In CVPR, 2019

  72. [72]

    Open OOD v1.5: Enhanced benchmark for out-of-distribution detection

    Jingyang Zhang, Jingkang Yang, Pengyun Wang, Haoqi Wang, Yueqian Lin, Haoran Zhang, Yiyou Sun, Xuefeng Du, Yixuan Li, Ziwei Liu, Yiran Chen, and Hai Li. Open OOD v1.5: Enhanced benchmark for out-of-distribution detection. DMLR, 2024

  73. [73]

    The 2017 DAVIS Challenge on Video Object Segmentation

    Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbel \'a ez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017

  74. [74]

    Scene parsing through ade20k dataset

    Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In CVPR, 2017

  75. [75]

    Crowley, and Dominique Vaufreydaz

    Yangtao Wang, Xi Shen, Shell Xu Hu, Yuan Yuan, James L. Crowley, and Dominique Vaufreydaz. Self-supervised transformers for unsupervised object discovery using normalized cut. In CVPR, 2022 b

  76. [76]

    Localizing objects with self-supervised transformers and no labels.arXiv preprint arXiv:2109.14279, 2021

    Oriane Sim \'e oni, Gilles Puy, Huy V Vo, Simon Roburin, Spyros Gidaris, Andrei Bursuc, Patrick P \'e rez, Renaud Marlet, and Jean Ponce. Localizing objects with self-supervised transformers and no labels. arXiv preprint arXiv:2109.14279, 2021

  77. [77]

    Vision transformers don’t need trained registers.arXiv preprint arXiv:2506.08010, 2025

    Nick Jiang, Amil Dravid, Alexei A Efros, and Yossi Gandelsman. Vision transformers don't need trained registers. In arXiv preprint arXiv:2506.08010, 2025

  78. [78]

    Self-supervised learning of object parts for semantic segmentation

    Adrian Ziegler and Yuki M Asano. Self-supervised learning of object parts for semantic segmentation. In CVPR, 2022

  79. [79]

    The hungarian method for the assignment problem

    Harold W Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 1955

  80. [80]

    Spair-71k: A large-scale benchmark for semantic correspon- dence

    Juhong Min, Jongmin Lee, Jean Ponce, and Minsu Cho. Spair-71k: A large-scale benchmark for semantic correspondence. arXiv preprint arXiv:1908.10543, 2019

Showing first 80 references.