Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning

Andrei Bursuc; Elias Ramzi; Lukas Knobel; Mohammadreza Salehi; Shashanka Venkataramanan; Spyros Gidaris; Valentinos Pariza; Yuki M. Asano

arxiv: 2507.14137 · v4 · submitted 2025-07-18 · 💻 cs.CV

Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning

Shashanka Venkataramanan , Valentinos Pariza , Mohammadreza Salehi , Lukas Knobel , Spyros Gidaris , Elias Ramzi , Andrei Bursuc , Yuki M. Asano This is my paper

Pith reviewed 2026-05-19 03:35 UTC · model grok-4.3

classification 💻 cs.CV

keywords nested Matryoshka clusteringvision foundation modelself-supervised learningopen-source modelpositional disentanglementvisual representation learningclustering projectorscalable SSL

0 comments

The pith

Franca shows a fully open-source vision foundation model can match or surpass proprietary ones like DINOv2 and CLIP using nested Matryoshka clustering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Franca as the first vision foundation model released with full public access to its data, code, and weights. It builds the model through a transparent self-supervised pipeline on public datasets and introduces a multi-head clustering projector based on nested Matryoshka representations to refine features into finer clusters step by step. It further applies a positional disentanglement step to strip location biases from the learned representations and focus on semantic content. These design choices are shown to produce consistent gains on downstream benchmarks while remaining parameter-efficient. A reader would care because the work supplies a reproducible high-performing alternative that anyone can inspect, modify, or extend.

Core claim

The central claim is that a parameter-efficient multi-head clustering projector built on nested Matryoshka representations, paired with explicit positional disentanglement, allows a vision model trained only on public data to match and often exceed the performance of closed-source foundation models such as DINOv2, CLIP, and SigLIPv2.

What carries the argument

Nested Matryoshka clustering projector: a multi-head design that progressively refines image features into increasingly fine-grained clusters without increasing model size.

If this is right

Cleaner feature spaces produce consistent gains across multiple downstream benchmarks.
Progressive refinement into finer clusters improves both accuracy and memory efficiency.
Explicit removal of positional biases strengthens the encoding of semantic content.
Full openness of data, code, and weights sets a new standard for reproducible vision foundation models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The nested clustering structure could be ported to other self-supervised frameworks to reduce semantic ambiguity in their codebooks.
Full release of training data invites independent audits for unintended biases or coverage gaps.
Positional disentanglement may prove especially useful for dense prediction tasks such as segmentation that need semantic focus without spatial shortcuts.
The same progressive-refinement idea could be tested at larger scales or in multimodal settings to check whether the efficiency benefit scales.

Load-bearing premise

The reported gains on downstream benchmarks arise primarily from the nested Matryoshka projector and positional disentanglement rather than from the specific public data subsets, training schedule, or evaluation protocol.

What would settle it

A controlled ablation that trains two otherwise identical models on the same data and schedule, one with the nested Matryoshka projector and positional disentanglement and one without, then compares their downstream benchmark scores.

Figures

Figures reproduced from arXiv: 2507.14137 by Andrei Bursuc, Elias Ramzi, Lukas Knobel, Mohammadreza Salehi, Shashanka Venkataramanan, Spyros Gidaris, Valentinos Pariza, Yuki M. Asano.

**Figure 1.** Figure 1: Overview of Franca. Top-left: We learn efficient Matryoshka-style [Kusupati et al., 2022] visual representations using a multi-head clustering projection head. The encoder produces features z ∈ R d , which is sliced into progressively smaller subsets of dimensions d, . . . d/8, d/16. Each slice passes through a projection head and a corresponding clustering head with cluster counts c, . . . , c/8, c/16, i… view at source ↗

**Figure 2.** Figure 2: Pretraining ablation of Franca. Starting from a ViT-B/14 pretrained on ImageNet-21K, we show the impact of each proposed components. The inner bar represents in-context segmentation performance on the Hummingbird benchmark [Balazevic et al., 2023], while the outer bar shows linear probing accuracy on the ImageNet-1K [Russakovsky et al., 2015]. Each addition, i.e., CyclicMask, Matryoshka representations,… view at source ↗

**Figure 3.** Figure 3: PCA visualizations across Matryoshka slices. We show the first three PCA components for different feature slices mj of Franca and DINOv2. Despite Franca being trained only up to dim/16, it maintains coherent part structure even in smaller feature dimension as compared to DINOv2. The standard Matryoshka approach slices the encoder’s output along the feature dimension and applies the same projection head to … view at source ↗

**Figure 4.** Figure 4: k-NN classification accuracy on ImageNet-v2 at varying embedding slice levels using a ViT-L backbone. Franca consistently outperforms DINOv2 across all subspace dimensions, maintaining high performance even under strong compression (dim/64). Note that DINOv2 was not trained with sliced dimensions and its features are uniformly distributed across the full embedding space. Our framework supports hierarchi… view at source ↗

**Figure 5.** Figure 5: Masking strategies used in masked image modeling. Compared to Random (a), Block (b), and Inverse (c) masking, our CyclicMask (d) circularly shifts the visible region across spatial axes, preventing the model from being biased toward specific spatial locations. alization in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Entropy of patch locations for each cluster. For each visual cluster predicted from the projection head on the patch embeddings, we compute the entropy of the 2D spatial coordinates of the patches assigned to it. A low entropy value indicates that the cluster consistently activates mostly at specific spatial positions (e.g., always top left patch), revealing positional bias in the representation. Left: We … view at source ↗

**Figure 7.** Figure 7: Each iteration of RASA projects a patch embedding Zi onto a learned positional plane span{ur, uc} and subtracts its projection pi . Formally, given Zi ∈ {Zh,w ∈ R D} n i=1, where n is the number of patches in an image, we optimize the position prediction head parametrized by W on a small set of images: ybi = σ [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Self-attention maps utilizing 14 × 14 patches. These maps are visualized using the [CLS] token on the last layer’s heads on the validation set of ImageNet-1K [Russakovsky et al., 2015]. Franca has better localization than DINOv2 with registers [Darcet et al., 2024] without requiring the use of registers, where the nested Matryoshka clustering captures fine-grained details, e.g., feathers, beaks of bird. 5 … view at source ↗

**Figure 9.** Figure 9: Out-of-Distribution Detection across five robustness-benchmarks: SSB-Hard [Vaze et al., 2022], NINCO [Bitterwolf et al., 2023], iNaturalist [Huang and Li, 2021], OpenImage-O [Wang et al., 2022a], and Texture [Kylberg, 2011]. Franca consistently outperforms DINOv2, at larger scales, demonstrating its robustness across distribution shifts. DINOv2-B and DINOv2-L are distilled from DINOv2-G and trained on LVD… view at source ↗

**Figure 10.** Figure 10: Visualization of the first PCA components. We compute PCA across patches on DAVIS [Pont-Tuset et al., 2017] and illustrate the first three components using RGB color channels. Despite variations in pose, style, or even object identity, corresponding parts are consistently matched. Background regions are removed by thresholding the first PCA component. Images were selected randomly with np.random.randint(… view at source ↗

**Figure 11.** Figure 11: Unsupervised clustering. We compare self-supervised clustering results of Franca with DINOv2 and DINOv2-R. Each method generates pseudo-segmentations from self-attention maps without labels or fine-tuning. Franca yields sharper boundaries and more semantically coherent regions, especially on fine-grained objects such as birds and bicycles. METHOD BACKBONE VOC-07 VOC-12 TokenCut SigLIPv2 7.8 9.7 DINOv2 … view at source ↗

**Figure 12.** Figure 12: Probing with Gaussian Splatting, Normalized average metrics using Feat2GS [Chen et al., 2025] across six datasets for two probing schemes: geometry (G), and all (A), i.e., Geometry + Texture with ViT-L backbone. We measure PSNR, SSIM (higher is better) and LPIPS (lower is better) showing that Franca achieves significantly better performance than state-of-the-art vision encoders suggesting strong geometric… view at source ↗

read the original abstract

We present Franca (pronounced Fran-ka): free one; the first fully open-source (data, code, weights) vision foundation model that matches and in many cases surpasses the performance of state-of-the-art proprietary models, e.g., DINOv2, CLIP, SigLIPv2, etc. Our approach is grounded in a transparent training pipeline inspired by Web-SSL and uses publicly available data: ImageNet-21K and a subset of ReLAION-2B. Beyond model release, we tackle critical limitations in SSL clustering methods. While modern models rely on assigning image features to large codebooks via clustering algorithms like Sinkhorn-Knopp, they fail to account for the inherent ambiguity in clustering semantics. To address this, we introduce a parameter-efficient, multi-head clustering projector based on nested Matryoshka representations. This design progressively refines features into increasingly fine-grained clusters without increasing the model size, enabling both performance and memory efficiency. Additionally, we propose a novel positional disentanglement strategy that explicitly removes positional biases from dense representations, thereby improving the encoding of semantic content. This leads to consistent gains on several downstream benchmarks, demonstrating the utility of cleaner feature spaces. Our contributions establish a new standard for transparent, high-performance vision models and open a path toward more reproducible and generalizable foundation models for the broader AI community. The code and model checkpoints are available at https://github.com/valeoai/Franca.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Franca releases a competitive open vision model on public data but the new nested clustering and disentanglement still need ablations to show they explain the gains.

read the letter

The main point is that this paper ships a fully open vision foundation model trained on ImageNet-21K plus a ReLAION subset, with code, weights, and data all released, and claims it matches or beats DINOv2, CLIP, and SigLIPv2 on downstream tasks. That release itself is useful for anyone who wants a reproducible starting point without closed data or models. The two technical additions are a nested Matryoshka multi-head clustering projector meant to handle semantic ambiguity more efficiently than standard Sinkhorn-Knopp, and an explicit positional disentanglement step for dense features. Both are described clearly enough in the abstract to see what they aim to fix. The training pipeline follows the Web-SSL template, so the novelty sits mainly in those two modules and the decision to go fully open at this scale. The open release and the parameter-efficient design of the projector are the parts that stand out as practical and worth looking at. The soft spot is exactly the one the stress-test note flags: without ablations that swap the nested projector for a plain clustering head and remove the disentanglement while holding data, schedule, and optimizer fixed, it is hard to know whether the reported improvements trace to the new components or to the particular data subsets and training choices. The abstract itself contains no numbers, tables, or error bars, so the performance claims cannot be checked from what is shown here. This paper is aimed at groups working on open self-supervised vision models who need a public baseline they can actually run and modify. A reader who wants to test open alternatives or extend clustering methods for dense representations would get concrete value from the released artifacts. I would send it to peer review so the experimental section can be examined and the ablations can be added or clarified.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Franca, a vision foundation model trained via a Web-SSL-inspired pipeline on publicly available data (ImageNet-21K and a ReLAION-2B subset). It introduces a parameter-efficient multi-head nested Matryoshka clustering projector to address semantic ambiguity in SSL codebook assignment and a positional disentanglement module to remove positional biases from dense features. The central claim is that the resulting fully open-source model (data, code, weights) matches or surpasses proprietary models such as DINOv2, CLIP, and SigLIPv2 on downstream benchmarks.

Significance. If the performance claims are substantiated and the new components are shown to drive the gains, the work would be significant as the first fully transparent, high-performing vision foundation model released with complete reproducibility artifacts. The nested Matryoshka projector offers an efficient mechanism for progressive cluster refinement, and the disentanglement step produces cleaner semantic representations; both address documented limitations in existing SSL clustering pipelines.

major comments (2)

[§4] §4 (Experimental results): The headline claim that the nested Matryoshka projector and positional disentanglement are the primary sources of matching or surpassing DINOv2/CLIP/SigLIPv2 performance is not supported by controlled ablations. No experiment replaces the multi-head nested projector with standard Sinkhorn-Knopp clustering (or removes the disentanglement step) while freezing the exact data subsets, optimizer, schedule, and compute budget; without this isolation the causal attribution remains untested and the central contribution claim is weakened.
[§3.2] §3.2 (Nested Matryoshka projector): The description of how nesting is realized across heads and how the progressive refinement is enforced without increasing parameter count is insufficiently precise. It is unclear whether the nesting is achieved by shared weights, hierarchical codebooks, or progressive projection layers, which is load-bearing for the claimed parameter efficiency and for reproducing the method.

minor comments (2)

[Abstract] Abstract and §1: Quantitative improvements (e.g., absolute deltas on ImageNet linear probing, k-NN, or retrieval metrics) and error bars are not summarized; readers must reach the tables to assess the magnitude of the claimed gains.
[§3.3] Figure 2 / §3.3: The positional disentanglement diagram and accompanying equations would benefit from an explicit statement of the loss term used to enforce orthogonality or decorrelation between positional and semantic components.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and commit to revisions that strengthen the manuscript without overstating current evidence.

read point-by-point responses

Referee: [§4] §4 (Experimental results): The headline claim that the nested Matryoshka projector and positional disentanglement are the primary sources of matching or surpassing DINOv2/CLIP/SigLIPv2 performance is not supported by controlled ablations. No experiment replaces the multi-head nested projector with standard Sinkhorn-Knopp clustering (or removes the disentanglement step) while freezing the exact data subsets, optimizer, schedule, and compute budget; without this isolation the causal attribution remains untested and the central contribution claim is weakened.

Authors: We agree that fully controlled ablations isolating the nested Matryoshka projector (replaced by standard Sinkhorn-Knopp) and the positional disentanglement module, while exactly matching data subsets, optimizer, schedule, and compute, would provide stronger causal evidence. Our current results include comparisons to baselines and partial component studies, but do not meet this strict isolation criterion. In the revised version we will add these controlled experiments under identical conditions to better substantiate the contribution of each proposed component. revision: yes
Referee: [§3.2] §3.2 (Nested Matryoshka projector): The description of how nesting is realized across heads and how the progressive refinement is enforced without increasing parameter count is insufficiently precise. It is unclear whether the nesting is achieved by shared weights, hierarchical codebooks, or progressive projection layers, which is load-bearing for the claimed parameter efficiency and for reproducing the method.

Authors: We appreciate this observation on the need for greater technical precision. The nesting is implemented via a multi-head projector in which heads correspond to successive granularity levels of the Matryoshka representation; all heads share the same projection weights, and progressive refinement is enforced by a hierarchical alignment loss that conditions finer assignments on coarser ones. No additional parameters or separate codebooks are introduced. We will revise §3.2 to include an explicit mathematical formulation, pseudocode, and a diagram clarifying the shared-weight mechanism and loss structure. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical evaluation of new modules

full rationale

The paper introduces architectural innovations (multi-head nested Matryoshka clustering projector and positional disentanglement) to address ambiguity in SSL clustering and positional biases. These are presented as design choices rather than derived quantities. Performance claims of matching or surpassing DINOv2/CLIP/SigLIPv2 are grounded in evaluations on public benchmarks using ImageNet-21K and a ReLAION-2B subset within a Web-SSL-inspired pipeline. No equations, self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The derivation chain is self-contained through standard SSL objectives plus externally validated modules, with no reduction of results to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on standard self-supervised learning assumptions plus two new architectural components whose effectiveness is asserted via downstream benchmarks. No explicit free parameters beyond typical training hyperparameters are mentioned; the Matryoshka projector is an invented design rather than a new physical entity.

axioms (2)

domain assumption Standard SSL clustering objectives (e.g., Sinkhorn-Knopp) remain valid when augmented with multi-head nested representations.
Invoked when the paper states that the new projector addresses ambiguity in existing clustering methods.
domain assumption Removing positional biases from dense features improves semantic encoding without harming other properties.
Stated as the motivation for the positional disentanglement strategy.

invented entities (1)

Nested Matryoshka multi-head clustering projector no independent evidence
purpose: To progressively refine features into increasingly fine-grained clusters in a parameter-efficient manner.
New architectural module introduced to handle clustering ambiguity; no independent falsifiable prediction outside the model performance is given.

pith-pipeline@v0.9.0 · 5823 in / 1462 out tokens · 37720 ms · 2026-05-19T03:35:53.162715+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

nested Matryoshka representations... progressively refines features into increasingly fine-grained clusters... multi-head clustering projector
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

hierarchical clustering that aligns naturally with the granularity of the features... coarse heads capture global semantics, while fine heads focus on local structure

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Coevolving Representations in Joint Image-Feature Diffusion
cs.CV 2026-04 unverdicted novelty 7.0

CoReDi coevolves semantic representations with the diffusion model via a jointly learned linear projection stabilized by stop-gradient, normalization, and regularization, yielding faster convergence and higher sample ...
Text-Conditional JEPA for Learning Semantically Rich Visual Representations
cs.LG 2026-05 unverdicted novelty 6.0

TC-JEPA conditions masked feature prediction on text captions via sparse cross-attention to produce more semantically rich visual representations and outperforms contrastive methods on fine-grained tasks.
Boosting Visual Instruction Tuning with Self-Supervised Guidance
cs.CV 2026-04 unverdicted novelty 6.0

Mixing 3-10% of visually grounded self-supervised instructions into visual instruction tuning consistently boosts MLLM performance on vision-centric benchmarks.
TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment
cs.CV 2026-04 unverdicted novelty 6.0

TIPSv2 improves dense patch-text alignment in vision-language pretraining through distillation and iBOT++ modifications, yielding models on par with or better than recent baselines on 9 tasks across 20 datasets.
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction
cs.CV 2026-04 unverdicted novelty 6.0

Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.
Modular Multimodal Classification Without Fine-Tuning: A Simple Compositional Approach
cs.LG 2026-05 unverdicted novelty 5.0

CoMET achieves strong multimodal classification performance by composing frozen modality encoders, PCA compression, and tabular foundation models without any training, reaching state-of-the-art on diverse benchmarks i...

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · cited by 6 Pith papers · 5 internal anchors

[1]

Matryoshka representation learning

Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, et al. Matryoshka representation learning. In NeurIPS, 2022

work page 2022
[2]

Towards in-context scene understanding

Ivana Balazevic, David Steiner, Nikhil Parthasarathy, Relja Arandjelovi \'c , and Olivier Henaff. Towards in-context scene understanding. NeurIPS, 2023

work page 2023
[3]

Open OOD : Benchmarking generalized out-of-distribution detection

Jingkang Yang, Pengyun Wang, Dejian Zou, Zitang Zhou, Kunyuan Ding, Wenxuan Peng, Haoqi Wang, Guangyao Chen, Bo Li, Yiyou Sun, Xuefeng Du, Kaiyang Zhou, Wayne Zhang, Dan Hendrycks, Yixuan Li, and Ziwei Liu. Open OOD : Benchmarking generalized out-of-distribution detection. In NeurIPS Datasets and Benchmarks, 2022

work page 2022
[4]

Feat2gs: Probing visual foundation models with gaussian splatting

Yue Chen, Xingyu Chen, Anpei Chen, Gerard Pons-Moll, and Yuliang Xiu. Feat2gs: Probing visual foundation models with gaussian splatting. CVPR, 2025

work page 2025
[5]

DINO v2: Learning robust visual features without supervision

Maxime Oquab, Timoth \'e e Darcet, Th \'e o Moutakanni, Huy V Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINO v2: Learning robust visual features without supervision. TMLR, 2024

work page 2024
[6]

Self- supervised pretraining of visual features in the wild.CoRR, abs/2103.01988, 2021

Priya Goyal, Mathilde Caron, Benjamin Lefaudeux, Min Xu, Pengchao Wang, Vivek Pai, Mannat Singh, Vitaliy Liptchinsky, Ishan Misra, Armand Joulin, et al. Self-supervised pretraining of visual features in the wild. arXiv preprint arXiv:2103.01988, 2021

work page arXiv 2021
[7]

The effectiveness of mae pre-pretraining for billion-scale pretraining

Mannat Singh, Quentin Duval, Kalyan Vasudev Alwala, Haoqi Fan, Vaibhav Aggarwal, Aaron Adcock, Armand Joulin, Piotr Doll \'a r, Christoph Feichtenhofer, Ross Girshick, et al. The effectiveness of mae pre-pretraining for billion-scale pretraining. In ICCV, 2023

work page 2023
[8]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Scaling language-free visual representation learning.arXiv preprint arXiv:2504.01017,

David Fan, Shengbang Tong, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nicolas Ballas, Yann LeCun, Amir Bar, and Saining Xie. Scaling language-free visual representation learning. arXiv preprint arXiv:2504.01017, 2025

work page arXiv 2025
[10]

Invariant information clustering for unsupervised image classification and segmentation

Xu Ji, Joao F Henriques, and Andrea Vedaldi. Invariant information clustering for unsupervised image classification and segmentation. In ICCV, 2019

work page 2019
[11]

Self-labelling via simultaneous clustering and representation learning

Yuki Markus Asano, Christian Rupprecht, and Andrea Vedaldi. Self-labelling via simultaneous clustering and representation learning. In ICLR, 2020

work page 2020
[12]

Burghouts, Francesco Locatello, and Yuki M Asano

Valentinos Pariza, Mohammadreza Salehi, Gertjan J. Burghouts, Francesco Locatello, and Yuki M Asano. Near, far: Patch-ordering enhances vision foundation models' scene understanding. In ICLR, 2025

work page 2025
[13]

Everingham, L

M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL V isual O bject C lasses C hallenge 2012 (VOC2012) R esults. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html

work page 2012
[14]

Coco-stuff: Thing and stuff classes in context, 2018

Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context, 2018

work page 2018
[15]

Unsupervised visual representation learning by context prediction

Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In ICCV, 2015

work page 2015
[16]

Unsupervised learning of visual representations by solving jigsaw puzzles

Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016

work page 2016
[17]

Colorful image colorization

Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In ECCV, 2016

work page 2016
[18]

Split-brain autoencoders: Unsupervised learning by cross-channel prediction

Richard Zhang, Phillip Isola, and Alexei A Efros. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In CVPR, 2017

work page 2017
[19]

Context encoders: Feature learning by inpainting

Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016

work page 2016
[20]

Unsupervised representation learning by predicting image rotations

Spyros Gidaris and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. In ICLR, 2018

work page 2018
[21]

Discriminative unsupervised feature learning with convolutional neural networks

Alexey Dosovitskiy, Jost Tobias Springenberg, Martin Riedmiller, and Thomas Brox. Discriminative unsupervised feature learning with convolutional neural networks. NeurIPS, 2014

work page 2014
[22]

Unsupervised feature learning via non-parametric instance discrimination

Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018

work page 2018
[23]

Representation learning with contrastive predictive coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv, 2018

work page 2018
[24]

Self-supervised learning of pretext-invariant representations

Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. In CVPR, 2020

work page 2020
[25]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In ICML, 2020 a

work page 2020
[26]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020

work page 2020
[27]

Improved Baselines with Momentum Contrastive Learning

Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020 b

work page internal anchor Pith review Pith/arXiv arXiv 2003
[28]

An empirical study of training self-supervised vision transformers

Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. In ICCV, 2021

work page 2021
[29]

Bootstrap your own latent-a new approach to self-supervised learning

Jean-Bastien Grill, Florian Strub, Florent Altch \'e , Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. NeurIPS, 2020

work page 2020
[30]

Exploring simple siamese representation learning

Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In CVPR, 2021

work page 2021
[31]

Obow: Online bag-of-visual-words generation for self-supervised learning

Spyros Gidaris, Andrei Bursuc, Gilles Puy, Nikos Komodakis, Matthieu Cord, and Patrick P \'e rez. Obow: Online bag-of-visual-words generation for self-supervised learning. In CVPR, 2021

work page 2021
[32]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv \'e J \'e gou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In ICCV, 2021

work page 2021
[33]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll \'a r, and Ross Girshick. Masked autoencoders are scalable vision learners. In CVPR, 2022

work page 2022
[34]

Image bert pre-training with online tokenizer

Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. Image bert pre-training with online tokenizer. In ICLR, 2022 a

work page 2022
[35]

BEiT : Bert pre-training of image transformers

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEiT : Bert pre-training of image transformers. In ICLR, 2022

work page 2022
[36]

Masked feature prediction for self-supervised visual pre-training

Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, and Christoph Feichtenhofer. Masked feature prediction for self-supervised visual pre-training. In CVPR, 2022

work page 2022
[37]

Deep clustering for unsupervised learning of visual features

Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018

work page 2018
[38]

Unsupervised learning of visual features by contrasting cluster assignments

Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. NeurIPS, 2020

work page 2020
[39]

Moca: Self-supervised representation learning by predicting masked online codebook assignments

Spyros Gidaris, Andrei Bursuc, Oriane Sim \'e oni, Anton \' n Vobeck \`y , Nikos Komodakis, Matthieu Cord, and Patrick Perez. Moca: Self-supervised representation learning by predicting masked online codebook assignments. TMLR, 2024

work page 2024
[40]

Cluster and predict latents patches for improved masked image modeling

Timoth \'e e Darcet, Federico Baldassarre, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Cluster and predict latents patches for improved masked image modeling. TMLR, 2025

work page 2025
[41]

Scaling and benchmarking self-supervised visual representation learning

Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan Misra. Scaling and benchmarking self-supervised visual representation learning. In ICCV, 2019

work page 2019
[42]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021

work page 2021
[43]

Demystifying clip data

Hu Xu, Saining Xie, Xiaoqing Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data. In ICLR, 2024

work page 2024
[44]

Releasing re-laion 5b: Transparent iteration on laion-5b with additional safety fixes

LAION. Releasing re-laion 5b: Transparent iteration on laion-5b with additional safety fixes. 2024. URL https://laion.ai/blog/relaion-5b/

work page 2024
[45]

Invariant Risk Minimization

Martin Arjovsky, L \'e on Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[46]

Don't judge an object by its context: learning to overcome contextual bias

Krishna Kumar Singh, Dhruv Mahajan, Kristen Grauman, Yong Jae Lee, Matt Feiszli, and Deepti Ghadiyaram. Don't judge an object by its context: learning to overcome contextual bias. In CVPR, 2020

work page 2020
[47]

Understanding image representations by measuring their equivariance and equivalence

Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their equivariance and equivalence. In CVPR, 2015

work page 2015
[48]

a henb \

Chao Wang, Yujun Liu, Yang Zou, and Philipp Kr \"a henb \"u hl. Projective manifold disentanglement for self-supervised learning. In CVPR, 2023

work page 2023
[49]

iBOT : Image BERT pre-training with online tokenizer

Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. iBOT : Image BERT pre-training with online tokenizer. In ICLR, 2022 b

work page 2022
[50]

Vision transformers need registers

Timoth \'e e Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. In ICLR, 2024

work page 2024
[51]

Imagenet large scale visual recognition challenge

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015

work page 2015
[52]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021

work page 2021
[53]

Sinkhorn distances: Lightspeed computation of optimal transport

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. NeurIPS, 2013

work page 2013
[54]

Scan: Learning to classify images without labels

Wouter Van Gansbeke, Simon Vandenhende, Stamatios Georgoulis, Marc Proesmans, and Luc Van Gool. Scan: Learning to classify images without labels. In ECCV, 2020

work page 2020
[55]

Mugs: A multi-granular self-supervised learning framework

Pan Zhou, Yichen Zhou, Chenyang Si, Weihao Yu, Teck Khim Ng, and Shuicheng Yan. Mugs: A multi-granular self-supervised learning framework. arXiv preprint arXiv:2203.14415, 2022 c

work page arXiv 2022
[56]

Efficient self-supervised learning with contextualized target representations for vision, speech and language

Alexei Baevski, Arun Babu, Wei-Ning Hsu, and Michael Auli. Efficient self-supervised learning with contextualized target representations for vision, speech and language. In ICML, 2023

work page 2023
[57]

Hummingbird evaluation for vision encoders, 2024

Valentinos Pariza, Mohammadreza Salehi, and Yuki Asano. Hummingbird evaluation for vision encoders, 2024. URL https://github.com/vpariza/open-hummingbird-eval

work page 2024
[58]

Golub and Charles F

Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns Hopkins University Press, 4th edition, 2013

work page 2013
[59]

Imagenet-21k pretraining for the masses

Tal Ridnik, Elad Ben-Baruch, Amir Zamir, and Ido Friedman. Imagenet-21k pretraining for the masses. In NeurIPS, 2021

work page 2021
[60]

Laion-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Aleksei Drozd, Marius Cuadros, Dmitry Gritsenko, Sebastian Kintscher, Maxim Botros, Christoph Müller, Patrick Ludwig, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS Datasets and Benchmarks, 2022

work page 2022
[61]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019

work page 2019
[62]

Are we done with imagenet?arXiv preprint arXiv:2006.07159,

Lucas Beyer, Olivier J H \'e naff, Alexander Kolesnikov, Xiaohua Zhai, and A \"a ron van den Oord. Are we done with imagenet? arXiv preprint arXiv:2006.07159, 2020

work page arXiv 2006
[63]

Do imagenet classifiers generalize to imagenet? In ICML, 2019

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? In ICML, 2019

work page 2019
[64]

Open-set recognition: A good closed-set classifier is all you need

Sagar Vaze, Kai Han, Andrea Vedaldi, and Andrew Zisserman. Open-set recognition: A good closed-set classifier is all you need. In ICLR, 2022

work page 2022
[65]

In or out? fixing imagenet out-of-distribution detection evaluation

Julian Bitterwolf, Maximilian Mueller, and Matthias Hein. In or out? fixing imagenet out-of-distribution detection evaluation. In ICML, 2023

work page 2023
[66]

Mos: Towards scaling out-of-distribution detection for large semantic space

Rui Huang and Yixuan Li. Mos: Towards scaling out-of-distribution detection for large semantic space. In CVPR, 2021

work page 2021
[67]

Vim: Out-of-distribution with virtual-logit matching

Haoqi Wang, Zhizhong Li, Litong Feng, and Wayne Zhang. Vim: Out-of-distribution with virtual-logit matching. In CVPR, 2022 a

work page 2022
[68]

Kylberg texture dataset v

Gustaf Kylberg. Kylberg texture dataset v. 1.0. Centre for Image Analysis, Swedish University of Agricultural Sciences and Uppsala University, 2011

work page 2011
[69]

Natural adversarial examples

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In CVPR, 2021 a

work page 2021
[70]

The many faces of robustness: A critical analysis of out-of-distribution generalization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, 2021 b

work page 2021
[71]

Learning correspondence from the cycle-consistency of time

Xiaolong Wang, Allan Jabri, and Alexei A Efros. Learning correspondence from the cycle-consistency of time. In CVPR, 2019

work page 2019
[72]

Open OOD v1.5: Enhanced benchmark for out-of-distribution detection

Jingyang Zhang, Jingkang Yang, Pengyun Wang, Haoqi Wang, Yueqian Lin, Haoran Zhang, Yiyou Sun, Xuefeng Du, Yixuan Li, Ziwei Liu, Yiran Chen, and Hai Li. Open OOD v1.5: Enhanced benchmark for out-of-distribution detection. DMLR, 2024

work page 2024
[73]

The 2017 DAVIS Challenge on Video Object Segmentation

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbel \'a ez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[74]

Scene parsing through ade20k dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In CVPR, 2017

work page 2017
[75]

Crowley, and Dominique Vaufreydaz

Yangtao Wang, Xi Shen, Shell Xu Hu, Yuan Yuan, James L. Crowley, and Dominique Vaufreydaz. Self-supervised transformers for unsupervised object discovery using normalized cut. In CVPR, 2022 b

work page 2022
[76]

Localizing objects with self-supervised transformers and no labels.arXiv preprint arXiv:2109.14279, 2021

Oriane Sim \'e oni, Gilles Puy, Huy V Vo, Simon Roburin, Spyros Gidaris, Andrei Bursuc, Patrick P \'e rez, Renaud Marlet, and Jean Ponce. Localizing objects with self-supervised transformers and no labels. arXiv preprint arXiv:2109.14279, 2021

work page arXiv 2021
[77]

Vision transformers don’t need trained registers.arXiv preprint arXiv:2506.08010, 2025

Nick Jiang, Amil Dravid, Alexei A Efros, and Yossi Gandelsman. Vision transformers don't need trained registers. In arXiv preprint arXiv:2506.08010, 2025

work page arXiv 2025
[78]

Self-supervised learning of object parts for semantic segmentation

Adrian Ziegler and Yuki M Asano. Self-supervised learning of object parts for semantic segmentation. In CVPR, 2022

work page 2022
[79]

The hungarian method for the assignment problem

Harold W Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 1955

work page 1955
[80]

Spair-71k: A large-scale benchmark for semantic correspon- dence

Juhong Min, Jongmin Lee, Jean Ponce, and Minsu Cho. Spair-71k: A large-scale benchmark for semantic correspondence. arXiv preprint arXiv:1908.10543, 2019

work page arXiv 1908

Showing first 80 references.

[1] [1]

Matryoshka representation learning

Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, et al. Matryoshka representation learning. In NeurIPS, 2022

work page 2022

[2] [2]

Towards in-context scene understanding

Ivana Balazevic, David Steiner, Nikhil Parthasarathy, Relja Arandjelovi \'c , and Olivier Henaff. Towards in-context scene understanding. NeurIPS, 2023

work page 2023

[3] [3]

Open OOD : Benchmarking generalized out-of-distribution detection

Jingkang Yang, Pengyun Wang, Dejian Zou, Zitang Zhou, Kunyuan Ding, Wenxuan Peng, Haoqi Wang, Guangyao Chen, Bo Li, Yiyou Sun, Xuefeng Du, Kaiyang Zhou, Wayne Zhang, Dan Hendrycks, Yixuan Li, and Ziwei Liu. Open OOD : Benchmarking generalized out-of-distribution detection. In NeurIPS Datasets and Benchmarks, 2022

work page 2022

[4] [4]

Feat2gs: Probing visual foundation models with gaussian splatting

Yue Chen, Xingyu Chen, Anpei Chen, Gerard Pons-Moll, and Yuliang Xiu. Feat2gs: Probing visual foundation models with gaussian splatting. CVPR, 2025

work page 2025

[5] [5]

DINO v2: Learning robust visual features without supervision

Maxime Oquab, Timoth \'e e Darcet, Th \'e o Moutakanni, Huy V Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINO v2: Learning robust visual features without supervision. TMLR, 2024

work page 2024

[6] [6]

Self- supervised pretraining of visual features in the wild.CoRR, abs/2103.01988, 2021

Priya Goyal, Mathilde Caron, Benjamin Lefaudeux, Min Xu, Pengchao Wang, Vivek Pai, Mannat Singh, Vitaliy Liptchinsky, Ishan Misra, Armand Joulin, et al. Self-supervised pretraining of visual features in the wild. arXiv preprint arXiv:2103.01988, 2021

work page arXiv 2021

[7] [7]

The effectiveness of mae pre-pretraining for billion-scale pretraining

Mannat Singh, Quentin Duval, Kalyan Vasudev Alwala, Haoqi Fan, Vaibhav Aggarwal, Aaron Adcock, Armand Joulin, Piotr Doll \'a r, Christoph Feichtenhofer, Ross Girshick, et al. The effectiveness of mae pre-pretraining for billion-scale pretraining. In ICCV, 2023

work page 2023

[8] [8]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Scaling language-free visual representation learning.arXiv preprint arXiv:2504.01017,

David Fan, Shengbang Tong, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nicolas Ballas, Yann LeCun, Amir Bar, and Saining Xie. Scaling language-free visual representation learning. arXiv preprint arXiv:2504.01017, 2025

work page arXiv 2025

[10] [10]

Invariant information clustering for unsupervised image classification and segmentation

Xu Ji, Joao F Henriques, and Andrea Vedaldi. Invariant information clustering for unsupervised image classification and segmentation. In ICCV, 2019

work page 2019

[11] [11]

Self-labelling via simultaneous clustering and representation learning

Yuki Markus Asano, Christian Rupprecht, and Andrea Vedaldi. Self-labelling via simultaneous clustering and representation learning. In ICLR, 2020

work page 2020

[12] [12]

Burghouts, Francesco Locatello, and Yuki M Asano

Valentinos Pariza, Mohammadreza Salehi, Gertjan J. Burghouts, Francesco Locatello, and Yuki M Asano. Near, far: Patch-ordering enhances vision foundation models' scene understanding. In ICLR, 2025

work page 2025

[13] [13]

Everingham, L

M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL V isual O bject C lasses C hallenge 2012 (VOC2012) R esults. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html

work page 2012

[14] [14]

Coco-stuff: Thing and stuff classes in context, 2018

Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context, 2018

work page 2018

[15] [15]

Unsupervised visual representation learning by context prediction

Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In ICCV, 2015

work page 2015

[16] [16]

Unsupervised learning of visual representations by solving jigsaw puzzles

Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016

work page 2016

[17] [17]

Colorful image colorization

Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In ECCV, 2016

work page 2016

[18] [18]

Split-brain autoencoders: Unsupervised learning by cross-channel prediction

Richard Zhang, Phillip Isola, and Alexei A Efros. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In CVPR, 2017

work page 2017

[19] [19]

Context encoders: Feature learning by inpainting

Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016

work page 2016

[20] [20]

Unsupervised representation learning by predicting image rotations

Spyros Gidaris and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. In ICLR, 2018

work page 2018

[21] [21]

Discriminative unsupervised feature learning with convolutional neural networks

Alexey Dosovitskiy, Jost Tobias Springenberg, Martin Riedmiller, and Thomas Brox. Discriminative unsupervised feature learning with convolutional neural networks. NeurIPS, 2014

work page 2014

[22] [22]

Unsupervised feature learning via non-parametric instance discrimination

Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018

work page 2018

[23] [23]

Representation learning with contrastive predictive coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv, 2018

work page 2018

[24] [24]

Self-supervised learning of pretext-invariant representations

Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. In CVPR, 2020

work page 2020

[25] [25]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In ICML, 2020 a

work page 2020

[26] [26]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020

work page 2020

[27] [27]

Improved Baselines with Momentum Contrastive Learning

Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020 b

work page internal anchor Pith review Pith/arXiv arXiv 2003

[28] [28]

An empirical study of training self-supervised vision transformers

Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. In ICCV, 2021

work page 2021

[29] [29]

Bootstrap your own latent-a new approach to self-supervised learning

Jean-Bastien Grill, Florian Strub, Florent Altch \'e , Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. NeurIPS, 2020

work page 2020

[30] [30]

Exploring simple siamese representation learning

Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In CVPR, 2021

work page 2021

[31] [31]

Obow: Online bag-of-visual-words generation for self-supervised learning

Spyros Gidaris, Andrei Bursuc, Gilles Puy, Nikos Komodakis, Matthieu Cord, and Patrick P \'e rez. Obow: Online bag-of-visual-words generation for self-supervised learning. In CVPR, 2021

work page 2021

[32] [32]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv \'e J \'e gou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In ICCV, 2021

work page 2021

[33] [33]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll \'a r, and Ross Girshick. Masked autoencoders are scalable vision learners. In CVPR, 2022

work page 2022

[34] [34]

Image bert pre-training with online tokenizer

Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. Image bert pre-training with online tokenizer. In ICLR, 2022 a

work page 2022

[35] [35]

BEiT : Bert pre-training of image transformers

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEiT : Bert pre-training of image transformers. In ICLR, 2022

work page 2022

[36] [36]

Masked feature prediction for self-supervised visual pre-training

Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, and Christoph Feichtenhofer. Masked feature prediction for self-supervised visual pre-training. In CVPR, 2022

work page 2022

[37] [37]

Deep clustering for unsupervised learning of visual features

Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018

work page 2018

[38] [38]

Unsupervised learning of visual features by contrasting cluster assignments

Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. NeurIPS, 2020

work page 2020

[39] [39]

Moca: Self-supervised representation learning by predicting masked online codebook assignments

Spyros Gidaris, Andrei Bursuc, Oriane Sim \'e oni, Anton \' n Vobeck \`y , Nikos Komodakis, Matthieu Cord, and Patrick Perez. Moca: Self-supervised representation learning by predicting masked online codebook assignments. TMLR, 2024

work page 2024

[40] [40]

Cluster and predict latents patches for improved masked image modeling

Timoth \'e e Darcet, Federico Baldassarre, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Cluster and predict latents patches for improved masked image modeling. TMLR, 2025

work page 2025

[41] [41]

Scaling and benchmarking self-supervised visual representation learning

Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan Misra. Scaling and benchmarking self-supervised visual representation learning. In ICCV, 2019

work page 2019

[42] [42]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021

work page 2021

[43] [43]

Demystifying clip data

Hu Xu, Saining Xie, Xiaoqing Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data. In ICLR, 2024

work page 2024

[44] [44]

Releasing re-laion 5b: Transparent iteration on laion-5b with additional safety fixes

LAION. Releasing re-laion 5b: Transparent iteration on laion-5b with additional safety fixes. 2024. URL https://laion.ai/blog/relaion-5b/

work page 2024

[45] [45]

Invariant Risk Minimization

Martin Arjovsky, L \'e on Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907

[46] [46]

Don't judge an object by its context: learning to overcome contextual bias

Krishna Kumar Singh, Dhruv Mahajan, Kristen Grauman, Yong Jae Lee, Matt Feiszli, and Deepti Ghadiyaram. Don't judge an object by its context: learning to overcome contextual bias. In CVPR, 2020

work page 2020

[47] [47]

Understanding image representations by measuring their equivariance and equivalence

Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their equivariance and equivalence. In CVPR, 2015

work page 2015

[48] [48]

a henb \

Chao Wang, Yujun Liu, Yang Zou, and Philipp Kr \"a henb \"u hl. Projective manifold disentanglement for self-supervised learning. In CVPR, 2023

work page 2023

[49] [49]

iBOT : Image BERT pre-training with online tokenizer

Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. iBOT : Image BERT pre-training with online tokenizer. In ICLR, 2022 b

work page 2022

[50] [50]

Vision transformers need registers

Timoth \'e e Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. In ICLR, 2024

work page 2024

[51] [51]

Imagenet large scale visual recognition challenge

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015

work page 2015

[52] [52]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021

work page 2021

[53] [53]

Sinkhorn distances: Lightspeed computation of optimal transport

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. NeurIPS, 2013

work page 2013

[54] [54]

Scan: Learning to classify images without labels

Wouter Van Gansbeke, Simon Vandenhende, Stamatios Georgoulis, Marc Proesmans, and Luc Van Gool. Scan: Learning to classify images without labels. In ECCV, 2020

work page 2020

[55] [55]

Mugs: A multi-granular self-supervised learning framework

Pan Zhou, Yichen Zhou, Chenyang Si, Weihao Yu, Teck Khim Ng, and Shuicheng Yan. Mugs: A multi-granular self-supervised learning framework. arXiv preprint arXiv:2203.14415, 2022 c

work page arXiv 2022

[56] [56]

Efficient self-supervised learning with contextualized target representations for vision, speech and language

Alexei Baevski, Arun Babu, Wei-Ning Hsu, and Michael Auli. Efficient self-supervised learning with contextualized target representations for vision, speech and language. In ICML, 2023

work page 2023

[57] [57]

Hummingbird evaluation for vision encoders, 2024

Valentinos Pariza, Mohammadreza Salehi, and Yuki Asano. Hummingbird evaluation for vision encoders, 2024. URL https://github.com/vpariza/open-hummingbird-eval

work page 2024

[58] [58]

Golub and Charles F

Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns Hopkins University Press, 4th edition, 2013

work page 2013

[59] [59]

Imagenet-21k pretraining for the masses

Tal Ridnik, Elad Ben-Baruch, Amir Zamir, and Ido Friedman. Imagenet-21k pretraining for the masses. In NeurIPS, 2021

work page 2021

[60] [60]

Laion-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Aleksei Drozd, Marius Cuadros, Dmitry Gritsenko, Sebastian Kintscher, Maxim Botros, Christoph Müller, Patrick Ludwig, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS Datasets and Benchmarks, 2022

work page 2022

[61] [61]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019

work page 2019

[62] [62]

Are we done with imagenet?arXiv preprint arXiv:2006.07159,

Lucas Beyer, Olivier J H \'e naff, Alexander Kolesnikov, Xiaohua Zhai, and A \"a ron van den Oord. Are we done with imagenet? arXiv preprint arXiv:2006.07159, 2020

work page arXiv 2006

[63] [63]

Do imagenet classifiers generalize to imagenet? In ICML, 2019

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? In ICML, 2019

work page 2019

[64] [64]

Open-set recognition: A good closed-set classifier is all you need

Sagar Vaze, Kai Han, Andrea Vedaldi, and Andrew Zisserman. Open-set recognition: A good closed-set classifier is all you need. In ICLR, 2022

work page 2022

[65] [65]

In or out? fixing imagenet out-of-distribution detection evaluation

Julian Bitterwolf, Maximilian Mueller, and Matthias Hein. In or out? fixing imagenet out-of-distribution detection evaluation. In ICML, 2023

work page 2023

[66] [66]

Mos: Towards scaling out-of-distribution detection for large semantic space

Rui Huang and Yixuan Li. Mos: Towards scaling out-of-distribution detection for large semantic space. In CVPR, 2021

work page 2021

[67] [67]

Vim: Out-of-distribution with virtual-logit matching

Haoqi Wang, Zhizhong Li, Litong Feng, and Wayne Zhang. Vim: Out-of-distribution with virtual-logit matching. In CVPR, 2022 a

work page 2022

[68] [68]

Kylberg texture dataset v

Gustaf Kylberg. Kylberg texture dataset v. 1.0. Centre for Image Analysis, Swedish University of Agricultural Sciences and Uppsala University, 2011

work page 2011

[69] [69]

Natural adversarial examples

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In CVPR, 2021 a

work page 2021

[70] [70]

The many faces of robustness: A critical analysis of out-of-distribution generalization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, 2021 b

work page 2021

[71] [71]

Learning correspondence from the cycle-consistency of time

Xiaolong Wang, Allan Jabri, and Alexei A Efros. Learning correspondence from the cycle-consistency of time. In CVPR, 2019

work page 2019

[72] [72]

Open OOD v1.5: Enhanced benchmark for out-of-distribution detection

Jingyang Zhang, Jingkang Yang, Pengyun Wang, Haoqi Wang, Yueqian Lin, Haoran Zhang, Yiyou Sun, Xuefeng Du, Yixuan Li, Ziwei Liu, Yiran Chen, and Hai Li. Open OOD v1.5: Enhanced benchmark for out-of-distribution detection. DMLR, 2024

work page 2024

[73] [73]

The 2017 DAVIS Challenge on Video Object Segmentation

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbel \'a ez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[74] [74]

Scene parsing through ade20k dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In CVPR, 2017

work page 2017

[75] [75]

Crowley, and Dominique Vaufreydaz

Yangtao Wang, Xi Shen, Shell Xu Hu, Yuan Yuan, James L. Crowley, and Dominique Vaufreydaz. Self-supervised transformers for unsupervised object discovery using normalized cut. In CVPR, 2022 b

work page 2022

[76] [76]

Localizing objects with self-supervised transformers and no labels.arXiv preprint arXiv:2109.14279, 2021

Oriane Sim \'e oni, Gilles Puy, Huy V Vo, Simon Roburin, Spyros Gidaris, Andrei Bursuc, Patrick P \'e rez, Renaud Marlet, and Jean Ponce. Localizing objects with self-supervised transformers and no labels. arXiv preprint arXiv:2109.14279, 2021

work page arXiv 2021

[77] [77]

Vision transformers don’t need trained registers.arXiv preprint arXiv:2506.08010, 2025

Nick Jiang, Amil Dravid, Alexei A Efros, and Yossi Gandelsman. Vision transformers don't need trained registers. In arXiv preprint arXiv:2506.08010, 2025

work page arXiv 2025

[78] [78]

Self-supervised learning of object parts for semantic segmentation

Adrian Ziegler and Yuki M Asano. Self-supervised learning of object parts for semantic segmentation. In CVPR, 2022

work page 2022

[79] [79]

The hungarian method for the assignment problem

Harold W Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 1955

work page 1955

[80] [80]

Spair-71k: A large-scale benchmark for semantic correspon- dence

Juhong Min, Jongmin Lee, Jean Ponce, and Minsu Cho. Spair-71k: A large-scale benchmark for semantic correspondence. arXiv preprint arXiv:1908.10543, 2019

work page arXiv 1908