Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning
Pith reviewed 2026-05-19 03:35 UTC · model grok-4.3
The pith
Franca shows a fully open-source vision foundation model can match or surpass proprietary ones like DINOv2 and CLIP using nested Matryoshka clustering.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a parameter-efficient multi-head clustering projector built on nested Matryoshka representations, paired with explicit positional disentanglement, allows a vision model trained only on public data to match and often exceed the performance of closed-source foundation models such as DINOv2, CLIP, and SigLIPv2.
What carries the argument
Nested Matryoshka clustering projector: a multi-head design that progressively refines image features into increasingly fine-grained clusters without increasing model size.
If this is right
- Cleaner feature spaces produce consistent gains across multiple downstream benchmarks.
- Progressive refinement into finer clusters improves both accuracy and memory efficiency.
- Explicit removal of positional biases strengthens the encoding of semantic content.
- Full openness of data, code, and weights sets a new standard for reproducible vision foundation models.
Where Pith is reading between the lines
- The nested clustering structure could be ported to other self-supervised frameworks to reduce semantic ambiguity in their codebooks.
- Full release of training data invites independent audits for unintended biases or coverage gaps.
- Positional disentanglement may prove especially useful for dense prediction tasks such as segmentation that need semantic focus without spatial shortcuts.
- The same progressive-refinement idea could be tested at larger scales or in multimodal settings to check whether the efficiency benefit scales.
Load-bearing premise
The reported gains on downstream benchmarks arise primarily from the nested Matryoshka projector and positional disentanglement rather than from the specific public data subsets, training schedule, or evaluation protocol.
What would settle it
A controlled ablation that trains two otherwise identical models on the same data and schedule, one with the nested Matryoshka projector and positional disentanglement and one without, then compares their downstream benchmark scores.
Figures
read the original abstract
We present Franca (pronounced Fran-ka): free one; the first fully open-source (data, code, weights) vision foundation model that matches and in many cases surpasses the performance of state-of-the-art proprietary models, e.g., DINOv2, CLIP, SigLIPv2, etc. Our approach is grounded in a transparent training pipeline inspired by Web-SSL and uses publicly available data: ImageNet-21K and a subset of ReLAION-2B. Beyond model release, we tackle critical limitations in SSL clustering methods. While modern models rely on assigning image features to large codebooks via clustering algorithms like Sinkhorn-Knopp, they fail to account for the inherent ambiguity in clustering semantics. To address this, we introduce a parameter-efficient, multi-head clustering projector based on nested Matryoshka representations. This design progressively refines features into increasingly fine-grained clusters without increasing the model size, enabling both performance and memory efficiency. Additionally, we propose a novel positional disentanglement strategy that explicitly removes positional biases from dense representations, thereby improving the encoding of semantic content. This leads to consistent gains on several downstream benchmarks, demonstrating the utility of cleaner feature spaces. Our contributions establish a new standard for transparent, high-performance vision models and open a path toward more reproducible and generalizable foundation models for the broader AI community. The code and model checkpoints are available at https://github.com/valeoai/Franca.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Franca, a vision foundation model trained via a Web-SSL-inspired pipeline on publicly available data (ImageNet-21K and a ReLAION-2B subset). It introduces a parameter-efficient multi-head nested Matryoshka clustering projector to address semantic ambiguity in SSL codebook assignment and a positional disentanglement module to remove positional biases from dense features. The central claim is that the resulting fully open-source model (data, code, weights) matches or surpasses proprietary models such as DINOv2, CLIP, and SigLIPv2 on downstream benchmarks.
Significance. If the performance claims are substantiated and the new components are shown to drive the gains, the work would be significant as the first fully transparent, high-performing vision foundation model released with complete reproducibility artifacts. The nested Matryoshka projector offers an efficient mechanism for progressive cluster refinement, and the disentanglement step produces cleaner semantic representations; both address documented limitations in existing SSL clustering pipelines.
major comments (2)
- [§4] §4 (Experimental results): The headline claim that the nested Matryoshka projector and positional disentanglement are the primary sources of matching or surpassing DINOv2/CLIP/SigLIPv2 performance is not supported by controlled ablations. No experiment replaces the multi-head nested projector with standard Sinkhorn-Knopp clustering (or removes the disentanglement step) while freezing the exact data subsets, optimizer, schedule, and compute budget; without this isolation the causal attribution remains untested and the central contribution claim is weakened.
- [§3.2] §3.2 (Nested Matryoshka projector): The description of how nesting is realized across heads and how the progressive refinement is enforced without increasing parameter count is insufficiently precise. It is unclear whether the nesting is achieved by shared weights, hierarchical codebooks, or progressive projection layers, which is load-bearing for the claimed parameter efficiency and for reproducing the method.
minor comments (2)
- [Abstract] Abstract and §1: Quantitative improvements (e.g., absolute deltas on ImageNet linear probing, k-NN, or retrieval metrics) and error bars are not summarized; readers must reach the tables to assess the magnitude of the claimed gains.
- [§3.3] Figure 2 / §3.3: The positional disentanglement diagram and accompanying equations would benefit from an explicit statement of the loss term used to enforce orthogonality or decorrelation between positional and semantic components.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and commit to revisions that strengthen the manuscript without overstating current evidence.
read point-by-point responses
-
Referee: [§4] §4 (Experimental results): The headline claim that the nested Matryoshka projector and positional disentanglement are the primary sources of matching or surpassing DINOv2/CLIP/SigLIPv2 performance is not supported by controlled ablations. No experiment replaces the multi-head nested projector with standard Sinkhorn-Knopp clustering (or removes the disentanglement step) while freezing the exact data subsets, optimizer, schedule, and compute budget; without this isolation the causal attribution remains untested and the central contribution claim is weakened.
Authors: We agree that fully controlled ablations isolating the nested Matryoshka projector (replaced by standard Sinkhorn-Knopp) and the positional disentanglement module, while exactly matching data subsets, optimizer, schedule, and compute, would provide stronger causal evidence. Our current results include comparisons to baselines and partial component studies, but do not meet this strict isolation criterion. In the revised version we will add these controlled experiments under identical conditions to better substantiate the contribution of each proposed component. revision: yes
-
Referee: [§3.2] §3.2 (Nested Matryoshka projector): The description of how nesting is realized across heads and how the progressive refinement is enforced without increasing parameter count is insufficiently precise. It is unclear whether the nesting is achieved by shared weights, hierarchical codebooks, or progressive projection layers, which is load-bearing for the claimed parameter efficiency and for reproducing the method.
Authors: We appreciate this observation on the need for greater technical precision. The nesting is implemented via a multi-head projector in which heads correspond to successive granularity levels of the Matryoshka representation; all heads share the same projection weights, and progressive refinement is enforced by a hierarchical alignment loss that conditions finer assignments on coarser ones. No additional parameters or separate codebooks are introduced. We will revise §3.2 to include an explicit mathematical formulation, pseudocode, and a diagram clarifying the shared-weight mechanism and loss structure. revision: yes
Circularity Check
No significant circularity; claims rest on empirical evaluation of new modules
full rationale
The paper introduces architectural innovations (multi-head nested Matryoshka clustering projector and positional disentanglement) to address ambiguity in SSL clustering and positional biases. These are presented as design choices rather than derived quantities. Performance claims of matching or surpassing DINOv2/CLIP/SigLIPv2 are grounded in evaluations on public benchmarks using ImageNet-21K and a ReLAION-2B subset within a Web-SSL-inspired pipeline. No equations, self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The derivation chain is self-contained through standard SSL objectives plus externally validated modules, with no reduction of results to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Standard SSL clustering objectives (e.g., Sinkhorn-Knopp) remain valid when augmented with multi-head nested representations.
- domain assumption Removing positional biases from dense features improves semantic encoding without harming other properties.
invented entities (1)
-
Nested Matryoshka multi-head clustering projector
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
nested Matryoshka representations... progressively refines features into increasingly fine-grained clusters... multi-head clustering projector
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_strictMono_of_one_lt echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
hierarchical clustering that aligns naturally with the granularity of the features... coarse heads capture global semantics, while fine heads focus on local structure
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 6 Pith papers
-
Coevolving Representations in Joint Image-Feature Diffusion
CoReDi coevolves semantic representations with the diffusion model via a jointly learned linear projection stabilized by stop-gradient, normalization, and regularization, yielding faster convergence and higher sample ...
-
Text-Conditional JEPA for Learning Semantically Rich Visual Representations
TC-JEPA conditions masked feature prediction on text captions via sparse cross-attention to produce more semantically rich visual representations and outperforms contrastive methods on fine-grained tasks.
-
Boosting Visual Instruction Tuning with Self-Supervised Guidance
Mixing 3-10% of visually grounded self-supervised instructions into visual instruction tuning consistently boosts MLLM performance on vision-centric benchmarks.
-
TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment
TIPSv2 improves dense patch-text alignment in vision-language pretraining through distillation and iBOT++ modifications, yielding models on par with or better than recent baselines on 9 tasks across 20 datasets.
-
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction
Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.
-
Modular Multimodal Classification Without Fine-Tuning: A Simple Compositional Approach
CoMET achieves strong multimodal classification performance by composing frozen modality encoders, PCA compression, and tabular foundation models without any training, reaching state-of-the-art on diverse benchmarks i...
Reference graph
Works this paper leans on
-
[1]
Matryoshka representation learning
Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, et al. Matryoshka representation learning. In NeurIPS, 2022
work page 2022
-
[2]
Towards in-context scene understanding
Ivana Balazevic, David Steiner, Nikhil Parthasarathy, Relja Arandjelovi \'c , and Olivier Henaff. Towards in-context scene understanding. NeurIPS, 2023
work page 2023
-
[3]
Open OOD : Benchmarking generalized out-of-distribution detection
Jingkang Yang, Pengyun Wang, Dejian Zou, Zitang Zhou, Kunyuan Ding, Wenxuan Peng, Haoqi Wang, Guangyao Chen, Bo Li, Yiyou Sun, Xuefeng Du, Kaiyang Zhou, Wayne Zhang, Dan Hendrycks, Yixuan Li, and Ziwei Liu. Open OOD : Benchmarking generalized out-of-distribution detection. In NeurIPS Datasets and Benchmarks, 2022
work page 2022
-
[4]
Feat2gs: Probing visual foundation models with gaussian splatting
Yue Chen, Xingyu Chen, Anpei Chen, Gerard Pons-Moll, and Yuliang Xiu. Feat2gs: Probing visual foundation models with gaussian splatting. CVPR, 2025
work page 2025
-
[5]
DINO v2: Learning robust visual features without supervision
Maxime Oquab, Timoth \'e e Darcet, Th \'e o Moutakanni, Huy V Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINO v2: Learning robust visual features without supervision. TMLR, 2024
work page 2024
-
[6]
Self- supervised pretraining of visual features in the wild.CoRR, abs/2103.01988, 2021
Priya Goyal, Mathilde Caron, Benjamin Lefaudeux, Min Xu, Pengchao Wang, Vivek Pai, Mannat Singh, Vitaliy Liptchinsky, Ishan Misra, Armand Joulin, et al. Self-supervised pretraining of visual features in the wild. arXiv preprint arXiv:2103.01988, 2021
-
[7]
The effectiveness of mae pre-pretraining for billion-scale pretraining
Mannat Singh, Quentin Duval, Kalyan Vasudev Alwala, Haoqi Fan, Vaibhav Aggarwal, Aaron Adcock, Armand Joulin, Piotr Doll \'a r, Christoph Feichtenhofer, Ross Girshick, et al. The effectiveness of mae pre-pretraining for billion-scale pretraining. In ICCV, 2023
work page 2023
-
[8]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Scaling language-free visual representation learning.arXiv preprint arXiv:2504.01017,
David Fan, Shengbang Tong, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nicolas Ballas, Yann LeCun, Amir Bar, and Saining Xie. Scaling language-free visual representation learning. arXiv preprint arXiv:2504.01017, 2025
-
[10]
Invariant information clustering for unsupervised image classification and segmentation
Xu Ji, Joao F Henriques, and Andrea Vedaldi. Invariant information clustering for unsupervised image classification and segmentation. In ICCV, 2019
work page 2019
-
[11]
Self-labelling via simultaneous clustering and representation learning
Yuki Markus Asano, Christian Rupprecht, and Andrea Vedaldi. Self-labelling via simultaneous clustering and representation learning. In ICLR, 2020
work page 2020
-
[12]
Burghouts, Francesco Locatello, and Yuki M Asano
Valentinos Pariza, Mohammadreza Salehi, Gertjan J. Burghouts, Francesco Locatello, and Yuki M Asano. Near, far: Patch-ordering enhances vision foundation models' scene understanding. In ICLR, 2025
work page 2025
-
[13]
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL V isual O bject C lasses C hallenge 2012 (VOC2012) R esults. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html
work page 2012
-
[14]
Coco-stuff: Thing and stuff classes in context, 2018
Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context, 2018
work page 2018
-
[15]
Unsupervised visual representation learning by context prediction
Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In ICCV, 2015
work page 2015
-
[16]
Unsupervised learning of visual representations by solving jigsaw puzzles
Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016
work page 2016
-
[17]
Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In ECCV, 2016
work page 2016
-
[18]
Split-brain autoencoders: Unsupervised learning by cross-channel prediction
Richard Zhang, Phillip Isola, and Alexei A Efros. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In CVPR, 2017
work page 2017
-
[19]
Context encoders: Feature learning by inpainting
Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016
work page 2016
-
[20]
Unsupervised representation learning by predicting image rotations
Spyros Gidaris and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. In ICLR, 2018
work page 2018
-
[21]
Discriminative unsupervised feature learning with convolutional neural networks
Alexey Dosovitskiy, Jost Tobias Springenberg, Martin Riedmiller, and Thomas Brox. Discriminative unsupervised feature learning with convolutional neural networks. NeurIPS, 2014
work page 2014
-
[22]
Unsupervised feature learning via non-parametric instance discrimination
Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018
work page 2018
-
[23]
Representation learning with contrastive predictive coding
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv, 2018
work page 2018
-
[24]
Self-supervised learning of pretext-invariant representations
Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. In CVPR, 2020
work page 2020
-
[25]
A simple framework for contrastive learning of visual representations
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In ICML, 2020 a
work page 2020
-
[26]
Momentum contrast for unsupervised visual representation learning
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020
work page 2020
-
[27]
Improved Baselines with Momentum Contrastive Learning
Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020 b
work page internal anchor Pith review Pith/arXiv arXiv 2003
-
[28]
An empirical study of training self-supervised vision transformers
Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. In ICCV, 2021
work page 2021
-
[29]
Bootstrap your own latent-a new approach to self-supervised learning
Jean-Bastien Grill, Florian Strub, Florent Altch \'e , Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. NeurIPS, 2020
work page 2020
-
[30]
Exploring simple siamese representation learning
Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In CVPR, 2021
work page 2021
-
[31]
Obow: Online bag-of-visual-words generation for self-supervised learning
Spyros Gidaris, Andrei Bursuc, Gilles Puy, Nikos Komodakis, Matthieu Cord, and Patrick P \'e rez. Obow: Online bag-of-visual-words generation for self-supervised learning. In CVPR, 2021
work page 2021
-
[32]
Emerging properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv \'e J \'e gou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In ICCV, 2021
work page 2021
-
[33]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll \'a r, and Ross Girshick. Masked autoencoders are scalable vision learners. In CVPR, 2022
work page 2022
-
[34]
Image bert pre-training with online tokenizer
Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. Image bert pre-training with online tokenizer. In ICLR, 2022 a
work page 2022
-
[35]
BEiT : Bert pre-training of image transformers
Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEiT : Bert pre-training of image transformers. In ICLR, 2022
work page 2022
-
[36]
Masked feature prediction for self-supervised visual pre-training
Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, and Christoph Feichtenhofer. Masked feature prediction for self-supervised visual pre-training. In CVPR, 2022
work page 2022
-
[37]
Deep clustering for unsupervised learning of visual features
Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018
work page 2018
-
[38]
Unsupervised learning of visual features by contrasting cluster assignments
Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. NeurIPS, 2020
work page 2020
-
[39]
Moca: Self-supervised representation learning by predicting masked online codebook assignments
Spyros Gidaris, Andrei Bursuc, Oriane Sim \'e oni, Anton \' n Vobeck \`y , Nikos Komodakis, Matthieu Cord, and Patrick Perez. Moca: Self-supervised representation learning by predicting masked online codebook assignments. TMLR, 2024
work page 2024
-
[40]
Cluster and predict latents patches for improved masked image modeling
Timoth \'e e Darcet, Federico Baldassarre, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Cluster and predict latents patches for improved masked image modeling. TMLR, 2025
work page 2025
-
[41]
Scaling and benchmarking self-supervised visual representation learning
Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan Misra. Scaling and benchmarking self-supervised visual representation learning. In ICCV, 2019
work page 2019
-
[42]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021
work page 2021
-
[43]
Hu Xu, Saining Xie, Xiaoqing Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data. In ICLR, 2024
work page 2024
-
[44]
Releasing re-laion 5b: Transparent iteration on laion-5b with additional safety fixes
LAION. Releasing re-laion 5b: Transparent iteration on laion-5b with additional safety fixes. 2024. URL https://laion.ai/blog/relaion-5b/
work page 2024
-
[45]
Martin Arjovsky, L \'e on Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[46]
Don't judge an object by its context: learning to overcome contextual bias
Krishna Kumar Singh, Dhruv Mahajan, Kristen Grauman, Yong Jae Lee, Matt Feiszli, and Deepti Ghadiyaram. Don't judge an object by its context: learning to overcome contextual bias. In CVPR, 2020
work page 2020
-
[47]
Understanding image representations by measuring their equivariance and equivalence
Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their equivariance and equivalence. In CVPR, 2015
work page 2015
- [48]
-
[49]
iBOT : Image BERT pre-training with online tokenizer
Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. iBOT : Image BERT pre-training with online tokenizer. In ICLR, 2022 b
work page 2022
-
[50]
Vision transformers need registers
Timoth \'e e Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. In ICLR, 2024
work page 2024
-
[51]
Imagenet large scale visual recognition challenge
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015
work page 2015
-
[52]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021
work page 2021
-
[53]
Sinkhorn distances: Lightspeed computation of optimal transport
Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. NeurIPS, 2013
work page 2013
-
[54]
Scan: Learning to classify images without labels
Wouter Van Gansbeke, Simon Vandenhende, Stamatios Georgoulis, Marc Proesmans, and Luc Van Gool. Scan: Learning to classify images without labels. In ECCV, 2020
work page 2020
-
[55]
Mugs: A multi-granular self-supervised learning framework
Pan Zhou, Yichen Zhou, Chenyang Si, Weihao Yu, Teck Khim Ng, and Shuicheng Yan. Mugs: A multi-granular self-supervised learning framework. arXiv preprint arXiv:2203.14415, 2022 c
-
[56]
Alexei Baevski, Arun Babu, Wei-Ning Hsu, and Michael Auli. Efficient self-supervised learning with contextualized target representations for vision, speech and language. In ICML, 2023
work page 2023
-
[57]
Hummingbird evaluation for vision encoders, 2024
Valentinos Pariza, Mohammadreza Salehi, and Yuki Asano. Hummingbird evaluation for vision encoders, 2024. URL https://github.com/vpariza/open-hummingbird-eval
work page 2024
-
[58]
Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns Hopkins University Press, 4th edition, 2013
work page 2013
-
[59]
Imagenet-21k pretraining for the masses
Tal Ridnik, Elad Ben-Baruch, Amir Zamir, and Ido Friedman. Imagenet-21k pretraining for the masses. In NeurIPS, 2021
work page 2021
-
[60]
Laion-5b: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Aleksei Drozd, Marius Cuadros, Dmitry Gritsenko, Sebastian Kintscher, Maxim Botros, Christoph Müller, Patrick Ludwig, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS Datasets and Benchmarks, 2022
work page 2022
-
[61]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019
work page 2019
-
[62]
Are we done with imagenet?arXiv preprint arXiv:2006.07159,
Lucas Beyer, Olivier J H \'e naff, Alexander Kolesnikov, Xiaohua Zhai, and A \"a ron van den Oord. Are we done with imagenet? arXiv preprint arXiv:2006.07159, 2020
-
[63]
Do imagenet classifiers generalize to imagenet? In ICML, 2019
Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? In ICML, 2019
work page 2019
-
[64]
Open-set recognition: A good closed-set classifier is all you need
Sagar Vaze, Kai Han, Andrea Vedaldi, and Andrew Zisserman. Open-set recognition: A good closed-set classifier is all you need. In ICLR, 2022
work page 2022
-
[65]
In or out? fixing imagenet out-of-distribution detection evaluation
Julian Bitterwolf, Maximilian Mueller, and Matthias Hein. In or out? fixing imagenet out-of-distribution detection evaluation. In ICML, 2023
work page 2023
-
[66]
Mos: Towards scaling out-of-distribution detection for large semantic space
Rui Huang and Yixuan Li. Mos: Towards scaling out-of-distribution detection for large semantic space. In CVPR, 2021
work page 2021
-
[67]
Vim: Out-of-distribution with virtual-logit matching
Haoqi Wang, Zhizhong Li, Litong Feng, and Wayne Zhang. Vim: Out-of-distribution with virtual-logit matching. In CVPR, 2022 a
work page 2022
-
[68]
Gustaf Kylberg. Kylberg texture dataset v. 1.0. Centre for Image Analysis, Swedish University of Agricultural Sciences and Uppsala University, 2011
work page 2011
-
[69]
Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In CVPR, 2021 a
work page 2021
-
[70]
The many faces of robustness: A critical analysis of out-of-distribution generalization
Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, 2021 b
work page 2021
-
[71]
Learning correspondence from the cycle-consistency of time
Xiaolong Wang, Allan Jabri, and Alexei A Efros. Learning correspondence from the cycle-consistency of time. In CVPR, 2019
work page 2019
-
[72]
Open OOD v1.5: Enhanced benchmark for out-of-distribution detection
Jingyang Zhang, Jingkang Yang, Pengyun Wang, Haoqi Wang, Yueqian Lin, Haoran Zhang, Yiyou Sun, Xuefeng Du, Yixuan Li, Ziwei Liu, Yiran Chen, and Hai Li. Open OOD v1.5: Enhanced benchmark for out-of-distribution detection. DMLR, 2024
work page 2024
-
[73]
The 2017 DAVIS Challenge on Video Object Segmentation
Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbel \'a ez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[74]
Scene parsing through ade20k dataset
Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In CVPR, 2017
work page 2017
-
[75]
Crowley, and Dominique Vaufreydaz
Yangtao Wang, Xi Shen, Shell Xu Hu, Yuan Yuan, James L. Crowley, and Dominique Vaufreydaz. Self-supervised transformers for unsupervised object discovery using normalized cut. In CVPR, 2022 b
work page 2022
-
[76]
Oriane Sim \'e oni, Gilles Puy, Huy V Vo, Simon Roburin, Spyros Gidaris, Andrei Bursuc, Patrick P \'e rez, Renaud Marlet, and Jean Ponce. Localizing objects with self-supervised transformers and no labels. arXiv preprint arXiv:2109.14279, 2021
-
[77]
Vision transformers don’t need trained registers.arXiv preprint arXiv:2506.08010, 2025
Nick Jiang, Amil Dravid, Alexei A Efros, and Yossi Gandelsman. Vision transformers don't need trained registers. In arXiv preprint arXiv:2506.08010, 2025
-
[78]
Self-supervised learning of object parts for semantic segmentation
Adrian Ziegler and Yuki M Asano. Self-supervised learning of object parts for semantic segmentation. In CVPR, 2022
work page 2022
-
[79]
The hungarian method for the assignment problem
Harold W Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 1955
work page 1955
-
[80]
Spair-71k: A large-scale benchmark for semantic correspon- dence
Juhong Min, Jongmin Lee, Jean Ponce, and Minsu Cho. Spair-71k: A large-scale benchmark for semantic correspondence. arXiv preprint arXiv:1908.10543, 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.