pith. machine review for the scientific record. sign in

arxiv: 2604.27932 · v1 · submitted 2026-04-30 · 💻 cs.CV

Recognition: unknown

Dynamic Cluster Data Sampling for Efficient and Long-Tail-Aware Vision-Language Pre-training

Arjen P. de Vries, Martha Larson, Mingliang Liang, Zhuoran Liu

Pith reviewed 2026-05-07 07:34 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language modelsdata samplinglong-tailefficient pre-trainingcluster-based samplingsemantic balanceVLM training
0
0 comments X

The pith

Dynamic cluster sampling reduces vision-language model training costs while boosting performance on rare long-tail concepts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DynamiCS, a method that samples training data by dynamically resizing semantic clusters at each epoch. Large clusters of common data are downsized while small clusters of rare concepts are enlarged. This differs from previous efficient training methods that often discard rare data or flatten the overall distribution. The approach aims to cut computational expenses without losing the natural balance of topics. Experiments indicate it saves on training resources and helps the model learn infrequent concepts better.

Core claim

We introduce a dynamic cluster-based sampling approach (DynamiCS) that downsamples large clusters of data and upsamples small ones. The approach is dynamic in that it applies sampling at each epoch. We first show the importance of dynamic sampling for VLM training. Then, we demonstrate the advantage of our cluster-scaling approach, which maintains the relative order of semantic clusters in the data and emphasizes the long-tail. This approach contrasts with current work, which focuses only on flattening the semantic distribution of the data. Our experiments show that DynamiCS reduces the computational cost of VLM training and provides a performance advantage for long-tail concepts.

What carries the argument

Dynamic cluster-based sampling (DynamiCS), which at each training epoch downsamples large semantic clusters and upsamples small ones to emphasize the long tail while preserving relative cluster order.

If this is right

  • VLMs can be pre-trained with fewer total data samples processed while retaining or improving accuracy.
  • Models achieve higher effectiveness on tasks involving rare or long-tail visual-language concepts.
  • Semantic cluster order is preserved, avoiding the loss of topic importance that comes from uniform flattening.
  • Dynamic adjustment per epoch proves more effective than static sampling strategies for balancing efficiency and coverage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This sampling strategy might extend to other data-intensive training tasks like large language model pre-training where long-tail issues arise.
  • Combining it with other efficiency methods such as gradient checkpointing could yield further compute savings.
  • Downstream applications in specialized domains with rare events, like medical imaging, could benefit from better rare-concept modeling.

Load-bearing premise

Dynamically resizing data clusters each epoch will keep enough examples from common concepts to support learning while increasing exposure to rare ones enough to improve their capture, without creating new biases or slowing convergence.

What would settle it

A direct comparison of zero-shot retrieval accuracy on long-tail concepts showing that a model trained with DynamiCS performs no better or worse than one trained with uniform random sampling at identical total compute.

Figures

Figures reproduced from arXiv: 2604.27932 by Arjen P. de Vries, Martha Larson, Mingliang Liang, Zhuoran Liu.

Figure 1
Figure 1. Figure 1: Zero-shot top-1 accuracy on ImageNet-1K [16] and Let-it-wag! [17] (long-tail test set). DynamiCS outperforms cost-saving baselines (RECLIP [13], FLIP [7], CLIPA [14]) and dual-purpose approaches (DataComp [18], DFN [19], Captioning [20]) while using less computational resources and achieves accuracy competitive with full-scale pre-training, e.g., OpenCLIP [3]. Experiments are conducted with ViT-B/16 pre-tr… view at source ↗
Figure 2
Figure 2. Figure 2: Image examples from three semantic clusters (e.g., “sea,” “tennis” “dog”), where visually similar concepts are view at source ↗
Figure 3
Figure 3. Figure 3: The Log-linear relationships between concept frequency and zero-shot performance of ImageNet-1k. Our approach outperforms both RECLIP-Random pruning and RECLIP and substantially surpasses them on long-tail categories. We pre-train RECLIP-Random and RECLIP-DynamiCS with 1.28B samples seen and fine-tune in small steps, reducing their training cost to 50% of RECLIP. Our model was pre-trained on the LAION-400M… view at source ↗
read the original abstract

The computational cost of training a vision-language model (VLM) can be reduced by sampling the training data. Previous work on efficient VLM pre-training has pointed to the importance of semantic data balance, adjusting the distribution of topics in the data to improve VLM accuracy. However, existing efficient pre-training approaches may disproportionately remove rare concepts from the training corpus. As a result, \emph{long-tail concepts} remain insufficiently represented in the training data and are not effectively captured during training. In this work, we introduce a \emph{dynamic cluster-based sampling approach (DynamiCS)} that downsamples large clusters of data and upsamples small ones. The approach is dynamic in that it applies sampling at each epoch. We first show the importance of dynamic sampling for VLM training. Then, we demonstrate the advantage of our cluster-scaling approach, which maintains the relative order of semantic clusters in the data and emphasizes the long-tail. This approach contrasts with current work, which focuses only on flattening the semantic distribution of the data. Our experiments show that DynamiCS reduces the computational cost of VLM training and provides a performance advantage for long-tail concepts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DynamiCS, a dynamic cluster-based sampling approach for efficient vision-language pre-training. It dynamically downsamples large semantic clusters and upsamples small ones at each epoch to reduce training costs while providing better representation for long-tail concepts. The work claims to demonstrate the importance of dynamic sampling and the advantages of maintaining relative cluster orders over flattening the distribution, supported by experiments showing reduced computational cost and improved long-tail performance.

Significance. If the experimental claims are robustly supported, this method could have significant impact on scaling VLM pre-training to larger datasets by improving efficiency and addressing long-tail biases in a principled way. The focus on dynamic adjustment per epoch and preserving semantic cluster structure offers a potentially novel angle compared to static or distribution-flattening methods.

major comments (2)
  1. [§5 (Experimental Results)] §5 (Experimental Results): The reported experiments do not include a control where cluster scaling ratios are computed once from the initial clustering and applied statically across all epochs. Without this ablation, gains cannot be confidently attributed to the dynamic per-epoch component of DynamiCS rather than the cluster-based scaling rule itself. This is critical because the central claim emphasizes the dynamism.
  2. [§3 (Method)] §3 (Method): Insufficient detail is provided on the cluster formation process, including the embedding model used for clustering, the clustering algorithm, and the number of clusters. This information is load-bearing for understanding and reproducing how 'large' and 'small' clusters are identified and scaled.
minor comments (2)
  1. [Abstract] Abstract: The abstract does not specify the datasets, VLM architectures, baselines, or evaluation metrics employed in the experiments, which makes it challenging to assess the strength of the claims without reading the full text.
  2. [Throughout] Throughout: Some notation for cluster sizes and sampling ratios could be more clearly defined with equations to avoid ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of DynamiCS. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation and reproducibility of our results.

read point-by-point responses
  1. Referee: [§5 (Experimental Results)] §5 (Experimental Results): The reported experiments do not include a control where cluster scaling ratios are computed once from the initial clustering and applied statically across all epochs. Without this ablation, gains cannot be confidently attributed to the dynamic per-epoch component of DynamiCS rather than the cluster-based scaling rule itself. This is critical because the central claim emphasizes the dynamism.

    Authors: We agree that this ablation is important for isolating the contribution of per-epoch dynamism. While our original experiments already compared DynamiCS against static sampling and distribution-flattening baselines, we did not include a static application of our own cluster-derived scaling ratios. In the revised manuscript, we will add this control in §5: scaling ratios will be computed once from the initial clustering and held fixed across epochs, then directly compared to the dynamic version on both efficiency and long-tail metrics. This addition will more rigorously support our emphasis on dynamism. revision: yes

  2. Referee: [§3 (Method)] §3 (Method): Insufficient detail is provided on the cluster formation process, including the embedding model used for clustering, the clustering algorithm, and the number of clusters. This information is load-bearing for understanding and reproducing how 'large' and 'small' clusters are identified and scaled.

    Authors: We appreciate this point and agree that these details are necessary for reproducibility. In the revised §3, we will explicitly specify the embedding model used to obtain representations for clustering, the clustering algorithm, the number of clusters, and the precise rule for designating clusters as large or small (including any size thresholds or relative ordering criteria). We will also add a brief note on how these choices were validated. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method with explicit definitions and experimental validation

full rationale

The paper defines DynamiCS explicitly as a per-epoch dynamic cluster-based sampling procedure that downsamples large clusters and upsamples small ones while preserving relative cluster order. It states two sequential demonstrations (importance of dynamism, then advantage of the scaling rule) and reports experimental outcomes on cost reduction and long-tail performance. No equations, first-principles derivations, or fitted parameters are presented that reduce a claimed prediction to its own inputs by construction. No self-citation is invoked as a load-bearing uniqueness theorem or ansatz. The method is not a renaming of a known result but a concrete sampling rule tested against baselines. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no mathematical derivations, fitted parameters, or postulated entities; the contribution is described at the level of a high-level algorithmic idea without explicit axioms or free parameters.

pith-pipeline@v0.9.0 · 5514 in / 1136 out tokens · 51271 ms · 2026-05-07T07:34:11.315483+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 7 canonical work pages · 5 internal anchors

  1. [1]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling Laws for Neural Language Models.CoRR, abs/2001.08361, 2020

  2. [2]

    Learning Transferable Visual Models from Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models from Natural Language Supervision. InInternational Conference on Machine Learning, 2021

  3. [3]

    OpenCLIP, July 2021

    Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. OpenCLIP, July 2021

  4. [4]

    Scaling up Visual and Vision-Language Representation Learning with Noisy Text Supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up Visual and Vision-Language Representation Learning with Noisy Text Supervision. InProceedings of Machine Learning Research, 2021

  5. [5]

    VirTex: Learning Visual Representations from Textual Annotations

    Karan Desai and Justin Johnson. VirTex: Learning Visual Representations from Textual Annotations. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

  6. [6]

    Contrastive Learning of Medical Visual Representations from Paired Images and Text

    Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D Manning, and Curtis P Langlotz. Contrastive Learning of Medical Visual Representations from Paired Images and Text. InMachine Learning for Healthcare Conference, 2022

  7. [7]

    Scaling Language-Image Pre-Training via Masking

    Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichtenhofer, and Kaiming He. Scaling Language-Image Pre-Training via Masking. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

  8. [8]

    Sigmoid Loss for Language Image Pre-Training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid Loss for Language Image Pre-Training. InIEEE/CVF International Conference on Computer Vision, 2023

  9. [9]

    BLIP: Bootstrapping Language-Image Pre-Training for Unified Vision-Language Understanding and Generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping Language-Image Pre-Training for Unified Vision-Language Understanding and Generation. InInternational Conference on Machine Learning, 2022

  10. [10]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual Instruction Tuning. InConference on Neural Information Processing Systems, 2023

  11. [11]

    GPT-4 Technical Report

    OpenAI, Josh Achiam, Steven Adler, and Sandhini Agarwal et al. GPT-4 Technical Report.CoRR, abs/2303.08774, 2023

  12. [12]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical Text-Conditional Image Generation with CLIP Latents.CoRR, abs/2204.06125, 2022

  13. [13]

    RECLIP: Resource-Efficient CLIP by Training with Small Images.Transactions on Machine Learning Research, 2023

    Runze Li, Dahun Kim, Bir Bhanu, and Weicheng Kuo. RECLIP: Resource-Efficient CLIP by Training with Small Images.Transactions on Machine Learning Research, 2023

  14. [14]

    An Inverse Scaling Law for CLIP Training

    Xianhang Li, Zeyu Wang, and Cihang Xie. An Inverse Scaling Law for CLIP Training. InConference on Neural Information Processing Systems, 2023

  15. [15]

    Demystifying CLIP Data

    Hu Xu, Saining Xie, Xiaoqing Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying CLIP Data. InInternational Conference on Learning Representations, 2024

  16. [16]

    ImageNet: A Large-Scale Hierarchical Image Database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2009. 11 APREPRINT- MAY1, 2026

  17. [17]

    Zero-Shot

    Vishaal Udandarao, Ameya Prabhu, Adhiraj Ghosh, Yash Sharma, Philip Torr, Adel Bibi, Samuel Albanie, and Matthias Bethge. No "Zero-Shot" without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance. InConference on Neural Information Processing Systems, 2024

  18. [18]

    DataComp: In Search of the Next Generation of Multimodal Datasets

    Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah M Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, Alex...

  19. [19]

    Data Filtering Networks

    Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander T Toshev, and Vaishaal Shankar. Data Filtering Networks. InInternational Conference on Learning Representations, 2024

  20. [20]

    Improving Multimodal Datasets with Image Captioning

    Thao Nguyen, Samir Yitzhak Gadre, Gabriel Ilharco, Sewoong Oh, and Ludwig Schmidt. Improving Multimodal Datasets with Image Captioning. InConference on Neural Information Processing Systems, 2023

  21. [21]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs. InNeurIPS Workshop on Data-Centric AI, 2021

  22. [22]

    Amro Kamal Mohamed Abbas, Evgenia Rusak, Kushal Tirumala, Wieland Brendel, Kamalika Chaudhuri, and Ari S. Morcos. Effective Pruning of Web-Scale Datasets Based on Complexity of Concept Clusters. InInternational Conference on Learning Representations, 2024

  23. [23]

    HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models

    Zhixiang Wei, Guangting Wang, Xiaoxiao Ma, Ke Mei, Huaian Chen, Yi Jin, and Fengyun Rao. HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models. In IEEE/CVF International Conference on Computer Vision, 2025

  24. [24]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation Learning with Contrastive Predictive Coding. CoRR, abs/1807.03748, 2018

  25. [25]

    Attentive Mask CLIP

    Yifan Yang, Weiquan Huang, Yixuan Wei, Houwen Peng, Xinyang Jiang, Huiqiang Jiang, Fangyun Wei, Yin Wang, Han Hu, Lili Qiu, et al. Attentive Mask CLIP. InIEEE/CVF International Conference on Computer Vision, 2023

  26. [26]

    Centered Masking for Language-Image Pre-Training

    Mingliang Liang and Martha Larson. Centered Masking for Language-Image Pre-Training. InMachine Learning and Knowledge Discovery in Databases. Research Track and Demo Track, 2024

  27. [27]

    Seeing What Matters: Empowering CLIP with Patch Generation-To-Selection

    Gensheng Pei, Tao Chen, Yujia Wang, Xinhao Cai, Xiangbo Shu, Tianfei Zhou, and Yazhou Yao. Seeing What Matters: Empowering CLIP with Patch Generation-To-Selection. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  28. [28]

    Frequency Is What You Need: Considering Word Frequency When Text Masking Benefits Vision-Language Model Pre-Training

    Mingliang Liang and Martha Larson. Frequency Is What You Need: Considering Word Frequency When Text Masking Benefits Vision-Language Model Pre-Training. InIEEE/CVF Winter Conference on Applications of Computer Vision, 2026

  29. [29]

    Conceptual Captions: A Cleaned, Hypernymed, Image Alt-Text Dataset for Automatic Image Captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-Text Dataset for Automatic Image Captioning. InAnnual Meeting of the Association for Computational Linguistics, 2018

  30. [30]

    Conceptual 12M: Pushing Web-Scale Image- Text Pre-Training to Recognize Long-Tail Visual Concepts

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12M: Pushing Web-Scale Image- Text Pre-Training to Recognize Long-Tail Visual Concepts. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

  31. [31]

    LAION-5B: An Open Large-Scale Dataset for Training Next Generation Image-Text Models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5B: An Open Large-Scale Dataset for Training Next Generation Image-Text Mod...

  32. [32]

    What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights

    Xin Wen, Bingchen Zhao, Yilun Chen, Jiangmiao Pang, and XIAOJUAN QI. What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights. InConference on Neural Information Processing Systems, 2024

  33. [33]

    The Neglected Tails in Vision-Language Models

    Shubham Parashar, Zhiqiu Lin, Tian Liu, Xiangjue Dong, Yanan Li, Deva Ramanan, James Caverlee, and Shu Kong. The Neglected Tails in Vision-Language Models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 12 APREPRINT- MAY1, 2026

  34. [34]

    Abbas, K

    Amro Abbas, Kushal Tirumala, Dániel Simig, Surya Ganguli, and Ari S Morcos. SemDeDup: Data-Efficient Learning at Web-Scale through Semantic Deduplication.CoRR, abs/2303.09540, 2023

  35. [35]

    On the De-Duplication of LAION-2B.CoRR, abs/2303.12733, 2023

    Ryan Webster, Julien Rabin, Loic Simon, and Frederic Jurie. On the De-Duplication of LAION-2B.CoRR, abs/2303.12733, 2023

  36. [36]

    What If We Recaption Billions of Web Images with LLaMA-3? InInternational Conference on Machine Learning, 2025

    Xianhang Li, Haoqin Tu, Mude Hui, Zeyu Wang, Bingchen Zhao, Junfei Xiao, Sucheng Ren, Jieru Mei, Qing Liu, Huangjie Zheng, Yuyin Zhou, and Cihang Xie. What If We Recaption Billions of Web Images with LLaMA-3? InInternational Conference on Machine Learning, 2025

  37. [37]

    OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning

    Xianhang Li, Yanqing Liu, Haoqin Tu, and Cihang Xie. OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning. InIEEE/CVF International Conference on Computer Vision, 2025

  38. [38]

    Improving CLIP Training with Language Rewrites

    Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and Yonglong Tian. Improving CLIP Training with Language Rewrites. InConference on Neural Information Processing Systems, 2023

  39. [39]

    Balanced Data Sampling for Language Model Training with Clustering

    Yunfan Shao, Linyang Li, Zhaoye Fei, Hang Yan, Dahua Lin, and Xipeng Qiu. Balanced Data Sampling for Language Model Training with Clustering. InFindings of the Association for Computational Linguistics: ACL, 2024

  40. [40]

    YFCC100M: The New Data in Multimedia Research.Communications of the ACM, 59(2), 2016

    Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. YFCC100M: The New Data in Multimedia Research.Communications of the ACM, 59(2), 2016

  41. [41]

    Distributed Representations of Words and Phrases and Their Compositionality

    Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed Representations of Words and Phrases and Their Compositionality. InConference on Neural Information Processing Systems, 2013

  42. [42]

    An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. InInternational Conference on Learning Representations, 2021

  43. [43]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. InConference on Neural Information Processing Systems, 2017

  44. [44]

    Microsoft COCO: Common Objects in Context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common Objects in Context. InEuropean Conference on Computer Vision, 2014

  45. [45]

    Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference over Event Descriptions.Transactions of the Association for Computational Linguistics, 2:67–78, 2014

  46. [46]

    C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical report, California Institute of Technology, 2011

  47. [47]

    Automated Flower Classification over a Large Number of Classes

    Maria-Elena Nilsback and Andrew Zisserman. Automated Flower Classification over a Large Number of Classes. InIndian Conference on Computer Vision, Graphics and Image Processing, 2008

  48. [48]

    Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

    Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.CoRR, abs/1706.02677, 2017

  49. [49]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. InInternational Conference on Learning Representations, 2019

  50. [50]

    Generative Pretraining from Pixels

    Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative Pretraining from Pixels. InInternational Conference on Machine Learning, 2020

  51. [51]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    Ilya Loshchilov and Frank Hutter. SGDR: Stochastic Gradient Descent with Warm Restarts. InInternational Conference on Learning Representations, 2017

  52. [52]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick L...

  53. [53]

    The Faiss Library.IEEE Transactions on Big Data, 2025

    Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The Faiss Library.IEEE Transactions on Big Data, 2025. Early Access. 13 APREPRINT- MAY1, 2026

  54. [54]

    img2dataset: Easily Turn Large Sets of Image URLs to an Image Dataset

    Romain Beaumont. img2dataset: Easily Turn Large Sets of Image URLs to an Image Dataset. https://github. com/rom1504/img2dataset, 2021

  55. [55]

    CLIP Benchmark, November 2022

    Mehdi Cherti and Romain Beaumont. CLIP Benchmark, November 2022. 14 APREPRINT- MAY1, 2026 A Details of Experimental Setup We follow OpenCLIP [3], FLIP [7], and CLIPA [14] to pre-train and evaluate our methods. A.1 Architectures Following FLIP [7] and CLIPA [14], we used ViT-B/16 and ViT-L/16 with global average pooling as the image encoder. For models pre...