pith. machine review for the scientific record. sign in

arxiv: 2210.08402 · v1 · submitted 2022-10-16 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

LAION-5B: An open large-scale dataset for training next generation image-text models

Aarush Katta, Cade Gordon, Christoph Schuhmann, Clayton Mullis, Jenia Jitsev, Katherine Crowson, Ludwig Schmidt, Mehdi Cherti, Mitchell Wortsman, Patrick Schramowski, Richard Vencu, Robert Kaczmarczyk, Romain Beaumont, Ross Wightman, Srivatsa Kundurthy, Theo Coombes

Pith reviewed 2026-05-13 14:17 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords LAION-5Bimage-text datasetCLIP filteringmultimodal modelsStable Diffusiondataset releaseopen research
0
0 comments X

The pith

LAION-5B supplies 5.85 billion CLIP-filtered image-text pairs to support open replication of large multimodal models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LAION-5B, an openly available collection of 5.85 billion image-text pairs that have been filtered with CLIP, including 2.32 billion English-language examples. It demonstrates that this resource can be used to replicate and fine-tune models such as CLIP, GLIDE, and Stable Diffusion. The work addresses the prior absence of public datasets at this scale, which had limited broader study of language-vision systems. By releasing the data along with nearest-neighbor indices, a web exploration interface, and content detection scores, the authors aim to enable additional experiments on training and capabilities of such models.

Core claim

The central claim is that LAION-5B, consisting of 5.85 billion CLIP-filtered image-text pairs of which 2.32 billion are in English, serves as effective training data for replicating foundational language-vision models including CLIP, GLIDE, and Stable Diffusion.

What carries the argument

The LAION-5B dataset of CLIP-filtered image-text pairs, which supplies the raw training material shown to support model replication and fine-tuning.

If this is right

  • Researchers without proprietary data access can now replicate and fine-tune models like CLIP and Stable Diffusion.
  • The provided nearest-neighbor indices and web interface enable efficient subset generation and dataset exploration for targeted experiments.
  • Detection scores for watermark, NSFW, and toxic content support safer curation of training subsets.
  • The scale of the open collection opens the door to further studies of training dynamics in large multimodal systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Widespread adoption may surface new patterns in how data volume and filtering affect zero-shot generalization and out-of-distribution robustness.
  • The dataset could accelerate work on specialized or domain-adapted multimodal models by allowing groups to start from a common public base.
  • Questions around long-term data maintenance, versioning, and bias auditing become more tractable with a fixed public reference collection.

Load-bearing premise

CLIP-based filtering at web scale produces training data of sufficient quality and diversity to support effective model replication.

What would settle it

A direct side-by-side comparison in which a model such as Stable Diffusion is trained from scratch on LAION-5B and evaluated on the same benchmarks used for the original model; large gaps in performance metrics would undermine the replication claim.

read the original abstract

Groundbreaking language-vision architectures like CLIP and DALL-E proved the utility of training on large amounts of noisy image-text data, without relying on expensive accurate labels used in standard vision unimodal supervised learning. The resulting models showed capabilities of strong text-guided image generation and transfer to downstream tasks, while performing remarkably at zero-shot classification with noteworthy out-of-distribution robustness. Since then, large-scale language-vision models like ALIGN, BASIC, GLIDE, Flamingo and Imagen made further improvements. Studying the training and capabilities of such models requires datasets containing billions of image-text pairs. Until now, no datasets of this size have been made openly available for the broader research community. To address this problem and democratize research on large-scale multi-modal models, we present LAION-5B - a dataset consisting of 5.85 billion CLIP-filtered image-text pairs, of which 2.32B contain English language. We show successful replication and fine-tuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset, and discuss further experiments enabled with an openly available dataset of this scale. Additionally we provide several nearest neighbor indices, an improved web-interface for dataset exploration and subset generation, and detection scores for watermark, NSFW, and toxic content detection. Announcement page https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents LAION-5B, a publicly released dataset of 5.85 billion CLIP-filtered image-text pairs (2.32 billion English) extracted from Common Crawl. It demonstrates the dataset's utility through reported successful replication and fine-tuning of CLIP, GLIDE, and Stable Diffusion, and supplies supporting resources including nearest-neighbor indices, a web interface, and NSFW/watermark/toxicity detection scores.

Significance. If the replication results hold under scrutiny, the open release of this scale of filtered multimodal data would substantially lower barriers to research on large vision-language models, enabling independent verification and extension of work previously limited to well-resourced labs. The provision of auxiliary tools further increases practical value.

major comments (2)
  1. [Dataset construction] Dataset construction section: the exact CLIP similarity threshold, any secondary filtering heuristics, and the precise Common Crawl snapshot(s) used are not quantified, preventing exact reproduction of the 5.85 B pair corpus and undermining the central utility claim.
  2. [Experiments] Replication experiments: no quantitative benchmark numbers (zero-shot ImageNet accuracy for the CLIP replication, FID or CLIP score for GLIDE/Stable Diffusion) are reported against the original models or against training on other public datasets, so the assertion of 'successful replication' cannot be evaluated.
minor comments (2)
  1. [Abstract and Section 1] The abstract states 5.85 billion pairs while the body occasionally rounds to 5.8 B; standardize the figure throughout.
  2. [Figures] Figure captions for the nearest-neighbor index examples should include the exact query text and similarity scores used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and recommendation for minor revision. We address the two major comments point by point below and will revise the manuscript to incorporate the requested details.

read point-by-point responses
  1. Referee: [Dataset construction] Dataset construction section: the exact CLIP similarity threshold, any secondary filtering heuristics, and the precise Common Crawl snapshot(s) used are not quantified, preventing exact reproduction of the 5.85 B pair corpus and undermining the central utility claim.

    Authors: We agree that these parameters must be stated explicitly to support reproducibility. The revised manuscript will quantify the CLIP similarity threshold, describe all secondary filtering heuristics (including deduplication, image-size and aspect-ratio constraints, and language detection), and list the exact Common Crawl snapshots employed in constructing the 5.85 B corpus. revision: yes

  2. Referee: [Experiments] Replication experiments: no quantitative benchmark numbers (zero-shot ImageNet accuracy for the CLIP replication, FID or CLIP score for GLIDE/Stable Diffusion) are reported against the original models or against training on other public datasets, so the assertion of 'successful replication' cannot be evaluated.

    Authors: We accept that the current text does not supply the requested quantitative benchmarks. The revised version will add tables reporting zero-shot ImageNet accuracy for the CLIP model trained on LAION-5B, FID and CLIP scores for the GLIDE and Stable Diffusion replications, and direct comparisons against the original models as well as models trained on other public datasets. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a dataset release paper whose central claim is the public availability of 5.85B CLIP-filtered image-text pairs together with reported external replications of CLIP, GLIDE and Stable Diffusion. No mathematical derivations, fitted parameters, predictions, or uniqueness theorems appear in the argument. The filtering pipeline is stated as an explicit design choice rather than derived from prior results by the same authors. No self-citation is load-bearing for any internal claim, and the replications are presented as independent evidence of utility rather than outputs forced by the paper's own equations or definitions. The derivation chain is therefore empty; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset release paper. No free parameters, mathematical axioms, or invented entities are introduced; the work relies on standard web crawling and existing CLIP filtering.

pith-pipeline@v0.9.0 · 5626 in / 1021 out tokens · 42274 ms · 2026-05-13T14:17:35.420231+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Improved techniques for fine-tuning flow models via adjoint matching: a deterministic control pipeline

    cs.AI 2026-05 unverdicted novelty 7.0

    A new adjoint matching framework formulates flow model alignment as optimal control, enabling direct regression training and terminal-trajectory truncation for efficiency gains on models like SiT-XL and FLUX.

  2. Pro$^2$Assist: Continuous Step-Aware Proactive Assistance with Multimodal Egocentric Perception for Long-Horizon Procedural Tasks

    cs.AI 2026-05 unverdicted novelty 7.0

    Pro²Assist uses multimodal egocentric perception from AR glasses to track fine-grained progress in long-horizon procedural tasks and deliver timely proactive assistance, outperforming baselines by over 21% in action u...

  3. Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization

    cs.CV 2026-04 unverdicted novelty 7.0

    Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.

  4. $Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a di...

  5. Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment

    cs.CV 2026-04 unverdicted novelty 7.0

    Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.

  6. Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    cs.CV 2023-10 unverdicted novelty 7.0

    Latent Consistency Models enable high-fidelity text-to-image generation in 2-4 steps by directly predicting solutions to the probability flow ODE in latent space, distilled from pre-trained LDMs.

  7. AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    cs.CV 2023-07 unverdicted novelty 7.0

    A single motion module trained on videos adds temporally coherent animation to any personalized text-to-image model derived from the same base without additional tuning.

  8. Visual Instruction Tuning

    cs.CV 2023-04 unverdicted novelty 7.0

    LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.

  9. BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

    cs.CV 2023-03 conditional novelty 7.0

    BiomedCLIP, pretrained on the new 15-million-pair PMC-15M dataset, achieves state-of-the-art performance on diverse biomedical vision-language tasks and even outperforms radiology-specific models on chest X-ray pneumo...

  10. Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    Exploiting linear structure in VLM embeddings, a synthetic-data pre-training method yields background-invariant representations that exceed 90% worst-group accuracy on Waterbirds even under 100% spurious correlation w...

  11. Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models

    cs.CV 2026-04 unverdicted novelty 6.0

    SAMG uses spatially adaptive guidance scales derived from a geometric analysis of classifier-free guidance to resolve the detail-artifact dilemma in diffusion-based image and video generation.

  12. Euclid Quick Data Release (Q1). AstroVink: A vision transformer approach to find strong gravitational lens systems

    astro-ph.IM 2026-04 conditional novelty 6.0

    A vision transformer classifier trained on simulated and real Euclid data recovers all known strong lenses in test sets and finds 8 Grade A plus 26 Grade B new candidates in the Q1 data.

  13. Chameleon: Mixed-Modal Early-Fusion Foundation Models

    cs.CL 2024-05 unverdicted novelty 6.0

    Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro o...

  14. Kosmos-2: Grounding Multimodal Large Language Models to the World

    cs.CL 2023-06 unverdicted novelty 6.0

    Kosmos-2 grounds text to image regions by encoding refer expressions as Markdown links to sequences of location tokens and trains on a new GrIT dataset of grounded image-text pairs.

  15. Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    cs.CV 2023-06 conditional novelty 6.0

    HPD v2 is the largest human preference dataset for text-to-image images with 798k choices, and HPS v2 is the resulting CLIP-based scorer that better predicts human judgments and responds to model improvements.

  16. EVA-CLIP: Improved Training Techniques for CLIP at Scale

    cs.CV 2023-03 conditional novelty 6.0

    EVA-CLIP delivers improved CLIP training recipes that yield 82.0% zero-shot ImageNet-1K accuracy for a 5B-parameter model after only 9 billion samples.

  17. Aligning Text-to-Image Models using Human Feedback

    cs.LG 2023-02 unverdicted novelty 6.0

    A three-stage fine-tuning process uses human ratings to train a reward model and then improves text-to-image alignment by maximizing reward-weighted likelihood.

  18. Making AI Drafts Count: A Quality Threshold in Audio Description Workflows

    cs.HC 2026-05 unverdicted novelty 5.0

    AI drafts for audio description reduce editing time and cognitive load only when they exceed a content-dependent quality threshold, unlike unguided baseline drafts.

  19. FineEdit: Fine-Grained Image Edit with Bounding Box Guidance

    cs.CV 2026-04 unverdicted novelty 5.0

    FineEdit adds multi-level bounding box injection to diffusion image editing, releases a 1.2M-pair dataset with box annotations, and shows better instruction following and background consistency than prior open models ...

  20. From Cradle to Cloud: A Life Cycle Review of AI's Environmental Footprint

    cs.CY 2026-05 unverdicted novelty 4.0

    A review of AI sustainability studies finds inconsistent life cycle definitions and predominant reliance on coarse CO2e proxies, with limited coverage of water, materials, and multi-impact assessments.

  21. OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

    cs.CV 2023-08 unverdicted novelty 4.0

    OpenFlamingo provides open-source autoregressive vision-language models that achieve 80-89% of Flamingo performance on seven vision-language datasets.

  22. Neural Network Optimization Reimagined: Decoupled Techniques for Scratch and Fine-Tuning

    cs.CV 2026-04 unverdicted novelty 3.0

    DualOpt decouples optimization by using real-time layer-wise weight decay for scratch training and weight rollback for fine-tuning to improve convergence, generalization, and reduce knowledge forgetting.

  23. LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

    cs.CV 2026-04 unverdicted novelty 3.0

    This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...

Reference graph

Works this paper leans on

104 extracted references · 104 canonical work pages · cited by 23 Pith papers · 11 internal anchors

  1. [1]

    15https://github.com/lucidrains/DALLE-pytorch 16https://discord.gg/xBPBXfcFHd 17https://gauss-centre.eu 13

    URL https://commoncrawl.org/. 15https://github.com/lucidrains/DALLE-pytorch 16https://discord.gg/xBPBXfcFHd 17https://gauss-centre.eu 13

  2. [2]

    Flamingo: a Visual Language Model for Few-Shot Learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.arXiv preprint arXiv:2204.14198, 2022

  3. [3]

    Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models

    Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. InAdvances in Neural Information Processing Systems (NeurIPS), 2019. URL https://proceedings.neurips.cc/paper/2019/file/97af07a14ca cba...

  4. [4]

    Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

    Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? InProceedings of ACM Conference on Fairness, Accountability, and Transparency (FAccT), pages 610–623, 2021

  5. [5]

    Large image datasets: A pyrrhic win for computer vision? InProceedings of IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1536–1546

    Abeba Birhane and Vinay Uday Prabhu. Large image datasets: A pyrrhic win for computer vision? InProceedings of IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1536–1546. IEEE, 2021

  6. [6]

    Multimodaldatasets: misogyny, pornography, and malignant stereotypes

    AbebaBirhane, VinayUdayPrabhu, andEmmanuelKahembwe. Multimodaldatasets: misogyny, pornography, and malignant stereotypes. October 2021

  7. [7]

    Food-101–mining discriminative components with random forests

    Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InEuropean Conference on Computer Vision (ECCV), 2014. https://data.vision.ee.ethz.ch/cvl/datasets_extra/food-101/

  8. [8]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

  9. [9]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,...

  10. [10]

    Cross-lingual and multilingual clip

    Fredrik Carlsson, Philipp Eisen, Faton Rekathati, and Magnus Sahlgren. Cross-lingual and multilingual clip. InProceedings of the Language Resources and Evaluation Conference, pages 6848–6854, Marseille, France, June 2022. European Language Resources Association. URL https://aclanthology.org/2022.lrec-1.739

  11. [11]

    Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021

  12. [12]

    Describing textures in the wild

    Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InConference on Computer Vision and Pattern Recognition (CVPR), 2014. https://arxiv.org/abs/1311.3618. 14

  13. [13]

    Imagenet: A large- scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  14. [14]

    Redcaps: Web-curated image-text data created by the people, for the people.arXiv preprint arXiv:2111.11431, 2021

    Karan Desai, Gaurav Kaul, Zubin Aysola, and Justin Johnson. Redcaps: Web-curated image-text data created by the people, for the people.arXiv preprint arXiv:2111.11431, 2021

  15. [15]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  16. [16]

    Magma–multimodal augmentation of generative models through adapter-based finetuning

    Constantin Eichenberg, Sidney Black, Samuel Weinbach, Letitia Parcalabescu, and Anette Frank. MAGMA - multimodal augmentation of generative models through adapter-based finetuning. CoRR, abs/2112.05253, 2021. URLhttps://arxiv.org/abs/2112.05253

  17. [17]

    CLIP on wheels: Zero-shot object navigation as object localization and exploration, 2022

    Samir Yitzhak Gadre, Mitchell Wortsman, Gabriel Ilharco, Ludwig Schmidt, and Shuran Song. CLIP on wheels: Zero-shot object navigation as object localization and exploration, 2022

  18. [18]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

  19. [19]

    Pyramidclip: Hierarchical feature alignment for vision-language model pretraining, 2022

    Yuting Gao, Jinfeng Liu, Zihan Xu, Jun Zhang, Ke Li, and Chunhua Shen. Pyramidclip: Hierarchical feature alignment for vision-language model pretraining, 2022. URL https: //arxiv.org/abs/2204.14095

  20. [20]

    Just: Large-scale multi-tier storage infrastructure at the jülich supercomputing centre

    Stephan Graf and Olaf Mextorf. Just: Large-scale multi-tier storage infrastructure at the jülich supercomputing centre. Journal of large-scale research facilities JLSRF, 7:180, 2021

  21. [21]

    Vector quantized diffusion model for text-to-image synthesis

    Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis. CoRR, abs/2111.14822, 2021. URL https://arxiv.org/abs/2111.14822

  22. [22]

    Stumpe, Derek Wu, Arunachalam Narayanaswamy, Subhashini Venugopalan, Kasumi Widner, Tom Madams, Jorge Cuadros, Ramasamy Kim, Rajiv Raman, Philip C

    Varun Gulshan, Lily Peng, Marc Coram, Martin C. Stumpe, Derek Wu, Arunachalam Narayanaswamy, Subhashini Venugopalan, Kasumi Widner, Tom Madams, Jorge Cuadros, Ramasamy Kim, Rajiv Raman, Philip C. Nelson, Jessica L. Mega, and Dale R. Webster. Devel- opment and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus P...

  23. [23]

    The many faces of robustness: A critical analysis of out-of-distribution generalization

    Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization. International Conference on Computer Vision (ICCV), 2021. https://arxiv.org/abs/2006.1 6241

  24. [24]

    Natural adversarial examples

    Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15262–15271, 2021. 15

  25. [25]

    Natural adversarial examples

    Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. Conference on Computer Vision and Pattern Recognition (CVPR), 2021. https://arxiv.org/abs/1907.07174

  26. [26]

    Scaling up vision-language pre-training for image captioning

    Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Li- juan Wang. Scaling up vision-language pre-training for image captioning. arXiv preprint arXiv:2111.12233, 2021

  27. [27]

    Openclip, July 2021

    Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip, July 2021. URLhttps://doi.org/10.5281/ze nodo.5143773

  28. [28]

    13 Published as a conference paper at ICLR 2026 Sheng Jin, Xueying Jiang, Jiaxing Huang, Lewei Lu, and Shijian Lu

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yun- Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision.CoRR, abs/2102.05918, 2021. URLhttps://arxiv.org/ abs/2102.05918

  29. [29]

    JUWELS Booster Supercomputer, 2020.https://apps.fz- juelich.de/jsc/hps/juwels/configuration.html#hardware-configuration-of-the-sys tem-name-booster-module

    Juelich Supercomputing Center. JUWELS Booster Supercomputer, 2020.https://apps.fz- juelich.de/jsc/hps/juwels/configuration.html#hardware-configuration-of-the-sys tem-name-booster-module

  30. [30]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

  31. [31]

    Deep visual-semantic alignments for generating image descrip- tions

    Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descrip- tions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015

  32. [32]

    Simple but effective: Clip embeddings for embodied ai.arXiv preprint arXiv:2111.09888, 2021

    Apoorv Khandelwal, Luca Weihs, Roozbeh Mottaghi, and Aniruddha Kembhavi. Simple but effective: Clip embeddings for embodied ai.arXiv preprint arXiv:2111.09888, 2021

  33. [33]

    Big transfer (bit): General visual representation learning

    Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big transfer (bit): General visual representation learning. InEuropean conference on computer vision, pages 491–507. Springer, 2020

  34. [34]

    Do better imagenet models transfer better? In Conference on Computer Vision and Pattern Recognition (CVPR), 2019

    Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do better imagenet models transfer better? In Conference on Computer Vision and Pattern Recognition (CVPR), 2019. https: //arxiv.org/abs/1805.08974

  35. [35]

    3d object representations for fine- grained categorization

    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine- grained categorization. InInternational Conference on Computer Vision (ICCV) Workshops,

  36. [36]

    https://ieeexplore.ieee.org/document/6755945

  37. [37]

    Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123(1):32–73, 2017

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123(1):32–73, 2017. 16

  38. [38]

    Learning multiple layers of features from tiny images,

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images,

  39. [39]

    https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf

  40. [40]

    Lantern-rd: Enabling deep learning for mitigation of the invasive spotted lanternfly, 2022

    Srivatsa Kundurthy. Lantern-rd: Enabling deep learning for mitigation of the invasive spotted lanternfly, 2022. URL https://arxiv.org/abs/2205.06397

  41. [41]

    The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale.IJCV, 2020

    Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale.IJCV, 2020

  42. [42]

    The bigscience roots corpus: A 1.6 tb composite multilingual dataset

    Hugo Laurençon, Lucile Saulnier, Thomas Wang, Christopher Akiki, Albert Villanova del Moral, Teven Le Scao, Leandro Von Werra, Chenghao Mou, Eduardo González Ponferrada, Huu Nguyen, et al. The bigscience roots corpus: A 1.6 tb composite multilingual dataset. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track

  43. [43]

    Learning visual n-grams from web data

    Ang Li, Allan Jabri, Armand Joulin, and Laurens Van Der Maaten. Learning visual n-grams from web data. InProceedings of the IEEE International Conference on Computer Vision, pages 4183–4192, 2017

  44. [45]

    Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation.arXiv preprint arXiv:2201.12086, 2022

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation.arXiv preprint arXiv:2201.12086, 2022

  45. [46]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

  46. [47]

    Pseudo numerical methods for diffusion models on manifolds

    Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds. arXiv preprint arXiv:2202.09778, 2022

  47. [48]

    Exploring the limits of weakly supervised pretraining

    Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens Van Der Maaten. Exploring the limits of weakly supervised pretraining. In Proceedings of the European conference on computer vision (ECCV), pages 181–196, 2018

  48. [49]

    Generating images from captions with attention.arXiv preprint arXiv:1511.02793, 2015

    Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Generating images from captions with attention.arXiv preprint arXiv:1511.02793, 2015

  49. [50]

    Ciagan: Conditional identity anonymiza- tion generative adversarial networks

    Maxim Maximov, Ismail Elezi, and Laura Leal-Taixé. Ciagan: Conditional identity anonymiza- tion generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5447–5456, 2020

  50. [51]

    Model cards for model reporting

    Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchin- son, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. 17 In Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT). ACM, 2019

  51. [52]

    arXiv preprint arXiv:2111.09734 , year=

    Ron Mokady, Amir Hertz, and Amit H Bermano. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021

  52. [53]

    Image-to-word transformation based on dividing and vector quantizing images with words

    Yasuhide Mori, Hironobu Takahashi, and Ryuichi Oka. Image-to-word transformation based on dividing and vector quantizing images with words. InFirst international workshop on multimedia intelligent storage and retrieval management, pages 1–9. Citeseer, 1999

  53. [54]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2021. URLhttps://arxiv.org/abs/2112.10741

  54. [55]

    cld3: Google’s Compact Language Detector 3 , 2022

    Jeroen Ooms. cld3: Google’s Compact Language Detector 3 , 2022. https://docs.ropensci.org/cld3/, https://github.com/ropensci/cld3 (devel) https://github.com/google/cld3 (upstream)

  55. [56]

    Combined scaling for zero-shot transfer learning.arXiv preprint arXiv:2111.10050, 2021

    Hieu Pham, Zihang Dai, Golnaz Ghiasi, Hanxiao Liu, Adams Wei Yu, Minh-Thang Luong, Mingxing Tan, and Quoc V Le. Combined scaling for zero-shot transfer learning.arXiv preprint arXiv:2111.10050, 2021

  56. [57]

    Connecting vision and language with localized narratives

    Jordi Pont-Tuset, Jasper Uijlings, Soravit Changpinyo, Radu Soricut, and Vittorio Ferrari. Connecting vision and language with localized narratives. InECCV, 2020

  57. [58]

    Learning visual representations using images with captions

    Ariadna Quattoni, Michael Collins, and Trevor Darrell. Learning visual representations using images with captions. In2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2007

  58. [59]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/2103.00020

  59. [60]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021

  60. [61]

    Zero-Shot Text-to-Image Generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation.CoRR, abs/2102.12092, 2021. URL https://arxiv.org/abs/2102.12092

  61. [62]

    Hierarchical text-conditional image generation with clip latents, 2022

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents, 2022. URLhttps://arxiv.org/abs/2204 .06125

  62. [63]

    Do ImageNet classifiers generalize to ImageNet? InInternational Conference on Machine Learning (ICML),

    Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do ImageNet classifiers generalize to ImageNet? InInternational Conference on Machine Learning (ICML),

  63. [64]

    https://arxiv.org/abs/1902.10811. 18

  64. [65]

    Generative adversarial text to image synthesis

    Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. InInternational conference on machine learning, pages 1060–1069. PMLR, 2016

  65. [66]

    High-Resolution Image Synthesis with Latent Diffusion Models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models.CoRR, abs/2112.10752, 2021. URL https://arxiv.org/abs/2112.10752

  66. [67]

    High-resolution image synthesis with latent diffusion models, 2021

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021

  67. [68]

    S., Berg, A

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y

  68. [69]

    Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding, 2022. URLhttps://arxiv.org/abs/2205 .11487

  69. [70]

    Clipfa: Connecting farsi text and images.https://github.com /SajjjadAyobi/CLIPfa, 2021

    Navid Kanaani Sajjad Ayoubi. Clipfa: Connecting farsi text and images.https://github.com /SajjjadAyobi/CLIPfa, 2021

  70. [71]

    Patrick Schramowski, Christopher Tauchmann, and Kristian Kersting. Can machines help us answering question 16 in datasheets, and in turn reflecting on inappropriate content? In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT). ACM, 2022

  71. [72]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs.arXiv preprint arXiv:2111.02114, 2021

  72. [73]

    Do image classifiers generalize across time?, 2019.https://arxiv.org/abs/1906.0 2168

    Vaishaal Shankar, Achal Dave, Rebecca Roelofs, Deva Ramanan, Benjamin Recht, and Ludwig Schmidt. Do image classifiers generalize across time?, 2019.https://arxiv.org/abs/1906.0 2168

  73. [74]

    Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, Melbourne, Australia, July 2018. Association for Computational L...

  74. [75]

    How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383, 2021

    Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. How much can clip benefit vision-and-language tasks?arXiv preprint arXiv:2107.06383, 2021

  75. [76]

    Japanese clip.https://github.com/rinnakk/japanese-clip, May 2022

    Makoto Shing. Japanese clip.https://github.com/rinnakk/japanese-clip, May 2022. 19

  76. [77]

    Koclip.https://github.com/jaketae/koc lip, 20201

    Guijin Son, Hansol Park, Jake Tae, and Trent Oh. Koclip.https://github.com/jaketae/koc lip, 20201

  77. [78]

    Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning

    Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2443–2449, 2021

  78. [79]

    Image representations learned with unsupervised pre-training contain human-like biases

    Ryan Steed and Aylin Caliskan. Image representations learned with unsupervised pre-training contain human-like biases. InProceedings of ACM Conference on Fairness, Accountability, and Transparency (FAccT), pages 701–713, 2021

  79. [80]

    Revisiting unreasonable effectiveness of data in deep learning era

    Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. InProceedings of the IEEE international conference on computer vision, pages 843–852, 2017

  80. [81]

    Measuring robustness to natural distribution shifts in image classification.Advances in Neural Information Processing Systems, 33:18583–18599, 2020

    Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Carlini, Benjamin Recht, and Ludwig Schmidt. Measuring robustness to natural distribution shifts in image classification.Advances in Neural Information Processing Systems, 33:18583–18599, 2020

Showing first 80 references.