arxiv: 2210.08402 · v1 · submitted 2022-10-16 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

LAION-5B: An open large-scale dataset for training next generation image-text models

Aarush Katta, Cade Gordon, Christoph Schuhmann, Clayton Mullis, Jenia Jitsev, Katherine Crowson, Ludwig Schmidt, Mehdi Cherti, Mitchell Wortsman, Patrick Schramowski, Richard Vencu, Robert Kaczmarczyk, Romain Beaumont, Ross Wightman, Srivatsa Kundurthy, Theo Coombes

Pith reviewed 2026-05-13 14:17 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords LAION-5Bimage-text datasetCLIP filteringmultimodal modelsStable Diffusiondataset releaseopen research

0 comments

The pith

LAION-5B supplies 5.85 billion CLIP-filtered image-text pairs to support open replication of large multimodal models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LAION-5B, an openly available collection of 5.85 billion image-text pairs that have been filtered with CLIP, including 2.32 billion English-language examples. It demonstrates that this resource can be used to replicate and fine-tune models such as CLIP, GLIDE, and Stable Diffusion. The work addresses the prior absence of public datasets at this scale, which had limited broader study of language-vision systems. By releasing the data along with nearest-neighbor indices, a web exploration interface, and content detection scores, the authors aim to enable additional experiments on training and capabilities of such models.

Core claim

The central claim is that LAION-5B, consisting of 5.85 billion CLIP-filtered image-text pairs of which 2.32 billion are in English, serves as effective training data for replicating foundational language-vision models including CLIP, GLIDE, and Stable Diffusion.

What carries the argument

The LAION-5B dataset of CLIP-filtered image-text pairs, which supplies the raw training material shown to support model replication and fine-tuning.

If this is right

Researchers without proprietary data access can now replicate and fine-tune models like CLIP and Stable Diffusion.
The provided nearest-neighbor indices and web interface enable efficient subset generation and dataset exploration for targeted experiments.
Detection scores for watermark, NSFW, and toxic content support safer curation of training subsets.
The scale of the open collection opens the door to further studies of training dynamics in large multimodal systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread adoption may surface new patterns in how data volume and filtering affect zero-shot generalization and out-of-distribution robustness.
The dataset could accelerate work on specialized or domain-adapted multimodal models by allowing groups to start from a common public base.
Questions around long-term data maintenance, versioning, and bias auditing become more tractable with a fixed public reference collection.

Load-bearing premise

CLIP-based filtering at web scale produces training data of sufficient quality and diversity to support effective model replication.

What would settle it

A direct side-by-side comparison in which a model such as Stable Diffusion is trained from scratch on LAION-5B and evaluated on the same benchmarks used for the original model; large gaps in performance metrics would undermine the replication claim.

read the original abstract

Groundbreaking language-vision architectures like CLIP and DALL-E proved the utility of training on large amounts of noisy image-text data, without relying on expensive accurate labels used in standard vision unimodal supervised learning. The resulting models showed capabilities of strong text-guided image generation and transfer to downstream tasks, while performing remarkably at zero-shot classification with noteworthy out-of-distribution robustness. Since then, large-scale language-vision models like ALIGN, BASIC, GLIDE, Flamingo and Imagen made further improvements. Studying the training and capabilities of such models requires datasets containing billions of image-text pairs. Until now, no datasets of this size have been made openly available for the broader research community. To address this problem and democratize research on large-scale multi-modal models, we present LAION-5B - a dataset consisting of 5.85 billion CLIP-filtered image-text pairs, of which 2.32B contain English language. We show successful replication and fine-tuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset, and discuss further experiments enabled with an openly available dataset of this scale. Additionally we provide several nearest neighbor indices, an improved web-interface for dataset exploration and subset generation, and detection scores for watermark, NSFW, and toxic content detection. Announcement page https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LAION-5B is a straightforward release of a 5.85B-pair CLIP-filtered image-text dataset that already supports real model training outside closed labs.

read the letter

The core thing to know is that this paper ships an openly downloadable dataset of 5.85 billion image-text pairs (2.32 billion English) filtered from Common Crawl using CLIP similarity scores, plus supporting indices and safety metadata. That scale was previously only available inside a few companies, so the release itself moves the field forward by letting more groups run large multimodal experiments without starting from scratch on scraping and filtering. They back the claim with replications: training a CLIP model from scratch, fine-tuning GLIDE, and producing a working Stable Diffusion checkpoint all on this data. Those results are the main evidence that the filtering produces usable training material rather than just noise. The accompanying tools (nearest-neighbor search, web explorer, watermark/NSFW detectors) are practical additions that make the resource immediately usable. The write-up is mostly descriptive, which fits a data-release paper; there are no new algorithms or derivations to evaluate. The main soft spot is that the exact filtering thresholds, language distribution details, and quantitative quality metrics are only sketched at a high level, so a reader has to trust the replication outcomes or go look at the released code and subsets to judge noise levels or biases. That is a limitation but not a fatal one, since the models trained on it demonstrably work. This paper is for anyone training or studying vision-language models at scale who needs a public starting point. It deserves a serious referee because the resource is large, the replications provide concrete validation, and the community has already started using it. I would send it to peer review rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The manuscript presents LAION-5B, a publicly released dataset of 5.85 billion CLIP-filtered image-text pairs (2.32 billion English) extracted from Common Crawl. It demonstrates the dataset's utility through reported successful replication and fine-tuning of CLIP, GLIDE, and Stable Diffusion, and supplies supporting resources including nearest-neighbor indices, a web interface, and NSFW/watermark/toxicity detection scores.

Significance. If the replication results hold under scrutiny, the open release of this scale of filtered multimodal data would substantially lower barriers to research on large vision-language models, enabling independent verification and extension of work previously limited to well-resourced labs. The provision of auxiliary tools further increases practical value.

major comments (2)

[Dataset construction] Dataset construction section: the exact CLIP similarity threshold, any secondary filtering heuristics, and the precise Common Crawl snapshot(s) used are not quantified, preventing exact reproduction of the 5.85 B pair corpus and undermining the central utility claim.
[Experiments] Replication experiments: no quantitative benchmark numbers (zero-shot ImageNet accuracy for the CLIP replication, FID or CLIP score for GLIDE/Stable Diffusion) are reported against the original models or against training on other public datasets, so the assertion of 'successful replication' cannot be evaluated.

minor comments (2)

[Abstract and Section 1] The abstract states 5.85 billion pairs while the body occasionally rounds to 5.8 B; standardize the figure throughout.
[Figures] Figure captions for the nearest-neighbor index examples should include the exact query text and similarity scores used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and recommendation for minor revision. We address the two major comments point by point below and will revise the manuscript to incorporate the requested details.

read point-by-point responses

Referee: [Dataset construction] Dataset construction section: the exact CLIP similarity threshold, any secondary filtering heuristics, and the precise Common Crawl snapshot(s) used are not quantified, preventing exact reproduction of the 5.85 B pair corpus and undermining the central utility claim.

Authors: We agree that these parameters must be stated explicitly to support reproducibility. The revised manuscript will quantify the CLIP similarity threshold, describe all secondary filtering heuristics (including deduplication, image-size and aspect-ratio constraints, and language detection), and list the exact Common Crawl snapshots employed in constructing the 5.85 B corpus. revision: yes
Referee: [Experiments] Replication experiments: no quantitative benchmark numbers (zero-shot ImageNet accuracy for the CLIP replication, FID or CLIP score for GLIDE/Stable Diffusion) are reported against the original models or against training on other public datasets, so the assertion of 'successful replication' cannot be evaluated.

Authors: We accept that the current text does not supply the requested quantitative benchmarks. The revised version will add tables reporting zero-shot ImageNet accuracy for the CLIP model trained on LAION-5B, FID and CLIP scores for the GLIDE and Stable Diffusion replications, and direct comparisons against the original models as well as models trained on other public datasets. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a dataset release paper whose central claim is the public availability of 5.85B CLIP-filtered image-text pairs together with reported external replications of CLIP, GLIDE and Stable Diffusion. No mathematical derivations, fitted parameters, predictions, or uniqueness theorems appear in the argument. The filtering pipeline is stated as an explicit design choice rather than derived from prior results by the same authors. No self-citation is load-bearing for any internal claim, and the replications are presented as independent evidence of utility rather than outputs forced by the paper's own equations or definitions. The derivation chain is therefore empty; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset release paper. No free parameters, mathematical axioms, or invented entities are introduced; the work relies on standard web crawling and existing CLIP filtering.

pith-pipeline@v0.9.0 · 5626 in / 1021 out tokens · 42274 ms · 2026-05-13T14:17:35.420231+00:00 · methodology

discussion (0)

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Improved techniques for fine-tuning flow models via adjoint matching: a deterministic control pipeline
cs.AI 2026-05 unverdicted novelty 7.0

A new adjoint matching framework formulates flow model alignment as optimal control, enabling direct regression training and terminal-trajectory truncation for efficiency gains on models like SiT-XL and FLUX.
Pro$^2$Assist: Continuous Step-Aware Proactive Assistance with Multimodal Egocentric Perception for Long-Horizon Procedural Tasks
cs.AI 2026-05 unverdicted novelty 7.0

Pro²Assist uses multimodal egocentric perception from AR glasses to track fine-grained progress in long-horizon procedural tasks and deliver timely proactive assistance, outperforming baselines by over 21% in action u...
Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization
cs.CV 2026-04 unverdicted novelty 7.0

Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.
$Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models
cs.CV 2026-04 unverdicted novelty 7.0

Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a di...
Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment
cs.CV 2026-04 unverdicted novelty 7.0

Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
cs.CV 2023-10 unverdicted novelty 7.0

Latent Consistency Models enable high-fidelity text-to-image generation in 2-4 steps by directly predicting solutions to the probability flow ODE in latent space, distilled from pre-trained LDMs.
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
cs.CV 2023-07 unverdicted novelty 7.0

A single motion module trained on videos adds temporally coherent animation to any personalized text-to-image model derived from the same base without additional tuning.
Visual Instruction Tuning
cs.CV 2023-04 unverdicted novelty 7.0

LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs
cs.CV 2023-03 conditional novelty 7.0

BiomedCLIP, pretrained on the new 15-million-pair PMC-15M dataset, achieves state-of-the-art performance on diverse biomedical vision-language tasks and even outperforms radiology-specific models on chest X-ray pneumo...
Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs
cs.CV 2026-05 unverdicted novelty 6.0

Exploiting linear structure in VLM embeddings, a synthetic-data pre-training method yields background-invariant representations that exceed 90% worst-group accuracy on Waterbirds even under 100% spurious correlation w...
Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models
cs.CV 2026-04 unverdicted novelty 6.0

SAMG uses spatially adaptive guidance scales derived from a geometric analysis of classifier-free guidance to resolve the detail-artifact dilemma in diffusion-based image and video generation.
Euclid Quick Data Release (Q1). AstroVink: A vision transformer approach to find strong gravitational lens systems
astro-ph.IM 2026-04 conditional novelty 6.0

A vision transformer classifier trained on simulated and real Euclid data recovers all known strong lenses in test sets and finds 8 Grade A plus 26 Grade B new candidates in the Q1 data.
Chameleon: Mixed-Modal Early-Fusion Foundation Models
cs.CL 2024-05 unverdicted novelty 6.0

Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro o...
Kosmos-2: Grounding Multimodal Large Language Models to the World
cs.CL 2023-06 unverdicted novelty 6.0

Kosmos-2 grounds text to image regions by encoding refer expressions as Markdown links to sequences of location tokens and trains on a new GrIT dataset of grounded image-text pairs.
Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis
cs.CV 2023-06 conditional novelty 6.0

HPD v2 is the largest human preference dataset for text-to-image images with 798k choices, and HPS v2 is the resulting CLIP-based scorer that better predicts human judgments and responds to model improvements.
EVA-CLIP: Improved Training Techniques for CLIP at Scale
cs.CV 2023-03 conditional novelty 6.0

EVA-CLIP delivers improved CLIP training recipes that yield 82.0% zero-shot ImageNet-1K accuracy for a 5B-parameter model after only 9 billion samples.
Aligning Text-to-Image Models using Human Feedback
cs.LG 2023-02 unverdicted novelty 6.0

A three-stage fine-tuning process uses human ratings to train a reward model and then improves text-to-image alignment by maximizing reward-weighted likelihood.
Making AI Drafts Count: A Quality Threshold in Audio Description Workflows
cs.HC 2026-05 unverdicted novelty 5.0

AI drafts for audio description reduce editing time and cognitive load only when they exceed a content-dependent quality threshold, unlike unguided baseline drafts.
FineEdit: Fine-Grained Image Edit with Bounding Box Guidance
cs.CV 2026-04 unverdicted novelty 5.0

FineEdit adds multi-level bounding box injection to diffusion image editing, releases a 1.2M-pair dataset with box annotations, and shows better instruction following and background consistency than prior open models ...
From Cradle to Cloud: A Life Cycle Review of AI's Environmental Footprint
cs.CY 2026-05 unverdicted novelty 4.0

A review of AI sustainability studies finds inconsistent life cycle definitions and predominant reliance on coarse CO2e proxies, with limited coverage of water, materials, and multi-impact assessments.
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
cs.CV 2023-08 unverdicted novelty 4.0

OpenFlamingo provides open-source autoregressive vision-language models that achieve 80-89% of Flamingo performance on seven vision-language datasets.
Neural Network Optimization Reimagined: Decoupled Techniques for Scratch and Fine-Tuning
cs.CV 2026-04 unverdicted novelty 3.0

DualOpt decouples optimization by using real-time layer-wise weight decay for scratch training and weight rollback for fine-tuning to improve convergence, generalization, and reduce knowledge forgetting.
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
cs.CV 2026-04 unverdicted novelty 3.0

This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...

Reference graph

Works this paper leans on

104 extracted references · 104 canonical work pages · cited by 23 Pith papers · 11 internal anchors

[1]

15https://github.com/lucidrains/DALLE-pytorch 16https://discord.gg/xBPBXfcFHd 17https://gauss-centre.eu 13

URL https://commoncrawl.org/. 15https://github.com/lucidrains/DALLE-pytorch 16https://discord.gg/xBPBXfcFHd 17https://gauss-centre.eu 13

work page
[2]

Flamingo: a Visual Language Model for Few-Shot Learning

Jean-Baptiste Alayrac, Jeﬀ Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.arXiv preprint arXiv:2204.14198, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models

Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. InAdvances in Neural Information Processing Systems (NeurIPS), 2019. URL https://proceedings.neurips.cc/paper/2019/file/97af07a14ca cba...

work page 2019
[4]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? InProceedings of ACM Conference on Fairness, Accountability, and Transparency (FAccT), pages 610–623, 2021

work page 2021
[5]

Large image datasets: A pyrrhic win for computer vision? InProceedings of IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1536–1546

Abeba Birhane and Vinay Uday Prabhu. Large image datasets: A pyrrhic win for computer vision? InProceedings of IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1536–1546. IEEE, 2021

work page 2021
[6]

Multimodaldatasets: misogyny, pornography, and malignant stereotypes

AbebaBirhane, VinayUdayPrabhu, andEmmanuelKahembwe. Multimodaldatasets: misogyny, pornography, and malignant stereotypes. October 2021

work page 2021
[7]

Food-101–mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InEuropean Conference on Computer Vision (ECCV), 2014. https://data.vision.ee.ethz.ch/cvl/datasets_extra/food-101/

work page 2014
[8]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[9]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeﬀrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[10]

Cross-lingual and multilingual clip

Fredrik Carlsson, Philipp Eisen, Faton Rekathati, and Magnus Sahlgren. Cross-lingual and multilingual clip. InProceedings of the Language Resources and Evaluation Conference, pages 6848–6854, Marseille, France, June 2022. European Language Resources Association. URL https://aclanthology.org/2022.lrec-1.739

work page 2022
[11]

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021

work page 2021
[12]

Describing textures in the wild

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InConference on Computer Vision and Pattern Recognition (CVPR), 2014. https://arxiv.org/abs/1311.3618. 14

work page arXiv 2014
[13]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

work page 2009
[14]

Redcaps: Web-curated image-text data created by the people, for the people.arXiv preprint arXiv:2111.11431, 2021

Karan Desai, Gaurav Kaul, Zubin Aysola, and Justin Johnson. Redcaps: Web-curated image-text data created by the people, for the people.arXiv preprint arXiv:2111.11431, 2021

work page arXiv 2021
[15]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[16]

Magma–multimodal augmentation of generative models through adapter-based ﬁnetuning

Constantin Eichenberg, Sidney Black, Samuel Weinbach, Letitia Parcalabescu, and Anette Frank. MAGMA - multimodal augmentation of generative models through adapter-based ﬁnetuning. CoRR, abs/2112.05253, 2021. URLhttps://arxiv.org/abs/2112.05253

work page arXiv 2021
[17]

CLIP on wheels: Zero-shot object navigation as object localization and exploration, 2022

Samir Yitzhak Gadre, Mitchell Wortsman, Gabriel Ilharco, Ludwig Schmidt, and Shuran Song. CLIP on wheels: Zero-shot object navigation as object localization and exploration, 2022

work page 2022
[18]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020
[19]

Pyramidclip: Hierarchical feature alignment for vision-language model pretraining, 2022

Yuting Gao, Jinfeng Liu, Zihan Xu, Jun Zhang, Ke Li, and Chunhua Shen. Pyramidclip: Hierarchical feature alignment for vision-language model pretraining, 2022. URL https: //arxiv.org/abs/2204.14095

work page arXiv 2022
[20]

Just: Large-scale multi-tier storage infrastructure at the jülich supercomputing centre

Stephan Graf and Olaf Mextorf. Just: Large-scale multi-tier storage infrastructure at the jülich supercomputing centre. Journal of large-scale research facilities JLSRF, 7:180, 2021

work page 2021
[21]

Vector quantized diﬀusion model for text-to-image synthesis

Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diﬀusion model for text-to-image synthesis. CoRR, abs/2111.14822, 2021. URL https://arxiv.org/abs/2111.14822

work page arXiv 2021
[22]

Stumpe, Derek Wu, Arunachalam Narayanaswamy, Subhashini Venugopalan, Kasumi Widner, Tom Madams, Jorge Cuadros, Ramasamy Kim, Rajiv Raman, Philip C

Varun Gulshan, Lily Peng, Marc Coram, Martin C. Stumpe, Derek Wu, Arunachalam Narayanaswamy, Subhashini Venugopalan, Kasumi Widner, Tom Madams, Jorge Cuadros, Ramasamy Kim, Rajiv Raman, Philip C. Nelson, Jessica L. Mega, and Dale R. Webster. Devel- opment and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus P...

work page doi:10.1001/jama.2016.17216 2016
[23]

The many faces of robustness: A critical analysis of out-of-distribution generalization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization. International Conference on Computer Vision (ICCV), 2021. https://arxiv.org/abs/2006.1 6241

work page 2021
[24]

Natural adversarial examples

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15262–15271, 2021. 15

work page 2021
[25]

Natural adversarial examples

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. Conference on Computer Vision and Pattern Recognition (CVPR), 2021. https://arxiv.org/abs/1907.07174

work page arXiv 2021
[26]

Scaling up vision-language pre-training for image captioning

Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Li- juan Wang. Scaling up vision-language pre-training for image captioning. arXiv preprint arXiv:2111.12233, 2021

work page arXiv 2021
[27]

Openclip, July 2021

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip, July 2021. URLhttps://doi.org/10.5281/ze nodo.5143773

work page doi:10.5281/ze 2021
[28]

13 Published as a conference paper at ICLR 2026 Sheng Jin, Xueying Jiang, Jiaxing Huang, Lewei Lu, and Shijian Lu

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yun- Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision.CoRR, abs/2102.05918, 2021. URLhttps://arxiv.org/ abs/2102.05918

work page arXiv 2021
[29]

JUWELS Booster Supercomputer, 2020.https://apps.fz- juelich.de/jsc/hps/juwels/configuration.html#hardware-configuration-of-the-sys tem-name-booster-module

Juelich Supercomputing Center. JUWELS Booster Supercomputer, 2020.https://apps.fz- juelich.de/jsc/hps/juwels/configuration.html#hardware-configuration-of-the-sys tem-name-booster-module

work page 2020
[30]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeﬀrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[31]

Deep visual-semantic alignments for generating image descrip- tions

Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descrip- tions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015

work page 2015
[32]

Simple but eﬀective: Clip embeddings for embodied ai.arXiv preprint arXiv:2111.09888, 2021

Apoorv Khandelwal, Luca Weihs, Roozbeh Mottaghi, and Aniruddha Kembhavi. Simple but eﬀective: Clip embeddings for embodied ai.arXiv preprint arXiv:2111.09888, 2021

work page arXiv 2021
[33]

Big transfer (bit): General visual representation learning

Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big transfer (bit): General visual representation learning. InEuropean conference on computer vision, pages 491–507. Springer, 2020

work page 2020
[34]

Do better imagenet models transfer better? In Conference on Computer Vision and Pattern Recognition (CVPR), 2019

Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do better imagenet models transfer better? In Conference on Computer Vision and Pattern Recognition (CVPR), 2019. https: //arxiv.org/abs/1805.08974

work page arXiv 2019
[35]

3d object representations for ﬁne- grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for ﬁne- grained categorization. InInternational Conference on Computer Vision (ICCV) Workshops,

work page
[36]

https://ieeexplore.ieee.org/document/6755945

work page arXiv
[37]

Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123(1):32–73, 2017

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123(1):32–73, 2017. 16

work page 2017
[38]

Learning multiple layers of features from tiny images,

Alex Krizhevsky, Geoﬀrey Hinton, et al. Learning multiple layers of features from tiny images,

work page
[39]

https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf

work page 2009
[40]

Lantern-rd: Enabling deep learning for mitigation of the invasive spotted lanternﬂy, 2022

Srivatsa Kundurthy. Lantern-rd: Enabling deep learning for mitigation of the invasive spotted lanternﬂy, 2022. URL https://arxiv.org/abs/2205.06397

work page arXiv 2022
[41]

The open images dataset v4: Uniﬁed image classiﬁcation, object detection, and visual relationship detection at scale.IJCV, 2020

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. The open images dataset v4: Uniﬁed image classiﬁcation, object detection, and visual relationship detection at scale.IJCV, 2020

work page 2020
[42]

The bigscience roots corpus: A 1.6 tb composite multilingual dataset

Hugo Laurençon, Lucile Saulnier, Thomas Wang, Christopher Akiki, Albert Villanova del Moral, Teven Le Scao, Leandro Von Werra, Chenghao Mou, Eduardo González Ponferrada, Huu Nguyen, et al. The bigscience roots corpus: A 1.6 tb composite multilingual dataset. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track

work page
[43]

Learning visual n-grams from web data

Ang Li, Allan Jabri, Armand Joulin, and Laurens Van Der Maaten. Learning visual n-grams from web data. InProceedings of the IEEE International Conference on Computer Vision, pages 4183–4192, 2017

work page 2017
[45]

Blip: Bootstrapping language- image pre-training for uniﬁed vision-language understanding and generation.arXiv preprint arXiv:2201.12086, 2022

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language- image pre-training for uniﬁed vision-language understanding and generation.arXiv preprint arXiv:2201.12086, 2022

work page arXiv 2022
[46]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

work page 2014
[47]

Pseudo numerical methods for diﬀusion models on manifolds

Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diﬀusion models on manifolds. arXiv preprint arXiv:2202.09778, 2022

work page arXiv 2022
[48]

Exploring the limits of weakly supervised pretraining

Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens Van Der Maaten. Exploring the limits of weakly supervised pretraining. In Proceedings of the European conference on computer vision (ECCV), pages 181–196, 2018

work page 2018
[49]

Generating images from captions with attention.arXiv preprint arXiv:1511.02793, 2015

Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Generating images from captions with attention.arXiv preprint arXiv:1511.02793, 2015

work page arXiv 2015
[50]

Ciagan: Conditional identity anonymiza- tion generative adversarial networks

Maxim Maximov, Ismail Elezi, and Laura Leal-Taixé. Ciagan: Conditional identity anonymiza- tion generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5447–5456, 2020

work page 2020
[51]

Model cards for model reporting

Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchin- son, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. 17 In Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT). ACM, 2019

work page 2019
[52]

arXiv preprint arXiv:2111.09734 , year=

Ron Mokady, Amir Hertz, and Amit H Bermano. Clipcap: Clip preﬁx for image captioning. arXiv preprint arXiv:2111.09734, 2021

work page arXiv 2021
[53]

Image-to-word transformation based on dividing and vector quantizing images with words

Yasuhide Mori, Hironobu Takahashi, and Ryuichi Oka. Image-to-word transformation based on dividing and vector quantizing images with words. InFirst international workshop on multimedia intelligent storage and retrieval management, pages 1–9. Citeseer, 1999

work page 1999
[54]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diﬀusion models, 2021. URLhttps://arxiv.org/abs/2112.10741

work page internal anchor Pith review Pith/arXiv arXiv 2021
[55]

cld3: Google’s Compact Language Detector 3 , 2022

Jeroen Ooms. cld3: Google’s Compact Language Detector 3 , 2022. https://docs.ropensci.org/cld3/, https://github.com/ropensci/cld3 (devel) https://github.com/google/cld3 (upstream)

work page 2022
[56]

Combined scaling for zero-shot transfer learning.arXiv preprint arXiv:2111.10050, 2021

Hieu Pham, Zihang Dai, Golnaz Ghiasi, Hanxiao Liu, Adams Wei Yu, Minh-Thang Luong, Mingxing Tan, and Quoc V Le. Combined scaling for zero-shot transfer learning.arXiv preprint arXiv:2111.10050, 2021

work page arXiv 2021
[57]

Connecting vision and language with localized narratives

Jordi Pont-Tuset, Jasper Uijlings, Soravit Changpinyo, Radu Soricut, and Vittorio Ferrari. Connecting vision and language with localized narratives. InECCV, 2020

work page 2020
[58]

Learning visual representations using images with captions

Ariadna Quattoni, Michael Collins, and Trevor Darrell. Learning visual representations using images with captions. In2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2007

work page 2007
[59]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[60]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021

work page 2021
[61]

Zero-Shot Text-to-Image Generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation.CoRR, abs/2102.12092, 2021. URL https://arxiv.org/abs/2102.12092

work page internal anchor Pith review arXiv 2021
[62]

Hierarchical text-conditional image generation with clip latents, 2022

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents, 2022. URLhttps://arxiv.org/abs/2204 .06125

work page 2022
[63]

Do ImageNet classiﬁers generalize to ImageNet? InInternational Conference on Machine Learning (ICML),

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do ImageNet classiﬁers generalize to ImageNet? InInternational Conference on Machine Learning (ICML),

work page
[64]

https://arxiv.org/abs/1902.10811. 18

work page Pith review arXiv 1902
[65]

Generative adversarial text to image synthesis

Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. InInternational conference on machine learning, pages 1060–1069. PMLR, 2016

work page 2016
[66]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diﬀusion models.CoRR, abs/2112.10752, 2021. URL https://arxiv.org/abs/2112.10752

work page internal anchor Pith review Pith/arXiv arXiv 2021
[67]

High-resolution image synthesis with latent diﬀusion models, 2021

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diﬀusion models, 2021

work page 2021
[68]

S., Berg, A

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y

work page doi:10.1007/s11263-015-0816-y 2015
[69]

Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diﬀusion models with deep language understanding, 2022. URLhttps://arxiv.org/abs/2205 .11487

work page 2022
[70]

Clipfa: Connecting farsi text and images.https://github.com /SajjjadAyobi/CLIPfa, 2021

Navid Kanaani Sajjad Ayoubi. Clipfa: Connecting farsi text and images.https://github.com /SajjjadAyobi/CLIPfa, 2021

work page 2021
[71]

Patrick Schramowski, Christopher Tauchmann, and Kristian Kersting. Can machines help us answering question 16 in datasheets, and in turn reﬂecting on inappropriate content? In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT). ACM, 2022

work page 2022
[72]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-ﬁltered 400 million image-text pairs.arXiv preprint arXiv:2111.02114, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[73]

Do image classiﬁers generalize across time?, 2019.https://arxiv.org/abs/1906.0 2168

Vaishaal Shankar, Achal Dave, Rebecca Roelofs, Deva Ramanan, Benjamin Recht, and Ludwig Schmidt. Do image classiﬁers generalize across time?, 2019.https://arxiv.org/abs/1906.0 2168

work page 2019
[74]

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, Melbourne, Australia, July 2018. Association for Computational L...

work page doi:10.18653/v1/p18-1238 2018
[75]

How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383, 2021

Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. How much can clip beneﬁt vision-and-language tasks?arXiv preprint arXiv:2107.06383, 2021

work page arXiv 2021
[76]

Japanese clip.https://github.com/rinnakk/japanese-clip, May 2022

Makoto Shing. Japanese clip.https://github.com/rinnakk/japanese-clip, May 2022. 19

work page 2022
[77]

Koclip.https://github.com/jaketae/koc lip, 20201

Guijin Son, Hansol Park, Jake Tae, and Trent Oh. Koclip.https://github.com/jaketae/koc lip, 20201

work page
[78]

Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning

Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2443–2449, 2021

work page 2021
[79]

Image representations learned with unsupervised pre-training contain human-like biases

Ryan Steed and Aylin Caliskan. Image representations learned with unsupervised pre-training contain human-like biases. InProceedings of ACM Conference on Fairness, Accountability, and Transparency (FAccT), pages 701–713, 2021

work page 2021
[80]

Revisiting unreasonable eﬀectiveness of data in deep learning era

Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable eﬀectiveness of data in deep learning era. InProceedings of the IEEE international conference on computer vision, pages 843–852, 2017

work page 2017
[81]

Measuring robustness to natural distribution shifts in image classiﬁcation.Advances in Neural Information Processing Systems, 33:18583–18599, 2020

Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Carlini, Benjamin Recht, and Ludwig Schmidt. Measuring robustness to natural distribution shifts in image classiﬁcation.Advances in Neural Information Processing Systems, 33:18583–18599, 2020

work page 2020

Showing first 80 references.