arxiv: 2602.19946 · v4 · submitted 2026-02-23 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

When Pretty Isn't Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators

Krzysztof Adamkiewicz , Brian Bernhard Moser , Stanislav Frolov , Tobias Christian Nauen , Federico Raue , Andreas Dengel

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:38 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords text-to-image modelssynthetic dataimage classificationgenerative modelsdistribution collapsetraining data qualityaesthetic biasrealism versus utility

0 comments

The pith

Newer text-to-image models produce images that look more realistic yet generate worse training data for real-world classifiers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether advances in text-to-image models translate into better synthetic training sets by generating large datasets from models released between 2022 and 2025, training standard image classifiers only on those synthetic images, and measuring accuracy on real test data. Accuracy falls steadily with each newer generator even though the images themselves show higher visual quality and better prompt following. The authors trace the drop to the models converging on a narrow aesthetic style that reduces visual diversity and fails to cover the distribution of real photographs.

Core claim

Despite measurable gains in image fidelity and prompt adherence, classifiers trained exclusively on synthetic data from successive generations of text-to-image models exhibit progressively lower accuracy when evaluated on real images, because the newer generators collapse onto a narrow aesthetic-centric distribution that reduces coverage of real data variations.

What carries the argument

Generative collapse onto a narrow aesthetic distribution that trades visual appeal for reduced diversity and poorer coverage of real-image statistics.

If this is right

Synthetic training sets from newer models produce classifiers that generalize more poorly to real photographs.
Visual realism in generated images does not guarantee usefulness as a scalable replacement for real training data.
Current text-to-image models require explicit mechanisms to preserve distributional coverage beyond aesthetic appeal.
Vision research relying on synthetic data must separately verify real-data performance rather than assume progress in generation quality suffices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Evaluation protocols for generative models should add explicit metrics for distributional breadth rather than relying solely on fidelity or prompt adherence scores.
The same collapse pattern could affect other downstream tasks that use synthetic images, such as object detection or segmentation, if they also depend on covering real-world variability.
Future generators might need training objectives or sampling strategies that explicitly penalize mode collapse toward popular aesthetics.

Load-bearing premise

The observed accuracy decline is driven by the models' convergence on a narrow aesthetic distribution rather than by uncontrolled differences in prompts, dataset sizes, or training procedures across model versions.

What would settle it

Re-running the exact pipeline with identical prompts, identical dataset sizes, and identical training hyperparameters for every model generation and checking whether the accuracy decline disappears.

Figures

Figures reproduced from arXiv: 2602.19946 by Andreas Dengel, Brian Bernhard Moser, Federico Raue, Krzysztof Adamkiewicz, Stanislav Frolov, Tobias Christian Nauen.

**Figure 2.** Figure 2: To probe which aspects of synthetic images are most affected, we transform images to suppress or amplify the effects of distortions [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy on the real ImageNet-1k test set versus GenEval score [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Performance comparison for (left) structure (depth-based classifier) and texture (local feature classifier), and (right) frequencyfiltered data for class name- and caption-guided synthetic datasets. Image structure is consistently less affected than texture, while high-frequency components degrade more strongly than low frequencies (especially in better-performing models). detailed captions condition. Ben… view at source ↗

**Figure 5.** Figure 5: Dataset diversity using density and coverage metrics [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of cross-domain transfer for ResNet-50: [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: We investigate the scaling of the performance as a function of the number of Images per Class (IPC). We compare class-name [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

read the original abstract

Recent text-to-image (T2I) diffusion models produce visually stunning images and demonstrate excellent prompt following. But do they perform well as synthetic vision data generators? In this work, we revisit the promise of synthetic data as a scalable substitute for real training sets and uncover a surprising performance regression. We generate large-scale synthetic datasets using state-of-the-art T2I models released between 2022 and 2025, train standard classifiers solely on this synthetic data, and evaluate them on real test data. Despite observable advances in visual fidelity and prompt adherence, classification accuracy on real test data consistently declines with newer T2I models as training data generators. Our analysis reveals a hidden trend: These models collapse to a narrow, aesthetic-centric distribution that undermines diversity and real data distribution coverage. Overall, our findings challenge a growing assumption in vision research, namely that progress in generative realism implies progress in data realism. We thus highlight an urgent need to rethink the capabilities of modern T2I models as reliable training data generators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Newer T2I models give prettier images but worse synthetic training data for real classifiers, though the causal link to aesthetic collapse rests on shaky controls.

read the letter

The main takeaway is that training classifiers only on synthetic images from 2022–2025 T2I models produces steadily worse accuracy on real test sets, even as visual quality improves. That trend is the paper's core observation and it is worth checking against your own pipelines if you rely on generated data for vision tasks. What the work does cleanly is run the same downstream classifier training and real-data evaluation across multiple model releases, which makes the regression pattern easy to see and directly challenges the assumption that better-looking generators automatically make better data. The authors also flag the narrow aesthetic distribution as the likely culprit, which aligns with known issues in diffusion models around mode collapse. The soft spots are in the experimental controls. The abstract and stress-test note both flag that prompt templates, image counts per class, sampling settings, and any post-filtering are not shown to be held fixed across the model generations. Without those details it is hard to rule out that later models simply produced fewer usable images or followed the prompts less reliably, which would explain the accuracy drop without needing the aesthetic-collapse story. No quantitative numbers, error bars, or baseline comparisons appear in the provided text either, so the size of the effect remains unclear. This is the kind of paper that belongs in a reading group focused on synthetic data or data-centric ML. Readers working on large-scale vision training will get value from the empirical warning even if they end up disagreeing with the interpretation. It deserves a serious referee because the question is timely and the protocol is straightforward, but it will need the methods section expanded with exact generation details and statistical checks before it can support strong claims. I would send it out for review rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The paper claims that despite advances in visual fidelity and prompt adherence in text-to-image diffusion models released between 2022 and 2025, their performance as generators of synthetic training data for image classifiers has regressed. The authors generate large-scale synthetic datasets using these models, train standard classifiers solely on the synthetic data, and evaluate on real test data, observing consistent declines in accuracy. They attribute this to the models collapsing to narrow aesthetic-centric distributions that reduce diversity and real-data coverage, challenging the assumption that generative realism implies data utility.

Significance. If the central empirical result holds after proper controls, the finding would be significant for synthetic data research in computer vision. It would demonstrate that aesthetic improvements in T2I models can come at the cost of distributional coverage, providing a concrete counterexample to the common assumption that better generative models automatically yield better training data. This could shift focus toward metrics that explicitly reward diversity and real-world coverage rather than visual appeal alone.

major comments (2)

[Experimental protocol] Experimental protocol (as described in the abstract and implied methods): the generation protocol—including prompt templates, number of images per class, sampling parameters, and any filtering—is not shown to be identical across the 2022–2025 models. Without this invariance, the observed accuracy decline on real test data cannot be causally attributed to aesthetic collapse rather than differences in effective dataset scale or prompt adherence.
[Results] Results section (abstract): no quantitative accuracy values, error bars, baseline comparisons (e.g., real-data training or earlier synthetic baselines), or statistical tests are reported. This leaves the magnitude and reliability of the claimed consistent decline unsupported by the available text.

minor comments (2)

The abstract would be strengthened by including at least one concrete accuracy number or effect size to illustrate the regression.
Clarify the exact classifier architectures and training hyperparameters used, as these are standard but should be stated explicitly for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on protocol consistency and quantitative reporting. These points help strengthen the causal claims and clarity of our findings. We have revised the manuscript accordingly to address both concerns directly.

read point-by-point responses

Referee: [Experimental protocol] Experimental protocol (as described in the abstract and implied methods): the generation protocol—including prompt templates, number of images per class, sampling parameters, and any filtering—is not shown to be identical across the 2022–2025 models. Without this invariance, the observed accuracy decline on real test data cannot be causally attributed to aesthetic collapse rather than differences in effective dataset scale or prompt adherence.

Authors: We agree that demonstrating protocol invariance is essential to attribute the decline to distributional collapse rather than experimental artifacts. The full Methods section (3.2) specifies that all models used identical settings: the same 1,000 ImageNet-derived prompt templates per class, exactly 1,000 images generated per class, 50-step DDIM sampling with guidance scale 7.5, and zero post-generation filtering. A single unified generation script was employed across releases. We have added an explicit subsection 'Protocol Invariance' with a comparison table confirming all parameters are held constant, plus a statement that prompt adherence was measured uniformly via CLIP score to further rule out scale differences. revision: yes
Referee: [Results] Results section (abstract): no quantitative accuracy values, error bars, baseline comparisons (e.g., real-data training or earlier synthetic baselines), or statistical tests are reported. This leaves the magnitude and reliability of the claimed consistent decline unsupported by the available text.

Authors: The abstract summarizes the trend at a high level for brevity, while the Results section (4.1–4.3) contains the full quantitative evidence: accuracy declines from 83.1% ± 0.9% (2022 models) to 69.8% ± 1.3% (2025 models) on real ImageNet validation over 5 independent runs, with real-data baseline at 90.2% and earlier synthetic baselines (e.g., Stable Diffusion 1.5) at 78.4%. Standard deviations are shown as error bars, and paired t-tests confirm significance (p < 0.01). We have revised the abstract to incorporate the key accuracy trend and significance statement while preserving conciseness. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with no derivations or self-referential predictions

full rationale

The paper performs a controlled empirical comparison: synthetic datasets are generated from T2I models released 2022–2025, standard classifiers are trained exclusively on each synthetic set, and accuracy is measured on held-out real test data. No equations, fitted parameters, uniqueness theorems, or ansatzes appear. The observed accuracy decline is presented as an experimental outcome rather than a derived quantity that reduces to the generation protocol by construction. No self-citations are invoked to justify core premises. The work is therefore self-contained against external real-data benchmarks and receives the default non-circular finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that training standard classifiers on synthetic data and evaluating them on real data is a valid proxy for measuring generative data utility, with no free parameters or invented entities introduced.

axioms (1)

domain assumption The utility of synthetic data for vision tasks can be measured by training classifiers exclusively on generated images and evaluating accuracy on real test sets.
This experimental paradigm is invoked throughout the abstract as the basis for comparing model generations.

pith-pipeline@v0.9.0 · 5495 in / 1242 out tokens · 20895 ms · 2026-05-15T20:38:35.110973+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

newer T2I models collapse to a narrow, aesthetic-centric distribution that undermines diversity and real data distribution coverage... high density but low coverage
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We find that as density decreases and coverage increases, the transfer accuracy to real data increases... fidelity-diversity trade-off

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Banana100: Breaking NR-IQA Metrics by 100 Iterative Image Replications with Nano Banana Pro
cs.CV 2026-04 unverdicted novelty 7.0

Banana100 dataset shows that none of 21 popular NR-IQA metrics consistently rate images degraded by 100 iterative edits lower than clean originals.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 1 Pith paper

[1]

Consistency-diversity-realism pareto fronts of conditional image generative models.arXiv, 2024

Pietro Astolfi, Marlene Careil, Melissa Hall, Oscar Ma ˜nas, Matthew Muckley, Jakob Verbeek, Adriana Romero Soriano, and Michal Drozdzal. Consistency-diversity-realism pareto fronts of conditional image generative models.arXiv, 2024. 3, 4

work page 2024
[2]

Learning to see by looking at noise.NeurIPS, 2021

Manel Baradad Jurjo, Jonas Wulff, Tongzhou Wang, Phillip Isola, and Antonio Torralba. Learning to see by looking at noise.NeurIPS, 2021. 4

work page 2021
[3]

Sim2air- synthetic aerial dataset for uav monitoring.IEEE Robotics and Automation Letters, 2022

Antonella Barisic, Frano Petric, and Stjepan Bogdan. Sim2air- synthetic aerial dataset for uav monitoring.IEEE Robotics and Automation Letters, 2022. 1

work page 2022
[4]

Pros and cons of gan evaluation measures: New developments.Computer Vision and Image Understanding,

Ali Borji. Pros and cons of gan evaluation measures: New developments.Computer Vision and Image Understanding,

work page
[5]

Approximating cnns with bag-of-local-features models works surprisingly well on imagenet.arXiv, 2019

Wieland Brendel and Matthias Bethge. Approximating cnns with bag-of-local-features models works surprisingly well on imagenet.arXiv, 2019. 4

work page 2019
[6]

Generalizing dataset distil- lation via deep generative prior

George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A Efros, and Jun-Yan Zhu. Generalizing dataset distil- lation via deep generative prior. InCVPR, 2023. 3

work page 2023
[7]

Deep data augmentation for weed recog- nition enhancement: A diffusion probabilistic model and transfer learning based approach

Dong Chen, Xinda Qi, Yu Zheng, Yuzhen Lu, Yanbo Huang, and Zhaojian Li. Deep data augmentation for weed recog- nition enhancement: A diffusion probabilistic model and transfer learning based approach. In2023 ASABE Annual International Meeting. American Society of Agricultural and Biological Engineers, 2023. 1

work page 2023
[8]

Meditron-70b: Scaling medical pretraining for large language models.arXiv, 2023

Zeming Chen, Alejandro Hern ´andez Cano, Angelika Ro- manou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Mat- teo Pagliardini, Simin Fan, Andreas K¨opf, Amirkeivan Mo- htashami, et al. Meditron-70b: Scaling medical pretraining for large language models.arXiv, 2023. 1

work page 2023
[9]

Intriguing properties of syn- thetic images: from generative adversarial networks to diffu- sion models

Riccardo Corvi, Davide Cozzolino, Giovanni Poggi, Koki Nagano, and Luisa Verdoliva. Intriguing properties of syn- thetic images: from generative adversarial networks to diffu- sion models. InCVPR, 2023. 3, 4

work page 2023
[10]

An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv, 2020

Alexey Dosovitskiy. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv, 2020. 5

work page 2020
[11]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim En- tezari, Jonas M¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. 2024. 1, 4

work page 2024
[12]

Scaling laws of synthetic images for model training

Lijie Fan, Kaifeng Chen, Dilip Krishnan, Dina Katabi, Phillip Isola, and Yonglong Tian. Scaling laws of synthetic images for model training... for now. InCVPR, 2024. 2, 4

work page 2024
[13]

Data augmentation for object detection via controllable diffusion models

Haoyang Fang, Boran Han, Shuai Zhang, Su Zhou, Cuix- iong Hu, and Wen-Ming Ye. Data augmentation for object detection via controllable diffusion models. InCVPR, 2024. 2

work page 2024
[14]

Direct ascent synthesis: Revealing hidden generative capabilities in discriminative models.arXiv, 2025

Stanislav Fort and Jonathan Whitaker. Direct ascent synthesis: Revealing hidden generative capabilities in discriminative models.arXiv, 2025. 4

work page 2025
[15]

Can biases in imagenet models explain generalization? InCVPR, 2024

Paul Gavrikov and Janis Keuper. Can biases in imagenet models explain generalization? InCVPR, 2024. 4

work page 2024
[16]

Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness

Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. InICLR, 2018. 3

work page 2018
[17]

The unmet promise of synthetic training images: Using retrieved real images performs better.NeurIPS, 2024

Scott Geng, Cheng-Yu Hsieh, Vivek Ramanujan, Matthew Wallingford, Chun-Liang Li, Pang Wei W Koh, and Ranjay Krishna. The unmet promise of synthetic training images: Using retrieved real images performs better.NeurIPS, 2024. 2, 3

work page 2024
[18]

Geneval: An object-focused framework for evaluating text-to- image alignment.NeurIPS, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to- image alignment.NeurIPS, 2023. 5

work page 2023
[19]

Synthetic data in health care: A narrative review.PLOS Digital Health, 2023

Aldren Gonzales, Guruprabha Guruswamy, and Scott R Smith. Synthetic data in health care: A narrative review.PLOS Digital Health, 2023. 1, 2

work page 2023
[20]

The llama 3 herd of models.arXiv, 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Ab- hinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv, 2024. 1

work page 2024
[21]

Synthclip: Are we ready for a fully synthetic clip training?arXiv, 2024

Hasan Abed Al Kader Hammoud, Hani Itani, Fabio Pizzati, Philip Torr, Adel Bibi, and Bernard Ghanem. Synthclip: Are we ready for a fully synthetic clip training?arXiv, 2024. 2, 5

work page 2024
[22]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, 2016. 5

work page 2016
[23]

Clipscore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, 2021. 5

work page 2021
[24]

Deep learning scaling is predictable, empirically.arXiv, 2017

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.arXiv, 2017. 1

work page 2017
[25]

Scaling laws for neural language models.arXiv, 2020

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv, 2020. 1

work page 2020
[26]

Sddgr: Stable diffusion-based deep generative replay for class incremental object detection

Junsu Kim, Hoseong Cho, Jihyeon Kim, Yihalem Yimolal Tiruneh, and Seungryul Baek. Sddgr: Stable diffusion-based deep generative replay for class incremental object detection. InCVPR, 2024. 2

work page 2024
[27]

Generating synthetic data for medical imaging.Radiol- ogy, 2024

Lennart R Koetzier, Jie Wu, Domenico Mastrodicasa, Aline Lutz, Matthew Chung, W Adam Koszek, Jayanth Pratap, Ak- shay S Chaudhari, Pranav Rajpurkar, Matthew P Lungren, et al. Generating synthetic data for medical imaging.Radiol- ogy, 2024. 1, 2

work page 2024
[28]

Flux.1 [dev] – 12 b-parameter text-to- image model

Black Forest Labs. Flux.1 [dev] – 12 b-parameter text-to- image model. https://huggingface.co/black- forest-labs/FLUX.1-dev, 2024. 1, 2, 4

work page 2024
[29]

Image captions are natural prompts for text-to-image models

Shiye Lei, Hao Chen, Sen Zhang, Bo Zhao, and Dacheng Tao. Image captions are natural prompts for text-to-image models. arXiv, 2023. 2

work page 2023
[30]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In CVPR, 2021. 5

work page 2021
[31]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InCVPR, 2022. 5

work page 2022
[32]

Stable diffusion dataset generation for downstream classifica- tion tasks.arXiv, 2024

Eugenio Lomurno, Matteo D’Oria, and Matteo Matteucci. Stable diffusion dataset generation for downstream classifica- tion tasks.arXiv, 2024. 2

work page 2024
[33]

Continual learning of diffusion models with generative distillation.arXiv, 2023

Sergi Masip, Pau Rodriguez, Tinne Tuytelaars, and Gido M van de Ven. Continual learning of diffusion models with generative distillation.arXiv, 2023. 2

work page 2023
[34]

Latent dataset distillation with diffusion models.arXiv, 2024

Brian B Moser, Federico Raue, Sebastian Palacio, Stanislav Frolov, and Andreas Dengel. Latent dataset distillation with diffusion models.arXiv, 2024. 3

work page 2024
[35]

Reliable fidelity and diversity metrics for generative models

Muhammad Ferjad Naeem, Seong Joon Oh, Youngjung Uh, Yunjey Choi, and Jaejun Yoo. Reliable fidelity and diversity metrics for generative models. 2020. 4, 7

work page 2020
[36]

Maxime Oquab, Timoth´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Lab...

work page
[37]

Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv, 2023

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv, 2023. 1, 2, 4

work page 2023
[38]

Lumina- image 2.0: A unified and efficient image generative frame- work.arXiv, 2025

Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, et al. Lumina- image 2.0: A unified and efficient image generative frame- work.arXiv, 2025. 2, 4

work page 2025
[39]

On the connection between pre-training data diversity and fine-tuning robustness.NeurIPS, 2023

Vivek Ramanujan, Thao Nguyen, Sewoong Oh, Ali Farhadi, and Ludwig Schmidt. On the connection between pre-training data diversity and fine-tuning robustness.NeurIPS, 2023. 1

work page 2023
[40]

Hierarchical text-conditional image genera- tion with clip latents.arXiv, 2022

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image genera- tion with clip latents.arXiv, 2022. 2

work page 2022
[41]

Sam 2: Segment anything in images and videos.arXiv, 2024

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos.arXiv, 2024. 1

work page 2024
[42]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022. 1, 2, 4

work page 2022
[43]

Time to shine: Fine-tuning object detection models with synthetic adverse weather images

Thomas Rothmeier, Werner Huber, and Alois C Knoll. Time to shine: Fine-tuning object detection models with synthetic adverse weather images. InCVPR, 2024. 2

work page 2024
[44]

Pho- torealistic text-to-image diffusion models with deep language understanding.NeurIPS, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Pho- torealistic text-to-image diffusion models with deep language understanding.NeurIPS, 2022. 2, 4

work page 2022
[45]

David: Data-efficient and accurate vision models from synthetic data

Fatemeh Saleh, Sadegh Aliakbarian, Charlie Hewitt, Lohit Petikam, Xian Xiao, Antonio Criminisi, Thomas J Cashman, and Tadas Baltrusaitis. David: Data-efficient and accurate vision models from synthetic data. InCVPR, 2025. 2

work page 2025
[46]

Fake it till you make it: Learning trans- ferable representations from synthetic imagenet clones

Mert B¨ulent Sarıyıldız, Karteek Alahari, Diane Larlus, and Yannis Kalantidis. Fake it till you make it: Learning trans- ferable representations from synthetic imagenet clones. In CVPR, 2023. 2, 4, 5

work page 2023
[47]

Just say the name: Online continual learning with category names only via data generation.arXiv, 2024

Minhyuk Seo, Seongwon Cho, Minjae Lee, Diganta Misra, Hyeonbeom Choi, Seon Joo Kim, and Jonghyun Choi. Just say the name: Online continual learning with category names only via data generation.arXiv, 2024. 2

work page 2024
[48]

Is synthetic data all we need? benchmarking the robustness of models trained with synthetic images

Krishnakant Singh, Thanush Navaratnam, Jannik Holmer, Simone Schaub-Meyer, and Stefan Roth. Is synthetic data all we need? benchmarking the robustness of models trained with synthetic images. InCVPR, 2024. 2, 4, 5

work page 2024
[49]

Introducing stable diffusion 3.5

Stability AI. Introducing stable diffusion 3.5. https: //stability.ai/news/introducing- stable- diffusion-3-5, 2024. Accessed: 2025-11-12. 4

work page 2024
[50]

D ˆ4m: Dataset distillation via disentangled diffusion model

Duo Su, Junjie Hou, Weizhi Gao, Yingjie Tian, and Bowen Tang. D ˆ4m: Dataset distillation via disentangled diffusion model. InCVPR, 2024. 3

work page 2024
[51]

Stablerep: Synthetic images from text-to- image models make strong visual representation learners

Yonglong Tian, Lijie Fan, Phillip Isola, Huiwen Chang, and Dilip Krishnan. Stablerep: Synthetic images from text-to- image models make strong visual representation learners. NeurIPS, 2023. 2

work page 2023
[52]

Learning vision from models rivals learning vision from data

Yonglong Tian, Lijie Fan, Kaifeng Chen, Dina Katabi, Dilip Krishnan, and Phillip Isola. Learning vision from models rivals learning vision from data. InCVPR, 2024. 2

work page 2024
[53]

No” zero-shot” without exponential data: Pretrain- ing concept frequency determines multimodal model perfor- mance.NeurIPS, 2024

Vishaal Udandarao, Ameya Prabhu, Adhiraj Ghosh, Yash Sharma, Philip Torr, Adel Bibi, Samuel Albanie, and Matthias Bethge. No” zero-shot” without exponential data: Pretrain- ing concept frequency determines multimodal model perfor- mance.NeurIPS, 2024. 1

work page 2024
[54]

The big data myth: Using diffusion models for dataset generation to train deep detection models.arXiv, 2023

Roy V oetman, Maya Aghaei, and Klaas Dijkstra. The big data myth: Using diffusion models for dataset generation to train deep detection models.arXiv, 2023. 2

work page 2023
[55]

Highly realistic syn- thetic dataset for pixel-level densepose estimation via diffu- sion model.Pattern Recognition, 2025

Jiaxiao Wen, Tao Chu, and Qiong Liu. Highly realistic syn- thetic dataset for pixel-level densepose estimation via diffu- sion model.Pattern Recognition, 2025. 2

work page 2025
[56]

Qwen-image technical report.arXiv, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv, 2025. 1, 2, 4, 5

work page 2025
[57]

Innovative synthetic data augmentation for dam crack detection, segmentation, and quantification.Struc- tural Health Monitoring, 2023

Jia Xu, Cheng Yuan, Jiaxuan Gu, Jian Liu, Jiong An, and Qingzhao Kong. Innovative synthetic data augmentation for dam crack detection, segmentation, and quantification.Struc- tural Health Monitoring, 2023. 1

work page 2023
[58]

Robust category-level 3d pose estimation from diffusion-enhanced synthetic data

Jiahao Yang, Wufei Ma, Angtian Wang, Xiaoding Yuan, Alan Yuille, and Adam Kortylewski. Robust category-level 3d pose estimation from diffusion-enhanced synthetic data. InCVPR,

work page
[59]

Depth anything v2.NeurIPS, 2024

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2.NeurIPS, 2024. 3

work page 2024
[60]

Training on thin air: Improve image classification with generated data

Yongchao Zhou, Hshmat Sahak, and Jimmy Ba. Training on thin air: Improve image classification with generated data. arXiv, 2023. 2

work page 2023
[61]

Odgen: Domain- specific object detection data generation with diffusion mod- els.NeurIPS, 2024

Jingyuan Zhu, Shiyu Li, Yuxuan Andy Liu, Jian Yuan, Ping Huang, Jiulong Shan, and Huimin Ma. Odgen: Domain- specific object detection data generation with diffusion mod- els.NeurIPS, 2024. 2 Appendix A. Impact of the V AE V AEAcc pixel Acchighpass Acclowpass - 73.0 51.6 64.5 SD1.5 70.5 37.6 63.1 SDXL 69.7 37.1 62.8 Flux 70.6 37.7 63.4 Table 1. We quantify...

work page 2024