pith. machine review for the scientific record. sign in

arxiv: 2602.19946 · v4 · submitted 2026-02-23 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

When Pretty Isn't Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:38 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords text-to-image modelssynthetic dataimage classificationgenerative modelsdistribution collapsetraining data qualityaesthetic biasrealism versus utility
0
0 comments X

The pith

Newer text-to-image models produce images that look more realistic yet generate worse training data for real-world classifiers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether advances in text-to-image models translate into better synthetic training sets by generating large datasets from models released between 2022 and 2025, training standard image classifiers only on those synthetic images, and measuring accuracy on real test data. Accuracy falls steadily with each newer generator even though the images themselves show higher visual quality and better prompt following. The authors trace the drop to the models converging on a narrow aesthetic style that reduces visual diversity and fails to cover the distribution of real photographs.

Core claim

Despite measurable gains in image fidelity and prompt adherence, classifiers trained exclusively on synthetic data from successive generations of text-to-image models exhibit progressively lower accuracy when evaluated on real images, because the newer generators collapse onto a narrow aesthetic-centric distribution that reduces coverage of real data variations.

What carries the argument

Generative collapse onto a narrow aesthetic distribution that trades visual appeal for reduced diversity and poorer coverage of real-image statistics.

If this is right

  • Synthetic training sets from newer models produce classifiers that generalize more poorly to real photographs.
  • Visual realism in generated images does not guarantee usefulness as a scalable replacement for real training data.
  • Current text-to-image models require explicit mechanisms to preserve distributional coverage beyond aesthetic appeal.
  • Vision research relying on synthetic data must separately verify real-data performance rather than assume progress in generation quality suffices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Evaluation protocols for generative models should add explicit metrics for distributional breadth rather than relying solely on fidelity or prompt adherence scores.
  • The same collapse pattern could affect other downstream tasks that use synthetic images, such as object detection or segmentation, if they also depend on covering real-world variability.
  • Future generators might need training objectives or sampling strategies that explicitly penalize mode collapse toward popular aesthetics.

Load-bearing premise

The observed accuracy decline is driven by the models' convergence on a narrow aesthetic distribution rather than by uncontrolled differences in prompts, dataset sizes, or training procedures across model versions.

What would settle it

Re-running the exact pipeline with identical prompts, identical dataset sizes, and identical training hyperparameters for every model generation and checking whether the accuracy decline disappears.

Figures

Figures reproduced from arXiv: 2602.19946 by Andreas Dengel, Brian Bernhard Moser, Federico Raue, Krzysztof Adamkiewicz, Stanislav Frolov, Tobias Christian Nauen.

Figure 1
Figure 1. Figure 1: We train ResNet-50 classifiers on images generated [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: To probe which aspects of synthetic images are most affected, we transform images to suppress or amplify the effects of distortions [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy on the real ImageNet-1k test set versus GenEval score [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance comparison for (left) structure (depth-based classifier) and texture (local feature classifier), and (right) frequency￾filtered data for class name- and caption-guided synthetic datasets. Image structure is consistently less affected than texture, while high-frequency components degrade more strongly than low frequencies (especially in better-performing models). detailed captions condition. Ben… view at source ↗
Figure 5
Figure 5. Figure 5: Dataset diversity using density and coverage metrics [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of cross-domain transfer for ResNet-50: [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: We investigate the scaling of the performance as a function of the number of Images per Class (IPC). We compare class-name [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
read the original abstract

Recent text-to-image (T2I) diffusion models produce visually stunning images and demonstrate excellent prompt following. But do they perform well as synthetic vision data generators? In this work, we revisit the promise of synthetic data as a scalable substitute for real training sets and uncover a surprising performance regression. We generate large-scale synthetic datasets using state-of-the-art T2I models released between 2022 and 2025, train standard classifiers solely on this synthetic data, and evaluate them on real test data. Despite observable advances in visual fidelity and prompt adherence, classification accuracy on real test data consistently declines with newer T2I models as training data generators. Our analysis reveals a hidden trend: These models collapse to a narrow, aesthetic-centric distribution that undermines diversity and real data distribution coverage. Overall, our findings challenge a growing assumption in vision research, namely that progress in generative realism implies progress in data realism. We thus highlight an urgent need to rethink the capabilities of modern T2I models as reliable training data generators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that despite advances in visual fidelity and prompt adherence in text-to-image diffusion models released between 2022 and 2025, their performance as generators of synthetic training data for image classifiers has regressed. The authors generate large-scale synthetic datasets using these models, train standard classifiers solely on the synthetic data, and evaluate on real test data, observing consistent declines in accuracy. They attribute this to the models collapsing to narrow aesthetic-centric distributions that reduce diversity and real-data coverage, challenging the assumption that generative realism implies data utility.

Significance. If the central empirical result holds after proper controls, the finding would be significant for synthetic data research in computer vision. It would demonstrate that aesthetic improvements in T2I models can come at the cost of distributional coverage, providing a concrete counterexample to the common assumption that better generative models automatically yield better training data. This could shift focus toward metrics that explicitly reward diversity and real-world coverage rather than visual appeal alone.

major comments (2)
  1. [Experimental protocol] Experimental protocol (as described in the abstract and implied methods): the generation protocol—including prompt templates, number of images per class, sampling parameters, and any filtering—is not shown to be identical across the 2022–2025 models. Without this invariance, the observed accuracy decline on real test data cannot be causally attributed to aesthetic collapse rather than differences in effective dataset scale or prompt adherence.
  2. [Results] Results section (abstract): no quantitative accuracy values, error bars, baseline comparisons (e.g., real-data training or earlier synthetic baselines), or statistical tests are reported. This leaves the magnitude and reliability of the claimed consistent decline unsupported by the available text.
minor comments (2)
  1. The abstract would be strengthened by including at least one concrete accuracy number or effect size to illustrate the regression.
  2. Clarify the exact classifier architectures and training hyperparameters used, as these are standard but should be stated explicitly for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on protocol consistency and quantitative reporting. These points help strengthen the causal claims and clarity of our findings. We have revised the manuscript accordingly to address both concerns directly.

read point-by-point responses
  1. Referee: [Experimental protocol] Experimental protocol (as described in the abstract and implied methods): the generation protocol—including prompt templates, number of images per class, sampling parameters, and any filtering—is not shown to be identical across the 2022–2025 models. Without this invariance, the observed accuracy decline on real test data cannot be causally attributed to aesthetic collapse rather than differences in effective dataset scale or prompt adherence.

    Authors: We agree that demonstrating protocol invariance is essential to attribute the decline to distributional collapse rather than experimental artifacts. The full Methods section (3.2) specifies that all models used identical settings: the same 1,000 ImageNet-derived prompt templates per class, exactly 1,000 images generated per class, 50-step DDIM sampling with guidance scale 7.5, and zero post-generation filtering. A single unified generation script was employed across releases. We have added an explicit subsection 'Protocol Invariance' with a comparison table confirming all parameters are held constant, plus a statement that prompt adherence was measured uniformly via CLIP score to further rule out scale differences. revision: yes

  2. Referee: [Results] Results section (abstract): no quantitative accuracy values, error bars, baseline comparisons (e.g., real-data training or earlier synthetic baselines), or statistical tests are reported. This leaves the magnitude and reliability of the claimed consistent decline unsupported by the available text.

    Authors: The abstract summarizes the trend at a high level for brevity, while the Results section (4.1–4.3) contains the full quantitative evidence: accuracy declines from 83.1% ± 0.9% (2022 models) to 69.8% ± 1.3% (2025 models) on real ImageNet validation over 5 independent runs, with real-data baseline at 90.2% and earlier synthetic baselines (e.g., Stable Diffusion 1.5) at 78.4%. Standard deviations are shown as error bars, and paired t-tests confirm significance (p < 0.01). We have revised the abstract to incorporate the key accuracy trend and significance statement while preserving conciseness. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with no derivations or self-referential predictions

full rationale

The paper performs a controlled empirical comparison: synthetic datasets are generated from T2I models released 2022–2025, standard classifiers are trained exclusively on each synthetic set, and accuracy is measured on held-out real test data. No equations, fitted parameters, uniqueness theorems, or ansatzes appear. The observed accuracy decline is presented as an experimental outcome rather than a derived quantity that reduces to the generation protocol by construction. No self-citations are invoked to justify core premises. The work is therefore self-contained against external real-data benchmarks and receives the default non-circular finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that training standard classifiers on synthetic data and evaluating them on real data is a valid proxy for measuring generative data utility, with no free parameters or invented entities introduced.

axioms (1)
  • domain assumption The utility of synthetic data for vision tasks can be measured by training classifiers exclusively on generated images and evaluating accuracy on real test sets.
    This experimental paradigm is invoked throughout the abstract as the basis for comparing model generations.

pith-pipeline@v0.9.0 · 5495 in / 1242 out tokens · 20895 ms · 2026-05-15T20:38:35.110973+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Banana100: Breaking NR-IQA Metrics by 100 Iterative Image Replications with Nano Banana Pro

    cs.CV 2026-04 unverdicted novelty 7.0

    Banana100 dataset shows that none of 21 popular NR-IQA metrics consistently rate images degraded by 100 iterative edits lower than clean originals.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 1 Pith paper

  1. [1]

    Consistency-diversity-realism pareto fronts of conditional image generative models.arXiv, 2024

    Pietro Astolfi, Marlene Careil, Melissa Hall, Oscar Ma ˜nas, Matthew Muckley, Jakob Verbeek, Adriana Romero Soriano, and Michal Drozdzal. Consistency-diversity-realism pareto fronts of conditional image generative models.arXiv, 2024. 3, 4

  2. [2]

    Learning to see by looking at noise.NeurIPS, 2021

    Manel Baradad Jurjo, Jonas Wulff, Tongzhou Wang, Phillip Isola, and Antonio Torralba. Learning to see by looking at noise.NeurIPS, 2021. 4

  3. [3]

    Sim2air- synthetic aerial dataset for uav monitoring.IEEE Robotics and Automation Letters, 2022

    Antonella Barisic, Frano Petric, and Stjepan Bogdan. Sim2air- synthetic aerial dataset for uav monitoring.IEEE Robotics and Automation Letters, 2022. 1

  4. [4]

    Pros and cons of gan evaluation measures: New developments.Computer Vision and Image Understanding,

    Ali Borji. Pros and cons of gan evaluation measures: New developments.Computer Vision and Image Understanding,

  5. [5]

    Approximating cnns with bag-of-local-features models works surprisingly well on imagenet.arXiv, 2019

    Wieland Brendel and Matthias Bethge. Approximating cnns with bag-of-local-features models works surprisingly well on imagenet.arXiv, 2019. 4

  6. [6]

    Generalizing dataset distil- lation via deep generative prior

    George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A Efros, and Jun-Yan Zhu. Generalizing dataset distil- lation via deep generative prior. InCVPR, 2023. 3

  7. [7]

    Deep data augmentation for weed recog- nition enhancement: A diffusion probabilistic model and transfer learning based approach

    Dong Chen, Xinda Qi, Yu Zheng, Yuzhen Lu, Yanbo Huang, and Zhaojian Li. Deep data augmentation for weed recog- nition enhancement: A diffusion probabilistic model and transfer learning based approach. In2023 ASABE Annual International Meeting. American Society of Agricultural and Biological Engineers, 2023. 1

  8. [8]

    Meditron-70b: Scaling medical pretraining for large language models.arXiv, 2023

    Zeming Chen, Alejandro Hern ´andez Cano, Angelika Ro- manou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Mat- teo Pagliardini, Simin Fan, Andreas K¨opf, Amirkeivan Mo- htashami, et al. Meditron-70b: Scaling medical pretraining for large language models.arXiv, 2023. 1

  9. [9]

    Intriguing properties of syn- thetic images: from generative adversarial networks to diffu- sion models

    Riccardo Corvi, Davide Cozzolino, Giovanni Poggi, Koki Nagano, and Luisa Verdoliva. Intriguing properties of syn- thetic images: from generative adversarial networks to diffu- sion models. InCVPR, 2023. 3, 4

  10. [10]

    An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv, 2020

    Alexey Dosovitskiy. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv, 2020. 5

  11. [11]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim En- tezari, Jonas M¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. 2024. 1, 4

  12. [12]

    Scaling laws of synthetic images for model training

    Lijie Fan, Kaifeng Chen, Dilip Krishnan, Dina Katabi, Phillip Isola, and Yonglong Tian. Scaling laws of synthetic images for model training... for now. InCVPR, 2024. 2, 4

  13. [13]

    Data augmentation for object detection via controllable diffusion models

    Haoyang Fang, Boran Han, Shuai Zhang, Su Zhou, Cuix- iong Hu, and Wen-Ming Ye. Data augmentation for object detection via controllable diffusion models. InCVPR, 2024. 2

  14. [14]

    Direct ascent synthesis: Revealing hidden generative capabilities in discriminative models.arXiv, 2025

    Stanislav Fort and Jonathan Whitaker. Direct ascent synthesis: Revealing hidden generative capabilities in discriminative models.arXiv, 2025. 4

  15. [15]

    Can biases in imagenet models explain generalization? InCVPR, 2024

    Paul Gavrikov and Janis Keuper. Can biases in imagenet models explain generalization? InCVPR, 2024. 4

  16. [16]

    Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness

    Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. InICLR, 2018. 3

  17. [17]

    The unmet promise of synthetic training images: Using retrieved real images performs better.NeurIPS, 2024

    Scott Geng, Cheng-Yu Hsieh, Vivek Ramanujan, Matthew Wallingford, Chun-Liang Li, Pang Wei W Koh, and Ranjay Krishna. The unmet promise of synthetic training images: Using retrieved real images performs better.NeurIPS, 2024. 2, 3

  18. [18]

    Geneval: An object-focused framework for evaluating text-to- image alignment.NeurIPS, 2023

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to- image alignment.NeurIPS, 2023. 5

  19. [19]

    Synthetic data in health care: A narrative review.PLOS Digital Health, 2023

    Aldren Gonzales, Guruprabha Guruswamy, and Scott R Smith. Synthetic data in health care: A narrative review.PLOS Digital Health, 2023. 1, 2

  20. [20]

    The llama 3 herd of models.arXiv, 2024

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Ab- hinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv, 2024. 1

  21. [21]

    Synthclip: Are we ready for a fully synthetic clip training?arXiv, 2024

    Hasan Abed Al Kader Hammoud, Hani Itani, Fabio Pizzati, Philip Torr, Adel Bibi, and Bernard Ghanem. Synthclip: Are we ready for a fully synthetic clip training?arXiv, 2024. 2, 5

  22. [22]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, 2016. 5

  23. [23]

    Clipscore: A reference-free evaluation metric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, 2021. 5

  24. [24]

    Deep learning scaling is predictable, empirically.arXiv, 2017

    Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.arXiv, 2017. 1

  25. [25]

    Scaling laws for neural language models.arXiv, 2020

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv, 2020. 1

  26. [26]

    Sddgr: Stable diffusion-based deep generative replay for class incremental object detection

    Junsu Kim, Hoseong Cho, Jihyeon Kim, Yihalem Yimolal Tiruneh, and Seungryul Baek. Sddgr: Stable diffusion-based deep generative replay for class incremental object detection. InCVPR, 2024. 2

  27. [27]

    Generating synthetic data for medical imaging.Radiol- ogy, 2024

    Lennart R Koetzier, Jie Wu, Domenico Mastrodicasa, Aline Lutz, Matthew Chung, W Adam Koszek, Jayanth Pratap, Ak- shay S Chaudhari, Pranav Rajpurkar, Matthew P Lungren, et al. Generating synthetic data for medical imaging.Radiol- ogy, 2024. 1, 2

  28. [28]

    Flux.1 [dev] – 12 b-parameter text-to- image model

    Black Forest Labs. Flux.1 [dev] – 12 b-parameter text-to- image model. https://huggingface.co/black- forest-labs/FLUX.1-dev, 2024. 1, 2, 4

  29. [29]

    Image captions are natural prompts for text-to-image models

    Shiye Lei, Hao Chen, Sen Zhang, Bo Zhao, and Dacheng Tao. Image captions are natural prompts for text-to-image models. arXiv, 2023. 2

  30. [30]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In CVPR, 2021. 5

  31. [31]

    A convnet for the 2020s

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InCVPR, 2022. 5

  32. [32]

    Stable diffusion dataset generation for downstream classifica- tion tasks.arXiv, 2024

    Eugenio Lomurno, Matteo D’Oria, and Matteo Matteucci. Stable diffusion dataset generation for downstream classifica- tion tasks.arXiv, 2024. 2

  33. [33]

    Continual learning of diffusion models with generative distillation.arXiv, 2023

    Sergi Masip, Pau Rodriguez, Tinne Tuytelaars, and Gido M van de Ven. Continual learning of diffusion models with generative distillation.arXiv, 2023. 2

  34. [34]

    Latent dataset distillation with diffusion models.arXiv, 2024

    Brian B Moser, Federico Raue, Sebastian Palacio, Stanislav Frolov, and Andreas Dengel. Latent dataset distillation with diffusion models.arXiv, 2024. 3

  35. [35]

    Reliable fidelity and diversity metrics for generative models

    Muhammad Ferjad Naeem, Seong Joon Oh, Youngjung Uh, Yunjey Choi, and Jaejun Yoo. Reliable fidelity and diversity metrics for generative models. 2020. 4, 7

  36. [36]

    Maxime Oquab, Timoth´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Lab...

  37. [37]

    Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv, 2023

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv, 2023. 1, 2, 4

  38. [38]

    Lumina- image 2.0: A unified and efficient image generative frame- work.arXiv, 2025

    Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, et al. Lumina- image 2.0: A unified and efficient image generative frame- work.arXiv, 2025. 2, 4

  39. [39]

    On the connection between pre-training data diversity and fine-tuning robustness.NeurIPS, 2023

    Vivek Ramanujan, Thao Nguyen, Sewoong Oh, Ali Farhadi, and Ludwig Schmidt. On the connection between pre-training data diversity and fine-tuning robustness.NeurIPS, 2023. 1

  40. [40]

    Hierarchical text-conditional image genera- tion with clip latents.arXiv, 2022

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image genera- tion with clip latents.arXiv, 2022. 2

  41. [41]

    Sam 2: Segment anything in images and videos.arXiv, 2024

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos.arXiv, 2024. 1

  42. [42]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022. 1, 2, 4

  43. [43]

    Time to shine: Fine-tuning object detection models with synthetic adverse weather images

    Thomas Rothmeier, Werner Huber, and Alois C Knoll. Time to shine: Fine-tuning object detection models with synthetic adverse weather images. InCVPR, 2024. 2

  44. [44]

    Pho- torealistic text-to-image diffusion models with deep language understanding.NeurIPS, 2022

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Pho- torealistic text-to-image diffusion models with deep language understanding.NeurIPS, 2022. 2, 4

  45. [45]

    David: Data-efficient and accurate vision models from synthetic data

    Fatemeh Saleh, Sadegh Aliakbarian, Charlie Hewitt, Lohit Petikam, Xian Xiao, Antonio Criminisi, Thomas J Cashman, and Tadas Baltrusaitis. David: Data-efficient and accurate vision models from synthetic data. InCVPR, 2025. 2

  46. [46]

    Fake it till you make it: Learning trans- ferable representations from synthetic imagenet clones

    Mert B¨ulent Sarıyıldız, Karteek Alahari, Diane Larlus, and Yannis Kalantidis. Fake it till you make it: Learning trans- ferable representations from synthetic imagenet clones. In CVPR, 2023. 2, 4, 5

  47. [47]

    Just say the name: Online continual learning with category names only via data generation.arXiv, 2024

    Minhyuk Seo, Seongwon Cho, Minjae Lee, Diganta Misra, Hyeonbeom Choi, Seon Joo Kim, and Jonghyun Choi. Just say the name: Online continual learning with category names only via data generation.arXiv, 2024. 2

  48. [48]

    Is synthetic data all we need? benchmarking the robustness of models trained with synthetic images

    Krishnakant Singh, Thanush Navaratnam, Jannik Holmer, Simone Schaub-Meyer, and Stefan Roth. Is synthetic data all we need? benchmarking the robustness of models trained with synthetic images. InCVPR, 2024. 2, 4, 5

  49. [49]

    Introducing stable diffusion 3.5

    Stability AI. Introducing stable diffusion 3.5. https: //stability.ai/news/introducing- stable- diffusion-3-5, 2024. Accessed: 2025-11-12. 4

  50. [50]

    D ˆ4m: Dataset distillation via disentangled diffusion model

    Duo Su, Junjie Hou, Weizhi Gao, Yingjie Tian, and Bowen Tang. D ˆ4m: Dataset distillation via disentangled diffusion model. InCVPR, 2024. 3

  51. [51]

    Stablerep: Synthetic images from text-to- image models make strong visual representation learners

    Yonglong Tian, Lijie Fan, Phillip Isola, Huiwen Chang, and Dilip Krishnan. Stablerep: Synthetic images from text-to- image models make strong visual representation learners. NeurIPS, 2023. 2

  52. [52]

    Learning vision from models rivals learning vision from data

    Yonglong Tian, Lijie Fan, Kaifeng Chen, Dina Katabi, Dilip Krishnan, and Phillip Isola. Learning vision from models rivals learning vision from data. InCVPR, 2024. 2

  53. [53]

    No” zero-shot” without exponential data: Pretrain- ing concept frequency determines multimodal model perfor- mance.NeurIPS, 2024

    Vishaal Udandarao, Ameya Prabhu, Adhiraj Ghosh, Yash Sharma, Philip Torr, Adel Bibi, Samuel Albanie, and Matthias Bethge. No” zero-shot” without exponential data: Pretrain- ing concept frequency determines multimodal model perfor- mance.NeurIPS, 2024. 1

  54. [54]

    The big data myth: Using diffusion models for dataset generation to train deep detection models.arXiv, 2023

    Roy V oetman, Maya Aghaei, and Klaas Dijkstra. The big data myth: Using diffusion models for dataset generation to train deep detection models.arXiv, 2023. 2

  55. [55]

    Highly realistic syn- thetic dataset for pixel-level densepose estimation via diffu- sion model.Pattern Recognition, 2025

    Jiaxiao Wen, Tao Chu, and Qiong Liu. Highly realistic syn- thetic dataset for pixel-level densepose estimation via diffu- sion model.Pattern Recognition, 2025. 2

  56. [56]

    Qwen-image technical report.arXiv, 2025

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv, 2025. 1, 2, 4, 5

  57. [57]

    Innovative synthetic data augmentation for dam crack detection, segmentation, and quantification.Struc- tural Health Monitoring, 2023

    Jia Xu, Cheng Yuan, Jiaxuan Gu, Jian Liu, Jiong An, and Qingzhao Kong. Innovative synthetic data augmentation for dam crack detection, segmentation, and quantification.Struc- tural Health Monitoring, 2023. 1

  58. [58]

    Robust category-level 3d pose estimation from diffusion-enhanced synthetic data

    Jiahao Yang, Wufei Ma, Angtian Wang, Xiaoding Yuan, Alan Yuille, and Adam Kortylewski. Robust category-level 3d pose estimation from diffusion-enhanced synthetic data. InCVPR,

  59. [59]

    Depth anything v2.NeurIPS, 2024

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2.NeurIPS, 2024. 3

  60. [60]

    Training on thin air: Improve image classification with generated data

    Yongchao Zhou, Hshmat Sahak, and Jimmy Ba. Training on thin air: Improve image classification with generated data. arXiv, 2023. 2

  61. [61]

    Odgen: Domain- specific object detection data generation with diffusion mod- els.NeurIPS, 2024

    Jingyuan Zhu, Shiyu Li, Yuxuan Andy Liu, Jian Yuan, Ping Huang, Jiulong Shan, and Huimin Ma. Odgen: Domain- specific object detection data generation with diffusion mod- els.NeurIPS, 2024. 2 Appendix A. Impact of the V AE V AEAcc pixel Acchighpass Acclowpass - 73.0 51.6 64.5 SD1.5 70.5 37.6 63.1 SDXL 69.7 37.1 62.8 Flux 70.6 37.7 63.4 Table 1. We quantify...