Self-Improving Diffusion Models with Synthetic Data

Ahmed Imtiaz Humayun; John Collomosse; Richard Baraniuk; Shruti Agarwal; Sina Alemohammad

arxiv: 2408.16333 · v1 · pith:GRTXIRD7new · submitted 2024-08-29 · 💻 cs.LG · cs.AI

Self-Improving Diffusion Models with Synthetic Data

Sina Alemohammad , Ahmed Imtiaz Humayun , Shruti Agarwal , John Collomosse , Richard Baraniuk This is my paper

classification 💻 cs.LG cs.AI

keywords datasyntheticmodelmodelsdiffusiongenerativesimstraining

0 comments

read the original abstract

The artificial intelligence (AI) world is running out of real data for training increasingly large generative models, resulting in accelerating pressure to train on synthetic data. Unfortunately, training new generative models with synthetic data from current or past generation models creates an autophagous (self-consuming) loop that degrades the quality and/or diversity of the synthetic data in what has been termed model autophagy disorder (MAD) and model collapse. Current thinking around model autophagy recommends that synthetic data is to be avoided for model training lest the system deteriorate into MADness. In this paper, we take a different tack that treats synthetic data differently from real data. Self-IMproving diffusion models with Synthetic data (SIMS) is a new training concept for diffusion models that uses self-synthesized data to provide negative guidance during the generation process to steer a model's generative process away from the non-ideal synthetic data manifold and towards the real data distribution. We demonstrate that SIMS is capable of self-improvement; it establishes new records based on the Fr\'echet inception distance (FID) metric for CIFAR-10 and ImageNet-64 generation and achieves competitive results on FFHQ-64 and ImageNet-512. Moreover, SIMS is, to the best of our knowledge, the first prophylactic generative AI algorithm that can be iteratively trained on self-generated synthetic data without going MAD. As a bonus, SIMS can adjust a diffusion model's synthetic data distribution to match any desired in-domain target distribution to help mitigate biases and ensure fairness.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Active Flow Expansion for Out-of-Distribution Discovery: from Theory to Molecules
cs.LG 2026-06 unverdicted novelty 7.0

ActFlow expands the generable set of pre-trained flow models for out-of-distribution molecular and sequence design via active synthetic data generation and verifier feedback, with new statistical guarantees.
ReSAGE-PAR: Representational Similarity Assessment for Generative Expansion in Pedestrian Attribute Recognition
cs.CV 2026-06 unverdicted novelty 5.0

ReSAGE-PAR adapts diffusion models with LoRA, scores generated images via vision-language prompts, and applies Bayesian classification to produce pseudo-labels, yielding up to 8.7% gains when used to expand PAR datasets.
Enhancing Malware Detection with Generative AI: Using Variational Autoencoders to Boost Machine Learning Classifiers' Performance
cs.CR 2026-05 unverdicted novelty 3.0

VAEs generate synthetic malware to augment datasets, yielding reported gains in accuracy, precision, recall, and F1 for three ML classifiers.