pith. sign in

arxiv: 2408.16333 · v1 · pith:GRTXIRD7new · submitted 2024-08-29 · 💻 cs.LG · cs.AI

Self-Improving Diffusion Models with Synthetic Data

classification 💻 cs.LG cs.AI
keywords datasyntheticmodelmodelsdiffusiongenerativesimstraining
0
0 comments X
read the original abstract

The artificial intelligence (AI) world is running out of real data for training increasingly large generative models, resulting in accelerating pressure to train on synthetic data. Unfortunately, training new generative models with synthetic data from current or past generation models creates an autophagous (self-consuming) loop that degrades the quality and/or diversity of the synthetic data in what has been termed model autophagy disorder (MAD) and model collapse. Current thinking around model autophagy recommends that synthetic data is to be avoided for model training lest the system deteriorate into MADness. In this paper, we take a different tack that treats synthetic data differently from real data. Self-IMproving diffusion models with Synthetic data (SIMS) is a new training concept for diffusion models that uses self-synthesized data to provide negative guidance during the generation process to steer a model's generative process away from the non-ideal synthetic data manifold and towards the real data distribution. We demonstrate that SIMS is capable of self-improvement; it establishes new records based on the Fr\'echet inception distance (FID) metric for CIFAR-10 and ImageNet-64 generation and achieves competitive results on FFHQ-64 and ImageNet-512. Moreover, SIMS is, to the best of our knowledge, the first prophylactic generative AI algorithm that can be iteratively trained on self-generated synthetic data without going MAD. As a bonus, SIMS can adjust a diffusion model's synthetic data distribution to match any desired in-domain target distribution to help mitigate biases and ensure fairness.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Active Flow Expansion for Out-of-Distribution Discovery: from Theory to Molecules

    cs.LG 2026-06 unverdicted novelty 7.0

    ActFlow expands the generable set of pre-trained flow models for out-of-distribution molecular and sequence design via active synthetic data generation and verifier feedback, with new statistical guarantees.

  2. ReSAGE-PAR: Representational Similarity Assessment for Generative Expansion in Pedestrian Attribute Recognition

    cs.CV 2026-06 unverdicted novelty 5.0

    ReSAGE-PAR adapts diffusion models with LoRA, scores generated images via vision-language prompts, and applies Bayesian classification to produce pseudo-labels, yielding up to 8.7% gains when used to expand PAR datasets.

  3. Enhancing Malware Detection with Generative AI: Using Variational Autoencoders to Boost Machine Learning Classifiers' Performance

    cs.CR 2026-05 unverdicted novelty 3.0

    VAEs generate synthetic malware to augment datasets, yielding reported gains in accuracy, precision, recall, and F1 for three ML classifiers.