arxiv: 2202.00512 · v2 · submitted 2022-02-01 · 💻 cs.LG · cs.AI· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans , Jonathan Ho

Authors on Pith no claims yet

Pith reviewed 2026-05-11 09:31 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords diffusion modelsprogressive distillationfast samplingimage generationgenerative modelingFID scoreCIFAR-10few-step sampling

0 comments

The pith

Progressive distillation reduces diffusion model sampling from thousands of steps to 4 while keeping high image quality on standard benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to remove the main practical drawback of diffusion models—the need for hundreds or thousands of model evaluations to produce one sample—by combining two changes. First, it introduces new parameterizations that make the models more stable when run with very few steps. Second, it shows a distillation process that takes a trained many-step sampler and trains a new model to match its outputs using half as many steps, then repeats the halving until only 4 steps remain. A reader would care because the resulting models still reach low FID scores, for example 3.0 on CIFAR-10, and the entire sequence of distillations costs no more training time than the original model.

Core claim

Starting from a deterministic diffusion sampler that uses up to 8192 steps, the authors apply a repeated distillation procedure in which each new model is trained to reproduce the previous model's output distribution using half the number of steps; together with parameterizations that increase stability at low step counts, this yields usable models that generate samples in only 4 steps on CIFAR-10, ImageNet, and LSUN while preserving most of the original perceptual quality.

What carries the argument

The progressive distillation procedure, which trains a student diffusion model to match a teacher sampler's multi-step trajectory using half the steps, combined with re-parameterizations that stabilize few-step sampling.

Load-bearing premise

That successive rounds of distillation do not accumulate enough error to degrade image quality and that the new parameterizations keep sampling stable when the step count is reduced across different image datasets.

What would settle it

A direct comparison on CIFAR-10 or ImageNet in which the 4-step distilled model produces visibly worse samples or a substantially higher FID than the original 8192-step sampler, or in which further distillation rounds cause a sudden quality collapse.

read the original abstract

Diffusion models have recently shown great promise for generative modeling, outperforming GANs on perceptual quality and autoregressive models at density estimation. A remaining downside is their slow sampling time: generating high quality samples takes many hundreds or thousands of model evaluations. Here we make two contributions to help eliminate this downside: First, we present new parameterizations of diffusion models that provide increased stability when using few sampling steps. Second, we present a method to distill a trained deterministic diffusion sampler, using many steps, into a new diffusion model that takes half as many sampling steps. We then keep progressively applying this distillation procedure to our model, halving the number of required sampling steps each time. On standard image generation benchmarks like CIFAR-10, ImageNet, and LSUN, we start out with state-of-the-art samplers taking as many as 8192 steps, and are able to distill down to models taking as few as 4 steps without losing much perceptual quality; achieving, for example, a FID of 3.0 on CIFAR-10 in 4 steps. Finally, we show that the full progressive distillation procedure does not take more time than it takes to train the original model, thus representing an efficient solution for generative modeling using diffusion at both train and test time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This shows how to distill diffusion models from thousands of steps down to four while keeping FID scores competitive on CIFAR-10 and similar benchmarks.

read the letter

This paper shows how to distill diffusion models from thousands of steps down to four while keeping FID scores competitive on CIFAR-10 and similar benchmarks. They start with a high-step deterministic sampler and repeatedly train a new model to match it but with half the steps, applying the process several times until they reach four steps. Along the way they introduce new parameterizations that improve stability for low-step sampling. The headline result is an FID of 3.0 on CIFAR-10 at four steps, with comparable preservation of quality on ImageNet and LSUN. The full sequence of distillations takes no more wall-clock time than training the original model once. What stands out is the iterative halving approach itself. Earlier distillation work tended to target a single large reduction in steps; here the progressive schedule lets them maintain quality across multiple halvings without a sharp drop at any stage. The efficiency claim is also useful because it removes the usual worry that extra training stages will dominate the compute budget. The soft spots are limited but worth noting. The abstract gives clean headline numbers yet leaves out error bars, ablations on the new parameterizations, and checks on how much the outcome depends on the exact distillation hyperparameters or random seeds. If the full paper supplies those controls and shows the procedure is not overly brittle, the results will be more convincing. The work also stays within standard image benchmarks, so its behavior on other data types remains open. This is aimed at anyone who trains or deploys diffusion models and needs faster sampling without retraining from scratch. A practitioner looking for a drop-in speed-up recipe will find the method straightforward to try once the details are in hand. It deserves a serious referee because the empirical gains are large enough and the procedure is simple enough that expert feedback on reproducibility and edge cases would be valuable to the community.

Referee Report

2 major / 2 minor

Summary. The paper claims that new parameterizations of diffusion models increase stability for few-step sampling, and that a progressive distillation procedure can iteratively halve the number of sampling steps (from up to 8192 down to 4) while preserving perceptual quality on image generation tasks. It reports concrete results such as an FID of 3.0 on CIFAR-10 with 4 steps, along with results on ImageNet and LSUN, and states that the full distillation procedure takes no more time than training the original model.

Significance. If the empirical results hold, the work is significant for addressing the slow sampling drawback of diffusion models, enabling fast generation competitive with alternatives like GANs while retaining quality and density estimation advantages. The progressive distillation approach combined with the new parameterizations provides a practical, efficient solution, and the manuscript supplies falsifiable benchmark outcomes across multiple standard datasets.

major comments (2)

[§5] §5 (Experimental results): The central claim that progressive distillation preserves perceptual quality down to 4 steps (e.g., CIFAR-10 FID of 3.0) is load-bearing, yet the reported benchmark numbers lack error bars, multiple random seed statistics, or ablations isolating the new parameterizations from the distillation procedure; this directly affects assessment of robustness against error accumulation.
[§3.2] §3.2 (New parameterizations): The claim that the introduced parameterizations reliably stabilize few-step sampling is central to enabling the progressive procedure, but the section provides no analysis or equations demonstrating their effect on sampling dynamics or variance reduction, relying only on end-to-end empirical outcomes.

minor comments (2)

[Abstract] The abstract and introduction could more explicitly state the exact sequence of distillation steps applied and the base model architectures used for each benchmark.
[§4] Notation for the teacher-student alignment in the distillation loss could be clarified with an additional equation showing how the student is trained to match the teacher's multi-step trajectory.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of our work's significance and for the constructive feedback. We address each major comment point by point below, providing clarifications from the manuscript and indicating revisions where we will strengthen the presentation of results and analysis.

read point-by-point responses

Referee: [§5] §5 (Experimental results): The central claim that progressive distillation preserves perceptual quality down to 4 steps (e.g., CIFAR-10 FID of 3.0) is load-bearing, yet the reported benchmark numbers lack error bars, multiple random seed statistics, or ablations isolating the new parameterizations from the distillation procedure; this directly affects assessment of robustness against error accumulation.

Authors: We acknowledge that error bars, multi-seed statistics, and explicit ablations would strengthen the assessment of robustness. The manuscript reports results from single runs with fixed seeds for reproducibility, but demonstrates consistency by applying the same progressive procedure across CIFAR-10, ImageNet, and LSUN while preserving quality from 8192 steps down to 4. The load-bearing claim is further supported by the fact that each halving step maintains perceptual quality without retraining from scratch. To address the concern directly, we will revise §5 to include error bars from additional runs (where feasible given compute), a note on seed consistency, and a targeted ablation isolating the new parameterizations' contribution from the distillation steps. revision: yes
Referee: [§3.2] §3.2 (New parameterizations): The claim that the introduced parameterizations reliably stabilize few-step sampling is central to enabling the progressive procedure, but the section provides no analysis or equations demonstrating their effect on sampling dynamics or variance reduction, relying only on end-to-end empirical outcomes.

Authors: Section 3.2 introduces the new parameterizations (including the velocity parameterization) as direct modifications to the standard diffusion model output that reduce sensitivity to accumulated errors in few-step regimes. The section provides the explicit functional forms and motivates them via their effect on the reverse-process update. While the primary validation is through the end-to-end progressive distillation results, we agree that additional equations would clarify the variance-reduction mechanism. We will revise §3.2 to include the sampling update equations under these parameterizations and a short derivation showing how they lower the effective variance of the predicted clean image relative to noise prediction, thereby enabling stable halving. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical training procedure (progressive distillation) and new parameterizations for diffusion models, with all load-bearing claims consisting of experimental outcomes measured on held-out benchmarks such as CIFAR-10 FID scores. No equations, predictions, or first-principles derivations reduce outputs to inputs by construction, and no self-citations serve as the sole justification for the central method or results. The procedure is self-contained against external validation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard diffusion model assumptions plus the empirical claim that distillation can be applied progressively without quality collapse. No new physical entities or unstated mathematical axioms beyond typical ML training.

free parameters (1)

distillation hyperparameters
Choices such as learning rate and step-halving schedule are tuned to achieve reported results.

axioms (1)

domain assumption Diffusion models admit parameterizations that remain stable under few-step sampling
Invoked as the first contribution enabling distillation.

pith-pipeline@v0.9.0 · 10101 in / 1056 out tokens · 78185 ms · 2026-05-11T09:31:49.097576+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we present new parameterizations of diffusion models that provide increased stability when using few sampling steps. Second, we present a method to distill a trained deterministic diffusion sampler, using many steps, into a new diffusion model that takes half as many sampling steps
IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we start out with state-of-the-art samplers taking as many as 8192 steps, and are able to distill down to models taking as few as 4 steps without losing much perceptual quality

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 55 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Continuous-Time Distribution Matching for Few-Step Diffusion Distillation
cs.CV 2026-05 unverdicted novelty 8.0

CDM migrates distribution matching distillation to continuous time via dynamic random-length schedules and active off-trajectory latent alignment, yielding competitive few-step image fidelity on SD3 and Longcat-Image.
Query Lower Bounds for Diffusion Sampling
cs.LG 2026-04 unverdicted novelty 8.0

Diffusion sampling from d-dimensional distributions requires at least ~sqrt(d) adaptive score queries when score estimates have polynomial accuracy.
Training-Free Generative Sampling via Moment-Matched Score Smoothing
stat.ML 2026-05 unverdicted novelty 7.0

MM-SOLD is a training-free particle sampler whose large-particle limit converges to a moment-matched Gibbs distribution obtained by exponentially tilting a score-smoothed target.
Stylized Text-to-Motion Generation via Hypernetwork-Driven Low-Rank Adaptation
cs.CV 2026-05 unverdicted novelty 7.0

A hypernetwork maps style motion embeddings to LoRA updates that stylize text-driven motion diffusion models with improved generalization to unseen styles via contrastive structuring of the style space.
One-Step Generative Modeling via Wasserstein Gradient Flows
cs.LG 2026-05 conditional novelty 7.0

W-Flow achieves state-of-the-art one-step ImageNet 256x256 generation at 1.29 FID by training a static neural network to follow a Wasserstein gradient flow that minimizes Sinkhorn divergence, delivering roughly 100x f...
Muninn: Your Trajectory Diffusion Model But Faster
cs.RO 2026-05 unverdicted novelty 7.0

Muninn accelerates diffusion trajectory planners up to 4.6x by spending an uncertainty budget to decide when to cache denoiser outputs, preserving performance and certifying bounded deviation from full computation.
HapticLDM: A Diffusion Model for Text-to-Vibrotactile Generation
cs.HC 2026-05 unverdicted novelty 7.0

HapticLDM is the first latent diffusion model that generates vibrotactile signals directly from text, using dynamic text curation and global denoising to improve realism and semantic alignment over autoregressive baselines.
LENS: Low-Frequency Eigen Noise Shaping for Efficient Diffusion Sampling
cs.CV 2026-05 unverdicted novelty 7.0

LENS shapes low-frequency eigen noise with a lightweight network to enable efficient, high-quality sampling in distilled diffusion models.
PODiff: Latent Diffusion in Proper Orthogonal Decomposition Space for Scientific Super-Resolution
cs.LG 2026-05 unverdicted novelty 7.0

PODiff performs conditional diffusion in a fixed, variance-ordered POD latent space to enable efficient probabilistic super-resolution of high-dimensional scientific fields with lower memory and better-calibrated unce...
Active Sampling for Ultra-Low-Bit-Rate Video Compression via Conditional Controlled Diffusion
cs.CV 2026-05 unverdicted novelty 7.0

ActDiff-VC achieves up to 64.6% bitrate reduction at matched NIQE and improves perceptual metrics like KID and FID by using content-adaptive keyframe selection and budget-aware sparse trajectory selection to condition...
SpecEdit: Training-Free Acceleration for Diffusion based Image Editing via Semantic Locking
cs.CV 2026-05 unverdicted novelty 7.0

SpecEdit accelerates diffusion-based image editing up to 10x by using a low-resolution draft to identify edit-relevant tokens via semantic discrepancies for selective high-resolution denoising.
CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies
cs.CV 2026-04 unverdicted novelty 7.0

CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.
Dream-Cubed: Controllable Generative Modeling in Minecraft by Training on Billions of Cubes
cs.CV 2026-04 unverdicted novelty 7.0

Dream-Cubed releases a billion-scale voxel dataset and 3D diffusion models that generate controllable Minecraft worlds by operating directly on blocks.
Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
Structure-Adaptive Sparse Diffusion in Voxel Space for 3D Medical Image Enhancement
cs.CV 2026-04 unverdicted novelty 7.0

A sparse voxel-space diffusion method with structure-adaptive modulation achieves up to 10x training speedup and state-of-the-art results for 3D medical image denoising and super-resolution.
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
cs.CV 2026-04 unverdicted novelty 7.0

LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
Beyond Few-Step Inference: Accelerating Video Diffusion Transformer Model Serving with Inter-Request Caching Reuse
cs.CV 2026-04 unverdicted novelty 7.0

Chorus accelerates video DiT serving up to 45% via inter-request caching reuse in a three-stage denoising strategy with token-guided attention amplification.
1.x-Distill: Breaking the Diversity, Quality, and Efficiency Barrier in Distribution Matching Distillation
cs.CV 2026-04 conditional novelty 7.0

1.x-Distill achieves better quality and diversity than prior few-step distillation methods at 1.67 and 1.74 effective NFEs on SD3 models with up to 33x speedup.
Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting
cs.CV 2026-03 unverdicted novelty 7.0

Drift-AR achieves 3.8-5.5x speedup in AR-diffusion image models by using entropy to enable entropy-informed speculative decoding and single-step (1-NFE) anti-symmetric drifting decoding.
Training Agents Inside of Scalable World Models
cs.AI 2025-09 conditional novelty 7.0

Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
cs.AI 2025-07 unverdicted novelty 7.0

MixGRPO speeds up GRPO for flow-based image generators by restricting SDE sampling and optimization to a sliding window while using ODE elsewhere, cutting training time by up to 71% with better alignment performance.
One Step Diffusion via Shortcut Models
cs.LG 2024-10 conditional novelty 7.0

Shortcut models enable high-quality single or few-step sampling in diffusion models with one network and training phase by conditioning on desired step size.
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
cs.CV 2023-10 unverdicted novelty 7.0

Latent Consistency Models enable high-fidelity text-to-image generation in 2-4 steps by directly predicting solutions to the probability flow ODE in latent space, distilled from pre-trained LDMs.
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning
cs.LG 2022-08 unverdicted novelty 7.0

Diffusion-QL uses conditional diffusion models as expressive policies in offline RL by coupling behavior cloning with Q-value maximization, achieving SOTA on most D4RL tasks.
ROMER: Expert Replacement and Router Calibration for Robust MoE LLMs on Analog Compute-in-Memory Systems
cs.LG 2026-05 conditional novelty 6.0

ROMER cuts perplexity by up to 59% in noisy analog CIM environments for MoE LLMs via expert replacement and router recalibration calibrated on real-chip measurements.
Generative climate downscaling enables high-resolution compound risk assessment by preserving multivariate dependencies
physics.ao-ph 2026-05 unverdicted novelty 6.0

A multivariate diffusion generative downscaling method preserves inter-variable correlations in climate data under large resolution increases, enabling more accurate compound risk assessment.
FlashMol: High-Quality Molecule Generation in as Few as Four Steps
cs.LG 2026-05 unverdicted novelty 6.0

FlashMol produces chemically valid 3D molecules in 4 steps via distribution matching distillation with respaced timesteps and Jensen-Shannon regularization, matching or exceeding 1000-step teacher performance on QM9 a...
MetaSR: Content-Adaptive Metadata Orchestration for Generative Super-Resolution
cs.CV 2026-04 unverdicted novelty 6.0

MetaSR adaptively orchestrates metadata in a DiT-based generative SR model to deliver up to 1 dB PSNR gains and 50% bitrate savings across diverse content and degradations.
V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think
cs.LG 2026-04 unverdicted novelty 6.0

V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.
Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

Synthetic data complements real data in diffusion-based controllable human video generation, with effective sample selection improving motion realism, temporal consistency, and identity preservation.
WFM: 3D Wavelet Flow Matching for Ultrafast Multi-Modal MRI Synthesis
cs.CV 2026-04 unverdicted novelty 6.0

WFM achieves near-diffusion quality for all four BraTS MRI modalities with one 82M model in 1-2 steps by flowing from the mean of conditioning modalities in wavelet space, running 250-1000x faster.
Allo{SR}$^2$: Rectifying One-Step Super-Resolution to Stay Real via Allomorphic Generative Flows
cs.CV 2026-04 unverdicted novelty 6.0

Allo{SR}^2 rectifies one-step super-resolution trajectories with allomorphic generative flows via SNR initialization, velocity supervision, and self-adversarial matching to deliver state-of-the-art fidelity and realism.
Fisher Decorator: Refining Flow Policy via a Local Transport Map
cs.LG 2026-04 unverdicted novelty 6.0

Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.
CoD-Lite: Real-Time Diffusion-Based Generative Image Compression
cs.CV 2026-04 unverdicted novelty 6.0

CoD-Lite delivers real-time generative image compression via a lightweight convolution-based diffusion codec with compression-oriented pre-training and distillation, achieving substantial bitrate savings.
Self-Adversarial One Step Generation via Condition Shifting
cs.CV 2026-04 unverdicted novelty 6.0

APEX derives self-adversarial gradients from condition-shifted velocity fields in flow models to achieve high-fidelity one-step generation, outperforming much larger models and multi-step teachers.
ELT: Elastic Looped Transformers for Visual Generation
cs.CV 2026-04 unverdicted novelty 6.0

Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.
Post-Hoc Guidance for Consistency Models by Joint Flow Distribution Learning
cs.LG 2026-04 unverdicted novelty 6.0

JFDL allows pre-trained Consistency Models to perform guided image generation post-hoc by aligning flow distributions, reducing FID scores on CIFAR-10 and ImageNet without needing a teacher model.
Diffusion-Based Point-Cloud Generation of Heavy-Ion Events
hep-ph 2026-04 unverdicted novelty 6.0

A two-stage score-driven diffusion model with Point-Edge Transformer generates realistic high-multiplicity heavy-ion events as point clouds.
MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model
cs.CV 2026-03 unverdicted novelty 6.0

MPDiT uses a hierarchical multi-patch design in transformers to lower computation in diffusion models by handling coarse global features first then fine local details, plus faster-converging embeddings.
MAGI-1: Autoregressive Video Generation at Scale
cs.CV 2025-05 unverdicted novelty 6.0

MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
cs.CV 2024-08 unverdicted novelty 6.0

CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
cs.CV 2023-11 conditional novelty 6.0

Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
cs.CV 2023-07 conditional novelty 6.0

SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-t...
CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation
physics.ins-det 2026-05 unverdicted novelty 5.0

CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditi...
Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation
cs.SD 2026-05 unverdicted novelty 5.0

A one-step text-to-audio model using energy-distance training and contextual distillation outperforms prior fast baselines on AudioCaps and achieves up to 8.5x faster inference than the multi-step IMPACT system with c...
Training-inference input alignment outweighs framework choice in longitudinal retinal image prediction
cs.CV 2026-04 unverdicted novelty 5.0

Training-inference input alignment outweighs framework choice for longitudinal retinal image prediction, with deterministic regression matching complex models when acquisition variability dominates disease progression.
ADP-DiT: Text-Guided Diffusion Transformer for Brain Image Generation in Alzheimer's Disease Progression
cs.CV 2026-04 unverdicted novelty 5.0

ADP-DiT is a text-conditioned diffusion transformer for synthesizing longitudinal Alzheimer's MRI scans, reporting SSIM 0.8739 and PSNR 29.32 dB with improvements over a DiT baseline.
SubFlow: Sub-mode Conditioned Flow Matching for Diverse One-Step Generation
cs.LG 2026-04 unverdicted novelty 5.0

SubFlow restores full mode coverage in one-step flow matching by conditioning on sub-modes from semantic clustering, yielding higher diversity on ImageNet-256 while preserving FID.
Elucidating Representation Degradation Problem in Diffusion Model Training
cs.LG 2026-05 unverdicted novelty 4.0

Diffusion models suffer representation degradation at high noise due to recoverability mismatch; ERD mitigates this by dynamic optimization reallocation, accelerating convergence across backbones.
Seed3D 2.0: Advancing High-Fidelity Simulation-Ready 3D Content Generation
cs.GR 2026-04 unverdicted novelty 4.0

Seed3D 2.0 advances 3D content generation via a coarse-to-fine geometry pipeline, unified PBR material model, and simulation-ready scene tools, reporting 69-89.9% win rates over commercial systems in human studies.
From Redaction to Restoration: Deep Learning for Medical Image Anonymization and Reconstruction
cs.CV 2026-04 unverdicted novelty 4.0

An end-to-end framework redacts PHI from medical images via CRNN detection and restores them with Stable Diffusion inpainting to enable privacy-preserving data sharing without losing downstream utility.
Enhancing the accuracy of under-resolved numerical simulations of atmospheric flows with super resolution
physics.flu-dyn 2026-04 unverdicted novelty 4.0

A multi-scale CNN super-resolution model outperforms baseline CNN, attention CNN, and diffusion-based approaches in reconstructing fine-scale features from under-resolved atmospheric flow simulations on standard benchmarks.
Discrete Meanflow Training Curriculum
cs.LG 2026-04 unverdicted novelty 4.0

A DMF curriculum initialized from pretrained flow models achieves one-step FID 3.36 on CIFAR-10 after only 2000 epochs by exploiting a discretized consistency property in the Meanflow objective.
Flow Matching Guide and Code
cs.LG 2024-12 unverdicted novelty 2.0

Flow Matching is a generative modeling framework with mathematical foundations, design choices, extensions, and open-source PyTorch code for applications like image and text generation.
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
cs.CV 2024-02 unverdicted novelty 2.0

The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 55 Pith papers · 2 internal anchors

[1]

Austin, D

Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. CoRR, abs/2107.03006,

work page arXiv
[2]

Learning gradient ﬁelds for shape generation

Ruojin Cai, Guandao Yang, Hadar Averbuch-Elor, Zekun Hao, Serge Belongie, Noah Snavely, and Bharath Hariharan. Learning gradient ﬁelds for shape generation. arXiv preprint arXiv:2008.06520,

work page arXiv 2008
[3]

Diffusion Models Beat GANs on Image Synthesis

Prafulla Dhariwal and Alex Nichol. Diffusion models beat GANs on image synthesis.arXiv preprint arXiv:2105.05233,

work page internal anchor Pith review arXiv
[4]

FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models

Will Grathwohl, Ricky TQ Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. Ffjord: Free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367,

work page Pith review arXiv
[5]

Cascaded diffusion models for high ﬁdelity image generation

Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high ﬁdelity image generation. arXiv preprint arXiv:2106.15282,

work page arXiv
[6]

Argmax flows and multinomial diffusion: Learning categorical distributions, 2021

Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax ﬂows and multinomial diffusion: Towards non-autoregressive language models. arXiv preprint arXiv:2102.05379,

work page arXiv
[7]

Jolicoeur-Martineau, K

Alexia Jolicoeur-Martineau, Ke Li, Rémi Piché-Taillefer, Tal Kachman, and Ioannis Mitliagkas. Gotta go fast when generating data with score-based models. arXiv preprint arXiv:2105.14080,

work page arXiv
[8]

Variational diffusion models.arXiv preprint arXiv:2107.00630, 2,

Diederik P Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. arXiv preprint arXiv:2107.00630,

work page arXiv
[9]

On fast sampling of diffusion probabilistic models

10 Published as a conference paper at ICLR 2022 Zhifeng Kong and Wei Ping. On fast sampling of diffusion probabilistic models. arXiv preprint arXiv:2106.00132,

work page arXiv 2022
[10]

Bilateral denoising diffusion models

Max WY Lam, Jun Wang, Rongjie Huang, Dan Su, and Dong Yu. Bilateral denoising diffusion models. arXiv preprint arXiv:2108.11514,

work page arXiv
[11]

Srdiff: Single image super-resolution with diffusion probabilistic models

Haoying Li, Yifan Yang, Meng Chang, Huajun Feng, Zhihai Xu, Qi Li, and Yueting Chen. Srdiff: Single image super-resolution with diffusion probabilistic models. arXiv preprint arXiv:2104.14951,

work page arXiv
[12]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Knowledge distillation in iterative generative models for improved sampling speed

Eric Luhman and Troy Luhman. Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:2101.02388,

work page arXiv
[14]

Non gaussian denoising diffusion models.arXiv preprint arXiv:2106.07582,

Eliya Nachmani, Robin San Roman, and Lior Wolf. Non gaussian denoising diffusion models.arXiv preprint arXiv:2106.07582,

work page arXiv
[15]

Fast generation for convolutional autoregressive models

Prajit Ramachandran, Tom Le Paine, Pooya Khorrami, Mohammad Babaeizadeh, Shiyu Chang, Yang Zhang, Mark A Hasegawa-Johnson, Roy H Campbell, and Thomas S Huang. Fast genera- tion for convolutional autoregressive models. arXiv preprint arXiv:1704.06001,

work page arXiv
[16]

Fleet, and Mohammad Norouzi

Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative reﬁnement.arXiv preprint arXiv:2104.07636,

work page arXiv
[17]

Noise estimation for generative diffusion models

Robin San-Roman, Eliya Nachmani, and Lior Wolf. Noise estimation for generative diffusion mod- els. arXiv preprint arXiv:2104.02600,

work page arXiv
[18]

Maximum likelihood training of score- based diffusion models

Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score- based diffusion models. arXiv e-prints, pp. arXiv–2101, 2021b. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. International Conference ...

work page arXiv 2004
[19]

arXiv:1905.09883 , year=

Belinda Tzen and Maxim Raginsky. Neural stochastic differential equations: Deep latent gaussian models in the diffusion limit. arXiv preprint arXiv:1905.09883, 2019a. 11 Published as a conference paper at ICLR 2022 Belinda Tzen and Maxim Raginsky. Theoretical guarantees for sampling and inference in generative models with latent diffusions. In Conference ...

work page arXiv 1905
[20]

Watson, J

Daniel Watson, Jonathan Ho, Mohammad Norouzi, and William Chan. Learning to efﬁciently sam- ple from diffusion probabilistic models. arXiv preprint arXiv:2106.03802,

work page arXiv
[21]

12 Published as a conference paper at ICLR 2022 A P ROBABILITY FLOW ODE IN TERMS OF LOG -SNR Song et al. (2021c) formulate the forward diffusion process in terms of an SDE of the form dz =f (z,t )dt +g(t)dW, (10) and show that samples from this diffusion process can be generated by solving the associated prob- ability ﬂow ODE: dz = [f (z,t ) − 1 2g2(t)∇z ...

work page 2022
[22]

is given by zs = σs σt [zt −αt ˆxθ(zt)] +αs ˆxθ(zt), (20) fors < t. Taking the derivative of this expression with respect to λs, assuming again a variance preserving diffusion process, and using dαλ dλ = 1 2αλσ2 λ and dσλ dλ = − 1 2σλα2 λ, gives zλs dλs = dσλs dλs 1 σt [zt −αt ˆxθ(zt)] + dαλs dλs ˆxθ(zt) (21) = − 1 2α2 s σs σt [zt −αt ˆxθ(zt)] + 1 2αsσ2 s...

work page 2022
[23]

E S ETTINGS USED IN EXPERIMENTS Our model architectures closely follow those described by Dhariwal & Nichol (2021)

Figure 5: Visualization of reparameterizing the diffusion process in terms ofφ and vφ. E S ETTINGS USED IN EXPERIMENTS Our model architectures closely follow those described by Dhariwal & Nichol (2021). For 64 × 64 ImageNet we use their model exactly, with 192 channels at the highest resolution. All other models are slight variations with different hyperp...

work page 2021
[24]

We use single-headed attention, and only apply this at the 16 × 16 and 8 × 8 resolutions

At each resolution we apply 3 residual blocks, like described by Dhariwal & Nichol (2021). We use single-headed attention, and only apply this at the 16 × 16 and 8 × 8 resolutions. We use dropout of 0.2 when training the original model. No dropout is used during distillation. For LSUN we use a model similar to that for ImageNet, but with a reduced number ...

work page 2021
[25]

We clip the norm of gradients to a global norm of 1 before calculating parameter updates

with a constant of 0.001. We clip the norm of gradients to a global norm of 1 before calculating parameter updates. For CIFAR-10 we train for 800k parameter updates, for ImageNet we use 550k updates, and for LSUN we use 400k updates. During distillation we train for 50k updates per iteration, except for the distillation to 2 and 1 sampling steps, for whic...

work page 2022
[26]

25612864321684212 3 4 5 6 78910 20 sampling steps FID 64x64 ImageNet Distilled DDIM Distilled Stochastic Undistilled Stochastic Figure 6: FID of generated samples from distilled and undistilled models, using DDIM or stochastic sampling. For the stochastic sampling results we present the best FID obtained by a grid-search over 11 possible noise levels, spa...

work page 2020
[27]

forms a non-Gaussian distribution that falls outside the family of Gaus- sian distributions that can be modelled by a single DDPM student step: A multi-step stochastic DDPM sampler can thus not be distilled into a few-step sampler without some loss in ﬁdelity. This is in contrast with the deterministic DDIM sampler: here both the two-step DDIM teacher upd...

work page 2021
[28]

For each schedule we selected the optimal learning rate from [5e−5, 1e−4, 2e−4, 3e−4]

All reported numbers are averages over 4 random seeds. For each schedule we selected the optimal learning rate from [5e−5, 1e−4, 2e−4, 3e−4]. 20 Published as a conference paper at ICLR 2022 25612864321684212 3 4 5678910 20 sampling steps FID 64x64 ImageNet 50k updates10k updates 2561286432168421 3 4 5678910 20 sampling steps 128x128 LSUN Bedrooms 50k upda...

work page 2022