Joint KL yields horizon-free approximation but an information-theoretic lower bound of order Omega(H) for estimation error in autoregressive learning, with matching computationally efficient upper bounds.
hub
Deep Unsupervised Learning using Nonequilibrium Thermodynamics
22 Pith papers cite this work. Polarity classification is still indexing.
abstract
A central problem in machine learning involves modeling complex data-sets using highly flexible families of probability distributions in which learning, sampling, inference, and evaluation are still analytically or computationally tractable. Here, we develop an approach that simultaneously achieves both flexibility and tractability. The essential idea, inspired by non-equilibrium statistical physics, is to systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process. We then learn a reverse diffusion process that restores structure in data, yielding a highly flexible and tractable generative model of the data. This approach allows us to rapidly learn, sample from, and evaluate probabilities in deep generative models with thousands of layers or time steps, as well as to compute conditional and posterior probabilities under the learned model. We additionally release an open source reference implementation of the algorithm.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Generative diffusion and flow models are constructed to remain exactly on the Lorentz-invariant massless N-particle phase space manifold during sampling for particle physics applications.
DCGANs with architectural constraints learn a hierarchy of representations from object parts to scenes in both generator and discriminator across image datasets.
A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.
Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrained autoencoders with cross-attention conditioning, while cutting computational and
A 3.5-billion-parameter diffusion model with classifier-free guidance generates images preferred over DALL-E by human raters and can be fine-tuned for text-guided inpainting.
SDEdit performs guided image synthesis and editing by adding noise to inputs and refining them via denoising with a diffusion model's SDE prior, outperforming GAN methods in human studies without task-specific training.
Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.
PG-3DGS couples 3D Gaussian Splatting with differentiable physics so that optimized shapes satisfy both visual fidelity and physical objectives such as pouring and aerodynamic lift, with real-world 3D-printed validation.
Implicit score matching trains diffusion models that successfully sample SU(3) Wilson gauge configurations on lattices, with a Hamiltonian-dynamics corrector needed for strong coupling.
GCCM prevents shortcut collapse in consistency models for graph prediction by using contrastive negative pairs and input feature perturbation, leading to better performance than deterministic baselines.
FMDiffWA uses frequency-domain modulation inside diffusion sampling to neutralize watermarks in images while preserving visual quality and generalizing across watermarking schemes.
ANL uses diffusion noise prediction and attention to regularize deepfake detectors for better generalization to unseen synthesis methods without added inference cost.
Discrete Flow Maps recast flow map training for discrete domains using simplex geometry to enable single-step text generation from noise and outperform prior discrete flow models.
MuPPet introduces person encoding, permutation augmentation, and dynamic multi-person attention to outperform prior single- and multi-person 2D-to-3D pose lifting methods on group interaction datasets while improving occlusion robustness.
Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.
Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results while enabling motion LoRA and multi-view 3D applications.
SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-the-art generators.
VideoGPT generates competitive natural videos by learning discrete latents with VQ-VAE and modeling them autoregressively with a transformer.
A unified training framework for mesh-based ML surrogates in CFD improves accuracy and long-horizon stability by enforcing spatial derivative consistency via multi-node prediction, using temporal cross-attention correction, and adding 3D rotary positional embeddings.
A pre-trained Earth Observation diffusion model generates realistic post-wildfire Sentinel-2 imagery from burn masks via inpainting, achieving Burn IoU 0.456 and improved saliency over full generation.
The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.
citing papers explorer
-
Autoregressive Learning in Joint KL: Sharp Oracle Bounds and Lower Bounds
Joint KL yields horizon-free approximation but an information-theoretic lower bound of order Omega(H) for estimation error in autoregressive learning, with matching computationally efficient upper bounds.
-
Generative models on phase space
Generative diffusion and flow models are constructed to remain exactly on the Lorentz-invariant massless N-particle phase space manifold during sampling for particle physics applications.
-
Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
DCGANs with architectural constraints learn a hierarchy of representations from object parts to scenes in both generator and discriminator across image datasets.
-
Hierarchical Text-Conditional Image Generation with CLIP Latents
A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.
-
High-Resolution Image Synthesis with Latent Diffusion Models
Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrained autoencoders with cross-attention conditioning, while cutting computational and
-
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
A 3.5-billion-parameter diffusion model with classifier-free guidance generates images preferred over DALL-E by human raters and can be fine-tuned for text-guided inpainting.
-
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
SDEdit performs guided image synthesis and editing by adding noise to inputs and refining them via denoising with a diffusion model's SDE prior, outperforming GAN methods in human studies without task-specific training.
-
Diffusion Models Beat GANs on Image Synthesis
Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.
-
PG-3DGS: Optimizing 3D Gaussian Splatting to Satisfy Physics Objectives
PG-3DGS couples 3D Gaussian Splatting with differentiable physics so that optimized shapes satisfy both visual fidelity and physical objectives such as pouring and aerodynamic lift, with real-world 3D-printed validation.
-
Diffusion model for SU(N) gauge theories
Implicit score matching trains diffusion models that successfully sample SU(3) Wilson gauge configurations on lattices, with a Hamiltonian-dynamics corrector needed for strong coupling.
-
GCCM: Enhancing Generative Graph Prediction via Contrastive Consistency Model
GCCM prevents shortcut collapse in consistency models for graph prediction by using contrastive negative pairs and input feature perturbation, leading to better performance than deterministic baselines.
-
Breaking Watermarks in the Frequency Domain: A Modulated Diffusion Attack Framework
FMDiffWA uses frequency-domain modulation inside diffusion sampling to neutralize watermarks in images while preserving visual quality and generalizing across watermarking schemes.
-
Deepfake Detection Generalization with Diffusion Noise
ANL uses diffusion noise prediction and attention to regularize deepfake detectors for better generalization to unseen synthesis methods without added inference cost.
-
Discrete Flow Maps
Discrete Flow Maps recast flow map training for discrete domains using simplex geometry to enable single-step text generation from noise and outperform prior discrete flow models.
-
MuPPet: Multi-person 2D-to-3D Pose Lifting
MuPPet introduces person encoding, permutation augmentation, and dynamic multi-person attention to outperform prior single- and multi-person 2D-to-3D pose lifting methods on group interaction datasets while improving occlusion robustness.
-
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.
-
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results while enabling motion LoRA and multi-view 3D applications.
-
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-the-art generators.
-
VideoGPT: Video Generation using VQ-VAE and Transformers
VideoGPT generates competitive natural videos by learning discrete latents with VQ-VAE and modeling them autoregressively with a transformer.
-
Mesh Based Simulations with Spatial and Temporal awareness
A unified training framework for mesh-based ML surrogates in CFD improves accuracy and long-horizon stability by enforcing spatial derivative consistency via multi-node prediction, using temporal cross-attention correction, and adding 3D rotary positional embeddings.
-
Generating Satellite Imagery Data for Wildfire Detection through Mask-Conditioned Generative AI
A pre-trained Earth Observation diffusion model generates realistic post-wildfire Sentinel-2 imagery from burn masks via inpainting, achieving Burn IoU 0.456 and improved saliency over full generation.
-
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.