hub

A note on the evaluation of generative models

Lucas Theis, Aäron van den Oord, Matthias Bethge · 2015 · stat.ML · arXiv 1511.01844

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

open full Pith review browse 12 citing papers arXiv PDF

abstract

Probabilistic generative models can be used for compression, denoising, inpainting, texture synthesis, semi-supervised learning, unsupervised feature learning, and other tasks. Given this wide range of applications, it is not surprising that a lot of heterogeneity exists in the way these models are formulated, trained, and evaluated. As a consequence, direct comparison between models is often difficult. This article reviews mostly known but often underappreciated properties relating to the evaluation and interpretation of generative models with a focus on image models. In particular, we show that three of the currently most commonly used criteria---average log-likelihood, Parzen window estimates, and visual fidelity of samples---are largely independent of each other when the data is high-dimensional. Good performance with respect to one criterion therefore need not imply good performance with respect to the other criteria. Our results show that extrapolation from one criterion to another is not warranted and generative models need to be evaluated directly with respect to the application(s) they were intended for. In addition, we provide examples demonstrating that Parzen window estimates should generally be avoided.

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

Density estimation using Real NVP

cs.LG · 2016-05-27 · accept · novelty 8.0

Real NVP uses affine coupling layers to create invertible transformations that support exact density estimation, sampling, and latent inference without approximations.

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

cs.LG · 2015-11-19 · accept · novelty 8.0

DCGANs with architectural constraints learn a hierarchy of representations from object parts to scenes in both generator and discriminator across image datasets.

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

cs.LG · 2015-03-12 · accept · novelty 8.0

A forward diffusion process adds noise iteratively to data until it is unstructured, and a neural network learns the reverse process to generate new samples from the original distribution.

Bayesian Rain Field Reconstruction using Commercial Microwave Links and Diffusion Model Priors

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

Diffusion model priors enable training-free Bayesian sampling for more accurate rain field reconstruction from path-integrated commercial microwave link measurements than Gaussian process baselines.

Large Scale GAN Training for High Fidelity Natural Image Synthesis

cs.LG · 2018-09-28 · accept · novelty 7.0

BigGANs achieve state-of-the-art class-conditional synthesis on ImageNet 128x128 with Inception Score 166.5 and FID 7.4 by scaling GANs and applying orthogonal regularization plus truncation.

Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings

cs.LG · 2026-05-13 · conditional · novelty 6.0

Mixing auxiliary high-resource language data outperforms hyperparameter tuning in data-constrained bilingual pre-training, with gains equivalent to 2-13 times more unique target data.

Coupling Models for One-Step Discrete Generation

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Coupling Models enable single-step discrete sequence generation via learned couplings to Gaussian latents and outperform prior one-step baselines on text perplexity, biological FBD, and image FID metrics.

Learning to Theorize the World from Observation

cs.LG · 2026-05-05 · unverdicted · novelty 6.0

NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.

GazeVaLM: A Multi-Observer Eye-Tracking Benchmark for Evaluating Clinical Realism in AI-Generated X-Rays

cs.CV · 2026-04-13 · unverdicted · novelty 6.0

GazeVaLM provides 960 gaze recordings from 16 radiologists on 60 chest X-rays (half synthetic) plus LLM predictions for diagnostic accuracy and real-fake detection under matched conditions.

Generative Frontiers: Why Evaluation Matters for Diffusion Language Models

cs.LG · 2026-04-03 · conditional · novelty 6.0

Generative perplexity and entropy are shown to be the two additive components of KL divergence to a reference distribution, motivating generative frontiers as a principled evaluation method for diffusion language models.

Supersampling Stable Diffusion and More: An Approach for Interpolating Neural Networks Using Common Interpolation Methods

cs.CV · 2026-05-09 · unverdicted · novelty 5.0

Kernel interpolation with a constant scaling factor enables Stable Diffusion to produce higher-resolution images without training and extends to general neural networks with small accuracy drops.

Synthesizing real-world distributions from high-dimensional Gaussian Noise with Fully Connected Neural Network

cs.LG · 2026-04-10 · unverdicted · novelty 5.0

Fully connected neural network with randomized loss synthesizes real-world tabular data distributions from Gaussian noise faster than state-of-the-art deep generative models.

citing papers explorer

Showing 12 of 12 citing papers.

Density estimation using Real NVP cs.LG · 2016-05-27 · accept · none · ref 62
Real NVP uses affine coupling layers to create invertible transformations that support exact density estimation, sampling, and latent inference without approximations.
Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks cs.LG · 2015-11-19 · accept · none · ref 19
DCGANs with architectural constraints learn a hierarchy of representations from object parts to scenes in both generator and discriminator across image datasets.
Deep Unsupervised Learning using Nonequilibrium Thermodynamics cs.LG · 2015-03-12 · accept · none · ref 51
A forward diffusion process adds noise iteratively to data until it is unstructured, and a neural network learns the reverse process to generate new samples from the original distribution.
Bayesian Rain Field Reconstruction using Commercial Microwave Links and Diffusion Model Priors cs.LG · 2026-05-06 · unverdicted · none · ref 70
Diffusion model priors enable training-free Bayesian sampling for more accurate rain field reconstruction from path-integrated commercial microwave link measurements than Gaussian process baselines.
Large Scale GAN Training for High Fidelity Natural Image Synthesis cs.LG · 2018-09-28 · accept · none · ref 12
BigGANs achieve state-of-the-art class-conditional synthesis on ImageNet 128x128 with Inception Score 166.5 and FID 7.4 by scaling GANs and applying orthogonal regularization plus truncation.
Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings cs.LG · 2026-05-13 · conditional · none · ref 24 · internal anchor
Mixing auxiliary high-resource language data outperforms hyperparameter tuning in data-constrained bilingual pre-training, with gains equivalent to 2-13 times more unique target data.
Coupling Models for One-Step Discrete Generation cs.LG · 2026-05-08 · unverdicted · none · ref 68
Coupling Models enable single-step discrete sequence generation via learned couplings to Gaussian latents and outperform prior one-step baselines on text perplexity, biological FBD, and image FID metrics.
Learning to Theorize the World from Observation cs.LG · 2026-05-05 · unverdicted · none · ref 178
NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.
GazeVaLM: A Multi-Observer Eye-Tracking Benchmark for Evaluating Clinical Realism in AI-Generated X-Rays cs.CV · 2026-04-13 · unverdicted · none · ref 26
GazeVaLM provides 960 gaze recordings from 16 radiologists on 60 chest X-rays (half synthetic) plus LLM predictions for diagnostic accuracy and real-fake detection under matched conditions.
Generative Frontiers: Why Evaluation Matters for Diffusion Language Models cs.LG · 2026-04-03 · conditional · none · ref 12
Generative perplexity and entropy are shown to be the two additive components of KL divergence to a reference distribution, motivating generative frontiers as a principled evaluation method for diffusion language models.
Supersampling Stable Diffusion and More: An Approach for Interpolating Neural Networks Using Common Interpolation Methods cs.CV · 2026-05-09 · unverdicted · none · ref 9
Kernel interpolation with a constant scaling factor enables Stable Diffusion to produce higher-resolution images without training and extends to general neural networks with small accuracy drops.
Synthesizing real-world distributions from high-dimensional Gaussian Noise with Fully Connected Neural Network cs.LG · 2026-04-10 · unverdicted · none · ref 65
Fully connected neural network with randomized loss synthesizes real-world tabular data distributions from Gaussian noise faster than state-of-the-art deep generative models.

A note on the evaluation of generative models

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer