pith. machine review for the scientific record. sign in

arxiv: 1605.08803 · v3 · submitted 2016-05-27 · 💻 cs.LG · cs.AI· cs.NE· stat.ML

Recognition: 3 theorem links

· Lean Theorem

Density estimation using Real NVP

Jascha Sohl-Dickstein, Laurent Dinh, Samy Bengio

Pith reviewed 2026-05-11 23:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.NEstat.ML
keywords density estimationreal NVPinvertible transformationsunsupervised learninggenerative modelsnatural imagesexact likelihoodlatent space
0
0 comments X

The pith

Real NVP transformations provide invertible mappings that make density estimation tractable with exact likelihood computation, sampling, and latent inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces real-valued non-volume preserving transformations, called real NVP, to expand the class of usable probabilistic models for unsupervised learning. These transformations are designed to be invertible and learnable, so that the resulting models support exact log-likelihood evaluation, exact sampling from the model, exact recovery of latent variables, and an interpretable latent space. The authors apply the method to natural images and evaluate it through generated samples, likelihood scores, and direct manipulation of the latent variables on four datasets. A sympathetic reader cares because most high-dimensional density estimators previously required approximations that made some of these operations intractable or biased.

Core claim

We extend the space of such models using real-valued non-volume preserving (real NVP) transformations, a set of powerful invertible and learnable transformations, resulting in an unsupervised learning algorithm with exact log-likelihood computation, exact sampling, exact inference of latent variables, and an interpretable latent space. We demonstrate its ability to model natural images on four datasets through sampling, log-likelihood evaluation and latent variable manipulations.

What carries the argument

real NVP transformations built from stacked affine coupling layers whose scale and translation functions are parameterized by neural networks, allowing the Jacobian determinant to be computed in closed form.

If this is right

  • Any data point can be assigned an exact probability under the learned distribution.
  • New samples are obtained by drawing from a simple base distribution and applying the inverse transformation.
  • Latent codes for observed images are recovered exactly rather than approximated.
  • The latent space supports direct arithmetic operations that produce semantically meaningful changes in the generated images.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same coupling-layer construction could be adapted to sequential or graph-structured data if the conditioner networks are replaced by appropriate architectures.
  • Exact inference removes the need for variational bounds, which may simplify training objectives in other generative settings.
  • Because the transformations are volume-preserving up to a known factor, they might be combined with other invertible flows to trade off expressivity against computational cost.

Load-bearing premise

The neural-network-parameterized affine coupling layers are expressive enough to capture the structure of natural images without needing impractically many layers.

What would settle it

If a real NVP model trained on the same image datasets produces samples that bear no visual resemblance to the data or reports log-likelihood values far below those of other published density estimators, the practical utility claim would be refuted.

read the original abstract

Unsupervised learning of probabilistic models is a central yet challenging problem in machine learning. Specifically, designing models with tractable learning, sampling, inference and evaluation is crucial in solving this task. We extend the space of such models using real-valued non-volume preserving (real NVP) transformations, a set of powerful invertible and learnable transformations, resulting in an unsupervised learning algorithm with exact log-likelihood computation, exact sampling, exact inference of latent variables, and an interpretable latent space. We demonstrate its ability to model natural images on four datasets through sampling, log-likelihood evaluation and latent variable manipulations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces real-valued non-volume preserving (Real NVP) transformations based on affine coupling layers. These yield invertible maps whose Jacobians are triangular, allowing exact log-likelihood evaluation via the change-of-variables formula, exact sampling by inversion, and exact latent inference. The model is demonstrated on four image datasets (CIFAR-10, ImageNet 32×32, LSUN, CelebA) with reported log-likelihoods, samples, and latent-space manipulations.

Significance. If the central construction holds, the work is significant: it supplies a flow-based generative model that simultaneously achieves exact likelihood, exact sampling, and competitive performance on high-dimensional natural images, addressing a key limitation of contemporaneous methods such as VAEs and GANs. The multi-scale architecture and neural-network parameterizations for the scale and translation functions are shown to be sufficiently expressive for the reported tasks.

minor comments (3)
  1. [§3.2] §3.2, Eq. (6): the multi-scale architecture description would benefit from an explicit statement of how the checkerboard and channel-wise masks are alternated across layers to ensure full mixing.
  2. [Table 1] Table 1: the log-likelihood numbers are given without standard errors across multiple runs; adding these would strengthen the quantitative comparison to NICE and other baselines.
  3. [Figure 4] Figure 4: the latent-space arithmetic examples are visually informative, but the paper does not report a quantitative measure (e.g., reconstruction error after manipulation) to support the claim of an interpretable latent space.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their careful reading and positive evaluation of the manuscript. The provided summary accurately reflects the core contributions of Real NVP, including the use of affine coupling layers for invertible transformations with tractable Jacobians, enabling exact likelihood, sampling, and inference. We are pleased that the significance for flow-based generative modeling on high-dimensional image data is recognized.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The central construction defines affine coupling layers whose Jacobian is triangular by direct substitution (scale factors on one partition, identity on the other), yielding an exactly computable determinant via the change-of-variables formula. Log-likelihood, sampling, and latent inference follow immediately from this definition without fitted parameters or self-referential predictions. Prior work (NICE) is cited for context but is not load-bearing for the new real NVP properties or reported results. Empirical log-likelihoods on image datasets are external benchmarks, not internal fits renamed as predictions. No self-definitional, uniqueness-imported, or ansatz-smuggled steps appear.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard change-of-variables formula for densities under diffeomorphisms and on the assumption that neural networks can parameterize sufficiently flexible coupling functions; no ad-hoc constants or new entities are introduced.

axioms (1)
  • standard math Change of variables formula for probability densities under invertible differentiable transformations
    Invoked to obtain exact log-likelihood from the Jacobian determinant of the coupling layers.

pith-pipeline@v0.9.0 · 5396 in / 1202 out tokens · 60648 ms · 2026-05-11T23:49:35.826368+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 30 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Generative Modeling with Flux Matching

    cs.LG 2026-05 unverdicted novelty 8.0

    Flux Matching generalizes score-based generative modeling by using a weaker objective that admits infinitely many non-conservative vector fields with the data as stationary distribution, enabling new design choices be...

  2. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    cs.LG 2022-09 unverdicted novelty 8.0

    Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.

  3. Denoising Diffusion Probabilistic Models

    cs.LG 2020-06 accept novelty 8.0

    Denoising diffusion probabilistic models generate high-quality images by learning to reverse a fixed forward diffusion process, achieving FID 3.17 on CIFAR10.

  4. DriftXpress: Faster Drifting Models via Projected RKHS Fields

    cs.LG 2026-05 unverdicted novelty 7.0

    DriftXpress approximates drifting kernels via projected RKHS fields to lower training cost of one-step generative models while matching original FID scores.

  5. Normalizing Trajectory Models

    cs.CV 2026-05 unverdicted novelty 7.0

    NTM uses per-step conditional normalizing flows plus a trajectory-wide predictor to achieve exact-likelihood 4-step sampling that matches or exceeds baselines on text-to-image tasks.

  6. On the Invariance and Generality of Neural Scaling Laws

    cs.LG 2026-05 unverdicted novelty 7.0

    Neural scaling laws are invariant under bijective data transformations and change predictably with information resolution ρ under non-bijective transformations, enabling cross-domain transport of fitted exponents.

  7. TRACE: Transport Alignment Conformal Prediction via Diffusion and Flow Matching Models

    stat.ML 2026-05 unverdicted novelty 7.0

    TRACE creates valid conformal prediction sets for complex generative models by scoring outputs via averaged denoising or velocity errors along stochastic transport paths instead of likelihoods.

  8. TMDs in the Lens of Generative AI: A Pixel-Based Approach to Partonic Imaging

    hep-ph 2026-05 unverdicted novelty 7.0

    A nonparametric pixel-based Bayesian method integrates TMD evolution with generative AI and SVD to image parton distributions and reveal null TMDs unconstrained by observables.

  9. Risk-Controlled Post-Processing of Decision Policies

    stat.ML 2026-05 unverdicted novelty 7.0

    Risk-controlled post-processing yields a threshold-structured policy that follows the baseline except where an oracle fallback sharply reduces conditional violation risk, achieving O(log n/n) expected excess risk in i...

  10. Bayesian Rain Field Reconstruction using Commercial Microwave Links and Diffusion Model Priors

    cs.LG 2026-05 unverdicted novelty 7.0

    Diffusion model priors enable training-free Bayesian sampling for more accurate rain field reconstruction from path-integrated commercial microwave link measurements than Gaussian process baselines.

  11. Personalized Multi-Interest Modeling for Cross-Domain Recommendation to Cold-Start Users

    cs.IR 2026-04 unverdicted novelty 7.0

    NF-NPCDR enhances neural processes with normalizing flows to model personalized multi-interest preferences and uses a preference pool plus adaptive decoder to improve cross-domain recommendations for cold-start users.

  12. Probing the 3D Structures of Supernovae through IR Signatures of CO and SiO

    astro-ph.HE 2026-04 unverdicted novelty 7.0

    MOFAT applied to SN2024ggi shows CO triggering inner SiO formation with a receding edge, order-of-magnitude mass drop, clumping signatures, and no dust formation.

  13. MorphoFlow: Sparse-Supervised Generative Shape Modeling with Adaptive Latent Relevance

    cs.CV 2026-04 unverdicted novelty 7.0

    MorphoFlow learns compact probabilistic 3D shape representations from sparse annotations using neural implicits, autodecoders, autoregressive flows, and adaptive sparsity priors on latent dimensions.

  14. Differentiable free energy surface: a variational approach to directly observing rare events using generative deep-learning models

    physics.comp-ph 2026-04 unverdicted novelty 7.0

    VaFES constructs a latent space from reversible collective variables and variationally optimizes a tractable-density generative model to produce a continuous free energy surface from which rare events are directly sampled.

  15. Operator Spectroscopy of Trained Lattice Samplers

    hep-lat 2026-05 unverdicted novelty 6.0

    Operator projections of trained sampler functions in 2D phi^4 lattice theory decompose residuals into zero-mode Binder and finite-k correlator components, distinguishing flow-matching, diffusion, and normalizing-flow models.

  16. CONTRA: Conformal Prediction Region via Normalizing Flow Transformation

    stat.ML 2026-05 unverdicted novelty 6.0

    CONTRA generates sharp multi-dimensional conformal prediction regions by defining nonconformity scores as distances from the center in the latent space of a normalizing flow.

  17. STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    STARFlow2 presents an autoregressive flow-based architecture for unified multimodal text-image generation by interleaving a VLM stream with a TarFlow stream via residual skips and a unified latent space.

  18. Accelerating the Simulation of Ordinary Differential Equations Through Physics-Preserving Neural Networks

    math.NA 2026-05 unverdicted novelty 6.0

    A neural network maps ODE states to a slow-evolving latent space with dynamics derived from the original equations via the chain rule, enabling accelerated simulations with fewer function calls.

  19. Conservative Flows: A New Paradigm of Generative Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Conservative flows generate by running probability-preserving stochastic dynamics initialized at data points rather than noise, using corrected Langevin or predictor-corrector mechanisms on top of any pretrained flow ...

  20. Robust Conditional Conformal Prediction via Branched Normalizing Flow

    cs.LG 2026-05 unverdicted novelty 6.0

    Branched Normalizing Flow improves conditional coverage robustness of conformal prediction under distribution shift by normalizing test inputs to the calibration distribution and mapping prediction sets back.

  21. Normalizing Flows with Iterative Denoising

    cs.CV 2026-04 unverdicted novelty 6.0

    iTARFlow augments normalizing flows with diffusion-style iterative denoising during sampling while preserving end-to-end likelihood training, reaching competitive results on ImageNet 64/128/256.

  22. OLLM: Options-based Large Language Models

    cs.AI 2026-04 unverdicted novelty 6.0

    OLLM models next-token generation as a latent-indexed set of options, enabling up to 70% math reasoning correctness versus 51% baselines and structure-based alignment via a compact latent policy.

  23. Lookahead Drifting Model

    cs.LG 2026-04 unverdicted novelty 6.0

    The lookahead drifting model improves upon the drifting model by sequentially computing multiple drifting terms that incorporate higher-order gradient information, leading to better performance on toy examples and CIFAR10.

  24. Dartmouth Stellar Evolution Emulator (DSEE) 1: Generative Stellar Evolution Model Database

    astro-ph.SR 2026-04 unverdicted novelty 6.0

    DSEE is a flow-based emulator that generates stellar evolution tracks and isochrones as probabilistic outputs from a single model trained on millions of simulations, enabling fast interpolation and uncertainty-aware analyses.

  25. Jeffreys Flow: Robust Boltzmann Generators for Rare Event Sampling via Parallel Tempering Distillation

    cs.LG 2026-04 unverdicted novelty 6.0

    Jeffreys Flow distills Parallel Tempering trajectories via Jeffreys divergence to produce robust Boltzmann generators that suppress mode collapse and correct sampling inaccuracies for rare event sampling.

  26. VideoGPT: Video Generation using VQ-VAE and Transformers

    cs.CV 2021-04 accept novelty 6.0

    VideoGPT generates competitive natural videos by learning discrete latents with VQ-VAE and modeling them autoregressively with a transformer.

  27. To Use AI as Dice of Possibilities with Timing Computation

    cs.AI 2026-05 unverdicted novelty 5.0

    Proposes verb-based paradigm with timing computation to enable data-driven discovery of patient trajectories and counterfactual timing from EHR data without domain knowledge.

  28. Pre-localization of Massive Black Hole Binaries in the Millihertz Band

    gr-qc 2026-04 unverdicted novelty 5.0

    A neural spline flow pipeline performs amortized inference on millihertz MBHB signals, delivering ~20 deg² pre-merger sky localizations in ~1 minute while matching PTMCMC sky modes and parameter uncertainties.

  29. Generative Design of a Gas Turbine Combustor Using Invertible Neural Networks

    cs.AI 2026-04 unverdicted novelty 5.0

    Invertible Neural Networks are used to generate gas turbine combustor designs that meet specified performance criteria from a training database of parameterized designs and simulations.

  30. Scalable DDPM-Polycube: An Extended Diffusion-Based Method for Hexahedral Mesh and Volumetric Spline Construction

    cs.CE 2026-04 unverdicted novelty 3.0

    Scalable DDPM-Polycube adds a blind-hole cube primitive, enlarges the grid to 3D, and introduces genus-guided hierarchical verification to improve diffusion-based polycube generation for complex geometries.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · cited by 30 Pith papers · 8 internal anchors

  1. [1]

    TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

    Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016

  2. [2]

    Understanding symmetries in deep networks

    Vijay Badrinarayanan, Bamdev Mishra, and Roberto Cipolla. Understanding symmetries in deep networks. arXiv preprint arXiv:1511.01029, 2015

  3. [3]

    Density modeling of images using a generalized normalization transformation

    Johannes Ballé, Valero Laparra, and Eero P Simoncelli. Density modeling of images using a generalized normalization transformation. arXiv preprint arXiv:1511.06281, 2015

  4. [4]

    An information-maximization approach to blind separation and blind deconvolution

    Anthony J Bell and Terrence J Sejnowski. An information-maximization approach to blind separation and blind deconvolution. Neural computation, 7(6):1129–1159, 1995

  5. [5]

    Artificial neural networks and their application to sequence recognition

    Yoshua Bengio. Artificial neural networks and their application to sequence recognition. 1991

  6. [6]

    Modeling high-dimensional discrete data with multi-layer neural networks

    Yoshua Bengio and Samy Bengio. Modeling high-dimensional discrete data with multi-layer neural networks. In NIPS, volume 99, pages 400–406, 1999

  7. [7]

    Stochastic gradient estimate variance in contrastive divergence and persistent contrastive divergence

    Mathias Berglund and Tapani Raiko. Stochastic gradient estimate variance in contrastive divergence and persistent contrastive divergence. arXiv preprint arXiv:1312.6002, 2013

  8. [8]

    Bowman, Luke Vilnis, Oriol Vinyals, Andrew M

    Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015

  9. [9]

    Super-resolution with deep convolutional sufficient statistics

    Joan Bruna, Pablo Sprechmann, and Yann LeCun. Super-resolution with deep convolutional sufficient statistics. arXiv preprint arXiv:1511.05666, 2015

  10. [10]

    Importance weighted autoencoders

    Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015

  11. [11]

    Gaussianization

    Scott Shaobing Chen and Ramesh A Gopinath. Gaussianization. In Advances in Neural Information Processing Systems, 2000

  12. [12]

    A recurrent latent variable model for sequential data

    Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pages 2962–2970, 2015

  13. [13]

    The helmholtz machine

    Peter Dayan, Geoffrey E Hinton, Radford M Neal, and Richard S Zemel. The helmholtz machine. Neural computation, 7(5):889–904, 1995

  14. [14]

    Higher order statistical decorrelation without information loss

    Gustavo Deco and Wilfried Brauer. Higher order statistical decorrelation without information loss. In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors,Advances in Neural Information Processing Systems 7, pages 247–254. MIT Press, 1995

  15. [15]

    Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus

    Emily L. Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus. Deep generative image models using a laplacian pyramid of adversarial networks. In Advances in Neural Information Processing Systems 28: 10 Published as a conference paper at ICLR 2017 Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Can...

  16. [16]

    Sample-based non-uniform random variate generation

    Luc Devroye. Sample-based non-uniform random variate generation. InProceedings of the 18th conference on Winter simulation, pages 260–265. ACM, 1986

  17. [17]

    NICE: Non-linear Independent Components Estimation

    Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014

  18. [18]

    Graphical models for machine learning and digital communication

    Brendan J Frey. Graphical models for machine learning and digital communication. MIT press, 1998

  19. [19]

    Gatys, Alexander S

    Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. Texture synthesis using convolutional neural networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 262–270, 2015

  20. [20]

    MADE: masked autoencoder for distribution estimation

    Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. MADE: masked autoencoder for distribution estimation. CoRR, abs/1502.03509, 2015

  21. [21]

    Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C

    Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 2672–2680, 2014

  22. [22]

    Towards conceptual compression

    Karol Gregor, Frederic Besse, Danilo Jimenez Rezende, Ivo Danihelka, and Daan Wierstra. Towards conceptual compression. arXiv preprint arXiv:1604.08772, 2016

  23. [23]

    Continuous deep q-learning with model-based acceleration

    Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep q-learning with model-based acceleration. arXiv preprint arXiv:1603.00748, 2016

  24. [24]

    Deep Residual Learning for Image Recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015

  25. [25]

    Identity mappings in deep residual networks

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. CoRR, abs/1603.05027, 2016

  26. [26]

    Long short-term memory.Neural Computation, 9(8):1735–1780, 1997

    Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural Computation, 9(8):1735–1780, 1997

  27. [27]

    Stochastic variational inference

    Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013

  28. [28]

    Independent component analysis, volume 46

    Aapo Hyvärinen, Juha Karhunen, and Erkki Oja. Independent component analysis, volume 46. John Wiley & Sons, 2004

  29. [29]

    Nonlinear independent component analysis: Existence and uniqueness results

    Aapo Hyvärinen and Petteri Pajunen. Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks, 12(3):429–439, 1999

  30. [30]

    Generating images with recurrent adversarial networks

    Daniel Jiwoong Im, Chris Dongjoo Kim, Hui Jiang, and Roland Memisevic. Generating images with recurrent adversarial networks. arXiv preprint arXiv:1602.05110, 2016

  31. [31]

    Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

    Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015

  32. [32]

    Exploring the limits of language modeling

    Rafal Józefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. CoRR, abs/1602.02410, 2016

  33. [33]

    Adam: A Method for Stochastic Optimization

    Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  34. [34]

    Improving variational inference with inverse autoregressive flow

    Diederik P Kingma, Tim Salimans, and Max Welling. Improving variational inference with inverse autoregressive flow. arXiv preprint arXiv:1606.04934, 2016

  35. [35]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

  36. [36]

    Learning multiple layers of features from tiny images, 2009

    Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images, 2009

  37. [37]

    The neural autoregressive distribution estimator

    Hugo Larochelle and Iain Murray. The neural autoregressive distribution estimator. In AISTATS, 2011

  38. [38]

    Autoencoding beyond pixels using a learned similarity metric

    Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. CoRR, abs/1512.09300, 2015

  39. [39]

    Efficient backprop

    Yann A LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop. InNeural networks: Tricks of the trade, pages 9–48. Springer, 2012

  40. [40]

    Deeply-supervised nets

    Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-supervised nets. arXiv preprint arXiv:1409.5185, 2014

  41. [41]

    Deep learning face attributes in the wild

    Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015

  42. [42]

    Auxiliary deep generative models

    Lars Maaløe, Casper Kaae Sønderby, Søren Kaae Sønderby, and Ole Winther. Auxiliary deep generative models. arXiv preprint arXiv:1602.05473, 2016

  43. [43]

    Neural variational inference and learning in belief networks

    Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. arXiv preprint arXiv:1402.0030, 2014

  44. [44]

    Human-level control through deep reinforcement learning

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015

  45. [45]

    A view of the em algorithm that justifies incremental, sparse, and other variants

    Radford M Neal and Geoffrey E Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models, pages 355–368. Springer, 1998. 11 Published as a conference paper at ICLR 2017

  46. [46]

    Pixel recurrent neural networks

    Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016

  47. [47]

    Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

    Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, abs/1511.06434, 2015

  48. [48]

    Jimenez Rezende and S

    Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, 2015

  49. [49]

    J., Mohamed, S., and Wierstra, D

    Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approxi- mate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014

  50. [50]

    High-Dimensional Probability Estimation with Deep Density Models

    Oren Rippel and Ryan Prescott Adams. High-dimensional probability estimation with deep density models. arXiv preprint arXiv:1302.5125, 2013

  51. [51]

    Learning representations by back- propagating errors

    David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back- propagating errors. Cognitive modeling, 5(3):1, 1988

  52. [52]

    Imagenet large scale visual recognition challenge

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015

  53. [53]

    Deep boltzmann machines

    Ruslan Salakhutdinov and Geoffrey E Hinton. Deep boltzmann machines. In International conference on artificial intelligence and statistics, pages 448–455, 2009

  54. [54]

    Weight normalization: A sim- ple reparameterization to accelerate training of deep neural networks

    Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. arXiv preprint arXiv:1602.07868, 2016

  55. [55]

    Markov chain monte carlo and variational inference: Bridging the gap

    Tim Salimans, Diederik P Kingma, and Max Welling. Markov chain monte carlo and variational inference: Bridging the gap. arXiv preprint arXiv:1410.6460, 2014

  56. [56]

    Mean field theory for sigmoid belief networks

    Lawrence K Saul, Tommi Jaakkola, and Michael I Jordan. Mean field theory for sigmoid belief networks. Journal of artificial intelligence research, 4(1):61–76, 1996

  57. [57]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni- tion. arXiv preprint arXiv:1409.1556, 2014

  58. [58]

    Information processing in dynamical systems: Foundations of harmony theory

    Paul Smolensky. Information processing in dynamical systems: Foundations of harmony theory. Technical report, DTIC Document, 1986

  59. [59]

    Weiss, Niru Maheswaranathan, and Surya Ganguli

    Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 2256–2265, 2015

  60. [60]

    Resnet in resnet: Generalizing residual architectures,

    Sasha Targ, Diogo Almeida, and Kevin Lyman. Resnet in resnet: Generalizing residual architectures. CoRR, abs/1603.08029, 2016

  61. [61]

    Generative image modeling using spatial lstms

    Lucas Theis and Matthias Bethge. Generative image modeling using spatial lstms. In Advances in Neural Information Processing Systems, pages 1918–1926, 2015

  62. [62]

    A note on the evaluation of generative models, 2016

    Lucas Theis, Aäron Van Den Oord, and Matthias Bethge. A note on the evaluation of generative models. CoRR, abs/1511.01844, 2015

  63. [63]

    Variational gaussian process

    Dustin Tran, Rajesh Ranganath, and David M Blei. Variational gaussian process. arXiv preprint arXiv:1511.06499, 2015

  64. [64]

    Rnade: The real-valued neural autoregressive density- estimator

    Benigno Uria, Iain Murray, and Hugo Larochelle. Rnade: The real-valued neural autoregressive density- estimator. In Advances in Neural Information Processing Systems, pages 2175–2183, 2013

  65. [65]

    Learning functions across many orders of magnitudes

    Hado van Hasselt, Arthur Guez, Matteo Hessel, and David Silver. Learning functions across many orders of magnitudes. arXiv preprint arXiv:1602.07714, 2016

  66. [66]

    Order matters: Sequence to sequence for sets.arXiv preprint arXiv:1511.06391, 2015a

    Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to sequence for sets. arXiv preprint arXiv:1511.06391, 2015

  67. [67]

    Embed to control: A locally linear latent dynamics model for control from raw images

    Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in Neural Information Processing Systems, pages 2728–2736, 2015

  68. [68]

    Simple statistical gradient-following algorithms for connectionist reinforcement learning

    Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992

  69. [69]

    Multi-scale context aggregation by dilated convolutions,

    Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015

  70. [70]

    LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop

    Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015

  71. [71]

    Colorful image colorization

    Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. arXiv preprint arXiv:1603.08511, 2016. 12 Published as a conference paper at ICLR 2017 A Samples Figure 7: Samples from a model trained on Imagenet (64× 64). 13 Published as a conference paper at ICLR 2017 Figure 8: Samples from a model trained on CelebA. 14 Published as a conf...