arxiv: 2604.20041 · v1 · submitted 2026-04-21 · 💻 cs.CV · cs.AI

Recognition: unknown

Normalizing Flows with Iterative Denoising

David Berthelot, Jiatao Gu, Joshua Susskind, Shuangfei Zhai, Tianrong Chen

Pith reviewed 2026-05-10 02:08 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords normalizing flowsiterative denoisingimage generationImageNetautoregressive samplinggenerative modelslikelihood-based training

0 comments

The pith

Adding iterative denoising after autoregressive sampling lets normalizing flows reach competitive ImageNet performance at 64, 128, and 256 pixels while keeping exact likelihood training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces iTARFlow, which trains a normalizing flow end-to-end with a standard likelihood objective but switches to a two-stage sampling procedure: first an autoregressive pass, then an iterative denoising refinement step borrowed from diffusion methods. This combination is shown through experiments to produce competitive generation quality on ImageNet at multiple resolutions. The authors also examine the visual artifacts that remain, suggesting directions for further refinement. A sympathetic reader would care because the approach keeps the training objective unchanged and therefore preserves the ability to compute exact likelihoods, unlike diffusion models.

Core claim

iTARFlow performs autoregressive generation followed by an iterative denoising procedure during sampling. The model is trained entirely with the standard normalizing-flow likelihood objective. On ImageNet at 64, 128, and 256 pixel resolutions the resulting samples are competitive with other contemporary generative models, and the method yields measurable improvements over the base TARFlow architecture.

What carries the argument

The iterative denoising procedure applied after autoregressive generation in the sampling stage, which refines the output while the training objective remains unchanged.

If this is right

Normalizing flows can incorporate diffusion-style refinement steps at sampling time without retraining or losing the likelihood guarantee.
Performance on ImageNet scales to 256 pixels with this hybrid procedure, narrowing the gap with non-likelihood models.
Artifact analysis provides concrete visual diagnostics that can be used to diagnose and correct remaining failure modes in flow-based generators.
The separation of autoregressive generation and denoising stages allows independent tuning of each component.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-stage sampling pattern could be tested on non-image data such as audio or point clouds to check whether the denoising benefit generalizes beyond pixels.
If the denoising procedure can be expressed as an additional invertible transformation, it might be folded back into the flow itself to restore a single-stage sampler.
The reported competitiveness on ImageNet suggests that likelihood-based models may benefit from post-hoc refinement modules more broadly, even when those modules are not likelihood-preserving.

Load-bearing premise

The iterative denoising step added at sampling time does not break the validity of the likelihood objective or introduce inconsistencies that make the reported likelihood values unreliable.

What would settle it

An experiment that recomputes exact log-likelihoods on held-out data after disabling the denoising stage and finds that the values no longer match the likelihoods reported for the full iTARFlow pipeline.

Figures

Figures reproduced from arXiv: 2604.20041 by David Berthelot, Jiatao Gu, Joshua Susskind, Shuangfei Zhai, Tianrong Chen.

**Figure 1.** Figure 1: Demonstration of noise dilemma: When the maximum noise level tmax used during training is too small, the model tends to generate images with rich local textures but poor global structure. Conversely, when the maximum noise level tmax is large, the model generates samples with accurate global structure but noticeably blurred fine details and visible artifacts (zoom in for details), even after self-denoising… view at source ↗

**Figure 2.** Figure 2: Demonstration of iTARFlow. During training, we optimize a TARFlow across a range of noise levels using a shared network, analogous to diffusion models. The TARFlow is an invertible, causal Transformer-based NF composed of L stacked causal Transformer blocks. During sampling, the model first generates a noisy sample xt and then performs iterative denoising by leveraging automatic differentiation of the para… view at source ↗

**Figure 3.** Figure 3: Samples from iTARFlow in pixel space. Left to right, top to bottom: ImageNet 256, 128, and 64 resolution with [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 6.** Figure 6: Ablation study of patch size across ImageNet res [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 5.** Figure 5: (a) Ablation study of the choice of tmax given tmin = 0.01 over training epochs. (b) Ablation study of the number of denoising steps used in iTARFlow. Since the normalizing flow already produces samples with relatively small noise, the number of steps required to obtain a clean image is typically not large. back to the pixel space. Gu et al. (2025b) replaces the finetuned VAE with a jointly trained score … view at source ↗

**Figure 7.** Figure 7: Two primary artifact types are observed in the [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Normalizing Flows (NFs) are a classical family of likelihood-based methods that have received revived attention. Recent efforts such as TARFlow have shown that NFs are capable of achieving promising performance on image modeling tasks, making them viable alternatives to other methods such as diffusion models. In this work, we further advance the state of Normalizing Flow generative models by introducing iterative TARFlow (iTARFlow). Unlike diffusion models, iTARFlow maintains a fully end-to-end, likelihood-based objective during training. During sampling, it performs autoregressive generation followed by an iterative denoising procedure inspired by diffusion-style methods. Through extensive experiments, we show that iTARFlow achieves competitive performance across ImageNet resolutions of 64, 128, and 256 pixels, demonstrating its potential as a strong generative model and advancing the frontier of Normalizing Flows. In addition, we analyze the characteristic artifacts produced by iTARFlow, offering insights that may shed light on future improvements. Code is available at https://github.com/apple/ml-itarflow.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

iTARFlow adds iterative denoising to TARFlow sampling while keeping likelihood training, but the consistency between that sampling procedure and the claimed exact likelihoods is the part that needs direct verification.

read the letter

The paper's main move is to start from a TARFlow-style normalizing flow, do autoregressive generation, then run an iterative denoising procedure at sampling time. Training stays a standard end-to-end likelihood objective, which is the part they emphasize as different from diffusion models. That specific combination of autoregressive steps plus diffusion-style iteration on top of an NF base looks new relative to the TARFlow work they cite. They also run the model on ImageNet at 64, 128, and 256 resolution and include some artifact analysis, plus they release code, which makes it easier to check the details later. Those are the concrete things the paper delivers. The soft spot is the likelihood claim under the new sampling procedure. If the denoising steps are a non-invertible post-process that sits outside the trained flow, then any reported likelihood numbers would not actually correspond to the density the sampler is using. The abstract states the training objective is fully likelihood-based and independent, but does not show the equations that absorb the denoising into the invertible layers or prove it leaves the base density unchanged. That link is the one to inspect in the full text. If the math holds and the numbers are competitive with proper baselines and error bars, this is a reasonable incremental step for people who want to keep exact likelihoods while borrowing sampling tricks. It is aimed at the generative-modeling crowd that still cares about density estimation rather than pure sample quality. I would send it to peer review so the likelihood consistency and the actual quantitative results get checked by people who work in this area.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces iTARFlow as an extension of TARFlow for normalizing flows on images. Training remains fully end-to-end and likelihood-based, while sampling combines autoregressive generation with an iterative denoising procedure inspired by diffusion models. The authors report competitive performance on ImageNet at 64, 128, and 256 pixel resolutions and analyze characteristic artifacts produced by the model.

Significance. If the added iterative denoising can be shown to preserve exact likelihoods under the trained density, the approach could meaningfully advance normalizing flows as competitive alternatives to diffusion models on high-resolution image synthesis. The public release of code is a clear strength for reproducibility.

major comments (1)

[Sampling procedure section] The sampling procedure (described after the training objective) adds iterative denoising steps after autoregressive generation. The manuscript must explicitly show that these steps are either absorbed into the invertible flow layers or do not change the base density, so that the reported likelihoods remain valid for the trained model. Without a derivation or empirical verification that the full sampling distribution matches the likelihood objective, the central claim that iTARFlow is a valid normalizing flow is at risk.

minor comments (2)

[Abstract] The abstract asserts competitive performance across ImageNet resolutions but supplies no numerical metrics, baselines, or error bars; the full manuscript should ensure these appear in the main results tables with clear comparisons.
[Methods] Notation for the iterative denoising steps should be introduced with explicit equations showing how they interact with the autoregressive components and the original TARFlow layers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to clarify the relationship between training and sampling in iTARFlow. We address the major comment below.

read point-by-point responses

Referee: [Sampling procedure section] The sampling procedure (described after the training objective) adds iterative denoising steps after autoregressive generation. The manuscript must explicitly show that these steps are either absorbed into the invertible flow layers or do not change the base density, so that the reported likelihoods remain valid for the trained model. Without a derivation or empirical verification that the full sampling distribution matches the likelihood objective, the central claim that iTARFlow is a valid normalizing flow is at risk.

Authors: We agree that an explicit derivation is required to confirm that the reported likelihoods remain valid. In the current manuscript the training objective is strictly the negative log-likelihood of the autoregressive TARFlow layers, which are fully invertible. The iterative denoising is a post-sampling refinement step. In the revision we will add a dedicated subsection that (i) derives the composite transformation as a composition of the original flow with additional invertible denoising operators whose Jacobian determinant can be computed exactly, and (ii) provides a short empirical check that the likelihood of samples drawn with the full procedure matches the likelihood of the base model within numerical tolerance. This will be placed immediately after the description of the sampling algorithm. revision: yes

Circularity Check

0 steps flagged

No circularity detected; likelihood objective stated as independent of sampling

full rationale

The provided abstract and context present iTARFlow as maintaining an end-to-end likelihood-based training objective that is explicitly distinguished from the sampling procedure (autoregressive generation plus iterative denoising). No equations, fitted parameters, or self-citations are shown reducing the reported performance or likelihood validity to quantities defined by construction from the inputs. The central claim of competitive performance on ImageNet is supported by experiments rather than by renaming or self-referential definitions. This satisfies the default expectation of a self-contained derivation without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations, training details, or architectural specifications, so no free parameters, axioms, or invented entities can be identified with certainty.

pith-pipeline@v0.9.0 · 5487 in / 1113 out tokens · 52216 ms · 2026-05-10T02:08:05.362635+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Normalizing Trajectory Models
cs.CV 2026-05 unverdicted novelty 7.0

NTM uses per-step conditional normalizing flows plus a trajectory-wide predictor to achieve exact-likelihood 4-step sampling that matches or exceeds baselines on text-to-image tasks.
Normalizing Trajectory Models
cs.CV 2026-05 unverdicted novelty 7.0

NTM models each generative reverse step as a conditional normalizing flow with a hybrid shallow-deep architecture, enabling exact-likelihood training and strong four-step sampling performance on text-to-image tasks.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 1 Pith paper · 12 internal anchors

[1]

Diffusion Models Beat GANs on Image Synthesis

Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis.arXiv preprint arXiv:2105.05233,

work page internal anchor Pith review arXiv
[2]

Density estimation using Real NVP

Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density esti- mation using real nvp.arXiv preprint arXiv:1605.08803,

work page internal anchor Pith review arXiv
[3]

A., Susskind, J., and Zhai, S

Gu, J., Chen, T., Berthelot, D., Zheng, H., Wang, Y ., Zhang, R., Dinh, L., Bautista, M. A., Susskind, J., and Zhai, S. Starflow: Scaling latent normalizing flows for high-resolution image synthesis.arXiv preprint arXiv:2506.06276, 2025a. Gu, J., Shen, Y ., Chen, T., Dinh, L., Wang, Y ., Bautista, M. ´A., Berthelot, D., Susskind, J. M., and Zhai, S. End- ...

work page arXiv
[4]

Classifier-Free Diffusion Guidance

Ho, J. and Salimans, T. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,

work page internal anchor Pith review arXiv
[5]

Denoising Diffusion Probabilistic Models

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion proba- bilistic models.arXiv preprint arXiv:2006.11239,

work page internal anchor Pith review arXiv 2006
[6]

Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion.arXiv preprint arXiv:2410.19324,

Hoogeboom, E., Mensink, T., Heek, J., Lamerigts, K., Gao, R., and Salimans, T. Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion.arXiv preprint arXiv:2410.19324,

work page arXiv
[7]

Scalable adaptive computation for iterative generation,

Jabri, A., Fleet, D., and Chen, T. Scalable adaptive computation for iterative generation.arXiv preprint arXiv:2212.11972,

work page arXiv
[8]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[9]

Fractal generative models.arXiv:2502.17437, 2025

9 Normalizing Flows with Iterative Denoising Li, T., Sun, Q., Fan, L., and He, K. Fractal generative models.arXiv preprint arXiv:2502.17437,

work page arXiv
[10]

Scaling laws for diffusion transformers.CoRR, abs/2410.08184, 2024

Liang, Z., He, H., Yang, C., and Dai, B. Scal- ing laws for diffusion transformers.arXiv preprint arXiv:2410.08184,

work page arXiv
[11]

Flow Matching for Generative Modeling

Lipman, Y ., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Improved denois- ing diffusion probabilistic models.arXiv preprint arXiv:2102.09672,

Nichol, A. and Dhariwal, P. Improved denoising diffusion probabilistic models.arXiv preprint arXiv:2102.09672,

work page arXiv
[13]

2017, arXiv e-prints, arXiv:1705.07057

Papamakarios, G., Pavlakou, T., and Murray, I. Masked au- toregressive flow for density estimation.arXiv preprint arXiv:1705.07057,

work page arXiv
[14]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3,

work page internal anchor Pith review arXiv
[15]

Denoising Diffusion Implicit Models

Song, J., Meng, C., and Ermon, S. Denoising diffusion im- plicit models.arXiv preprint arXiv:2010.02502, 2020a. Song, Y . and Dhariwal, P. Improved techniques for training consistency models.arXiv preprint arXiv:2310.14189,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[16]

Score-Based Generative Modeling through Stochastic Differential Equations

Song, Y ., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Er- mon, S., and Poole, B. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020b. Song, Y ., Dhariwal, P., Chen, M., and Sutskever, I. Consis- tency models.arXiv preprint arXiv:2303.01469,

work page internal anchor Pith review Pith/arXiv arXiv 2011
[17]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Sun, P., Jiang, Y ., Chen, S., Zhang, S., Peng, B., Luo, P., and Yuan, Z. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525,

work page internal anchor Pith review arXiv
[18]

Givt: Gen- erative infinite-vocabulary transformers

Tschannen, M., Eastwood, C., and Mentzer, F. Givt: Gen- erative infinite-vocabulary transformers. InEuropean Conference on Computer Vision, pp. 292–309. Springer, 2024a. Tschannen, M., Pinto, A. S., and Kolesnikov, A. Jetformer: An autoregressive generative model of raw images and text.arXiv preprint arXiv:2411.19722, 2024b. Van den Oord, A., Kalchbrenner...

work page arXiv
[19]

Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268,

Wang, S., Gao, Z., Zhu, C., Huang, W., and Wang, L. Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268,

work page arXiv
[20]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Yu, J., Xu, Y ., Koh, J. Y ., Luong, T., Baid, G., Wang, Z., Vasudevan, V ., Ku, A., Yang, Y ., Ayan, B. K., et al. Scaling autoregressive models for content-rich text-to- image generation.arXiv preprint arXiv:2206.10789, 2 (3):5,

work page internal anchor Pith review arXiv
[21]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

10 Normalizing Flows with Iterative Denoising Yu, S., Kwak, S., Jang, H., Jeong, J., Huang, J., Shin, J., and Xie, S. Representation alignment for generation: Training diffusion transformers is easier than you think. arXiv preprint arXiv:2410.06940,

work page internal anchor Pith review arXiv
[22]

A., Jaitly, N., and Susskind, J

Zhai, S., Zhang, R., Nakkiran, P., Berthelot, D., Gu, J., Zheng, H., Chen, T., Bautista, M. A., Jaitly, N., and Susskind, J. Normalizing flows are capable generative models.arXiv preprint arXiv:2412.06329,

work page arXiv
[23]

A., Susskind, J., and Jaitly, N

Zhang, R., Zhai, S., Gu, J., Zhang, Y ., Zheng, H., Chen, T., Bautista, M. A., Susskind, J., and Jaitly, N. Flexible language modeling in continuous space with transformer-based autoregressive flows.arXiv preprint arXiv:2507.00425,

work page arXiv
[24]

Farmer: Flow autoregressive transformer over pixels.arXiv preprint arXiv:2510.23588,

Zheng, G., Zhao, Q., Yang, T., Xiao, F., Lin, Z., Wu, J., Deng, J., Zhang, Y ., and Zhu, R. Farmer: Flow autoregressive transformer over pixels.arXiv preprint arXiv:2510.23588,

work page arXiv