pith. machine review for the scientific record. sign in

arxiv: 2511.13720 · v2 · submitted 2025-11-17 · 💻 cs.CV

Recognition: 3 theorem links

· Lean Theorem

Back to Basics: Let Denoising Generative Models Denoise

Authors on Pith no claims yet

Pith reviewed 2026-05-11 22:10 UTC · model grok-4.3

classification 💻 cs.CV
keywords denoising diffusiongenerative modelstransformersimage generationmanifold assumptionImageNetpixel-level predictionclean data prediction
0
0 comments X

The pith

Predicting clean images directly with simple Transformers on raw pixels produces competitive generative models for ImageNet at 256 and 512 resolutions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that current denoising diffusion models avoid directly predicting clean images and instead forecast noise or other noised quantities. It argues that this choice ignores the manifold structure of natural data, where clean images occupy a low-dimensional surface while noisy versions fill the full high-dimensional space. By instead training models to map noisy inputs straight back to clean pixels, apparently limited networks can still succeed in pixel space without tokenizers, pre-training, or extra losses. The resulting JiT models, which are plain large-patch Transformers, reach competitive generation quality on ImageNet at both 256 and 512 resolution.

Core claim

Directly predicting the clean data from noised inputs, rather than predicting noise or a noised quantity, lets simple large-patch Transformers operate effectively as generative models on raw pixels. These JiT networks require no tokenizer, no pre-training, and no auxiliary loss yet produce competitive samples on ImageNet at 256 and 512 resolution, where high-dimensional noise prediction tends to fail.

What carries the argument

JiT, or Just image Transformers: large-patch Transformers applied directly to pixels that predict clean data from noised versions by exploiting the manifold structure of natural images.

If this is right

  • Networks with limited capacity can still generate high-resolution images when trained to recover points on the data manifold.
  • Generative performance remains competitive without tokenizers or pre-training when the prediction target is the clean image.
  • Large patch sizes of 16 and 32 become viable for Transformer-based diffusion on raw pixels.
  • A self-contained training paradigm for diffusion models on natural images is possible without auxiliary components.
  • Direct clean-image prediction avoids catastrophic failure modes observed when predicting high-dimensional noised quantities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same direct-prediction strategy could reduce architectural complexity in generative models for other high-dimensional data such as audio or video.
  • Training dynamics might change when the network is explicitly encouraged to map back onto the manifold rather than into the ambient noise space.
  • Model-size requirements for high-resolution generation could be re-examined under the clean-prediction objective.
  • Classical signal-processing denoising ideas may map more directly onto modern diffusion training once the target is restored to clean data.

Load-bearing premise

Natural data lies on a low-dimensional manifold while noised data does not.

What would settle it

A clean-data-predicting large-patch Transformer that produces visibly worse or incoherent samples than a noise-predicting baseline at 512 resolution on ImageNet would falsify the central claim.

read the original abstract

Today's denoising diffusion models do not "denoise" in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces. We show that simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss. Our approach is conceptually nothing more than "Just image Transformers", or JiT, as we call it. We report competitive results using JiT with large patch sizes of 16 and 32 on ImageNet at resolutions of 256 and 512, where predicting high-dimensional noised quantities can fail catastrophically. With our networks mapping back to the basics of the manifold, our research goes back to basics and pursues a self-contained paradigm for Transformer-based diffusion on raw natural data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that standard denoising diffusion models predict noise or noised quantities rather than clean data, and that directly predicting clean images is fundamentally different because natural data lies on a low-dimensional manifold while noised quantities do not. This allows simple, under-capacity networks to operate in high-dimensional pixel space. The authors introduce JiT (Just image Transformers): large-patch pixel Transformers trained with no tokenizer, no pre-training, and no extra loss, and report competitive ImageNet results at 256 and 512 resolutions with patch sizes 16 and 32, where noise-prediction baselines fail catastrophically.

Significance. If the results hold, the work demonstrates that a back-to-basics clean-data prediction target can enable competitive generative performance with minimal architectural complexity on raw pixels. This provides an empirical existence proof for simple large-patch Transformers as generative models and highlights the modeling choice of prediction target as potentially more important than tokenization or pre-training in high-dimensional settings.

major comments (2)
  1. [Abstract and §1] Abstract and §1: The explanatory link between direct clean-data prediction and success in high-dimensional space rests on the untested manifold assumption (natural images occupy a low-dimensional manifold while noised quantities do not). No intrinsic-dimension estimates (PCA, MLE, or correlation dimension), ablation on manifold properties, or comparison of effective dimensionality at training noise levels are provided to ground this premise.
  2. [Experiments] Experiments section: The claim that noise/noised-quantity prediction 'fails catastrophically' at large patch sizes while clean prediction succeeds is load-bearing for the central argument, yet the manuscript does not report controlled ablations isolating the prediction target from other factors such as loss geometry, optimization dynamics, or network capacity. Without these, the reported competitive FID or other metrics cannot be confidently attributed to the manifold-based rationale.
minor comments (2)
  1. [§2] §2: The precise mathematical formulation of the clean-data prediction objective (e.g., the training loss and how it differs from standard noise-prediction diffusion) should be stated explicitly with an equation for reproducibility.
  2. [Tables and figures] Tables and figures: Ensure quantitative tables report both patch size and resolution explicitly and include error bars or multiple seeds for the ImageNet 256/512 results to allow direct comparison with noise-prediction baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comments below, providing clarifications and indicating where revisions will be made to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract and §1] Abstract and §1: The explanatory link between direct clean-data prediction and success in high-dimensional space rests on the untested manifold assumption (natural images occupy a low-dimensional manifold while noised quantities do not). No intrinsic-dimension estimates (PCA, MLE, or correlation dimension), ablation on manifold properties, or comparison of effective dimensionality at training noise levels are provided to ground this premise.

    Authors: We appreciate this observation. The manifold hypothesis for natural images is a standard assumption in the field, with substantial supporting evidence from prior studies on the low-dimensional structure of image data. Our work builds on this by demonstrating that direct prediction of clean data enables effective modeling in high-dimensional pixel space with simple architectures, in contrast to noise prediction. While we do not provide new intrinsic dimension calculations, the empirical results—particularly the failure of noise prediction at large patch sizes—serve as indirect validation. In the revised manuscript, we will expand the discussion in Section 1 to include references to key literature on image manifolds and clarify the role of this assumption. revision: partial

  2. Referee: [Experiments] Experiments section: The claim that noise/noised-quantity prediction 'fails catastrophically' at large patch sizes while clean prediction succeeds is load-bearing for the central argument, yet the manuscript does not report controlled ablations isolating the prediction target from other factors such as loss geometry, optimization dynamics, or network capacity. Without these, the reported competitive FID or other metrics cannot be confidently attributed to the manifold-based rationale.

    Authors: We agree that careful isolation of variables strengthens the argument. Our experiments compare clean-data prediction (JiT) against noise-prediction baselines using the exact same Transformer architecture, patch sizes, and training protocol on raw pixels, with the only difference being the prediction target. This setup controls for network capacity and largely for optimization dynamics, as the training procedure is identical. The loss geometry is inherently tied to the choice of target, which is the central modeling decision under investigation. We believe this provides sufficient evidence for the importance of the prediction target. However, we will add a note in the experiments section acknowledging potential confounding factors and discussing why the target choice is the primary variable. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical demonstration remains self-contained without reductions to fitted inputs or self-citations.

full rationale

The manuscript advances an empirical claim that direct clean-image prediction with large-patch pixel Transformers yields competitive ImageNet results at 256/512 resolution, without tokenizers or pre-training. The manifold assumption is invoked as an explanatory premise for why this modeling choice succeeds where noise prediction fails, but the paper presents no equations, derivations, or parameter fits that reduce the reported performance to the assumption by construction. No self-citation chains, uniqueness theorems, or ansatzes are used to justify core choices; results are benchmark numbers rather than forced predictions. The derivation chain is therefore independent of its inputs and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one domain assumption about data manifolds; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Natural data lies on a low-dimensional manifold, whereas noised quantities do not.
    Invoked in the abstract to justify why predicting clean data is fundamentally different and advantageous.

pith-pipeline@v0.9.0 · 5504 in / 1122 out tokens · 41294 ms · 2026-05-11T22:10:03.888395+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation.LawOfExistence defect_zero_iff_one echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces.

  • Foundation.JCostCoshIdentity jcost_exp_cosh_form echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    Predicting clean data is fundamentally different from predicting noise or a noised quantity.

  • Foundation.DiscretenessForcing continuous_no_isolated_zero_defect echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 40 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FluxFlow: Conservative Flow-Matching for Astronomical Image Super-Resolution

    cs.CV 2026-05 unverdicted novelty 7.0

    FluxFlow is a conservative pixel-space flow-matching framework for astronomical super-resolution that incorporates real atmospheric uncertainty and a training-free Wiener correction, outperforming baselines on a new 1...

  2. Binomial flows: Denoising and flow matching for discrete ordinal data

    cs.LG 2026-05 unverdicted novelty 7.0

    Binomial flows close the gap between continuous flow matching and discrete ordinal data by using binomial distributions to enable unified denoising, sampling, and exact likelihoods in diffusion models.

  3. Structure-Adaptive Sparse Diffusion in Voxel Space for 3D Medical Image Enhancement

    cs.CV 2026-04 unverdicted novelty 7.0

    A sparse voxel-space diffusion method with structure-adaptive modulation achieves up to 10x training speedup and state-of-the-art results for 3D medical image denoising and super-resolution.

  4. Grokking of Diffusion Models: Case Study on Modular Addition

    cs.LG 2026-04 unverdicted novelty 7.0

    Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.

  5. Coevolving Representations in Joint Image-Feature Diffusion

    cs.CV 2026-04 unverdicted novelty 7.0

    CoReDi coevolves semantic representations with the diffusion model via a jointly learned linear projection stabilized by stop-gradient, normalization, and regularization, yielding faster convergence and higher sample ...

  6. FARM: Foundational Aerial Radio Map for Intelligent Low-Altitude Networking

    eess.SP 2026-04 unverdicted novelty 7.0

    FARM is a foundation model combining masked autoencoders and diffusion decoders to estimate high-resolution aerial radio maps from a new multi-band low-altitude dataset, claiming superior accuracy and generalization o...

  7. Free-Range Gaussians: Non-Grid-Aligned Generative 3D Gaussian Reconstruction

    cs.CV 2026-04 unverdicted novelty 7.0

    Free-Range Gaussians uses flow matching over Gaussian parameters to predict non-grid-aligned 3D Gaussians from multi-view images, enabling synthesis of plausible content in unobserved regions with fewer primitives tha...

  8. L2P: Unlocking Latent Potential for Pixel Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.

  9. BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion

    cs.CL 2026-05 unverdicted novelty 6.0

    BitLM replaces per-token softmax with bitwise continuous diffusion inside causal blocks to generate multiple tokens in parallel while preserving autoregressive structure.

  10. Generative climate downscaling enables high-resolution compound risk assessment by preserving multivariate dependencies

    physics.ao-ph 2026-05 unverdicted novelty 6.0

    A multivariate diffusion generative downscaling method preserves inter-variable correlations in climate data under large resolution increases, enabling more accurate compound risk assessment.

  11. ELF: Embedded Language Flows

    cs.CL 2026-05 unverdicted novelty 6.0

    ELF is a continuous embedding-space flow matching model for language that stays continuous until the last step and outperforms prior discrete and continuous diffusion language models with fewer sampling steps.

  12. HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

    cs.CV 2026-05 unverdicted novelty 6.0

    A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B t...

  13. Temporal Sampling Frequency Matters: A Capacity-Aware Study of End-to-End Driving Trajectory Prediction

    cs.CV 2026-05 unverdicted novelty 6.0

    Smaller end-to-end autonomous driving models achieve optimal 3-second trajectory prediction accuracy at lower or intermediate temporal sampling frequencies, whereas larger VLA-style models perform best at the highest ...

  14. FREPix: Frequency-Heterogeneous Flow Matching for Pixel-Space Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    FREPix achieves competitive FID scores on ImageNet by decomposing image generation into separate low- and high-frequency paths within a flow matching framework.

  15. Taming Outlier Tokens in Diffusion Transformers

    cs.CV 2026-05 unverdicted novelty 6.0

    Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.

  16. A Few-Step Generative Model on Cumulative Flow Maps

    cs.LG 2026-05 unverdicted novelty 6.0

    Cumulative flow maps unify few-step generative modeling for diffusion and flow models via cumulative transport and parameterization with minimal changes to time embeddings and objectives.

  17. High-Dimensional Noise to Low-Dimensional Manifolds: A Manifold-Space Diffusion Framework for Degraded Hyperspectral Image Classification

    cs.CV 2026-04 unverdicted novelty 6.0

    MSDiff maps degraded hyperspectral data to a low-dimensional manifold and uses diffusion to regularize features for more robust classification under complex degradations.

  18. CoreFlow: Low-Rank Matrix Generative Models

    cs.LG 2026-04 unverdicted novelty 6.0

    CoreFlow is a low-rank matrix generative model that trains normalizing flows on shared subspaces to improve efficiency and quality for high-dimensional limited-sample data, including incomplete matrices.

  19. V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think

    cs.LG 2026-04 unverdicted novelty 6.0

    V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.

  20. One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.

  21. VOLT: Volumetric Wide-Field Microscopy via 3D-Native Probabilistic Transport

    eess.IV 2026-04 unverdicted novelty 6.0

    VOLT is a probabilistic transport method with a 3D anisotropic network that improves wide-field microscopy volume reconstruction in lateral and axial directions while supplying voxel-wise credibility estimates.

  22. Cross-Modal Generation: From Commodity WiFi to High-Fidelity mmWave and RFID Sensing

    cs.LG 2026-04 unverdicted novelty 6.0

    RF-CMG synthesizes high-quality mmWave and RFID signals from WiFi using a diffusion model with Modality-Guided Embedding for high-frequency details and Low-Frequency Modality Consistency to preserve physical structure.

  23. Generative Refinement Networks for Visual Synthesis

    cs.CV 2026-04 unverdicted novelty 6.0

    GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.

  24. FeaXDrive: Feasibility-aware Trajectory-Centric Diffusion Planning for End-to-End Autonomous Driving

    cs.RO 2026-04 unverdicted novelty 6.0

    FeaXDrive improves end-to-end autonomous driving by shifting diffusion planning to a trajectory-centric formulation with curvature-constrained training, drivable-area guidance, and GRPO post-training, yielding stronge...

  25. CoD-Lite: Real-Time Diffusion-Based Generative Image Compression

    cs.CV 2026-04 unverdicted novelty 6.0

    CoD-Lite delivers real-time generative image compression via a lightweight convolution-based diffusion codec with compression-oriented pre-training and distillation, achieving substantial bitrate savings.

  26. Continuous Adversarial Flow Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-im...

  27. From Clues to Generation: Language-Guided Conditional Diffusion for Cross-Domain Recommendation

    cs.IR 2026-04 unverdicted novelty 6.0

    LGCD creates pseudo-overlapping user data via LLM reasoning and uses conditional diffusion to generate target-domain user representations for inter-domain sequential recommendation without real overlapping users.

  28. ML-based approach to classification and generation of structured light propagation in turbulent media

    physics.optics 2026-04 unverdicted novelty 6.0

    ML models classify and generate structured light in turbulence using CNNs and diffusion models enhanced by Bregman distance minimization.

  29. SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

    cs.CV 2026-05 unverdicted novelty 5.0

    SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

  30. CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation

    physics.ins-det 2026-05 unverdicted novelty 5.0

    CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditi...

  31. Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 5.0

    The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LI...

  32. FluxFlow: Conservative Flow-Matching for Astronomical Image Super-Resolution

    cs.CV 2026-05 unverdicted novelty 5.0

    FluxFlow uses conservative pixel-space flow-matching with uncertainty weights and Wiener test-time correction to outperform baselines on photometric and scientific accuracy for ground-to-space super-resolution, valida...

  33. Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.

  34. Scaling Properties of Continuous Diffusion Spoken Language Models

    cs.CL 2026-04 unverdicted novelty 5.0

    Continuous diffusion spoken language models follow scaling laws for loss and phoneme divergence and generate emotive multi-speaker speech at 16B scale, though long-form coherence stays difficult.

  35. UniCSG: Unified High-Fidelity Content-Constrained Style-Driven Generation via Staged Semantic and Frequency Disentanglement

    cs.CV 2026-04 unverdicted novelty 5.0

    UniCSG adds staged semantic disentanglement and frequency-aware reconstruction to DiT diffusion models to improve content preservation and style fidelity in both text- and reference-guided generation.

  36. RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

    cs.CV 2026-04 unverdicted novelty 5.0

    RAD-2 uses a diffusion generator and RL discriminator to cut collision rates by 56% in closed-loop autonomous driving planning.

  37. PoreDiT: A Scalable Generative Model for Large-Scale Digital Rock Reconstruction

    cs.AI 2026-04 unverdicted novelty 5.0

    PoreDiT generates 1024^3 voxel digital rock models via 3D Swin Transformer binary pore-field prediction, matching prior methods on porosity, permeability, and Euler characteristics while running on consumer hardware.

  38. Target Parameterization in Diffusion Models for Nonlinear Spatiotemporal System Identification

    eess.SY 2026-04 unverdicted novelty 4.0

    Clean-state prediction in diffusion models for turbulent spatiotemporal systems improves rollout stability and reduces long-horizon error compared to velocity- and noise-based objectives.

  39. NTIRE 2026 The Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results

    cs.CV 2026-04 unverdicted novelty 2.0

    The second NTIRE challenge on day and night raindrop removal for dual-focused images received 17 valid team submissions that demonstrated strong performance on the Raindrop Clarity dataset.

  40. NTIRE 2026 The Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results

    cs.CV 2026-04 unverdicted novelty 2.0

    The NTIRE 2026 challenge reports strong performance from 17 teams on raindrop removal for dual-focused day and night images using an adjusted real-world dataset with 14,139 training images.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · cited by 38 Pith papers · 6 internal anchors

  1. [1]

    Build- ing normalizing flows with stochastic interpolants

    Michael Samuel Albergo and Eric Vanden-Eijnden. Build- ing normalizing flows with stochastic interpolants. InICLR, 2023

  2. [2]

    Deep variational information bottleneck

    Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. InICLR, 2017

  3. [3]

    Topology and data.Bulletin of the Ameri- can Mathematical Society, 46(2):255–308, 2009

    Gunnar Carlsson. Topology and data.Bulletin of the Ameri- can Mathematical Society, 46(2):255–308, 2009

  4. [4]

    MIT Press, Cambridge, MA, USA, 2006

    Olivier Chapelle, Bernhard Sch ¨olkopf, and Alexander Zien, editors.Semi-Supervised Learning. MIT Press, Cambridge, MA, USA, 2006

  5. [5]

    Neural ordinary differential equations

    Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. InNeurIPS, 2018

  6. [6]

    Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025

    Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. PixelFlow: Pixel-space generative models with flow.arXiv:2504.07963, 2025

  7. [7]

    On the importance of noise scheduling for diffusion models

    Ting Chen. On the importance of noise scheduling for diffu- sion models.arXiv:2301.10972, 2023

  8. [8]

    De- constructing denoising diffusion models for self-supervised learning

    Xinlei Chen, Zhuang Liu, Saining Xie, and Kaiming He. De- constructing denoising diffusion models for self-supervised learning. InICLR, 2025

  9. [9]

    Image denoising by sparse 3-D transform- domain collaborative filtering.IEEE Transactions on image processing, 16(8):2080–2095, 2007

    Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian. Image denoising by sparse 3-D transform- domain collaborative filtering.IEEE Transactions on image processing, 16(8):2080–2095, 2007

  10. [10]

    Inversion by di- rect iteration: An alternative to denoising diffusion for image restoration.Transactions on Machine Learning Research, 2023

    Mauricio Delbracio and Peyman Milanfar. Inversion by di- rect iteration: An alternative to denoising diffusion for image restoration.Transactions on Machine Learning Research, 2023

  11. [11]

    ImageNet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. InCVPR, 2009

  12. [12]

    Diffusion models beat GANs on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. InNeurIPS, 2021

  13. [13]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021

  14. [14]

    Image denoising via sparse and redundant representations over learned dictionar- ies.IEEE Transactions on Image processing, 15(12):3736– 3745, 2006

    Michael Elad and Michal Aharon. Image denoising via sparse and redundant representations over learned dictionar- ies.IEEE Transactions on Image processing, 15(12):3736– 3745, 2006

  15. [15]

    Scaling rec- tified flow Transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rec- tified flow Transformers for high-resolution image synthesis. InICML, 2024

  16. [16]

    Generative adversarial nets

    Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. InNeurIPS, 2014

  17. [17]

    Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

    Priya Goyal, Piotr Doll ´ar, Ross Girshick, Pieter Noord- huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training ImageNet in 1 hour.arXiv:1706.02677, 2017

  18. [18]

    Training Agents Inside of Scalable World Models

    Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models.arXiv:2509.24527, 2025

  19. [19]

    Query-key normalization for Transformers

    Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key normalization for Transformers. InFindings of EMNLP, 2020

  20. [20]

    Neue methoden zur approximativen integration der differentialgleichungen einer unabh ¨angigen ver ¨ander- lichen.Z

    Karl Heun. Neue methoden zur approximativen integration der differentialgleichungen einer unabh ¨angigen ver ¨ander- lichen.Z. Math. Phys, 45:23–38, 1900

  21. [21]

    GANs trained by a two time-scale update rule converge to a local Nash equi- librium.NeurIPS, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equi- librium.NeurIPS, 2017

  22. [22]

    Classifier-free diffusion guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS Workshops, 2021

  23. [23]

    Denoising diffu- sion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InNeurIPS, 2020

  24. [24]

    DDPM github repo

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. DDPM github repo. L155,diffusion utils 2.py, 2020

  25. [25]

    sim- ple diffusion: End-to-end diffusion for high resolution im- ages.ICML, 2023

    Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. sim- ple diffusion: End-to-end diffusion for high resolution im- ages.ICML, 2023

  26. [26]

    Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion

    Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, and Tim Salimans. Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion. InCVPR, 2025

  27. [27]

    What secrets do your manifolds hold? understanding the local geometry of gener- ative models

    Ahmed Imtiaz Humayun, Ibtihel Amara, Cristina Vascon- celos, Deepak Ramachandran, Candice Schumann, Jun- feng He, Katherine Heller, Golnoosh Farnadi, Negar Ros- tamzadeh, and Mohammad Havaei. What secrets do your manifolds hold? understanding the local geometry of gener- ative models. InICLR, 2025

  28. [28]

    Scalable adaptive computation for iterative generation

    Allan Jabri, David Fleet, and Ting Chen. Scalable adaptive computation for iterative generation. InICML, 2023

  29. [29]

    Elucidating the design space of diffusion-based generative models

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InNeurIPS, 2022

  30. [30]

    Understanding diffusion objectives as the ELBO with simple data augmentation

    Diederik Kingma and Ruiqi Gao. Understanding diffusion objectives as the ELBO with simple data augmentation. In NeurIPS, 2023

  31. [31]

    Adam: A method for stochastic optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR, 2015

  32. [32]

    Improved precision and recall met- ric for assessing generative models.NeurIPS, 2019

    Tuomas Kynk ¨a¨anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall met- ric for assessing generative models.NeurIPS, 2019

  33. [33]

    Applying guidance in a limited interval improves sample and distribution quality in diffusion models

    Tuomas Kynk ¨a¨anniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models. InNeurIPS, 2024. 12

  34. [34]

    Advancing end- to-end pixel space generative modeling via self-supervised pre-training.arXiv:2510.12586, 2025

    Jiachen Lei, Keli Liu, Julius Berner, Haiming Yu, Hongkai Zheng, Jiahong Wu, and Xiangxiang Chu. Advancing end- to-end pixel space generative modeling via self-supervised pre-training.arXiv:2510.12586, 2025

  35. [35]

    Autoregressive image generation without vec- tor quantization

    Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vec- tor quantization. InNeurIPS, 2024

  36. [36]

    Fractal generative models.arXiv:2502.17437, 2025

    Tianhong Li, Qinyi Sun, Lijie Fan, and Kaiming He. Fractal generative models.arXiv:2502.17437, 2025

  37. [37]

    Flow matching for generative mod- eling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling. InICLR, 2023

  38. [38]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023

  39. [39]

    Deep generative models through the lens of the manifold hypothe- sis: A survey and new connections.Transactions on Machine Learning Research, 2024

    Gabriel Loaiza-Ganem, Brendan Leigh Ross, Rasa Hossein- zadeh, Anthony L Caterini, and Jesse C Cresswell. Deep generative models through the lens of the manifold hypothe- sis: A survey and new connections.Transactions on Machine Learning Research, 2024

  40. [40]

    SiT: Explor- ing flow and diffusion-based generative models with scalable interpolant Transformers

    Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. SiT: Explor- ing flow and diffusion-based generative models with scalable interpolant Transformers. InECCV, 2024

  41. [41]

    k-Sparse Autoencoders

    Alireza Makhzani and Brendan Frey. K-sparse autoencoders. arXiv:1312.5663, 2013

  42. [42]

    Denoising: a powerful building block for imaging, inverse problems and machine learning.Philosophical Transactions A, 383(2299): 20240326, 2025

    Peyman Milanfar and Mauricio Delbracio. Denoising: a powerful building block for imaging, inverse problems and machine learning.Philosophical Transactions A, 383(2299): 20240326, 2025

  43. [43]

    NeRF: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

  44. [44]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InICML, 2021

  45. [45]

    DINOv2: Learning robust visual fea- tures without supervision.Transactions on Machine Learn- ing Research, 2023

    Maxime Oquab et al. DINOv2: Learning robust visual fea- tures without supervision.Transactions on Machine Learn- ing Research, 2023

  46. [46]

    Scalable diffusion models with Transformers

    William Peebles and Saining Xie. Scalable diffusion models with Transformers. InICCV, 2023

  47. [47]

    Image denoising using scale mixtures of Gaussians in the wavelet domain.IEEE Transactions on Image processing, 12(11):1338–1351, 2003

    Javier Portilla, Vasily Strela, Martin J Wainwright, and Eero P Simoncelli. Image denoising using scale mixtures of Gaussians in the wavelet domain.IEEE Transactions on Image processing, 12(11):1338–1351, 2003

  48. [48]

    Contractive auto-encoders: Explicit in- variance during feature extraction

    Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. Contractive auto-encoders: Explicit in- variance during feature extraction. InICML, 2011

  49. [49]

    High-resolution image syn- thesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, 2022

  50. [50]

    U- Net: Convolutional networks for biomedical image segmen- tation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- Net: Convolutional networks for biomedical image segmen- tation. InMICCAI, 2015

  51. [51]

    Nonlinear dimension- ality reduction by locally linear embedding.Science, 290 (5500):2323–2326, 2000

    Sam T Roweis and Lawrence K Saul. Nonlinear dimension- ality reduction by locally linear embedding.Science, 290 (5500):2323–2326, 2000

  52. [52]

    Progressive distillation for fast sampling of diffusion models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. InICLR, 2022

  53. [53]

    Improved techniques for training GANs.NeurIPS, 29, 2016

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs.NeurIPS, 29, 2016

  54. [54]

    GLU Variants Improve Transformer

    Noam Shazeer. GLU variants improve Transformer. arXiv:2002.05202, 2020

  55. [55]

    Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025

    Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Ji- wen Lu. Latent diffusion model without variational autoen- coder.arXiv:2510.15301, 2025

  56. [56]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014

  57. [57]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InICML, 2015

  58. [58]

    Denois- ing diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InICLR, 2021

  59. [59]

    Generative modeling by es- timating gradients of the data distribution

    Yang Song and Stefano Ermon. Generative modeling by es- timating gradients of the data distribution. InNeurIPS, 2019

  60. [60]

    Score-based generative modeling through stochastic differential equa- tions

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. InICLR, 2021

  61. [61]

    Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014

    Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014

  62. [62]

    RoFormer: Enhanced Transformer with rotary position embedding.Neurocomputing, 568: 127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced Transformer with rotary position embedding.Neurocomputing, 568: 127063, 2024

  63. [63]

    A global geometric framework for nonlinear dimensionality reduction.Science, 290(5500):2319–2323, 2000

    Joshua B Tenenbaum, Vin de Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction.Science, 290(5500):2319–2323, 2000

  64. [64]

    The information bottleneck method

    Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method.arXiv preprint physics/0004057, 2000

  65. [65]

    JetFormer: an autoregressive generative model of raw images and text

    Michael Tschannen, Andr ´e Susano Pinto, and Alexander Kolesnikov. JetFormer: an autoregressive generative model of raw images and text. InICLR, 2025

  66. [66]

    Attention is all you need.NeurIPS, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.NeurIPS, 2017

  67. [67]

    A connection between score matching and denoising autoencoders.Neural computation, 23(7):1661– 1674, 2011

    Pascal Vincent. A connection between score matching and denoising autoencoders.Neural computation, 23(7):1661– 1674, 2011

  68. [68]

    Extracting and composing robust features with denoising autoencoders

    Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. InICML, 2008

  69. [69]

    Stacked denoising autoencoders: Learning useful represen- tations in a deep network with a local denoising criterion

    Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, and L ´eon Bottou. Stacked denoising autoencoders: Learning useful represen- tations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11(12), 2010. 13

  70. [70]

    Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268,

    Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, and Limin Wang. PixNerd: Pixel neural field diffusion. arXiv:2507.23268, 2025

  71. [71]

    DDT: Decoupled diffusion Transformer.arXiv:2504.05741, 2025

    Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. DDT: Decoupled diffusion Transformer.arXiv:2504.05741, 2025

  72. [72]

    Dif- fusion model for generative image denoising

    Yutong Xie, Minne Yuan, Bin Dong, and Quanzheng Li. Dif- fusion model for generative image denoising. InICCV, 2023

  73. [73]

    Reconstruc- tion vs

    Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruc- tion vs. generation: Taming optimization dilemma in latent diffusion models. InCVPR, 2025

  74. [74]

    Representation alignment for generation: Training diffusion Transformers is easier than you think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion Transformers is easier than you think. InICLR, 2025

  75. [75]

    Root mean square layer nor- malization

    Biao Zhang and Rico Sennrich. Root mean square layer nor- malization. InNeurIPS, 2019

  76. [77]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018

  77. [78]

    Diffusion Transformers with Representation Autoencoders

    Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion Transformers with representation autoen- coders.arXiv:2510.11690, 2025

  78. [79]

    From learning models of nat- ural image patches to whole image restoration

    Daniel Zoran and Yair Weiss. From learning models of nat- ural image patches to whole image restoration. InICCV, 2011. 14 class 012: house finch, linnet, Carpodacus mexicanus class 014: indigo bunting, indigo finch, indigo bird, Passerina cyanea class 042: agama class 081: ptarmigan class 107: jellyfish class 108: sea anemone, anemone class 110: flatworm,...