arxiv: 2511.13720 · v2 · submitted 2025-11-17 · 💻 cs.CV

Recognition: 3 theorem links

· Lean Theorem

Back to Basics: Let Denoising Generative Models Denoise

Tianhong Li , Kaiming He

Authors on Pith no claims yet

Pith reviewed 2026-05-11 22:10 UTC · model grok-4.3

classification 💻 cs.CV

keywords denoising diffusiongenerative modelstransformersimage generationmanifold assumptionImageNetpixel-level predictionclean data prediction

0 comments

The pith

Predicting clean images directly with simple Transformers on raw pixels produces competitive generative models for ImageNet at 256 and 512 resolutions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that current denoising diffusion models avoid directly predicting clean images and instead forecast noise or other noised quantities. It argues that this choice ignores the manifold structure of natural data, where clean images occupy a low-dimensional surface while noisy versions fill the full high-dimensional space. By instead training models to map noisy inputs straight back to clean pixels, apparently limited networks can still succeed in pixel space without tokenizers, pre-training, or extra losses. The resulting JiT models, which are plain large-patch Transformers, reach competitive generation quality on ImageNet at both 256 and 512 resolution.

Core claim

Directly predicting the clean data from noised inputs, rather than predicting noise or a noised quantity, lets simple large-patch Transformers operate effectively as generative models on raw pixels. These JiT networks require no tokenizer, no pre-training, and no auxiliary loss yet produce competitive samples on ImageNet at 256 and 512 resolution, where high-dimensional noise prediction tends to fail.

What carries the argument

JiT, or Just image Transformers: large-patch Transformers applied directly to pixels that predict clean data from noised versions by exploiting the manifold structure of natural images.

If this is right

Networks with limited capacity can still generate high-resolution images when trained to recover points on the data manifold.
Generative performance remains competitive without tokenizers or pre-training when the prediction target is the clean image.
Large patch sizes of 16 and 32 become viable for Transformer-based diffusion on raw pixels.
A self-contained training paradigm for diffusion models on natural images is possible without auxiliary components.
Direct clean-image prediction avoids catastrophic failure modes observed when predicting high-dimensional noised quantities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same direct-prediction strategy could reduce architectural complexity in generative models for other high-dimensional data such as audio or video.
Training dynamics might change when the network is explicitly encouraged to map back onto the manifold rather than into the ambient noise space.
Model-size requirements for high-resolution generation could be re-examined under the clean-prediction objective.
Classical signal-processing denoising ideas may map more directly onto modern diffusion training once the target is restored to clean data.

Load-bearing premise

Natural data lies on a low-dimensional manifold while noised data does not.

What would settle it

A clean-data-predicting large-patch Transformer that produces visibly worse or incoherent samples than a noise-predicting baseline at 512 resolution on ImageNet would falsify the central claim.

read the original abstract

Today's denoising diffusion models do not "denoise" in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces. We show that simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss. Our approach is conceptually nothing more than "Just image Transformers", or JiT, as we call it. We report competitive results using JiT with large patch sizes of 16 and 32 on ImageNet at resolutions of 256 and 512, where predicting high-dimensional noised quantities can fail catastrophically. With our networks mapping back to the basics of the manifold, our research goes back to basics and pursues a self-contained paradigm for Transformer-based diffusion on raw natural data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper pushes direct clean-image prediction in diffusion models via simple large-patch pixel Transformers and gets competitive ImageNet results, but the manifold assumption explaining why this works gets no direct test.

read the letter

Li and He make the case that standard denoising diffusion models miss the point by predicting noise or noised quantities instead of the clean image. They argue this matters because natural images lie on a low-dimensional manifold while noised versions do not, so direct clean prediction lets under-capacity networks handle high-dimensional pixel space. Their JiT setup uses plain large-patch Transformers on raw pixels—no tokenizer, no pre-training, no extra losses—and reports competitive results on ImageNet at 256 and 512 resolution where noise prediction reportedly fails badly for similar models. That empirical demonstration is the clearest new piece: it shows a stripped-down pixel Transformer can serve as a generative model under this prediction target. The results are presented as a return to basics, and the simplicity is a genuine strength if the numbers hold up against proper baselines. The soft spot is exactly the one flagged in the stress-test note. The manifold assumption is invoked to explain the success, yet the paper supplies no supporting measurements such as intrinsic-dimension estimates on clean versus noised patches at the relevant noise levels, nor ablations that tie performance gains to manifold properties rather than loss geometry or training dynamics. Without that, the explanatory link stays unverified. The competitive claims also need the full set of controls and comparisons to judge their strength. This paper is for people working on diffusion architectures and pixel-level generative Transformers who want to test whether changing the prediction target simplifies high-resolution pipelines. A reader focused on practical simplifications would get value from the reported setup and numbers. It deserves peer review because the idea is straightforward, the empirical direction is clear, and the results could matter for design choices even if the theoretical grounding needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper claims that standard denoising diffusion models predict noise or noised quantities rather than clean data, and that directly predicting clean images is fundamentally different because natural data lies on a low-dimensional manifold while noised quantities do not. This allows simple, under-capacity networks to operate in high-dimensional pixel space. The authors introduce JiT (Just image Transformers): large-patch pixel Transformers trained with no tokenizer, no pre-training, and no extra loss, and report competitive ImageNet results at 256 and 512 resolutions with patch sizes 16 and 32, where noise-prediction baselines fail catastrophically.

Significance. If the results hold, the work demonstrates that a back-to-basics clean-data prediction target can enable competitive generative performance with minimal architectural complexity on raw pixels. This provides an empirical existence proof for simple large-patch Transformers as generative models and highlights the modeling choice of prediction target as potentially more important than tokenization or pre-training in high-dimensional settings.

major comments (2)

[Abstract and §1] Abstract and §1: The explanatory link between direct clean-data prediction and success in high-dimensional space rests on the untested manifold assumption (natural images occupy a low-dimensional manifold while noised quantities do not). No intrinsic-dimension estimates (PCA, MLE, or correlation dimension), ablation on manifold properties, or comparison of effective dimensionality at training noise levels are provided to ground this premise.
[Experiments] Experiments section: The claim that noise/noised-quantity prediction 'fails catastrophically' at large patch sizes while clean prediction succeeds is load-bearing for the central argument, yet the manuscript does not report controlled ablations isolating the prediction target from other factors such as loss geometry, optimization dynamics, or network capacity. Without these, the reported competitive FID or other metrics cannot be confidently attributed to the manifold-based rationale.

minor comments (2)

[§2] §2: The precise mathematical formulation of the clean-data prediction objective (e.g., the training loss and how it differs from standard noise-prediction diffusion) should be stated explicitly with an equation for reproducibility.
[Tables and figures] Tables and figures: Ensure quantitative tables report both patch size and resolution explicitly and include error bars or multiple seeds for the ImageNet 256/512 results to allow direct comparison with noise-prediction baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comments below, providing clarifications and indicating where revisions will be made to strengthen the paper.

read point-by-point responses

Referee: [Abstract and §1] Abstract and §1: The explanatory link between direct clean-data prediction and success in high-dimensional space rests on the untested manifold assumption (natural images occupy a low-dimensional manifold while noised quantities do not). No intrinsic-dimension estimates (PCA, MLE, or correlation dimension), ablation on manifold properties, or comparison of effective dimensionality at training noise levels are provided to ground this premise.

Authors: We appreciate this observation. The manifold hypothesis for natural images is a standard assumption in the field, with substantial supporting evidence from prior studies on the low-dimensional structure of image data. Our work builds on this by demonstrating that direct prediction of clean data enables effective modeling in high-dimensional pixel space with simple architectures, in contrast to noise prediction. While we do not provide new intrinsic dimension calculations, the empirical results—particularly the failure of noise prediction at large patch sizes—serve as indirect validation. In the revised manuscript, we will expand the discussion in Section 1 to include references to key literature on image manifolds and clarify the role of this assumption. revision: partial
Referee: [Experiments] Experiments section: The claim that noise/noised-quantity prediction 'fails catastrophically' at large patch sizes while clean prediction succeeds is load-bearing for the central argument, yet the manuscript does not report controlled ablations isolating the prediction target from other factors such as loss geometry, optimization dynamics, or network capacity. Without these, the reported competitive FID or other metrics cannot be confidently attributed to the manifold-based rationale.

Authors: We agree that careful isolation of variables strengthens the argument. Our experiments compare clean-data prediction (JiT) against noise-prediction baselines using the exact same Transformer architecture, patch sizes, and training protocol on raw pixels, with the only difference being the prediction target. This setup controls for network capacity and largely for optimization dynamics, as the training procedure is identical. The loss geometry is inherently tied to the choice of target, which is the central modeling decision under investigation. We believe this provides sufficient evidence for the importance of the prediction target. However, we will add a note in the experiments section acknowledging potential confounding factors and discussing why the target choice is the primary variable. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical demonstration remains self-contained without reductions to fitted inputs or self-citations.

full rationale

The manuscript advances an empirical claim that direct clean-image prediction with large-patch pixel Transformers yields competitive ImageNet results at 256/512 resolution, without tokenizers or pre-training. The manifold assumption is invoked as an explanatory premise for why this modeling choice succeeds where noise prediction fails, but the paper presents no equations, derivations, or parameter fits that reduce the reported performance to the assumption by construction. No self-citation chains, uniqueness theorems, or ansatzes are used to justify core choices; results are benchmark numbers rather than forced predictions. The derivation chain is therefore independent of its inputs and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one domain assumption about data manifolds; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Natural data lies on a low-dimensional manifold, whereas noised quantities do not.
Invoked in the abstract to justify why predicting clean data is fundamentally different and advantageous.

pith-pipeline@v0.9.0 · 5504 in / 1122 out tokens · 41294 ms · 2026-05-11T22:10:03.888395+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.LawOfExistence defect_zero_iff_one echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces.
Foundation.JCostCoshIdentity jcost_exp_cosh_form echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Predicting clean data is fundamentally different from predicting noise or a noised quantity.
Foundation.DiscretenessForcing continuous_no_isolated_zero_defect echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 40 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FluxFlow: Conservative Flow-Matching for Astronomical Image Super-Resolution
cs.CV 2026-05 unverdicted novelty 7.0

FluxFlow is a conservative pixel-space flow-matching framework for astronomical super-resolution that incorporates real atmospheric uncertainty and a training-free Wiener correction, outperforming baselines on a new 1...
Binomial flows: Denoising and flow matching for discrete ordinal data
cs.LG 2026-05 unverdicted novelty 7.0

Binomial flows close the gap between continuous flow matching and discrete ordinal data by using binomial distributions to enable unified denoising, sampling, and exact likelihoods in diffusion models.
Structure-Adaptive Sparse Diffusion in Voxel Space for 3D Medical Image Enhancement
cs.CV 2026-04 unverdicted novelty 7.0

A sparse voxel-space diffusion method with structure-adaptive modulation achieves up to 10x training speedup and state-of-the-art results for 3D medical image denoising and super-resolution.
Grokking of Diffusion Models: Case Study on Modular Addition
cs.LG 2026-04 unverdicted novelty 7.0

Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.
Coevolving Representations in Joint Image-Feature Diffusion
cs.CV 2026-04 unverdicted novelty 7.0

CoReDi coevolves semantic representations with the diffusion model via a jointly learned linear projection stabilized by stop-gradient, normalization, and regularization, yielding faster convergence and higher sample ...
FARM: Foundational Aerial Radio Map for Intelligent Low-Altitude Networking
eess.SP 2026-04 unverdicted novelty 7.0

FARM is a foundation model combining masked autoencoders and diffusion decoders to estimate high-resolution aerial radio maps from a new multi-band low-altitude dataset, claiming superior accuracy and generalization o...
Free-Range Gaussians: Non-Grid-Aligned Generative 3D Gaussian Reconstruction
cs.CV 2026-04 unverdicted novelty 7.0

Free-Range Gaussians uses flow matching over Gaussian parameters to predict non-grid-aligned 3D Gaussians from multi-view images, enabling synthesis of plausible content in unobserved regions with fewer primitives tha...
L2P: Unlocking Latent Potential for Pixel Generation
cs.CV 2026-05 unverdicted novelty 6.0

L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.
BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion
cs.CL 2026-05 unverdicted novelty 6.0

BitLM replaces per-token softmax with bitwise continuous diffusion inside causal blocks to generate multiple tokens in parallel while preserving autoregressive structure.
Generative climate downscaling enables high-resolution compound risk assessment by preserving multivariate dependencies
physics.ao-ph 2026-05 unverdicted novelty 6.0

A multivariate diffusion generative downscaling method preserves inter-variable correlations in climate data under large resolution increases, enabling more accurate compound risk assessment.
ELF: Embedded Language Flows
cs.CL 2026-05 unverdicted novelty 6.0

ELF is a continuous embedding-space flow matching model for language that stays continuous until the last step and outperforms prior discrete and continuous diffusion language models with fewer sampling steps.
HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer
cs.CV 2026-05 unverdicted novelty 6.0

A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B t...
Temporal Sampling Frequency Matters: A Capacity-Aware Study of End-to-End Driving Trajectory Prediction
cs.CV 2026-05 unverdicted novelty 6.0

Smaller end-to-end autonomous driving models achieve optimal 3-second trajectory prediction accuracy at lower or intermediate temporal sampling frequencies, whereas larger VLA-style models perform best at the highest ...
FREPix: Frequency-Heterogeneous Flow Matching for Pixel-Space Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

FREPix achieves competitive FID scores on ImageNet by decomposing image generation into separate low- and high-frequency paths within a flow matching framework.
Taming Outlier Tokens in Diffusion Transformers
cs.CV 2026-05 unverdicted novelty 6.0

Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.
A Few-Step Generative Model on Cumulative Flow Maps
cs.LG 2026-05 unverdicted novelty 6.0

Cumulative flow maps unify few-step generative modeling for diffusion and flow models via cumulative transport and parameterization with minimal changes to time embeddings and objectives.
High-Dimensional Noise to Low-Dimensional Manifolds: A Manifold-Space Diffusion Framework for Degraded Hyperspectral Image Classification
cs.CV 2026-04 unverdicted novelty 6.0

MSDiff maps degraded hyperspectral data to a low-dimensional manifold and uses diffusion to regularize features for more robust classification under complex degradations.
CoreFlow: Low-Rank Matrix Generative Models
cs.LG 2026-04 unverdicted novelty 6.0

CoreFlow is a low-rank matrix generative model that trains normalizing flows on shared subspaces to improve efficiency and quality for high-dimensional limited-sample data, including incomplete matrices.
V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think
cs.LG 2026-04 unverdicted novelty 6.0

V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models
cs.LG 2026-04 unverdicted novelty 6.0

Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
VOLT: Volumetric Wide-Field Microscopy via 3D-Native Probabilistic Transport
eess.IV 2026-04 unverdicted novelty 6.0

VOLT is a probabilistic transport method with a 3D anisotropic network that improves wide-field microscopy volume reconstruction in lateral and axial directions while supplying voxel-wise credibility estimates.
Cross-Modal Generation: From Commodity WiFi to High-Fidelity mmWave and RFID Sensing
cs.LG 2026-04 unverdicted novelty 6.0

RF-CMG synthesizes high-quality mmWave and RFID signals from WiFi using a diffusion model with Modality-Guided Embedding for high-frequency details and Low-Frequency Modality Consistency to preserve physical structure.
Generative Refinement Networks for Visual Synthesis
cs.CV 2026-04 unverdicted novelty 6.0

GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
FeaXDrive: Feasibility-aware Trajectory-Centric Diffusion Planning for End-to-End Autonomous Driving
cs.RO 2026-04 unverdicted novelty 6.0

FeaXDrive improves end-to-end autonomous driving by shifting diffusion planning to a trajectory-centric formulation with curvature-constrained training, drivable-area guidance, and GRPO post-training, yielding stronge...
CoD-Lite: Real-Time Diffusion-Based Generative Image Compression
cs.CV 2026-04 unverdicted novelty 6.0

CoD-Lite delivers real-time generative image compression via a lightweight convolution-based diffusion codec with compression-oriented pre-training and distillation, achieving substantial bitrate savings.
Continuous Adversarial Flow Models
cs.LG 2026-04 unverdicted novelty 6.0

Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-im...
From Clues to Generation: Language-Guided Conditional Diffusion for Cross-Domain Recommendation
cs.IR 2026-04 unverdicted novelty 6.0

LGCD creates pseudo-overlapping user data via LLM reasoning and uses conditional diffusion to generate target-domain user representations for inter-domain sequential recommendation without real overlapping users.
ML-based approach to classification and generation of structured light propagation in turbulent media
physics.optics 2026-04 unverdicted novelty 6.0

ML models classify and generate structured light in turbulence using CNNs and diffusion models enhanced by Bregman distance minimization.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
cs.CV 2026-05 unverdicted novelty 5.0

SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation
physics.ins-det 2026-05 unverdicted novelty 5.0

CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditi...
Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 5.0

The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LI...
FluxFlow: Conservative Flow-Matching for Astronomical Image Super-Resolution
cs.CV 2026-05 unverdicted novelty 5.0

FluxFlow uses conservative pixel-space flow-matching with uncertainty weights and Wiener test-time correction to outperform baselines on photometric and scientific accuracy for ground-to-space super-resolution, valida...
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
cs.CV 2026-04 unverdicted novelty 5.0

Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.
Scaling Properties of Continuous Diffusion Spoken Language Models
cs.CL 2026-04 unverdicted novelty 5.0

Continuous diffusion spoken language models follow scaling laws for loss and phoneme divergence and generate emotive multi-speaker speech at 16B scale, though long-form coherence stays difficult.
UniCSG: Unified High-Fidelity Content-Constrained Style-Driven Generation via Staged Semantic and Frequency Disentanglement
cs.CV 2026-04 unverdicted novelty 5.0

UniCSG adds staged semantic disentanglement and frequency-aware reconstruction to DiT diffusion models to improve content preservation and style fidelity in both text- and reference-guided generation.
RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework
cs.CV 2026-04 unverdicted novelty 5.0

RAD-2 uses a diffusion generator and RL discriminator to cut collision rates by 56% in closed-loop autonomous driving planning.
PoreDiT: A Scalable Generative Model for Large-Scale Digital Rock Reconstruction
cs.AI 2026-04 unverdicted novelty 5.0

PoreDiT generates 1024^3 voxel digital rock models via 3D Swin Transformer binary pore-field prediction, matching prior methods on porosity, permeability, and Euler characteristics while running on consumer hardware.
Target Parameterization in Diffusion Models for Nonlinear Spatiotemporal System Identification
eess.SY 2026-04 unverdicted novelty 4.0

Clean-state prediction in diffusion models for turbulent spatiotemporal systems improves rollout stability and reduces long-horizon error compared to velocity- and noise-based objectives.
NTIRE 2026 The Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results
cs.CV 2026-04 unverdicted novelty 2.0

The second NTIRE challenge on day and night raindrop removal for dual-focused images received 17 valid team submissions that demonstrated strong performance on the Raindrop Clarity dataset.
NTIRE 2026 The Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results
cs.CV 2026-04 unverdicted novelty 2.0

The NTIRE 2026 challenge reports strong performance from 17 teams on raindrop removal for dual-focused day and night images using an adjusted real-world dataset with 14,139 training images.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · cited by 38 Pith papers · 6 internal anchors

[1]

Build- ing normalizing flows with stochastic interpolants

Michael Samuel Albergo and Eric Vanden-Eijnden. Build- ing normalizing flows with stochastic interpolants. InICLR, 2023

work page 2023
[2]

Deep variational information bottleneck

Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. InICLR, 2017

work page 2017
[3]

Topology and data.Bulletin of the Ameri- can Mathematical Society, 46(2):255–308, 2009

Gunnar Carlsson. Topology and data.Bulletin of the Ameri- can Mathematical Society, 46(2):255–308, 2009

work page 2009
[4]

MIT Press, Cambridge, MA, USA, 2006

Olivier Chapelle, Bernhard Sch ¨olkopf, and Alexander Zien, editors.Semi-Supervised Learning. MIT Press, Cambridge, MA, USA, 2006

work page 2006
[5]

Neural ordinary differential equations

Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. InNeurIPS, 2018

work page 2018
[6]

Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025

Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. PixelFlow: Pixel-space generative models with flow.arXiv:2504.07963, 2025

work page arXiv 2025
[7]

On the importance of noise scheduling for diffusion models

Ting Chen. On the importance of noise scheduling for diffu- sion models.arXiv:2301.10972, 2023

work page arXiv 2023
[8]

De- constructing denoising diffusion models for self-supervised learning

Xinlei Chen, Zhuang Liu, Saining Xie, and Kaiming He. De- constructing denoising diffusion models for self-supervised learning. InICLR, 2025

work page 2025
[9]

Image denoising by sparse 3-D transform- domain collaborative filtering.IEEE Transactions on image processing, 16(8):2080–2095, 2007

Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian. Image denoising by sparse 3-D transform- domain collaborative filtering.IEEE Transactions on image processing, 16(8):2080–2095, 2007

work page 2080
[10]

Inversion by di- rect iteration: An alternative to denoising diffusion for image restoration.Transactions on Machine Learning Research, 2023

Mauricio Delbracio and Peyman Milanfar. Inversion by di- rect iteration: An alternative to denoising diffusion for image restoration.Transactions on Machine Learning Research, 2023

work page 2023
[11]

ImageNet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. InCVPR, 2009

work page 2009
[12]

Diffusion models beat GANs on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. InNeurIPS, 2021

work page 2021
[13]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021

work page 2021
[14]

Image denoising via sparse and redundant representations over learned dictionar- ies.IEEE Transactions on Image processing, 15(12):3736– 3745, 2006

Michael Elad and Michal Aharon. Image denoising via sparse and redundant representations over learned dictionar- ies.IEEE Transactions on Image processing, 15(12):3736– 3745, 2006

work page 2006
[15]

Scaling rec- tified flow Transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rec- tified flow Transformers for high-resolution image synthesis. InICML, 2024

work page 2024
[16]

Generative adversarial nets

Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. InNeurIPS, 2014

work page 2014
[17]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Priya Goyal, Piotr Doll ´ar, Ross Girshick, Pieter Noord- huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training ImageNet in 1 hour.arXiv:1706.02677, 2017

work page internal anchor Pith review arXiv 2017
[18]

Training Agents Inside of Scalable World Models

Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models.arXiv:2509.24527, 2025

work page internal anchor Pith review arXiv 2025
[19]

Query-key normalization for Transformers

Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key normalization for Transformers. InFindings of EMNLP, 2020

work page 2020
[20]

Neue methoden zur approximativen integration der differentialgleichungen einer unabh ¨angigen ver ¨ander- lichen.Z

Karl Heun. Neue methoden zur approximativen integration der differentialgleichungen einer unabh ¨angigen ver ¨ander- lichen.Z. Math. Phys, 45:23–38, 1900

work page 1900
[21]

GANs trained by a two time-scale update rule converge to a local Nash equi- librium.NeurIPS, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equi- librium.NeurIPS, 2017

work page 2017
[22]

Classifier-free diffusion guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS Workshops, 2021

work page 2021
[23]

Denoising diffu- sion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InNeurIPS, 2020

work page 2020
[24]

DDPM github repo

Jonathan Ho, Ajay Jain, and Pieter Abbeel. DDPM github repo. L155,diffusion utils 2.py, 2020

work page 2020
[25]

sim- ple diffusion: End-to-end diffusion for high resolution im- ages.ICML, 2023

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. sim- ple diffusion: End-to-end diffusion for high resolution im- ages.ICML, 2023

work page 2023
[26]

Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion

Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, and Tim Salimans. Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion. InCVPR, 2025

work page 2025
[27]

What secrets do your manifolds hold? understanding the local geometry of gener- ative models

Ahmed Imtiaz Humayun, Ibtihel Amara, Cristina Vascon- celos, Deepak Ramachandran, Candice Schumann, Jun- feng He, Katherine Heller, Golnoosh Farnadi, Negar Ros- tamzadeh, and Mohammad Havaei. What secrets do your manifolds hold? understanding the local geometry of gener- ative models. InICLR, 2025

work page 2025
[28]

Scalable adaptive computation for iterative generation

Allan Jabri, David Fleet, and Ting Chen. Scalable adaptive computation for iterative generation. InICML, 2023

work page 2023
[29]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InNeurIPS, 2022

work page 2022
[30]

Understanding diffusion objectives as the ELBO with simple data augmentation

Diederik Kingma and Ruiqi Gao. Understanding diffusion objectives as the ELBO with simple data augmentation. In NeurIPS, 2023

work page 2023
[31]

Adam: A method for stochastic optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR, 2015

work page 2015
[32]

Improved precision and recall met- ric for assessing generative models.NeurIPS, 2019

Tuomas Kynk ¨a¨anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall met- ric for assessing generative models.NeurIPS, 2019

work page 2019
[33]

Applying guidance in a limited interval improves sample and distribution quality in diffusion models

Tuomas Kynk ¨a¨anniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models. InNeurIPS, 2024. 12

work page 2024
[34]

Advancing end- to-end pixel space generative modeling via self-supervised pre-training.arXiv:2510.12586, 2025

Jiachen Lei, Keli Liu, Julius Berner, Haiming Yu, Hongkai Zheng, Jiahong Wu, and Xiangxiang Chu. Advancing end- to-end pixel space generative modeling via self-supervised pre-training.arXiv:2510.12586, 2025

work page arXiv 2025
[35]

Autoregressive image generation without vec- tor quantization

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vec- tor quantization. InNeurIPS, 2024

work page 2024
[36]

Fractal generative models.arXiv:2502.17437, 2025

Tianhong Li, Qinyi Sun, Lijie Fan, and Kaiming He. Fractal generative models.arXiv:2502.17437, 2025

work page arXiv 2025
[37]

Flow matching for generative mod- eling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling. InICLR, 2023

work page 2023
[38]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023

work page 2023
[39]

Deep generative models through the lens of the manifold hypothe- sis: A survey and new connections.Transactions on Machine Learning Research, 2024

Gabriel Loaiza-Ganem, Brendan Leigh Ross, Rasa Hossein- zadeh, Anthony L Caterini, and Jesse C Cresswell. Deep generative models through the lens of the manifold hypothe- sis: A survey and new connections.Transactions on Machine Learning Research, 2024

work page 2024
[40]

SiT: Explor- ing flow and diffusion-based generative models with scalable interpolant Transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. SiT: Explor- ing flow and diffusion-based generative models with scalable interpolant Transformers. InECCV, 2024

work page 2024
[41]

k-Sparse Autoencoders

Alireza Makhzani and Brendan Frey. K-sparse autoencoders. arXiv:1312.5663, 2013

work page Pith review arXiv 2013
[42]

Denoising: a powerful building block for imaging, inverse problems and machine learning.Philosophical Transactions A, 383(2299): 20240326, 2025

Peyman Milanfar and Mauricio Delbracio. Denoising: a powerful building block for imaging, inverse problems and machine learning.Philosophical Transactions A, 383(2299): 20240326, 2025

work page 2025
[43]

NeRF: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

work page 2021
[44]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InICML, 2021

work page 2021
[45]

DINOv2: Learning robust visual fea- tures without supervision.Transactions on Machine Learn- ing Research, 2023

Maxime Oquab et al. DINOv2: Learning robust visual fea- tures without supervision.Transactions on Machine Learn- ing Research, 2023

work page 2023
[46]

Scalable diffusion models with Transformers

William Peebles and Saining Xie. Scalable diffusion models with Transformers. InICCV, 2023

work page 2023
[47]

Image denoising using scale mixtures of Gaussians in the wavelet domain.IEEE Transactions on Image processing, 12(11):1338–1351, 2003

Javier Portilla, Vasily Strela, Martin J Wainwright, and Eero P Simoncelli. Image denoising using scale mixtures of Gaussians in the wavelet domain.IEEE Transactions on Image processing, 12(11):1338–1351, 2003

work page 2003
[48]

Contractive auto-encoders: Explicit in- variance during feature extraction

Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. Contractive auto-encoders: Explicit in- variance during feature extraction. InICML, 2011

work page 2011
[49]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, 2022

work page 2022
[50]

U- Net: Convolutional networks for biomedical image segmen- tation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- Net: Convolutional networks for biomedical image segmen- tation. InMICCAI, 2015

work page 2015
[51]

Nonlinear dimension- ality reduction by locally linear embedding.Science, 290 (5500):2323–2326, 2000

Sam T Roweis and Lawrence K Saul. Nonlinear dimension- ality reduction by locally linear embedding.Science, 290 (5500):2323–2326, 2000

work page 2000
[52]

Progressive distillation for fast sampling of diffusion models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. InICLR, 2022

work page 2022
[53]

Improved techniques for training GANs.NeurIPS, 29, 2016

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs.NeurIPS, 29, 2016

work page 2016
[54]

GLU Variants Improve Transformer

Noam Shazeer. GLU variants improve Transformer. arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002
[55]

Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025

Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Ji- wen Lu. Latent diffusion model without variational autoen- coder.arXiv:2510.15301, 2025

work page arXiv 2025
[56]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[57]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InICML, 2015

work page 2015
[58]

Denois- ing diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InICLR, 2021

work page 2021
[59]

Generative modeling by es- timating gradients of the data distribution

Yang Song and Stefano Ermon. Generative modeling by es- timating gradients of the data distribution. InNeurIPS, 2019

work page 2019
[60]

Score-based generative modeling through stochastic differential equa- tions

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. InICLR, 2021

work page 2021
[61]

Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014

work page 1929
[62]

RoFormer: Enhanced Transformer with rotary position embedding.Neurocomputing, 568: 127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced Transformer with rotary position embedding.Neurocomputing, 568: 127063, 2024

work page 2024
[63]

A global geometric framework for nonlinear dimensionality reduction.Science, 290(5500):2319–2323, 2000

Joshua B Tenenbaum, Vin de Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction.Science, 290(5500):2319–2323, 2000

work page 2000
[64]

The information bottleneck method

Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method.arXiv preprint physics/0004057, 2000

work page internal anchor Pith review Pith/arXiv arXiv 2000
[65]

JetFormer: an autoregressive generative model of raw images and text

Michael Tschannen, Andr ´e Susano Pinto, and Alexander Kolesnikov. JetFormer: an autoregressive generative model of raw images and text. InICLR, 2025

work page 2025
[66]

Attention is all you need.NeurIPS, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.NeurIPS, 2017

work page 2017
[67]

A connection between score matching and denoising autoencoders.Neural computation, 23(7):1661– 1674, 2011

Pascal Vincent. A connection between score matching and denoising autoencoders.Neural computation, 23(7):1661– 1674, 2011

work page 2011
[68]

Extracting and composing robust features with denoising autoencoders

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. InICML, 2008

work page 2008
[69]

Stacked denoising autoencoders: Learning useful represen- tations in a deep network with a local denoising criterion

Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, and L ´eon Bottou. Stacked denoising autoencoders: Learning useful represen- tations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11(12), 2010. 13

work page 2010
[70]

Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268,

Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, and Limin Wang. PixNerd: Pixel neural field diffusion. arXiv:2507.23268, 2025

work page arXiv 2025
[71]

DDT: Decoupled diffusion Transformer.arXiv:2504.05741, 2025

Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. DDT: Decoupled diffusion Transformer.arXiv:2504.05741, 2025

work page arXiv 2025
[72]

Dif- fusion model for generative image denoising

Yutong Xie, Minne Yuan, Bin Dong, and Quanzheng Li. Dif- fusion model for generative image denoising. InICCV, 2023

work page 2023
[73]

Reconstruc- tion vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruc- tion vs. generation: Taming optimization dilemma in latent diffusion models. InCVPR, 2025

work page 2025
[74]

Representation alignment for generation: Training diffusion Transformers is easier than you think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion Transformers is easier than you think. InICLR, 2025

work page 2025
[75]

Root mean square layer nor- malization

Biao Zhang and Rico Sennrich. Root mean square layer nor- malization. InNeurIPS, 2019

work page 2019
[77]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018

work page 2018
[78]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion Transformers with representation autoen- coders.arXiv:2510.11690, 2025

work page internal anchor Pith review arXiv 2025
[79]

From learning models of nat- ural image patches to whole image restoration

Daniel Zoran and Yair Weiss. From learning models of nat- ural image patches to whole image restoration. InICCV, 2011. 14 class 012: house finch, linnet, Carpodacus mexicanus class 014: indigo bunting, indigo finch, indigo bird, Passerina cyanea class 042: agama class 081: ptarmigan class 107: jellyfish class 108: sea anemone, anemone class 110: flatworm,...

work page 2011