hub Mixed citations

Back to Basics: Let Denoising Generative Models Denoise

Tianhong Li, Kaiming He · 2025 · cs.CV · arXiv 2511.13720

Mixed citation behavior. Most common role is background (62%).

76 Pith papers citing it

Background 62% of classified citations

open full Pith review browse 76 citing papers arXiv PDF

abstract

Today's denoising diffusion models do not "denoise" in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces. We show that simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss. Our approach is conceptually nothing more than "Just image Transformers", or JiT, as we call it. We report competitive results using JiT with large patch sizes of 16 and 32 on ImageNet at resolutions of 256 and 512, where predicting high-dimensional noised quantities can fail catastrophically. With our networks mapping back to the basics of the manifold, our research goes back to basics and pursues a self-contained paradigm for Transformer-based diffusion on raw natural data.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 12 method 3 baseline 1

citation-polarity summary

background 10 use method 3 support 2 baseline 1

claims ledger

abstract Today's denoising diffusion models do not "denoise" in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimens

co-cited works

representative citing papers

MUSE: Unlocking Timestep as Native Task Steering for One-Step Dense Prediction

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

MUSE shows that the native timestep embedding in diffusion models acts as a parameter-free steering signal for multi-task monocular depth and normal estimation via manifold decoupling in latent space.

Masked Diffusion Decoding as $x$-Prediction Flow

cs.CL · 2026-06-27 · unverdicted · novelty 7.0

Masked diffusion LMs can use continuous x-prediction flow with token-wise asynchronous updates and an RL policy network to reach 97% performance on HumanEval using only 25% of the usual decoding budget.

Parallel Rollout Approximation for Pixel-Space Autoregressive Image Generation

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

PRA approximates sequential rollout training in parallel for pixel-space AR models via intermediate states and a pixel decoder, achieving FID 2.58 (135M params) and 1.94 (511M params) on ImageNet-1K 256x256, new SOTA among pixel-space AR models.

Attention as In-Context Empirical Bayes: A Two-Stage View via Particle Dynamics

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Attention in minimal transformers under corruption performs in-context empirical Bayes via a single kernel-weighted posterior mean step followed by depth-driven particle dynamics refinement.

Let EEG Models Learn EEG

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

JET is a conditional flow matching framework that generates EEG as continuous raw sequences with added constraints for spectral and temporal properties, achieving over 40% lower TS-FID than prior discrete denoising methods on three benchmarks.

CAdam: Context-Adaptive Moment Estimation for 3D Gaussian Densification in Generative Distillation

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

CAdam reinterprets densification in generative 3DGS as signal verification via gradient-moment interference, quantile context, and SNR gating to achieve large reductions in primitive count with comparable quality.

Mat\'ern Noise for Triangulation-Agnostic Flow Matching on Meshes

cs.GR · 2026-05-19 · unverdicted · novelty 7.0

Proposes discretized Matérn process noise for triangulation-agnostic flow matching on meshes with PoissonNet denoiser, tested on elastic states and humanoid poses for meshes exceeding one million triangles.

Binomial flows: Denoising and flow matching for discrete ordinal data

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

Binomial flows close the gap between continuous flow matching and discrete ordinal data by using binomial distributions to enable unified denoising, sampling, and exact likelihoods in diffusion models.

Structure-Adaptive Sparse Diffusion in Voxel Space for 3D Medical Image Enhancement

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

A sparse voxel-space diffusion method with structure-adaptive modulation achieves up to 10x training speedup and state-of-the-art results for 3D medical image denoising and super-resolution.

Grokking of Diffusion Models: Case Study on Modular Addition

cs.LG · 2026-04-20 · unverdicted · novelty 7.0

Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.

Coevolving Representations in Joint Image-Feature Diffusion

cs.CV · 2026-04-19 · unverdicted · novelty 7.0

CoReDi coevolves semantic representations with the diffusion model via a jointly learned linear projection stabilized by stop-gradient, normalization, and regularization, yielding faster convergence and higher sample quality than fixed-representation baselines.

Free-Range Gaussians: Non-Grid-Aligned Generative 3D Gaussian Reconstruction

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

Free-Range Gaussians uses flow matching over Gaussian parameters to predict non-grid-aligned 3D Gaussians from multi-view images, enabling synthesis of plausible content in unobserved regions with fewer primitives than grid-aligned methods.

Latent Generative Solvers for Generalizable Long-Term Physics Simulation

cs.AI · 2026-02-11 · unverdicted · novelty 7.0

LGS pretrained on 2.5M trajectories across 16 systems matches deterministic baselines at one step and halves 20-step error while using far less compute and adapting to held-out higher-resolution flows.

PixelU: A U-Shaped Transformer for Efficient End-to-End Pixel Diffusion

cs.CV · 2026-06-26 · unverdicted · novelty 6.0

PixelU is a minimalist U-shaped Diffusion Transformer for pixel-space diffusion that decouples frequencies with zero-cost skip connections and constant-channel downsampling, outperforming baselines like JiT-G at 1/3 the compute cost with FID 1.63 on ImageNet 256x256.

OCOO-T : A Simple and Scalable Virtual Cell Model for Transcriptional Perturbation Response Prediction

q-bio.QM · 2026-06-11 · unverdicted · novelty 6.0

OCOO-T is a flow-matching Transformer model that directly denoises continuous gene expression profiles to predict transcriptional responses to perturbations and reports state-of-the-art results on Tahoe100M, Replogle, and PBMC benchmarks.

GPIC: A Giant Permissive Image Corpus for Visual Generation

cs.CV · 2026-05-28 · unverdicted · novelty 6.0

GPIC is a new 28-trillion-pixel permissively licensed image corpus with 100M training examples for visual generative modeling.

Colored Noise Diffusion Sampling

cs.CV · 2026-05-28 · unverdicted · novelty 6.0

CNS is a plug-and-play stochastic sampler for diffusion models that uses timestep- and frequency-dependent colored noise to allocate energy to unresolved bands, producing lower FID scores than standard ODE/SDE baselines on ImageNet-256.

DiscoForcing: A Unified Framework for Real-Time Audio-Driven Character Control with Diffusion Forcing

cs.CV · 2026-05-27 · unverdicted · novelty 6.0

DiscoForcing introduces a causal diffusion-forcing model with a hybrid temporal schedule for stable real-time audio-to-motion generation under abrupt audio changes.

JLT: Clean-Latent Prediction in Latent Diffusion Transformers

cs.CV · 2026-05-26 · unverdicted · novelty 6.0

JLT shows clean-latent prediction outperforms velocity prediction in a matched latent diffusion Transformer, reaching FID-50K 2.50 on ImageNet 256x256.

Learning Energy-Based Models from Stochastic Interpolants using Spatiotemporal Differences

cs.LG · 2026-05-26 · unverdicted · novelty 6.0

stNCE learns the energy of the joint density over data and time via spatiotemporal differences, unifies prior methods, and reports competitive performance on image and molecule density estimation.

PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion

cs.CV · 2026-05-22 · unverdicted · novelty 6.0

PiD is a pixel diffusion decoder that performs latent-to-pixel conversion and 4-8x upsampling in one generative step, enabling early stopping of latent diffusion and achieving sub-second 2048x2048 decoding with claimed better fidelity than cascaded baselines.

RiT: Vanilla Diffusion Transformers Suffice in Representation Space

cs.CV · 2026-05-21 · conditional · novelty 6.0

A vanilla Diffusion Transformer trained via x-prediction on frozen DINOv2 features reaches FID 1.14 on ImageNet 256x256 with fewer parameters and faster sampling than prior DiT variants.

Spatial Gram Alignment for Ultra-High-Resolution Image Synthesis

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

Spatial Gram Alignment aligns internal self-similarities of LDM features with foundation priors to reconcile global structure and fine details in ultra-high-resolution text-to-image synthesis.

Multi-Scale Generative Modeling with Heat Dissipation Flow Matching

cs.CV · 2026-05-19 · unverdicted · novelty 6.0

HDFM adds a continuous heat-dissipation (blur) process to flow matching, aligns an interpolated path to fix ill-posed inverse heat dissipation, and uses x-prediction to ease high-dimensional regression, yielding better performance than most baselines on image datasets.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Back to Basics: Let Denoising Generative Models Denoise

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer