hub Mixed citations

Back to Basics: Let Denoising Generative Models Denoise

Tianhong Li, Kaiming He · 2025 · cs.CV · arXiv 2511.13720

Mixed citation behavior. Most common role is background (62%).

90 Pith papers citing it

Background 62% of classified citations

open full Pith review browse 90 citing papers arXiv PDF

abstract

Today's denoising diffusion models do not "denoise" in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces. We show that simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss. Our approach is conceptually nothing more than "Just image Transformers", or JiT, as we call it. We report competitive results using JiT with large patch sizes of 16 and 32 on ImageNet at resolutions of 256 and 512, where predicting high-dimensional noised quantities can fail catastrophically. With our networks mapping back to the basics of the manifold, our research goes back to basics and pursues a self-contained paradigm for Transformer-based diffusion on raw natural data.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 12 method 3 baseline 1

citation-polarity summary

background 10 use method 3 support 2 baseline 1

claims ledger

abstract Today's denoising diffusion models do not "denoise" in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimens

co-cited works

representative citing papers

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling

eess.AS · 2026-06-02 · unverdicted · novelty 8.0

WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.

MUSE: Unlocking Timestep as Native Task Steering for One-Step Dense Prediction

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

MUSE shows that the native timestep embedding in diffusion models acts as a parameter-free steering signal for multi-task monocular depth and normal estimation via manifold decoupling in latent space.

Masked Diffusion Decoding as $x$-Prediction Flow

cs.CL · 2026-06-27 · unverdicted · novelty 7.0

Masked diffusion LMs can use continuous x-prediction flow with token-wise asynchronous updates and an RL policy network to reach 97% performance on HumanEval using only 25% of the usual decoding budget.

Parallel Rollout Approximation for Pixel-Space Autoregressive Image Generation

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

PRA approximates sequential rollout training in parallel for pixel-space AR models via intermediate states and a pixel decoder, achieving FID 2.58 (135M params) and 1.94 (511M params) on ImageNet-1K 256x256, new SOTA among pixel-space AR models.

Reinforcement Learning for Flow-Matching Policies with Density Transport

cs.LG · 2026-06-07 · unverdicted · novelty 7.0

RLDT fine-tunes pretrained flow-matching policies for continuous control by aligning them to a max-entropy RL transport field constructed via SVGD, using expected-target estimation for stable multi-step updates.

STREAM: Stochastic Riemannian Flow Matching with Anisotropic Decoder for Digital Histopathology Image Generation

cs.CV · 2026-06-05 · unverdicted · novelty 7.0

STREAM applies stochastic Riemannian flow matching on VFM-derived unit hypersphere latents with a novel anisotropic decoder to achieve SOTA reconstruction and generation on breast and colorectal cancer histopathology datasets.

Complexity-Balanced Diffusion Splitting

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

CBS partitions the diffusion timeline into segments of equal approximation burden via Dirichlet energy and trajectory acceleration monitors estimated by an auxiliary model, yielding higher synthesis quality at fixed per-step cost across SiT, JiT and UNet backbones.

MedSyn2: Flexible Control of 3D CT Generation via Text and Semantically-Defined Segmentation Prompts

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

MedSyn2 generates controllable high-resolution 3D CT volumes using optional text prompts and partial semantic segmentation masks via a modified diffusion transformer with gated attention.

Attention as In-Context Empirical Bayes: A Two-Stage View via Particle Dynamics

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Attention in minimal transformers under corruption performs in-context empirical Bayes via a single kernel-weighted posterior mean step followed by depth-driven particle dynamics refinement.

Let EEG Models Learn EEG

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

JET is a conditional flow matching framework that generates EEG as continuous raw sequences with added constraints for spectral and temporal properties, achieving over 40% lower TS-FID than prior discrete denoising methods on three benchmarks.

CAdam: Context-Adaptive Moment Estimation for 3D Gaussian Densification in Generative Distillation

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

CAdam reinterprets densification in generative 3DGS as signal verification via gradient-moment interference, quantile context, and SNR gating to achieve large reductions in primitive count with comparable quality.

Mat\'ern Noise for Triangulation-Agnostic Flow Matching on Meshes

cs.GR · 2026-05-19 · unverdicted · novelty 7.0

Proposes discretized Matérn process noise for triangulation-agnostic flow matching on meshes with PoissonNet denoiser, tested on elastic states and humanoid poses for meshes exceeding one million triangles.

Binomial flows: Denoising and flow matching for discrete ordinal data

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

Binomial flows close the gap between continuous flow matching and discrete ordinal data by using binomial distributions to enable unified denoising, sampling, and exact likelihoods in diffusion models.

Structure-Adaptive Sparse Diffusion in Voxel Space for 3D Medical Image Enhancement

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

A sparse voxel-space diffusion method with structure-adaptive modulation achieves up to 10x training speedup and state-of-the-art results for 3D medical image denoising and super-resolution.

Grokking of Diffusion Models: Case Study on Modular Addition

cs.LG · 2026-04-20 · unverdicted · novelty 7.0

Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.

Coevolving Representations in Joint Image-Feature Diffusion

cs.CV · 2026-04-19 · unverdicted · novelty 7.0

CoReDi coevolves semantic representations with the diffusion model via a jointly learned linear projection stabilized by stop-gradient, normalization, and regularization, yielding faster convergence and higher sample quality than fixed-representation baselines.

Free-Range Gaussians: Non-Grid-Aligned Generative 3D Gaussian Reconstruction

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

Free-Range Gaussians uses flow matching over Gaussian parameters to predict non-grid-aligned 3D Gaussians from multi-view images, enabling synthesis of plausible content in unobserved regions with fewer primitives than grid-aligned methods.

Latent Generative Solvers for Generalizable Long-Term Physics Simulation

cs.AI · 2026-02-11 · unverdicted · novelty 7.0

LGS pretrained on 2.5M trajectories across 16 systems matches deterministic baselines at one step and halves 20-step error while using far less compute and adapting to held-out higher-resolution flows.

AutoSpeed: Annotation-Free Stage-Adaptive Motion Speed Learning for Robot Manipulation

cs.RO · 2026-07-01 · unverdicted · novelty 6.0

AutoSpeed optimizes visuomotor policies over candidate trajectories at varying speeds using a composite cost of prediction error versus horizon length, with DCT-based modulation, yielding shorter execution times and higher success rates while producing speeds that align with task stages.

PixelU: A U-Shaped Transformer for Efficient End-to-End Pixel Diffusion

cs.CV · 2026-06-26 · unverdicted · novelty 6.0

PixelU is a minimalist U-shaped Diffusion Transformer for pixel-space diffusion that decouples frequencies with zero-cost skip connections and constant-channel downsampling, outperforming baselines like JiT-G at 1/3 the compute cost with FID 1.63 on ImageNet 256x256.

OCOO-T : A Simple and Scalable Virtual Cell Model for Transcriptional Perturbation Response Prediction

q-bio.QM · 2026-06-11 · unverdicted · novelty 6.0

OCOO-T is a flow-matching Transformer model that directly denoises continuous gene expression profiles to predict transcriptional responses to perturbations and reports state-of-the-art results on Tahoe100M, Replogle, and PBMC benchmarks.

CSFlow: Aligning Flow Matching with Human Contrast Sensitivity

cs.CV · 2026-06-07 · unverdicted · novelty 6.0

CSFlow derives inference-time timestep weights for flow matching by matching per-step frequency content to human CSF, yielding 4.7% FID reduction and smaller gains on IS and GenEval.

DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models

cs.CV · 2026-06-04 · unverdicted · novelty 6.0

DRIFT adapts pretrained VLMs to continuous decoding via a base predictor plus residual flow matching, outperforming regression and generative baselines on grounding and robotic control tasks.

Representation Forcing for Bottleneck-Free Unified Multimodal Models

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

Representation Forcing enables end-to-end pixel-space unified multimodal models by making visual representation prediction a native autoregressive generation target that guides subsequent pixel diffusion in the same backbone.

citing papers explorer

Showing 40 of 90 citing papers.

VOLT: Volumetric Wide-Field Microscopy via 3D-Native Probabilistic Transport eess.IV · 2026-04-20 · unverdicted · none · ref 45 · internal anchor
VOLT is a probabilistic transport method with a 3D anisotropic network that improves wide-field microscopy volume reconstruction in lateral and axial directions while supplying voxel-wise credibility estimates.
Cross-Modal Generation: From Commodity WiFi to High-Fidelity mmWave and RFID Sensing cs.LG · 2026-04-17 · unverdicted · none · ref 22 · internal anchor
RF-CMG synthesizes high-quality mmWave and RFID signals from WiFi using a diffusion model with Modality-Guided Embedding for high-frequency details and Low-Frequency Modality Consistency to preserve physical structure.
Generative Refinement Networks for Visual Synthesis cs.CV · 2026-04-14 · unverdicted · none · ref 32 · internal anchor
GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
FeaXDrive: Feasibility-aware Trajectory-Centric Diffusion Planning for End-to-End Autonomous Driving cs.RO · 2026-04-14 · unverdicted · none · ref 9 · internal anchor
FeaXDrive improves end-to-end autonomous driving by shifting diffusion planning to a trajectory-centric formulation with curvature-constrained training, drivable-area guidance, and GRPO post-training, yielding stronger closed-loop performance and feasibility on NAVSIM.
CoD-Lite: Real-Time Diffusion-Based Generative Image Compression cs.CV · 2026-04-14 · unverdicted · none · ref 10 · internal anchor
CoD-Lite delivers real-time generative image compression via a lightweight convolution-based diffusion codec with compression-oriented pre-training and distillation, achieving substantial bitrate savings.
Continuous Adversarial Flow Models cs.LG · 2026-04-13 · unverdicted · none · ref 37 · internal anchor
Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-image benchmarks.
From Clues to Generation: Language-Guided Conditional Diffusion for Cross-Domain Recommendation cs.IR · 2026-04-07 · unverdicted · none · ref 28 · internal anchor
LGCD creates pseudo-overlapping user data via LLM reasoning and uses conditional diffusion to generate target-domain user representations for inter-domain sequential recommendation without real overlapping users.
FlowLM: Few-Step Language Modeling via Diffusion-to-Flow Adaptation cs.CL · 2026-04-06 · unverdicted · none · ref 8 · internal anchor
FlowLM converts diffusion LMs to flow matching via fine-tuning, achieving few-step generation that rivals or beats 2000-step diffusion and saturates faster than training flow models from scratch.
ML-based approach to classification and generation of structured light propagation in turbulent media physics.optics · 2026-04-04 · unverdicted · none · ref 31 · internal anchor
ML models classify and generate structured light in turbulence using CNNs and diffusion models enhanced by Bregman distance minimization.
What Does Flow Matching Bring To TD Learning? cs.LG · 2026-03-04 · conditional · none · ref 33 · internal anchor
Flow matching critics outperform monolithic ones in RL by 2x performance and 5x sample efficiency via test-time error recovery through integration and multi-point velocity supervision that preserves feature plasticity.
Unleashing the Potential of Diffusion Models for End-to-End Autonomous Driving cs.RO · 2026-02-26 · unverdicted · none · ref 27 · internal anchor
The paper introduces Hyper Diffusion Planner (HDP), a diffusion-based E2E AD framework that identifies insights on loss space, trajectory representation and data scaling, adds RL post-training, and reports 10x performance gains over 200 km of real-world testing across 6 scenarios.
Flow Map Language Models: One-step Language Modeling via Continuous Denoising cs.CL · 2026-02-18 · conditional · none · ref 42 · 2 links · internal anchor
Continuous flows on token embeddings with flow-map distillation produce one-step language models whose quality exceeds recent 8-step discrete diffusion baselines on LM1B and OpenWebText.
ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning cs.CV · 2026-02-11 · unverdicted · none · ref 26 · internal anchor
ABot-M0 unifies heterogeneous robot data into a 6-million-trajectory dataset and introduces Action Manifold Learning to predict stable actions on a low-dimensional manifold using a DiT backbone.
Protein Autoregressive Modeling via Multiscale Structure Generation cs.LG · 2026-02-04 · unverdicted · none · ref 29 · internal anchor
PAR is a multi-scale autoregressive transformer framework for protein backbone generation that uses coarse-to-fine prediction, noisy context learning, and flow-based decoding to achieve high-quality unconditional and zero-shot conditional outputs.
Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design cs.LG · 2026-02-04 · conditional · none · ref 12 · internal anchor
An ELBO-based likelihood estimator from the final generated sample dominates other RL design factors for diffusion models, raising GenEval from 0.24 to 0.95 in 90 GPU hours with better efficiency than prior methods.
PixelGen: Improving Pixel Diffusion with Perceptual Supervision cs.CV · 2026-02-02 · accept · none · ref 11 · internal anchor
PixelGen augments pixel diffusion with gated perceptual supervision to reach FID 5.11 on ImageNet-256 and GenEval 0.79 in text-to-image, narrowing the gap to latent methods without VAEs.
Sampling-Free Diffusion Transformers for Low-Complexity MIMO Channel Estimation eess.SP · 2026-02-02 · unverdicted · none · ref 14 · internal anchor
A diffusion transformer directly maps noisy MIMO channel observations to clean estimates in a single pass by exploiting angular sparsity, achieving better accuracy and much lower complexity than iterative diffusion baselines.
PixelDiT: Pixel Diffusion Transformers for Image Generation cs.CV · 2025-11-25 · conditional · none · ref 16 · internal anchor
PixelDiT generates images in pixel space with a dual-level transformer and reaches 1.61 FID on ImageNet 256, outperforming prior pixel-space models.
Not All Prediction Targets Keep Training-Free Diffusion Guidance on the Manifold cs.CV · 2026-07-01 · unverdicted · none · ref 30 · internal anchor
x-prediction maintains manifold adherence during training-free diffusion guidance better than ε- or v-prediction, per theoretical analysis and experiments on bird classification and style transfer.
Learning Climate Variability from Scarce Data with Diffusion Models: A Test Case for ENSO physics.ao-ph · 2026-06-25 · unverdicted · none · ref 33 · internal anchor
Diffusion models recover known ENSO variability structure from synthetic LIM data when given enough samples, but require pre-training on CMIP6 plus fine-tuning to match observations with the ~700 samples available in ERSSTv5.
E4GEN: Event-level Explainable Extreme-Enhanced Time-series Generation cs.LG · 2026-06-01 · unverdicted · none · ref 24 · internal anchor
E4GEN is an explainable diffusion model using E-Activator, E-Predictor, and E-Control for extreme-event-aware time-series generation evaluated on six datasets.
Wall-OSS-0.5 Technical Report cs.RO · 2026-05-29 · unverdicted · none · ref 31 · internal anchor
Wall-OSS-0.5 is a 4B VLA model pretrained across many embodiments that achieves zero-shot real-robot performance on a 17-task suite and outperforms π_0.5 after fine-tuning.
Stochastic MeanFlow Policies: One-Step Generative Control with Entropic Mirror Descent cs.LG · 2026-05-20 · unverdicted · none · ref 29 · 2 links · internal anchor
Stochastic MeanFlow Policies enable one-step generative control in off-policy mirror descent by mapping noise through a MeanFlow transform, yielding tractable entropy and improved MuJoCo performance over Gaussian and generative baselines.
PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset cs.CV · 2026-05-19 · unverdicted · none · ref 31 · internal anchor
PixVerve introduces a 95K ultra-high-resolution image-text dataset and training strategies that enable native 100-megapixel text-to-image generation together with a new evaluation benchmark.
FrequencyBooster: Full-Frequency Modeling for High-Fidelity Pixel Diffusion cs.CV · 2026-05-18 · unverdicted · none · ref 12 · internal anchor
FrequencyBooster reports state-of-the-art FID scores of 1.60 at 256x256 and 1.69 at 512x512 for pixel diffusion by using a specialized decoder for full-frequency modeling.
Nano World Models: A Minimalist Implementation of Future Video Prediction cs.CV · 2026-05-17 · unverdicted · none · ref 16 · internal anchor
Nano World Models supplies a unified minimalist codebase and evaluation framework for studying diffusion forcing in video prediction across control, games, and robot domains.
HDRFace: Rethinking Face Restoration with High-Dimensional Representation cs.CV · 2026-05-14 · unverdicted · none · ref 16 · internal anchor
HDRFace injects high-dimensional facial features from low-quality and intermediate images into diffusion models via SDFM fusion, reporting gains on SD V2.1-base and Qwen-Image.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture cs.CV · 2026-05-12 · unverdicted · none · ref 69 · internal anchor
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation physics.ins-det · 2026-05-12 · unverdicted · none · ref 23 · internal anchor
CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditional flow matching.
Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation cs.RO · 2026-05-12 · unverdicted · none · ref 85 · internal anchor
The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LIBERO, RoboTwin, and real-robot tasks.
FluxFlow: Conservative Flow-Matching for Astronomical Image Super-Resolution cs.CV · 2026-05-05 · unverdicted · none · ref 8 · 2 links · internal anchor
FluxFlow uses conservative pixel-space flow-matching with uncertainty weights and Wiener test-time correction to outperform baselines on photometric and scientific accuracy for ground-to-space super-resolution, validated on a new real 19,500-pair DESI-HST dataset.
Scaling Properties of Continuous Diffusion Spoken Language Models cs.CL · 2026-04-27 · unverdicted · none · ref 59 · internal anchor
Continuous diffusion spoken language models follow scaling laws for loss and phoneme divergence and generate emotive multi-speaker speech at 16B scale, though long-form coherence stays difficult.
UniCSG: Unified High-Fidelity Content-Constrained Style-Driven Generation via Staged Semantic and Frequency Disentanglement cs.CV · 2026-04-20 · unverdicted · none · ref 20 · internal anchor
UniCSG adds staged semantic disentanglement and frequency-aware reconstruction to DiT diffusion models to improve content preservation and style fidelity in both text- and reference-guided generation.
RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework cs.CV · 2026-04-16 · unverdicted · none · ref 26 · internal anchor
RAD-2 uses a diffusion generator and RL discriminator to cut collision rates by 56% in closed-loop autonomous driving planning.
PoreDiT: A Scalable Generative Model for Large-Scale Digital Rock Reconstruction cs.AI · 2026-04-11 · unverdicted · none · ref 23 · internal anchor
PoreDiT generates 1024^3 voxel digital rock models via 3D Swin Transformer binary pore-field prediction, matching prior methods on porosity, permeability, and Euler characteristics while running on consumer hardware.
Accelerating Redshift-Conditioned Galaxy Image Synthesis with One-step Generative Modeling astro-ph.IM · 2026-05-17 · unverdicted · none · ref 60 · internal anchor
One-step pixel-MeanFlow models recover key galaxy morphology statistics at orders-of-magnitude lower computational cost than standard DDPM sampling while remaining weaker on fine-grained structure.
HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion cs.CV · 2026-05-15 · unverdicted · none · ref 8 · 2 links · internal anchor
HyperDiT reports FID 1.56 on ImageNet 256x256 using hyper-connected cross-scale attention, SA-RoPE, and VFM registers in pixel space.
Target Parameterization in Diffusion Models for Nonlinear Spatiotemporal System Identification eess.SY · 2026-04-19 · unverdicted · none · ref 12 · internal anchor
Clean-state prediction in diffusion models for turbulent spatiotemporal systems improves rollout stability and reduces long-horizon error compared to velocity- and noise-based objectives.
NTIRE 2026 The Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results cs.CV · 2026-04-12 · unverdicted · none · ref 35 · 2 links · internal anchor
The NTIRE 2026 challenge reports strong performance from 17 teams on raindrop removal for dual-focused day and night images using an adjusted real-world dataset with 14,139 training images.
FARM: Foundational Aerial Radio Map for Intelligent Low-Altitude Networking eess.SP · 2026-04-19 · unreviewed · ref 33 · internal anchor

Back to Basics: Let Denoising Generative Models Denoise

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer