WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.
hub
On the importance of noise scheduling for diffu- sion models
15 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
SKILD unifies unconditional image generation and continuous super-resolution in one diffusion model via scale-invariant k-space dynamics where the reverse process handles both tasks by varying only the starting timestep.
Proposes generative pseudo-force fields trained on quadratic pseudo-potentials from noisy equilibria as a time-step-agnostic diffusion variant for efficient molecular conformation generation with high validity on QM9.
Derives a conditional-marginal entropy-rate objective for bridge-aware discretization that yields U-shaped schedules and improves low-NFE sample quality on 2D, CIFAR-10, and protein tasks.
Non-monotonic sampling schedules never improve upon monotonic baselines in diffusion models, with performance gaps ranging from substantial to negligible depending on the denoiser.
Proposes an advection-diffusion PDE corruption process with stochastic velocity fields and Lattice Boltzmann solver for diffusion models, generalizing prior PDE methods.
A generative model produces realistic and coherent 360 panoramic videos from in-the-wild perspective videos via curated online data and geometry-motion aware operations.
DFoT enables flexible history conditioning in video diffusion, with history guidance methods that boost temporal consistency and support long rollouts.
Proposes CFRG noise schedule for diffusion models that assigns larger noises to low-frequency classes to improve generation on imbalanced datasets.
WavFlow performs direct waveform audio generation via flow matching on 2D token grids from raw patches plus amplitude lifting, matching latent-based methods on VGGSound and AudioCaps without intermediate compression.
Directly predicting clean data with large-patch pixel Transformers enables strong generative performance in diffusion models where noise prediction fails at high dimensions.
NoiseShift learns a resolution-specific mapping from scheduler noise to conditioning noise via lightweight calibration to restore consistency and improve low-resolution generation quality in models like SD3 and Flux.
Zero123++ produces high-quality 3D-consistent multi-view images from a single input by fine-tuning Stable Diffusion with targeted conditioning and training methods.
Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.
The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.
citing papers explorer
-
Is Monotonic Sampling Necessary in Diffusion Models?
Non-monotonic sampling schedules never improve upon monotonic baselines in diffusion models, with performance gaps ranging from substantial to negligible depending on the denoiser.
-
Back to Basics: Let Denoising Generative Models Denoise
Directly predicting clean data with large-patch pixel Transformers enables strong generative performance in diffusion models where noise prediction fails at high dimensions.
-
World Simulation with Video Foundation Models for Physical AI
Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.