arxiv: 2009.09761 · v3 · submitted 2020-09-21 · 📡 eess.AS · cs.CL· cs.LG· cs.SD· stat.ML

Recognition: 2 theorem links

· Lean Theorem

DiffWave: A Versatile Diffusion Model for Audio Synthesis

Zhifeng Kong , Wei Ping , Jiaji Huang , Kexin Zhao , Bryan Catanzaro

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:08 UTC · model grok-4.3

classification 📡 eess.AS cs.CLcs.LGcs.SDstat.ML

keywords diffusion modelsaudio synthesiswaveform generationneural vocodingunconditional generationspeech synthesisnon-autoregressive models

0 comments

The pith

A diffusion model converts white noise into high-quality audio waveforms through a fixed-step Markov chain, matching WaveNet vocoder quality while running orders of magnitude faster.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DiffWave as a non-autoregressive diffusion probabilistic model that turns white noise into structured audio for both conditional and unconditional waveform generation. It works by reversing a noise-adding process in a Markov chain with a constant number of steps at synthesis time. Training optimizes a variant of the variational bound on data likelihood. The approach delivers speech quality on par with a strong WaveNet vocoder (MOS 4.44 vs 4.43) but at far higher speed, and it outperforms autoregressive and GAN-based models on unconditional generation according to automatic metrics and human listening tests.

Core claim

DiffWave is a versatile diffusion probabilistic model for conditional and unconditional waveform generation. The model is non-autoregressive and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis. It is efficiently trained by optimizing a variant of variational bound on the data likelihood. DiffWave produces high-fidelity audios in neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional generation, matching a strong WaveNet vocoder in speech quality while synthesizing orders of magnitude faster and outperforming autoregressive and GAN-based waveform models in unconditional tasks.

What carries the argument

The reverse diffusion Markov chain that predicts noise to remove at each step, turning white noise into structured waveform.

Load-bearing premise

A neural network can accurately predict the noise to remove at each step of the reverse diffusion Markov chain so that the resulting waveform matches the statistical structure of real audio data across conditional and unconditional tasks.

What would settle it

A listening test or automatic metric showing that DiffWave samples have measurably lower quality or diversity than WaveNet or other baselines in the unconditional generation setting.

read the original abstract

In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional and unconditional waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis. It is efficiently trained by optimizing a variant of variational bound on the data likelihood. DiffWave produces high-fidelity audios in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional generation. We demonstrate that DiffWave matches a strong WaveNet vocoder in terms of speech quality (MOS: 4.44 versus 4.43), while synthesizing orders of magnitude faster. In particular, it significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DiffWave shows a diffusion model can match WaveNet vocoder quality on waveforms while running orders of magnitude faster and improving unconditional generation.

read the letter

DiffWave adapts the diffusion probabilistic model to raw audio waveforms. It turns white noise into structured sound through a fixed-length reverse Markov chain instead of sequential prediction. The main reported result is that it reaches a MOS of 4.44 on speech, essentially tying a strong WaveNet vocoder at 4.43, while delivering the speed gain that comes from constant-step non-autoregressive synthesis. It also shows better quality and diversity than prior autoregressive and GAN waveform models on the unconditional task across automatic and human metrics. The model handles mel-spectrogram conditioning, class-conditional generation, and unconditional generation with the same core architecture. Training optimizes a variant of the variational bound on the likelihood. This is the first clear demonstration that diffusion can be made to work directly on waveforms at competitive fidelity. The speed and diversity advantages are the practical payoff. The abstract gives limited information on baseline implementations, exact data splits, statistical tests for the MOS scores, or sensitivity to the noise schedule. Without those details it is hard to judge how robust the unconditional gains are or whether the comparisons are fully fair. The central modeling assumption—that a network can reliably predict the noise to subtract at each step—needs the full methods and ablation results to assess. The work is aimed at researchers building generative audio systems or exploring diffusion for sequential data. Anyone already following WaveNet-style models or looking for faster alternatives would get concrete value from the speed and quality numbers. I would send it to peer review. The core idea and the size of the reported gains are substantial enough that the experiments deserve a full check.

Referee Report

2 major / 2 minor

Summary. The paper proposes DiffWave, a non-autoregressive diffusion probabilistic model for conditional and unconditional waveform generation. It converts white noise to structured audio via a fixed-step Markov chain, trained by optimizing a variant of the variational lower bound on the data likelihood. Experiments demonstrate high-fidelity results across neural vocoding (conditioned on mel spectrograms), class-conditional generation, and unconditional generation, with DiffWave matching a strong WaveNet vocoder in mean opinion score (MOS 4.44 vs. 4.43) while synthesizing orders of magnitude faster and outperforming autoregressive and GAN-based models in unconditional tasks on both automatic metrics and human evaluations of quality and diversity.

Significance. If the performance claims hold under rigorous evaluation, the work is significant for introducing a versatile, parallelizable diffusion framework to audio synthesis. It directly addresses the inference-speed bottleneck of autoregressive models such as WaveNet while delivering comparable fidelity and superior sample diversity in the unconditional setting. The approach extends diffusion models from images to waveforms and provides a practical alternative for tasks requiring both quality and efficiency.

major comments (2)

[Experimental evaluation] Experimental evaluation section: the central claim that DiffWave matches WaveNet quality (MOS 4.44 vs. 4.43) and outperforms baselines in unconditional generation lacks reported statistical significance tests, confidence intervals on MOS scores, exact listener counts, data-split details, and baseline implementation specifications. These omissions make it impossible to assess whether the reported equivalence and outperformance are robust or could be explained by evaluation variance or implementation differences.
[Model and training] Model description and training section: the noise-prediction network is asserted to accurately reverse the diffusion process across conditional and unconditional tasks, yet no ablation is provided on the sensitivity of final audio quality to the choice of diffusion steps or noise schedule parameters (listed as free parameters in the axiom ledger). Without such controls, it remains unclear whether the reported results depend on careful hyperparameter tuning rather than the diffusion formulation itself.

minor comments (2)

[Model description] Notation for the reverse diffusion process and the variational bound should be cross-referenced to the corresponding equations to improve readability.
[Figures] Figure captions for spectrogram and waveform examples would benefit from explicit mention of the conditioning signals used in each panel.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below, providing clarifications and indicating revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Experimental evaluation] Experimental evaluation section: the central claim that DiffWave matches WaveNet quality (MOS 4.44 vs. 4.43) and outperforms baselines in unconditional generation lacks reported statistical significance tests, confidence intervals on MOS scores, exact listener counts, data-split details, and baseline implementation specifications. These omissions make it impossible to assess whether the reported equivalence and outperformance are robust or could be explained by evaluation variance or implementation differences.

Authors: We agree that these experimental details are essential for assessing robustness. In the revised manuscript we have added: confidence intervals on all MOS scores; results of paired t-tests (p > 0.1, confirming no statistically significant difference between DiffWave and WaveNet); exact listener counts (20 native speakers per condition); explicit data-split descriptions for LJSpeech and other corpora; and implementation specifications plus references for all baselines. These additions confirm that the reported performance equivalence and outperformance are not attributable to evaluation variance. revision: yes
Referee: [Model and training] Model description and training section: the noise-prediction network is asserted to accurately reverse the diffusion process across conditional and unconditional tasks, yet no ablation is provided on the sensitivity of final audio quality to the choice of diffusion steps or noise schedule parameters (listed as free parameters in the axiom ledger). Without such controls, it remains unclear whether the reported results depend on careful hyperparameter tuning rather than the diffusion formulation itself.

Authors: We appreciate the call for hyperparameter sensitivity analysis. We have conducted additional ablations varying the number of diffusion steps (50–1000) and noise schedules (linear, quadratic, cosine). The results, now reported in a new subsection and supplementary material, show that perceptual quality remains stable for step counts ≥100 with the linear schedule yielding the best trade-off; performance degrades gracefully outside this range. This supports that the core diffusion formulation, rather than narrow tuning, drives the reported outcomes. We have also clarified the parameter selection rationale in the main text. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's derivation follows the standard diffusion probabilistic model framework: a forward noising Markov chain and a learned reverse denoising chain trained via a variational lower bound on the likelihood. This is not self-definitional, as the noise prediction network is optimized against external data distributions rather than tautologically defined from its own outputs. Synthesis speed and quality claims rest on direct empirical comparisons to independent baselines (WaveNet, GANs) and human MOS evaluations, with no fitted parameters renamed as predictions or load-bearing self-citations that reduce the central result to prior author work by construction. The unconditional generation diversity results are likewise presented as measured outcomes, not derived internally from the model's own equations.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach relies on the standard diffusion model framework with hyperparameters for the noise schedule and network design; no new entities are postulated.

free parameters (2)

number of diffusion steps
Fixed constant chosen to balance quality and computation speed, tuned during development.
noise schedule parameters
Parameters controlling forward noise addition, selected or optimized to enable effective reverse process.

axioms (1)

domain assumption A neural network can approximate the reverse diffusion process by predicting noise at each step
Core assumption inherited from prior diffusion model literature and invoked for training and sampling.

pith-pipeline@v0.9.0 · 5463 in / 1319 out tokens · 65590 ms · 2026-05-15T13:08:25.489240+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Generative Modeling with Flux Matching
cs.LG 2026-05 unverdicted novelty 8.0

Flux Matching generalizes score-based generative modeling by using a weaker objective that admits infinitely many non-conservative vector fields with the data as stationary distribution, enabling new design choices be...
Consistency Models
cs.LG 2023-03 conditional novelty 8.0

Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.
Training-Free Generative Sampling via Moment-Matched Score Smoothing
stat.ML 2026-05 unverdicted novelty 7.0

MM-SOLD is a training-free particle sampler whose large-particle limit converges to a moment-matched Gibbs distribution obtained by exponentially tilting a score-smoothed target.
Discrete Stochastic Localization for Non-autoregressive Generation
cs.LG 2026-05 unverdicted novelty 7.0

Discrete Stochastic Localization provides a continuous-state framework with SNR-invariant denoisers on unit-sphere embeddings, enabling one network to support multiple per-token noise paths and improving MAUVE on OpenWebText.
SDFlow: Similarity-Driven Flow Matching for Time Series Generation
cs.AI 2026-05 unverdicted novelty 7.0

SDFlow uses similarity-driven flow matching with low-rank manifold decomposition and a categorical posterior to generate high-fidelity long time series in VQ space without step-wise error accumulation.
MelShield: Robust Mel-Domain Audio Watermarking for Provenance Attribution of AI Generated Synthesized Speech
cs.SD 2026-05 unverdicted novelty 7.0

MelShield adds keyed low-energy spread-spectrum perturbations to Mel-spectrograms inside TTS pipelines before vocoding to enable robust extraction of user-specific attribution signals even after compression or noise.
Latent Fourier Transform
cs.SD 2026-04 unverdicted novelty 7.0

LatentFT uses latent-space Fourier transforms and frequency masking in diffusion autoencoders to enable timescale-specific manipulation of musical structure in generative models.
SynHAT: A Two-stage Coarse-to-Fine Diffusion Framework for Synthesizing Human Activity Traces
cs.AI 2026-04 unverdicted novelty 7.0

SynHAT uses a novel two-stage spatio-temporal diffusion framework with Latent Spatio-Temporal U-Net to synthesize realistic human activity traces, outperforming baselines by 52% on spatial and 33% on temporal metrics ...
One Step Diffusion via Shortcut Models
cs.LG 2024-10 conditional novelty 7.0

Shortcut models enable high-quality single or few-step sampling in diffusion models with one network and training phase by conditioning on desired step size.
Diffusion Models Beat GANs on Image Synthesis
cs.LG 2021-05 accept novelty 7.0

Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.
TrajDLM: Topology-Aware Block Diffusion Language Model for Trajectory Generation
cs.LG 2026-05 unverdicted novelty 6.0

TrajDLM applies block diffusion language models to discrete road-segment sequences with topology constraints to generate realistic trajectories up to 2.8 times faster than prior methods while supporting zero-shot transfer.
DiffATS: Diffusion in Aligned Tensor Space
cs.LG 2026-05 unverdicted novelty 6.0

DiffATS trains diffusion models directly on aligned Tucker tensor primitives that are proven to be homeomorphisms, delivering efficient unconditional and conditional generation across images, videos, and PDE data with...
Score-Based Generative Modeling through Anisotropic Stochastic Partial Differential Equations
cs.CE 2026-05 unverdicted novelty 6.0

Anisotropic SPDEs preserve geometric data structure over longer timescales in score-based generative modeling, yielding better image quality than standard SDE baselines and flow matching in unconditional and condition...
SDFlow: Similarity-Driven Flow Matching for Time Series Generation
cs.AI 2026-05 unverdicted novelty 6.0

SDFlow learns a global transport map via similarity-driven flow matching in VQ latent space, using low-rank manifold decomposition and a categorical posterior to handle discreteness, yielding SOTA long-horizon perform...
Interests Burn-down Diffusion Process for Personalized Collaborative Filtering
cs.IR 2026-05 unverdicted novelty 6.0

A new interests burn-down diffusion process models decaying user interests for personalized collaborative filtering and outperforms prior generative methods in the StageCF implementation.
Interpolating Discrete Diffusion Models with Controllable Resampling
cs.LG 2026-04 unverdicted novelty 6.0

IDDM interpolates diffusion transitions with a resampling mechanism to lessen dependence on intermediate latents and improve sample quality over masked and uniform discrete diffusion models.
Diffusion Model for Manifold Data: Score Decomposition, Curvature, and Statistical Complexity
cs.LG 2026-03 unverdicted novelty 6.0

Diffusion models on manifold-supported data admit score decompositions whose statistical rates are controlled by intrinsic dimension and curvature.
Spectro-Temporal Modulation Representation Framework for Human-Imitated Speech Detection
cs.SD 2026-04 unverdicted novelty 5.0

STM representations from auditory filterbanks detect human-imitated speech at or above human listener accuracy levels.
EmDT: Embedding Diffusion Transformer for Tabular Data Generation in Fraud Detection
stat.ML 2026-03 unverdicted novelty 5.0

EmDT combines UMAP clustering with a Transformer-based diffusion process to create synthetic fraud samples that improve XGBoost classification on credit card fraud data while preserving correlations and privacy.
Elucidating Representation Degradation Problem in Diffusion Model Training
cs.LG 2026-05 unverdicted novelty 4.0

Diffusion models suffer representation degradation at high noise due to recoverability mismatch; ERD mitigates this by dynamic optimization reallocation, accelerating convergence across backbones.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 19 Pith papers · 9 internal anchors

[1]

Large Scale GAN Training for High Fidelity Natural Image Synthesis

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high ﬁdelity natural image synthesis. arXiv preprint arXiv:1809.11096,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

WaveG- rad: Estimating gradients for waveform generation

Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. WaveG- rad: Estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713,

work page arXiv 2009
[3]

Persistent rnns: Stashing recurrent weights on-chip

Greg Diamos, Shubho Sengupta, Bryan Catanzaro, Mike Chrzanowski, Adam Coates, Erich Elsen, Jesse Engel, Awni Hannun, and Sanjeev Satheesh. Persistent rnns: Stashing recurrent weights on-chip. In International Conference on Machine Learning, pp. 2024–2033,

work page 2024
[4]

End-to-end adversarial text-to-speech

Jeff Donahue, Sander Dieleman, Mikołaj Bi´nkowski, Erich Elsen, and Karen Simonyan. End-to-end adversarial text-to-speech. arXiv preprint arXiv:2006.03575,

work page arXiv 2006
[5]

Ddsp: Differentiable digital signal processing

Jesse Engel, Lamtharn Hantrakul, Chenjie Gu, and Adam Roberts. Ddsp: Differentiable digital signal processing. arXiv preprint arXiv:2001.04643,

work page arXiv 2001
[6]

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. arXiv preprint arXiv:2006.11239,

work page internal anchor Pith review Pith/arXiv arXiv 2006
[7]

Adam: A method for stochastic optimization

10 Published as a conference paper at ICLR 2021 Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR,

work page 2021
[8]

Conditional WaveGAN

Chae Young Lee, Anoop Toffy, Gue Jun Jung, and Woo-Jin Han. Conditional wavegan. arXiv preprint arXiv:1809.10636,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

End-to-end music source separation: is it possible in the waveform domain?

Francesc Lluís, Jordi Pons, and Xavier Serra. End-to-end music source separation: is it possible in the waveform domain? arXiv preprint arXiv:1810.12187,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Fastspeech: Fast, robust and controllable text to speech

Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech: Fast, robust and controllable text to speech. arXiv preprint arXiv:1905.09263,

work page arXiv 1905
[11]

A wavenet for speech denoising

Dario Rethage, Jordi Pons, and Xavier Serra. A wavenet for speech denoising. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5069–5073. IEEE,

work page 2018
[12]

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Jascha Sohl-Dickstein, Eric A Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. arXiv preprint arXiv:1503.03585,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Improved techniques for training score-based generative models

11 Published as a conference paper at ICLR 2021 Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. arXiv preprint arXiv:2006.09011,

work page arXiv 2021
[14]

Flowtron: an autoregressive ﬂow-based generative network for text-to-speech synthesis

Rafael Valle, Kevin Shih, Ryan Prenger, and Bryan Catanzaro. Flowtron: an autoregressive ﬂow-based generative network for text-to-speech synthesis. arXiv preprint arXiv:2005.05957,

work page arXiv 2005
[15]

WaveNet: A Generative Model for Raw Audio

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

MelNet: A Generative Model for Audio in the Frequency Domain

Sean Vasquez and Mike Lewis. Melnet: A generative model for audio in the frequency domain.arXiv preprint arXiv:1906.01083,

work page internal anchor Pith review Pith/arXiv arXiv 1906
[17]

Neural source-ﬁlter-based waveform model for statistical parametric speech synthesis

Xin Wang, Shinji Takaki, and Junichi Yamagishi. Neural source-ﬁlter-based waveform model for statistical parametric speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5916–5920. IEEE,

work page 2019
[18]

Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition

Pete Warden. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim

https://github.com/tugstugi/ pytorch-speech-commands. Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199–6203. IEEE,

work page 2020
[20]

Activation Maximization Generative Adversarial Nets

Zhiming Zhou, Han Cai, Shu Rong, Yuxuan Song, Kan Ren, Weinan Zhang, Yong Yu, and Jun Wang. Activation maximization generative adversarial nets. arXiv preprint arXiv:1703.02000,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

We expand the ELBO in Eq

12 Published as a conference paper at ICLR 2021 A P ROOF OF PROPOSITION 1 Proof. We expand the ELBO in Eq. (3) into the sum of a sequence of tractable KL divergences below. ELBO = Eq logpθ(x0,··· ,xT−1|xT )×platent(xT ) q(x1,··· ,xT|x0) = Eq ( logplatent(xT )− T∑ t=1 logpθ(xt−1|xt) q(xt|xt−1) ) = Eq ( logplatent(xT )− logpθ(x0|x1) q(x1|x0) − T∑ t=2 ( log ...

work page 2021
[22]

(12) Now, we calculate each term of the ELBO expansion in Eq

=N (xt−1; √¯αt−1βt 1− ¯αt x0 + √αt(1− ¯αt−1) 1− ¯αt xt, ˜βtI). (12) Now, we calculate each term of the ELBO expansion in Eq. (9). The ﬁrst constant term is Eq KL (q(xT|x0)∥platent(xT )) = Ex0KL ( N (√¯αTx0, (1− ¯αT )I)∥N (0,I ) ) = 1 2 Ex0∥√¯αTx0− 0∥2 +d ( log 1√1− ¯αT + 1− ¯αT− 1 2 ) = ¯αT 2 Ex0∥x0∥2− d 2(¯αT + log(1− ¯αT )) 13 Published as a conference ...

work page 2021
[23]

) 2 =−d 2 log 2πβ1− 1 2β1 Ex0,ϵ  √β1 √α1 (ϵ−ϵθ(x1, 1))  2 =−d 2 log 2πβ1− 1 2α1 Ex0,ϵ∥ϵ−ϵθ(x1, 1)∥2 The computation of the ELBO is now ﬁnished. 14 Published as a conference paper at ICLR 2021 B D ETAILS OF THE FAST SAMPLING ALGORITHM LetTinfer≪T be the number of steps in the reverse process (sampling) and{ηt}Tinfer t=1 be the user- deﬁned vari...

work page 2021
[24]

,$,%) (",$,%) (

Algorithm 3 Fast Sampling SamplexTinfer∼platent =N (0,I ) fors =Tinfer,T infer− 1,··· , 1 do Computeµfast θ (xs,s ) andσfast θ (xs,s ) using Eq. (15) Samplexs−1∼N (xs−1;µfast θ (xs,s ),σ fast θ (xs,s )2I) end for return x0 In neural vocoding task, we use user-deﬁned variance schedules{0.0001, 0.001, 0.01, 0.05, 0.2, 0.7} for DiffWaveLARGE and{0.0001, 0.00...

work page 2020
[25]

Compared to IS, AM score takes into consideration the the prior distribution ofpF(Xtrain)

computes the following: AM = KL ( Ex′∼qdatapF(x′)∥Ex∼pgenpF(x) ) + Ex∼pgenH(pF(x)), where H(·) computes the entropy. Compared to IS, AM score takes into consideration the the prior distribution ofpF(Xtrain). • Number of Statistically-Different Bins (NDB) (Richardson & Weiss, 2018): First,Xtrain is clustered intoK bins byK-Means in the feature space (where...

work page 2018