Recognition: 2 theorem links
· Lean TheoremWaveNet: A Generative Model for Raw Audio
Pith reviewed 2026-05-12 20:23 UTC · model grok-4.3
The pith
WaveNet generates raw audio waveforms by predicting each sample from all previous ones and yields more natural text-to-speech than prior systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
WaveNet is a fully probabilistic autoregressive deep neural network for raw audio waveforms, with the predictive distribution for each audio sample conditioned on all previous ones. It can be efficiently trained on data with tens of thousands of samples per second. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity by conditioning on speaker identity, and when trained to model music it generates novel and often highly realistic
What carries the argument
The autoregressive predictive distribution over each raw audio sample conditioned on all prior samples, realized in a deep neural network architecture.
If this is right
- Text-to-speech systems can achieve higher naturalness as judged by human listeners.
- A single model can represent the voices of many different speakers through conditioning on speaker identity.
- The architecture can generate novel and realistic musical fragments when trained on music data.
- The same network can be repurposed for discriminative tasks such as phoneme recognition with promising results.
Where Pith is reading between the lines
- The autoregressive sample-by-sample approach might extend to other high-rate sequential signals such as video or sensor streams.
- Further conditioning inputs could allow finer control over generated audio content beyond speaker identity.
- Efficiency improvements could support real-time interactive audio generation applications.
Load-bearing premise
Human listener ratings of naturalness provide a reliable and unbiased measure of generated audio quality.
What would settle it
A controlled blind listening test in which average naturalness ratings for WaveNet audio are not higher than those for the best parametric or concatenative text-to-speech systems.
read the original abstract
This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces WaveNet, a deep autoregressive neural network for raw audio waveform generation based on dilated causal convolutions. It demonstrates that the model can be trained efficiently on high-sample-rate audio data. When applied to text-to-speech, it claims state-of-the-art performance with human listeners rating the synthesized speech as significantly more natural than the best parametric and concatenative systems for both English and Mandarin. A single model captures multiple speakers via conditioning on speaker identity. Additional results include realistic music generation and promising phoneme recognition performance when used discriminatively.
Significance. If the human evaluation results hold under scrutiny, this represents a significant advance in audio generation by showing that direct probabilistic modeling of raw waveforms can outperform traditional TTS pipelines. The dilated convolution architecture efficiently captures long-range temporal structure, which is a key technical contribution. Credit is given for the explicit demonstration of efficient training despite the autoregressive formulation and for the multi-speaker conditioning results.
major comments (1)
- [TTS experiments (Section 4)] The central SOTA claim for TTS (abstract and experiments section) rests on human naturalness ratings being significantly higher than parametric/concatenative baselines. The manuscript provides no details on the number of raters, statistical significance testing, rating scale or protocol (e.g., blind presentation, sample selection), or objective corroborating metrics such as MCD or PESQ. This information is load-bearing for verifying that the preference reflects model quality rather than test artifacts.
minor comments (2)
- [Abstract] The abstract states performance improvements via human evaluations but does not reference the specific quantitative results or tables that support the 'significantly more natural' claim.
- [Model architecture (Section 2)] The description of speaker conditioning could be strengthened by an explicit equation or diagram showing how the speaker embedding is injected into the dilated layers.
Simulated Author's Rebuttal
We thank the referee for the careful review and for identifying the need for greater transparency in the TTS evaluation protocol. We address the single major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [TTS experiments (Section 4)] The central SOTA claim for TTS (abstract and experiments section) rests on human naturalness ratings being significantly higher than parametric/concatenative baselines. The manuscript provides no details on the number of raters, statistical significance testing, rating scale or protocol (e.g., blind presentation, sample selection), or objective corroborating metrics such as MCD or PESQ. This information is load-bearing for verifying that the preference reflects model quality rather than test artifacts.
Authors: We agree that the manuscript would benefit from additional details on the human evaluation to allow readers to fully assess the strength of the SOTA claims. The current version does not sufficiently describe the listening test protocol, including participant numbers, rating scale, blinding procedures, sample selection, statistical testing, or objective corroborating metrics. In the revised manuscript we will expand the relevant section to specify the mean opinion score (MOS) protocol, the number of raters and their selection criteria, confirmation that samples were presented blindly in randomized order, the statistical tests used to establish significance, and any objective metrics (such as MCD) that were computed alongside the perceptual ratings. These additions will directly address the concern that the reported preference could stem from test artifacts rather than model quality. revision: yes
Circularity Check
No circularity: WaveNet architecture and TTS claims rest on explicit model definition plus external human ratings
full rationale
The paper defines the autoregressive dilated-convolution architecture, softmax output, and conditioning mechanisms directly from first principles (causal convolutions, residual/skip connections). Training maximizes the standard next-sample log-likelihood on external audio corpora. The central TTS claim (SOTA naturalness) is supported solely by separate human listening tests whose ratings are not algebraically or statistically forced by any fitted parameter inside the model equations. No self-citation chain, ansatz smuggling, or renaming of known results occurs for the performance assertions. The derivation chain is therefore self-contained and non-circular.
Axiom & Free-Parameter Ledger
free parameters (2)
- dilation schedule and network depth
- conditioning mechanisms for text and speaker
axioms (2)
- domain assumption Raw audio waveforms can be modeled as autoregressive sequences where each sample depends statistically on all previous samples.
- domain assumption Dilated convolutions efficiently capture long-range temporal dependencies in audio data at high sample rates.
Forward citations
Cited by 33 Pith papers
-
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
-
Efficiently Modeling Long Sequences with Structured State Spaces
S4 is an efficient state space sequence model that captures long-range dependencies via structured parameterization of the SSM, achieving state-of-the-art results on the Long Range Arena and other benchmarks while bei...
-
DiffWave: A Versatile Diffusion Model for Audio Synthesis
DiffWave is a non-autoregressive diffusion model that generates high-fidelity audio waveforms from noise in constant steps, matching WaveNet vocoder quality while being orders of magnitude faster and outperforming pri...
-
Denoising Diffusion Probabilistic Models
Denoising diffusion probabilistic models generate high-quality images by learning to reverse a fixed forward diffusion process, achieving FID 3.17 on CIFAR10.
-
Neural network modeling of many-body super- and sub-radiant dynamics
Neural quantum states simulate dissipative many-body emission dynamics for approximately 40 atoms in dense 1D and 2D arrays, revealing prominent subradiant behavior at late times.
-
MelShield: Robust Mel-Domain Audio Watermarking for Provenance Attribution of AI Generated Synthesized Speech
MelShield adds keyed low-energy spread-spectrum perturbations to Mel-spectrograms inside TTS pipelines before vocoding to enable robust extraction of user-specific attribution signals even after compression or noise.
-
DiffAnon: Diffusion-based Prosody Control for Voice Anonymization
DiffAnon introduces the first diffusion model for voice anonymization that supplies structured, continuous, inference-time control over prosody preservation via classifier-free guidance on RVQ semantic embeddings.
-
Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages
Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
-
ReLU Networks for Exact Generation of Similar Graphs
Constant-depth ReLU networks of size O(n²d) exist that deterministically generate graphs within edit distance d from any given n-vertex input graph.
-
Moshi: a speech-text foundation model for real-time dialogue
Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
-
Chronos: Learning the Language of Time Series
Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.
-
High Fidelity Neural Audio Compression
EnCodec is an end-to-end trained streaming neural audio codec that uses a single multiscale spectrogram discriminator and a gradient-normalizing loss balancer to achieve higher fidelity than prior methods at the same ...
-
A Generalist Agent
Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
-
Diffusion Models Beat GANs on Image Synthesis
Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.
-
Generating Long Sequences with Sparse Transformers
Sparse Transformers factorize attention to handle sequences tens of thousands long, achieving new SOTA density modeling on Enwik8, CIFAR-10, and ImageNet-64.
-
Progressive Growing of GANs for Improved Quality, Stability, and Variation
Progressive growing stabilizes GAN training to produce high-resolution images of unprecedented quality and achieves a record unsupervised inception score of 8.80 on CIFAR10.
-
Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation
Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.
-
AIBuildAI: An AI Agent for Automatically Building AI Models
AIBuildAI uses a manager agent and three LLM sub-agents to fully automate AI model development and achieves a 63.1% medal rate on MLE-Bench, matching experienced human engineers.
-
A Novel Automatic Framework for Speaker Drift Detection in Synthesized Speech
A framework detects speaker drift in TTS outputs by computing cosine similarities across speech segments and using LLMs for binary classification, supported by a human-validated synthetic benchmark.
-
Optimal-Transport-Guided Functional Flow Matching for Turbulent Field Generation in Hilbert Space
FOT-CFM generates turbulent fields in function space with superior high-order statistics and energy spectra on Navier-Stokes, Kolmogorov flow, and Hasegawa-Wakatani equations compared to baselines.
-
Borderless Long Speech Synthesis
Borderless Long Speech Synthesis unifies voice design, multi-speaker TTS, and long-form generation via Global-Sentence-Token annotations, CoT reasoning, and a Structured Semantic Interface for agent-centric control.
-
Is Conditional Generative Modeling all you need for Decision-Making?
Return-conditional diffusion models for policies outperform offline RL on benchmarks by circumventing dynamic programming and enable constraint or skill composition.
-
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges
Geometric deep learning provides a unified mathematical framework based on grids, groups, graphs, geodesics, and gauges to explain and extend neural network architectures by incorporating physical regularities.
-
VideoGPT: Video Generation using VQ-VAE and Transformers
VideoGPT generates competitive natural videos by learning discrete latents with VQ-VAE and modeling them autoregressively with a transformer.
-
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
GShard supplies automatic sharding and conditional computation support that enabled training a 600-billion-parameter multilingual translation model on thousands of TPUs with superior quality.
-
Sessa: Selective State Space Attention
Sessa integrates attention within recurrent paths to achieve power-law memory tails and flexible non-decaying selective retrieval, outperforming baselines on long-context tasks.
-
Learning Invariant Modality Representation for Robust Multimodal Learning from a Causal Inference Perspective
CmIR uses causal inference to separate invariant causal representations from spurious ones in multimodal data, improving generalization under distribution shifts and noise via invariance, mutual information, and recon...
-
Applied AI-Enhanced RF Interference Rejection
Autoregressive transformer decoders suppress OFDM interference in FM radio signals to restore intelligible speech with low latency on GPUs like Jetson AGX Orin.
-
Federated Parameter-Efficient Adaptation for Interference Mitigation at the Wireless Edge
Federated LoRA on TCNs for wireless interference suppression reduces per-round communication up to 20x while delivering 12.6% average BER improvement comparable to local adaptation.
-
Empowering Video Translation using Multimodal Large Language Models
The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.
-
AT-ADD: All-Type Audio Deepfake Detection Challenge Evaluation Plan
AT-ADD introduces standardized tracks and datasets for evaluating audio deepfake detectors on speech under real-world conditions and on diverse unknown audio types to promote generalization beyond speech-centric methods.
-
Dynamic Forecasting and Temporal Feature Evolution of Stock Repurchases in Listed Companies Using Attention-Based Deep Temporal Networks
A TCN plus Attention-LSTM model trained on 2014-2024 Chinese A-share data outperforms static baselines and identifies prolonged undervaluation as the long-term driver and sudden cash-flow increases as the short-term t...
-
Deep Learning for Sequential Decision Making under Uncertainty: Foundations, Frameworks, and Frontiers
A tutorial framing deep learning as a complement to optimization for sequential decision-making under uncertainty, with applications in supply chains, healthcare, and energy.
Reference graph
Works this paper leans on
-
[1]
Vocaine the vocoder and applications is speech synthesis
Agiomyrgiannakis, Yannis. Vocaine the vocoder and applications is speech synthesis. In ICASSP, pp.\ 4230--4234, 2015
work page 2015
-
[2]
Bishop, Christopher M. Mixture density networks. Technical Report NCRG/94/004, Neural Computing Research Group, Aston University, 1994
work page 1994
-
[3]
Semantic image segmentation with deep convolutional nets and fully connected CRF s
Chen, Liang-Chieh, Papandreou, George, Kokkinos, Iasonas, Murphy, Kevin, and Yuille, Alan L. Semantic image segmentation with deep convolutional nets and fully connected CRF s. In ICLR, 2015. URL http://arxiv.org/abs/1412.7062
-
[4]
The Vowel: I ts Nature and Structure
Chiba, Tsutomu and Kajiyama, Masato. The Vowel: I ts Nature and Structure . Tokyo-Kaiseikan, 1942
work page 1942
-
[5]
Dudley, Homer. Remaking speech. The Journal of the Acoustical Society of America, 11 0 (2): 0 169--177, 1939
work page 1939
-
[6]
An implementation of the ``algorithme \`a trous'' to compute the wavelet transform
Dutilleux, Pierre. An implementation of the ``algorithme \`a trous'' to compute the wavelet transform. In Combes, Jean-Michel, Grossmann, Alexander, and Tchamitchian, Philippe (eds.), Wavelets: Time-Frequency Methods and Phase Space, pp.\ 298--304. Springer Berlin Heidelberg, 1989
work page 1989
-
[7]
TTS synthesis with bidirectional LSTM based recurrent neural networks
Fan, Yuchen, Qian, Yao, and Xie, Feng-Long, Soong Frank K. TTS synthesis with bidirectional LSTM based recurrent neural networks. In Interspeech, pp.\ 1964--1968, 2014
work page 1964
-
[8]
Acoustic Theory of Speech Production
Fant, Gunnar. Acoustic Theory of Speech Production. Mouton De Gruyter, 1970
work page 1970
-
[9]
DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM
Garofolo, John S., Lamel, Lori F., Fisher, William M., Fiscus, Jonathon G., and Pallett, David S. DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM . NIST speech disc 1-1.1. NASA STI/Recon technical report, 93, 1993
work page 1993
-
[10]
Recent advances in G oogle real-time HMM -driven unit selection synthesizer
Gonzalvo, Xavi, Tazari, Siamak, Chan, Chun-an, Becker, Markus, Gutkin, Alexander, and Silen, Hanna. Recent advances in G oogle real-time HMM -driven unit selection synthesizer. In Interspeech, 2016. URL http://research.google.com/pubs/pub45564.html
work page 2016
-
[11]
Deep Residual Learning for Image Recognition
He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[12]
Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Comput., 9 0 (8): 0 1735--1780, 1997
work page 1997
-
[13]
A real-time algorithm for signal analysis with the help of the wavelet transform
Holschneider, Matthias, Kronland-Martinet, Richard, Morlet, Jean, and Tchamitchian, Philippe. A real-time algorithm for signal analysis with the help of the wavelet transform. In Combes, Jean-Michel, Grossmann, Alexander, and Tchamitchian, Philippe (eds.), Wavelets: Time-Frequency Methods and Phase Space, pp.\ 286--297. Springer Berlin Heidelberg, 1989
work page 1989
-
[14]
Speech acoustic modeling from raw multichannel waveforms
Hoshen, Yedid, Weiss, Ron J., and Wilson, Kevin W. Speech acoustic modeling from raw multichannel waveforms. In ICASSP, pp.\ 4624--4628. IEEE, 2015
work page 2015
-
[15]
Hunt, Andrew J. and Black, Alan W. Unit selection in a concatenative speech synthesis system using a large speech database. In ICASSP, pp.\ 373--376, 1996
work page 1996
-
[16]
Unbiased estimation of log spectrum
Imai, Satoshi and Furuichi, Chieko. Unbiased estimation of log spectrum. In EURASIP, pp.\ 203--206, 1988
work page 1988
-
[17]
Line spectrum representation of linear predictor coefficients of speech signals
Itakura, Fumitada. Line spectrum representation of linear predictor coefficients of speech signals. The Journal of the Acoust. Society of America, 57 0 (S1): 0 S35--S35, 1975
work page 1975
-
[18]
A statistical method for estimation of speech spectral density and formant frequencies
Itakura, Fumitada and Saito, Shuzo. A statistical method for estimation of speech spectral density and formant frequencies. Trans. IEICE, J53–A: 0 35--42, 1970
work page 1970
-
[19]
ITU-T. Recommendation G . 711. Pulse Code Modulation (PCM) of voice frequencies, 1988
work page 1988
-
[20]
Exploring the Limits of Language Modeling
J \' o zefowicz, Rafal, Vinyals, Oriol, Schuster, Mike, Shazeer, Noam, and Wu, Yonghui. Exploring the limits of language modeling. CoRR, abs/1602.02410, 2016. URL http://arxiv.org/abs/1602.02410
work page Pith review arXiv 2016
-
[21]
Mixture autoregressive hidden M arkov models for speech signals
Juang, Biing-Hwang and Rabiner, Lawrence. Mixture autoregressive hidden M arkov models for speech signals. IEEE Trans. Acoust. Speech Signal Process., pp.\ 1404--1413, 1985
work page 1985
-
[22]
Speech analysis with multi-kernel linear prediction
Kameoka, Hirokazu, Ohishi, Yasunori, Mochihashi, Daichi, and Le Roux, Jonathan. Speech analysis with multi-kernel linear prediction. In Spring Conference of ASJ, pp.\ 499--502, 2010. (in Japanese)
work page 2010
-
[23]
Text-to-speech conversion with neural networks: A recurrent TDNN approach
Karaali, Orhan, Corrigan, Gerald, Gerson, Ira, and Massey, Noel. Text-to-speech conversion with neural networks: A recurrent TDNN approach. In Eurospeech, pp.\ 561--564, 1997
work page 1997
-
[24]
Kawahara, Hideki, Masuda-Katsuse, Ikuyo, and de Cheveign \'e , Alain. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based f_ 0 extraction: possible role of a repetitive structure in sounds. Speech Commn., 27: 0 187--207, 1999
work page 1999
-
[25]
Kawahara, Hideki, Estill, Jo, and Fujimura, Osamu. Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT . In MAVEBA, pp.\ 13--15, 2001
work page 2001
-
[26]
Input-agreement: a new mechanism for collecting data using human computation games
Law, Edith and Von Ahn, Luis. Input-agreement: a new mechanism for collecting data using human computation games. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp.\ 1197--1206. ACM, 2009
work page 2009
-
[27]
Maia, Ranniery, Zen, Heiga, and Gales, Mark J. F. Statistical parametric speech synthesis with joint estimation of acoustic and excitation model parameters. In ISCA SSW7, pp.\ 88--93, 2010
work page 2010
-
[28]
WORLD : A vocoder-based high-quality speech synthesis system for real-time applications
Morise, Masanori, Yokomori, Fumiya, and Ozawa, Kenji. WORLD : A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst., E99-D 0 (7): 0 1877--1884, 2016
work page 2016
-
[29]
Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones
Moulines, Eric and Charpentier, Francis. Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Commn., 9: 0 453--467, 1990
work page 1990
-
[30]
Muthukumar, P. and Black, Alan W. A deep learning approach to data-driven parameterizations for statistical parametric speech synthesis. arXiv:1409.8558, 2014
-
[31]
Rectified linear units improve restricted B oltzmann machines
Nair, Vinod and Hinton, Geoffrey E. Rectified linear units improve restricted B oltzmann machines. In ICML, pp.\ 807--814, 2010
work page 2010
-
[32]
Integration of spectral feature extraction and modeling for HMM -based speech synthesis
Nakamura, Kazuhiro, Hashimoto, Kei, Nankaku, Yoshihiko, and Tokuda, Keiichi. Integration of spectral feature extraction and modeling for HMM -based speech synthesis. IEICE Trans. Inf. Syst., E97-D 0 (6): 0 1438--1448, 2014
work page 2014
-
[33]
Palaz, Dimitri, Collobert, Ronan, and Magimai-Doss, Mathew. Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks. In Interspeech, pp.\ 1766--1770, 2013
work page 2013
-
[34]
Nonlinear filter design: methodologies and challenges
Peltonen, Sari, Gabbouj, Moncef, and Astola, Jaakko. Nonlinear filter design: methodologies and challenges. In IEEE ISPA, pp.\ 102--107, 2001
work page 2001
-
[35]
Linear predictive hidden M arkov models and the speech signal
Poritz, Alan B. Linear predictive hidden M arkov models and the speech signal. In ICASSP, pp.\ 1291--1294, 1982
work page 1982
-
[36]
Fundamentals of Speech Recognition
Rabiner, Lawrence and Juang, Biing-Hwang. Fundamentals of Speech Recognition. PrenticeHall, 1993
work page 1993
-
[37]
ATR -talk speech synthesis system
Sagisaka, Yoshinori, Kaiki, Nobuyoshi, Iwahashi, Naoto, and Mimura, Katsuhiko. ATR -talk speech synthesis system. In ICSLP, pp.\ 483--486, 1992
work page 1992
-
[38]
Learning the speech front-end with raw waveform CLDNN s
Sainath, Tara N., Weiss, Ron J., Senior, Andrew, Wilson, Kevin W., and Vinyals, Oriol. Learning the speech front-end with raw waveform CLDNN s. In Interspeech, pp.\ 1--5, 2015
work page 2015
-
[39]
Takaki, Shinji and Yamagishi, Junichi. A deep auto-encoder based low-dimensional feature extraction from FFT spectral envelopes for statistical parametric speech synthesis. In ICASSP, pp.\ 5535--5539, 2016
work page 2016
-
[40]
Postfilters to modify the modulation spectrum for statistical parametric speech synthesis
Takamichi, Shinnosuke, Toda, Tomoki, Black, Alan W., Neubig, Graham, Sakriani, Sakti, and Nakamura, Satoshi. Postfilters to modify the modulation spectrum for statistical parametric speech synthesis. IEEE/ACM Trans. Audio Speech Lang. Process., 24 0 (4): 0 755--767, 2016
work page 2016
-
[41]
Generative image modeling using spatial LSTM s
Theis, Lucas and Bethge, Matthias. Generative image modeling using spatial LSTM s. In NIPS, pp.\ 1927--1935, 2015
work page 1927
-
[42]
A speech parameter generation algorithm considering global variance for HMM -based speech synthesis
Toda, Tomoki and Tokuda, Keiichi. A speech parameter generation algorithm considering global variance for HMM -based speech synthesis. IEICE Trans. Inf. Syst., E90-D 0 (5): 0 816--824, 2007
work page 2007
-
[43]
Toda, Tomoki and Tokuda, Keiichi. Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory hmm. In ICASSP, pp.\ 3925--3928, 2008
work page 2008
-
[44]
Speech synthesis as a statistical machine learning problem
Tokuda, Keiichi. Speech synthesis as a statistical machine learning problem. http://www.sp.nitech.ac.jp/ tokuda/tokuda_asru2011_for_pdf.pdf, 2011. Invited talk given at ASRU
work page 2011
-
[45]
Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis
Tokuda, Keiichi and Zen, Heiga. Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis. In ICASSP, pp.\ 4215--4219, 2015
work page 2015
-
[46]
Directly modeling voiced and unvoiced components in speech waveforms by neural networks
Tokuda, Keiichi and Zen, Heiga. Directly modeling voiced and unvoiced components in speech waveforms by neural networks. In ICASSP, pp.\ 5640--5644, 2016
work page 2016
-
[47]
Speech synthesis using artificial neural networks trained on cepstral coefficients
Tuerk, Christine and Robinson, Tony. Speech synthesis using artificial neural networks trained on cepstral coefficients. In Proc. Eurospeech, pp.\ 1713--1716, 1993
work page 1993
-
[48]
u ske, Zolt \'a n, Golik, Pavel, Schl \
T \"u ske, Zolt \'a n, Golik, Pavel, Schl \"u ter, Ralf, and Ney, Hermann. Acoustic modeling with deep neural networks using raw time signal for LVCSR . In Interspeech, pp.\ 890--894, 2014
work page 2014
-
[49]
Modelling acoustic feature dependencies with artificial neural networks: T rajectory- RNADE
Uria, Benigno, Murray, Iain, Renals, Steve, Valentini-Botinhao, Cassia, and Bridle, John. Modelling acoustic feature dependencies with artificial neural networks: T rajectory- RNADE . In ICASSP, pp.\ 4465--4469, 2015
work page 2015
-
[50]
Pixel Recurrent Neural Networks
van den Oord, A \" a ron, Kalchbrenner, Nal, and Kavukcuoglu, Koray. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016 a
work page Pith review arXiv 2016
-
[51]
Conditional image generation with PixelCNN decoders
van den Oord, A \" a ron, Kalchbrenner, Nal, Vinyals, Oriol, Espeholt, Lasse, Graves, Alex, and Kavukcuoglu, Koray. Conditional image generation with PixelCNN decoders. CoRR, abs/1606.05328, 2016 b . URL http://arxiv.org/abs/1606.05328
-
[52]
Wu, Yi-Jian and Tokuda, Keiichi. Minimum generation error training with direct log spectral distortion on LSP s for HMM -based speech synthesis. In Interspeech, pp.\ 577--580, 2008
work page 2008
-
[53]
English multi-speaker corpus for CSTR voice cloning toolkit, 2012
Yamagishi, Junichi. English multi-speaker corpus for CSTR voice cloning toolkit, 2012. URL http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html
work page 2012
-
[54]
Yoshimura, Takayoshi. Simultaneous modeling of phonetic and prosodic parameters, and characteristic conversion for HMM -based text-to-speech systems . PhD thesis, Nagoya Institute of Technology, 2002
work page 2002
-
[55]
Multi-Scale Context Aggregation by Dilated Convolutions
Yu, Fisher and Koltun, Vladlen. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016. URL http://arxiv.org/abs/1511.07122
work page Pith review arXiv 2016
-
[56]
An example of context-dependent label format for HMM -based speech synthesis in E nglish, 2006
Zen, Heiga. An example of context-dependent label format for HMM -based speech synthesis in E nglish, 2006. URL http://hts.sp.nitech.ac.jp/?Download
work page 2006
-
[57]
Zen, Heiga, Tokuda, Keiichi, and Kitamura, Tadashi. Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic features. Comput. Speech Lang., 21 0 (1): 0 153--173, 2007
work page 2007
-
[58]
Statistical parametric speech synthesis
Zen, Heiga, Tokuda, Keiichi, and Black, Alan W. Statistical parametric speech synthesis. Speech Commn., 51 0 (11): 0 1039--1064, 2009
work page 2009
-
[59]
Statistical parametric speech synthesis using deep neural networks
Zen, Heiga, Senior, Andrew, and Schuster, Mike. Statistical parametric speech synthesis using deep neural networks. In Proc. ICASSP, pp.\ 7962--7966, 2013
work page 2013
-
[60]
Zen, Heiga, Agiomyrgiannakis, Yannis, Egberts, Niels, Henderson, Fergus, and Szczepaniak, Przemys aw. Fast, compact, and high quality LSTM - RNN based statistical parametric speech synthesizers for mobile devices. In Interspeech, 2016. URL https://arxiv.org/abs/1606.06061
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.