arxiv: 1609.03499 · v2 · submitted 2016-09-12 · 💻 cs.SD · cs.LG

Recognition: 2 theorem links

· Lean Theorem

WaveNet: A Generative Model for Raw Audio

Aaron van den Oord , Sander Dieleman , Heiga Zen , Karen Simonyan , Oriol Vinyals , Alex Graves , Nal Kalchbrenner , Andrew Senior

show 1 more author

Koray Kavukcuoglu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 20:23 UTC · model grok-4.3

classification 💻 cs.SD cs.LG

keywords wavenetgenerative modelraw audiotext-to-speechautoregressivemusic generationneural networkphoneme recognition

0 comments

The pith

WaveNet generates raw audio waveforms by predicting each sample from all previous ones and yields more natural text-to-speech than prior systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a neural network that creates audio waveforms by modeling the probability of each tiny sample given every sample that came before it. This autoregressive setup can be trained efficiently even at the high rates of real audio data. When applied to turning text into speech, human listeners rate the results as significantly more natural than the best existing parametric and concatenative methods, and this holds for both English and Mandarin. One trained model can also imitate many different speakers simply by receiving speaker identity as input and can produce realistic new music fragments when trained on music data.

Core claim

WaveNet is a fully probabilistic autoregressive deep neural network for raw audio waveforms, with the predictive distribution for each audio sample conditioned on all previous ones. It can be efficiently trained on data with tens of thousands of samples per second. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity by conditioning on speaker identity, and when trained to model music it generates novel and often highly realistic

What carries the argument

The autoregressive predictive distribution over each raw audio sample conditioned on all prior samples, realized in a deep neural network architecture.

If this is right

Text-to-speech systems can achieve higher naturalness as judged by human listeners.
A single model can represent the voices of many different speakers through conditioning on speaker identity.
The architecture can generate novel and realistic musical fragments when trained on music data.
The same network can be repurposed for discriminative tasks such as phoneme recognition with promising results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The autoregressive sample-by-sample approach might extend to other high-rate sequential signals such as video or sensor streams.
Further conditioning inputs could allow finer control over generated audio content beyond speaker identity.
Efficiency improvements could support real-time interactive audio generation applications.

Load-bearing premise

Human listener ratings of naturalness provide a reliable and unbiased measure of generated audio quality.

What would settle it

A controlled blind listening test in which average naturalness ratings for WaveNet audio are not higher than those for the best parametric or concatenative text-to-speech systems.

read the original abstract

This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WaveNet shows that dilated causal convolutions let you train an autoregressive model directly on raw audio waveforms at scale, and the TTS results look promising but rest mostly on human ratings.

read the letter

The main takeaway is that this architecture stacks dilated causal convolutions to give the model a large receptive field over past samples without recurrence, which makes training feasible on high-rate audio data. That is the concrete advance over earlier RNN-based audio models or spectrogram approaches. It also conditions on text features and speaker identity so one network can handle multiple voices and switch between them cleanly. The music generation examples and the side use as a phoneme classifier are nice extras that show the same model is flexible. Training efficiency is another plus they highlight, since the dilated structure avoids the usual sequential bottlenecks. The TTS claim is that listeners rate the output more natural than the best parametric and concatenative systems on both English and Mandarin. That is the headline result. The soft spot is that the evidence for this is human naturalness ratings, and the abstract gives no numbers on rater count, statistical tests, blinding, or objective measures such as mel-cepstral distortion to cross-check. If the test setup had small samples or weaker baselines, the gap could shrink. The assumption that speaker ID plus text conditioning captures timbre and prosody fully is also untested beyond those same ratings. Readers working on generative sequence models or speech synthesis will get the most from it, especially if they want to try the dilated convolution trick on other high-rate signals. It is worth sending to peer review because the core idea is new and the practical payoff is clear, even if the experimental details will need tightening.

Referee Report

1 major / 2 minor

Summary. The paper introduces WaveNet, a deep autoregressive neural network for raw audio waveform generation based on dilated causal convolutions. It demonstrates that the model can be trained efficiently on high-sample-rate audio data. When applied to text-to-speech, it claims state-of-the-art performance with human listeners rating the synthesized speech as significantly more natural than the best parametric and concatenative systems for both English and Mandarin. A single model captures multiple speakers via conditioning on speaker identity. Additional results include realistic music generation and promising phoneme recognition performance when used discriminatively.

Significance. If the human evaluation results hold under scrutiny, this represents a significant advance in audio generation by showing that direct probabilistic modeling of raw waveforms can outperform traditional TTS pipelines. The dilated convolution architecture efficiently captures long-range temporal structure, which is a key technical contribution. Credit is given for the explicit demonstration of efficient training despite the autoregressive formulation and for the multi-speaker conditioning results.

major comments (1)

[TTS experiments (Section 4)] The central SOTA claim for TTS (abstract and experiments section) rests on human naturalness ratings being significantly higher than parametric/concatenative baselines. The manuscript provides no details on the number of raters, statistical significance testing, rating scale or protocol (e.g., blind presentation, sample selection), or objective corroborating metrics such as MCD or PESQ. This information is load-bearing for verifying that the preference reflects model quality rather than test artifacts.

minor comments (2)

[Abstract] The abstract states performance improvements via human evaluations but does not reference the specific quantitative results or tables that support the 'significantly more natural' claim.
[Model architecture (Section 2)] The description of speaker conditioning could be strengthened by an explicit equation or diagram showing how the speaker embedding is injected into the dilated layers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for identifying the need for greater transparency in the TTS evaluation protocol. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [TTS experiments (Section 4)] The central SOTA claim for TTS (abstract and experiments section) rests on human naturalness ratings being significantly higher than parametric/concatenative baselines. The manuscript provides no details on the number of raters, statistical significance testing, rating scale or protocol (e.g., blind presentation, sample selection), or objective corroborating metrics such as MCD or PESQ. This information is load-bearing for verifying that the preference reflects model quality rather than test artifacts.

Authors: We agree that the manuscript would benefit from additional details on the human evaluation to allow readers to fully assess the strength of the SOTA claims. The current version does not sufficiently describe the listening test protocol, including participant numbers, rating scale, blinding procedures, sample selection, statistical testing, or objective corroborating metrics. In the revised manuscript we will expand the relevant section to specify the mean opinion score (MOS) protocol, the number of raters and their selection criteria, confirmation that samples were presented blindly in randomized order, the statistical tests used to establish significance, and any objective metrics (such as MCD) that were computed alongside the perceptual ratings. These additions will directly address the concern that the reported preference could stem from test artifacts rather than model quality. revision: yes

Circularity Check

0 steps flagged

No circularity: WaveNet architecture and TTS claims rest on explicit model definition plus external human ratings

full rationale

The paper defines the autoregressive dilated-convolution architecture, softmax output, and conditioning mechanisms directly from first principles (causal convolutions, residual/skip connections). Training maximizes the standard next-sample log-likelihood on external audio corpora. The central TTS claim (SOTA naturalness) is supported solely by separate human listening tests whose ratings are not algebraically or statistically forced by any fitted parameter inside the model equations. No self-citation chain, ansatz smuggling, or renaming of known results occurs for the performance assertions. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the effectiveness of an autoregressive neural architecture with dilated convolutions for high-rate audio, plus several architectural and training choices that function as free parameters tuned empirically.

free parameters (2)

dilation schedule and network depth
Hyperparameters selected to achieve sufficient receptive field for audio dependencies while enabling efficient training.
conditioning mechanisms for text and speaker
Specific ways to incorporate external inputs chosen to enable multi-speaker and TTS functionality.

axioms (2)

domain assumption Raw audio waveforms can be modeled as autoregressive sequences where each sample depends statistically on all previous samples.
Stated directly in the abstract as the core generative approach.
domain assumption Dilated convolutions efficiently capture long-range temporal dependencies in audio data at high sample rates.
Invoked to justify training feasibility on tens of thousands of samples per second.

pith-pipeline@v0.9.0 · 5471 in / 1466 out tokens · 69206 ms · 2026-05-12T20:23:22.651711+00:00 · methodology

discussion (0)

Forward citations

Cited by 33 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces
cs.LG 2023-12 unverdicted novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
Efficiently Modeling Long Sequences with Structured State Spaces
cs.LG 2021-10 unverdicted novelty 8.0

S4 is an efficient state space sequence model that captures long-range dependencies via structured parameterization of the SSM, achieving state-of-the-art results on the Long Range Arena and other benchmarks while bei...
DiffWave: A Versatile Diffusion Model for Audio Synthesis
eess.AS 2020-09 unverdicted novelty 8.0

DiffWave is a non-autoregressive diffusion model that generates high-fidelity audio waveforms from noise in constant steps, matching WaveNet vocoder quality while being orders of magnitude faster and outperforming pri...
Denoising Diffusion Probabilistic Models
cs.LG 2020-06 accept novelty 8.0

Denoising diffusion probabilistic models generate high-quality images by learning to reverse a fixed forward diffusion process, achieving FID 3.17 on CIFAR10.
Neural network modeling of many-body super- and sub-radiant dynamics
quant-ph 2026-05 unverdicted novelty 7.0

Neural quantum states simulate dissipative many-body emission dynamics for approximately 40 atoms in dense 1D and 2D arrays, revealing prominent subradiant behavior at late times.
MelShield: Robust Mel-Domain Audio Watermarking for Provenance Attribution of AI Generated Synthesized Speech
cs.SD 2026-05 unverdicted novelty 7.0

MelShield adds keyed low-energy spread-spectrum perturbations to Mel-spectrograms inside TTS pipelines before vocoding to enable robust extraction of user-specific attribution signals even after compression or noise.
DiffAnon: Diffusion-based Prosody Control for Voice Anonymization
eess.AS 2026-04 unverdicted novelty 7.0

DiffAnon introduces the first diffusion model for voice anonymization that supplies structured, continuous, inference-time control over prosody preservation via classifier-free guidance on RVQ semantic embeddings.
Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages
eess.AS 2026-04 unverdicted novelty 7.0

Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
ReLU Networks for Exact Generation of Similar Graphs
cs.LG 2026-04 unverdicted novelty 7.0

Constant-depth ReLU networks of size O(n²d) exist that deterministically generate graphs within edit distance d from any given n-vertex input graph.
Moshi: a speech-text foundation model for real-time dialogue
eess.AS 2024-09 accept novelty 7.0

Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
Chronos: Learning the Language of Time Series
cs.LG 2024-03 conditional novelty 7.0

Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.
High Fidelity Neural Audio Compression
eess.AS 2022-10 accept novelty 7.0

EnCodec is an end-to-end trained streaming neural audio codec that uses a single multiscale spectrogram discriminator and a gradient-normalizing loss balancer to achieve higher fidelity than prior methods at the same ...
A Generalist Agent
cs.AI 2022-05 accept novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
Diffusion Models Beat GANs on Image Synthesis
cs.LG 2021-05 accept novelty 7.0

Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.
Generating Long Sequences with Sparse Transformers
cs.LG 2019-04 unverdicted novelty 7.0

Sparse Transformers factorize attention to handle sequences tens of thousands long, achieving new SOTA density modeling on Enwik8, CIFAR-10, and ImageNet-64.
Progressive Growing of GANs for Improved Quality, Stability, and Variation
cs.NE 2017-10 accept novelty 7.0

Progressive growing stabilizes GAN training to produce high-resolution images of unprecedented quality and achieves a record unsupervised inception score of 8.80 on CIFAR10.
Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation
eess.AS 2026-04 unverdicted novelty 6.0

Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.
AIBuildAI: An AI Agent for Automatically Building AI Models
cs.AI 2026-04 unverdicted novelty 6.0

AIBuildAI uses a manager agent and three LLM sub-agents to fully automate AI model development and achieves a 63.1% medal rate on MLE-Bench, matching experienced human engineers.
A Novel Automatic Framework for Speaker Drift Detection in Synthesized Speech
cs.SD 2026-04 unverdicted novelty 6.0

A framework detects speaker drift in TTS outputs by computing cosine similarities across speech segments and using LLMs for binary classification, supported by a human-validated synthetic benchmark.
Optimal-Transport-Guided Functional Flow Matching for Turbulent Field Generation in Hilbert Space
cs.LG 2026-04 unverdicted novelty 6.0

FOT-CFM generates turbulent fields in function space with superior high-order statistics and energy spectra on Navier-Stokes, Kolmogorov flow, and Hasegawa-Wakatani equations compared to baselines.
Borderless Long Speech Synthesis
cs.SD 2026-03 unverdicted novelty 6.0

Borderless Long Speech Synthesis unifies voice design, multi-speaker TTS, and long-form generation via Global-Sentence-Token annotations, CoT reasoning, and a Structured Semantic Interface for agent-centric control.
Is Conditional Generative Modeling all you need for Decision-Making?
cs.LG 2022-11 unverdicted novelty 6.0

Return-conditional diffusion models for policies outperform offline RL on benchmarks by circumventing dynamic programming and enable constraint or skill composition.
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges
cs.LG 2021-04 accept novelty 6.0

Geometric deep learning provides a unified mathematical framework based on grids, groups, graphs, geodesics, and gauges to explain and extend neural network architectures by incorporating physical regularities.
VideoGPT: Video Generation using VQ-VAE and Transformers
cs.CV 2021-04 accept novelty 6.0

VideoGPT generates competitive natural videos by learning discrete latents with VQ-VAE and modeling them autoregressively with a transformer.
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
cs.CL 2020-06 unverdicted novelty 6.0

GShard supplies automatic sharding and conditional computation support that enabled training a 600-billion-parameter multilingual translation model on thousands of TPUs with superior quality.
Sessa: Selective State Space Attention
cs.LG 2026-04 unverdicted novelty 5.0

Sessa integrates attention within recurrent paths to achieve power-law memory tails and flexible non-decaying selective retrieval, outperforming baselines on long-context tasks.
Learning Invariant Modality Representation for Robust Multimodal Learning from a Causal Inference Perspective
cs.LG 2026-04 unverdicted novelty 5.0

CmIR uses causal inference to separate invariant causal representations from spurious ones in multimodal data, improving generalization under distribution shifts and noise via invariance, mutual information, and recon...
Applied AI-Enhanced RF Interference Rejection
eess.SP 2026-04 unverdicted novelty 5.0

Autoregressive transformer decoders suppress OFDM interference in FM radio signals to restore intelligible speech with low latency on GPUs like Jetson AGX Orin.
Federated Parameter-Efficient Adaptation for Interference Mitigation at the Wireless Edge
cs.NI 2026-04 unverdicted novelty 4.0

Federated LoRA on TCNs for wireless interference suppression reduces per-round communication up to 20x while delivering 12.6% average BER improvement comparable to local adaptation.
Empowering Video Translation using Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 4.0

The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.
AT-ADD: All-Type Audio Deepfake Detection Challenge Evaluation Plan
cs.SD 2026-04 unverdicted novelty 3.0

AT-ADD introduces standardized tracks and datasets for evaluating audio deepfake detectors on speech under real-world conditions and on diverse unknown audio types to promote generalization beyond speech-centric methods.
Dynamic Forecasting and Temporal Feature Evolution of Stock Repurchases in Listed Companies Using Attention-Based Deep Temporal Networks
q-fin.ST 2026-03 unverdicted novelty 3.0

A TCN plus Attention-LSTM model trained on 2014-2024 Chinese A-share data outperforms static baselines and identifies prolonged undervaluation as the long-term driver and sudden cash-flow increases as the short-term t...
Deep Learning for Sequential Decision Making under Uncertainty: Foundations, Frameworks, and Frontiers
math.OC 2026-04 unverdicted novelty 2.0

A tutorial framing deep learning as a complement to optimization for sequential decision-making under uncertainty, with applications in supply chains, healthcare, and energy.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 33 Pith papers · 1 internal anchor

[1]

Vocaine the vocoder and applications is speech synthesis

Agiomyrgiannakis, Yannis. Vocaine the vocoder and applications is speech synthesis. In ICASSP, pp.\ 4230--4234, 2015

work page 2015
[2]

Mixture density networks

Bishop, Christopher M. Mixture density networks. Technical Report NCRG/94/004, Neural Computing Research Group, Aston University, 1994

work page 1994
[3]

Semantic image segmentation with deep convolutional nets and fully connected CRF s

Chen, Liang-Chieh, Papandreou, George, Kokkinos, Iasonas, Murphy, Kevin, and Yuille, Alan L. Semantic image segmentation with deep convolutional nets and fully connected CRF s. In ICLR, 2015. URL http://arxiv.org/abs/1412.7062

work page arXiv 2015
[4]

The Vowel: I ts Nature and Structure

Chiba, Tsutomu and Kajiyama, Masato. The Vowel: I ts Nature and Structure . Tokyo-Kaiseikan, 1942

work page 1942
[5]

Remaking speech

Dudley, Homer. Remaking speech. The Journal of the Acoustical Society of America, 11 0 (2): 0 169--177, 1939

work page 1939
[6]

An implementation of the ``algorithme \`a trous'' to compute the wavelet transform

Dutilleux, Pierre. An implementation of the ``algorithme \`a trous'' to compute the wavelet transform. In Combes, Jean-Michel, Grossmann, Alexander, and Tchamitchian, Philippe (eds.), Wavelets: Time-Frequency Methods and Phase Space, pp.\ 298--304. Springer Berlin Heidelberg, 1989

work page 1989
[7]

TTS synthesis with bidirectional LSTM based recurrent neural networks

Fan, Yuchen, Qian, Yao, and Xie, Feng-Long, Soong Frank K. TTS synthesis with bidirectional LSTM based recurrent neural networks. In Interspeech, pp.\ 1964--1968, 2014

work page 1964
[8]

Acoustic Theory of Speech Production

Fant, Gunnar. Acoustic Theory of Speech Production. Mouton De Gruyter, 1970

work page 1970
[9]

DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM

Garofolo, John S., Lamel, Lori F., Fisher, William M., Fiscus, Jonathon G., and Pallett, David S. DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM . NIST speech disc 1-1.1. NASA STI/Recon technical report, 93, 1993

work page 1993
[10]

Recent advances in G oogle real-time HMM -driven unit selection synthesizer

Gonzalvo, Xavi, Tazari, Siamak, Chan, Chun-an, Becker, Markus, Gutkin, Alexander, and Silen, Hanna. Recent advances in G oogle real-time HMM -driven unit selection synthesizer. In Interspeech, 2016. URL http://research.google.com/pubs/pub45564.html

work page 2016
[11]

Deep Residual Learning for Image Recognition

He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[12]

and Schmidhuber, J

Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Comput., 9 0 (8): 0 1735--1780, 1997

work page 1997
[13]

A real-time algorithm for signal analysis with the help of the wavelet transform

Holschneider, Matthias, Kronland-Martinet, Richard, Morlet, Jean, and Tchamitchian, Philippe. A real-time algorithm for signal analysis with the help of the wavelet transform. In Combes, Jean-Michel, Grossmann, Alexander, and Tchamitchian, Philippe (eds.), Wavelets: Time-Frequency Methods and Phase Space, pp.\ 286--297. Springer Berlin Heidelberg, 1989

work page 1989
[14]

Speech acoustic modeling from raw multichannel waveforms

Hoshen, Yedid, Weiss, Ron J., and Wilson, Kevin W. Speech acoustic modeling from raw multichannel waveforms. In ICASSP, pp.\ 4624--4628. IEEE, 2015

work page 2015
[15]

and Black, Alan W

Hunt, Andrew J. and Black, Alan W. Unit selection in a concatenative speech synthesis system using a large speech database. In ICASSP, pp.\ 373--376, 1996

work page 1996
[16]

Unbiased estimation of log spectrum

Imai, Satoshi and Furuichi, Chieko. Unbiased estimation of log spectrum. In EURASIP, pp.\ 203--206, 1988

work page 1988
[17]

Line spectrum representation of linear predictor coefficients of speech signals

Itakura, Fumitada. Line spectrum representation of linear predictor coefficients of speech signals. The Journal of the Acoust. Society of America, 57 0 (S1): 0 S35--S35, 1975

work page 1975
[18]

A statistical method for estimation of speech spectral density and formant frequencies

Itakura, Fumitada and Saito, Shuzo. A statistical method for estimation of speech spectral density and formant frequencies. Trans. IEICE, J53–A: 0 35--42, 1970

work page 1970
[19]

Recommendation G

ITU-T. Recommendation G . 711. Pulse Code Modulation (PCM) of voice frequencies, 1988

work page 1988
[20]

Exploring the Limits of Language Modeling

J \' o zefowicz, Rafal, Vinyals, Oriol, Schuster, Mike, Shazeer, Noam, and Wu, Yonghui. Exploring the limits of language modeling. CoRR, abs/1602.02410, 2016. URL http://arxiv.org/abs/1602.02410

work page Pith review arXiv 2016
[21]

Mixture autoregressive hidden M arkov models for speech signals

Juang, Biing-Hwang and Rabiner, Lawrence. Mixture autoregressive hidden M arkov models for speech signals. IEEE Trans. Acoust. Speech Signal Process., pp.\ 1404--1413, 1985

work page 1985
[22]

Speech analysis with multi-kernel linear prediction

Kameoka, Hirokazu, Ohishi, Yasunori, Mochihashi, Daichi, and Le Roux, Jonathan. Speech analysis with multi-kernel linear prediction. In Spring Conference of ASJ, pp.\ 499--502, 2010. (in Japanese)

work page 2010
[23]

Text-to-speech conversion with neural networks: A recurrent TDNN approach

Karaali, Orhan, Corrigan, Gerald, Gerson, Ira, and Massey, Noel. Text-to-speech conversion with neural networks: A recurrent TDNN approach. In Eurospeech, pp.\ 561--564, 1997

work page 1997
[24]

Kawahara, Hideki, Masuda-Katsuse, Ikuyo, and de Cheveign \'e , Alain. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based f_ 0 extraction: possible role of a repetitive structure in sounds. Speech Commn., 27: 0 187--207, 1999

work page 1999
[25]

Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT

Kawahara, Hideki, Estill, Jo, and Fujimura, Osamu. Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT . In MAVEBA, pp.\ 13--15, 2001

work page 2001
[26]

Input-agreement: a new mechanism for collecting data using human computation games

Law, Edith and Von Ahn, Luis. Input-agreement: a new mechanism for collecting data using human computation games. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp.\ 1197--1206. ACM, 2009

work page 2009
[27]

Maia, Ranniery, Zen, Heiga, and Gales, Mark J. F. Statistical parametric speech synthesis with joint estimation of acoustic and excitation model parameters. In ISCA SSW7, pp.\ 88--93, 2010

work page 2010
[28]

WORLD : A vocoder-based high-quality speech synthesis system for real-time applications

Morise, Masanori, Yokomori, Fumiya, and Ozawa, Kenji. WORLD : A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst., E99-D 0 (7): 0 1877--1884, 2016

work page 2016
[29]

Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones

Moulines, Eric and Charpentier, Francis. Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Commn., 9: 0 453--467, 1990

work page 1990
[30]

and Black, Alan W

Muthukumar, P. and Black, Alan W. A deep learning approach to data-driven parameterizations for statistical parametric speech synthesis. arXiv:1409.8558, 2014

work page arXiv 2014
[31]

Rectified linear units improve restricted B oltzmann machines

Nair, Vinod and Hinton, Geoffrey E. Rectified linear units improve restricted B oltzmann machines. In ICML, pp.\ 807--814, 2010

work page 2010
[32]

Integration of spectral feature extraction and modeling for HMM -based speech synthesis

Nakamura, Kazuhiro, Hashimoto, Kei, Nankaku, Yoshihiko, and Tokuda, Keiichi. Integration of spectral feature extraction and modeling for HMM -based speech synthesis. IEICE Trans. Inf. Syst., E97-D 0 (6): 0 1438--1448, 2014

work page 2014
[33]

Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks

Palaz, Dimitri, Collobert, Ronan, and Magimai-Doss, Mathew. Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks. In Interspeech, pp.\ 1766--1770, 2013

work page 2013
[34]

Nonlinear filter design: methodologies and challenges

Peltonen, Sari, Gabbouj, Moncef, and Astola, Jaakko. Nonlinear filter design: methodologies and challenges. In IEEE ISPA, pp.\ 102--107, 2001

work page 2001
[35]

Linear predictive hidden M arkov models and the speech signal

Poritz, Alan B. Linear predictive hidden M arkov models and the speech signal. In ICASSP, pp.\ 1291--1294, 1982

work page 1982
[36]

Fundamentals of Speech Recognition

Rabiner, Lawrence and Juang, Biing-Hwang. Fundamentals of Speech Recognition. PrenticeHall, 1993

work page 1993
[37]

ATR -talk speech synthesis system

Sagisaka, Yoshinori, Kaiki, Nobuyoshi, Iwahashi, Naoto, and Mimura, Katsuhiko. ATR -talk speech synthesis system. In ICSLP, pp.\ 483--486, 1992

work page 1992
[38]

Learning the speech front-end with raw waveform CLDNN s

Sainath, Tara N., Weiss, Ron J., Senior, Andrew, Wilson, Kevin W., and Vinyals, Oriol. Learning the speech front-end with raw waveform CLDNN s. In Interspeech, pp.\ 1--5, 2015

work page 2015
[39]

A deep auto-encoder based low-dimensional feature extraction from FFT spectral envelopes for statistical parametric speech synthesis

Takaki, Shinji and Yamagishi, Junichi. A deep auto-encoder based low-dimensional feature extraction from FFT spectral envelopes for statistical parametric speech synthesis. In ICASSP, pp.\ 5535--5539, 2016

work page 2016
[40]

Postfilters to modify the modulation spectrum for statistical parametric speech synthesis

Takamichi, Shinnosuke, Toda, Tomoki, Black, Alan W., Neubig, Graham, Sakriani, Sakti, and Nakamura, Satoshi. Postfilters to modify the modulation spectrum for statistical parametric speech synthesis. IEEE/ACM Trans. Audio Speech Lang. Process., 24 0 (4): 0 755--767, 2016

work page 2016
[41]

Generative image modeling using spatial LSTM s

Theis, Lucas and Bethge, Matthias. Generative image modeling using spatial LSTM s. In NIPS, pp.\ 1927--1935, 2015

work page 1927
[42]

A speech parameter generation algorithm considering global variance for HMM -based speech synthesis

Toda, Tomoki and Tokuda, Keiichi. A speech parameter generation algorithm considering global variance for HMM -based speech synthesis. IEICE Trans. Inf. Syst., E90-D 0 (5): 0 816--824, 2007

work page 2007
[43]

Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory hmm

Toda, Tomoki and Tokuda, Keiichi. Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory hmm. In ICASSP, pp.\ 3925--3928, 2008

work page 2008
[44]

Speech synthesis as a statistical machine learning problem

Tokuda, Keiichi. Speech synthesis as a statistical machine learning problem. http://www.sp.nitech.ac.jp/ tokuda/tokuda_asru2011_for_pdf.pdf, 2011. Invited talk given at ASRU

work page 2011
[45]

Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis

Tokuda, Keiichi and Zen, Heiga. Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis. In ICASSP, pp.\ 4215--4219, 2015

work page 2015
[46]

Directly modeling voiced and unvoiced components in speech waveforms by neural networks

Tokuda, Keiichi and Zen, Heiga. Directly modeling voiced and unvoiced components in speech waveforms by neural networks. In ICASSP, pp.\ 5640--5644, 2016

work page 2016
[47]

Speech synthesis using artificial neural networks trained on cepstral coefficients

Tuerk, Christine and Robinson, Tony. Speech synthesis using artificial neural networks trained on cepstral coefficients. In Proc. Eurospeech, pp.\ 1713--1716, 1993

work page 1993
[48]

u ske, Zolt \'a n, Golik, Pavel, Schl \

T \"u ske, Zolt \'a n, Golik, Pavel, Schl \"u ter, Ralf, and Ney, Hermann. Acoustic modeling with deep neural networks using raw time signal for LVCSR . In Interspeech, pp.\ 890--894, 2014

work page 2014
[49]

Modelling acoustic feature dependencies with artificial neural networks: T rajectory- RNADE

Uria, Benigno, Murray, Iain, Renals, Steve, Valentini-Botinhao, Cassia, and Bridle, John. Modelling acoustic feature dependencies with artificial neural networks: T rajectory- RNADE . In ICASSP, pp.\ 4465--4469, 2015

work page 2015
[50]

Pixel Recurrent Neural Networks

van den Oord, A \" a ron, Kalchbrenner, Nal, and Kavukcuoglu, Koray. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016 a

work page Pith review arXiv 2016
[51]

Conditional image generation with PixelCNN decoders

van den Oord, A \" a ron, Kalchbrenner, Nal, Vinyals, Oriol, Espeholt, Lasse, Graves, Alex, and Kavukcuoglu, Koray. Conditional image generation with PixelCNN decoders. CoRR, abs/1606.05328, 2016 b . URL http://arxiv.org/abs/1606.05328

work page arXiv 2016
[52]

Minimum generation error training with direct log spectral distortion on LSP s for HMM -based speech synthesis

Wu, Yi-Jian and Tokuda, Keiichi. Minimum generation error training with direct log spectral distortion on LSP s for HMM -based speech synthesis. In Interspeech, pp.\ 577--580, 2008

work page 2008
[53]

English multi-speaker corpus for CSTR voice cloning toolkit, 2012

Yamagishi, Junichi. English multi-speaker corpus for CSTR voice cloning toolkit, 2012. URL http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html

work page 2012
[54]

Simultaneous modeling of phonetic and prosodic parameters, and characteristic conversion for HMM -based text-to-speech systems

Yoshimura, Takayoshi. Simultaneous modeling of phonetic and prosodic parameters, and characteristic conversion for HMM -based text-to-speech systems . PhD thesis, Nagoya Institute of Technology, 2002

work page 2002
[55]

Multi-Scale Context Aggregation by Dilated Convolutions

Yu, Fisher and Koltun, Vladlen. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016. URL http://arxiv.org/abs/1511.07122

work page Pith review arXiv 2016
[56]

An example of context-dependent label format for HMM -based speech synthesis in E nglish, 2006

Zen, Heiga. An example of context-dependent label format for HMM -based speech synthesis in E nglish, 2006. URL http://hts.sp.nitech.ac.jp/?Download

work page 2006
[57]

Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic features

Zen, Heiga, Tokuda, Keiichi, and Kitamura, Tadashi. Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic features. Comput. Speech Lang., 21 0 (1): 0 153--173, 2007

work page 2007
[58]

Statistical parametric speech synthesis

Zen, Heiga, Tokuda, Keiichi, and Black, Alan W. Statistical parametric speech synthesis. Speech Commn., 51 0 (11): 0 1039--1064, 2009

work page 2009
[59]

Statistical parametric speech synthesis using deep neural networks

Zen, Heiga, Senior, Andrew, and Schuster, Mike. Statistical parametric speech synthesis using deep neural networks. In Proc. ICASSP, pp.\ 7962--7966, 2013

work page 2013
[60]

Fast, compact, and high quality LSTM - RNN based statistical parametric speech synthesizers for mobile devices

Zen, Heiga, Agiomyrgiannakis, Yannis, Egberts, Niels, Henderson, Fergus, and Szczepaniak, Przemys aw. Fast, compact, and high quality LSTM - RNN based statistical parametric speech synthesizers for mobile devices. In Interspeech, 2016. URL https://arxiv.org/abs/1606.06061

work page arXiv 2016