pith. machine review for the scientific record. sign in

arxiv: 1609.03499 · v2 · submitted 2016-09-12 · 💻 cs.SD · cs.LG

Recognition: 2 theorem links

· Lean Theorem

WaveNet: A Generative Model for Raw Audio

Authors on Pith no claims yet

Pith reviewed 2026-05-12 20:23 UTC · model grok-4.3

classification 💻 cs.SD cs.LG
keywords wavenetgenerative modelraw audiotext-to-speechautoregressivemusic generationneural networkphoneme recognition
0
0 comments X

The pith

WaveNet generates raw audio waveforms by predicting each sample from all previous ones and yields more natural text-to-speech than prior systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a neural network that creates audio waveforms by modeling the probability of each tiny sample given every sample that came before it. This autoregressive setup can be trained efficiently even at the high rates of real audio data. When applied to turning text into speech, human listeners rate the results as significantly more natural than the best existing parametric and concatenative methods, and this holds for both English and Mandarin. One trained model can also imitate many different speakers simply by receiving speaker identity as input and can produce realistic new music fragments when trained on music data.

Core claim

WaveNet is a fully probabilistic autoregressive deep neural network for raw audio waveforms, with the predictive distribution for each audio sample conditioned on all previous ones. It can be efficiently trained on data with tens of thousands of samples per second. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity by conditioning on speaker identity, and when trained to model music it generates novel and often highly realistic

What carries the argument

The autoregressive predictive distribution over each raw audio sample conditioned on all prior samples, realized in a deep neural network architecture.

If this is right

  • Text-to-speech systems can achieve higher naturalness as judged by human listeners.
  • A single model can represent the voices of many different speakers through conditioning on speaker identity.
  • The architecture can generate novel and realistic musical fragments when trained on music data.
  • The same network can be repurposed for discriminative tasks such as phoneme recognition with promising results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The autoregressive sample-by-sample approach might extend to other high-rate sequential signals such as video or sensor streams.
  • Further conditioning inputs could allow finer control over generated audio content beyond speaker identity.
  • Efficiency improvements could support real-time interactive audio generation applications.

Load-bearing premise

Human listener ratings of naturalness provide a reliable and unbiased measure of generated audio quality.

What would settle it

A controlled blind listening test in which average naturalness ratings for WaveNet audio are not higher than those for the best parametric or concatenative text-to-speech systems.

read the original abstract

This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces WaveNet, a deep autoregressive neural network for raw audio waveform generation based on dilated causal convolutions. It demonstrates that the model can be trained efficiently on high-sample-rate audio data. When applied to text-to-speech, it claims state-of-the-art performance with human listeners rating the synthesized speech as significantly more natural than the best parametric and concatenative systems for both English and Mandarin. A single model captures multiple speakers via conditioning on speaker identity. Additional results include realistic music generation and promising phoneme recognition performance when used discriminatively.

Significance. If the human evaluation results hold under scrutiny, this represents a significant advance in audio generation by showing that direct probabilistic modeling of raw waveforms can outperform traditional TTS pipelines. The dilated convolution architecture efficiently captures long-range temporal structure, which is a key technical contribution. Credit is given for the explicit demonstration of efficient training despite the autoregressive formulation and for the multi-speaker conditioning results.

major comments (1)
  1. [TTS experiments (Section 4)] The central SOTA claim for TTS (abstract and experiments section) rests on human naturalness ratings being significantly higher than parametric/concatenative baselines. The manuscript provides no details on the number of raters, statistical significance testing, rating scale or protocol (e.g., blind presentation, sample selection), or objective corroborating metrics such as MCD or PESQ. This information is load-bearing for verifying that the preference reflects model quality rather than test artifacts.
minor comments (2)
  1. [Abstract] The abstract states performance improvements via human evaluations but does not reference the specific quantitative results or tables that support the 'significantly more natural' claim.
  2. [Model architecture (Section 2)] The description of speaker conditioning could be strengthened by an explicit equation or diagram showing how the speaker embedding is injected into the dilated layers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for identifying the need for greater transparency in the TTS evaluation protocol. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [TTS experiments (Section 4)] The central SOTA claim for TTS (abstract and experiments section) rests on human naturalness ratings being significantly higher than parametric/concatenative baselines. The manuscript provides no details on the number of raters, statistical significance testing, rating scale or protocol (e.g., blind presentation, sample selection), or objective corroborating metrics such as MCD or PESQ. This information is load-bearing for verifying that the preference reflects model quality rather than test artifacts.

    Authors: We agree that the manuscript would benefit from additional details on the human evaluation to allow readers to fully assess the strength of the SOTA claims. The current version does not sufficiently describe the listening test protocol, including participant numbers, rating scale, blinding procedures, sample selection, statistical testing, or objective corroborating metrics. In the revised manuscript we will expand the relevant section to specify the mean opinion score (MOS) protocol, the number of raters and their selection criteria, confirmation that samples were presented blindly in randomized order, the statistical tests used to establish significance, and any objective metrics (such as MCD) that were computed alongside the perceptual ratings. These additions will directly address the concern that the reported preference could stem from test artifacts rather than model quality. revision: yes

Circularity Check

0 steps flagged

No circularity: WaveNet architecture and TTS claims rest on explicit model definition plus external human ratings

full rationale

The paper defines the autoregressive dilated-convolution architecture, softmax output, and conditioning mechanisms directly from first principles (causal convolutions, residual/skip connections). Training maximizes the standard next-sample log-likelihood on external audio corpora. The central TTS claim (SOTA naturalness) is supported solely by separate human listening tests whose ratings are not algebraically or statistically forced by any fitted parameter inside the model equations. No self-citation chain, ansatz smuggling, or renaming of known results occurs for the performance assertions. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the effectiveness of an autoregressive neural architecture with dilated convolutions for high-rate audio, plus several architectural and training choices that function as free parameters tuned empirically.

free parameters (2)
  • dilation schedule and network depth
    Hyperparameters selected to achieve sufficient receptive field for audio dependencies while enabling efficient training.
  • conditioning mechanisms for text and speaker
    Specific ways to incorporate external inputs chosen to enable multi-speaker and TTS functionality.
axioms (2)
  • domain assumption Raw audio waveforms can be modeled as autoregressive sequences where each sample depends statistically on all previous samples.
    Stated directly in the abstract as the core generative approach.
  • domain assumption Dilated convolutions efficiently capture long-range temporal dependencies in audio data at high sample rates.
    Invoked to justify training feasibility on tens of thousands of samples per second.

pith-pipeline@v0.9.0 · 5471 in / 1466 out tokens · 69206 ms · 2026-05-12T20:23:22.651711+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 33 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    cs.LG 2023-12 unverdicted novelty 8.0

    Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

  2. Efficiently Modeling Long Sequences with Structured State Spaces

    cs.LG 2021-10 unverdicted novelty 8.0

    S4 is an efficient state space sequence model that captures long-range dependencies via structured parameterization of the SSM, achieving state-of-the-art results on the Long Range Arena and other benchmarks while bei...

  3. DiffWave: A Versatile Diffusion Model for Audio Synthesis

    eess.AS 2020-09 unverdicted novelty 8.0

    DiffWave is a non-autoregressive diffusion model that generates high-fidelity audio waveforms from noise in constant steps, matching WaveNet vocoder quality while being orders of magnitude faster and outperforming pri...

  4. Denoising Diffusion Probabilistic Models

    cs.LG 2020-06 accept novelty 8.0

    Denoising diffusion probabilistic models generate high-quality images by learning to reverse a fixed forward diffusion process, achieving FID 3.17 on CIFAR10.

  5. Neural network modeling of many-body super- and sub-radiant dynamics

    quant-ph 2026-05 unverdicted novelty 7.0

    Neural quantum states simulate dissipative many-body emission dynamics for approximately 40 atoms in dense 1D and 2D arrays, revealing prominent subradiant behavior at late times.

  6. MelShield: Robust Mel-Domain Audio Watermarking for Provenance Attribution of AI Generated Synthesized Speech

    cs.SD 2026-05 unverdicted novelty 7.0

    MelShield adds keyed low-energy spread-spectrum perturbations to Mel-spectrograms inside TTS pipelines before vocoding to enable robust extraction of user-specific attribution signals even after compression or noise.

  7. DiffAnon: Diffusion-based Prosody Control for Voice Anonymization

    eess.AS 2026-04 unverdicted novelty 7.0

    DiffAnon introduces the first diffusion model for voice anonymization that supplies structured, continuous, inference-time control over prosody preservation via classifier-free guidance on RVQ semantic embeddings.

  8. Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages

    eess.AS 2026-04 unverdicted novelty 7.0

    Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.

  9. ReLU Networks for Exact Generation of Similar Graphs

    cs.LG 2026-04 unverdicted novelty 7.0

    Constant-depth ReLU networks of size O(n²d) exist that deterministically generate graphs within edit distance d from any given n-vertex input graph.

  10. Moshi: a speech-text foundation model for real-time dialogue

    eess.AS 2024-09 accept novelty 7.0

    Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.

  11. Chronos: Learning the Language of Time Series

    cs.LG 2024-03 conditional novelty 7.0

    Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.

  12. High Fidelity Neural Audio Compression

    eess.AS 2022-10 accept novelty 7.0

    EnCodec is an end-to-end trained streaming neural audio codec that uses a single multiscale spectrogram discriminator and a gradient-normalizing loss balancer to achieve higher fidelity than prior methods at the same ...

  13. A Generalist Agent

    cs.AI 2022-05 accept novelty 7.0

    Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.

  14. Diffusion Models Beat GANs on Image Synthesis

    cs.LG 2021-05 accept novelty 7.0

    Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.

  15. Generating Long Sequences with Sparse Transformers

    cs.LG 2019-04 unverdicted novelty 7.0

    Sparse Transformers factorize attention to handle sequences tens of thousands long, achieving new SOTA density modeling on Enwik8, CIFAR-10, and ImageNet-64.

  16. Progressive Growing of GANs for Improved Quality, Stability, and Variation

    cs.NE 2017-10 accept novelty 7.0

    Progressive growing stabilizes GAN training to produce high-resolution images of unprecedented quality and achieves a record unsupervised inception score of 8.80 on CIFAR10.

  17. Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation

    eess.AS 2026-04 unverdicted novelty 6.0

    Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.

  18. AIBuildAI: An AI Agent for Automatically Building AI Models

    cs.AI 2026-04 unverdicted novelty 6.0

    AIBuildAI uses a manager agent and three LLM sub-agents to fully automate AI model development and achieves a 63.1% medal rate on MLE-Bench, matching experienced human engineers.

  19. A Novel Automatic Framework for Speaker Drift Detection in Synthesized Speech

    cs.SD 2026-04 unverdicted novelty 6.0

    A framework detects speaker drift in TTS outputs by computing cosine similarities across speech segments and using LLMs for binary classification, supported by a human-validated synthetic benchmark.

  20. Optimal-Transport-Guided Functional Flow Matching for Turbulent Field Generation in Hilbert Space

    cs.LG 2026-04 unverdicted novelty 6.0

    FOT-CFM generates turbulent fields in function space with superior high-order statistics and energy spectra on Navier-Stokes, Kolmogorov flow, and Hasegawa-Wakatani equations compared to baselines.

  21. Borderless Long Speech Synthesis

    cs.SD 2026-03 unverdicted novelty 6.0

    Borderless Long Speech Synthesis unifies voice design, multi-speaker TTS, and long-form generation via Global-Sentence-Token annotations, CoT reasoning, and a Structured Semantic Interface for agent-centric control.

  22. Is Conditional Generative Modeling all you need for Decision-Making?

    cs.LG 2022-11 unverdicted novelty 6.0

    Return-conditional diffusion models for policies outperform offline RL on benchmarks by circumventing dynamic programming and enable constraint or skill composition.

  23. Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges

    cs.LG 2021-04 accept novelty 6.0

    Geometric deep learning provides a unified mathematical framework based on grids, groups, graphs, geodesics, and gauges to explain and extend neural network architectures by incorporating physical regularities.

  24. VideoGPT: Video Generation using VQ-VAE and Transformers

    cs.CV 2021-04 accept novelty 6.0

    VideoGPT generates competitive natural videos by learning discrete latents with VQ-VAE and modeling them autoregressively with a transformer.

  25. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    cs.CL 2020-06 unverdicted novelty 6.0

    GShard supplies automatic sharding and conditional computation support that enabled training a 600-billion-parameter multilingual translation model on thousands of TPUs with superior quality.

  26. Sessa: Selective State Space Attention

    cs.LG 2026-04 unverdicted novelty 5.0

    Sessa integrates attention within recurrent paths to achieve power-law memory tails and flexible non-decaying selective retrieval, outperforming baselines on long-context tasks.

  27. Learning Invariant Modality Representation for Robust Multimodal Learning from a Causal Inference Perspective

    cs.LG 2026-04 unverdicted novelty 5.0

    CmIR uses causal inference to separate invariant causal representations from spurious ones in multimodal data, improving generalization under distribution shifts and noise via invariance, mutual information, and recon...

  28. Applied AI-Enhanced RF Interference Rejection

    eess.SP 2026-04 unverdicted novelty 5.0

    Autoregressive transformer decoders suppress OFDM interference in FM radio signals to restore intelligible speech with low latency on GPUs like Jetson AGX Orin.

  29. Federated Parameter-Efficient Adaptation for Interference Mitigation at the Wireless Edge

    cs.NI 2026-04 unverdicted novelty 4.0

    Federated LoRA on TCNs for wireless interference suppression reduces per-round communication up to 20x while delivering 12.6% average BER improvement comparable to local adaptation.

  30. Empowering Video Translation using Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 4.0

    The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.

  31. AT-ADD: All-Type Audio Deepfake Detection Challenge Evaluation Plan

    cs.SD 2026-04 unverdicted novelty 3.0

    AT-ADD introduces standardized tracks and datasets for evaluating audio deepfake detectors on speech under real-world conditions and on diverse unknown audio types to promote generalization beyond speech-centric methods.

  32. Dynamic Forecasting and Temporal Feature Evolution of Stock Repurchases in Listed Companies Using Attention-Based Deep Temporal Networks

    q-fin.ST 2026-03 unverdicted novelty 3.0

    A TCN plus Attention-LSTM model trained on 2014-2024 Chinese A-share data outperforms static baselines and identifies prolonged undervaluation as the long-term driver and sudden cash-flow increases as the short-term t...

  33. Deep Learning for Sequential Decision Making under Uncertainty: Foundations, Frameworks, and Frontiers

    math.OC 2026-04 unverdicted novelty 2.0

    A tutorial framing deep learning as a complement to optimization for sequential decision-making under uncertainty, with applications in supply chains, healthcare, and energy.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 33 Pith papers · 1 internal anchor

  1. [1]

    Vocaine the vocoder and applications is speech synthesis

    Agiomyrgiannakis, Yannis. Vocaine the vocoder and applications is speech synthesis. In ICASSP, pp.\ 4230--4234, 2015

  2. [2]

    Mixture density networks

    Bishop, Christopher M. Mixture density networks. Technical Report NCRG/94/004, Neural Computing Research Group, Aston University, 1994

  3. [3]

    Semantic image segmentation with deep convolutional nets and fully connected CRF s

    Chen, Liang-Chieh, Papandreou, George, Kokkinos, Iasonas, Murphy, Kevin, and Yuille, Alan L. Semantic image segmentation with deep convolutional nets and fully connected CRF s. In ICLR, 2015. URL http://arxiv.org/abs/1412.7062

  4. [4]

    The Vowel: I ts Nature and Structure

    Chiba, Tsutomu and Kajiyama, Masato. The Vowel: I ts Nature and Structure . Tokyo-Kaiseikan, 1942

  5. [5]

    Remaking speech

    Dudley, Homer. Remaking speech. The Journal of the Acoustical Society of America, 11 0 (2): 0 169--177, 1939

  6. [6]

    An implementation of the ``algorithme \`a trous'' to compute the wavelet transform

    Dutilleux, Pierre. An implementation of the ``algorithme \`a trous'' to compute the wavelet transform. In Combes, Jean-Michel, Grossmann, Alexander, and Tchamitchian, Philippe (eds.), Wavelets: Time-Frequency Methods and Phase Space, pp.\ 298--304. Springer Berlin Heidelberg, 1989

  7. [7]

    TTS synthesis with bidirectional LSTM based recurrent neural networks

    Fan, Yuchen, Qian, Yao, and Xie, Feng-Long, Soong Frank K. TTS synthesis with bidirectional LSTM based recurrent neural networks. In Interspeech, pp.\ 1964--1968, 2014

  8. [8]

    Acoustic Theory of Speech Production

    Fant, Gunnar. Acoustic Theory of Speech Production. Mouton De Gruyter, 1970

  9. [9]

    DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM

    Garofolo, John S., Lamel, Lori F., Fisher, William M., Fiscus, Jonathon G., and Pallett, David S. DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM . NIST speech disc 1-1.1. NASA STI/Recon technical report, 93, 1993

  10. [10]

    Recent advances in G oogle real-time HMM -driven unit selection synthesizer

    Gonzalvo, Xavi, Tazari, Siamak, Chan, Chun-an, Becker, Markus, Gutkin, Alexander, and Silen, Hanna. Recent advances in G oogle real-time HMM -driven unit selection synthesizer. In Interspeech, 2016. URL http://research.google.com/pubs/pub45564.html

  11. [11]

    Deep Residual Learning for Image Recognition

    He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015

  12. [12]

    and Schmidhuber, J

    Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Comput., 9 0 (8): 0 1735--1780, 1997

  13. [13]

    A real-time algorithm for signal analysis with the help of the wavelet transform

    Holschneider, Matthias, Kronland-Martinet, Richard, Morlet, Jean, and Tchamitchian, Philippe. A real-time algorithm for signal analysis with the help of the wavelet transform. In Combes, Jean-Michel, Grossmann, Alexander, and Tchamitchian, Philippe (eds.), Wavelets: Time-Frequency Methods and Phase Space, pp.\ 286--297. Springer Berlin Heidelberg, 1989

  14. [14]

    Speech acoustic modeling from raw multichannel waveforms

    Hoshen, Yedid, Weiss, Ron J., and Wilson, Kevin W. Speech acoustic modeling from raw multichannel waveforms. In ICASSP, pp.\ 4624--4628. IEEE, 2015

  15. [15]

    and Black, Alan W

    Hunt, Andrew J. and Black, Alan W. Unit selection in a concatenative speech synthesis system using a large speech database. In ICASSP, pp.\ 373--376, 1996

  16. [16]

    Unbiased estimation of log spectrum

    Imai, Satoshi and Furuichi, Chieko. Unbiased estimation of log spectrum. In EURASIP, pp.\ 203--206, 1988

  17. [17]

    Line spectrum representation of linear predictor coefficients of speech signals

    Itakura, Fumitada. Line spectrum representation of linear predictor coefficients of speech signals. The Journal of the Acoust. Society of America, 57 0 (S1): 0 S35--S35, 1975

  18. [18]

    A statistical method for estimation of speech spectral density and formant frequencies

    Itakura, Fumitada and Saito, Shuzo. A statistical method for estimation of speech spectral density and formant frequencies. Trans. IEICE, J53–A: 0 35--42, 1970

  19. [19]

    Recommendation G

    ITU-T. Recommendation G . 711. Pulse Code Modulation (PCM) of voice frequencies, 1988

  20. [20]

    Exploring the Limits of Language Modeling

    J \' o zefowicz, Rafal, Vinyals, Oriol, Schuster, Mike, Shazeer, Noam, and Wu, Yonghui. Exploring the limits of language modeling. CoRR, abs/1602.02410, 2016. URL http://arxiv.org/abs/1602.02410

  21. [21]

    Mixture autoregressive hidden M arkov models for speech signals

    Juang, Biing-Hwang and Rabiner, Lawrence. Mixture autoregressive hidden M arkov models for speech signals. IEEE Trans. Acoust. Speech Signal Process., pp.\ 1404--1413, 1985

  22. [22]

    Speech analysis with multi-kernel linear prediction

    Kameoka, Hirokazu, Ohishi, Yasunori, Mochihashi, Daichi, and Le Roux, Jonathan. Speech analysis with multi-kernel linear prediction. In Spring Conference of ASJ, pp.\ 499--502, 2010. (in Japanese)

  23. [23]

    Text-to-speech conversion with neural networks: A recurrent TDNN approach

    Karaali, Orhan, Corrigan, Gerald, Gerson, Ira, and Massey, Noel. Text-to-speech conversion with neural networks: A recurrent TDNN approach. In Eurospeech, pp.\ 561--564, 1997

  24. [24]

    Kawahara, Hideki, Masuda-Katsuse, Ikuyo, and de Cheveign \'e , Alain. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based f_ 0 extraction: possible role of a repetitive structure in sounds. Speech Commn., 27: 0 187--207, 1999

  25. [25]

    Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT

    Kawahara, Hideki, Estill, Jo, and Fujimura, Osamu. Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT . In MAVEBA, pp.\ 13--15, 2001

  26. [26]

    Input-agreement: a new mechanism for collecting data using human computation games

    Law, Edith and Von Ahn, Luis. Input-agreement: a new mechanism for collecting data using human computation games. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp.\ 1197--1206. ACM, 2009

  27. [27]

    Maia, Ranniery, Zen, Heiga, and Gales, Mark J. F. Statistical parametric speech synthesis with joint estimation of acoustic and excitation model parameters. In ISCA SSW7, pp.\ 88--93, 2010

  28. [28]

    WORLD : A vocoder-based high-quality speech synthesis system for real-time applications

    Morise, Masanori, Yokomori, Fumiya, and Ozawa, Kenji. WORLD : A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst., E99-D 0 (7): 0 1877--1884, 2016

  29. [29]

    Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones

    Moulines, Eric and Charpentier, Francis. Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Commn., 9: 0 453--467, 1990

  30. [30]

    and Black, Alan W

    Muthukumar, P. and Black, Alan W. A deep learning approach to data-driven parameterizations for statistical parametric speech synthesis. arXiv:1409.8558, 2014

  31. [31]

    Rectified linear units improve restricted B oltzmann machines

    Nair, Vinod and Hinton, Geoffrey E. Rectified linear units improve restricted B oltzmann machines. In ICML, pp.\ 807--814, 2010

  32. [32]

    Integration of spectral feature extraction and modeling for HMM -based speech synthesis

    Nakamura, Kazuhiro, Hashimoto, Kei, Nankaku, Yoshihiko, and Tokuda, Keiichi. Integration of spectral feature extraction and modeling for HMM -based speech synthesis. IEICE Trans. Inf. Syst., E97-D 0 (6): 0 1438--1448, 2014

  33. [33]

    Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks

    Palaz, Dimitri, Collobert, Ronan, and Magimai-Doss, Mathew. Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks. In Interspeech, pp.\ 1766--1770, 2013

  34. [34]

    Nonlinear filter design: methodologies and challenges

    Peltonen, Sari, Gabbouj, Moncef, and Astola, Jaakko. Nonlinear filter design: methodologies and challenges. In IEEE ISPA, pp.\ 102--107, 2001

  35. [35]

    Linear predictive hidden M arkov models and the speech signal

    Poritz, Alan B. Linear predictive hidden M arkov models and the speech signal. In ICASSP, pp.\ 1291--1294, 1982

  36. [36]

    Fundamentals of Speech Recognition

    Rabiner, Lawrence and Juang, Biing-Hwang. Fundamentals of Speech Recognition. PrenticeHall, 1993

  37. [37]

    ATR -talk speech synthesis system

    Sagisaka, Yoshinori, Kaiki, Nobuyoshi, Iwahashi, Naoto, and Mimura, Katsuhiko. ATR -talk speech synthesis system. In ICSLP, pp.\ 483--486, 1992

  38. [38]

    Learning the speech front-end with raw waveform CLDNN s

    Sainath, Tara N., Weiss, Ron J., Senior, Andrew, Wilson, Kevin W., and Vinyals, Oriol. Learning the speech front-end with raw waveform CLDNN s. In Interspeech, pp.\ 1--5, 2015

  39. [39]

    A deep auto-encoder based low-dimensional feature extraction from FFT spectral envelopes for statistical parametric speech synthesis

    Takaki, Shinji and Yamagishi, Junichi. A deep auto-encoder based low-dimensional feature extraction from FFT spectral envelopes for statistical parametric speech synthesis. In ICASSP, pp.\ 5535--5539, 2016

  40. [40]

    Postfilters to modify the modulation spectrum for statistical parametric speech synthesis

    Takamichi, Shinnosuke, Toda, Tomoki, Black, Alan W., Neubig, Graham, Sakriani, Sakti, and Nakamura, Satoshi. Postfilters to modify the modulation spectrum for statistical parametric speech synthesis. IEEE/ACM Trans. Audio Speech Lang. Process., 24 0 (4): 0 755--767, 2016

  41. [41]

    Generative image modeling using spatial LSTM s

    Theis, Lucas and Bethge, Matthias. Generative image modeling using spatial LSTM s. In NIPS, pp.\ 1927--1935, 2015

  42. [42]

    A speech parameter generation algorithm considering global variance for HMM -based speech synthesis

    Toda, Tomoki and Tokuda, Keiichi. A speech parameter generation algorithm considering global variance for HMM -based speech synthesis. IEICE Trans. Inf. Syst., E90-D 0 (5): 0 816--824, 2007

  43. [43]

    Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory hmm

    Toda, Tomoki and Tokuda, Keiichi. Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory hmm. In ICASSP, pp.\ 3925--3928, 2008

  44. [44]

    Speech synthesis as a statistical machine learning problem

    Tokuda, Keiichi. Speech synthesis as a statistical machine learning problem. http://www.sp.nitech.ac.jp/ tokuda/tokuda_asru2011_for_pdf.pdf, 2011. Invited talk given at ASRU

  45. [45]

    Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis

    Tokuda, Keiichi and Zen, Heiga. Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis. In ICASSP, pp.\ 4215--4219, 2015

  46. [46]

    Directly modeling voiced and unvoiced components in speech waveforms by neural networks

    Tokuda, Keiichi and Zen, Heiga. Directly modeling voiced and unvoiced components in speech waveforms by neural networks. In ICASSP, pp.\ 5640--5644, 2016

  47. [47]

    Speech synthesis using artificial neural networks trained on cepstral coefficients

    Tuerk, Christine and Robinson, Tony. Speech synthesis using artificial neural networks trained on cepstral coefficients. In Proc. Eurospeech, pp.\ 1713--1716, 1993

  48. [48]

    u ske, Zolt \'a n, Golik, Pavel, Schl \

    T \"u ske, Zolt \'a n, Golik, Pavel, Schl \"u ter, Ralf, and Ney, Hermann. Acoustic modeling with deep neural networks using raw time signal for LVCSR . In Interspeech, pp.\ 890--894, 2014

  49. [49]

    Modelling acoustic feature dependencies with artificial neural networks: T rajectory- RNADE

    Uria, Benigno, Murray, Iain, Renals, Steve, Valentini-Botinhao, Cassia, and Bridle, John. Modelling acoustic feature dependencies with artificial neural networks: T rajectory- RNADE . In ICASSP, pp.\ 4465--4469, 2015

  50. [50]

    Pixel Recurrent Neural Networks

    van den Oord, A \" a ron, Kalchbrenner, Nal, and Kavukcuoglu, Koray. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016 a

  51. [51]

    Conditional image generation with PixelCNN decoders

    van den Oord, A \" a ron, Kalchbrenner, Nal, Vinyals, Oriol, Espeholt, Lasse, Graves, Alex, and Kavukcuoglu, Koray. Conditional image generation with PixelCNN decoders. CoRR, abs/1606.05328, 2016 b . URL http://arxiv.org/abs/1606.05328

  52. [52]

    Minimum generation error training with direct log spectral distortion on LSP s for HMM -based speech synthesis

    Wu, Yi-Jian and Tokuda, Keiichi. Minimum generation error training with direct log spectral distortion on LSP s for HMM -based speech synthesis. In Interspeech, pp.\ 577--580, 2008

  53. [53]

    English multi-speaker corpus for CSTR voice cloning toolkit, 2012

    Yamagishi, Junichi. English multi-speaker corpus for CSTR voice cloning toolkit, 2012. URL http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html

  54. [54]

    Simultaneous modeling of phonetic and prosodic parameters, and characteristic conversion for HMM -based text-to-speech systems

    Yoshimura, Takayoshi. Simultaneous modeling of phonetic and prosodic parameters, and characteristic conversion for HMM -based text-to-speech systems . PhD thesis, Nagoya Institute of Technology, 2002

  55. [55]

    Multi-Scale Context Aggregation by Dilated Convolutions

    Yu, Fisher and Koltun, Vladlen. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016. URL http://arxiv.org/abs/1511.07122

  56. [56]

    An example of context-dependent label format for HMM -based speech synthesis in E nglish, 2006

    Zen, Heiga. An example of context-dependent label format for HMM -based speech synthesis in E nglish, 2006. URL http://hts.sp.nitech.ac.jp/?Download

  57. [57]

    Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic features

    Zen, Heiga, Tokuda, Keiichi, and Kitamura, Tadashi. Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic features. Comput. Speech Lang., 21 0 (1): 0 153--173, 2007

  58. [58]

    Statistical parametric speech synthesis

    Zen, Heiga, Tokuda, Keiichi, and Black, Alan W. Statistical parametric speech synthesis. Speech Commn., 51 0 (11): 0 1039--1064, 2009

  59. [59]

    Statistical parametric speech synthesis using deep neural networks

    Zen, Heiga, Senior, Andrew, and Schuster, Mike. Statistical parametric speech synthesis using deep neural networks. In Proc. ICASSP, pp.\ 7962--7966, 2013

  60. [60]

    Fast, compact, and high quality LSTM - RNN based statistical parametric speech synthesizers for mobile devices

    Zen, Heiga, Agiomyrgiannakis, Yannis, Egberts, Niels, Henderson, Fergus, and Szczepaniak, Przemys aw. Fast, compact, and high quality LSTM - RNN based statistical parametric speech synthesizers for mobile devices. In Interspeech, 2016. URL https://arxiv.org/abs/1606.06061