pith. sign in

arxiv: 1907.07769 · v1 · pith:OG3OOOANnew · submitted 2019-07-15 · 📡 eess.AS · cs.LG· cs.SD· stat.ML

Hierarchical Sequence to Sequence Voice Conversion with Limited Data

Pith reviewed 2026-05-24 21:28 UTC · model grok-4.3

classification 📡 eess.AS cs.LGcs.SDstat.ML
keywords voice conversionsequence to sequencehierarchical encoderattention mechanismautoencoder pretrainingmel spectrogramwavenet vocoderlimited data
0
0 comments X

The pith

A seq2seq model pretrained as an autoencoder on single-speaker data adapts to perform voice conversion on limited multispeaker datasets using mel spectrograms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a voice conversion method based on sequence-to-sequence recurrent networks. It employs a hierarchical encoder to process input audio and an attention-based decoder inspired by TTS systems. Because large multispeaker voice conversion datasets are scarce, the model is first trained as an autoencoder on a large single-speaker corpus and then adapted to smaller parallel datasets. This approach uses only mel spectrograms rather than explicit pitch, duration, or linguistic features, with a wavenet vocoder for waveform synthesis. A sympathetic reader would care because it addresses the data scarcity problem in voice conversion by leveraging transfer from larger single-speaker resources.

Core claim

Our seq2seq architecture makes use of a hierarchical encoder to summarize input audio frames. On the decoder side, we use an attention based architecture used in recent TTS works. Since there is a dearth of large multispeaker voice conversion databases needed for training DNNs, we resort to training the network with a large single speaker dataset as an autoencoder. This is then adapted for the smaller multispeaker voice conversion datasets available for voice conversion. In contrast with other voice conversion works that use F0, duration and linguistic features, our system uses mel spectrograms as the audio representation. Output mel frames are converted back to audio using a wavenet vocoder

What carries the argument

Hierarchical encoder to summarize input audio frames paired with an attention-based decoder, pretrained as autoencoder then adapted to parallel voice conversion pairs.

If this is right

  • Voice conversion operates directly on mel spectrograms without explicit F0, duration or linguistic features.
  • Pretraining on large single-speaker data followed by adaptation succeeds on smaller parallel multispeaker sets.
  • WaveNet vocoder converts output mel frames back to audio waveforms.
  • The system works in the parallel setting where source-target audio pairs are available.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pretraining-plus-adaptation pattern could reduce data needs for other paired audio translation tasks.
  • Hierarchical frame summarization may separate content from speaker traits more readily than flat encoders.
  • Replacing hand-crafted features with mel spectrograms simplifies the overall conversion pipeline.

Load-bearing premise

Pretraining the network as an autoencoder on a large single-speaker dataset enables effective adaptation to smaller multispeaker voice conversion datasets.

What would settle it

Listening tests or speaker similarity scores on held-out parallel pairs showing the adapted outputs fail to match target speaker identity would falsify the adaptation claim.

Figures

Figures reproduced from arXiv: 1907.07769 by Francois Charette, Gint Puskorius, Praveen Narayanan, Punarjay Chakravarty.

Figure 1
Figure 1. Figure 1: System Diagram: Our Attention based Encoder￾Decoder architecture for Voice Conversion takes in a mel￾spectrogram for the source speaker and outputs the mel￾spectrogram for the target speaker. solution wherein one doesn’t have to train an ASR and TTS en￾gine separately. Our approach has a simpler processing pipeline as it only needs audio transcripts (with no accompanying text or need for segmentation), and… view at source ↗
Figure 2
Figure 2. Figure 2: The Pre-net and the CBH layers that are used to pro￾cess the input mel-spectrogram frames. Output tensor sizes at each step of processing are indicated by the side of the unit. perior to those from the Griffin-Lim procedure used in Tacotron [1]. A system diagram showing the various components of the model is shown in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Hierarchical Bi-directional Recurrent Encoder with an indication of the tensor sizes at each step. The number of hidden units in each GRU is 150. Each pyramidal GRU unit (GRU 1 and 2) decreases the sequence length by 1/2. Left-right and right-left GRU units each output a 150xT matrix, that are concatenated to give a 300xT matrix, with T as input sequence length. 1. Prenet 2. Attention RNN 3. Decoder RNNs w… view at source ↗
Figure 5
Figure 5. Figure 5: Feature extractor, depicted through attention align￾ment and mel spectrograms produced by training the network to produce ljspeech voices, with source and target being the same. 4. Autoencoder pretraining and transfer learning Voice conversion with DNNs for parallel data is a difficult un￾dertaking owing to the lack of availability of large multispeaker voice conversion datasets. To get around this problem… view at source ↗
Figure 6
Figure 6. Figure 6: Voice conversion from male (bdl) to female (slt) voice, depicted through attention alignment and mel spectro￾grams produced by adapting to small CMU Arctic voice corpus [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
read the original abstract

We present a voice conversion solution using recurrent sequence to sequence modeling for DNNs. Our solution takes advantage of recent advances in attention based modeling in the fields of Neural Machine Translation (NMT), Text-to-Speech (TTS) and Automatic Speech Recognition (ASR). The problem consists of converting between voices in a parallel setting when {\it $<$source,target$>$} audio pairs are available. Our seq2seq architecture makes use of a hierarchical encoder to summarize input audio frames. On the decoder side, we use an attention based architecture used in recent TTS works. Since there is a dearth of large multispeaker voice conversion databases needed for training DNNs, we resort to training the network with a large single speaker dataset as an autoencoder. This is then adapted for the smaller multispeaker voice conversion datasets available for voice conversion. In contrast with other voice conversion works that use $F_0$, duration and linguistic features, our system uses mel spectrograms as the audio representation. Output mel frames are converted back to audio using a wavenet vocoder.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes a hierarchical sequence-to-sequence architecture for parallel voice conversion. A hierarchical encoder summarizes input audio frames while an attention-based decoder (inspired by recent TTS models) generates output. To address the scarcity of large multispeaker voice conversion datasets, the network is first pretrained as an autoencoder on a large single-speaker corpus and then adapted to smaller parallel multispeaker data. The system operates directly on mel spectrograms (rather than F0, duration or linguistic features) and uses a WaveNet vocoder to synthesize waveforms.

Significance. If the pretraining-plus-adaptation strategy reliably produces transferable representations, the work would offer a practical route to high-quality voice conversion under limited parallel data by exploiting abundant single-speaker corpora, providing an alternative to conventional feature-engineering pipelines.

major comments (2)
  1. [Abstract] Abstract: the load-bearing claim that pretraining as a single-speaker autoencoder followed by adaptation yields effective multispeaker voice conversion is stated without any description of the adaptation procedure (which layers are updated, learning-rate schedule, auxiliary losses, or whether speaker embeddings are introduced). This omission directly prevents assessment of whether the reconstruction objective actually encourages the required speaker-invariant content representations.
  2. [Abstract] Abstract: no quantitative results, baselines, objective metrics (e.g., MCD, WER), subjective scores, or dataset statistics are supplied, so the central assertion that the method solves the limited-data problem cannot be evaluated against evidence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the comments. We address each major comment below and will revise the abstract accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the load-bearing claim that pretraining as a single-speaker autoencoder followed by adaptation yields effective multispeaker voice conversion is stated without any description of the adaptation procedure (which layers are updated, learning-rate schedule, auxiliary losses, or whether speaker embeddings are introduced). This omission directly prevents assessment of whether the reconstruction objective actually encourages the required speaker-invariant content representations.

    Authors: We agree the abstract is too terse on the adaptation step. The manuscript body details the procedure (pretrain full autoencoder on single-speaker data, then fine-tune selected decoder layers on parallel multispeaker pairs while freezing the encoder). We will expand the abstract with one sentence summarizing the adaptation (layers updated, learning-rate reduction, no auxiliary losses or speaker embeddings) so readers can immediately assess the claim. revision: yes

  2. Referee: [Abstract] Abstract: no quantitative results, baselines, objective metrics (e.g., MCD, WER), subjective scores, or dataset statistics are supplied, so the central assertion that the method solves the limited-data problem cannot be evaluated against evidence.

    Authors: Abstracts conventionally omit numbers for brevity. The manuscript reports MCD, WER, MOS scores, and dataset sizes in the experiments section with comparisons to baselines. To address the concern we will add a concise results clause to the abstract citing the key objective and subjective improvements on the limited-data setting. revision: yes

Circularity Check

0 steps flagged

No significant circularity; architecture is standard adaptation without self-referential reduction

full rationale

The paper describes a hierarchical seq2seq model with attention decoder, pretrained as autoencoder on single-speaker data then adapted to multispeaker VC using mel spectrograms. No equations, derivations, fitted parameters presented as predictions, or load-bearing self-citations appear in the text. The approach relies on external advances in NMT/TTS/ASR without reducing any claim to its own inputs by construction. This is a typical non-circular empirical adaptation strategy.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract provides no explicit free parameters, invented entities or new axioms; relies on standard assumptions of recurrent seq2seq models, attention mechanisms and autoencoder transfer from prior literature in NMT/TTS/ASR.

axioms (1)
  • domain assumption Standard assumptions of recurrent neural networks and attention mechanisms apply to audio sequences for voice conversion.
    The paper builds directly on advances in NMT, TTS and ASR without stating new axioms.

pith-pipeline@v0.9.0 · 5730 in / 1307 out tokens · 26759 ms · 2026-05-24T21:28:42.517684+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 34 internal anchors

  1. [1]

    Likewise, it has been demonstrated that ASR can be handled excellently by seq2seq architectures

    Introduction Recently, sequence to sequence models have been adapted with great success in producing realistic sounding speech in TTS sys- tems [1, 2, 3, 4, 5]. Likewise, it has been demonstrated that ASR can be handled excellently by seq2seq architectures. In TTS, the system takes in a text or phoneme sequence and out- puts a speech representation as out...

  2. [2]

    Related Work The traditional pipeline for parallel voice conversion is through use of Gaussian Mixture Models (GMMs) [6, 7, 8] or Deep Neural Networks (DNNs) [9, 10, 11, 12]. After first align- ing source and target features using Dynamic Time Warping (DTW)[13], the model is trained so that it learns to produce the target given the source features for each...

  3. [3]

    The network architecture borrows heav- ily from recent developments in TTS [1] and ASR [19]

    Architecture We use an attention based encoder-decoder network for our voice conversion task. The network architecture borrows heav- ily from recent developments in TTS [1] and ASR [19]. The system takes in an audio representation (mel-spectrogram) as input, and encodes it into a hidden representation in recurrent fashion. This hidden representation is th...

  4. [4]

    How- ever, before doing so, it is useful to have in mind an overall picture of how the data flows through the decoder stack

    Decoder RNNs with residuality We describe the components in more detail below. How- ever, before doing so, it is useful to have in mind an overall picture of how the data flows through the decoder stack. To that end, we present a brief description of the calculations at a high level. The decoder’s task is to transform linguistic content from the source spe...

  5. [5]

    To get around this problem, we first pretrain our network as an autoencoder with a large sin- gle speaker TTS corpus [46], with the source and target voices being the same

    Autoencoder pretraining and transfer learning V oice conversion with DNNs for parallel data is a difficult un- dertaking owing to the lack of availability of large multispeaker voice conversion datasets. To get around this problem, we first pretrain our network as an autoencoder with a large sin- gle speaker TTS corpus [46], with the source and target voice...

  6. [6]

    We first pretrain the network with a large single- speaker corpus in which the source and the target are the same

    Experimental setup Our experimental procedure consists of two steps, as mentioned in section 4. We first pretrain the network with a large single- speaker corpus in which the source and the target are the same. After this, we allow the network to adapt to the desired source and target data. 5.1. Datasets For autoencoder pretraining, we use the LJSpeech dat...

  7. [7]

    for text, PixelCNN [50] for images and Video Pixel Net

  8. [8]

    for videos. This type of architecture, at a high level works on a temporal (in the sense that there is a certain temporal order- ing of data) basis by stacking dilated convolutions with expo- nentially growing receptive field sizes (e.g. 2, 4, 8, 16). Mask- ing is carried out so as to only allow information from the past. In wavenet, instead of masking, on...

  9. [9]

    These fea- tures serve as a useful starting point for transfer learning in the limited data corpus

    Conclusions In this work, we demonstrated a way to overcome data limi- tations (an all too common malady in the speech world) with a trick to extract linguistic features by pretraining with a large corpus so that it learns to reconstruct the input voice. These fea- tures serve as a useful starting point for transfer learning in the limited data corpus. Th...

  10. [10]

    Tacotron: Towards End-to-End Speech Synthesis

    Y . Wang, R. J. Skerry-Ryan, D. Stanton, Y . Wu, R. J. Weiss, N. Jaitly, Y . Xiao, Z. Chen, S. Bengio, Q. Le, Y . Agiomyrgian- nakis, R. Clark, and R. A. Saurous, “Tacotron: Towards end to end speech synthesis,” arXiv preprint arXiv:1703.10135, 2017

  11. [11]

    Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

    J. Shen, R. Pang, R. J. Weiss, N. Jaitly, Z. Yang, Z. Chen, Y . Zhang, Y . Wang, R. Skerry-Ryan, R. A. Sauros, Y . Agiomyr- giannakis, and Y . Wu, “Natural tts synthesis by condition- ing wavenet on mel spectrogram predictions,” arXiv preprint arXiv:1712.05884, 2017

  12. [12]

    Deep Voice: Real-time Neural Text-to-Speech

    S. O. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y . Kang, X. Li, J. Miller, A. Ng, J. Raiman, S. Sengupta, and M. Shoeybi, “Deep voice: Real-time neural text-to-speech,” arXiv preprint arXiv:1702.07825, 2017

  13. [13]

    Deep Voice 2: Multi-Speaker Neural Text-to-Speech

    S. O. Arik, G. Diamos, A. Gibiansky, J. Miller, K. Peng, W. Ping, and Y . Zhou, “Deep voice 2: Multi-speaker text-to-speech,”arXiv preprint arXiv:1705.08947, 2017

  14. [14]

    Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

    W. Ping, K. Peng, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep voice 3: Scaling text-to-speech with con- volutional sequence learning,” arXiv preprint arXiv:1710.07654, 2017

  15. [15]

    Spectral voice conversion for text-to- speech synthesis,

    A. Kain and M. Macon, “Spectral voice conversion for text-to- speech synthesis,” in ICASSP , IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings , vol. 1. IEEE, 1998, pp. 285–288

  16. [16]

    V oice conversion based on maximum-likelihood estimation of spectral parameter trajectory,

    T. Toda, A. W. Black, and K. Tokuda, “V oice conversion based on maximum-likelihood estimation of spectral parameter trajectory,” Trans. Audio, Speech and Lang. Proc. , vol. 15, no. 8, pp. 2222–2235, Nov. 2007. [Online]. Available: https: //doi.org/10.1109/TASL.2007.907344 [8]

  17. [17]

    V oice conversion using artificial neural networks,

    S. Desai, E. V . Raghavendra, B. Yegnanarayana, A. W. Black, and K. Prahallad, “V oice conversion using artificial neural networks,” in Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing , ser. ICASSP ’09. Washington, DC, USA: IEEE Computer Society, 2009, pp. 3893–3896. [Online]. Available: https: //doi.org/10....

  18. [18]

    V oice conversion using artificial neural networks,

    S. Desai, A. W. Black, and B. Yegnanarayana, “V oice conversion using artificial neural networks,” in IEEE Transactions on Audio, Speech and Language Processing, vol. 18, no. 5, July 2010

  19. [19]

    V oice conversion using deep bidirectional long short-term memory,

    L. Sun, S. Yang, K. Li, and H. Meng, “V oice conversion using deep bidirectional long short-term memory,” inProceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ser. ICASSP ’15. Washington, DC, USA: IEEE Computer Society, 2015, pp. 4869–4873

  20. [20]

    L. Sun, K. Li, S. Kang, and H. Meng, in IEEE International Con- ference on Multimedia and Expo, 2016

  21. [21]

    M ¨uller, Information Retrieval for Music and Motion

    M. M ¨uller, Information Retrieval for Music and Motion . Springer, 2007

  22. [22]

    An overview of voice conversion systems,

    S. H. Mohammadi and A. Kain, “An overview of voice conversion systems,” Speech Commun., vol. 88, no. C, pp. 65–82, Apr. 2017. [Online]. Available: https://doi.org/10.1016/j.specom. 2017.01.008

  23. [23]

    Robust hi- erarchical learning for non-negative matrix factorization with out- liers,

    Y . Li, M. Sun, H. Van Hamme, X. Zhang, and J. Yang, “Robust hi- erarchical learning for non-negative matrix factorization with out- liers,” IEEE Access, vol. 7, pp. 10 546–10 558, 2019

  24. [24]

    Exemplar-based voice conversion using sparse representation in noisy environments,

    R. Takashima, T. Takiguchi, and Y . Ariki, “Exemplar-based voice conversion using sparse representation in noisy environments,” IEICE Transactions on Fundamentals of Electronics, Communi- cations and Computer Sciences , vol. 96, no. 10, pp. 1946–1953, 2013

  25. [25]

    Sequence-to-sequence acoustic modeling for voice conversion,

    J.-X. Zhang, Z.-H. Ling, L.-J. Liu, Y . Jiang, and L.-R. Dai, “Sequence-to-sequence acoustic modeling for voice conversion,” arXiv preprint arXiv:1810.06865, 2018

  26. [26]

    Improving sequence-to-sequence acoustic modeling by adding text-supervision,

    J. Zhang, Z. Ling, Y . Jiang, L. Liu, C. Liang, and L. Dai, “Improving sequence-to-sequence acoustic modeling by adding text-supervision,” CoRR, vol. abs/1811.08111, 2018. [Online]. Available: http://arxiv.org/abs/1811.08111

  27. [27]

    Listen, Attend and Spell

    W. Chan, N. Jaitly, Q. V . Le, and O. Vinyals, “Listen, attend and spell,” arXiv preprint arXiv:1508.01211, 2015

  28. [28]

    AttS2S-VC: Sequence-to-Sequence Voice Conversion with Attention and Context Preservation Mechanisms

    K. Tanaka, H. Kameoka, T. Kaneko, and N. Hojo, “Atts2s-vc: Sequence-to-sequence voice conversion with attention and con- text preservation mechanisms,” arXiv preprint arXiv:1811.04076, 2018

  29. [29]

    Convs2s- vc fully convolutional sequence-to-sequence voice conversion,

    H. Kameoka, K. Tanaka, T. Kaneko, and N. Hojo, “Convs2s- vc fully convolutional sequence-to-sequence voice conversion,” arXiv preprint arXiv:1811.01609, 2018

  30. [30]

    Efficiently trainable text- to-speech system based on deep convolutional networks with guided attention,

    H. Tachibana, K. Uenoyama, and S. Aihara, “Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention,” CoRR, vol. abs/1710.08969, 2017. [Online]. Available: http://arxiv.org/abs/1710.08969

  31. [31]

    Unpaired image-to-image translation using cycle-consistent adversarial networks,

    J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” CoRR, vol. abs/1703.10593, 2017. [Online]. Available: http://arxiv.org/abs/1703.10593

  32. [32]

    Auto-Encoding Variational Bayes

    D. Kingma and M. Welling, “Autoencoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013

  33. [33]

    Generative Adversarial Networks

    I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Wade- Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adver- sarial networks,” arXiv preprint arXiv:1406.2661, 2014

  34. [34]

    Sequence-to-sequence voice conversion with similarity met- ric learned using generative adversarial networks,

    T. Kaneko, H. Kameoka, K. Hiramatsu, and K. Kashino, “Sequence-to-sequence voice conversion with similarity met- ric learned using generative adversarial networks,” in INTER- SPEECH, 2017

  35. [35]

    Autoencoding beyond pixels using a learned similarity metric

    A. B. L. Larsen, S. K. Sønderby, and O. Winther, “Autoencoding beyond pixels using a learned similarity metric,” CoRR, vol. abs/1512.09300, 2015. [Online]. Available: http://arxiv.org/abs/ 1512.09300

  36. [36]

    Voice Conversion from Unaligned Corpora using Variational Autoencoding Wasserstein Generative Adversarial Networks

    C. Hsu, H. Hwang, Y . Wu, Y . Tsao, and H. Wang, “V oice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks,” CoRR, vol. abs/1704.00849, 2017. [Online]. Available: http: //arxiv.org/abs/1704.00849

  37. [37]

    Wasserstein GAN

    M. Arjrovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017

  38. [38]

    Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks

    T. Kaneko and H. Kameoka, “Parallel-data-free voice conver- sion using cycle-consistent adversarial networks,” arXiv preprint arXiv:1711.11293, 2017

  39. [39]

    StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks

    H. Kameoka and T. Kaneko, “Stargan-vc: Non-parallel many-to- many voice conversion with star generative adversarial networks,” arXiv preprint arXiv:1806.02169, 2018

  40. [41]

    Sample Efficient Adaptive Text-to-Speech

    [Online]. Available: http://arxiv.org/abs/1809.10460

  41. [42]

    Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

    Y . Jia, Y . Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, P. Nguyen, R. Pang, I. Lopez-Moreno, and Y . Wu, “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” CoRR, vol. abs/1806.04558, 2018. [Online]. Available: http://arxiv.org/abs/1806.04558

  42. [44]

    Neural Voice Cloning with a Few Samples

    [Online]. Available: http://arxiv.org/abs/1802.06006

  43. [45]

    VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop

    Y . Taigman, L. Wolf, A. Polyak, and E. Nachmani, “V oice synthesis for in-the-wild speakers via a phonological loop,” CoRR, vol. abs/1707.06588, 2017. [Online]. Available: http: //arxiv.org/abs/1707.06588

  44. [46]

    Fitting New Speakers Based on a Short Untranscribed Sample

    E. Nachmani, A. Polyak, Y . Taigman, and L. Wolf, “Fitting new speakers based on a short untranscribed sample,” CoRR, vol. abs/1802.06984, 2018. [Online]. Available: http://arxiv.org/abs/ 1802.06984

  45. [47]

    Unsupervised Polyglot Text To Speech

    E. Nachmani and L. Wolf, “Unsupervised polyglot text to speech,” CoRR, vol. abs/1902.02263, 2019. [Online]. Available: http://arxiv.org/abs/1902.02263

  46. [48]

    Wavenet vocoder,

    R. Yamamoto, “Wavenet vocoder,” 2018. [Online]. Available: https://github.com/r9y9/wavenet vocoder

  47. [49]

    Dropout: a simple way to prevent neural networks from overfitting,

    N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014. [Online]. Available: http://jmlr.org/papers/v15/srivastava14a.html

  48. [50]

    A learning algorithm for continually running fully recurrent neural networks,

    R. J. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,” Neural Comput., vol. 1, no. 2, pp. 270–280, Jun. 1989. [Online]. Available: http://dx.doi.org/10.1162/neco.1989.1.2.270

  49. [51]

    Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks

    S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sam- pling for sequence prediction with recurrent neural networks,” arXiv preprint arXiv:1506.03099, 2015

  50. [52]

    Professor Forcing: A New Algorithm for Training Recurrent Networks

    A. Lamb, A. Goyal, Y . Zhang, S. Zhang, A. Courville, and Y . Ben- gio, “Professor forcing: A new algorithm for training recurrent networks,” arXiv preprint arXiv:1610.09038, 2016

  51. [53]

    Fully Character-Level Neural Machine Translation without Explicit Segmentation

    J. Lee, K. Cho, and T. Hoffman, “Fully character-level neural ma- chine translation without explicit segmentation,” arXiv prepring arXiv:1610.03017, 2016

  52. [54]

    Effective Approaches to Attention-based Neural Machine Translation

    M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” arXiv preprint arXiv:1508.04025, 2015

  53. [55]

    Attention-Based Models for Speech Recognition

    J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y . Bengio, “Attention based models for speech recognition,” arXiv preprint arXiv:1506.07503, 2015

  54. [56]

    The lj speech dataset,

    K. Ito, “The lj speech dataset,” 2017. [Online]. Available: https://keithito.com/LJ-Speech-Dataset/

  55. [57]

    Cmu arctic databases for speechsynthesis,

    J. Kominek and A. W. Black, “Cmu arctic databases for speechsynthesis,” Language Technology Institute, Carnegie Mellon University, Pittsburgh, PA, 2003. [Online]. Available: http://festvox.org/cmuarctic/index.html

  56. [58]

    WaveNet: A Generative Model for Raw Audio

    A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” CoRR, vol. abs/1609.03499, 2016. [Online]. Available: http://arxiv.org/abs/1609.03499

  57. [59]

    Neural Machine Translation in Linear Time

    N. Kalchbrenner, L. Espeholt, K. Simonyan, A. van den Oord, A. Graves, and K. Kavukcuoglu, “Neural machine translation in linear time,” CoRR, vol. abs/1610.10099, 2016. [Online]. Available: http://arxiv.org/abs/1610.10099

  58. [60]

    Conditional Image Generation with PixelCNN Decoders

    A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu, “Conditional image generation with pixelcnn decoders,” CoRR, vol. abs/1606.05328, 2016. [Online]. Available: http://arxiv.org/abs/1606.05328

  59. [61]

    Video Pixel Networks

    N. Kalchbrenner, A. van den Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu, “Video pixel networks,” CoRR, vol. abs/1610.00527, 2016. [Online]. Available: http://arxiv.org/abs/1610.00527

  60. [62]

    Variational Inference with Normalizing Flows

    D. J. Rezende and S. Mohamed, “Variational normalizing flows,” arXiv preprint arXiv:1505.05770, 2015

  61. [63]

    Improving Variational Inference with Inverse Autoregressive Flow

    D. P. Kingma, T. Salimans, and M. Welling, “Improving variational inference with inverse autoregressive flow,” CoRR, vol. abs/1606.04934, 2016. [Online]. Available: http://arxiv.org/ abs/1606.04934

  62. [65]

    Parallel WaveNet: Fast High-Fidelity Speech Synthesis

    [Online]. Available: http://arxiv.org/abs/1711.10433

  63. [66]

    WaveGlow: A Flow-based Generative Network for Speech Synthesis

    R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based generative network for speech synthesis,” CoRR, vol. abs/1811.00002, 2018. [Online]. Available: http://arxiv.org/abs/ 1811.00002