Hierarchical Sequence to Sequence Voice Conversion with Limited Data

Francois Charette; Gint Puskorius; Praveen Narayanan; Punarjay Chakravarty

arxiv: 1907.07769 · v1 · pith:OG3OOOANnew · submitted 2019-07-15 · 📡 eess.AS · cs.LG· cs.SD· stat.ML

Hierarchical Sequence to Sequence Voice Conversion with Limited Data

Praveen Narayanan , Punarjay Chakravarty , Francois Charette , Gint Puskorius This is my paper

Pith reviewed 2026-05-24 21:28 UTC · model grok-4.3

classification 📡 eess.AS cs.LGcs.SDstat.ML

keywords voice conversionsequence to sequencehierarchical encoderattention mechanismautoencoder pretrainingmel spectrogramwavenet vocoderlimited data

0 comments

The pith

A seq2seq model pretrained as an autoencoder on single-speaker data adapts to perform voice conversion on limited multispeaker datasets using mel spectrograms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a voice conversion method based on sequence-to-sequence recurrent networks. It employs a hierarchical encoder to process input audio and an attention-based decoder inspired by TTS systems. Because large multispeaker voice conversion datasets are scarce, the model is first trained as an autoencoder on a large single-speaker corpus and then adapted to smaller parallel datasets. This approach uses only mel spectrograms rather than explicit pitch, duration, or linguistic features, with a wavenet vocoder for waveform synthesis. A sympathetic reader would care because it addresses the data scarcity problem in voice conversion by leveraging transfer from larger single-speaker resources.

Core claim

Our seq2seq architecture makes use of a hierarchical encoder to summarize input audio frames. On the decoder side, we use an attention based architecture used in recent TTS works. Since there is a dearth of large multispeaker voice conversion databases needed for training DNNs, we resort to training the network with a large single speaker dataset as an autoencoder. This is then adapted for the smaller multispeaker voice conversion datasets available for voice conversion. In contrast with other voice conversion works that use F0, duration and linguistic features, our system uses mel spectrograms as the audio representation. Output mel frames are converted back to audio using a wavenet vocoder

What carries the argument

Hierarchical encoder to summarize input audio frames paired with an attention-based decoder, pretrained as autoencoder then adapted to parallel voice conversion pairs.

If this is right

Voice conversion operates directly on mel spectrograms without explicit F0, duration or linguistic features.
Pretraining on large single-speaker data followed by adaptation succeeds on smaller parallel multispeaker sets.
WaveNet vocoder converts output mel frames back to audio waveforms.
The system works in the parallel setting where source-target audio pairs are available.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pretraining-plus-adaptation pattern could reduce data needs for other paired audio translation tasks.
Hierarchical frame summarization may separate content from speaker traits more readily than flat encoders.
Replacing hand-crafted features with mel spectrograms simplifies the overall conversion pipeline.

Load-bearing premise

Pretraining the network as an autoencoder on a large single-speaker dataset enables effective adaptation to smaller multispeaker voice conversion datasets.

What would settle it

Listening tests or speaker similarity scores on held-out parallel pairs showing the adapted outputs fail to match target speaker identity would falsify the adaptation claim.

Figures

Figures reproduced from arXiv: 1907.07769 by Francois Charette, Gint Puskorius, Praveen Narayanan, Punarjay Chakravarty.

**Figure 1.** Figure 1: System Diagram: Our Attention based EncoderDecoder architecture for Voice Conversion takes in a melspectrogram for the source speaker and outputs the melspectrogram for the target speaker. solution wherein one doesn’t have to train an ASR and TTS engine separately. Our approach has a simpler processing pipeline as it only needs audio transcripts (with no accompanying text or need for segmentation), and… view at source ↗

**Figure 2.** Figure 2: The Pre-net and the CBH layers that are used to process the input mel-spectrogram frames. Output tensor sizes at each step of processing are indicated by the side of the unit. perior to those from the Griffin-Lim procedure used in Tacotron [1]. A system diagram showing the various components of the model is shown in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Hierarchical Bi-directional Recurrent Encoder with an indication of the tensor sizes at each step. The number of hidden units in each GRU is 150. Each pyramidal GRU unit (GRU 1 and 2) decreases the sequence length by 1/2. Left-right and right-left GRU units each output a 150xT matrix, that are concatenated to give a 300xT matrix, with T as input sequence length. 1. Prenet 2. Attention RNN 3. Decoder RNNs w… view at source ↗

**Figure 5.** Figure 5: Feature extractor, depicted through attention alignment and mel spectrograms produced by training the network to produce ljspeech voices, with source and target being the same. 4. Autoencoder pretraining and transfer learning Voice conversion with DNNs for parallel data is a difficult undertaking owing to the lack of availability of large multispeaker voice conversion datasets. To get around this problem… view at source ↗

**Figure 6.** Figure 6: Voice conversion from male (bdl) to female (slt) voice, depicted through attention alignment and mel spectrograms produced by adapting to small CMU Arctic voice corpus [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

read the original abstract

We present a voice conversion solution using recurrent sequence to sequence modeling for DNNs. Our solution takes advantage of recent advances in attention based modeling in the fields of Neural Machine Translation (NMT), Text-to-Speech (TTS) and Automatic Speech Recognition (ASR). The problem consists of converting between voices in a parallel setting when {\it $<$source,target$>$} audio pairs are available. Our seq2seq architecture makes use of a hierarchical encoder to summarize input audio frames. On the decoder side, we use an attention based architecture used in recent TTS works. Since there is a dearth of large multispeaker voice conversion databases needed for training DNNs, we resort to training the network with a large single speaker dataset as an autoencoder. This is then adapted for the smaller multispeaker voice conversion datasets available for voice conversion. In contrast with other voice conversion works that use $F_0$, duration and linguistic features, our system uses mel spectrograms as the audio representation. Output mel frames are converted back to audio using a wavenet vocoder.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Seq2seq voice conversion via hierarchical encoder and attention decoder, pretrained as single-speaker autoencoder then adapted to small parallel multispeaker data using only mels.

read the letter

The main point is that this paper adapts a hierarchical seq2seq encoder plus attention decoder from TTS work to parallel voice conversion, pretraining the model as an autoencoder on a large single-speaker set before fine-tuning on limited multispeaker pairs, and it does this while using only mel spectrograms instead of F0, duration, or linguistic features. The pretraining step is their explicit response to the data scarcity problem in VC. That combination is new relative to the cited prior work, and the paper gives a clear description of how the pieces fit together. Staying in the mel domain keeps the system simpler and avoids extra feature extractors, which is a practical choice. The motivation around limited data is handled honestly. The soft spot is the transfer itself. The abstract gives no detail on which layers get updated during adaptation, what learning rates are used, or whether any auxiliary losses prevent speaker leakage from the pretraining stage. The stress-test concern about single-speaker reconstruction not pushing for invariant content representations is reasonable and needs checking against actual results. Without those experiments visible here, it is difficult to know whether the pretraining delivers the claimed benefit or just adds a step that could be replaced by direct training. This is aimed at speech researchers already working on seq2seq models for synthesis or conversion who face small datasets. A reader in that group would get a usable architecture sketch and the pretraining tactic. It deserves a serious referee because the idea is a direct, grounded extension of existing methods to a real constraint in the subfield, even if the scope stays narrow. Send it to review so the experiments can be evaluated.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes a hierarchical sequence-to-sequence architecture for parallel voice conversion. A hierarchical encoder summarizes input audio frames while an attention-based decoder (inspired by recent TTS models) generates output. To address the scarcity of large multispeaker voice conversion datasets, the network is first pretrained as an autoencoder on a large single-speaker corpus and then adapted to smaller parallel multispeaker data. The system operates directly on mel spectrograms (rather than F0, duration or linguistic features) and uses a WaveNet vocoder to synthesize waveforms.

Significance. If the pretraining-plus-adaptation strategy reliably produces transferable representations, the work would offer a practical route to high-quality voice conversion under limited parallel data by exploiting abundant single-speaker corpora, providing an alternative to conventional feature-engineering pipelines.

major comments (2)

[Abstract] Abstract: the load-bearing claim that pretraining as a single-speaker autoencoder followed by adaptation yields effective multispeaker voice conversion is stated without any description of the adaptation procedure (which layers are updated, learning-rate schedule, auxiliary losses, or whether speaker embeddings are introduced). This omission directly prevents assessment of whether the reconstruction objective actually encourages the required speaker-invariant content representations.
[Abstract] Abstract: no quantitative results, baselines, objective metrics (e.g., MCD, WER), subjective scores, or dataset statistics are supplied, so the central assertion that the method solves the limited-data problem cannot be evaluated against evidence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the comments. We address each major comment below and will revise the abstract accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the load-bearing claim that pretraining as a single-speaker autoencoder followed by adaptation yields effective multispeaker voice conversion is stated without any description of the adaptation procedure (which layers are updated, learning-rate schedule, auxiliary losses, or whether speaker embeddings are introduced). This omission directly prevents assessment of whether the reconstruction objective actually encourages the required speaker-invariant content representations.

Authors: We agree the abstract is too terse on the adaptation step. The manuscript body details the procedure (pretrain full autoencoder on single-speaker data, then fine-tune selected decoder layers on parallel multispeaker pairs while freezing the encoder). We will expand the abstract with one sentence summarizing the adaptation (layers updated, learning-rate reduction, no auxiliary losses or speaker embeddings) so readers can immediately assess the claim. revision: yes
Referee: [Abstract] Abstract: no quantitative results, baselines, objective metrics (e.g., MCD, WER), subjective scores, or dataset statistics are supplied, so the central assertion that the method solves the limited-data problem cannot be evaluated against evidence.

Authors: Abstracts conventionally omit numbers for brevity. The manuscript reports MCD, WER, MOS scores, and dataset sizes in the experiments section with comparisons to baselines. To address the concern we will add a concise results clause to the abstract citing the key objective and subjective improvements on the limited-data setting. revision: yes

Circularity Check

0 steps flagged

No significant circularity; architecture is standard adaptation without self-referential reduction

full rationale

The paper describes a hierarchical seq2seq model with attention decoder, pretrained as autoencoder on single-speaker data then adapted to multispeaker VC using mel spectrograms. No equations, derivations, fitted parameters presented as predictions, or load-bearing self-citations appear in the text. The approach relies on external advances in NMT/TTS/ASR without reducing any claim to its own inputs by construction. This is a typical non-circular empirical adaptation strategy.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract provides no explicit free parameters, invented entities or new axioms; relies on standard assumptions of recurrent seq2seq models, attention mechanisms and autoencoder transfer from prior literature in NMT/TTS/ASR.

axioms (1)

domain assumption Standard assumptions of recurrent neural networks and attention mechanisms apply to audio sequences for voice conversion.
The paper builds directly on advances in NMT, TTS and ASR without stating new axioms.

pith-pipeline@v0.9.0 · 5730 in / 1307 out tokens · 26759 ms · 2026-05-24T21:28:42.517684+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 34 internal anchors

[1]

Likewise, it has been demonstrated that ASR can be handled excellently by seq2seq architectures

Introduction Recently, sequence to sequence models have been adapted with great success in producing realistic sounding speech in TTS sys- tems [1, 2, 3, 4, 5]. Likewise, it has been demonstrated that ASR can be handled excellently by seq2seq architectures. In TTS, the system takes in a text or phoneme sequence and out- puts a speech representation as out...

work page
[2]

Related Work The traditional pipeline for parallel voice conversion is through use of Gaussian Mixture Models (GMMs) [6, 7, 8] or Deep Neural Networks (DNNs) [9, 10, 11, 12]. After ﬁrst align- ing source and target features using Dynamic Time Warping (DTW)[13], the model is trained so that it learns to produce the target given the source features for each...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[3]

The network architecture borrows heav- ily from recent developments in TTS [1] and ASR [19]

Architecture We use an attention based encoder-decoder network for our voice conversion task. The network architecture borrows heav- ily from recent developments in TTS [1] and ASR [19]. The system takes in an audio representation (mel-spectrogram) as input, and encodes it into a hidden representation in recurrent fashion. This hidden representation is th...

work page
[4]

How- ever, before doing so, it is useful to have in mind an overall picture of how the data ﬂows through the decoder stack

Decoder RNNs with residuality We describe the components in more detail below. How- ever, before doing so, it is useful to have in mind an overall picture of how the data ﬂows through the decoder stack. To that end, we present a brief description of the calculations at a high level. The decoder’s task is to transform linguistic content from the source spe...

work page
[5]

To get around this problem, we ﬁrst pretrain our network as an autoencoder with a large sin- gle speaker TTS corpus [46], with the source and target voices being the same

Autoencoder pretraining and transfer learning V oice conversion with DNNs for parallel data is a difﬁcult un- dertaking owing to the lack of availability of large multispeaker voice conversion datasets. To get around this problem, we ﬁrst pretrain our network as an autoencoder with a large sin- gle speaker TTS corpus [46], with the source and target voice...

work page
[6]

We ﬁrst pretrain the network with a large single- speaker corpus in which the source and the target are the same

Experimental setup Our experimental procedure consists of two steps, as mentioned in section 4. We ﬁrst pretrain the network with a large single- speaker corpus in which the source and the target are the same. After this, we allow the network to adapt to the desired source and target data. 5.1. Datasets For autoencoder pretraining, we use the LJSpeech dat...

work page
[7]

for text, PixelCNN [50] for images and Video Pixel Net

work page
[8]

for videos. This type of architecture, at a high level works on a temporal (in the sense that there is a certain temporal order- ing of data) basis by stacking dilated convolutions with expo- nentially growing receptive ﬁeld sizes (e.g. 2, 4, 8, 16). Mask- ing is carried out so as to only allow information from the past. In wavenet, instead of masking, on...

work page
[9]

These fea- tures serve as a useful starting point for transfer learning in the limited data corpus

Conclusions In this work, we demonstrated a way to overcome data limi- tations (an all too common malady in the speech world) with a trick to extract linguistic features by pretraining with a large corpus so that it learns to reconstruct the input voice. These fea- tures serve as a useful starting point for transfer learning in the limited data corpus. Th...

work page
[10]

Tacotron: Towards End-to-End Speech Synthesis

Y . Wang, R. J. Skerry-Ryan, D. Stanton, Y . Wu, R. J. Weiss, N. Jaitly, Y . Xiao, Z. Chen, S. Bengio, Q. Le, Y . Agiomyrgian- nakis, R. Clark, and R. A. Saurous, “Tacotron: Towards end to end speech synthesis,” arXiv preprint arXiv:1703.10135, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[11]

Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

J. Shen, R. Pang, R. J. Weiss, N. Jaitly, Z. Yang, Z. Chen, Y . Zhang, Y . Wang, R. Skerry-Ryan, R. A. Sauros, Y . Agiomyr- giannakis, and Y . Wu, “Natural tts synthesis by condition- ing wavenet on mel spectrogram predictions,” arXiv preprint arXiv:1712.05884, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[12]

Deep Voice: Real-time Neural Text-to-Speech

S. O. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y . Kang, X. Li, J. Miller, A. Ng, J. Raiman, S. Sengupta, and M. Shoeybi, “Deep voice: Real-time neural text-to-speech,” arXiv preprint arXiv:1702.07825, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[13]

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

S. O. Arik, G. Diamos, A. Gibiansky, J. Miller, K. Peng, W. Ping, and Y . Zhou, “Deep voice 2: Multi-speaker text-to-speech,”arXiv preprint arXiv:1705.08947, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[14]

Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

W. Ping, K. Peng, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep voice 3: Scaling text-to-speech with con- volutional sequence learning,” arXiv preprint arXiv:1710.07654, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[15]

Spectral voice conversion for text-to- speech synthesis,

A. Kain and M. Macon, “Spectral voice conversion for text-to- speech synthesis,” in ICASSP , IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings , vol. 1. IEEE, 1998, pp. 285–288

work page 1998
[16]

V oice conversion based on maximum-likelihood estimation of spectral parameter trajectory,

T. Toda, A. W. Black, and K. Tokuda, “V oice conversion based on maximum-likelihood estimation of spectral parameter trajectory,” Trans. Audio, Speech and Lang. Proc. , vol. 15, no. 8, pp. 2222–2235, Nov. 2007. [Online]. Available: https: //doi.org/10.1109/TASL.2007.907344 [8]

work page doi:10.1109/tasl.2007.907344 2007
[17]

V oice conversion using artiﬁcial neural networks,

S. Desai, E. V . Raghavendra, B. Yegnanarayana, A. W. Black, and K. Prahallad, “V oice conversion using artiﬁcial neural networks,” in Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing , ser. ICASSP ’09. Washington, DC, USA: IEEE Computer Society, 2009, pp. 3893–3896. [Online]. Available: https: //doi.org/10....

work page doi:10.1109/icassp.2009.4960478 2009
[18]

V oice conversion using artiﬁcial neural networks,

S. Desai, A. W. Black, and B. Yegnanarayana, “V oice conversion using artiﬁcial neural networks,” in IEEE Transactions on Audio, Speech and Language Processing, vol. 18, no. 5, July 2010

work page 2010
[19]

V oice conversion using deep bidirectional long short-term memory,

L. Sun, S. Yang, K. Li, and H. Meng, “V oice conversion using deep bidirectional long short-term memory,” inProceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ser. ICASSP ’15. Washington, DC, USA: IEEE Computer Society, 2015, pp. 4869–4873

work page 2015
[20]

L. Sun, K. Li, S. Kang, and H. Meng, in IEEE International Con- ference on Multimedia and Expo, 2016

work page 2016
[21]

M ¨uller, Information Retrieval for Music and Motion

M. M ¨uller, Information Retrieval for Music and Motion . Springer, 2007

work page 2007
[22]

An overview of voice conversion systems,

S. H. Mohammadi and A. Kain, “An overview of voice conversion systems,” Speech Commun., vol. 88, no. C, pp. 65–82, Apr. 2017. [Online]. Available: https://doi.org/10.1016/j.specom. 2017.01.008

work page doi:10.1016/j.specom 2017
[23]

Robust hi- erarchical learning for non-negative matrix factorization with out- liers,

Y . Li, M. Sun, H. Van Hamme, X. Zhang, and J. Yang, “Robust hi- erarchical learning for non-negative matrix factorization with out- liers,” IEEE Access, vol. 7, pp. 10 546–10 558, 2019

work page 2019
[24]

Exemplar-based voice conversion using sparse representation in noisy environments,

R. Takashima, T. Takiguchi, and Y . Ariki, “Exemplar-based voice conversion using sparse representation in noisy environments,” IEICE Transactions on Fundamentals of Electronics, Communi- cations and Computer Sciences , vol. 96, no. 10, pp. 1946–1953, 2013

work page 1946
[25]

Sequence-to-sequence acoustic modeling for voice conversion,

J.-X. Zhang, Z.-H. Ling, L.-J. Liu, Y . Jiang, and L.-R. Dai, “Sequence-to-sequence acoustic modeling for voice conversion,” arXiv preprint arXiv:1810.06865, 2018

work page arXiv 2018
[26]

Improving sequence-to-sequence acoustic modeling by adding text-supervision,

J. Zhang, Z. Ling, Y . Jiang, L. Liu, C. Liang, and L. Dai, “Improving sequence-to-sequence acoustic modeling by adding text-supervision,” CoRR, vol. abs/1811.08111, 2018. [Online]. Available: http://arxiv.org/abs/1811.08111

work page arXiv 2018
[27]

Listen, Attend and Spell

W. Chan, N. Jaitly, Q. V . Le, and O. Vinyals, “Listen, attend and spell,” arXiv preprint arXiv:1508.01211, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[28]

AttS2S-VC: Sequence-to-Sequence Voice Conversion with Attention and Context Preservation Mechanisms

K. Tanaka, H. Kameoka, T. Kaneko, and N. Hojo, “Atts2s-vc: Sequence-to-sequence voice conversion with attention and con- text preservation mechanisms,” arXiv preprint arXiv:1811.04076, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[29]

Convs2s- vc fully convolutional sequence-to-sequence voice conversion,

H. Kameoka, K. Tanaka, T. Kaneko, and N. Hojo, “Convs2s- vc fully convolutional sequence-to-sequence voice conversion,” arXiv preprint arXiv:1811.01609, 2018

work page arXiv 2018
[30]

Efﬁciently trainable text- to-speech system based on deep convolutional networks with guided attention,

H. Tachibana, K. Uenoyama, and S. Aihara, “Efﬁciently trainable text-to-speech system based on deep convolutional networks with guided attention,” CoRR, vol. abs/1710.08969, 2017. [Online]. Available: http://arxiv.org/abs/1710.08969

work page arXiv 2017
[31]

Unpaired image-to-image translation using cycle-consistent adversarial networks,

J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” CoRR, vol. abs/1703.10593, 2017. [Online]. Available: http://arxiv.org/abs/1703.10593

work page arXiv 2017
[32]

Auto-Encoding Variational Bayes

D. Kingma and M. Welling, “Autoencoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[33]

Generative Adversarial Networks

I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Wade- Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adver- sarial networks,” arXiv preprint arXiv:1406.2661, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[34]

Sequence-to-sequence voice conversion with similarity met- ric learned using generative adversarial networks,

T. Kaneko, H. Kameoka, K. Hiramatsu, and K. Kashino, “Sequence-to-sequence voice conversion with similarity met- ric learned using generative adversarial networks,” in INTER- SPEECH, 2017

work page 2017
[35]

Autoencoding beyond pixels using a learned similarity metric

A. B. L. Larsen, S. K. Sønderby, and O. Winther, “Autoencoding beyond pixels using a learned similarity metric,” CoRR, vol. abs/1512.09300, 2015. [Online]. Available: http://arxiv.org/abs/ 1512.09300

work page internal anchor Pith review Pith/arXiv arXiv 2015
[36]

Voice Conversion from Unaligned Corpora using Variational Autoencoding Wasserstein Generative Adversarial Networks

C. Hsu, H. Hwang, Y . Wu, Y . Tsao, and H. Wang, “V oice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks,” CoRR, vol. abs/1704.00849, 2017. [Online]. Available: http: //arxiv.org/abs/1704.00849

work page internal anchor Pith review Pith/arXiv arXiv 2017
[37]

Wasserstein GAN

M. Arjrovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[38]

Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks

T. Kaneko and H. Kameoka, “Parallel-data-free voice conver- sion using cycle-consistent adversarial networks,” arXiv preprint arXiv:1711.11293, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[39]

StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks

H. Kameoka and T. Kaneko, “Stargan-vc: Non-parallel many-to- many voice conversion with star generative adversarial networks,” arXiv preprint arXiv:1806.02169, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[41]

Sample Efficient Adaptive Text-to-Speech

[Online]. Available: http://arxiv.org/abs/1809.10460

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Y . Jia, Y . Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, P. Nguyen, R. Pang, I. Lopez-Moreno, and Y . Wu, “Transfer learning from speaker veriﬁcation to multispeaker text-to-speech synthesis,” CoRR, vol. abs/1806.04558, 2018. [Online]. Available: http://arxiv.org/abs/1806.04558

work page internal anchor Pith review Pith/arXiv arXiv 2018
[44]

Neural Voice Cloning with a Few Samples

[Online]. Available: http://arxiv.org/abs/1802.06006

work page internal anchor Pith review Pith/arXiv arXiv
[45]

VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop

Y . Taigman, L. Wolf, A. Polyak, and E. Nachmani, “V oice synthesis for in-the-wild speakers via a phonological loop,” CoRR, vol. abs/1707.06588, 2017. [Online]. Available: http: //arxiv.org/abs/1707.06588

work page internal anchor Pith review Pith/arXiv arXiv 2017
[46]

Fitting New Speakers Based on a Short Untranscribed Sample

E. Nachmani, A. Polyak, Y . Taigman, and L. Wolf, “Fitting new speakers based on a short untranscribed sample,” CoRR, vol. abs/1802.06984, 2018. [Online]. Available: http://arxiv.org/abs/ 1802.06984

work page internal anchor Pith review Pith/arXiv arXiv 2018
[47]

Unsupervised Polyglot Text To Speech

E. Nachmani and L. Wolf, “Unsupervised polyglot text to speech,” CoRR, vol. abs/1902.02263, 2019. [Online]. Available: http://arxiv.org/abs/1902.02263

work page internal anchor Pith review Pith/arXiv arXiv 1902
[48]

Wavenet vocoder,

R. Yamamoto, “Wavenet vocoder,” 2018. [Online]. Available: https://github.com/r9y9/wavenet vocoder

work page 2018
[49]

Dropout: a simple way to prevent neural networks from overﬁtting,

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overﬁtting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014. [Online]. Available: http://jmlr.org/papers/v15/srivastava14a.html

work page 1929
[50]

A learning algorithm for continually running fully recurrent neural networks,

R. J. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,” Neural Comput., vol. 1, no. 2, pp. 270–280, Jun. 1989. [Online]. Available: http://dx.doi.org/10.1162/neco.1989.1.2.270

work page doi:10.1162/neco.1989.1.2.270 1989
[51]

Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks

S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sam- pling for sequence prediction with recurrent neural networks,” arXiv preprint arXiv:1506.03099, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[52]

Professor Forcing: A New Algorithm for Training Recurrent Networks

A. Lamb, A. Goyal, Y . Zhang, S. Zhang, A. Courville, and Y . Ben- gio, “Professor forcing: A new algorithm for training recurrent networks,” arXiv preprint arXiv:1610.09038, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[53]

Fully Character-Level Neural Machine Translation without Explicit Segmentation

J. Lee, K. Cho, and T. Hoffman, “Fully character-level neural ma- chine translation without explicit segmentation,” arXiv prepring arXiv:1610.03017, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[54]

Effective Approaches to Attention-based Neural Machine Translation

M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” arXiv preprint arXiv:1508.04025, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[55]

Attention-Based Models for Speech Recognition

J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y . Bengio, “Attention based models for speech recognition,” arXiv preprint arXiv:1506.07503, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[56]

The lj speech dataset,

K. Ito, “The lj speech dataset,” 2017. [Online]. Available: https://keithito.com/LJ-Speech-Dataset/

work page 2017
[57]

Cmu arctic databases for speechsynthesis,

J. Kominek and A. W. Black, “Cmu arctic databases for speechsynthesis,” Language Technology Institute, Carnegie Mellon University, Pittsburgh, PA, 2003. [Online]. Available: http://festvox.org/cmuarctic/index.html

work page 2003
[58]

WaveNet: A Generative Model for Raw Audio

A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” CoRR, vol. abs/1609.03499, 2016. [Online]. Available: http://arxiv.org/abs/1609.03499

work page internal anchor Pith review Pith/arXiv arXiv 2016
[59]

Neural Machine Translation in Linear Time

N. Kalchbrenner, L. Espeholt, K. Simonyan, A. van den Oord, A. Graves, and K. Kavukcuoglu, “Neural machine translation in linear time,” CoRR, vol. abs/1610.10099, 2016. [Online]. Available: http://arxiv.org/abs/1610.10099

work page internal anchor Pith review Pith/arXiv arXiv 2016
[60]

Conditional Image Generation with PixelCNN Decoders

A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu, “Conditional image generation with pixelcnn decoders,” CoRR, vol. abs/1606.05328, 2016. [Online]. Available: http://arxiv.org/abs/1606.05328

work page internal anchor Pith review Pith/arXiv arXiv 2016
[61]

Video Pixel Networks

N. Kalchbrenner, A. van den Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu, “Video pixel networks,” CoRR, vol. abs/1610.00527, 2016. [Online]. Available: http://arxiv.org/abs/1610.00527

work page internal anchor Pith review Pith/arXiv arXiv 2016
[62]

Variational Inference with Normalizing Flows

D. J. Rezende and S. Mohamed, “Variational normalizing ﬂows,” arXiv preprint arXiv:1505.05770, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[63]

Improving Variational Inference with Inverse Autoregressive Flow

D. P. Kingma, T. Salimans, and M. Welling, “Improving variational inference with inverse autoregressive ﬂow,” CoRR, vol. abs/1606.04934, 2016. [Online]. Available: http://arxiv.org/ abs/1606.04934

work page internal anchor Pith review Pith/arXiv arXiv 2016
[65]

Parallel WaveNet: Fast High-Fidelity Speech Synthesis

[Online]. Available: http://arxiv.org/abs/1711.10433

work page internal anchor Pith review Pith/arXiv arXiv
[66]

WaveGlow: A Flow-based Generative Network for Speech Synthesis

R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A ﬂow-based generative network for speech synthesis,” CoRR, vol. abs/1811.00002, 2018. [Online]. Available: http://arxiv.org/abs/ 1811.00002

work page internal anchor Pith review Pith/arXiv arXiv 2018

[1] [1]

Likewise, it has been demonstrated that ASR can be handled excellently by seq2seq architectures

Introduction Recently, sequence to sequence models have been adapted with great success in producing realistic sounding speech in TTS sys- tems [1, 2, 3, 4, 5]. Likewise, it has been demonstrated that ASR can be handled excellently by seq2seq architectures. In TTS, the system takes in a text or phoneme sequence and out- puts a speech representation as out...

work page

[2] [2]

Related Work The traditional pipeline for parallel voice conversion is through use of Gaussian Mixture Models (GMMs) [6, 7, 8] or Deep Neural Networks (DNNs) [9, 10, 11, 12]. After ﬁrst align- ing source and target features using Dynamic Time Warping (DTW)[13], the model is trained so that it learns to produce the target given the source features for each...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[3] [3]

The network architecture borrows heav- ily from recent developments in TTS [1] and ASR [19]

Architecture We use an attention based encoder-decoder network for our voice conversion task. The network architecture borrows heav- ily from recent developments in TTS [1] and ASR [19]. The system takes in an audio representation (mel-spectrogram) as input, and encodes it into a hidden representation in recurrent fashion. This hidden representation is th...

work page

[4] [4]

How- ever, before doing so, it is useful to have in mind an overall picture of how the data ﬂows through the decoder stack

Decoder RNNs with residuality We describe the components in more detail below. How- ever, before doing so, it is useful to have in mind an overall picture of how the data ﬂows through the decoder stack. To that end, we present a brief description of the calculations at a high level. The decoder’s task is to transform linguistic content from the source spe...

work page

[5] [5]

To get around this problem, we ﬁrst pretrain our network as an autoencoder with a large sin- gle speaker TTS corpus [46], with the source and target voices being the same

Autoencoder pretraining and transfer learning V oice conversion with DNNs for parallel data is a difﬁcult un- dertaking owing to the lack of availability of large multispeaker voice conversion datasets. To get around this problem, we ﬁrst pretrain our network as an autoencoder with a large sin- gle speaker TTS corpus [46], with the source and target voice...

work page

[6] [6]

We ﬁrst pretrain the network with a large single- speaker corpus in which the source and the target are the same

Experimental setup Our experimental procedure consists of two steps, as mentioned in section 4. We ﬁrst pretrain the network with a large single- speaker corpus in which the source and the target are the same. After this, we allow the network to adapt to the desired source and target data. 5.1. Datasets For autoencoder pretraining, we use the LJSpeech dat...

work page

[7] [7]

for text, PixelCNN [50] for images and Video Pixel Net

work page

[8] [8]

for videos. This type of architecture, at a high level works on a temporal (in the sense that there is a certain temporal order- ing of data) basis by stacking dilated convolutions with expo- nentially growing receptive ﬁeld sizes (e.g. 2, 4, 8, 16). Mask- ing is carried out so as to only allow information from the past. In wavenet, instead of masking, on...

work page

[9] [9]

These fea- tures serve as a useful starting point for transfer learning in the limited data corpus

Conclusions In this work, we demonstrated a way to overcome data limi- tations (an all too common malady in the speech world) with a trick to extract linguistic features by pretraining with a large corpus so that it learns to reconstruct the input voice. These fea- tures serve as a useful starting point for transfer learning in the limited data corpus. Th...

work page

[10] [10]

Tacotron: Towards End-to-End Speech Synthesis

Y . Wang, R. J. Skerry-Ryan, D. Stanton, Y . Wu, R. J. Weiss, N. Jaitly, Y . Xiao, Z. Chen, S. Bengio, Q. Le, Y . Agiomyrgian- nakis, R. Clark, and R. A. Saurous, “Tacotron: Towards end to end speech synthesis,” arXiv preprint arXiv:1703.10135, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[11] [11]

Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

J. Shen, R. Pang, R. J. Weiss, N. Jaitly, Z. Yang, Z. Chen, Y . Zhang, Y . Wang, R. Skerry-Ryan, R. A. Sauros, Y . Agiomyr- giannakis, and Y . Wu, “Natural tts synthesis by condition- ing wavenet on mel spectrogram predictions,” arXiv preprint arXiv:1712.05884, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[12] [12]

Deep Voice: Real-time Neural Text-to-Speech

S. O. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y . Kang, X. Li, J. Miller, A. Ng, J. Raiman, S. Sengupta, and M. Shoeybi, “Deep voice: Real-time neural text-to-speech,” arXiv preprint arXiv:1702.07825, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[13] [13]

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

S. O. Arik, G. Diamos, A. Gibiansky, J. Miller, K. Peng, W. Ping, and Y . Zhou, “Deep voice 2: Multi-speaker text-to-speech,”arXiv preprint arXiv:1705.08947, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[14] [14]

Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

W. Ping, K. Peng, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep voice 3: Scaling text-to-speech with con- volutional sequence learning,” arXiv preprint arXiv:1710.07654, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[15] [15]

Spectral voice conversion for text-to- speech synthesis,

A. Kain and M. Macon, “Spectral voice conversion for text-to- speech synthesis,” in ICASSP , IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings , vol. 1. IEEE, 1998, pp. 285–288

work page 1998

[16] [16]

V oice conversion based on maximum-likelihood estimation of spectral parameter trajectory,

T. Toda, A. W. Black, and K. Tokuda, “V oice conversion based on maximum-likelihood estimation of spectral parameter trajectory,” Trans. Audio, Speech and Lang. Proc. , vol. 15, no. 8, pp. 2222–2235, Nov. 2007. [Online]. Available: https: //doi.org/10.1109/TASL.2007.907344 [8]

work page doi:10.1109/tasl.2007.907344 2007

[17] [17]

V oice conversion using artiﬁcial neural networks,

S. Desai, E. V . Raghavendra, B. Yegnanarayana, A. W. Black, and K. Prahallad, “V oice conversion using artiﬁcial neural networks,” in Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing , ser. ICASSP ’09. Washington, DC, USA: IEEE Computer Society, 2009, pp. 3893–3896. [Online]. Available: https: //doi.org/10....

work page doi:10.1109/icassp.2009.4960478 2009

[18] [18]

V oice conversion using artiﬁcial neural networks,

S. Desai, A. W. Black, and B. Yegnanarayana, “V oice conversion using artiﬁcial neural networks,” in IEEE Transactions on Audio, Speech and Language Processing, vol. 18, no. 5, July 2010

work page 2010

[19] [19]

V oice conversion using deep bidirectional long short-term memory,

L. Sun, S. Yang, K. Li, and H. Meng, “V oice conversion using deep bidirectional long short-term memory,” inProceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ser. ICASSP ’15. Washington, DC, USA: IEEE Computer Society, 2015, pp. 4869–4873

work page 2015

[20] [20]

L. Sun, K. Li, S. Kang, and H. Meng, in IEEE International Con- ference on Multimedia and Expo, 2016

work page 2016

[21] [21]

M ¨uller, Information Retrieval for Music and Motion

M. M ¨uller, Information Retrieval for Music and Motion . Springer, 2007

work page 2007

[22] [22]

An overview of voice conversion systems,

S. H. Mohammadi and A. Kain, “An overview of voice conversion systems,” Speech Commun., vol. 88, no. C, pp. 65–82, Apr. 2017. [Online]. Available: https://doi.org/10.1016/j.specom. 2017.01.008

work page doi:10.1016/j.specom 2017

[23] [23]

Robust hi- erarchical learning for non-negative matrix factorization with out- liers,

Y . Li, M. Sun, H. Van Hamme, X. Zhang, and J. Yang, “Robust hi- erarchical learning for non-negative matrix factorization with out- liers,” IEEE Access, vol. 7, pp. 10 546–10 558, 2019

work page 2019

[24] [24]

Exemplar-based voice conversion using sparse representation in noisy environments,

R. Takashima, T. Takiguchi, and Y . Ariki, “Exemplar-based voice conversion using sparse representation in noisy environments,” IEICE Transactions on Fundamentals of Electronics, Communi- cations and Computer Sciences , vol. 96, no. 10, pp. 1946–1953, 2013

work page 1946

[25] [25]

Sequence-to-sequence acoustic modeling for voice conversion,

J.-X. Zhang, Z.-H. Ling, L.-J. Liu, Y . Jiang, and L.-R. Dai, “Sequence-to-sequence acoustic modeling for voice conversion,” arXiv preprint arXiv:1810.06865, 2018

work page arXiv 2018

[26] [26]

Improving sequence-to-sequence acoustic modeling by adding text-supervision,

J. Zhang, Z. Ling, Y . Jiang, L. Liu, C. Liang, and L. Dai, “Improving sequence-to-sequence acoustic modeling by adding text-supervision,” CoRR, vol. abs/1811.08111, 2018. [Online]. Available: http://arxiv.org/abs/1811.08111

work page arXiv 2018

[27] [27]

Listen, Attend and Spell

W. Chan, N. Jaitly, Q. V . Le, and O. Vinyals, “Listen, attend and spell,” arXiv preprint arXiv:1508.01211, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[28] [28]

AttS2S-VC: Sequence-to-Sequence Voice Conversion with Attention and Context Preservation Mechanisms

K. Tanaka, H. Kameoka, T. Kaneko, and N. Hojo, “Atts2s-vc: Sequence-to-sequence voice conversion with attention and con- text preservation mechanisms,” arXiv preprint arXiv:1811.04076, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[29] [29]

Convs2s- vc fully convolutional sequence-to-sequence voice conversion,

H. Kameoka, K. Tanaka, T. Kaneko, and N. Hojo, “Convs2s- vc fully convolutional sequence-to-sequence voice conversion,” arXiv preprint arXiv:1811.01609, 2018

work page arXiv 2018

[30] [30]

Efﬁciently trainable text- to-speech system based on deep convolutional networks with guided attention,

H. Tachibana, K. Uenoyama, and S. Aihara, “Efﬁciently trainable text-to-speech system based on deep convolutional networks with guided attention,” CoRR, vol. abs/1710.08969, 2017. [Online]. Available: http://arxiv.org/abs/1710.08969

work page arXiv 2017

[31] [31]

Unpaired image-to-image translation using cycle-consistent adversarial networks,

J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” CoRR, vol. abs/1703.10593, 2017. [Online]. Available: http://arxiv.org/abs/1703.10593

work page arXiv 2017

[32] [32]

Auto-Encoding Variational Bayes

D. Kingma and M. Welling, “Autoencoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[33] [33]

Generative Adversarial Networks

I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Wade- Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adver- sarial networks,” arXiv preprint arXiv:1406.2661, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[34] [34]

Sequence-to-sequence voice conversion with similarity met- ric learned using generative adversarial networks,

T. Kaneko, H. Kameoka, K. Hiramatsu, and K. Kashino, “Sequence-to-sequence voice conversion with similarity met- ric learned using generative adversarial networks,” in INTER- SPEECH, 2017

work page 2017

[35] [35]

Autoencoding beyond pixels using a learned similarity metric

A. B. L. Larsen, S. K. Sønderby, and O. Winther, “Autoencoding beyond pixels using a learned similarity metric,” CoRR, vol. abs/1512.09300, 2015. [Online]. Available: http://arxiv.org/abs/ 1512.09300

work page internal anchor Pith review Pith/arXiv arXiv 2015

[36] [36]

Voice Conversion from Unaligned Corpora using Variational Autoencoding Wasserstein Generative Adversarial Networks

C. Hsu, H. Hwang, Y . Wu, Y . Tsao, and H. Wang, “V oice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks,” CoRR, vol. abs/1704.00849, 2017. [Online]. Available: http: //arxiv.org/abs/1704.00849

work page internal anchor Pith review Pith/arXiv arXiv 2017

[37] [37]

Wasserstein GAN

M. Arjrovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[38] [38]

Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks

T. Kaneko and H. Kameoka, “Parallel-data-free voice conver- sion using cycle-consistent adversarial networks,” arXiv preprint arXiv:1711.11293, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[39] [39]

StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks

H. Kameoka and T. Kaneko, “Stargan-vc: Non-parallel many-to- many voice conversion with star generative adversarial networks,” arXiv preprint arXiv:1806.02169, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[40] [41]

Sample Efficient Adaptive Text-to-Speech

[Online]. Available: http://arxiv.org/abs/1809.10460

work page internal anchor Pith review Pith/arXiv arXiv

[41] [42]

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Y . Jia, Y . Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, P. Nguyen, R. Pang, I. Lopez-Moreno, and Y . Wu, “Transfer learning from speaker veriﬁcation to multispeaker text-to-speech synthesis,” CoRR, vol. abs/1806.04558, 2018. [Online]. Available: http://arxiv.org/abs/1806.04558

work page internal anchor Pith review Pith/arXiv arXiv 2018

[42] [44]

Neural Voice Cloning with a Few Samples

[Online]. Available: http://arxiv.org/abs/1802.06006

work page internal anchor Pith review Pith/arXiv arXiv

[43] [45]

VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop

Y . Taigman, L. Wolf, A. Polyak, and E. Nachmani, “V oice synthesis for in-the-wild speakers via a phonological loop,” CoRR, vol. abs/1707.06588, 2017. [Online]. Available: http: //arxiv.org/abs/1707.06588

work page internal anchor Pith review Pith/arXiv arXiv 2017

[44] [46]

Fitting New Speakers Based on a Short Untranscribed Sample

E. Nachmani, A. Polyak, Y . Taigman, and L. Wolf, “Fitting new speakers based on a short untranscribed sample,” CoRR, vol. abs/1802.06984, 2018. [Online]. Available: http://arxiv.org/abs/ 1802.06984

work page internal anchor Pith review Pith/arXiv arXiv 2018

[45] [47]

Unsupervised Polyglot Text To Speech

E. Nachmani and L. Wolf, “Unsupervised polyglot text to speech,” CoRR, vol. abs/1902.02263, 2019. [Online]. Available: http://arxiv.org/abs/1902.02263

work page internal anchor Pith review Pith/arXiv arXiv 1902

[46] [48]

Wavenet vocoder,

R. Yamamoto, “Wavenet vocoder,” 2018. [Online]. Available: https://github.com/r9y9/wavenet vocoder

work page 2018

[47] [49]

Dropout: a simple way to prevent neural networks from overﬁtting,

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overﬁtting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014. [Online]. Available: http://jmlr.org/papers/v15/srivastava14a.html

work page 1929

[48] [50]

A learning algorithm for continually running fully recurrent neural networks,

R. J. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,” Neural Comput., vol. 1, no. 2, pp. 270–280, Jun. 1989. [Online]. Available: http://dx.doi.org/10.1162/neco.1989.1.2.270

work page doi:10.1162/neco.1989.1.2.270 1989

[49] [51]

Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks

S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sam- pling for sequence prediction with recurrent neural networks,” arXiv preprint arXiv:1506.03099, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[50] [52]

Professor Forcing: A New Algorithm for Training Recurrent Networks

A. Lamb, A. Goyal, Y . Zhang, S. Zhang, A. Courville, and Y . Ben- gio, “Professor forcing: A new algorithm for training recurrent networks,” arXiv preprint arXiv:1610.09038, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[51] [53]

Fully Character-Level Neural Machine Translation without Explicit Segmentation

J. Lee, K. Cho, and T. Hoffman, “Fully character-level neural ma- chine translation without explicit segmentation,” arXiv prepring arXiv:1610.03017, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[52] [54]

Effective Approaches to Attention-based Neural Machine Translation

M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” arXiv preprint arXiv:1508.04025, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[53] [55]

Attention-Based Models for Speech Recognition

J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y . Bengio, “Attention based models for speech recognition,” arXiv preprint arXiv:1506.07503, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[54] [56]

The lj speech dataset,

K. Ito, “The lj speech dataset,” 2017. [Online]. Available: https://keithito.com/LJ-Speech-Dataset/

work page 2017

[55] [57]

Cmu arctic databases for speechsynthesis,

J. Kominek and A. W. Black, “Cmu arctic databases for speechsynthesis,” Language Technology Institute, Carnegie Mellon University, Pittsburgh, PA, 2003. [Online]. Available: http://festvox.org/cmuarctic/index.html

work page 2003

[56] [58]

WaveNet: A Generative Model for Raw Audio

A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” CoRR, vol. abs/1609.03499, 2016. [Online]. Available: http://arxiv.org/abs/1609.03499

work page internal anchor Pith review Pith/arXiv arXiv 2016

[57] [59]

Neural Machine Translation in Linear Time

N. Kalchbrenner, L. Espeholt, K. Simonyan, A. van den Oord, A. Graves, and K. Kavukcuoglu, “Neural machine translation in linear time,” CoRR, vol. abs/1610.10099, 2016. [Online]. Available: http://arxiv.org/abs/1610.10099

work page internal anchor Pith review Pith/arXiv arXiv 2016

[58] [60]

Conditional Image Generation with PixelCNN Decoders

A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu, “Conditional image generation with pixelcnn decoders,” CoRR, vol. abs/1606.05328, 2016. [Online]. Available: http://arxiv.org/abs/1606.05328

work page internal anchor Pith review Pith/arXiv arXiv 2016

[59] [61]

Video Pixel Networks

N. Kalchbrenner, A. van den Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu, “Video pixel networks,” CoRR, vol. abs/1610.00527, 2016. [Online]. Available: http://arxiv.org/abs/1610.00527

work page internal anchor Pith review Pith/arXiv arXiv 2016

[60] [62]

Variational Inference with Normalizing Flows

D. J. Rezende and S. Mohamed, “Variational normalizing ﬂows,” arXiv preprint arXiv:1505.05770, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[61] [63]

Improving Variational Inference with Inverse Autoregressive Flow

D. P. Kingma, T. Salimans, and M. Welling, “Improving variational inference with inverse autoregressive ﬂow,” CoRR, vol. abs/1606.04934, 2016. [Online]. Available: http://arxiv.org/ abs/1606.04934

work page internal anchor Pith review Pith/arXiv arXiv 2016

[62] [65]

Parallel WaveNet: Fast High-Fidelity Speech Synthesis

[Online]. Available: http://arxiv.org/abs/1711.10433

work page internal anchor Pith review Pith/arXiv arXiv

[63] [66]

WaveGlow: A Flow-based Generative Network for Speech Synthesis

R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A ﬂow-based generative network for speech synthesis,” CoRR, vol. abs/1811.00002, 2018. [Online]. Available: http://arxiv.org/abs/ 1811.00002

work page internal anchor Pith review Pith/arXiv arXiv 2018