Hierarchical Sequence to Sequence Voice Conversion with Limited Data
Pith reviewed 2026-05-24 21:28 UTC · model grok-4.3
The pith
A seq2seq model pretrained as an autoencoder on single-speaker data adapts to perform voice conversion on limited multispeaker datasets using mel spectrograms.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our seq2seq architecture makes use of a hierarchical encoder to summarize input audio frames. On the decoder side, we use an attention based architecture used in recent TTS works. Since there is a dearth of large multispeaker voice conversion databases needed for training DNNs, we resort to training the network with a large single speaker dataset as an autoencoder. This is then adapted for the smaller multispeaker voice conversion datasets available for voice conversion. In contrast with other voice conversion works that use F0, duration and linguistic features, our system uses mel spectrograms as the audio representation. Output mel frames are converted back to audio using a wavenet vocoder
What carries the argument
Hierarchical encoder to summarize input audio frames paired with an attention-based decoder, pretrained as autoencoder then adapted to parallel voice conversion pairs.
If this is right
- Voice conversion operates directly on mel spectrograms without explicit F0, duration or linguistic features.
- Pretraining on large single-speaker data followed by adaptation succeeds on smaller parallel multispeaker sets.
- WaveNet vocoder converts output mel frames back to audio waveforms.
- The system works in the parallel setting where source-target audio pairs are available.
Where Pith is reading between the lines
- The same pretraining-plus-adaptation pattern could reduce data needs for other paired audio translation tasks.
- Hierarchical frame summarization may separate content from speaker traits more readily than flat encoders.
- Replacing hand-crafted features with mel spectrograms simplifies the overall conversion pipeline.
Load-bearing premise
Pretraining the network as an autoencoder on a large single-speaker dataset enables effective adaptation to smaller multispeaker voice conversion datasets.
What would settle it
Listening tests or speaker similarity scores on held-out parallel pairs showing the adapted outputs fail to match target speaker identity would falsify the adaptation claim.
Figures
read the original abstract
We present a voice conversion solution using recurrent sequence to sequence modeling for DNNs. Our solution takes advantage of recent advances in attention based modeling in the fields of Neural Machine Translation (NMT), Text-to-Speech (TTS) and Automatic Speech Recognition (ASR). The problem consists of converting between voices in a parallel setting when {\it $<$source,target$>$} audio pairs are available. Our seq2seq architecture makes use of a hierarchical encoder to summarize input audio frames. On the decoder side, we use an attention based architecture used in recent TTS works. Since there is a dearth of large multispeaker voice conversion databases needed for training DNNs, we resort to training the network with a large single speaker dataset as an autoencoder. This is then adapted for the smaller multispeaker voice conversion datasets available for voice conversion. In contrast with other voice conversion works that use $F_0$, duration and linguistic features, our system uses mel spectrograms as the audio representation. Output mel frames are converted back to audio using a wavenet vocoder.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a hierarchical sequence-to-sequence architecture for parallel voice conversion. A hierarchical encoder summarizes input audio frames while an attention-based decoder (inspired by recent TTS models) generates output. To address the scarcity of large multispeaker voice conversion datasets, the network is first pretrained as an autoencoder on a large single-speaker corpus and then adapted to smaller parallel multispeaker data. The system operates directly on mel spectrograms (rather than F0, duration or linguistic features) and uses a WaveNet vocoder to synthesize waveforms.
Significance. If the pretraining-plus-adaptation strategy reliably produces transferable representations, the work would offer a practical route to high-quality voice conversion under limited parallel data by exploiting abundant single-speaker corpora, providing an alternative to conventional feature-engineering pipelines.
major comments (2)
- [Abstract] Abstract: the load-bearing claim that pretraining as a single-speaker autoencoder followed by adaptation yields effective multispeaker voice conversion is stated without any description of the adaptation procedure (which layers are updated, learning-rate schedule, auxiliary losses, or whether speaker embeddings are introduced). This omission directly prevents assessment of whether the reconstruction objective actually encourages the required speaker-invariant content representations.
- [Abstract] Abstract: no quantitative results, baselines, objective metrics (e.g., MCD, WER), subjective scores, or dataset statistics are supplied, so the central assertion that the method solves the limited-data problem cannot be evaluated against evidence.
Simulated Author's Rebuttal
We thank the referee for the comments. We address each major comment below and will revise the abstract accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the load-bearing claim that pretraining as a single-speaker autoencoder followed by adaptation yields effective multispeaker voice conversion is stated without any description of the adaptation procedure (which layers are updated, learning-rate schedule, auxiliary losses, or whether speaker embeddings are introduced). This omission directly prevents assessment of whether the reconstruction objective actually encourages the required speaker-invariant content representations.
Authors: We agree the abstract is too terse on the adaptation step. The manuscript body details the procedure (pretrain full autoencoder on single-speaker data, then fine-tune selected decoder layers on parallel multispeaker pairs while freezing the encoder). We will expand the abstract with one sentence summarizing the adaptation (layers updated, learning-rate reduction, no auxiliary losses or speaker embeddings) so readers can immediately assess the claim. revision: yes
-
Referee: [Abstract] Abstract: no quantitative results, baselines, objective metrics (e.g., MCD, WER), subjective scores, or dataset statistics are supplied, so the central assertion that the method solves the limited-data problem cannot be evaluated against evidence.
Authors: Abstracts conventionally omit numbers for brevity. The manuscript reports MCD, WER, MOS scores, and dataset sizes in the experiments section with comparisons to baselines. To address the concern we will add a concise results clause to the abstract citing the key objective and subjective improvements on the limited-data setting. revision: yes
Circularity Check
No significant circularity; architecture is standard adaptation without self-referential reduction
full rationale
The paper describes a hierarchical seq2seq model with attention decoder, pretrained as autoencoder on single-speaker data then adapted to multispeaker VC using mel spectrograms. No equations, derivations, fitted parameters presented as predictions, or load-bearing self-citations appear in the text. The approach relies on external advances in NMT/TTS/ASR without reducing any claim to its own inputs by construction. This is a typical non-circular empirical adaptation strategy.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard assumptions of recurrent neural networks and attention mechanisms apply to audio sequences for voice conversion.
Reference graph
Works this paper leans on
-
[1]
Likewise, it has been demonstrated that ASR can be handled excellently by seq2seq architectures
Introduction Recently, sequence to sequence models have been adapted with great success in producing realistic sounding speech in TTS sys- tems [1, 2, 3, 4, 5]. Likewise, it has been demonstrated that ASR can be handled excellently by seq2seq architectures. In TTS, the system takes in a text or phoneme sequence and out- puts a speech representation as out...
-
[2]
Related Work The traditional pipeline for parallel voice conversion is through use of Gaussian Mixture Models (GMMs) [6, 7, 8] or Deep Neural Networks (DNNs) [9, 10, 11, 12]. After first align- ing source and target features using Dynamic Time Warping (DTW)[13], the model is trained so that it learns to produce the target given the source features for each...
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[3]
The network architecture borrows heav- ily from recent developments in TTS [1] and ASR [19]
Architecture We use an attention based encoder-decoder network for our voice conversion task. The network architecture borrows heav- ily from recent developments in TTS [1] and ASR [19]. The system takes in an audio representation (mel-spectrogram) as input, and encodes it into a hidden representation in recurrent fashion. This hidden representation is th...
-
[4]
Decoder RNNs with residuality We describe the components in more detail below. How- ever, before doing so, it is useful to have in mind an overall picture of how the data flows through the decoder stack. To that end, we present a brief description of the calculations at a high level. The decoder’s task is to transform linguistic content from the source spe...
-
[5]
Autoencoder pretraining and transfer learning V oice conversion with DNNs for parallel data is a difficult un- dertaking owing to the lack of availability of large multispeaker voice conversion datasets. To get around this problem, we first pretrain our network as an autoencoder with a large sin- gle speaker TTS corpus [46], with the source and target voice...
-
[6]
Experimental setup Our experimental procedure consists of two steps, as mentioned in section 4. We first pretrain the network with a large single- speaker corpus in which the source and the target are the same. After this, we allow the network to adapt to the desired source and target data. 5.1. Datasets For autoencoder pretraining, we use the LJSpeech dat...
-
[7]
for text, PixelCNN [50] for images and Video Pixel Net
-
[8]
for videos. This type of architecture, at a high level works on a temporal (in the sense that there is a certain temporal order- ing of data) basis by stacking dilated convolutions with expo- nentially growing receptive field sizes (e.g. 2, 4, 8, 16). Mask- ing is carried out so as to only allow information from the past. In wavenet, instead of masking, on...
-
[9]
These fea- tures serve as a useful starting point for transfer learning in the limited data corpus
Conclusions In this work, we demonstrated a way to overcome data limi- tations (an all too common malady in the speech world) with a trick to extract linguistic features by pretraining with a large corpus so that it learns to reconstruct the input voice. These fea- tures serve as a useful starting point for transfer learning in the limited data corpus. Th...
-
[10]
Tacotron: Towards End-to-End Speech Synthesis
Y . Wang, R. J. Skerry-Ryan, D. Stanton, Y . Wu, R. J. Weiss, N. Jaitly, Y . Xiao, Z. Chen, S. Bengio, Q. Le, Y . Agiomyrgian- nakis, R. Clark, and R. A. Saurous, “Tacotron: Towards end to end speech synthesis,” arXiv preprint arXiv:1703.10135, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[11]
Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
J. Shen, R. Pang, R. J. Weiss, N. Jaitly, Z. Yang, Z. Chen, Y . Zhang, Y . Wang, R. Skerry-Ryan, R. A. Sauros, Y . Agiomyr- giannakis, and Y . Wu, “Natural tts synthesis by condition- ing wavenet on mel spectrogram predictions,” arXiv preprint arXiv:1712.05884, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[12]
Deep Voice: Real-time Neural Text-to-Speech
S. O. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y . Kang, X. Li, J. Miller, A. Ng, J. Raiman, S. Sengupta, and M. Shoeybi, “Deep voice: Real-time neural text-to-speech,” arXiv preprint arXiv:1702.07825, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[13]
Deep Voice 2: Multi-Speaker Neural Text-to-Speech
S. O. Arik, G. Diamos, A. Gibiansky, J. Miller, K. Peng, W. Ping, and Y . Zhou, “Deep voice 2: Multi-speaker text-to-speech,”arXiv preprint arXiv:1705.08947, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[14]
Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning
W. Ping, K. Peng, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep voice 3: Scaling text-to-speech with con- volutional sequence learning,” arXiv preprint arXiv:1710.07654, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[15]
Spectral voice conversion for text-to- speech synthesis,
A. Kain and M. Macon, “Spectral voice conversion for text-to- speech synthesis,” in ICASSP , IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings , vol. 1. IEEE, 1998, pp. 285–288
work page 1998
-
[16]
V oice conversion based on maximum-likelihood estimation of spectral parameter trajectory,
T. Toda, A. W. Black, and K. Tokuda, “V oice conversion based on maximum-likelihood estimation of spectral parameter trajectory,” Trans. Audio, Speech and Lang. Proc. , vol. 15, no. 8, pp. 2222–2235, Nov. 2007. [Online]. Available: https: //doi.org/10.1109/TASL.2007.907344 [8]
-
[17]
V oice conversion using artificial neural networks,
S. Desai, E. V . Raghavendra, B. Yegnanarayana, A. W. Black, and K. Prahallad, “V oice conversion using artificial neural networks,” in Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing , ser. ICASSP ’09. Washington, DC, USA: IEEE Computer Society, 2009, pp. 3893–3896. [Online]. Available: https: //doi.org/10....
-
[18]
V oice conversion using artificial neural networks,
S. Desai, A. W. Black, and B. Yegnanarayana, “V oice conversion using artificial neural networks,” in IEEE Transactions on Audio, Speech and Language Processing, vol. 18, no. 5, July 2010
work page 2010
-
[19]
V oice conversion using deep bidirectional long short-term memory,
L. Sun, S. Yang, K. Li, and H. Meng, “V oice conversion using deep bidirectional long short-term memory,” inProceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ser. ICASSP ’15. Washington, DC, USA: IEEE Computer Society, 2015, pp. 4869–4873
work page 2015
-
[20]
L. Sun, K. Li, S. Kang, and H. Meng, in IEEE International Con- ference on Multimedia and Expo, 2016
work page 2016
-
[21]
M ¨uller, Information Retrieval for Music and Motion
M. M ¨uller, Information Retrieval for Music and Motion . Springer, 2007
work page 2007
-
[22]
An overview of voice conversion systems,
S. H. Mohammadi and A. Kain, “An overview of voice conversion systems,” Speech Commun., vol. 88, no. C, pp. 65–82, Apr. 2017. [Online]. Available: https://doi.org/10.1016/j.specom. 2017.01.008
-
[23]
Robust hi- erarchical learning for non-negative matrix factorization with out- liers,
Y . Li, M. Sun, H. Van Hamme, X. Zhang, and J. Yang, “Robust hi- erarchical learning for non-negative matrix factorization with out- liers,” IEEE Access, vol. 7, pp. 10 546–10 558, 2019
work page 2019
-
[24]
Exemplar-based voice conversion using sparse representation in noisy environments,
R. Takashima, T. Takiguchi, and Y . Ariki, “Exemplar-based voice conversion using sparse representation in noisy environments,” IEICE Transactions on Fundamentals of Electronics, Communi- cations and Computer Sciences , vol. 96, no. 10, pp. 1946–1953, 2013
work page 1946
-
[25]
Sequence-to-sequence acoustic modeling for voice conversion,
J.-X. Zhang, Z.-H. Ling, L.-J. Liu, Y . Jiang, and L.-R. Dai, “Sequence-to-sequence acoustic modeling for voice conversion,” arXiv preprint arXiv:1810.06865, 2018
-
[26]
Improving sequence-to-sequence acoustic modeling by adding text-supervision,
J. Zhang, Z. Ling, Y . Jiang, L. Liu, C. Liang, and L. Dai, “Improving sequence-to-sequence acoustic modeling by adding text-supervision,” CoRR, vol. abs/1811.08111, 2018. [Online]. Available: http://arxiv.org/abs/1811.08111
-
[27]
W. Chan, N. Jaitly, Q. V . Le, and O. Vinyals, “Listen, attend and spell,” arXiv preprint arXiv:1508.01211, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[28]
AttS2S-VC: Sequence-to-Sequence Voice Conversion with Attention and Context Preservation Mechanisms
K. Tanaka, H. Kameoka, T. Kaneko, and N. Hojo, “Atts2s-vc: Sequence-to-sequence voice conversion with attention and con- text preservation mechanisms,” arXiv preprint arXiv:1811.04076, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[29]
Convs2s- vc fully convolutional sequence-to-sequence voice conversion,
H. Kameoka, K. Tanaka, T. Kaneko, and N. Hojo, “Convs2s- vc fully convolutional sequence-to-sequence voice conversion,” arXiv preprint arXiv:1811.01609, 2018
-
[30]
H. Tachibana, K. Uenoyama, and S. Aihara, “Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention,” CoRR, vol. abs/1710.08969, 2017. [Online]. Available: http://arxiv.org/abs/1710.08969
-
[31]
Unpaired image-to-image translation using cycle-consistent adversarial networks,
J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” CoRR, vol. abs/1703.10593, 2017. [Online]. Available: http://arxiv.org/abs/1703.10593
-
[32]
Auto-Encoding Variational Bayes
D. Kingma and M. Welling, “Autoencoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[33]
Generative Adversarial Networks
I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Wade- Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adver- sarial networks,” arXiv preprint arXiv:1406.2661, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[34]
T. Kaneko, H. Kameoka, K. Hiramatsu, and K. Kashino, “Sequence-to-sequence voice conversion with similarity met- ric learned using generative adversarial networks,” in INTER- SPEECH, 2017
work page 2017
-
[35]
Autoencoding beyond pixels using a learned similarity metric
A. B. L. Larsen, S. K. Sønderby, and O. Winther, “Autoencoding beyond pixels using a learned similarity metric,” CoRR, vol. abs/1512.09300, 2015. [Online]. Available: http://arxiv.org/abs/ 1512.09300
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[36]
C. Hsu, H. Hwang, Y . Wu, Y . Tsao, and H. Wang, “V oice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks,” CoRR, vol. abs/1704.00849, 2017. [Online]. Available: http: //arxiv.org/abs/1704.00849
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[37]
M. Arjrovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[38]
Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks
T. Kaneko and H. Kameoka, “Parallel-data-free voice conver- sion using cycle-consistent adversarial networks,” arXiv preprint arXiv:1711.11293, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[39]
StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks
H. Kameoka and T. Kaneko, “Stargan-vc: Non-parallel many-to- many voice conversion with star generative adversarial networks,” arXiv preprint arXiv:1806.02169, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[41]
Sample Efficient Adaptive Text-to-Speech
[Online]. Available: http://arxiv.org/abs/1809.10460
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
Y . Jia, Y . Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, P. Nguyen, R. Pang, I. Lopez-Moreno, and Y . Wu, “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” CoRR, vol. abs/1806.04558, 2018. [Online]. Available: http://arxiv.org/abs/1806.04558
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[44]
Neural Voice Cloning with a Few Samples
[Online]. Available: http://arxiv.org/abs/1802.06006
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop
Y . Taigman, L. Wolf, A. Polyak, and E. Nachmani, “V oice synthesis for in-the-wild speakers via a phonological loop,” CoRR, vol. abs/1707.06588, 2017. [Online]. Available: http: //arxiv.org/abs/1707.06588
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[46]
Fitting New Speakers Based on a Short Untranscribed Sample
E. Nachmani, A. Polyak, Y . Taigman, and L. Wolf, “Fitting new speakers based on a short untranscribed sample,” CoRR, vol. abs/1802.06984, 2018. [Online]. Available: http://arxiv.org/abs/ 1802.06984
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[47]
Unsupervised Polyglot Text To Speech
E. Nachmani and L. Wolf, “Unsupervised polyglot text to speech,” CoRR, vol. abs/1902.02263, 2019. [Online]. Available: http://arxiv.org/abs/1902.02263
work page internal anchor Pith review Pith/arXiv arXiv 1902
-
[48]
R. Yamamoto, “Wavenet vocoder,” 2018. [Online]. Available: https://github.com/r9y9/wavenet vocoder
work page 2018
-
[49]
Dropout: a simple way to prevent neural networks from overfitting,
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014. [Online]. Available: http://jmlr.org/papers/v15/srivastava14a.html
work page 1929
-
[50]
A learning algorithm for continually running fully recurrent neural networks,
R. J. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,” Neural Comput., vol. 1, no. 2, pp. 270–280, Jun. 1989. [Online]. Available: http://dx.doi.org/10.1162/neco.1989.1.2.270
-
[51]
Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks
S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sam- pling for sequence prediction with recurrent neural networks,” arXiv preprint arXiv:1506.03099, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[52]
Professor Forcing: A New Algorithm for Training Recurrent Networks
A. Lamb, A. Goyal, Y . Zhang, S. Zhang, A. Courville, and Y . Ben- gio, “Professor forcing: A new algorithm for training recurrent networks,” arXiv preprint arXiv:1610.09038, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[53]
Fully Character-Level Neural Machine Translation without Explicit Segmentation
J. Lee, K. Cho, and T. Hoffman, “Fully character-level neural ma- chine translation without explicit segmentation,” arXiv prepring arXiv:1610.03017, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[54]
Effective Approaches to Attention-based Neural Machine Translation
M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” arXiv preprint arXiv:1508.04025, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[55]
Attention-Based Models for Speech Recognition
J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y . Bengio, “Attention based models for speech recognition,” arXiv preprint arXiv:1506.07503, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[56]
K. Ito, “The lj speech dataset,” 2017. [Online]. Available: https://keithito.com/LJ-Speech-Dataset/
work page 2017
-
[57]
Cmu arctic databases for speechsynthesis,
J. Kominek and A. W. Black, “Cmu arctic databases for speechsynthesis,” Language Technology Institute, Carnegie Mellon University, Pittsburgh, PA, 2003. [Online]. Available: http://festvox.org/cmuarctic/index.html
work page 2003
-
[58]
WaveNet: A Generative Model for Raw Audio
A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” CoRR, vol. abs/1609.03499, 2016. [Online]. Available: http://arxiv.org/abs/1609.03499
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[59]
Neural Machine Translation in Linear Time
N. Kalchbrenner, L. Espeholt, K. Simonyan, A. van den Oord, A. Graves, and K. Kavukcuoglu, “Neural machine translation in linear time,” CoRR, vol. abs/1610.10099, 2016. [Online]. Available: http://arxiv.org/abs/1610.10099
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[60]
Conditional Image Generation with PixelCNN Decoders
A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu, “Conditional image generation with pixelcnn decoders,” CoRR, vol. abs/1606.05328, 2016. [Online]. Available: http://arxiv.org/abs/1606.05328
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[61]
N. Kalchbrenner, A. van den Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu, “Video pixel networks,” CoRR, vol. abs/1610.00527, 2016. [Online]. Available: http://arxiv.org/abs/1610.00527
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[62]
Variational Inference with Normalizing Flows
D. J. Rezende and S. Mohamed, “Variational normalizing flows,” arXiv preprint arXiv:1505.05770, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[63]
Improving Variational Inference with Inverse Autoregressive Flow
D. P. Kingma, T. Salimans, and M. Welling, “Improving variational inference with inverse autoregressive flow,” CoRR, vol. abs/1606.04934, 2016. [Online]. Available: http://arxiv.org/ abs/1606.04934
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[65]
Parallel WaveNet: Fast High-Fidelity Speech Synthesis
[Online]. Available: http://arxiv.org/abs/1711.10433
work page internal anchor Pith review Pith/arXiv arXiv
-
[66]
WaveGlow: A Flow-based Generative Network for Speech Synthesis
R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based generative network for speech synthesis,” CoRR, vol. abs/1811.00002, 2018. [Online]. Available: http://arxiv.org/abs/ 1811.00002
work page internal anchor Pith review Pith/arXiv arXiv 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.