WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling
Pith reviewed 2026-06-28 08:16 UTC · model grok-4.3
The pith
WavTTS demonstrates that direct raw-waveform diffusion can reach the quality of latent-space zero-shot TTS systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
WavTTS is the first raw-waveform generative TTS model built on flow matching with a Diffusion Transformer; it applies a simple patchification strategy to the waveform, adds multi-scale mel-spectrogram supervision for perceptual guidance, and uses a tailored noise schedule, allowing it to approach the performance of state-of-the-art latent generative zero-shot TTS models on open benchmarks while substantially outperforming previous end-to-end speech generation models.
What carries the argument
Flow matching with Diffusion Transformer applied to patchified raw waveforms, guided by multi-scale mel-spectrogram supervision.
If this is right
- Raw-waveform diffusion becomes a viable alternative that avoids the information bottleneck of latent or mel representations.
- End-to-end training in the waveform domain can match or exceed the results of two-stage latent pipelines.
- An optimized noise schedule and multi-scale supervision suffice to stabilize training on extremely long audio sequences.
- Scaling diffusion-based TTS directly in waveform space opens a new direction for fully end-to-end speech generation.
Where Pith is reading between the lines
- The method could eliminate the need for a separate vocoder stage in the inference pipeline.
- Direct waveform modeling may reduce cumulative artifacts that arise when decoding from compressed latents.
- The same patchification-plus-supervision recipe could be tested on related long-sequence audio tasks such as music or environmental sound generation.
Load-bearing premise
The combination of patchification, multi-scale mel supervision, and the chosen noise schedule overcomes long-sequence and information-loss problems without introducing new quality-degrading artifacts.
What would settle it
A set of listening tests or standard TTS metrics on the same benchmarks showing WavTTS naturalness or speaker similarity scores remain clearly below the leading latent models.
read the original abstract
Recently, diffusion models operating on VAE latents or mel-spectrograms have become the dominant paradigm for zero-shot TTS. Although these compressed representations improve generation efficiency, they inevitably suffer from information loss and non-end-to-end training. Theoretically, directly modeling raw waveforms circumvents these issues; however, this direction remains underexplored and is often deemed difficult due to the extremely long sequence length of audio signals. To overcome this, we propose WavTTS, the first raw waveform generative TTS model that substantially narrows the gap with latent-space generative models. Built upon the flow matching with Diffusion Transformer (DiT), WavTTS directly models speech waveforms via a simple patchification strategy, while integrating multi-scale mel-spectrogram supervision to provide perceptual guidance during training. Furthermore, we investigate the impact of prediction targets and noise scheduling in waveform diffusion, and develop an effective schedule design to improve generation quality. Evaluations on open-source benchmarks demonstrate that WavTTS closely approaches the performance of current state-of-the-art latent generative zero-shot TTS models, while substantially outperforming previous end-to-end speech generation models. Our findings demonstrate the feasibility of scaling diffusion-based TTS directly in the waveform space, opening a new direction for end-to-end speech generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces WavTTS, the first raw-waveform generative zero-shot TTS model. It builds a flow-matching Diffusion Transformer (DiT) that directly models speech waveforms via patchification, augments training with multi-scale mel-spectrogram supervision, and introduces a tailored noise schedule; the central empirical claim is that this architecture approaches the performance of current latent-space SOTA zero-shot TTS systems while substantially outperforming prior end-to-end waveform models on open-source benchmarks.
Significance. If the reported benchmark results hold, the work establishes that direct waveform-space diffusion is viable for high-quality zero-shot TTS, removing the information-loss penalty of VAE or mel compression and thereby opening a genuinely end-to-end generative pathway for speech synthesis.
major comments (2)
- [Abstract] Abstract: the performance claims ("closely approaches … state-of-the-art latent generative zero-shot TTS models" and "substantially outperforming previous end-to-end speech generation models") are stated without any numerical metrics, confidence intervals, dataset identifiers, or reference to the experimental tables; because these numbers are load-bearing for the central claim, their absence prevents verification that the data actually support the stated conclusions.
- [§4] §4 (Experiments) and associated tables: the manuscript must supply the concrete MOS, WER, speaker-similarity, and RTF numbers together with the exact baselines, training data, and number of listeners so that the "approaches SOTA while outperforming end-to-end" statement can be evaluated; without these the empirical support remains unassessable.
minor comments (2)
- [§3.3] The description of the noise schedule in §3.3 would benefit from an explicit equation or pseudocode block showing the chosen σ(t) schedule and its relation to the flow-matching objective.
- [Figure 2] Figure 2 (or equivalent architecture diagram) should label the patch size, the multi-scale mel heads, and the conditioning injection points for clarity.
Simulated Author's Rebuttal
We thank the referee for highlighting the need for explicit numerical support in the abstract and experiments. We will revise the manuscript to strengthen the presentation of results while preserving the core claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the performance claims ("closely approaches … state-of-the-art latent generative zero-shot TTS models" and "substantially outperforming previous end-to-end speech generation models") are stated without any numerical metrics, confidence intervals, dataset identifiers, or reference to the experimental tables; because these numbers are load-bearing for the central claim, their absence prevents verification that the data actually support the stated conclusions.
Authors: We agree the abstract would be stronger with explicit metrics. In revision we will insert the key quantitative results (MOS, WER, speaker similarity) with dataset names and table references while keeping the abstract concise. revision: yes
-
Referee: [§4] §4 (Experiments) and associated tables: the manuscript must supply the concrete MOS, WER, speaker-similarity, and RTF numbers together with the exact baselines, training data, and number of listeners so that the "approaches SOTA while outperforming end-to-end" statement can be evaluated; without these the empirical support remains unassessable.
Authors: We will expand §4 and the tables to explicitly list all requested values (MOS, WER, speaker similarity, RTF), the precise baselines, training datasets, listener counts, and any confidence intervals, ensuring every claim is directly traceable to the reported numbers. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper presents an empirical architecture (patchification + multi-scale mel supervision + flow-matching DiT on raw waveforms) whose performance claims rest on benchmark evaluations rather than any derivation chain. No equations, fitted parameters renamed as predictions, self-definitional relations, or load-bearing self-citations appear in the provided text. The central result is that the proposed components enable competitive zero-shot TTS; this is supported by external benchmark numbers and does not reduce to an input quantity defined by the authors' own prior work. The argument is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, et al. Seed-TTS: A Family of High-Quality Versatile Speech Generation Models.arXiv preprint arXiv:2406.02430, 2024
Pith/arXiv arXiv 2024
-
[2]
Common Voice: A Massively-Multilingual Speech Corpus
Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers, and Gregor Weber. Common Voice: A Massively-Multilingual Speech Corpus. In Proceedings of the twelfth language resources and evaluation conference, pages 4218–4222, 2020
2020
-
[3]
DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation
Roi Benita, Michael Elad, and Joseph Keshet. DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation. InThe TwelfthInternational Conference on Learning Representations, 2024
2024
-
[4]
WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis.Interspeech 2021, pages 3765–3769, 2021
Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, Najim Dehak, and William Chan. WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis.Interspeech 2021, pages 3765–3769, 2021
2021
-
[5]
Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, and Furu Wei. VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers.arXiv preprint arXiv:2406.05370, 2024
arXiv 2024
-
[6]
PixelFlow: Pixel-Space Generative Models with Flow.arXiv preprint arXiv:2504.07963, 2025
Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. PixelFlow: Pixel-Space Generative Models with Flow.arXiv preprint arXiv:2504.07963, 2025
arXiv 2025
-
[7]
On the Importance of Noise Scheduling for Diffusion Models.arXiv preprint arXiv:2301.10972, 2023
Ting Chen. On the Importance of Noise Scheduling for Diffusion Models.arXiv preprint arXiv:2301.10972, 2023
arXiv 2023
-
[8]
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, JianZhao JianZhao, Kai Yu, and Xie Chen. F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6255–6271, 2025
2025
-
[9]
Large-Scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification
Zhengyang Chen, Sanyuan Chen, Yu Wu, Yao Qian, Chengyi Wang, Shujie Liu, Yanmin Qian, and Michael Zeng. Large-Scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification. In ICASSP 2022-2022 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP), pages 6147–6151. IEEE, 2022
2022
-
[10]
DiP: Taming diffusion models in pixel space.arXiv preprint arXiv:2511.18822, 2025
Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Xiaobin Hu, Hanzhen Zhao, Chengjie Wang, Jian Yang, and Ying Tai. DiP: Taming diffusion models in pixel space.arXiv preprint arXiv:2511.18822, 2025
arXiv 2025
-
[11]
High Fidelity Neural Audio Compression
Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High Fidelity Neural Audio Compression. arXiv preprint arXiv:2210.13438, 2022
Pith/arXiv arXiv 2022
-
[12]
Diffusion Models Beat GANs on Image Synthesis.Advances in neural information processing systems, 34:8780–8794, 2021
Prafulla Dhariwal and Alexander Nichol. Diffusion Models Beat GANs on Image Synthesis.Advances in neural information processing systems, 34:8780–8794, 2021
2021
-
[13]
End-to-End Adversarial Text-to-Speech
Jeff Donahue, Sander Dieleman, Mikołaj Bińkowski, Erich Elsen, and Karen Simonyan. End-to-End Adversarial Text-to-Speech. arXiv preprint arXiv:2006.03575, 2020
arXiv 2006
-
[14]
Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, et al. CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens.arXiv preprint arXiv:2407.05407, 2024
Pith/arXiv arXiv 2024
-
[15]
Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, et al. CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models.arXiv preprint arXiv:2412.10117, 2024
Pith/arXiv arXiv 2024
-
[16]
Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang, Xiang Lv, Hui Wang, Chongjia Ni, Xian Shi, et al. CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training.arXiv preprint arXiv:2505.17589, 2025
Pith/arXiv arXiv 2025
-
[17]
E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS
Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, et al. E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS. In 2024 IEEE spoken language technology workshop (SLT), pages 682–689. IEEE, 2024
2024
-
[18]
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. InForty-firstinternational conference on machine learning, 2024. 14
2024
-
[19]
E3 TTS: Easy End-to-End Diffusion-Based Text To Speech
Yuan Gao, Nobuyuki Morioka, Yu Zhang, and Nanxin Chen. E3 TTS: Easy End-to-End Diffusion-Based Text To Speech. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop(ASRU), pages 1–8. IEEE, 2023
2023
-
[20]
Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition
Zhifu Gao, ShiLiang Zhang, Ian McLoughlin, and Zhijie Yan. Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition. InProc. Interspeech 2022, pages 2063–2067, 2022
2022
-
[21]
MOSS-TTS Technical Report.arXiv preprint arXiv:2603.18090, 2026
Yitian Gong, Botian Jiang, Yiwei Zhao, Yucheng Yuan, Kuangwei Chen, Yaozhou Jiang, Cheng Chang, Dong Hong, Mingshu Chen, Ruixiao Li, et al. MOSS-TTS Technical Report.arXiv preprint arXiv:2603.18090, 2026
arXiv 2026
-
[22]
FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications
Hao-Han Guo, Yao Hu, Kun Liu, Fei-Yu Shen, Xu Tang, Yi-Chen Wu, Feng-Long Xie, Kun Xie, and Kai-Tuo Xu. FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications. arXiv preprint arXiv:2409.03283, 2024
arXiv 2024
-
[23]
Didispeech: A Large Scale Mandarin Speech Corpus
Tingwei Guo, Cheng Wen, Dongwei Jiang, Ne Luo, Ruixiong Zhang, Shuaijiang Zhao, Wubo Li, Cheng Gong, Wei Zou, Kun Han, et al. Didispeech: A Large Scale Mandarin Speech Corpus. InICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6968–6972. IEEE, 2021
2021
-
[24]
VoiceFlow: Efficient Text-To-Speech with Rectified Flow Matching
Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen, and Kai Yu. VoiceFlow: Efficient Text-To-Speech with Rectified Flow Matching. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 11121–11125. IEEE, 2024
2024
-
[25]
VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment
Bing Han, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Yanming Qian, Yanqing Liu, Sheng Zhao, Jinyu Li, and Furu Wei. VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment. arXiv preprint arXiv:2406.07855, 2024
arXiv 2024
-
[26]
Emilia: An Extensive, Multilingual, and Diverse Speech Dataset For Large-Scale Speech Generation
Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, et al. Emilia: An Extensive, Multilingual, and Diverse Speech Dataset For Large-Scale Speech Generation. In 2024 IEEE Spoken Language TechnologyWorkshop(SLT), pages 885–890. IEEE, 2024
2024
-
[27]
Classifier-Free Diffusion Guidance.arXiv preprint arXiv:2207.12598, 2022
Jonathan Ho and Tim Salimans. Classifier-Free Diffusion Guidance.arXiv preprint arXiv:2207.12598, 2022
Pith/arXiv arXiv 2022
-
[28]
Denoising Diffusion Probabilistic Models.Advances in neural information processing systems, 33:6840–6851, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models.Advances in neural information processing systems, 33:6840–6851, 2020
2020
-
[29]
simple diffusion: End-to-end diffusion for high resolution images
Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. In International Conference on Machine Learning, pages 13213–13232. PMLR, 2023
2023
-
[30]
Simpler Diffusion: 1.5 FID on ImageNet512 with pixel-space diffusion
Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, and Tim Salimans. Simpler Diffusion: 1.5 FID on ImageNet512 with pixel-space diffusion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18062–18071, 2025
2025
-
[31]
Qwen3-TTS Technical Report.arXiv preprint arXiv:2601.15621, 2026
Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, et al. Qwen3-TTS Technical Report.arXiv preprint arXiv:2601.15621, 2026
Pith/arXiv arXiv 2026
-
[32]
FlowTS: Time Series Generation via Rectified Flow.arXiv preprint arXiv:2411.07506, 2024
Yang Hu, Xiao Wang, Zezhen Ding, Lirong Wu, Huatian Zhang, Stan Z Li, Sheng Wang, Jiheng Zhang, Ziyun Li, and Tianlong Chen. FlowTS: Time Series Generation via Rectified Flow.arXiv preprint arXiv:2411.07506, 2024
arXiv 2024
-
[33]
FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis
R Huang, MWY Lam, J Wang, D Su, D Yu, Y Ren, and Z Zhao. FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis. InIJCAI International Joint Conference on Artificial Intelligence, pages 4157–4163. IJCAI: International Joint Conferences on Artificial Intelligence Organization, 2022
2022
-
[34]
ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech
Rongjie Huang, Zhou Zhao, Huadai Liu, Jinglin Liu, Chenye Cui, and Yi Ren. ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech. InProceedings of the 30th ACM International Conference on Multimedia, pages 2595–2605, 2022
2022
-
[35]
The lj speech dataset.https://keithito.com/LJ-Speech-Dataset/, 2017
Keith Ito and Linda Johnson. The lj speech dataset.https://keithito.com/LJ-Speech-Dataset/, 2017
2017
-
[36]
Diff-TTS: A Denoising Diffusion Model for Text-to-Speech.arXiv preprint arXiv:2104.01409, 2021
Myeonghun Jeong, Hyeongju Kim, Sung Jun Cheon, Byoung Jin Choi, and Nam Soo Kim. Diff-TTS: A Denoising Diffusion Model for Text-to-Speech.arXiv preprint arXiv:2104.01409, 2021
arXiv 2021
-
[37]
DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation
Dongya Jia, Zhuo Chen, Jiawei Chen, Chenpeng Du, Jian Wu, Jian Cong, Xiaobin Zhuang, Chumin Li, Zhen Wei, Yuping Wang, et al. DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation. In Forty-secondInternational Conference on Machine Learning, 2025. 15
2025
-
[38]
Ziyue Jiang, Yi Ren, Ruiqi Li, Shengpeng Ji, Boyang Zhang, Zhenhui Ye, Chen Zhang, Bai Jionghao, Xiaoda Yang, Jialong Zuo, et al. MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis.arXiv preprint arXiv:2502.18924, 2025
arXiv 2025
-
[39]
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, et al. NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models. arXiv preprint arXiv:2403.03100, 2024
arXiv 2024
-
[40]
Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
Jaehyeon Kim, Jungil Kong, and Juhee Son. Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. InInternational conference on machine learning, pages 5530–5540. PMLR, 2021
2021
-
[41]
Auto-Encoding Variational Bayes.arXiv preprint arXiv:1312.6114, 2013
Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes.arXiv preprint arXiv:1312.6114, 2013
Pith/arXiv arXiv 2013
-
[42]
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis.Advancesin neural information processing systems, 33:17022–17033, 2020
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis.Advancesin neural information processing systems, 33:17022–17033, 2020
2020
-
[43]
DiffWave: A Versatile Diffusion Model for Audio Synthesis.arXiv preprint arXiv:2009.09761, 2020
Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. DiffWave: A Versatile Diffusion Model for Audio Synthesis.arXiv preprint arXiv:2009.09761, 2020
Pith/arXiv arXiv 2009
-
[44]
High-Fidelity Audio Compression with Improved RVQGAN.Advancesin Neural Information Processing Systems, 36:27980–27993, 2023
Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar. High-Fidelity Audio Compression with Improved RVQGAN.Advancesin Neural Information Processing Systems, 36:27980–27993, 2023
2023
-
[45]
Mateusz Łajszczak, Guillermo Cámbara, Yang Li, Fatih Beyhan, Arent Van Korlaar, Fan Yang, Arnaud Joly, Álvaro Martín-Cortinas, Ammar Abbas, Adam Michalski, et al. BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data.arXiv preprint arXiv:2402.08093, 2024
arXiv 2024
-
[46]
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale
Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, et al. Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale. Advancesin neural information processing systems, 36:14005–14034, 2023
2023
-
[47]
DiTTo-TTS: Diffusion Transformers for Scalable Text-to-Speech without Domain-Specific Factors
Keon Lee, Dong Won Kim, Jaehyeon Kim, Seungjun Chung, and Jaewoong Cho. DiTTo-TTS: Diffusion Transformers for Scalable Text-to-Speech without Domain-Specific Factors. InInternational Conference on Learning Representations, volume 2025, pages 52022–52055, 2025
2025
-
[48]
BigVGAN: A Universal Neural Vocoder with Large-Scale Training.arXiv preprint arXiv:2206.04658, 2022
Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, and Sungroh Yoon. BigVGAN: A Universal Neural Vocoder with Large-Scale Training.arXiv preprint arXiv:2206.04658, 2022
arXiv 2022
-
[49]
Back to Basics: Let Denoising Generative Models Denoise.arXiv preprint arXiv:2511.13720, 2025
Tianhong Li and Kaiming He. Back to Basics: Let Denoising Generative Models Denoise.arXiv preprint arXiv:2511.13720, 2025
Pith/arXiv arXiv 2025
-
[50]
Dan Lim, Sunghee Jung, and Eesung Kim. JETS: Jointly training FastSpeech2 and HiFi-GAN for end to end text to speech.arXiv preprint arXiv:2203.16852, 2022
arXiv 2022
-
[51]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow Matching for Generative Modeling. arXiv preprint arXiv:2210.02747, 2022
Pith/arXiv arXiv 2022
-
[52]
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow.arXiv preprint arXiv:2209.03003, 2022
Pith/arXiv arXiv 2022
-
[53]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. InInternational Conference on Learning Representations, 2019
2019
-
[54]
Zehong Ma, Longhui Wei, Shuai Wang, Shiliang Zhang, and Qi Tian. DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation.arXiv preprint arXiv:2511.19365, 2025
Pith/arXiv arXiv 2025
-
[55]
PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss
Zehong Ma, Ruihan Xu, and Shiliang Zhang. PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss. arXiv preprint arXiv:2602.02493, 2026
Pith/arXiv arXiv 2026
-
[56]
Matcha-TTS: A Fast TTS ArchitecturewithConditionalFlowMatching
Shivam Mehta, Ruibo Tu, Jonas Beskow, Éva Székely, and Gustav Eje Henter. Matcha-TTS: A Fast TTS ArchitecturewithConditionalFlowMatching. In ICASSP2024-2024IEEEInternationalConferenceonAcoustics, Speech and Signal Processing (ICASSP), pages 11341–11345. IEEE, 2024
2024
-
[57]
LibriSpeech-PC: Benchmark for Evaluation of Punctuation and Capitalization Capabilities of End-to-End ASR Model
Aleksandr Meister, Matvei Novikov, Nikolay Karpov, Evelina Bakhturina, Vitaly Lavrukhin, and Boris Ginsburg. LibriSpeech-PC: Benchmark for Evaluation of Punctuation and Capitalization Capabilities of End-to-End ASR Model. In 2023 IEEE automatic speech recognition and understanding workshop (ASRU), pages 1–7. IEEE, 2023. 16
2023
-
[58]
Improved Denoising Diffusion Probabilistic Models
Alexander Quinn Nichol and Prafulla Dhariwal. Improved Denoising Diffusion Probabilistic Models. In International conference on machine learning, pages 8162–8171. PMLR, 2021
2021
-
[59]
Zhikang Niu, Shujie Hu, Jeongsoo Choi, Yushen Chen, Peining Chen, Pengcheng Zhu, Yunting Yang, Bowen Zhang, Jian Zhao, Chunhui Wang, et al. Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis.arXiv preprint arXiv:2509.22167, 2025
arXiv 2025
-
[60]
Parallel WaveNet: Fast High-Fidelity Speech Synthesis
Aaron Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George Driessche, Edward Lockhart, Luis Cobo, Florian Stimberg, et al. Parallel WaveNet: Fast High-Fidelity Speech Synthesis. In International conference on machine learning, pages 3918–3926. PMLR, 2018
2018
-
[61]
Scalable Diffusion Models with Transformers
William Peebles and Saining Xie. Scalable Diffusion Models with Transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
2023
-
[62]
VOICECRAFT: Zero-Shot Speech Editing and Text-to-Speech in the Wild
Puyuan Peng, Po-Yao Huang, Shang-Wen Li, Abdelrahman Mohamed, and David Harwath. VOICECRAFT: Zero-Shot Speech Editing and Text-to-Speech in the Wild. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 12442–12462, 2024
2024
-
[63]
VibeVoice: Expressive Podcast Generation with Next-Token Diffusion
Zhiliang Peng, Jianwei Yu, Wenhui Wang, Yaoyao Chang, Yutao Sun, Li Dong, Yi Zhu, Weijiang Xu, Hangbo Bao, Zehua Wang, et al. VibeVoice: Expressive Podcast Generation with Next-Token Diffusion. InThe Fourteenth International Conference on Learning Representations, 2026
2026
-
[64]
ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech
Wei Ping, Kainan Peng, and Jitong Chen. ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech. arXiv preprint arXiv:1807.07281, 2018
Pith/arXiv arXiv 2018
-
[65]
Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech
Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, and Mikhail Kudinov. Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech. In International conference on machine learning, pages 8599–8608. PMLR, 2021
2021
-
[66]
Robust Speech Recognition via Large-Scale Weak Supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust Speech Recognition via Large-Scale Weak Supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023
2023
-
[67]
FastSpeech: Fast, Robust and Controllable Text to Speech.Advancesin neural information processing systems, 32, 2019
Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. FastSpeech: Fast, Robust and Controllable Text to Speech.Advancesin neural information processing systems, 32, 2019
2019
-
[68]
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. InInternational Conference on Learning Representations, 2021
2021
-
[69]
U-Net: Convolutional Networks for Biomedical Image Segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015
2015
-
[70]
UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022
Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, and Hiroshi Saruwatari. UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022. InProc. Interspeech 2022, pages 4521–4525, 2022
2022
-
[71]
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers
Kai Shen, Zeqian Ju, Xu Tan, Eric Liu, Yichong Leng, Lei He, Tao Qin, Jiang Bian, et al. NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers. InInternational conference on learning representations, volume 2024, pages 698–722, 2024
2024
-
[72]
Hubert Siuzdak. Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis.arXiv preprint arXiv:2306.00814, 2023
arXiv 2023
-
[73]
MagiCodec: Simple Masked Gaussian-Injected Codec for High-Fidelity Reconstruction and Generation
Yakun Song, Jiawei Chen, Xiaobin Zhuang, Chenpeng Du, Ziyang Ma, Jian Wu, Jian Cong, Dongya Jia, Zhuo Chen, Yuping Wang, et al. MagiCodec: Simple Masked Gaussian-Injected Codec for High-Fidelity Reconstruction and Generation. arXiv preprint arXiv:2506.00385, 2025
arXiv 2025
-
[74]
ELLA-V: Stable Neural Codec Language Modeling with Alignment-Guided Sequence Reordering
Yakun Song, Zhuo Chen, Xiaofei Wang, Ziyang Ma, and Xie Chen. ELLA-V: Stable Neural Codec Language Modeling with Alignment-Guided Sequence Reordering . InProceedings of the AAAI Conference on Artificial Intelligence, 2025
2025
-
[75]
Generative Modeling by Estimating Gradients of the Data Distribution.Advances in neural information processing systems, 32, 2019
Yang Song and Stefano Ermon. Generative Modeling by Estimating Gradients of the Data Distribution.Advances in neural information processing systems, 32, 2019. 17
2019
-
[76]
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-Based Generative Modeling through Stochastic Differential Equations.arXiv preprint arXiv:2011.13456, 2020
Pith/arXiv arXiv 2011
-
[77]
RoFormer: Enhanced transformer with Rotary Position Embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with Rotary Position Embedding.Neurocomputing, 568:127063, 2024
2024
-
[78]
Xiaohui Sun, Ruitong Xiao, Jianye Mo, Bowen Wu, Qun Yu, and Baoxun Wang. F5R-TTS: Improving Flow- Matching based Text-to-Speech with Group Relative Policy Optimization.arXiv preprint arXiv:2504.02407, 2025
arXiv 2025
-
[79]
STFT Spectral Loss for Training a Neural Speech Waveform Model
Shinji Takaki, Toru Nakashika, Xin Wang, and Junichi Yamagishi. STFT Spectral Loss for Training a Neural Speech Waveform Model. InICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7065–7069. IEEE, 2019
2019
-
[80]
NaturalSpeech: End-to-End Text-to-Speech Synthesis With Human-Level Quality.IEEE Transactionson Pattern Analysis and Machine Intelligence, 46(6):4234–4245, 2024
Xu Tan, Jiawei Chen, Haohe Liu, Jian Cong, Chen Zhang, Yanqing Liu, Xi Wang, Yichong Leng, Yuanhao Yi, Lei He, et al. NaturalSpeech: End-to-End Text-to-Speech Synthesis With Human-Level Quality.IEEE Transactionson Pattern Analysis and Machine Intelligence, 46(6):4234–4245, 2024
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.