ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment
Pith reviewed 2026-06-28 21:10 UTC · model grok-4.3
The pith
ImmersiveTTS generates speech integrated with environmental audio by fusing latents in a multimodal diffusion transformer and aligning domain-specific representations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ImmersiveTTS builds a multimodal diffusion transformer that fuses transcript-aligned speech latent with text-conditioned environmental context via joint attention and adds a domain-specific representation alignment objective that draws complementary self-supervised features from speech and audio encoders to enforce semantic consistency, yielding speech that integrates seamlessly with surrounding audio.
What carries the argument
Multimodal diffusion transformer performing joint attention between speech and environmental latents, plus domain-specific representation alignment objective.
If this is right
- Generated speech scores higher on naturalness, intelligibility, and audio fidelity than prior text-to-speech systems when environmental context is present.
- Joint attention explicitly models cross-modal interactions that separate pipelines cannot capture.
- The alignment objective reduces acoustic and temporal mismatches between speech and background sounds.
- Objective metrics and human evaluations both improve across tested environments.
Where Pith is reading between the lines
- The same fusion approach could be tested on real-time streaming audio where environment changes mid-utterance.
- Extending the alignment loss to video-derived visual context might further tighten multimodal consistency without new labeled data.
- If the alignment generalizes, it offers a route to parameter-efficient fine-tuning of existing diffusion audio models for immersive tasks.
Load-bearing premise
The domain-specific representation alignment objective produces semantic consistency and seamless integration between speech and environmental audio.
What would settle it
A controlled listening test in which listeners consistently judge the generated speech as semantically mismatched to the described environment (for example, urgent speech during a quiet library scene) would falsify the consistency claim.
Figures
read the original abstract
Recent advancements in text-guided audio generation have yielded promising results in diverse domains, including sound effects, speech, and music. However, jointly generating speech with environmental audio remains challenging due to the inherent disparities in their acoustic patterns and temporal dynamics. We propose ImmersiveTTS, an environment-aware text-to-speech (TTS) model that generates natural speech seamlessly integrated within environmental contexts by explicitly modeling cross-modal interactions. Our model builds on a multimodal diffusion transformer and fuses transcript-aligned speech latent with text-conditioned environmental context via joint attention. To enhance semantic consistency, we introduce a domain-specific representation alignment objective tailored to environment-aware TTS, leveraging complementary self-supervised representations from speech and audio encoders. Experimental results show that ImmersiveTTS achieves higher naturalness, intelligibility, and audio fidelity than existing approaches across objective metrics and human listening tests.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ImmersiveTTS, an environment-aware TTS system built on a multimodal diffusion transformer. It fuses transcript-aligned speech latents with text-conditioned environmental context via joint attention and introduces a domain-specific representation alignment objective that leverages complementary self-supervised representations from speech and audio encoders to promote semantic consistency. The central claim is that this architecture yields higher naturalness, intelligibility, and audio fidelity than prior methods, as measured by objective metrics and human listening tests.
Significance. If the performance claims are substantiated, the work would address a recognized gap in text-guided audio generation by enabling seamless integration of speech with environmental sounds despite their differing acoustic and temporal characteristics. The multimodal diffusion transformer with joint attention and the tailored alignment objective could supply a reusable framework for immersive audio synthesis applications.
major comments (1)
- Abstract: The assertion of superior performance across objective metrics and human listening tests is presented without any experimental details, baselines, error bars, datasets, or statistical analysis; the central empirical claim therefore cannot be evaluated from the supplied text.
Simulated Author's Rebuttal
We thank the referee for the review and the positive assessment of the work's potential significance, conditional on the empirical claims. We address the single major comment point by point below.
read point-by-point responses
-
Referee: [—] Abstract: The assertion of superior performance across objective metrics and human listening tests is presented without any experimental details, baselines, error bars, datasets, or statistical analysis; the central empirical claim therefore cannot be evaluated from the supplied text.
Authors: We agree that the abstract is written at a high level and omits specific experimental details, which is standard practice to maintain brevity (typically under 200 words). The full manuscript substantiates the claims in Section 4 (Experiments), which details the datasets, baseline systems, objective metrics with numerical results and error bars, human listening test design (including participant numbers and protocols), and statistical analysis. These elements enable evaluation of the performance assertions. We do not believe the abstract itself requires expansion, as doing so would violate length conventions without adding substantive value. revision: no
Circularity Check
No significant circularity detected
full rationale
The provided abstract and description contain no equations, derivations, or first-principles claims that could reduce to inputs by construction. The model is described at a high architectural level (multimodal diffusion transformer with joint attention and a domain-specific alignment objective), with performance asserted via experimental results rather than any mathematical chain. No self-citations, fitted inputs renamed as predictions, or ansatzes are present in the given text. The derivation chain is therefore self-contained against external benchmarks, as there is no internal reduction to inspect.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
online" 'onlinestring :=
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[2]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, and 1 others. 2022. Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing. In Proc. Annu. Meet. Assoc. Comput. Linguist. (ACL), pages 5723--5738
2022
-
[4]
Marcely Zanon Boito, Vivek Iyer, Nikolaos Lagos, Laurent Besacier, and Ioan Calapodescu. 2024. mhubert-147: A compact multilingual hubert model. In Ann. Conf. Int. Speech Commun. Assoc. (INTERSPEECH)
2024
-
[5]
Heng-Jui Chang, Saurabhchand Bhati, James Glass, and Alexander H Liu. 2025. Usad: Universal speech and audio representation via distillation. In IEEE Autom. Speech Recognit. Underst. Workshop (ASRU)
2025
-
[6]
Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, and 1 others. 2022. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, pages 1505--1518
2022
-
[7]
Sanyuan Chen, Chengyi Wang, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, and 1 others. 2025 a . Neural codec language models are zero-shot text to speech synthesizers. IEEE/ACM Trans. Audio, Speech, Lang. Process. (TASLP)
2025
-
[8]
Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, and Xie Chen. 2025 b . F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching. In Proc. Annu. Meet. Assoc. Comput. Linguist. (ACL)
2025
-
[9]
Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. 2025. Mmaudio: Taming multimodal joint training for high-quality video-to-audio synthesis. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR)
2025
-
[10]
Ha-Yeong Choi, Sang-Hoon Lee, and Seong-Whan Lee. 2024. Dddm-vc: Decoupled denoising diffusion models with disentangled representation and prior mixup for verified robust voice conversion. In Proc. AAAI Conf. Artificial Intelligence (AAAI), volume 38, pages 17862--17870
2024
-
[11]
Jeongsoo Choi, Zhikang Niu, Ji-Hoon Kim, Chunhui Wang, Joon Son Chung, and Xie Chen. 2025. Accelerating diffusion-based text-to-speech model training with dual modality alignment. In Ann. Conf. Int. Speech Commun. Assoc. (INTERSPEECH)
2025
-
[12]
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, and 1 others. 2024. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1--53
2024
-
[13]
Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, and Yonghui Wu. 2021. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In IEEE Autom. Speech Recognit. Underst. Workshop (ASRU), pages 244--250
2021
-
[14]
Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang, Xiang Lv, Hui Wang, Chongjia Ni, Xian Shi, and 1 others. 2025. Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training. arXiv preprint arXiv:2505.17589
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, and 1 others. 2024. Cosyvoice 2: Scalable streaming speech synthesis with large language models. arXiv preprint arXiv:2412.10117
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M \"u ller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, and 1 others. 2024. Scaling rectified flow transformers for high-resolution image synthesis, 2024. In Proc. Int. Conf. Mach. Learn. (ICML)
2024
- [17]
- [18]
- [19]
-
[20]
Alex Graves, Santiago Fern \'a ndez, Faustino Gomez, and J \"u rgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proc. Int. Conf. Mach. Learn. (ICML), pages 369--376
2006
-
[21]
Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, and 1 others. 2017. Cnn architectures for large-scale audio classification. In IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP)
2017
-
[22]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. In Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), volume 33, pages 6840--6851
2020
-
[23]
Jonathan Ho and Tim Salimans. 2022. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[24]
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio, Speech, Lang. Process. (TASLP), 29:3451--3460
2021
-
[25]
Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, and Zhou Zhao. 2023. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. In Proc. Int. Conf. Mach. Learn. (ICML)
2023
-
[26]
Chia-Yu Hung, Navonil Majumder, Zhifeng Kong, Ambuj Mehrish, Amir Ali Bagherzadeh, Chuan Li, Rafael Valle, Bryan Catanzaro, and Soujanya Poria. 2024. Tangoflux: Super fast and faithful text to audio generation with flow matching and clap-ranked preference optimization. arXiv preprint arXiv:2412.21037
-
[27]
Jaemin Jung, Junseok Ahn, Chaeyoung Jung, Tan Dat Nguyen, Youngjoon Jang, and Joon Son Chung. 2025. Voicedit: Dual-condition diffusion transformer for environment-aware speech synthesis. In IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP)
2025
-
[28]
Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Sungroh Yoon. 2020. Glow-tts: A generative flow for text-to-speech via monotonic alignment search. In Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), volume 33, pages 8067--8077
2020
-
[29]
Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021 a . Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In Proc. Int. Conf. Mach. Learn. (ICML), pages 5530--5540
2021
-
[30]
Ji-Hoon Kim, Sang-Hoon Lee, Ji-Hyun Lee, and Seong-Whan Lee. 2021 b . Fre-gan: Adversarial frequency-consistent audio synthesis
2021
-
[31]
Seung-Bin Kim, Jun-Hyeok Cha, Hyung-Seok Oh, Heejin Choi, and Seong-Whan Lee. 2025. Fillerspeech: Towards human-like text-to-speech synthesis with filler insertion and filler style control. In Proc. Conf. Empir. Methods Nat. Lang. Process. (EMNLP), pages 34096--34113
2025
-
[32]
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. In Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), pages 17022--17033
2020
-
[33]
Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre D \'e fossez, Jade Copet, Devi Parikh, Yaniv Taigman, and Yossi Adi. 2023. Audiogen: Textually guided audio generation. In Proc. Int. Conf. Learn. Represent. (ICLR)
2023
-
[34]
Sang-Hoon Lee, Ha-Yeong Choi, Seung-Bin Kim, and Seong-Whan Lee. 2025. Hierspeech++: Bridging the gap between semantic and acoustic representation of speech by hierarchical variational inference for zero-shot speech synthesis. IEEE Transactions on Neural Networks and Learning Systems
2025
-
[35]
Sang-Hoon Lee, Seung-Bin Kim, Ji-Hyun Lee, Eunwoo Song, Min-Jae Hwang, and Seong-Whan Lee. 2022. Hierspeech: Bridging the gap between text and speech by hierarchical variational inference using self-supervised representations for speech synthesis. In Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), pages 16624--16636
2022
-
[36]
Yeonghyeon Lee, Inmo Yeon, Juhan Nam, and Joon Son Chung. 2024. Voiceldm: Text-to-speech with environmental context. In IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP)
2024
-
[37]
Xian Li, Nian Shao, and Xiaofei Li. 2024. Self-supervised audio teacher-student transformer for both clip-level and frame-level tasks. IEEE/ACM Trans. Audio, Speech, Lang. Process. (TASLP), pages 1336--1351
2024
- [38]
- [39]
-
[40]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. 2023. F low matching for generative modeling. In Proc. Int. Conf. Learn. Represent. (ICLR)
2023
-
[41]
Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. 2023 a . Audio LDM : Text-to-audio generation with latent diffusion models. In Proc. Int. Conf. Mach. Learn. (ICML)
2023
-
[42]
Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, and Mark D Plumbley. 2024. A udio LDM 2: L earning holistic audio generation with self-supervised pretraining. IEEE/ACM Trans. Audio, Speech, Lang. Process. (TASLP), pages 2871--2883
2024
-
[43]
Huadai Liu, Rongjie Huang, Xuan Lin, Wenqiang Xu, Maozong Zheng, Hong Chen, Jinzheng He, and Zhou Zhao. 2023 b . Vit-tts: visual text-to-speech with scalable diffusion transformer. In Proc. Conf. Empir. Methods Nat. Lang. Process. (EMNLP), pages 15957--15969
2023
-
[44]
Huadai Liu, Kaicheng Luo, Jialei Wang, Wen Wang, Qian Chen, Zhou Zhao, and Wei Xue. 2025 a . Thinksound: Chain-of-thought reasoning in multimodal large language models for audio generation and editing. In Proc. Adv. Neural Inf. Process. Syst. (NeurIPS)
2025
-
[45]
Rui Liu, Shuwei He, Yifan Hu, and Haizhou Li. 2025 b . Multi-modal and multi-scale spatial environment understanding for immersive visual text-to-speech. In Proc. AAAI Conf. Artificial Intelligence (AAAI), volume 39, pages 24632--24640
2025
-
[46]
Xingchao Liu, Chengyue Gong, and Qiang Liu. 2023 c . Flow straight and fast: Learning to generate and transfer data with rectified flow. In Proc. Int. Conf. Learn. Represent. (ICLR)
2023
-
[47]
Ye-Xin Lu, Hui-Peng Du, Zheng-Yan Sheng, Yang Ai, and Zhen-Hua Ling. 2025 a . Incremental disentanglement for environment-aware zero-shot text-to-speech synthesis. In IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP)
2025
- [48]
-
[49]
Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D Plumbley, Yuexian Zou, and Wenwu Wang. 2024. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. IEEE/ACM Trans. Audio, Speech, Lang. Process. (TASLP)
2024
-
[50]
Maxime Oquab, Timoth \'e e Darcet, Th \'e o Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, and Alaaeldin El-Nouby. 2023. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
William Peebles and Saining Xie. 2023. Scalable diffusion models with transformers. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 4195--4205
2023
-
[52]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, and 1 others. 2021. Learning transferable visual models from natural language supervision. In Proc. Int. Conf. Mach. Learn. (ICML)
2021
-
[53]
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In Proc. Int. Conf. Mach. Learn. (ICML), pages 28492--28518
2023
-
[54]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, pages 1--67
2020
-
[55]
Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2021. Fastspeech 2: Fast and high-quality end-to-end text to speech. In Proc. Int. Conf. Learn. Represent. (ICLR)
2021
-
[56]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proc. IEEE/CVF Int. Conf. Comput. Vis. (CVPR), pages 10684--10695
2022
-
[57]
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234--241
2015
- [58]
- [59]
-
[60]
Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, and 1 others. 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP)
2018
-
[61]
Yang Song and 1 others. 2021. Score-based generative modeling through stochastic differential equations. In Proc. Int. Conf. Learn. Represent. (ICLR)
2021
-
[62]
Daxin Tan, Guangyan Zhang, and Tan Lee. 2022. Environment aware text-to-speech synthesis. In Ann. Conf. Int. Speech Commun. Assoc. (INTERSPEECH)
2022
- [63]
- [64]
-
[65]
Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. 2023. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP)
2023
-
[66]
Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, and 1 others. 2025. Sana: Efficient high-resolution image synthesis with linear diffusion transformers. In Proc. Int. Conf. Learn. Represent. (ICLR)
2025
-
[67]
Jinlong Xue, Yayue Deng, Yingming Gao, and Ya Li. 2024. Auffusion: Leveraging the power of diffusion and large language models for text-to-audio generation. IEEE/ACM Trans. Audio, Speech, Lang. Process. (TASLP)
2024
-
[68]
Dongchao Yang, Jinchuan Tian, Xu Tan, Rongjie Huang, Songxiang Liu, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian, Xixin Wu, and 1 others. 2024. Uniaudio: An audio foundation model toward universal audio generation. In Proc. Int. Conf. Mach. Learn. (ICML)
2024
-
[69]
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. 2025. Representation alignment for generation: Training diffusion transformers is easier than you think. In Proc. Int. Conf. Learn. Represent. (ICLR)
2025
-
[70]
Jun-Hak Yun, Seung-Bin Kim, and Seong-Whan Lee. 2025. Flowhigh: Towards efficient and high-quality audio super-resolution with single-step flow matching. In IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP)
2025
-
[71]
Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. 2019. Libritts: A corpus derived from librispeech for text-to-speech. arXiv preprint arXiv:1904.02882
work page internal anchor Pith review Pith/arXiv arXiv 2019
- [72]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.