pith. sign in

arxiv: 2605.30965 · v1 · pith:4QCMU5VRnew · submitted 2026-05-29 · 📡 eess.AS · cs.AI· cs.CL

ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment

Pith reviewed 2026-06-28 21:10 UTC · model grok-4.3

classification 📡 eess.AS cs.AIcs.CL
keywords text-to-speechenvironmental audiomultimodal diffusion transformerrepresentation alignmentcross-modal fusionimmersive audiodomain-specific alignment
0
0 comments X

The pith

ImmersiveTTS generates speech integrated with environmental audio by fusing latents in a multimodal diffusion transformer and aligning domain-specific representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that jointly modeling speech and environmental audio through cross-modal attention produces more natural and intelligible output than separate generation pipelines. It centers on a diffusion transformer that aligns transcript-driven speech features with text-guided environment context while adding a specialized alignment loss drawn from separate speech and audio encoders. This integration matters because most real-world audio scenes contain overlapping speech and background sounds that current TTS systems handle poorly. If the method holds, it directly improves fidelity in settings where audio must feel embedded rather than layered. The core test is whether the added alignment step actually enforces semantic fit between voice and surroundings.

Core claim

ImmersiveTTS builds a multimodal diffusion transformer that fuses transcript-aligned speech latent with text-conditioned environmental context via joint attention and adds a domain-specific representation alignment objective that draws complementary self-supervised features from speech and audio encoders to enforce semantic consistency, yielding speech that integrates seamlessly with surrounding audio.

What carries the argument

Multimodal diffusion transformer performing joint attention between speech and environmental latents, plus domain-specific representation alignment objective.

If this is right

  • Generated speech scores higher on naturalness, intelligibility, and audio fidelity than prior text-to-speech systems when environmental context is present.
  • Joint attention explicitly models cross-modal interactions that separate pipelines cannot capture.
  • The alignment objective reduces acoustic and temporal mismatches between speech and background sounds.
  • Objective metrics and human evaluations both improve across tested environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fusion approach could be tested on real-time streaming audio where environment changes mid-utterance.
  • Extending the alignment loss to video-derived visual context might further tighten multimodal consistency without new labeled data.
  • If the alignment generalizes, it offers a route to parameter-efficient fine-tuning of existing diffusion audio models for immersive tasks.

Load-bearing premise

The domain-specific representation alignment objective produces semantic consistency and seamless integration between speech and environmental audio.

What would settle it

A controlled listening test in which listeners consistently judge the generated speech as semantically mismatched to the described environment (for example, urgent speech during a quiet library scene) would falsify the consistency claim.

Figures

Figures reproduced from arXiv: 2605.30965 by Jun-Hak Yun, Seong-Whan Lee, Seung-Bin Kim.

Figure 1
Figure 1. Figure 1: Overview of ImmersiveTTS. A dual-stream MM-DiT backbone conditions the speech stream on content [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison across different NFEs. environment-related metrics, achieving the highest CLAP score and improving FAD over the Base, suggesting better alignment with environmental context. Because FAD is measured on the mixed waveform, stronger prompt alignment does not al￾ways translate into lower FAD, and we observe that WavLM tends to achieve lower FAD than ATST￾Frame in this setting. USAD improves all thre… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the double-stream DiT blocks. [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Objective evaluation results on AudioCaps [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Objective evaluation results on AudioCaps [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Detailed information on listener requirements and the SN-MOS evaluation interfaces. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Detailed information on listener requirements and EC-MOS evaluation interfaces. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Detailed information on listener requirements and ON-MOS evaluation interfaces. [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Detailed information on listener requirements and S-MOS evaluation interfaces. [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
read the original abstract

Recent advancements in text-guided audio generation have yielded promising results in diverse domains, including sound effects, speech, and music. However, jointly generating speech with environmental audio remains challenging due to the inherent disparities in their acoustic patterns and temporal dynamics. We propose ImmersiveTTS, an environment-aware text-to-speech (TTS) model that generates natural speech seamlessly integrated within environmental contexts by explicitly modeling cross-modal interactions. Our model builds on a multimodal diffusion transformer and fuses transcript-aligned speech latent with text-conditioned environmental context via joint attention. To enhance semantic consistency, we introduce a domain-specific representation alignment objective tailored to environment-aware TTS, leveraging complementary self-supervised representations from speech and audio encoders. Experimental results show that ImmersiveTTS achieves higher naturalness, intelligibility, and audio fidelity than existing approaches across objective metrics and human listening tests.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes ImmersiveTTS, an environment-aware TTS system built on a multimodal diffusion transformer. It fuses transcript-aligned speech latents with text-conditioned environmental context via joint attention and introduces a domain-specific representation alignment objective that leverages complementary self-supervised representations from speech and audio encoders to promote semantic consistency. The central claim is that this architecture yields higher naturalness, intelligibility, and audio fidelity than prior methods, as measured by objective metrics and human listening tests.

Significance. If the performance claims are substantiated, the work would address a recognized gap in text-guided audio generation by enabling seamless integration of speech with environmental sounds despite their differing acoustic and temporal characteristics. The multimodal diffusion transformer with joint attention and the tailored alignment objective could supply a reusable framework for immersive audio synthesis applications.

major comments (1)
  1. Abstract: The assertion of superior performance across objective metrics and human listening tests is presented without any experimental details, baselines, error bars, datasets, or statistical analysis; the central empirical claim therefore cannot be evaluated from the supplied text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the review and the positive assessment of the work's potential significance, conditional on the empirical claims. We address the single major comment point by point below.

read point-by-point responses
  1. Referee: [—] Abstract: The assertion of superior performance across objective metrics and human listening tests is presented without any experimental details, baselines, error bars, datasets, or statistical analysis; the central empirical claim therefore cannot be evaluated from the supplied text.

    Authors: We agree that the abstract is written at a high level and omits specific experimental details, which is standard practice to maintain brevity (typically under 200 words). The full manuscript substantiates the claims in Section 4 (Experiments), which details the datasets, baseline systems, objective metrics with numerical results and error bars, human listening test design (including participant numbers and protocols), and statistical analysis. These elements enable evaluation of the performance assertions. We do not believe the abstract itself requires expansion, as doing so would violate length conventions without adding substantive value. revision: no

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and description contain no equations, derivations, or first-principles claims that could reduce to inputs by construction. The model is described at a high architectural level (multimodal diffusion transformer with joint attention and a domain-specific alignment objective), with performance asserted via experimental results rather than any mathematical chain. No self-citations, fitted inputs renamed as predictions, or ansatzes are present in the given text. The derivation chain is therefore self-contained against external benchmarks, as there is no internal reduction to inspect.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; all technical details are absent.

pith-pipeline@v0.9.1-grok · 5689 in / 1048 out tokens · 26622 ms · 2026-06-28T21:10:20.369613+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 17 canonical work pages · 5 internal anchors

  1. [1]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, and 1 others. 2022. Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing. In Proc. Annu. Meet. Assoc. Comput. Linguist. (ACL), pages 5723--5738

  4. [4]

    Marcely Zanon Boito, Vivek Iyer, Nikolaos Lagos, Laurent Besacier, and Ioan Calapodescu. 2024. mhubert-147: A compact multilingual hubert model. In Ann. Conf. Int. Speech Commun. Assoc. (INTERSPEECH)

  5. [5]

    Heng-Jui Chang, Saurabhchand Bhati, James Glass, and Alexander H Liu. 2025. Usad: Universal speech and audio representation via distillation. In IEEE Autom. Speech Recognit. Underst. Workshop (ASRU)

  6. [6]

    Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, and 1 others. 2022. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, pages 1505--1518

  7. [7]

    Sanyuan Chen, Chengyi Wang, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, and 1 others. 2025 a . Neural codec language models are zero-shot text to speech synthesizers. IEEE/ACM Trans. Audio, Speech, Lang. Process. (TASLP)

  8. [8]

    Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, and Xie Chen. 2025 b . F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching. In Proc. Annu. Meet. Assoc. Comput. Linguist. (ACL)

  9. [9]

    Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. 2025. Mmaudio: Taming multimodal joint training for high-quality video-to-audio synthesis. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR)

  10. [10]

    Ha-Yeong Choi, Sang-Hoon Lee, and Seong-Whan Lee. 2024. Dddm-vc: Decoupled denoising diffusion models with disentangled representation and prior mixup for verified robust voice conversion. In Proc. AAAI Conf. Artificial Intelligence (AAAI), volume 38, pages 17862--17870

  11. [11]

    Jeongsoo Choi, Zhikang Niu, Ji-Hoon Kim, Chunhui Wang, Joon Son Chung, and Xie Chen. 2025. Accelerating diffusion-based text-to-speech model training with dual modality alignment. In Ann. Conf. Int. Speech Commun. Assoc. (INTERSPEECH)

  12. [12]

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, and 1 others. 2024. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1--53

  13. [13]

    Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, and Yonghui Wu. 2021. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In IEEE Autom. Speech Recognit. Underst. Workshop (ASRU), pages 244--250

  14. [14]

    Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang, Xiang Lv, Hui Wang, Chongjia Ni, Xian Shi, and 1 others. 2025. Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training. arXiv preprint arXiv:2505.17589

  15. [15]

    Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, and 1 others. 2024. Cosyvoice 2: Scalable streaming speech synthesis with large language models. arXiv preprint arXiv:2412.10117

  16. [16]

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M \"u ller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, and 1 others. 2024. Scaling rectified flow transformers for high-resolution image synthesis, 2024. In Proc. Int. Conf. Mach. Learn. (ICML)

  17. [17]

    Zhengcong Fei, Mingyuan Fan, Changqian Yu, and Junshi Huang. 2024. Flux that plays music. arXiv preprint arXiv:2409.00587

  18. [18]

    Neta Glazer, Aviv Navon, Yael Segal, Aviv Shamsian, Hilit Segev, Asaf Buchnick, Menachem Pirchi, Gil Hetz, and Joseph Keshet. 2025. Umbratts: Adapting text-to-speech to environmental contexts with flow matching. In arXiv preprint arXiv:2506.09874

  19. [19]

    Junmin Gong, Sean Zhao, Sen Wang, Shengyuan Xu, and Joe Guo. 2025. Ace-step: A step towards music generation foundation model. arXiv preprint arXiv:2506.00045

  20. [20]

    Alex Graves, Santiago Fern \'a ndez, Faustino Gomez, and J \"u rgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proc. Int. Conf. Mach. Learn. (ICML), pages 369--376

  21. [21]

    Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, and 1 others. 2017. Cnn architectures for large-scale audio classification. In IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP)

  22. [22]

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. In Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), volume 33, pages 6840--6851

  23. [23]

    Jonathan Ho and Tim Salimans. 2022. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598

  24. [24]

    Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio, Speech, Lang. Process. (TASLP), 29:3451--3460

  25. [25]

    Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, and Zhou Zhao. 2023. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. In Proc. Int. Conf. Mach. Learn. (ICML)

  26. [26]

    Chia-Yu Hung, Navonil Majumder, Zhifeng Kong, Ambuj Mehrish, Amir Ali Bagherzadeh, Chuan Li, Rafael Valle, Bryan Catanzaro, and Soujanya Poria. 2024. Tangoflux: Super fast and faithful text to audio generation with flow matching and clap-ranked preference optimization. arXiv preprint arXiv:2412.21037

  27. [27]

    Jaemin Jung, Junseok Ahn, Chaeyoung Jung, Tan Dat Nguyen, Youngjoon Jang, and Joon Son Chung. 2025. Voicedit: Dual-condition diffusion transformer for environment-aware speech synthesis. In IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP)

  28. [28]

    Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Sungroh Yoon. 2020. Glow-tts: A generative flow for text-to-speech via monotonic alignment search. In Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), volume 33, pages 8067--8077

  29. [29]

    Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021 a . Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In Proc. Int. Conf. Mach. Learn. (ICML), pages 5530--5540

  30. [30]

    Ji-Hoon Kim, Sang-Hoon Lee, Ji-Hyun Lee, and Seong-Whan Lee. 2021 b . Fre-gan: Adversarial frequency-consistent audio synthesis

  31. [31]

    Seung-Bin Kim, Jun-Hyeok Cha, Hyung-Seok Oh, Heejin Choi, and Seong-Whan Lee. 2025. Fillerspeech: Towards human-like text-to-speech synthesis with filler insertion and filler style control. In Proc. Conf. Empir. Methods Nat. Lang. Process. (EMNLP), pages 34096--34113

  32. [32]

    Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. In Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), pages 17022--17033

  33. [33]

    Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre D \'e fossez, Jade Copet, Devi Parikh, Yaniv Taigman, and Yossi Adi. 2023. Audiogen: Textually guided audio generation. In Proc. Int. Conf. Learn. Represent. (ICLR)

  34. [34]

    Sang-Hoon Lee, Ha-Yeong Choi, Seung-Bin Kim, and Seong-Whan Lee. 2025. Hierspeech++: Bridging the gap between semantic and acoustic representation of speech by hierarchical variational inference for zero-shot speech synthesis. IEEE Transactions on Neural Networks and Learning Systems

  35. [35]

    Sang-Hoon Lee, Seung-Bin Kim, Ji-Hyun Lee, Eunwoo Song, Min-Jae Hwang, and Seong-Whan Lee. 2022. Hierspeech: Bridging the gap between text and speech by hierarchical variational inference using self-supervised representations for speech synthesis. In Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), pages 16624--16636

  36. [36]

    Yeonghyeon Lee, Inmo Yeon, Juhan Nam, and Joon Son Chung. 2024. Voiceldm: Text-to-speech with environmental context. In IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP)

  37. [37]

    Xian Li, Nian Shao, and Xiaofei Li. 2024. Self-supervised audio teacher-student transformer for both clip-level and frame-level tasks. IEEE/ACM Trans. Audio, Speech, Lang. Process. (TASLP), pages 1336--1351

  38. [38]

    Xiquan Li, Junxi Liu, Yuzhe Liang, Zhikang Niu, Wenxi Chen, and Xie Chen. 2025. Meanaudio: Fast and faithful text-to-audio generation with mean flows. arXiv preprint arXiv:2508.06098

  39. [39]

    Yizhi Li, Ruibin Yuan, Ge Zhang, Yinghao Ma, Xingran Chen, Hanzhi Yin, Chenghao Xiao, Chenghua Lin, Anton Ragni, Emmanouil Benetos, and 1 others. 2023. Mert: Acoustic music understanding model with large-scale self-supervised training. arXiv preprint arXiv:2306.00107

  40. [40]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. 2023. F low matching for generative modeling. In Proc. Int. Conf. Learn. Represent. (ICLR)

  41. [41]

    Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. 2023 a . Audio LDM : Text-to-audio generation with latent diffusion models. In Proc. Int. Conf. Mach. Learn. (ICML)

  42. [42]

    Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, and Mark D Plumbley. 2024. A udio LDM 2: L earning holistic audio generation with self-supervised pretraining. IEEE/ACM Trans. Audio, Speech, Lang. Process. (TASLP), pages 2871--2883

  43. [43]

    Huadai Liu, Rongjie Huang, Xuan Lin, Wenqiang Xu, Maozong Zheng, Hong Chen, Jinzheng He, and Zhou Zhao. 2023 b . Vit-tts: visual text-to-speech with scalable diffusion transformer. In Proc. Conf. Empir. Methods Nat. Lang. Process. (EMNLP), pages 15957--15969

  44. [44]

    Huadai Liu, Kaicheng Luo, Jialei Wang, Wen Wang, Qian Chen, Zhou Zhao, and Wei Xue. 2025 a . Thinksound: Chain-of-thought reasoning in multimodal large language models for audio generation and editing. In Proc. Adv. Neural Inf. Process. Syst. (NeurIPS)

  45. [45]

    Rui Liu, Shuwei He, Yifan Hu, and Haizhou Li. 2025 b . Multi-modal and multi-scale spatial environment understanding for immersive visual text-to-speech. In Proc. AAAI Conf. Artificial Intelligence (AAAI), volume 39, pages 24632--24640

  46. [46]

    Xingchao Liu, Chengyue Gong, and Qiang Liu. 2023 c . Flow straight and fast: Learning to generate and transfer data with rectified flow. In Proc. Int. Conf. Learn. Represent. (ICLR)

  47. [47]

    Ye-Xin Lu, Hui-Peng Du, Zheng-Yan Sheng, Yang Ai, and Zhen-Hua Ling. 2025 a . Incremental disentanglement for environment-aware zero-shot text-to-speech synthesis. In IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP)

  48. [48]

    Ye-Xin Lu, Yu Gu, Kun Wei, Hui-Peng Du, Yang Ai, and Zhen-Hua Ling. 2025 b . Daien-tts: Disentangled audio infilling for environment-aware text-to-speech synthesis. arXiv preprint arXiv:2509.14684

  49. [49]

    Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D Plumbley, Yuexian Zou, and Wenwu Wang. 2024. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. IEEE/ACM Trans. Audio, Speech, Lang. Process. (TASLP)

  50. [50]

    Maxime Oquab, Timoth \'e e Darcet, Th \'e o Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, and Alaaeldin El-Nouby. 2023. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193

  51. [51]

    William Peebles and Saining Xie. 2023. Scalable diffusion models with transformers. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 4195--4205

  52. [52]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, and 1 others. 2021. Learning transferable visual models from natural language supervision. In Proc. Int. Conf. Mach. Learn. (ICML)

  53. [53]

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In Proc. Int. Conf. Mach. Learn. (ICML), pages 28492--28518

  54. [54]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, pages 1--67

  55. [55]

    Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2021. Fastspeech 2: Fast and high-quality end-to-end text to speech. In Proc. Int. Conf. Learn. Represent. (ICLR)

  56. [56]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proc. IEEE/CVF Int. Conf. Comput. Vis. (CVPR), pages 10684--10695

  57. [57]

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234--241

  58. [58]

    Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, and Hiroshi Saruwatari. 2022. Utmos: Utokyo-sarulab system for voicemos challenge 2022. arXiv preprint arXiv:2204.02152

  59. [59]

    Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, and Zhao Zhong. 2025. Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation. arXiv preprint arXiv:2508.16930

  60. [60]

    Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, and 1 others. 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP)

  61. [61]

    Yang Song and 1 others. 2021. Score-based generative modeling through stochastic differential equations. In Proc. Int. Conf. Learn. Represent. (ICLR)

  62. [62]

    Daxin Tan, Guangyan Zhang, and Tan Lee. 2022. Environment aware text-to-speech synthesis. In Ann. Conf. Int. Speech Commun. Assoc. (INTERSPEECH)

  63. [63]

    Apoorv Vyas, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu, Baishan Guo, Jiemin Zhang, Xinyue Zhang, Robert Adkins, William Ngan, and 1 others. 2023. Audiobox: Unified audio generation with natural language prompts. arXiv preprint arXiv:2312.15821

  64. [64]

    Jun Wang, Xijuan Zeng, Chunyu Qiang, Ruilong Chen, Shiyao Wang, Le Wang, Wangjing Zhou, Pengfei Cai, Jiahui Zhao, Nan Li, and 1 others. 2025. Kling-foley: Multimodal diffusion transformer for high-quality video-to-audio generation. arXiv preprint arXiv:2506.19774

  65. [65]

    Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. 2023. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP)

  66. [66]

    Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, and 1 others. 2025. Sana: Efficient high-resolution image synthesis with linear diffusion transformers. In Proc. Int. Conf. Learn. Represent. (ICLR)

  67. [67]

    Jinlong Xue, Yayue Deng, Yingming Gao, and Ya Li. 2024. Auffusion: Leveraging the power of diffusion and large language models for text-to-audio generation. IEEE/ACM Trans. Audio, Speech, Lang. Process. (TASLP)

  68. [68]

    Dongchao Yang, Jinchuan Tian, Xu Tan, Rongjie Huang, Songxiang Liu, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian, Xixin Wu, and 1 others. 2024. Uniaudio: An audio foundation model toward universal audio generation. In Proc. Int. Conf. Mach. Learn. (ICML)

  69. [69]

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. 2025. Representation alignment for generation: Training diffusion transformers is easier than you think. In Proc. Int. Conf. Learn. Represent. (ICLR)

  70. [70]

    Jun-Hak Yun, Seung-Bin Kim, and Seong-Whan Lee. 2025. Flowhigh: Towards efficient and high-quality audio super-resolution with single-step flow matching. In IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP)

  71. [71]

    Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. 2019. Libritts: A corpus derived from librispeech for text-to-speech. arXiv preprint arXiv:1904.02882

  72. [72]

    Xueyao Zhang, Junan Zhang, Yuancheng Wang, Chaoren Wang, Yuanzhe Chen, Dongya Jia, Zhuo Chen, and Zhizheng Wu. 2025. Vevo2: Bridging controllable speech and singing voice generation via unified prosody learning. arXiv preprint arXiv:2508.16332