Longcat-audiodit: High-fidelity diffusion text-to-speech in the waveform latent space

Detai Xin, Shujie Hu, Chengzuo Y ang, Chen Huang, Guoqiao Yu, Guanglu Wan, Xunliang Cai · 2026 · arXiv 2603.29339

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling

eess.AS · 2026-06-02 · unverdicted · novelty 8.0

WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.

dots.tts Technical Report

cs.SD · 2026-06-05 · unverdicted · novelty 6.0

dots.tts reports SOTA benchmark results on Seed-TTS-Eval and other tests via continuous latent-space autoregressive modeling with three listed innovations and code release.

VoxCPM2 Technical Report

cs.SD · 2026-06-05 · unverdicted · novelty 5.0

VoxCPM2 scales hierarchical continuous-latent speech modeling to 2B parameters and over 2M hours of multilingual data, unifying voice cloning, style control, and continuation in one backbone with open release.

citing papers explorer

Showing 3 of 3 citing papers.

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling eess.AS · 2026-06-02 · unverdicted · none · ref 92
WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.
dots.tts Technical Report cs.SD · 2026-06-05 · unverdicted · none · ref 2
dots.tts reports SOTA benchmark results on Seed-TTS-Eval and other tests via continuous latent-space autoregressive modeling with three listed innovations and code release.
VoxCPM2 Technical Report cs.SD · 2026-06-05 · unverdicted · none · ref 36
VoxCPM2 scales hierarchical continuous-latent speech modeling to 2B parameters and over 2M hours of multilingual data, unifying voice cloning, style control, and continuation in one backbone with open release.

Longcat-audiodit: High-fidelity diffusion text-to-speech in the waveform latent space

fields

years

verdicts

representative citing papers

citing papers explorer