dots.tts Technical Report
Pith reviewed 2026-06-27 21:08 UTC · model grok-4.3
The pith
dots.tts shows that a 2B-parameter continuous autoregressive TTS model can reach top multilingual benchmark scores by structuring speech latents with an AudioVAE, full-history conditioning, and self-corrective post-training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
dots.tts is a 2B-parameter continuous autoregressive TTS foundation model that models speech in a continuous latent space. Trained on a large-scale multilingual corpus, it uses an AudioVAE with multiple objectives to create a semantically structured and prediction-friendly speech space, applies full-history conditioning in the flow-matching head to preserve long-range consistency, and performs reward-free self-corrective post-training on the flow-matching head to improve robustness and quality. On Seed-TTS-Eval this yields WERs of 0.94 percent, 1.30 percent, and 6.60 percent together with SIM scores of 81.0, 77.1, and 79.5 on the zh, en, and zh-hard sets, respectively, while also showing ope
What carries the argument
The AudioVAE trained with multiple objectives to produce a semantically structured continuous speech latent space, together with full-history conditioning inside the flow-matching head and reward-free self-corrective post-training.
If this is right
- The model produces speech with strong generation stability, voice cloning fidelity, and emotional expressiveness on multiple languages.
- CFG-aware MeanFlow distillation reduces first-packet latency to 85 ms in output streaming mode and 54 ms in dual-streaming mode.
- Releasing the full training and inference code plus pretrained, post-trained, and distilled checkpoints under Apache 2.0 supports direct reproduction and deployment.
- The same continuous-space approach consistently outperforms prior open-source systems across additional TTS benchmarks.
Where Pith is reading between the lines
- Similar continuous latent modeling with structured VAEs could be tested on other autoregressive audio tasks such as music generation to avoid tokenization artifacts.
- Full-history conditioning may reduce error accumulation in long-sequence models outside speech, such as video or text generation.
- The self-corrective post-training step could be applied to other flow-matching or diffusion heads to improve quality without requiring separate reward models.
- If the innovations remain effective at larger scales, the same recipe might support even higher-fidelity multilingual TTS foundation models.
Load-bearing premise
The three listed innovations are the main reason for the benchmark gains rather than model scale, training data volume, or other unmentioned implementation choices.
What would settle it
An ablation that removes one or more of the three innovations while keeping model size, data, and other details fixed and then measures whether Seed-TTS-Eval scores fall by a comparable margin would directly test the claim.
Figures
read the original abstract
We present dots.tts, a 2B-parameter continuous autoregressive text-to-speech (TTS) foundation model that models speech in a continuous latent space. Compared with existing continuous autoregressive models, our key innovations are threefold. First, we train an AudioVAE with multiple objectives to build a semantically structured and prediction-friendly continuous speech space. Second, we use full-history conditioning in the flow-matching head to preserve long-range consistency and reduce drift during generation. Third, we apply reward-free self-corrective post-training to the flow-matching head to further improve robustness and acoustic quality. After being trained on a large-scale multilingual corpus, dots.tts achieves the best average performance on Seed-TTS-Eval, with WERs of 0.94%/1.30%/6.60% and SIM scores of 81.0/77.1/79.5 on the zh/en/zh-hard test sets, respectively. Across other benchmarks, dots.tts also consistently demonstrates open-source state-of-the-art performance, exhibiting strong generation stability, voice cloning ability, and emotional expressiveness. For efficient inference, we further apply CFG-aware MeanFlow distillation, enabling low-latency speech generation with first-packet latencies of 85/54 ms in output streaming and dual-streaming modes, respectively. To facilitate reproducible research and practical deployment, we release the training and inference code, together with the pretrained, post-trained, and MeanFlow-distilled checkpoints, under the Apache 2.0 license.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents dots.tts, a 2B-parameter continuous autoregressive TTS foundation model. Its three claimed innovations are: (1) multi-objective training of an AudioVAE to produce a semantically structured continuous speech latent space, (2) full-history conditioning inside the flow-matching head, and (3) reward-free self-corrective post-training of that head. After training on a large-scale multilingual corpus the model is reported to reach the best average scores on Seed-TTS-Eval (WER 0.94/1.30/6.60 %, SIM 81.0/77.1/79.5 on the zh/en/zh-hard partitions) and to outperform prior open-source systems on additional stability, cloning and expressiveness benchmarks. The work also describes CFG-aware MeanFlow distillation for low-latency streaming and releases training code together with pretrained, post-trained and distilled checkpoints under Apache 2.0.
Significance. If the reported benchmark numbers are reproducible and the three innovations can be shown to be responsible for the gains, the release of a 2B-parameter continuous autoregressive TTS model with full training code and checkpoints would constitute a useful, immediately usable open-source foundation for the field.
major comments (2)
- [Abstract] Abstract: the central claim that the three listed innovations produce the reported SOTA numbers is not supported by any controlled ablation. No comparison is given between the full model and matched-scale baselines that omit the multi-objective AudioVAE losses, the full-history conditioning, or the self-corrective post-training step; therefore the performance edge cannot be attributed to the innovations rather than corpus size, parameter count or unreported implementation choices.
- [Abstract] Abstract / Experiments section: the Seed-TTS-Eval numbers are presented without error bars, without the number of evaluation runs, and without any statement of statistical significance; this weakens the claim of “best average performance.”
minor comments (2)
- The manuscript should clarify the precise weighting schedule used for the multiple AudioVAE objectives and whether those weights were tuned on a held-out validation set.
- Figure captions and table headers should explicitly state the evaluation protocol (e.g., number of speakers, prompt length, temperature) used for the reported WER and SIM figures.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and indicate the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the three listed innovations produce the reported SOTA numbers is not supported by any controlled ablation. No comparison is given between the full model and matched-scale baselines that omit the multi-objective AudioVAE losses, the full-history conditioning, or the self-corrective post-training step; therefore the performance edge cannot be attributed to the innovations rather than corpus size, parameter count or unreported implementation choices.
Authors: We agree that the manuscript contains no controlled ablations that isolate the contribution of each of the three innovations. Training multiple matched 2B-parameter models for such ablations is beyond our compute budget. The innovations are presented as the core design decisions of the model. We will revise the abstract to remove the implication of direct causal attribution and instead state that dots.tts, which incorporates these three design elements, reaches the reported benchmark numbers. A brief limitations paragraph will also be added noting the absence of component ablations. revision: partial
-
Referee: [Abstract] Abstract / Experiments section: the Seed-TTS-Eval numbers are presented without error bars, without the number of evaluation runs, and without any statement of statistical significance; this weakens the claim of “best average performance.”
Authors: The reported Seed-TTS-Eval figures come from a single evaluation run following the benchmark protocol. We will update the experiments section to state the number of runs explicitly (one) and add a sentence noting that error bars and statistical significance tests are not provided because only a single run was performed. The abstract will be revised to say “achieves the highest reported average scores” rather than “best average performance.” revision: yes
Circularity Check
No circularity; empirical performance claims rest on external benchmarks
full rationale
The paper is an empirical technical report describing a 2B-parameter TTS model trained on a large-scale multilingual corpus. The three listed innovations (AudioVAE multi-objective training, full-history conditioning in the flow-matching head, reward-free self-corrective post-training) are presented as design choices whose effects are measured via standard benchmark metrics (WER and SIM on Seed-TTS-Eval and other external sets). No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the abstract or described content. The reported numbers are direct evaluation outcomes on held-out test sets, not quantities derived by construction from the model inputs. Absence of component ablations is a separate methodological limitation but does not create circularity in any derivation chain.
Axiom & Free-Parameter Ledger
free parameters (2)
- 2B model size
- AudioVAE training objectives weights
axioms (1)
- domain assumption Multiple objectives on AudioVAE produce a semantically structured and prediction-friendly continuous speech latent space
Reference graph
Works this paper leans on
-
[1]
F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching
Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, and Xie Chen. F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL): Long Papers , pages 6255–6271, 2025a. Han Zhu, Lingxuan Y e, Wei Kang, Zengwei Y ao, ...
-
[2]
Longcat-audiodit: High-fidelity diffusion text-to-speech in the waveform latent space
Detai Xin, Shujie Hu, Chengzuo Y ang, Chen Huang, Guoqiao Yu, Guanglu Wan, and Xunliang Cai. Longcat-audiodit: High-fidelity diffusion text-to-speech in the waveform latent space. arXiv preprint arXiv:2603.29339,
-
[3]
Oral. Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang, Xiang Lv, Hui Wang, Chongjia Ni, Xian Shi, et al. Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training. arXiv preprint arXiv:2505.17589,
-
[4]
Qwen Team. Qwen3-tts technical report. arXiv preprint arXiv:2601.15621,
-
[7]
Zhiliang Peng, Jianwei Yu, Wenhui Wang, Y aoyao Chang, Yutao Sun, Li Dong, Yi Zhu, Weijiang Xu, Hangbo Bao, Zehua Wang, et al. Vibevoice technical report. arXiv preprint arXiv:2508.19205,
-
[8]
Voxcpm: Tokenizer-free tts for context-aware speech generation and true-to-life voice cloning
Yixuan Zhou, Guoyang Zeng, Xin Liu, Xiang Li, Renjie Yu, Ziyang Wang, Runchuan Y e, Weiyue Sun, Jiancheng Gui, Kehan Li, et al. Voxcpm: Tokenizer-free tts for context-aware speech generation and true-to-life voice cloning. arXiv preprint arXiv:2509.24650,
-
[9]
Autoregressive diffusion transformer for text-to-speech synthesis
Zhijun Liu, Shuai Wang, Sho Inoue, Qibing Bai, and Haizhou Li. Autoregressive diffusion transformer for text-to-speech synthesis. arXiv preprint arXiv:2406.05551,
-
[10]
Any2speech: Borderless long speech synthesis
Xingchen Song, Di Wu, Dinghao Zhou, Pengyu Cheng, Hongwu Ding, Yunchao He, Jie Wang, Shengfan Shen, Sixiang Lv, Lichun Fan, et al. Any2speech: Borderless long speech synthesis. arXiv preprint arXiv:2603.19798,
-
[11]
Bohan Li, Shi Lian, Hankun Wang, Yiwei Guo, Yu Xi, Zhihan Li, Da Zheng, Colin Zhang, and Kai Yu. Holitok: A continuous holistic tokenization with robust dual capabilities of speech generation and understanding. arXiv preprint arXiv:2605.29948,
-
[12]
Soar: Self-correction for optimal alignment and refinement in diffusion models
Y ou Qin, Linqing Wang, Hao Fei, Roger Zimmermann, Liefeng Bo, Qinglin Lu, and Chunyu Wang. Soar: Self-correction for optimal alignment and refinement in diffusion models. arXiv preprint arXiv:2604.12617,
-
[13]
Seed Team, ByteDance. Seed-tts-eval benchmark. Introduced in Seed-TTS, arXiv:2406.02430, 2024b. MiniMax Team. Minimax-speech: Intrinsic zero-shot text-to-speech with a learnable speaker encoder. arXiv preprint arXiv:2505.07916,
-
[14]
Qwen Team. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115,
-
[15]
Cosyvoice 2: Scalable streaming speech synthesis with large lan- guage models
Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Y exin Y ang, Changfeng Gao, Hui Wang, et al. Cosyvoice 2: Scalable streaming speech synthesis with large lan- guage models. arXiv preprint arXiv:2412.10117,
-
[16]
Fireredtts-2: Towards long con- versational speech generation for podcast and chatbot
Kun Xie, Feiyu Shen, Junjie Li, Fenglong Xie, Xu Tang, and Y ao Hu. Fireredtts-2: Towards long con- versational speech generation for podcast and chatbot. arXiv preprint arXiv:2509.02020,
-
[17]
Fish audio s2 technical report
Shijia Liao, Yuxuan Wang, Songting Liu, Yifan Cheng, Ruoyi Zhang, Tianyu Li, Shidong Li, Yisheng Zheng, Xingwei Liu, Qingzheng Wang, et al. Fish audio s2 technical report. arXiv preprint arXiv:2603.08823,
-
[18]
Megatts 3: Sparse alignment enhanced latent diffusion transformer for zero-shot speech synthesis
Ziyue Jiang, Yi Ren, Ruiqi Li, Shengpeng Ji, Boyang Zhang, Zhenhui Y e, Chen Zhang, Jionghao Bai, Xiaoda Y ang, Jialong Zuo, et al. Megatts 3: Sparse alignment enhanced latent diffusion transformer for zero-shot speech synthesis. arXiv preprint arXiv:2502.18924,
-
[19]
Xy-tokenizer: Mitigating the semantic-acoustic conflict in low-bitrate speech codecs
Yitian Gong, Luozhijie Jin, Ruifan Deng, Dong Zhang, Xin Zhang, Qinyuan Cheng, Zhaoye Fei, Shimin Li, and Xipeng Qiu. Xy-tokenizer: Mitigating the semantic-acoustic conflict in low-bitrate speech codecs. arXiv preprint arXiv:2506.23325,
-
[20]
Wavtokenizer: An efficient acoustic discrete codec tokenizer for audio language modeling
Shengpeng Ji, Ziyue Jiang, Wen Wang, Yifu Chen, Minghui Fang, Jialong Zuo, Qian Y ang, Xize Cheng, Zehan Wang, Ruiqi Li, et al. Wavtokenizer: An efficient acoustic discrete codec tokenizer for audio language modeling. arXiv preprint arXiv:2408.16532,
-
[21]
Llasa: Scaling train-time and inference-time compute for llama-based speech synthesis
Zhen Y e, Xinfa Zhu, Chi-Min Chan, Xinsheng Wang, Xu Tan, Jiahe Lei, Yi Peng, Haohe Liu, Yizhu Jin, Zheqi Dai, et al. Llasa: Scaling train-time and inference-time compute for llama-based speech synthesis. arXiv preprint arXiv:2502.04128, 2025b. Wenxi Chen, Xinsheng Wang, Ruiqi Y an, Yushen Chen, Zhikang Niu, Ziyang Ma, Xiquan Li, Yuzhe 21 Liang, Hanlin We...
-
[22]
Canxiang Y an, Chunxiang Jin, Dawei Huang, Haibing Yu, Han Peng, Hui Zhan, Jie Gao, Jing Peng, Jingdong Chen, Jun Zhou, et al. Ming-uniaudio: Speech llm for joint understanding, generation and editing with unified representation. arXiv preprint arXiv:2511.05516,
-
[23]
vllm-omni: Fully disaggregated serving for any-to-any multimodal models
Peiqi Yin, Jiangyun Zhu, Han Gao, Chenguang Zheng, Y ongxiang Huang, Taichang Zhou, Ruirui Y ang, Weizhi Liu, Weiqing Chen, Canlin Guo, et al. vllm-omni: Fully disaggregated serving for any-to-any multimodal models. arXiv preprint arXiv:2602.02204,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.