arxiv: 2604.17958 · v1 · submitted 2026-04-20 · 📡 eess.AS · cs.SD

Recognition: unknown

MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech

Huakang Chen , Jingbin Hu , Liumeng Xue , Qirui Zhan , Wenhao Li , Guobin Ma , Hanke Xie , Dake Guo

show 7 more authors

Linhan Ma Yuepeng Jiang Bengu Wu Pengyuan Xie Chuan Xie Qiang Zhang Lei Xie

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:00 UTC · model grok-4.3

classification 📡 eess.AS cs.SD

keywords instruction-following TTSmultilingual TTS benchmarkcontrollable speech synthesisTTS evaluationparalinguistic controlscompositional instructions

0 comments

The pith

A new benchmark for instruction-following text-to-speech reveals that current systems still struggle with complex compositional and paralinguistic controls across ten languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MINT-Bench to provide better evaluation for how well TTS systems follow detailed instructions in multiple languages. It relies on a hierarchical taxonomy of instruction types, a pipeline for creating diverse test cases, and a combined evaluation approach for accuracy, adherence, and quality. Testing leading commercial and open-source systems shows that while top commercial models perform best overall, open-source alternatives can surpass them in languages like Chinese. The results point to compositional instructions and paralinguistic elements as the main areas where current systems fall short. The authors release the benchmark data and tools to enable more targeted improvements in controllable TTS.

Core claim

MINT-Bench is built upon a hierarchical multi-axis taxonomy, a scalable multi-stage data construction pipeline, and a hierarchical hybrid evaluation protocol that jointly assesses content consistency, instruction following, and perceptual quality. Experiments across ten languages show that current systems remain far from solved: frontier commercial systems lead overall, while leading open-source models become highly competitive and can even outperform commercial counterparts in localized settings such as Chinese. The benchmark further reveals that harder compositional and paralinguistic controls remain major bottlenecks for current systems.

What carries the argument

The hierarchical multi-axis taxonomy, scalable multi-stage data construction pipeline, and hierarchical hybrid evaluation protocol of MINT-Bench.

Load-bearing premise

The hierarchical multi-axis taxonomy, scalable multi-stage data construction pipeline, and hierarchical hybrid evaluation protocol together provide a complete, unbiased, and diagnostically useful measure of instruction-following TTS capabilities.

What would settle it

If independent human raters consistently disagree with MINT-Bench rankings on which systems best follow complex instructions, or if high-scoring systems fail on similar instructions outside the benchmark's test set.

Figures

Figures reproduced from arXiv: 2604.17958 by Bengu Wu, Chuan Xie, Dake Guo, Guobin Ma, Hanke Xie, Huakang Chen, Jingbin Hu, Lei Xie, Linhan Ma, Liumeng Xue, Pengyuan Xie, Qiang Zhang, Qirui Zhan, Wenhao Li, Yuepeng Jiang.

**Figure 1.** Figure 1: Overview of MINT-Bench, consisting of a hierarchical multi-axis taxonomy, a three-stage data construction pipeline, [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Hierarchical hybrid evaluation protocol of MINT-Bench. The pipeline progressively evaluates content consistency, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The prompt template for Structured Label Planning (Part 1). This part establishes the general generation boundaries, [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

**Figure 4.** Figure 4: The prompt template for Structured Label Planning (Part 2). This part defines node-specific parsing logic, fine-grained [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: The prompt template for Instruction-Text Pair Construction (Part 1). This part establishes the overarching generation [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: The prompt template for Instruction-Text Pair Construction (Part 2). This part details the specialized realization [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: The LALM evaluation prompt used for Easy taxonomy nodes, focusing on conservative verification of atomic and [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: The LALM evaluation prompt used for Hard taxonomy nodes. It introduces strict guidelines for structural realization, [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: The LALM evaluation prompt used for Special (Extra-Vocal) taxonomy nodes. It applies a two-step verification process [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Screenshot of the custom web-based human annotation platform. The interface displays the instruction, reference text, [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

read the original abstract

Instruction-following text-to-speech (TTS) has emerged as an important capability for controllable and expressive speech generation, yet its evaluation remains underdeveloped due to limited benchmark coverage, weak diagnostic granularity, and insufficient multilingual support. We present \textbf{MINT-Bench}, a comprehensive multilingual benchmark for instruction-following TTS. MINT-Bench is built upon a hierarchical multi-axis taxonomy, a scalable multi-stage data construction pipeline, and a hierarchical hybrid evaluation protocol that jointly assesses content consistency, instruction following, and perceptual quality. Experiments across ten languages show that current systems remain far from solved: frontier commercial systems lead overall, while leading open-source models become highly competitive and can even outperform commercial counterparts in localized settings such as Chinese. The benchmark further reveals that harder compositional and paralinguistic controls remain major bottlenecks for current systems. We release MINT-Bench together with the data construction and evaluation toolkit to support future research on controllable, multilingual, and diagnostically grounded TTS evaluation. The leaderboard and demo are available at https://longwaytog0.github.io/MINT-Bench/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MINT-Bench adds a multilingual benchmark for instruction-following TTS with useful experiments on model gaps, but its taxonomy and data validation need more evidence to be fully convincing.

read the letter

The main point from this paper is a new benchmark called MINT-Bench for evaluating instruction-following in text-to-speech systems across multiple languages. It shows that current models, both commercial and open-source, still have clear gaps, especially with harder types of instructions. They introduce a hierarchical taxonomy that breaks down instructions into different axes and levels. Then there's a scalable pipeline to create the test data in ten languages, and a hybrid evaluation method that looks at consistency, how well instructions are followed, and overall quality. The experiments indicate frontier commercial systems do best overall, but some open-source ones hold their own or beat them in specific languages like Chinese. Compositional and paralinguistic controls come out as the toughest areas. This work does a good job filling a gap in multilingual coverage for this kind of TTS testing. Releasing the data construction tools and evaluation toolkit is practical and helpful for the community. The soft spots are around the details of how the taxonomy was built and validated. The abstract mentions the components but does not give numbers on inter-annotator agreement or bias checks during data creation. If those are missing or weak in the full paper, the benchmark's reliability for diagnosing specific failures would be limited. The hybrid protocol also needs to be reproducible enough for others to use it confidently. This paper is aimed at researchers in speech synthesis who care about controllable generation and evaluation standards. Anyone building or comparing TTS models in multiple languages could find the results and the released resources useful. It deserves a serious referee. The core idea addresses a real need, and the experimental findings on model performance are worth discussing even if the construction process requires more scrutiny. I recommend putting it through peer review rather than desk rejecting it.

Referee Report

2 major / 1 minor

Summary. The paper presents MINT-Bench, a multilingual benchmark for instruction-following TTS built on a hierarchical multi-axis taxonomy, a scalable multi-stage data construction pipeline, and a hierarchical hybrid evaluation protocol assessing content consistency, instruction following, and perceptual quality. Experiments across ten languages indicate that frontier commercial systems lead overall while leading open-source models are highly competitive and can outperform in localized settings such as Chinese; harder compositional and paralinguistic controls remain major bottlenecks. The benchmark, data construction toolkit, and evaluation toolkit are released publicly with a leaderboard.

Significance. If the taxonomy, pipeline, and protocol are shown to be reliable, MINT-Bench would offer diagnostically granular, multilingual evaluation that addresses key limitations in prior TTS benchmarks and could meaningfully guide development of controllable speech systems. The public release of the construction and evaluation toolkit is a clear strength supporting reproducibility.

major comments (2)

[Data construction pipeline (methods section)] The abstract and methods description of the hierarchical multi-axis taxonomy and multi-stage data construction pipeline provide no inter-annotator agreement statistics, bias analysis, or external validation of the taxonomy axes. This is load-bearing for the central claim that the benchmark supplies an unbiased and diagnostically useful measure across languages.
[Evaluation protocol (methods and experiments sections)] The hierarchical hybrid evaluation protocol is presented without reported reliability metrics (e.g., agreement between automated metrics and human raters, consistency of scores across the ten languages, or ablation of the hybrid components). This directly affects confidence in the reported performance gaps between commercial and open-source systems and the identification of control-type bottlenecks.

minor comments (1)

[Abstract] The abstract states results across ten languages but does not indicate the number of instructions, samples per language, or total test cases; adding these figures would improve immediate context for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the absence of explicit reliability and validation statistics weakens confidence in the benchmark's claims and have revised the manuscript to incorporate the requested analyses and metrics.

read point-by-point responses

Referee: [Data construction pipeline (methods section)] The abstract and methods description of the hierarchical multi-axis taxonomy and multi-stage data construction pipeline provide no inter-annotator agreement statistics, bias analysis, or external validation of the taxonomy axes. This is load-bearing for the central claim that the benchmark supplies an unbiased and diagnostically useful measure across languages.

Authors: We acknowledge that the original submission omitted these statistics. The taxonomy axes were derived from established linguistic and paralinguistic frameworks in the speech literature and iteratively refined by the author team; however, we have now added inter-annotator agreement (Fleiss' kappa) for the annotation stages of the data pipeline, a language-wise bias analysis of the generated instructions, and a comparison of the taxonomy against prior TTS control taxonomies. These results appear in the revised Section 3.2 and new Appendix C. revision: yes
Referee: [Evaluation protocol (methods and experiments sections)] The hierarchical hybrid evaluation protocol is presented without reported reliability metrics (e.g., agreement between automated metrics and human raters, consistency of scores across the ten languages, or ablation of the hybrid components). This directly affects confidence in the reported performance gaps between commercial and open-source systems and the identification of control-type bottlenecks.

Authors: We agree that reliability evidence is essential. The revised manuscript now includes (i) Pearson and Spearman correlations between the automated metrics and human ratings on a 500-sample subset, (ii) per-language score consistency statistics, and (iii) an ablation removing each hybrid component in turn. These analyses are reported in Section 4.3, Table 6, and Figure 4, confirming that the observed gaps and bottleneck conclusions remain stable. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark construction

full rationale

The paper constructs MINT-Bench via a hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation protocol, with all claims grounded in external data sources and standard TTS evaluation practices across ten languages. No mathematical derivations, fitted parameters, self-referential predictions, or load-bearing self-citations appear in the derivation chain; experimental results on commercial vs. open-source performance and control bottlenecks are direct outputs of the benchmark rather than inputs redefined as predictions. The work is self-contained against external benchmarks and data, with no reduction of claims to internal definitions or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the contribution centers on empirical benchmark design rather than theoretical constructs.

pith-pipeline@v0.9.0 · 5540 in / 1048 out tokens · 42602 ms · 2026-05-10T04:00:12.003847+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

79 extracted references · 39 canonical work pages · 5 internal anchors

[1]

Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu, Xudong Liu, Yuchen Liu, Zhengxi Liu, Lu Lu, J...

work page doi:10.48550/arxiv.2406.02430 2024
[2]

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, and Furu Wei. 2022. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing.IEEE J. Sel. Top. Signal Process.16, 6...

work page doi:10.1109/jstsp.2022.3188113 2022
[3]

Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, and Xie Chen. 2025. F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025, Wanxiang ...

2025
[4]

Wei Deng, Siyi Zhou, Jingchen Shu, Jinchao Wang, and Lu Wang. 2025. IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System. CoRRabs/2502.05512 (2025). arXiv:2502.05512 doi:10.48550/ARXIV.2502.05512

work page doi:10.48550/arxiv.2502.05512 2025
[5]

Anuj Diwan, Zhisheng Zheng, David Harwath, and Eunsol Choi. 2025. Scaling Rich Style-Prompted Text-to-Speech Datasets. InProceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025, Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (Eds.). Association ...

work page doi:10.18653/v1/2025.emnlp-main.180 2025
[6]

Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, and Zhijie Yan. 2024. CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens.CoRRabs/2407.05407 (2024). arXiv:2407.05407 doi:10.48550/ ARXIV.2407.05407

work page arXiv 2024
[7]

Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, Fan Yu, Huadai Liu, Zhengyan Sheng, Yue Gu, Chong Deng, Wen Wang, Shiliang Zhang, Zhijie Yan, and Jin- gren Zhou. 2024. CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models.CoRRabs/2412.10117 (2024). arXiv:2412.10117 d...

work page internal anchor Pith review arXiv 2024
[8]

Zhifu Gao, Shiliang Zhang, Ian McLoughlin, and Zhijie Yan. 2022. Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition. In23rd Annual Conference of the International Speech Com- munication Association, Interspeech 2022, Incheon, Korea, September 18-22, 2022, Hanseok Ko and John H. L. Hansen (Eds.). ISCA, 2...

2022
[9]

Haohan Guo, Kun Xie, Yi-Chen Wu, Feng-Long Xie, Xu Tang, and Yao Hu. 2025. FireRedTTS-1S: An Upgraded Streamable Foundation Text-to-Speech System. CoRRabs/2503.20499 (2025). arXiv:2503.20499 doi:10.48550/ARXIV.2503.20499

work page doi:10.48550/arxiv.2503.20499 2025
[10]

Zhifang Guo, Yichong Leng, Yihan Wu, Sheng Zhao, and Xu Tan. 2023. Prompttts: Controllable Text-To-Speech With Text Descriptions. InIEEE International Con- ference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023. IEEE, 1–5. doi:10.1109/ICASSP49357.2023.10096285

work page doi:10.1109/icassp49357.2023.10096285 2023
[11]

Jingbin Hu, Huakang Chen, Linhan Ma, Dake Guo, Qirui Zhan, Wenhao Li, Haoyu Zhang, Kangxiang Xia, Ziyu Zhang, Wenjie Tian, Chengyou Wang, Jinrui Liang, Shuhan Guo, Zihang Yang, Bengu Wu, Binbin Zhang, Pengcheng Zhu, Pengyuan Xie, Chuan Xie, Qiang Zhang, Jie Liu, and Lei Xie. 2026. VoiceSculptor: Your Voice, Designed By You. arXiv:2601.10629 [eess.AS] http...

work page arXiv 2026
[12]

Kexin Huang, Liwei Fan, Botian Jiang, Yaozhou Jiang, Qian Tu, Jie Zhu, Yuqian Zhang, Yiwei Zhao, Chenchen Yang, Zhaoye Fei, Shimin Li, Xiaogui Yang, Qinyuan Cheng, and Xipeng Qiu. 2026. MOSS-VoiceGenerator: Create Re- alistic Voices with Natural Language Descriptions. arXiv:2603.28086 [cs.SD] https://arxiv.org/abs/2603.28086

work page arXiv 2026
[13]

Kexin Huang, Qian Tu, Liwei Fan, Chenchen Yang, Dong Zhang, Shimin Li, Zhaoye Fei, Qinyuan Cheng, and Xipeng Qiu. 2025. InstructTTSEval: Benchmark- ing Complex Natural-Language Instruction Following in Text-to-Speech Systems. CoRRabs/2506.16381 (2025). arXiv:2506.16381 doi:10.48550/ARXIV.2506.16381

work page doi:10.48550/arxiv.2506.16381 2025
[14]

Shengpeng Ji, Jialong Zuo, Minghui Fang, Ziyue Jiang, Feiyang Chen, Xinyu Duan, Baoxing Huai, and Zhou Zhao. 2024. TextrolSpeech: A Text Style Control Speech Corpus with Codec Language Text-to-Speech Models. InIEEE Interna- tional Conference on Acoustics, Speech and Signal Processing, ICASSP 2024, Seoul, Republic of Korea, April 14-19, 2024. IEEE, 10301–1...

work page doi:10.1109/icassp48485 2024
[15]

Dongya Jia, Zhuo Chen, Jiawei Chen, Chenpeng Du, Jian Wu, Jian Cong, Xiaobin Zhuang, Chumin Li, Zhen Wei, Yuping Wang, and Yuxuan Wang. 2025. DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation. InForty- second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025 (Proceedings of Machine Le...

2025
[16]

Feng Jiang, Zhiyu Lin, Fan Bu, Yuhao Du, Benyou Wang, and Haizhou Li. 2025. S2S-Arena, Evaluating Speech2Speech Protocols on Instruction Following with Paralinguistic Information.CoRRabs/2503.05085 (2025). arXiv:2503.05085 doi:10. 48550/ARXIV.2503.05085

work page internal anchor Pith review arXiv 2025
[17]

Minchan Kim, Sung Jun Cheon, Byoung Jin Choi, Jong Jin Kim, and Nam Soo Kim. 2021. Expressive Text-to-Speech Using Style Tag. In22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30 - September 3, 2021, Hynek Hermansky, Honza Cernocký, Lukás Burget, Lori Lamel, Odette Scharenborg, and Petr...

work page doi:10.21437/interspeech.2021-465 2021
[18]

Yichong Leng, Zhifang Guo, Kai Shen, Zeqian Ju, Xu Tan, Eric Liu, Yufei Liu, Dongchao Yang, Leying Zhang, Kaitao Song, Lei He, Xiangyang Li, Sheng Zhao, Tao Qin, and Jiang Bian. 2024. PromptTTS 2: Describing and Generating Voices with Text Prompt. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 202...

2024
[20]

Yunpei Li, Xun Zhou, Jinchao Wang, Lu Wang, Yong Wu, Siyi Zhou, Yiquan Zhou, and Jingchen Shu. 2026. IndexTTS 2.5 Technical Report.CoRRabs/2601.03888 (2026). arXiv:2601.03888 doi:10.48550/ARXIV.2601.03888

work page doi:10.48550/arxiv.2601.03888 2026
[21]

Shijia Liao, Yuxuan Wang, Songting Liu, Yifan Cheng, Ruoyi Zhang, Tianyu Li, Shidong Li, Yisheng Zheng, Xingwei Liu, Qingzheng Wang, Zhizhuo Zhou, Jiahua Liu, Xin Chen, and Dawei Han. 2026. Fish Audio S2 Technical Report. arXiv:2603.08823 [cs.SD] https://arxiv.org/abs/2603.08823

work page arXiv 2026
[22]

Guanghou Liu, Yongmao Zhang, Yi Lei, Yunlin Chen, Rui Wang, Lei Xie, and Zhifei Li. 2023. PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions. In24th Annual Conference of the International Speech Communication Association, Interspeech 2023, Dublin, Ireland, August 20- 24, 2023, Naomi Harte, Julie Carson-Berndsen,...

work page doi:10.21437/interspeech.2023-1779 2023
[23]

Daniel Lyth and Simon King. 2024. Natural language guidance of high- fidelity text-to-speech with synthetic annotations.CoRRabs/2402.01912 (2024). arXiv:2402.01912 doi:10.48550/ARXIV.2402.01912

work page doi:10.48550/arxiv.2402.01912 2024
[24]

Ruskin Raj Manku, Yuzhi Tang, Xingjian Shi, Mu Li, and Alex Smola. 2025. EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressive- ness, and Linguistic Challenges Using Model-as-a-Judge.CoRRabs/2505.23009 (2025). arXiv:2505.23009 doi:10.48550/ARXIV.2505.23009

work page doi:10.48550/arxiv.2505.23009 2025
[25]

Zhiliang Peng, Jianwei Yu, Wenhui Wang, Yaoyao Chang, Yutao Sun, Li Dong, Yi Zhu, Weijiang Xu, Hangbo Bao, Zehua Wang, Shaohan Huang, Yan Xia, and Furu Wei. 2025. VibeVoice Technical Report.CoRRabs/2508.19205 (2025). arXiv:2508.19205 doi:10.48550/ARXIV.2508.19205

work page doi:10.48550/arxiv.2508.19205 2025
[26]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust Speech Recognition via Large-Scale Weak Supervision. InInternational Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA (Proceedings of Machine Learning Research), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara En...

2023
[27]

Jiatong Shi, Jionghao Han, Yichen Lu, Santiago Pascual, Pengfei Wu, Chenye Cui, Shinji Watanabe, Chao Weng, and Cong Zhou. 2025. Speech-DRAME: A Frame- work for Human-Aligned Benchmarks in Speech Role-Play.CoRRabs/2511.01261 (2025). arXiv:2511.01261 doi:10.48550/ARXIV.2511.01261

work page doi:10.48550/arxiv.2511.01261 2025
[28]

Reo Shimizu, Ryuichi Yamamoto, Masaya Kawamura, Yuma Shirahata, Hironori Doi, Tatsuya Komatsu, and Kentaro Tachibana. 2024. PromptTTS++: Control- ling Speaker Identity in Prompt-Based Text-To-Speech Using Natural Language Descriptions. InIEEE International Conference on Acoustics, Speech and Signal Pro- cessing, ICASSP 2024, Seoul, Republic of Korea, Apri...

work page doi:10.1109/icassp48485.2024.10448173 2024
[29]

Gemini Team. 2025. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.CoRR abs/2507.06261 (2025). arXiv:2507.06261 doi:10.48550/ARXIV.2507.06261

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.06261 2025
[30]

Qwen Team. 2026. Qwen3-TTS Technical Report.CoRRabs/2601.15621 (2026). arXiv:2601.15621 doi:10.48550/ARXIV.2601.15621

work page doi:10.48550/arxiv.2601.15621 2026
[32]

Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, Linqin Li, Zheng Liang, Qixi Zheng, Rui Wang, Xiaoqin Feng, Weizhen Bian, Zhen Ye, Sitong Cheng, Ruibin Yuan, Zhixian Zhao, Xinfa Zhu, Jiahao Pan, Liumeng Xue, Pengcheng Zhu, Yunlin Chen, Zhifei Li, Xie Chen, Lei Xie, Yike Guo, and Wei Xue
[33]

Spark-tts: An efficient llm-based text-to- speech model with single-stream decoupled speech tokens.arXiv preprint arXiv:2503.01710, 2025

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single- Stream Decoupled Speech Tokens.CoRRabs/2503.01710 (2025). arXiv:2503.01710 doi:10.48550/ARXIV.2503.01710

work page doi:10.48550/arxiv.2503.01710 2025
[34]

Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Ji- achen Zheng, Qiang Zhang, Xueyao Zhang, Shunsi Zhang, and Zhizheng Wu
[35]

InThe Thirteenth International Conference on Learning Repre- sentations, ICLR 2025, Singapore, April 24-28, 2025

MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer. InThe Thirteenth International Conference on Learning Repre- sentations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net. https: //openreview.net/forum?id=ExuBFYtCQU

2025
[36]

Kangxiang Xia, Xinfa Zhu, Jixun Yao, Wenjie Tian, Wenhao Li, and Lei Xie. 2026. KALL-E: Autoregressive Speech Synthesis with Next-Distribution Prediction. In Fortieth AAAI Conference on Artificial Intelligence, Thirty-Eighth Conference on Innovative Applications of Artificial Intelligence, Sixteenth Symposium on Educa- tional Advances in Artificial Intell...

work page doi:10.1609/aaai.v40i40.40695 2026
[37]

LLM-Core Xiaomi. 2025. MiMo-Audio: Audio Language Models are Few-Shot Learners.CoRRabs/2512.23808 (2025). arXiv:2512.23808 doi:10.48550/ARXIV. 2512.23808

work page internal anchor Pith review doi:10.48550/arxiv 2025
[38]

Hanke Xie, Haopeng Lin, Wenxiao Cao, Dake Guo, Wenjie Tian, Jun Wu, Hanlin Wen, Ruixuan Shang, Hongmei Liu, Zhiqi Jiang, Yuepeng Jiang, Wenxi Chen, Ruiqi Yan, Jiale Qian, Yichao Yan, Shunshun Yin, Ming Tao, Xie Chen, Lei Xie, and Xinsheng Wang. 2025. SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity. arXiv:251...

work page arXiv 2025
[39]

Liumeng Xue, Weizhen Bian, Jiahao Pan, Wenxuan Wang, Yilin Ren, Boyi Kang, Jingbin Hu, Ziyang Ma, Shuai Wang, Xinyuan Qian, Hung yi Lee, and Yike Guo. 2026. NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations. arXiv:2604.16211 [cs.SD] https://arxiv.org/abs/2604.16211

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

Dongchao Yang, Songxiang Liu, Rongjie Huang, Chao Weng, and Helen Meng
[41]

Audio Speech Lang

InstructTTS: Modelling Expressive TTS in Discrete Latent Space With Natural Language Style Prompt.IEEE ACM Trans. Audio Speech Lang. Process.32 (2024), 2913–2925. doi:10.1109/TASLP.2024.3402088

work page doi:10.1109/taslp.2024.3402088 2024
[42]

Guanrou Yang, Chen Yang, Qian Chen, Ziyang Ma, Wenxi Chen, Wen Wang, Tianrui Wang, Yifan Yang, Zhikang Niu, Wenrui Liu, Fan Yu, Zhihao Du, Zhifu Gao, Shiliang Zhang, and Xie Chen. 2025. EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting. InProceedings of the 33rd ACM International Conference on Multimedia, MM 2025, Dublin, Ir...

work page doi:10.1145/3746027.3754829 2025
[43]

Zhen Ye, Xinfa Zhu, Chi-Min Chan, Xinsheng Wang, Xu Tan, Jiahe Lei, Yi Peng, Haohe Liu, Yizhu Jin, Zheqi Dai, Hongzhan Lin, Jianyi Chen, Xingjian Du, Li- umeng Xue, Yunlin Chen, Zhifei Li, Lei Xie, Qiuqiang Kong, Yike Guo, and Wei Xue. 2025. Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis.CoRRabs/2502.04128 (2025). ar...

work page doi:10.48550/arxiv.2502.04128 2025
[44]

Jun Zhan, Mingyang Han, Yuxuan Xie, Chen Wang, Dong Zhang, Kexin Huang, Haoxiang Shi, DongXiao Wang, Tengtao Song, Qinyuan Cheng, Shimin Li, Jun Song, Xipeng Qiu, and Bo Zheng. 2025. VStyle: A Benchmark for Voice Style Adap- tation with Spoken Instructions.CoRRabs/2509.09716 (2025). arXiv:2509.09716 doi:10.48550/ARXIV.2509.09716

work page doi:10.48550/arxiv.2509.09716 2025
[45]

Xueyao Zhang, Junan Zhang, Yuancheng Wang, Chaoren Wang, Yuanzhe Chen, Dongya Jia, Zhuo Chen, and Zhizheng Wu. 2025. Vevo2: Bridging Controllable Speech and Singing Voice Generation via Unified Prosody Learning.CoRR abs/2508.16332 (2025). arXiv:2508.16332 doi:10.48550/ARXIV.2508.16332

work page doi:10.48550/arxiv.2508.16332 2025
[46]

Ziyu Zhang, Hanzhao Li, Jingbin Hu, Wenhao Li, and Lei Xie. 2025. HiStyle: Hierarchical Style Embedding Predictor for Text-Prompt-Guided Controllable Speech Synthesis.CoRRabs/2509.25842 (2025). arXiv:2509.25842 doi:10.48550/ ARXIV.2509.25842

work page arXiv 2025
[47]

Siyi Zhou, Yiquan Zhou, Yi He, Xun Zhou, Jinchao Wang, Wei Deng, and Jingchen Shu. 2026. IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech. InFortieth AAAI Conference on Artificial Intelligence, Thirty-Eighth Conference on Innovative Applications of Artificial Intelligence, Sixteenth Sympo...

work page doi:10.1609/aaai.v40i41.40820 2026
[48]

Yixuan Zhou, Xiaoyu Qin, Zeyu Jin, Shuoyi Zhou, Shun Lei, Songtao Zhou, Zhiyong Wu, and Jia Jia. 2024. VoxInstruct: Expressive Human Instruction- to-Speech Generation with Unified Multilingual Codec Language Modelling. InProceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024 - 1 November 202...

work page doi:10.1145/3664647.3681680 2024
[49]

Yixuan Zhou, Guoyang Zeng, Xin Liu, Xiang Li, Renjie Yu, Ziyang Wang, Runchuan Ye, Weiyue Sun, Jiancheng Gui, Kehan Li, Zhiyong Wu, and Zhiyuan Liu. 2025. VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning.CoRRabs/2509.24650 (2025). arXiv:2509.24650 doi:10.48550/ARXIV.2509.24650

work page doi:10.48550/arxiv.2509.24650 2025
[50]

model 1”, “model 2

Xinfa Zhu, Wenjie Tian, Xinsheng Wang, Lei He, Yujia Xiao, Xi Wang, Xu Tan, Sheng Zhao, and Lei Xie. 2024. UniStyle: Unified Style Modeling for Speaking Style Captioning and Stylistic Speech Synthesis. InProceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024 - 1 November 2024, Jianfei Cai, M...

work page arXiv 2024
[51]

the control target(s), 3

the taxonomy path, 2. the control target(s), 3. the intended target value(s),
[52]

the intended text semantic type and text length,

the realization style needed later, 5. the intended text semantic type and text length,
[53]

The goal of this stage is to create a clean, diverse, balanced, non-leaking, benchmark-ready scratch plan that can later be converted into final instruction-text pairs

any constraints needed later for generating the final instruction and text. The goal of this stage is to create a clean, diverse, balanced, non-leaking, benchmark-ready scratch plan that can later be converted into final instruction-text pairs. — GENERAL RULES —
[54]

Only plan the later instruction style and constraints

Do NOT generate final polished instructions. Only plan the later instruction style and constraints
[55]

Only specify text_semantic, text_length, and text_constraints

Do NOT generate final text sentences. Only specify text_semantic, text_length, and text_constraints
[56]

Text semantics must not trivially leak the target control attributes. Examples: - if the target is age=child, do not force child-themed lexical content - if the target is emotion=angry, do not force obviously angry words into the text - if the target is accent=sichuan, do not force dialect words unless explicitly required
[57]

Keep the difficulty determined mainly by the taxonomy node, not by overly complicated text content
[58]

Prefer diversity across: target values, semantic content types, attribute combinations, and instruction realization styles when the node allows more than one realization style
[59]

Do not invent out-of-scope labels or values

Use the provided allowed_labels and allowed_values strictly. Do not invent out-of-scope labels or values
[60]

Speak in Japanese

Output ONLY valid JSON. No explanations outside JSON. — LANGUAGE REALIZATION POLICY — A. Primary languages - If language is zh, the later final instruction and final text must both be Chinese. - If language is en, the later final instruction and final text must both be English. B. Non-primary languages - If language is ja / ko / pt / fr / de / it / ru / e...
[61]

Do not change meta_path

Follow the label plan strictly. Do not change meta_path. Do not add extra control targets. Do not ignore instruction_form, text_semantic, or instruction_constraints
[62]

The instruction must match the intended realization mechanism:tag(fixed key-value tags) ordirect(natural voice-direction)
[63]

Avoid fragments unless explicitly required

The text must be natural and speakable for TTS. Avoid fragments unless explicitly required. Avoid obscure wording
[64]

The text must NOT trivially leak the target. Examples: - child voice target does not mean child-themed lexical content - angry target does not mean explicit angry wording - accent target does not mean dialect spellings unless explicitly required
[65]

Difficulty should come mainly from the control design, not from making the text content artificially complicated
[66]

Do NOT copy franchise names, titles, or character names into the final instruction or text

Respect aux_values only as interpretation hints. Do NOT copy franchise names, titles, or character names into the final instruction or text
[67]

Speak in Japanese

Output ONLY valid JSON. — LANGUAGE POLICY — A. zh: instruction must be Chinese, text must be Chinese. B. en: instruction must be English, text must be English. C. ja / ko / pt / fr / de / it / ru / es: - instruction must be English - instruction must explicitly cue the target language with a prefix like: "Speak in Japanese. " - text must be written in the...
[68]

shuxing: zhi

Chinese tag instructions must use fixed "shuxing: zhi" format
[69]

Attribute: Value

All non-Chinese instruction languages must use fixed "Attribute: Value" format
[70]

For multi-attribute tag items, every attribute must appear in the same fixed format
[71]

Do not paraphrase tag items into free prose
[72]

Speak in Japanese. Attribute: Value, Attribute: Value

For non-primary languages, keep the English target-language prefix and then use English key-value tags. Example pattern: "Speak in Japanese. Attribute: Value, Attribute: Value" — DIRECT INSTRUCTION RULES — If instruction_form is direct: - use natural, operational voice-direction phrasing - keep easy items concise, make hard items richer only when the cont...
[73]

If only partial, assign level 2

complex_comp/static: judge whether multiple requested attributes are simultaneously realized. If only partial, assign level 2
[74]

Check if multiple stages are audible, order matches, and progression is clear

complex_comp/dynamic: judge temporal progression explicitly. Check if multiple stages are audible, order matches, and progression is clear. If too weak or binary, do not give level 3
[75]

complex_comp/layered: judge whether both surface and hidden affect are audible
[76]

complex_comp/conflict: judge whether both conflicting cues are present
[77]

Do not reward generic style if the requested persona is not clearly formed

persona (scenario/social_role/archetype): judge the overall role/scenario impression. Do not reward generic style if the requested persona is not clearly formed
[78]

Full credit requires the audio strongly evokes the described portrait through audible cues (age, texture, pitch, rhythm, accent, temperament)

persona/ip_portrait: evaluate against the abstract vocal portrait described in the instruction. Full credit requires the audio strongly evokes the described portrait through audible cues (age, texture, pitch, rhythm, accent, temperament). Vague generic resemblance is not enough for level 3. — INSTRUCTION-FOLLOWING LEVELS — 1 = bad: the requested structure...
[79]

If it does not clearly occur, it cannot receive high credit

explicit_nv / implicit_nv Evaluate in two steps: Step A: event existence - first judge whether the requested nonverbal event actually occurs. If it does not clearly occur, it cannot receive high credit. Step B: placement / frequency / appropriateness - if the event occurs, judge whether its placement/frequency roughly matches the instruction, and its real...
[80]

The disfluency must be sustained enough or patterned enough to be an intentional abnormal speaking state

abnormal/disfluency Only count clearly speaker-produced fluency disruption. The disfluency must be sustained enough or patterned enough to be an intentional abnormal speaking state. Do not count a single irregularity or isolated hesitation as full realization
[81]

The state must be audibly sustained, not a single rough syllable or isolated artifact

abnormal/dysphonia Only count a continuous abnormal voice-state condition. The state must be audibly sustained, not a single rough syllable or isolated artifact. Do not confuse ordinary texture with pathological/abnormal phonation unless clearly sustained. — INSTRUCTION-FOLLOWING LEVELS — 1 = bad: the requested event/state is absent, unclear, accidental, ...