arxiv: 2604.22225 · v1 · submitted 2026-04-24 · 💻 cs.CL · eess.AS

Recognition: unknown

TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis

Baijun Song, Di Wu, Jiahe Shao, Jian Luan, Jie Wang, Jingran Xie, Meng Meng, Xingchen Song, Xi Wang, Zhiyong Wu, Zijian Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:59 UTC · model grok-4.3

classification 💻 cs.CL eess.AS

keywords TTS evaluationperceptual diagnosisMandarin speech qualityinstruction tuningfine-grained metricsspeech artifact detectioninterpretable speech model

0 comments

The pith

TTS-PRISM embeds a 12-dimensional schema into a model that reasons about and scores fine-grained perceptual flaws in Mandarin text-to-speech output.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to move beyond single-number quality scores that hide why a TTS voice fails. It constructs a fixed set of twelve perceptual dimensions running from basic stability to higher-level expressiveness, then uses adversarial audio changes and expert examples to create training data that forces explicit reasoning. Schema-guided instruction tuning produces a compact model that outputs both numeric scores and natural-language explanations for each dimension. Tests on 1,600 held-out samples show tighter agreement with human listeners than off-the-shelf models, while profiling six different TTS systems yields distinct diagnostic patterns that separate their strengths and weaknesses.

Core claim

TTS-PRISM establishes a perceptual reasoning model for Mandarin TTS by embedding a 12-dimensional diagnostic schema through instruction tuning on a dataset built from adversarial perturbations and expert anchors. On a 1,600-sample test set, it achieves better alignment with human judgments than generalist models, and when applied to six TTS paradigms, it produces diagnostic flags that distinguish their capabilities at a fine-grained level.

What carries the argument

The 12-dimensional perceptual schema, populated via targeted synthesis with adversarial perturbations and expert anchors, then embedded into an end-to-end model through schema-driven instruction tuning that forces explicit scoring and reasoning.

If this is right

TTS developers can run any new system through the model to receive dimension-by-dimension flags instead of a single score.
Training loops can use the per-dimension scores to focus optimization on the weakest perceptual areas.
The open-source checkpoints allow direct comparison of future TTS variants against the same diagnostic baseline.
Production pipelines gain an automated way to reject outputs that fail on specific stability or expressiveness flags.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same schema-plus-tuning pattern could be recreated for other languages by repeating the perturbation and anchoring steps with native listeners.
If the diagnostic flags prove stable, they could serve as auxiliary rewards inside reinforcement learning from human feedback for TTS.
Integrating the model as a lightweight checker before deployment might reduce the need for large-scale human listening tests.

Load-bearing premise

The twelve dimensions, derived from adversarial perturbations and expert anchors, capture every important aspect of human judgment of Mandarin speech without gaps or systematic bias.

What would settle it

Human listeners on a new Mandarin speech set identify a recurring quality problem that none of the twelve dimensions accounts for, or TTS-PRISM scores diverge sharply from human ratings on additional test samples.

Figures

Figures reproduced from arXiv: 2604.22225 by Baijun Song, Di Wu, Jiahe Shao, Jian Luan, Jie Wang, Jingran Xie, Meng Meng, Xingchen Song, Xi Wang, Zhiyong Wu, Zijian Lin.

**Figure 1.** Figure 1: The schema comprises 12 well-defined dimensions spanning acoustic stability and expressiveness view at source ↗

**Figure 2.** Figure 2: Overview of TTS-PRISM. (a) The targeted synthesis strategy sharpens decision boundaries against long-tail artifacts. (b) Schema-driven instruction tuning enables 12-dimensional diagnosis via single-pass inference, balancing efficiency and interpretability. 2.1. Evaluation Dimensions & Scoring Criteria We construct a 12-dimensional hierarchical taxonomy across 5 core domains, with 4 domains (8 sub-dimension… view at source ↗

**Figure 3.** Figure 3: Distribution of diverse TTS sources and text domains. 3. Experimental Setup 3.1. Dataset & Training Configuration To evaluate alignment precision, we build a stratified 1,600- sample Mandarin Gold Test Set, strictly disjoint from training data, with 20% out-of-distribution (OOD) samples (unseen TTS and real recordings) and all labels validated via consensus-based expert annotation. For training, we perform… view at source ↗

read the original abstract

While generative text-to-speech (TTS) models approach human-level quality, monolithic metrics fail to diagnose fine-grained acoustic artifacts or explain perceptual collapse. To address this, we propose TTS-PRISM, a multi-dimensional diagnostic framework for Mandarin. First, we establish a 12-dimensional schema spanning stability to advanced expressiveness. Second, we design a targeted synthesis pipeline with adversarial perturbations and expert anchors to build a high-quality diagnostic dataset. Third, schema-driven instruction tuning embeds explicit scoring criteria and reasoning into an efficient end-to-end model. Experiments on a 1,600-sample Gold Test Set show TTS-PRISM outperforms generalist models in human alignment. Profiling six TTS paradigms establishes intuitive diagnostic flags that reveal fine-grained capability differences. TTS-PRISM is open-source, with code and checkpoints at https://github.com/xiaomi-research/tts-prism.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TTS-PRISM gives a new 12-dimensional schema and instruction-tuned model for diagnosing Mandarin TTS artifacts, but the abstract shows no metrics and the schema's construction risks bias without external checks.

read the letter

The paper's core move is to replace single-score TTS evaluation with a 12-dimensional perceptual schema that covers stability through expressiveness, then build an adversarial dataset around it and instruction-tune a model to output both scores and reasoning. That combination of schema, targeted data pipeline, and schema-driven tuning is the actual novelty, and the open-source release of code and checkpoints is a clear plus for anyone who wants to try the diagnostic flags on their own systems. Profiling six TTS paradigms to surface capability differences is a practical step that could shorten iteration cycles in Mandarin synthesis work. The main weakness is that the abstract gives no numbers at all—no alignment scores, no baseline comparisons, no ablation or statistical tests—so the claim of outperforming generalist models on the 1,600-sample gold set cannot be assessed yet. The schema itself is built from adversarial perturbations plus expert anchors; that approach can embed selection bias or omit dimensions that matter to ordinary listeners, and nothing is said about inter-rater agreement with naïve raters or correlation with open-ended perceptual reports. If those checks are missing in the full paper, the reported human alignment could be partly tautological with the schema rather than independently predictive. This is aimed at TTS researchers who already feel limited by MOS or WER and want finer-grained diagnostics, especially for Mandarin. A reader in that group would get usable ideas from the schema and dataset method even if the empirical results need more scrutiny. It should go to peer review because the framework targets a real evaluation gap and the open-source artifacts make follow-up straightforward, though the referee will need to see the actual tables and validation details before the central claims can be trusted.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes TTS-PRISM, a multi-dimensional diagnostic framework for Mandarin TTS. It first establishes a 12-dimensional perceptual schema spanning stability to advanced expressiveness, then designs a targeted synthesis pipeline using adversarial perturbations and expert anchors to create a diagnostic dataset, and finally applies schema-driven instruction tuning to embed explicit scoring criteria and reasoning into an efficient end-to-end model. Experiments on a 1,600-sample Gold Test Set are claimed to show that TTS-PRISM outperforms generalist models in human alignment, while profiling six TTS paradigms yields intuitive diagnostic flags revealing fine-grained capability differences. The work is released open-source with code and checkpoints.

Significance. If the empirical results hold after proper validation, TTS-PRISM could meaningfully advance TTS evaluation by replacing monolithic metrics with interpretable, fine-grained perceptual diagnostics, particularly for Mandarin speech where current tools are limited. The open-source release of code and checkpoints would further support reproducibility and adoption in both research and industry settings for targeted model improvement.

major comments (2)

[Abstract] Abstract: the central claims of outperformance in human alignment on the 1,600-sample Gold Test Set and establishment of diagnostic flags for six TTS paradigms are asserted without any quantitative metrics, baseline details, statistical tests, ablation results, or effect sizes, leaving the empirical support for the framework's utility invisible in the provided text.
[Schema construction] The 12-dimensional schema (introduced via adversarial perturbations and expert anchors): this construction is load-bearing for all downstream claims of human alignment and fine-grained profiling, yet no coverage analysis, inter-rater agreement metrics with naïve listeners, or correlation with open-ended perceptual reports are reported, raising the risk that the schema omits critical dimensions or embeds selection/anchoring biases that would make reported alignment metrics tautological rather than independently predictive.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important opportunities to strengthen the presentation of empirical results and the validation of the perceptual schema. We address each major comment point-by-point below and have prepared revisions to the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims of outperformance in human alignment on the 1,600-sample Gold Test Set and establishment of diagnostic flags for six TTS paradigms are asserted without any quantitative metrics, baseline details, statistical tests, ablation results, or effect sizes, leaving the empirical support for the framework's utility invisible in the provided text.

Authors: We agree that the abstract, as a high-level summary, would be strengthened by including key quantitative indicators. In the revised version we have added concise references to the main human-alignment metrics (correlation with expert ratings on the Gold Test Set), the primary baselines (generalist LLMs), and the statistical significance of the observed improvements, while preserving the abstract's brevity. Full tables of results, ablation studies, and effect sizes remain in the Experiments section. revision: yes
Referee: [Schema construction] The 12-dimensional schema (introduced via adversarial perturbations and expert anchors): this construction is load-bearing for all downstream claims of human alignment and fine-grained profiling, yet no coverage analysis, inter-rater agreement metrics with naïve listeners, or correlation with open-ended perceptual reports are reported, raising the risk that the schema omits critical dimensions or embeds selection/anchoring biases that would make reported alignment metrics tautological rather than independently predictive.

Authors: The schema was constructed from a systematic review of TTS perceptual literature combined with iterative expert consultation to span stability through advanced expressiveness. Expert anchors and adversarial perturbations were employed to isolate dimensions during dataset creation. We acknowledge that coverage analysis, naïve-listener agreement, and correlation with open-ended reports were not reported. In the revision we will add: (i) a mapping of the 12 dimensions to commonly reported perceptual issues in Mandarin TTS, (ii) inter-rater reliability statistics obtained from a supplementary panel of naïve listeners on a held-out subset, and (iii) Pearson correlations between schema-based scores and free-form listener descriptions. These additions will further demonstrate that the Gold Test Set labels were collected independently of schema construction, thereby avoiding circularity. revision: yes

Circularity Check

0 steps flagged

No circularity detected; framework is externally grounded without reduction to inputs by construction.

full rationale

The paper describes a sequential construction: a 12-dimensional schema is established, a synthesis pipeline with adversarial perturbations and expert anchors creates the diagnostic dataset, schema-driven tuning produces the model, and performance is reported on a distinct 1,600-sample Gold Test Set for human alignment. No equations, derivations, fitted parameters renamed as predictions, or self-citations appear in the abstract or described chain that would make any result equivalent to its inputs by definition. The claims rest on the external test set and labels rather than tautological self-reference, rendering the process self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Review conducted from abstract only; the 12-dimensional schema is an invented structure whose validity rests on expert judgment and perturbation design, with no visible free parameters or external axioms stated.

invented entities (1)

12-dimensional perceptual schema no independent evidence
purpose: Spans stability to advanced expressiveness for fine-grained TTS diagnosis
Newly defined in the paper; no independent external validation or prior literature equivalence mentioned in abstract.

pith-pipeline@v0.9.0 · 5475 in / 1260 out tokens · 75178 ms · 2026-05-08T11:59:20.527222+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 19 canonical work pages · 5 internal anchors

[1]

black box

Introduction Driven by the rapid evolution of large-scale generative mod- els, modern Text-to-Speech (TTS) [ 1, 2, 3, 4, 5, 6] systems have achieved human-level capabilities. However, the traditional Mean Opinion Score (MOS) [7] faces a “black box” dilemma: its single scalar obscures real capabilities in pronunciation, prosody, and emotion, and fails to c...
[2]

TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis

Methodology To enable fine-grained diagnosis of generative speech, we pro- pose TTS-PRISM, a framework comprising a hierarchical evalu- ation schema, a targeted data synthesis pipeline, and a diagnostic scoring model. Crucially, to eliminate subjective ambiguity, we anchor each score level to explicit tolerance thresholds (e.g., defining specific artifact...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Compute-matched

Experimental Setup 3.1. Dataset & Training Configuration To evaluate alignment precision, we build a stratified 1,600- sample Mandarin Gold Test Set, strictly disjoint from training data, with 20% out-of-distribution (OOD) samples (unseen TTS and real recordings) and all labels validated via consensus-based expert annotation. For training, we perform full...
[4]

Pronunciation- Accurate

Results 4.1. Fine-grained Accuracy and Rationale Quality Table 1 shows TTS-PRISM’s superior alignment on the 1,600- sample Gold Test Set. Noise-injected training enables acute sen- sitivity to physical noise and artifacts. For Emotion Expression, our expert-anchored samples mitigate over-smoothing in general- ist models, enabling precise high-arousal quan...

work page arXiv
[5]

Experiments demonstrate superior human alignment and leading TTS profiling over generalist models

Conclusion We propose TTS-PRISM, a fine-grained Mandarin speech di- agnostic framework. Experiments demonstrate superior human alignment and leading TTS profiling over generalist models. However, Pronunciation Accuracy limitations reveal the inherent intelligibility tolerance of ASR backbones—a bias difficult to override via instruction tuning. Future wor...
[6]

These tools were not used to generate any core scientific ideas, experimental data, or technical contributions

Generative AI Use Disclosure During the preparation of this manuscript, the authors used gen- erative AI tools exclusively for the purpose of language editing and manuscript polishing to improve readability. These tools were not used to generate any core scientific ideas, experimental data, or technical contributions. All authors have thoroughly reviewed ...
[7]

Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training.arXiv preprint arXiv:2505.17589, 2025

Z. Du, C. Gao, Y . Wang, F. Yu, T. Zhao, H. Wang, X. Lv, H. Wang, C. Ni, X. Shiet al., “CosyV oice 3: Towards in-the- wild speech generation via scaling-up and post-training,”arXiv preprint arXiv:2505.17589, 2025

work page arXiv 2025
[8]

F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,

Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. JianZhao, K. Yu, and X. Chen, “F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,” inAnnual Meeting of the Association for Computational Linguistics (ACL), 2025, pp. 6255–6271

2025
[9]

MaskGCT: Zero-shot text-to- speech with masked generative codec transformer,

Y . Wang, H. Zhan, L. Liu, R. Zeng, H. Guo, J. Zheng, Q. Zhang, X. Zhang, S. Zhang, and Z. Wu, “MaskGCT: Zero-shot text-to- speech with masked generative codec transformer,” inInternational Conference on Learning Representations (ICLR), 2025

2025
[10]

Qwen3-tts technical report.arXiv preprint arXiv:2601.15621, 2026

H. Hu, X. Zhu, T. He, D. Guo, B. Zhang, X. Wang, Z. Guo, Z. Jiang, H. Hao, Z. Guoet al., “Qwen3-TTS technical report,” arXiv preprint arXiv:2601.15621, 2026

work page arXiv 2026
[11]

Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech.arXiv preprint arXiv:2506.21619, 2025

S. Zhou, Y . Zhou, Y . He, X. Zhou, J. Wang, W. Deng, and J. Shu, “IndexTTS2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech,” arXiv preprint arXiv:2506.21619, 2025

work page arXiv 2025
[12]

FireRedTTS-2: Towards long conversational speech generation for podcast and chatbot,

K. Xie, F. Shen, J. Li, F. Xie, X. Tang, and Y . Hu, “FireRedTTS-2: Towards long conversational speech generation for podcast and chatbot,”arXiv preprint arXiv:2509.02020, 2025

work page arXiv 2025
[13]

Mean opinion score (MOS) revisited: Methods and applications, limitations and alter- natives,

R. C. Streijl, S. Winkler, and D. S. Hands, “Mean opinion score (MOS) revisited: Methods and applications, limitations and alter- natives,”Multimedia Systems, vol. 22, no. 2, pp. 213–227, 2016

2016
[14]

SpeechBERTScore: Reference-aware automatic evaluation of speech generation leveraging NLP evaluation metrics,

T. Saeki, S. Maiti, S. Takamichi, S. Watanabe, and H. Saruwatari, “SpeechBERTScore: Reference-aware automatic evaluation of speech generation leveraging NLP evaluation metrics,” inAnnual Conference of the International Speech Communication Associa- tion (INTERSPEECH). ISCA, 2024, pp. 4943–4947

2024
[15]

SpeechAlign: Aligning speech generation to human preferences,

D. Zhang, Z. Li, S. Li, X. Zhang, P. Wang, Y . Zhou, and X. Qiu, “SpeechAlign: Aligning speech generation to human preferences,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024, pp. 50 343–50 360

2024
[16]

Emo-DPO: Controllable emotional speech synthesis through direct preference optimization,

X. Gao, C. Zhang, Y . Chen, H. Zhang, and N. F. Chen, “Emo-DPO: Controllable emotional speech synthesis through direct preference optimization,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

2025
[17]

SpeechJudge: Towards human-level judgment for speech naturalness,

X. Zhang, C. Wang, H. Liao, Z. Li, Y . Wang, L. Wang, D. Jia, Y . Chen, X. Li, Z. Chenet al., “SpeechJudge: Towards human-level judgment for speech naturalness,”arXiv preprint arXiv:2511.07931, 2025

work page arXiv 2025
[18]

WavReward: Spoken dia- logue models with generalist reward evaluators,

S. Ji, T. Liang, Y . Li, J. Zuo, M. Fang, J. He, Y . Chen, Z. Liu, Z. Jiang, X. Chenget al., “WavReward: Spoken dia- logue models with generalist reward evaluators,”arXiv preprint arXiv:2505.09558, 2025

work page arXiv 2025
[19]

AudioJudge: Understanding what works in large audio model based speech evaluation,

P. Manakul, W. H. Gan, M. J. Ryan, A. S. Khan, W. Sirichote- dumrong, K. Pipatanakul, W. Held, and D. Yang, “AudioJudge: Understanding what works in large audio model based speech evaluation,”arXiv preprint arXiv:2507.12705, 2025

work page arXiv 2025
[20]

SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation

H. Wang, J. Zhao, Y . Yang, S. Liu, J. Chen, Y . Zhang, S. Zhao, J. Li, J. Zhou, H. Sunet al., “SpeechLLM-as-Judges: Towards general and interpretable speech quality evaluation,”arXiv preprint arXiv:2510.14664, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Read to hear: A zero-shot pronunciation assessment using textual descriptions and LLMs,

Y .-W. Chen, M. Ma, and J. Hirschberg, “Read to hear: A zero-shot pronunciation assessment using textual descriptions and LLMs,” in Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025, pp. 2682–2694

2025
[22]

VStyle: A benchmark for voice style adaptation with spoken instructions,

J. Zhan, M. Han, Y . Xie, C. Wang, D. Zhang, K. Huang, H. Shi, D. Wang, T. Song, Q. Chenget al., “VStyle: A benchmark for voice style adaptation with spoken instructions,”arXiv preprint arXiv:2509.09716, 2025

work page arXiv 2025
[23]

MT-Bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues,

G. Bai, J. Liu, X. Bu, Y . He, J. Liu, Z. Zhou, Z. Lin, W. Su, T. Ge, B. Zhenget al., “MT-Bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues,” in Annual Meeting of the Association for Computational Linguistics (ACL), 2024, pp. 7421–7454

2024
[24]

Audio Turing test: Benchmarking the human-likeness of large language model-based text-to-speech systems in Chinese,

X. Wang, Z. Zhao, S. Ren, S. Zhang, S. Li, X. Li, Z. Wang, L. Qiu, G. Wan, X. Caoet al., “Audio Turing test: Benchmarking the human-likeness of large language model-based text-to-speech systems in Chinese,”arXiv preprint arXiv:2505.11200, 2025

work page arXiv 2025
[25]

NISQA: A deep CNN-self-attention model for multidimensional speech qual- ity prediction with crowdsourced datasets,

G. Mittag, B. Naderi, A. Chehadi, and S. M ¨oller, “NISQA: A deep CNN-self-attention model for multidimensional speech qual- ity prediction with crowdsourced datasets,” inAnnual Conference of the International Speech Communication Association (INTER- SPEECH). ISCA, 2021, pp. 2127–2131

2021
[26]

SOMOS: The Samsung open MOS dataset for the evaluation of neural text- to-speech synthesis,

G. Maniati, A. Vioni, N. Ellinas, K. Nikitaras, K. Klapsas, J. S. Sung, G. Jho, A. Chalamandaris, and P. Tsiakoulis, “SOMOS: The Samsung open MOS dataset for the evaluation of neural text- to-speech synthesis,” inAnnual Conference of the International Speech Communication Association (INTERSPEECH). ISCA, 2022, pp. 2388–2392

2022
[27]

How do voices from past speech synthesis challenges compare today?

E. Cooper and J. Yamagishi, “How do voices from past speech synthesis challenges compare today?” inISCA Speech Synthesis Workshop (SSW), 2021, pp. 184–189

2021
[28]

WenetSpeech: A 10000+ hours multi- domain Mandarin corpus for speech recognition,

B. Zhang, H. Lv, P. Guo, Q. Shao, C. Yang, L. Xie, X. Xu, H. Bu, X. Chen, C. Zenget al., “WenetSpeech: A 10000+ hours multi- domain Mandarin corpus for speech recognition,” inIEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6182–6186

2022
[29]

AISHELL-3: A multi-speaker Mandarin TTS corpus and the baselines,

Y . Shi, H. Bu, X. Xu, S. Zhang, and M. Li, “AISHELL-3: A multi-speaker Mandarin TTS corpus and the baselines,” inAnnual Conference of the International Speech Communication Associa- tion (INTERSPEECH). ISCA, 2021, pp. 2756–2760

2021
[30]

Consistent and specific multi-view subspace clustering,

S. Luo, C. Zhang, W. Zhang, and X. Cao, “Consistent and specific multi-view subspace clustering,” inAAAI Conference on Artificial Intelligence (AAAI), 2018

2018
[31]

Nvspeech: An integrated and scalable pipeline for human-like speech modeling with paralinguistic vocalizations.arXiv preprint arXiv:2508.04195, 2025

H. Liao, Q. Ni, Y . Wang, Y . Lu, H. Zhan, P. Xie, Q. Zhang, and Z. Wu, “NVSpeech: An integrated and scalable pipeline for human- like speech modeling with paralinguistic vocalizations,”arXiv preprint arXiv:2508.04195, 2025

work page arXiv 2025
[32]

Advanc- ing zero-shot text-to-speech intelligibility across diverse domains via preference alignment,

X. Zhang, Y . Wang, C. Wang, Z. Li, Z. Chen, and Z. Wu, “Advanc- ing zero-shot text-to-speech intelligibility across diverse domains via preference alignment,” inAnnual Meeting of the Association for Computational Linguistics (ACL), 2025, pp. 12 251–12 270

2025
[33]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen et al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabili- ties,”arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review arXiv 2025
[34]

INSTRUCTSCORE: Towards explainable text generation evaluation with automatic feedback,

W. Xu, D. Wang, L. Pan, Z. Song, M. Freitag, W. Wang, and L. Li, “INSTRUCTSCORE: Towards explainable text generation evaluation with automatic feedback,” inConference on Empirical Methods in Natural Language Processing (EMNLP), 2023, pp. 5967–5994

2023
[35]

V oxcpm: Tokenizer-free tts for context-aware speech generation and true-to-life voice cloning.arXiv preprint arXiv:2509.24650, 2025

Y . Zhou, G. Zeng, X. Liu, X. Li, R. Yu, Z. Wang, R. Ye, W. Sun, J. Gui, K. Liet al., “V oxCPM: Tokenizer-free TTS for context- aware speech generation and true-to-life voice cloning,”arXiv preprint arXiv:2509.24650, 2025

work page arXiv 2025
[36]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Z. Du, Y . Wang, Q. Chen, X. Shi, X. Lv, T. Zhao, Z. Gao, Y . Yang, C. Gao, H. Wanget al., “CosyV oice 2: Scalable stream- ing speech synthesis with large language models,”arXiv preprint arXiv:2412.10117, 2024

work page internal anchor Pith review arXiv 2024
[37]

Mimo-audio: Audio language models are few-shot learners,

D. Zhang, G. Wang, J. Xue, K. Fang, L. Zhao, R. Ma, S. Ren, S. Liu, T. Guo, W. Zhuanget al., “MiMo-Audio: Audio language models are few-shot learners,”arXiv preprint arXiv:2512.23808, 2025

work page arXiv 2025
[38]

Step-audio-r1 technical report,

F. Tian, X. T. Zhang, Y . Zhang, H. Zhang, Y . Li, D. Liu, Y . Deng, D. Wu, J. Chen, L. Zhaoet al., “Step-Audio-R1 technical report,” arXiv preprint arXiv:2511.15848, 2025

work page arXiv 2025
[39]

Qwen3-Omni Technical Report

J. Xu, Z. Guo, H. Hu, Y . Chu, X. Wang, J. He, Y . Wang, X. Shi, T. He, X. Zhuet al., “Qwen3-Omni technical report,”arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review arXiv 2025