pith. sign in

arxiv: 2606.23190 · v1 · pith:RTVADOAUnew · submitted 2026-06-22 · 📡 eess.AS · cs.SD

FlowTTS-GRPO: Online Reinforcement Learning with Multi-Objective Reward Optimization for Flow-Matching Based Text-to-Speech

Pith reviewed 2026-06-26 07:00 UTC · model grok-4.3

classification 📡 eess.AS cs.SD
keywords flow-matchingtext-to-speechreinforcement learningonline RLTTSSDE pathsmulti-objective rewardfine-tuning
0
0 comments X

The pith

Converting ODE trajectories to SDE paths enables direct RL fine-tuning of flow-matching TTS models without auxiliary models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that flow-matching based text-to-speech models can be fine-tuned using online reinforcement learning by converting their ODE trajectories into SDE paths. This conversion allows direct application of rewards without needing auxiliary models. The authors demonstrate that a weighted multi-objective reward combination converges faster than a probabilistic one and propose three practical optimizations for training. Readers would care if this makes it easier to enhance existing open-source TTS systems for better voice similarity and sound quality.

Core claim

FlowTTS-GRPO converts ordinary differential equation trajectories into stochastic differential equation paths to enable online reinforcement learning fine-tuning of flow-matching TTS models. This permits direct optimization using multi-objective rewards on open-source models like CosyVoice and F5-TTS. The method shows faster convergence with weighted rewards, and optimizations such as skipping classifier-free guidance speed up training while improving robustness and detail metrics through hard case synthesis and targeted RL application.

What carries the argument

The ODE-to-SDE conversion that turns deterministic flow paths into stochastic ones for RL-based sampling and optimization.

If this is right

  • Weighted reward combination leads to faster convergence than probabilistic scheme.
  • Omitting CFG during RL training accelerates convergence.
  • Synthesizing hard cases improves robustness.
  • RL on the FM component enhances audio-detail metrics.
  • Results in objective and subjective gains in speaker similarity and perceptual quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This technique might extend to fine-tuning other types of generative flow models beyond speech synthesis.
  • The identified optimizations could be tested in non-RL fine-tuning scenarios for flow models.

Load-bearing premise

The conversion of ODE trajectories to SDE paths preserves the generative capabilities of the flow-matching model sufficiently for effective online RL fine-tuning without introducing artifacts or instability.

What would settle it

If experiments showed that models fine-tuned with FlowTTS-GRPO had worse or equal performance in speaker similarity and perceptual quality compared to the base models, the effectiveness of the method would be questioned.

Figures

Figures reproduced from arXiv: 2606.23190 by Biao Tian, Han Zhao, Haoxu Wang, Weiqing Li, Xiangang Li, Xiang Lv.

Figure 1
Figure 1. Figure 1: The pipeline of our FlowTTS-GRPO post-training for CosyVoice 3.0 and F5-TTS. Prompt speech tokens, generated speech tokens, and prompt speaker embedding are only used for CosyVoice 3.0. human-perception metrics. In the TTS domain, RL work has largely focused on LLM-based models. Seed-TTS [9] applies proximal policy optimization (PPO) [18] to optimize Word Error Rate (WER) and speaker similarity (SS), but P… view at source ↗
Figure 2
Figure 2. Figure 2: The standard deviation of three rewards in batch dur￾ing training. according to different weights; this enables more targeted opti￾mization of particular downstream metrics and allows different emphasis across objectives. 2.4.2. Weighted combination As shown in Fig.2, the standard deviations (std) of different re￾wards are not equal. If multiple rewards are combined directly by a weighted sum, their contri… view at source ↗
Figure 3
Figure 3. Figure 3: Proxy reward curves for CV3 on the dev-easy set during RL training with different numbers of GPUs. 3.4. Evaluation Sets We evaluate on Seed-TTS-Eval [9] (Chinese test-zh: 2,020 sam￾ples; English test-en: 1,088 samples; challenging test-hard: 400 samples) and CV3-Eval [6]. CV3-Eval uses the Multilingual Voice Cloning subset, which contains nine languages with 500 samples each: Chinese (zh), English (en), Ja… view at source ↗
Figure 4
Figure 4. Figure 4: Proxy reward curves for CV3 and F5-TTS on the dev-hard set during RL training with Robust Training via Hard Cases. train-hard-20k represents the hard case training set. 0 2000 4000 6000 8000 10000 Training Step 0.800 0.805 0.810 0.815 0.820 0.825 0.830 0.835 Speaker Similarity Speaker Similarity Weighted Probabilistic Weighted w/o std Norm. 0 2000 4000 6000 8000 10000 Training Step 3.25 3.30 3.35 3.40 3.45… view at source ↗
Figure 5
Figure 5. Figure 5: Proxy reward curves for CV3 on the dev-easy set during RL training with different multi-objective reward combination strategies. 0 2000 4000 6000 8000 10000 Training Step 0.805 0.810 0.815 0.820 0.825 0.830 Speaker Similarity Speaker Similarity 3 3 = 0.2 3 = 0.4 3 = 0.6 3 = 0.8 3 = 1.0 0 2000 4000 6000 8000 10000 Training Step 3.26 3.28 3.30 3.32 3.34 3.36 P835. DNSMOS Score P835. DNSMOS Score 3 3 = 0.2 3 … view at source ↗
Figure 6
Figure 6. Figure 6: Proxy reward curves for CV3 on the dev-easy set during RL training with different DNSMOS weights. 0 1000 2000 3000 4000 5000 6000 7000 8000 Training Step 0.805 0.810 0.815 0.820 0.825 0.830 0.835 Speaker Similarity 0.837 0.831 w/o CFG w CFG [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Effect of omitting CFG during RL training on the dev￾easy set. TTS synthesizes directly from text as an FM-only architecture and achieves ASR improvements with the train-easy-40k set. This indicates that the FM can independently adapt distribution modeling to enhance intelligibility. Additionally, using train￾hard-20k samples accelerates ASR reward growth, supporting the observation that optimization on ha… view at source ↗
Figure 9
Figure 9. Figure 9: Subjective A/B preference test between the baseline and our FlowTTS-GRPO model on the English subjective eval￾uation set. approach, the weighted combination method yields faster and more stable growth. We therefore adopt the weighted combina￾tion method as our final multi-objective strategy. We also conduct an ablation study on standard deviation (std) normalization during weighted combination. As shown by… view at source ↗
read the original abstract

Existing Reinforcement Learning (RL) research for Text-to-Speech (TTS) focuses on large language models (LLMs), leaving Flow-Matching (FM) under-explored. We present FlowTTS-GRPO, an online RL framework for FM-based TTS. By converting ordinary differential equation (ODE) trajectories into stochastic differential equation (SDE) paths, our method enables direct fine-tuning of open-source FM models without auxiliary models. We show that a weighted reward combination converges faster than a probabilistic scheme, and identify three practical optimizations: omitting classifier-free guidance (CFG) during training accelerates convergence; synthesizing hard cases improves robustness; and applying RL to the FM component enhances audio-detail metrics. Experiments on CosyVoice 3.0 and F5-TTS demonstrate objective and subjective preference gains in speaker similarity and perceptual quality, with F5-TTS also improving intelligibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces FlowTTS-GRPO, an online RL framework for flow-matching (FM) based TTS. By converting ODE trajectories into SDE paths, it enables direct fine-tuning of open-source FM models (CosyVoice 3.0, F5-TTS) without auxiliary networks. It claims a weighted multi-objective reward combination converges faster than a probabilistic scheme and identifies three optimizations (omitting CFG during training, synthesizing hard cases, applying RL to the FM component). Experiments are said to yield objective and subjective gains in speaker similarity, perceptual quality, and (for F5-TTS) intelligibility.

Significance. If the ODE-to-SDE conversion is shown to preserve generative fidelity without introducing instability or distribution shift, the approach could meaningfully extend RL fine-tuning techniques from LLM-based TTS to the FM setting, allowing direct improvement of open-source models. The practical optimizations and weighted-reward finding would be of interest to TTS practitioners if supported by reproducible ablations.

major comments (2)
  1. Abstract: the central claim that ODE-to-SDE conversion 'enables direct fine-tuning ... without auxiliary models' and 'preserves the generative capabilities' is load-bearing, yet the abstract supplies no derivation, sampling procedure, or fidelity metric (e.g., distribution-shift or reconstruction error) to substantiate that the conversion does not introduce artifacts; this must be addressed with explicit equations and verification in the methods section before the empirical claims can be evaluated.
  2. Abstract: objective and subjective 'preference gains' are asserted without any numerical values, error bars, dataset sizes, or baseline comparisons, rendering the magnitude and reliability of the reported improvements impossible to assess; this is a load-bearing gap for the experimental contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will make targeted revisions to strengthen the abstract and methods presentation.

read point-by-point responses
  1. Referee: Abstract: the central claim that ODE-to-SDE conversion 'enables direct fine-tuning ... without auxiliary models' and 'preserves the generative capabilities' is load-bearing, yet the abstract supplies no derivation, sampling procedure, or fidelity metric (e.g., distribution-shift or reconstruction error) to substantiate that the conversion does not introduce artifacts; this must be addressed with explicit equations and verification in the methods section before the empirical claims can be evaluated.

    Authors: The full manuscript's Methods section already derives the ODE-to-SDE conversion, specifies the sampling procedure, and reports fidelity verification (distribution shift and reconstruction error metrics) confirming no artifacts or instability. To address the referee's concern directly, we will revise the abstract to include a concise reference to these elements and ensure the methods section explicitly highlights the verification results with equations. revision: yes

  2. Referee: Abstract: objective and subjective 'preference gains' are asserted without any numerical values, error bars, dataset sizes, or baseline comparisons, rendering the magnitude and reliability of the reported improvements impossible to assess; this is a load-bearing gap for the experimental contribution.

    Authors: We agree the abstract is too high-level. The results section contains the full numerical results, error bars, dataset details, and baseline comparisons. We will revise the abstract to report key quantitative preference gains (with error bars and dataset sizes) so the magnitude of improvements is immediately clear. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The provided abstract and context describe an RL framework for flow-matching TTS that converts ODE trajectories to SDE paths to enable direct fine-tuning without auxiliary models, compares weighted vs. probabilistic reward schemes, and lists three practical optimizations. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations are exhibited in the given material. The central claims rest on empirical results from external models (CosyVoice 3.0, F5-TTS) and standard RL techniques rather than reducing to inputs by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; all technical details are at summary level only.

pith-pipeline@v0.9.1-grok · 5704 in / 1071 out tokens · 26911 ms · 2026-06-26T07:00:11.437302+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 23 canonical work pages · 12 internal anchors

  1. [1]

    FlowTTS-GRPO: Online Reinforcement Learning with Multi-Objective Reward Optimization for Flow-Matching Based Text-to-Speech

    Introduction Text-to-speech (TTS) converts input text into audible speech and plays a key role in human–computer interaction. Re- cently, researchers have incorporated large language models (LLMs) [1] into TTS [2–9]. One line of work uses LLMs’ ability to model discrete speech tokens with in-context learn- ing (ICL) [2, 7, 8]: a speech codec [10, 11] prod...

  2. [2]

    The TTS model used for RL finetuning We select CosyV oice 3.0 [6] (CV3) and F5-TTS [13] as pre- trained TTS models for RL finetuning

    Methods 2.1. The TTS model used for RL finetuning We select CosyV oice 3.0 [6] (CV3) and F5-TTS [13] as pre- trained TTS models for RL finetuning. Zero-shot voice-cloning refers to generating speech in a target speaker’s voice using only a short prompt audio without requiring explicit speaker adapta- tion or fine-tuning. To refine zero-shot voice-cloning ...

  3. [3]

    Training Dataset We use WenetSpeech4TTS [38] Premium (Chinese) and LibriTTS-960 [39] (English) as training sets

    Experimental Setup 3.1. Training Dataset We use WenetSpeech4TTS [38] Premium (Chinese) and LibriTTS-960 [39] (English) as training sets. Audio files serve as prompt waveforms with transcripts as prompt text. We ran- domly shuffle the original text corpus to produce target texts for voice cloning. We construct 20k samples each for Chinese and English (40k ...

  4. [4]

    Results and Discussion 4.1. Evaluation Metrics Following CV3 [6], we evaluate the effect of FlowTTS-GRPO fine-tuning using three objective metrics: • Content consistency (CER/WER): measures the intelligibil- ity of synthesized speech. We report Character Error Rate (CER) for Chinese and other non-English languages, and Word Error Rate (WER) for English. F...

  5. [5]

    Our method enables direct fine-tuning of open-source FM-only and LLM-FM hybrid mod- els by converting ODE trajectories to SDE paths

    Conclusion We introduce FlowTTS-GRPO, the first application of Flow- GRPO to text-to-speech models. Our method enables direct fine-tuning of open-source FM-only and LLM-FM hybrid mod- els by converting ODE trajectories to SDE paths. Our frame- work simplifies prior RL approaches by eliminating value net- works, preference pairs, and token-to-reward models...

  6. [6]

    The AI tools are used only for grammar checking and did not generate any significant part of the scientific content, technical contributions, or experimental results

    Generative AI Use Disclosure We use ChatGPT (OpenAI) and Qwen3-Max (Alibaba) for En- glish grammar checking, sentence polishing in the manuscript. The AI tools are used only for grammar checking and did not generate any significant part of the scientific content, technical contributions, or experimental results. All authors are fully re- sponsible for the...

  7. [7]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020

  8. [8]

    Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

    C. Wang, S. Chen, Y . Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y . Liu, H. Wang, J. Liet al., “Neural codec language mod- els are zero-shot text to speech synthesizers,”arXiv preprint arXiv:2301.02111, 2023

  9. [9]

    Lauragpt: Listen, attend, understand, and re- generate audio with gpt,

    Z. Du, J. Wang, Q. Chen, Y . Chu, Z. Gao, Z. Li, K. Hu, X. Zhou, J. Xu, Z. Maet al., “Lauragpt: Listen, attend, understand, and re- generate audio with gpt,”arXiv preprint arXiv:2310.04673, 2023

  10. [10]

    CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

    Z. Du, Q. Chen, S. Zhang, K. Hu, H. Lu, Y . Yang, H. Hu, S. Zheng, Y . Gu, Z. Maet al., “Cosyvoice: A scalable multi- lingual zero-shot text-to-speech synthesizer based on supervised semantic tokens,”arXiv preprint arXiv:2407.05407, 2024

  11. [11]

    CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

    Z. Du, Y . Wang, Q. Chen, X. Shi, X. Lv, T. Zhao, Z. Gao, Y . Yang, C. Gao, H. Wanget al., “Cosyvoice 2: Scalable stream- ing speech synthesis with large language models,”arXiv preprint arXiv:2412.10117, 2024

  12. [12]

    CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

    Z. Du, C. Gao, Y . Wang, F. Yu, T. Zhao, H. Wang, X. Lv, H. Wang, C. Ni, X. Shiet al., “Cosyvoice 3: Towards in-the- wild speech generation via scaling-up and post-training,”arXiv preprint arXiv:2505.17589, 2025

  13. [13]

    Llasa: Scaling train-time and inference-time compute for llama-based speech synthesis,

    Z. Ye, X. Zhu, C.-M. Chan, X. Wang, X. Tan, J. Lei, Y . Peng, H. Liu, Y . Jin, Z. Daiet al., “Llasa: Scaling train-time and inference-time compute for llama-based speech synthesis,”arXiv preprint arXiv:2502.04128, 2025

  14. [14]

    Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

    X. Wang, M. Jiang, Z. Ma, Z. Zhang, S. Liu, L. Li, Z. Liang, Q. Zheng, R. Wang, X. Fenget al., “Spark-tts: An efficient llm- based text-to-speech model with single-stream decoupled speech tokens,”arXiv preprint arXiv:2503.01710, 2025

  15. [15]

    Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

    P. Anastassiou, J. Chen, J. Chen, Y . Chen, Z. Chen, Z. Chen, J. Cong, L. Deng, C. Ding, L. Gaoet al., “Seed-tts: A family of high-quality versatile speech generation models,”arXiv preprint arXiv:2406.02430, 2024

  16. [16]

    Soundstream: An end-to-end neural audio codec,

    N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “Soundstream: An end-to-end neural audio codec,”IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, vol. 30, pp. 495–507, 2021

  17. [17]

    High Fidelity Neural Audio Compression

    A. D ´efossez, J. Copet, G. Synnaeve, and Y . Adi, “High fidelity neural audio compression,”arXiv preprint arXiv:2210.13438, 2022

  18. [18]

    Recent advances in discrete speech tokens: A review,

    Y . Guo, Z. Li, H. Wang, B. Li, C. Shao, H. Zhang, C. Du, X. Chen, S. Liu, and K. Yu, “Recent advances in discrete speech tokens: A review,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  19. [19]

    F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,

    Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. JianZhao, K. Yu, and X. Chen, “F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), 2025, pp. 6255–6271

  20. [20]

    Flow straight and fast: Learning to generate and transfer data with rectified flow,

    X. Liu, C. Gong, and Q. Liu, “Flow straight and fast: Learning to generate and transfer data with rectified flow,” inProc. ICLR. OpenReview.net, 2023. [Online]. Available: https://openreview.net/forum?id=XVjTT1nw5z

  21. [21]

    Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech,

    S. Zhou, Y . Zhou, Y . He, X. Zhou, J. Wang, W. Deng, and J. Shu, “Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech,” arXiv preprint arXiv:2506.21619, 2025

  22. [22]

    Train- ing language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Train- ing language models to follow instructions with human feedback,” Proc. NeurIPS, vol. 35, pp. 27 730–27 744, 2022

  23. [23]

    Imagereward: Learning and evaluating human prefer- ences for text-to-image generation,

    J. Xu, X. Liu, Y . Wu, Y . Tong, Q. Li, M. Ding, J. Tang, and Y . Dong, “Imagereward: Learning and evaluating human prefer- ences for text-to-image generation,”Proc. NeurIPS, vol. 36, pp. 15 903–15 935, 2023

  24. [24]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

  25. [25]

    Preference alignment improves language model-based tts,

    J. Tian, C. Zhang, J. Shi, H. Zhang, J. Yu, S. Watanabe, and D. Yu, “Preference alignment improves language model-based tts,” in Proc. ICASSP. IEEE, 2025, pp. 1–5

  26. [26]

    Emo-dpo: Controllable emotional speech synthesis through direct preference optimization,

    X. Gao, C. Zhang, Y . Chen, H. Zhang, and N. F. Chen, “Emo-dpo: Controllable emotional speech synthesis through direct preference optimization,” inProc. ICASSP. IEEE, 2025, pp. 1–5

  27. [27]

    Koel-tts: Enhanc- ing llm based speech generation with preference alignment and classifier free guidance,

    S. S. Hussain, P. Neekhara, X. Yang, E. Casanova, S. Ghosh, R. Fejgin, M. T. Desta, R. Valle, and J. Li, “Koel-tts: Enhanc- ing llm based speech generation with preference alignment and classifier free guidance,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 21 230–21 245

  28. [28]

    Direct preference optimization: Your language model is secretly a reward model,

    R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”Proc. NeurIPS, vol. 36, pp. 53 728–53 741, 2023

  29. [29]

    Differentiable Reward Optimiza- tion for LLM based TTS system,

    C. Gao, Z. Du, and S. Zhang, “Differentiable Reward Optimiza- tion for LLM based TTS system,” inProc. Interspeech, 2025, pp. 2450–2454

  30. [30]

    Rrpo: Robust reward policy optimization for llm-based emotional tts,

    C. Wang, C. Gao, Y . Xiang, Z. Du, K. An, H. Zhao, Q. Chen, X. Li, Y . Gao, and Y . Li, “Rrpo: Robust reward policy optimization for llm-based emotional tts,”arXiv preprint arXiv:2512.04552, 2025

  31. [31]

    Group relative policy optimization for text-to-speech with large language models,

    C. Liu, Y .-J. Hu, Y .-Y . Gao, S.-L. Zhang, and Z.-H. Ling, “Group relative policy optimization for text-to-speech with large language models,”arXiv preprint arXiv:2509.18798, 2025

  32. [32]

    F5r-tts: Improving flow-matching based text-to-speech with group relative policy optimization,

    X. Sun, R. Xiao, J. Mo, B. Wu, Q. Yu, and B. Wang, “F5r-tts: Improving flow-matching based text-to-speech with group relative policy optimization,”arXiv preprint arXiv:2504.02407, 2025

  33. [33]

    Flow-GRPO: Training Flow Matching Models via Online RL

    J. Liu, G. Liu, J. Liang, Y . Li, J. Liu, X. Wang, P. Wan, D. Zhang, and W. Ouyang, “Flow-grpo: Training flow matching models via online rl,”arXiv preprint arXiv:2505.05470, 2025

  34. [34]

    Flowse-grpo: Training flow matching speech en- hancement via online reinforcement learning,

    H. Wang, B. Tian, Y . Jiang, Z. Pan, S. Zhao, B. Ma, D. Chen, and X. Li, “Flowse-grpo: Training flow matching speech en- hancement via online reinforcement learning,”arXiv preprint arXiv:2601.16483, 2026

  35. [35]

    HiFi-GAN: Genera- tive Adversarial Networks for Efficient and High Fi- delity Speech Synthesis,

    J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Genera- tive Adversarial Networks for Efficient and High Fi- delity Speech Synthesis,” inProc. NeurIPS, 2020. [On- line]. Available: https://proceedings.neurips.cc/paper/2020/hash/ c5d736809766d46260d816d8dbc9eb44-Abstract.html

  36. [36]

    V ocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis,

    H. Siuzdak, “V ocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis,” arXiv preprint arXiv:2306.00814, 2023

  37. [37]

    MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

    J. Li, Y . Cui, T. Huang, Y . Ma, C. Fan, M. Yang, and Z. Zhong, “Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode- sde,”arXiv preprint arXiv:2507.21802, 2025

  38. [38]

    An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification,

    Y . Chen, S. Zheng, H. Wang, L. Cheng, Q. Chen, and J. Qi, “An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification,” inProc. Interspeech, 2023, pp. 2228–2232

  39. [39]

    Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to- end speech recognition,

    Z. Gao, S. Zhang, I. McLoughlin, and Z. Yan, “Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to- end speech recognition,” inInterspeech. ISCA, 2022, pp. 2063– 2067

  40. [40]

    Faster whisper large v3,

    Systran, “Faster whisper large v3,” 2023. [Online]. Available: https://huggingface.co/Systran/faster-whisper-large-v3

  41. [41]

    Dnsmos p. 835: A non- intrusive perceptual objective speech quality metric to evaluate noise suppressors,

    C. K. Reddy, V . Gopal, and R. Cutler, “Dnsmos p. 835: A non- intrusive perceptual objective speech quality metric to evaluate noise suppressors,” inProc. ICASSP. IEEE, 2022, pp. 886–890

  42. [42]

    Promptrl: Prompt matters in rl for flow-based image generation,

    F.-Y . Wang, H. Zhang, M. Gharbi, H. Li, and T. Park, “Promptrl: Prompt matters in rl for flow-based image generation,”arXiv preprint arXiv:2602.01382, 2026

  43. [43]

    Maskgct: Zero-shot text-to- speech with masked generative codec transformer,

    Y . Wang, H. Zhan, L. Liu, R. Zeng, H. Guo, J. Zheng, Q. Zhang, X. Zhang, S. Zhang, and Z. Wu, “Maskgct: Zero-shot text-to- speech with masked generative codec transformer,”arXiv preprint arXiv:2409.00750, 2024

  44. [44]

    Wenetspeech4tts: A 12,800-hour mandarin tts corpus for large speech generation model bench- mark,

    L. Ma, D. Guo, K. Song, Y . Jiang, S. Wang, L. Xue, W. Xu, H. Zhao, B. Zhang, and L. Xie, “Wenetspeech4tts: A 12,800-hour mandarin tts corpus for large speech generation model bench- mark,”arXiv preprint arXiv:2406.05763, 2024

  45. [45]

    Libritts: A corpus derived from librispeech for text- to-speech,

    H. Zen, V . Dang, R. Clark, Y . Zhang, R. J. Weiss, Y . Jia, Z. Chen, and Y . Wu, “Libritts: A corpus derived from librispeech for text- to-speech,” inProc. Interspeech, 2019, pp. 1526–1530

  46. [46]

    Lora: Low-rank adaptation of large language models

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.”Iclr, vol. 1, no. 2, p. 3, 2022

  47. [47]

    Large-scale self-supervised speech representation learning for automatic speaker verification,

    Z. Chen, S. Chen, Y . Wu, Y . Qian, C. Wang, S. Liu, Y . Qian, and M. Zeng, “Large-scale self-supervised speech representation learning for automatic speaker verification,” inICASSP 2022- 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6147–6151

  48. [48]

    Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise sup- pressors,

    C. K. Reddy, V . Gopal, and R. Cutler, “Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise sup- pressors,” inProc. ICASSP. IEEE, 2021, pp. 6493–6497