FlowTTS-GRPO: Online Reinforcement Learning with Multi-Objective Reward Optimization for Flow-Matching Based Text-to-Speech

Biao Tian; Han Zhao; Haoxu Wang; Weiqing Li; Xiangang Li; Xiang Lv

arxiv: 2606.23190 · v1 · pith:RTVADOAUnew · submitted 2026-06-22 · 📡 eess.AS · cs.SD

FlowTTS-GRPO: Online Reinforcement Learning with Multi-Objective Reward Optimization for Flow-Matching Based Text-to-Speech

Haoxu Wang , Biao Tian , Weiqing Li , Xiang Lv , Han Zhao , Xiangang Li This is my paper

Pith reviewed 2026-06-26 07:00 UTC · model grok-4.3

classification 📡 eess.AS cs.SD

keywords flow-matchingtext-to-speechreinforcement learningonline RLTTSSDE pathsmulti-objective rewardfine-tuning

0 comments

The pith

Converting ODE trajectories to SDE paths enables direct RL fine-tuning of flow-matching TTS models without auxiliary models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that flow-matching based text-to-speech models can be fine-tuned using online reinforcement learning by converting their ODE trajectories into SDE paths. This conversion allows direct application of rewards without needing auxiliary models. The authors demonstrate that a weighted multi-objective reward combination converges faster than a probabilistic one and propose three practical optimizations for training. Readers would care if this makes it easier to enhance existing open-source TTS systems for better voice similarity and sound quality.

Core claim

FlowTTS-GRPO converts ordinary differential equation trajectories into stochastic differential equation paths to enable online reinforcement learning fine-tuning of flow-matching TTS models. This permits direct optimization using multi-objective rewards on open-source models like CosyVoice and F5-TTS. The method shows faster convergence with weighted rewards, and optimizations such as skipping classifier-free guidance speed up training while improving robustness and detail metrics through hard case synthesis and targeted RL application.

What carries the argument

The ODE-to-SDE conversion that turns deterministic flow paths into stochastic ones for RL-based sampling and optimization.

If this is right

Weighted reward combination leads to faster convergence than probabilistic scheme.
Omitting CFG during RL training accelerates convergence.
Synthesizing hard cases improves robustness.
RL on the FM component enhances audio-detail metrics.
Results in objective and subjective gains in speaker similarity and perceptual quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This technique might extend to fine-tuning other types of generative flow models beyond speech synthesis.
The identified optimizations could be tested in non-RL fine-tuning scenarios for flow models.

Load-bearing premise

The conversion of ODE trajectories to SDE paths preserves the generative capabilities of the flow-matching model sufficiently for effective online RL fine-tuning without introducing artifacts or instability.

What would settle it

If experiments showed that models fine-tuned with FlowTTS-GRPO had worse or equal performance in speaker similarity and perceptual quality compared to the base models, the effectiveness of the method would be questioned.

Figures

Figures reproduced from arXiv: 2606.23190 by Biao Tian, Han Zhao, Haoxu Wang, Weiqing Li, Xiangang Li, Xiang Lv.

**Figure 1.** Figure 1: The pipeline of our FlowTTS-GRPO post-training for CosyVoice 3.0 and F5-TTS. Prompt speech tokens, generated speech tokens, and prompt speaker embedding are only used for CosyVoice 3.0. human-perception metrics. In the TTS domain, RL work has largely focused on LLM-based models. Seed-TTS [9] applies proximal policy optimization (PPO) [18] to optimize Word Error Rate (WER) and speaker similarity (SS), but P… view at source ↗

**Figure 2.** Figure 2: The standard deviation of three rewards in batch during training. according to different weights; this enables more targeted optimization of particular downstream metrics and allows different emphasis across objectives. 2.4.2. Weighted combination As shown in Fig.2, the standard deviations (std) of different rewards are not equal. If multiple rewards are combined directly by a weighted sum, their contri… view at source ↗

**Figure 3.** Figure 3: Proxy reward curves for CV3 on the dev-easy set during RL training with different numbers of GPUs. 3.4. Evaluation Sets We evaluate on Seed-TTS-Eval [9] (Chinese test-zh: 2,020 samples; English test-en: 1,088 samples; challenging test-hard: 400 samples) and CV3-Eval [6]. CV3-Eval uses the Multilingual Voice Cloning subset, which contains nine languages with 500 samples each: Chinese (zh), English (en), Ja… view at source ↗

**Figure 4.** Figure 4: Proxy reward curves for CV3 and F5-TTS on the dev-hard set during RL training with Robust Training via Hard Cases. train-hard-20k represents the hard case training set. 0 2000 4000 6000 8000 10000 Training Step 0.800 0.805 0.810 0.815 0.820 0.825 0.830 0.835 Speaker Similarity Speaker Similarity Weighted Probabilistic Weighted w/o std Norm. 0 2000 4000 6000 8000 10000 Training Step 3.25 3.30 3.35 3.40 3.45… view at source ↗

**Figure 5.** Figure 5: Proxy reward curves for CV3 on the dev-easy set during RL training with different multi-objective reward combination strategies. 0 2000 4000 6000 8000 10000 Training Step 0.805 0.810 0.815 0.820 0.825 0.830 Speaker Similarity Speaker Similarity 3 3 = 0.2 3 = 0.4 3 = 0.6 3 = 0.8 3 = 1.0 0 2000 4000 6000 8000 10000 Training Step 3.26 3.28 3.30 3.32 3.34 3.36 P835. DNSMOS Score P835. DNSMOS Score 3 3 = 0.2 3 … view at source ↗

**Figure 6.** Figure 6: Proxy reward curves for CV3 on the dev-easy set during RL training with different DNSMOS weights. 0 1000 2000 3000 4000 5000 6000 7000 8000 Training Step 0.805 0.810 0.815 0.820 0.825 0.830 0.835 Speaker Similarity 0.837 0.831 w/o CFG w CFG [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Effect of omitting CFG during RL training on the deveasy set. TTS synthesizes directly from text as an FM-only architecture and achieves ASR improvements with the train-easy-40k set. This indicates that the FM can independently adapt distribution modeling to enhance intelligibility. Additionally, using trainhard-20k samples accelerates ASR reward growth, supporting the observation that optimization on ha… view at source ↗

**Figure 9.** Figure 9: Subjective A/B preference test between the baseline and our FlowTTS-GRPO model on the English subjective evaluation set. approach, the weighted combination method yields faster and more stable growth. We therefore adopt the weighted combination method as our final multi-objective strategy. We also conduct an ablation study on standard deviation (std) normalization during weighted combination. As shown by… view at source ↗

read the original abstract

Existing Reinforcement Learning (RL) research for Text-to-Speech (TTS) focuses on large language models (LLMs), leaving Flow-Matching (FM) under-explored. We present FlowTTS-GRPO, an online RL framework for FM-based TTS. By converting ordinary differential equation (ODE) trajectories into stochastic differential equation (SDE) paths, our method enables direct fine-tuning of open-source FM models without auxiliary models. We show that a weighted reward combination converges faster than a probabilistic scheme, and identify three practical optimizations: omitting classifier-free guidance (CFG) during training accelerates convergence; synthesizing hard cases improves robustness; and applying RL to the FM component enhances audio-detail metrics. Experiments on CosyVoice 3.0 and F5-TTS demonstrate objective and subjective preference gains in speaker similarity and perceptual quality, with F5-TTS also improving intelligibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows how to apply online RL to flow-matching TTS by converting ODE trajectories to SDE paths for direct fine-tuning, plus three practical optimizations that improve convergence and quality on two open models.

read the letter

The main point is that they convert ODE sampling paths in flow-matching TTS into SDE paths so they can run online RL directly on open models like CosyVoice 3.0 and F5-TTS without extra networks. They also find that a weighted reward combination converges faster than a probabilistic one and list three optimizations: skipping classifier-free guidance during training, synthesizing hard cases for robustness, and applying RL specifically to the FM component to boost detail metrics.

What stands out as new is the targeted framework for FM-based TTS, an area that has seen little RL work compared to LLM approaches. The experiments report objective and subjective gains in speaker similarity and perceptual quality, with added intelligibility on F5-TTS.

The paper does a solid job at filling that gap with concrete technical choices and practical tips that could be reused. The results on two different models give some indication the approach is not tied to one architecture.

The soft spots are that the abstract supplies no numbers, error bars, or dataset specifics, so the size of the gains is unclear from the summary alone. The key assumption—that the ODE-to-SDE shift preserves the original generative distribution without major artifacts or instability—needs checking against the full derivations and any ablations on distribution shift. If those hold up, the rest follows.

This is for TTS researchers who want to fine-tune flow-matching models with RL. It deserves a serious referee because it brings a workable method to an under-explored combination, even if the gains are incremental rather than transformative.

Referee Report

2 major / 0 minor

Summary. The paper introduces FlowTTS-GRPO, an online RL framework for flow-matching (FM) based TTS. By converting ODE trajectories into SDE paths, it enables direct fine-tuning of open-source FM models (CosyVoice 3.0, F5-TTS) without auxiliary networks. It claims a weighted multi-objective reward combination converges faster than a probabilistic scheme and identifies three optimizations (omitting CFG during training, synthesizing hard cases, applying RL to the FM component). Experiments are said to yield objective and subjective gains in speaker similarity, perceptual quality, and (for F5-TTS) intelligibility.

Significance. If the ODE-to-SDE conversion is shown to preserve generative fidelity without introducing instability or distribution shift, the approach could meaningfully extend RL fine-tuning techniques from LLM-based TTS to the FM setting, allowing direct improvement of open-source models. The practical optimizations and weighted-reward finding would be of interest to TTS practitioners if supported by reproducible ablations.

major comments (2)

Abstract: the central claim that ODE-to-SDE conversion 'enables direct fine-tuning ... without auxiliary models' and 'preserves the generative capabilities' is load-bearing, yet the abstract supplies no derivation, sampling procedure, or fidelity metric (e.g., distribution-shift or reconstruction error) to substantiate that the conversion does not introduce artifacts; this must be addressed with explicit equations and verification in the methods section before the empirical claims can be evaluated.
Abstract: objective and subjective 'preference gains' are asserted without any numerical values, error bars, dataset sizes, or baseline comparisons, rendering the magnitude and reliability of the reported improvements impossible to assess; this is a load-bearing gap for the experimental contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will make targeted revisions to strengthen the abstract and methods presentation.

read point-by-point responses

Referee: Abstract: the central claim that ODE-to-SDE conversion 'enables direct fine-tuning ... without auxiliary models' and 'preserves the generative capabilities' is load-bearing, yet the abstract supplies no derivation, sampling procedure, or fidelity metric (e.g., distribution-shift or reconstruction error) to substantiate that the conversion does not introduce artifacts; this must be addressed with explicit equations and verification in the methods section before the empirical claims can be evaluated.

Authors: The full manuscript's Methods section already derives the ODE-to-SDE conversion, specifies the sampling procedure, and reports fidelity verification (distribution shift and reconstruction error metrics) confirming no artifacts or instability. To address the referee's concern directly, we will revise the abstract to include a concise reference to these elements and ensure the methods section explicitly highlights the verification results with equations. revision: yes
Referee: Abstract: objective and subjective 'preference gains' are asserted without any numerical values, error bars, dataset sizes, or baseline comparisons, rendering the magnitude and reliability of the reported improvements impossible to assess; this is a load-bearing gap for the experimental contribution.

Authors: We agree the abstract is too high-level. The results section contains the full numerical results, error bars, dataset details, and baseline comparisons. We will revise the abstract to report key quantitative preference gains (with error bars and dataset sizes) so the magnitude of improvements is immediately clear. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The provided abstract and context describe an RL framework for flow-matching TTS that converts ODE trajectories to SDE paths to enable direct fine-tuning without auxiliary models, compares weighted vs. probabilistic reward schemes, and lists three practical optimizations. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations are exhibited in the given material. The central claims rest on empirical results from external models (CosyVoice 3.0, F5-TTS) and standard RL techniques rather than reducing to inputs by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; all technical details are at summary level only.

pith-pipeline@v0.9.1-grok · 5704 in / 1071 out tokens · 26911 ms · 2026-06-26T07:00:11.437302+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 23 canonical work pages · 12 internal anchors

[1]

FlowTTS-GRPO: Online Reinforcement Learning with Multi-Objective Reward Optimization for Flow-Matching Based Text-to-Speech

Introduction Text-to-speech (TTS) converts input text into audible speech and plays a key role in human–computer interaction. Re- cently, researchers have incorporated large language models (LLMs) [1] into TTS [2–9]. One line of work uses LLMs’ ability to model discrete speech tokens with in-context learn- ing (ICL) [2, 7, 8]: a speech codec [10, 11] prod...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

The TTS model used for RL finetuning We select CosyV oice 3.0 [6] (CV3) and F5-TTS [13] as pre- trained TTS models for RL finetuning

Methods 2.1. The TTS model used for RL finetuning We select CosyV oice 3.0 [6] (CV3) and F5-TTS [13] as pre- trained TTS models for RL finetuning. Zero-shot voice-cloning refers to generating speech in a target speaker’s voice using only a short prompt audio without requiring explicit speaker adapta- tion or fine-tuning. To refine zero-shot voice-cloning ...

2000
[3]

Training Dataset We use WenetSpeech4TTS [38] Premium (Chinese) and LibriTTS-960 [39] (English) as training sets

Experimental Setup 3.1. Training Dataset We use WenetSpeech4TTS [38] Premium (Chinese) and LibriTTS-960 [39] (English) as training sets. Audio files serve as prompt waveforms with transcripts as prompt text. We ran- domly shuffle the original text corpus to produce target texts for voice cloning. We construct 20k samples each for Chinese and English (40k ...

2000
[4]

Results and Discussion 4.1. Evaluation Metrics Following CV3 [6], we evaluate the effect of FlowTTS-GRPO fine-tuning using three objective metrics: • Content consistency (CER/WER): measures the intelligibil- ity of synthesized speech. We report Character Error Rate (CER) for Chinese and other non-English languages, and Word Error Rate (WER) for English. F...

2000
[5]

Our method enables direct fine-tuning of open-source FM-only and LLM-FM hybrid mod- els by converting ODE trajectories to SDE paths

Conclusion We introduce FlowTTS-GRPO, the first application of Flow- GRPO to text-to-speech models. Our method enables direct fine-tuning of open-source FM-only and LLM-FM hybrid mod- els by converting ODE trajectories to SDE paths. Our frame- work simplifies prior RL approaches by eliminating value net- works, preference pairs, and token-to-reward models...
[6]

The AI tools are used only for grammar checking and did not generate any significant part of the scientific content, technical contributions, or experimental results

Generative AI Use Disclosure We use ChatGPT (OpenAI) and Qwen3-Max (Alibaba) for En- glish grammar checking, sentence polishing in the manuscript. The AI tools are used only for grammar checking and did not generate any significant part of the scientific content, technical contributions, or experimental results. All authors are fully re- sponsible for the...
[7]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[8]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

C. Wang, S. Chen, Y . Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y . Liu, H. Wang, J. Liet al., “Neural codec language mod- els are zero-shot text to speech synthesizers,”arXiv preprint arXiv:2301.02111, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Lauragpt: Listen, attend, understand, and re- generate audio with gpt,

Z. Du, J. Wang, Q. Chen, Y . Chu, Z. Gao, Z. Li, K. Hu, X. Zhou, J. Xu, Z. Maet al., “Lauragpt: Listen, attend, understand, and re- generate audio with gpt,”arXiv preprint arXiv:2310.04673, 2023

work page arXiv 2023
[10]

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Z. Du, Q. Chen, S. Zhang, K. Hu, H. Lu, Y . Yang, H. Hu, S. Zheng, Y . Gu, Z. Maet al., “Cosyvoice: A scalable multi- lingual zero-shot text-to-speech synthesizer based on supervised semantic tokens,”arXiv preprint arXiv:2407.05407, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Z. Du, Y . Wang, Q. Chen, X. Shi, X. Lv, T. Zhao, Z. Gao, Y . Yang, C. Gao, H. Wanget al., “Cosyvoice 2: Scalable stream- ing speech synthesis with large language models,”arXiv preprint arXiv:2412.10117, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Z. Du, C. Gao, Y . Wang, F. Yu, T. Zhao, H. Wang, X. Lv, H. Wang, C. Ni, X. Shiet al., “Cosyvoice 3: Towards in-the- wild speech generation via scaling-up and post-training,”arXiv preprint arXiv:2505.17589, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Llasa: Scaling train-time and inference-time compute for llama-based speech synthesis,

Z. Ye, X. Zhu, C.-M. Chan, X. Wang, X. Tan, J. Lei, Y . Peng, H. Liu, Y . Jin, Z. Daiet al., “Llasa: Scaling train-time and inference-time compute for llama-based speech synthesis,”arXiv preprint arXiv:2502.04128, 2025

work page arXiv 2025
[14]

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

X. Wang, M. Jiang, Z. Ma, Z. Zhang, S. Liu, L. Li, Z. Liang, Q. Zheng, R. Wang, X. Fenget al., “Spark-tts: An efficient llm- based text-to-speech model with single-stream decoupled speech tokens,”arXiv preprint arXiv:2503.01710, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

P. Anastassiou, J. Chen, J. Chen, Y . Chen, Z. Chen, Z. Chen, J. Cong, L. Deng, C. Ding, L. Gaoet al., “Seed-tts: A family of high-quality versatile speech generation models,”arXiv preprint arXiv:2406.02430, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Soundstream: An end-to-end neural audio codec,

N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “Soundstream: An end-to-end neural audio codec,”IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, vol. 30, pp. 495–507, 2021

2021
[17]

High Fidelity Neural Audio Compression

A. D ´efossez, J. Copet, G. Synnaeve, and Y . Adi, “High fidelity neural audio compression,”arXiv preprint arXiv:2210.13438, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

Recent advances in discrete speech tokens: A review,

Y . Guo, Z. Li, H. Wang, B. Li, C. Shao, H. Zhang, C. Du, X. Chen, S. Liu, and K. Yu, “Recent advances in discrete speech tokens: A review,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025
[19]

F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,

Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. JianZhao, K. Yu, and X. Chen, “F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), 2025, pp. 6255–6271

2025
[20]

Flow straight and fast: Learning to generate and transfer data with rectified flow,

X. Liu, C. Gong, and Q. Liu, “Flow straight and fast: Learning to generate and transfer data with rectified flow,” inProc. ICLR. OpenReview.net, 2023. [Online]. Available: https://openreview.net/forum?id=XVjTT1nw5z

2023
[21]

Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech,

S. Zhou, Y . Zhou, Y . He, X. Zhou, J. Wang, W. Deng, and J. Shu, “Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech,” arXiv preprint arXiv:2506.21619, 2025

work page arXiv 2025
[22]

Train- ing language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Train- ing language models to follow instructions with human feedback,” Proc. NeurIPS, vol. 35, pp. 27 730–27 744, 2022

2022
[23]

Imagereward: Learning and evaluating human prefer- ences for text-to-image generation,

J. Xu, X. Liu, Y . Wu, Y . Tong, Q. Li, M. Ding, J. Tang, and Y . Dong, “Imagereward: Learning and evaluating human prefer- ences for text-to-image generation,”Proc. NeurIPS, vol. 36, pp. 15 903–15 935, 2023

2023
[24]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[25]

Preference alignment improves language model-based tts,

J. Tian, C. Zhang, J. Shi, H. Zhang, J. Yu, S. Watanabe, and D. Yu, “Preference alignment improves language model-based tts,” in Proc. ICASSP. IEEE, 2025, pp. 1–5

2025
[26]

Emo-dpo: Controllable emotional speech synthesis through direct preference optimization,

X. Gao, C. Zhang, Y . Chen, H. Zhang, and N. F. Chen, “Emo-dpo: Controllable emotional speech synthesis through direct preference optimization,” inProc. ICASSP. IEEE, 2025, pp. 1–5

2025
[27]

Koel-tts: Enhanc- ing llm based speech generation with preference alignment and classifier free guidance,

S. S. Hussain, P. Neekhara, X. Yang, E. Casanova, S. Ghosh, R. Fejgin, M. T. Desta, R. Valle, and J. Li, “Koel-tts: Enhanc- ing llm based speech generation with preference alignment and classifier free guidance,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 21 230–21 245

2025
[28]

Direct preference optimization: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”Proc. NeurIPS, vol. 36, pp. 53 728–53 741, 2023

2023
[29]

Differentiable Reward Optimiza- tion for LLM based TTS system,

C. Gao, Z. Du, and S. Zhang, “Differentiable Reward Optimiza- tion for LLM based TTS system,” inProc. Interspeech, 2025, pp. 2450–2454

2025
[30]

Rrpo: Robust reward policy optimization for llm-based emotional tts,

C. Wang, C. Gao, Y . Xiang, Z. Du, K. An, H. Zhao, Q. Chen, X. Li, Y . Gao, and Y . Li, “Rrpo: Robust reward policy optimization for llm-based emotional tts,”arXiv preprint arXiv:2512.04552, 2025

work page arXiv 2025
[31]

Group relative policy optimization for text-to-speech with large language models,

C. Liu, Y .-J. Hu, Y .-Y . Gao, S.-L. Zhang, and Z.-H. Ling, “Group relative policy optimization for text-to-speech with large language models,”arXiv preprint arXiv:2509.18798, 2025

work page arXiv 2025
[32]

F5r-tts: Improving flow-matching based text-to-speech with group relative policy optimization,

X. Sun, R. Xiao, J. Mo, B. Wu, Q. Yu, and B. Wang, “F5r-tts: Improving flow-matching based text-to-speech with group relative policy optimization,”arXiv preprint arXiv:2504.02407, 2025

work page arXiv 2025
[33]

Flow-GRPO: Training Flow Matching Models via Online RL

J. Liu, G. Liu, J. Liang, Y . Li, J. Liu, X. Wang, P. Wan, D. Zhang, and W. Ouyang, “Flow-grpo: Training flow matching models via online rl,”arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Flowse-grpo: Training flow matching speech en- hancement via online reinforcement learning,

H. Wang, B. Tian, Y . Jiang, Z. Pan, S. Zhao, B. Ma, D. Chen, and X. Li, “Flowse-grpo: Training flow matching speech en- hancement via online reinforcement learning,”arXiv preprint arXiv:2601.16483, 2026

work page arXiv 2026
[35]

HiFi-GAN: Genera- tive Adversarial Networks for Efficient and High Fi- delity Speech Synthesis,

J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Genera- tive Adversarial Networks for Efficient and High Fi- delity Speech Synthesis,” inProc. NeurIPS, 2020. [On- line]. Available: https://proceedings.neurips.cc/paper/2020/hash/ c5d736809766d46260d816d8dbc9eb44-Abstract.html

2020
[36]

V ocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis,

H. Siuzdak, “V ocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis,” arXiv preprint arXiv:2306.00814, 2023

work page arXiv 2023
[37]

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

J. Li, Y . Cui, T. Huang, Y . Ma, C. Fan, M. Yang, and Z. Zhong, “Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode- sde,”arXiv preprint arXiv:2507.21802, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification,

Y . Chen, S. Zheng, H. Wang, L. Cheng, Q. Chen, and J. Qi, “An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification,” inProc. Interspeech, 2023, pp. 2228–2232

2023
[39]

Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to- end speech recognition,

Z. Gao, S. Zhang, I. McLoughlin, and Z. Yan, “Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to- end speech recognition,” inInterspeech. ISCA, 2022, pp. 2063– 2067

2022
[40]

Faster whisper large v3,

Systran, “Faster whisper large v3,” 2023. [Online]. Available: https://huggingface.co/Systran/faster-whisper-large-v3

2023
[41]

Dnsmos p. 835: A non- intrusive perceptual objective speech quality metric to evaluate noise suppressors,

C. K. Reddy, V . Gopal, and R. Cutler, “Dnsmos p. 835: A non- intrusive perceptual objective speech quality metric to evaluate noise suppressors,” inProc. ICASSP. IEEE, 2022, pp. 886–890

2022
[42]

Promptrl: Prompt matters in rl for flow-based image generation,

F.-Y . Wang, H. Zhang, M. Gharbi, H. Li, and T. Park, “Promptrl: Prompt matters in rl for flow-based image generation,”arXiv preprint arXiv:2602.01382, 2026

work page arXiv 2026
[43]

Maskgct: Zero-shot text-to- speech with masked generative codec transformer,

Y . Wang, H. Zhan, L. Liu, R. Zeng, H. Guo, J. Zheng, Q. Zhang, X. Zhang, S. Zhang, and Z. Wu, “Maskgct: Zero-shot text-to- speech with masked generative codec transformer,”arXiv preprint arXiv:2409.00750, 2024

work page arXiv 2024
[44]

Wenetspeech4tts: A 12,800-hour mandarin tts corpus for large speech generation model bench- mark,

L. Ma, D. Guo, K. Song, Y . Jiang, S. Wang, L. Xue, W. Xu, H. Zhao, B. Zhang, and L. Xie, “Wenetspeech4tts: A 12,800-hour mandarin tts corpus for large speech generation model bench- mark,”arXiv preprint arXiv:2406.05763, 2024

work page arXiv 2024
[45]

Libritts: A corpus derived from librispeech for text- to-speech,

H. Zen, V . Dang, R. Clark, Y . Zhang, R. J. Weiss, Y . Jia, Z. Chen, and Y . Wu, “Libritts: A corpus derived from librispeech for text- to-speech,” inProc. Interspeech, 2019, pp. 1526–1530

2019
[46]

Lora: Low-rank adaptation of large language models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.”Iclr, vol. 1, no. 2, p. 3, 2022

2022
[47]

Large-scale self-supervised speech representation learning for automatic speaker verification,

Z. Chen, S. Chen, Y . Wu, Y . Qian, C. Wang, S. Liu, Y . Qian, and M. Zeng, “Large-scale self-supervised speech representation learning for automatic speaker verification,” inICASSP 2022- 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6147–6151

2022
[48]

Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise sup- pressors,

C. K. Reddy, V . Gopal, and R. Cutler, “Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise sup- pressors,” inProc. ICASSP. IEEE, 2021, pp. 6493–6497

2021

[1] [1]

FlowTTS-GRPO: Online Reinforcement Learning with Multi-Objective Reward Optimization for Flow-Matching Based Text-to-Speech

Introduction Text-to-speech (TTS) converts input text into audible speech and plays a key role in human–computer interaction. Re- cently, researchers have incorporated large language models (LLMs) [1] into TTS [2–9]. One line of work uses LLMs’ ability to model discrete speech tokens with in-context learn- ing (ICL) [2, 7, 8]: a speech codec [10, 11] prod...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

The TTS model used for RL finetuning We select CosyV oice 3.0 [6] (CV3) and F5-TTS [13] as pre- trained TTS models for RL finetuning

Methods 2.1. The TTS model used for RL finetuning We select CosyV oice 3.0 [6] (CV3) and F5-TTS [13] as pre- trained TTS models for RL finetuning. Zero-shot voice-cloning refers to generating speech in a target speaker’s voice using only a short prompt audio without requiring explicit speaker adapta- tion or fine-tuning. To refine zero-shot voice-cloning ...

2000

[3] [3]

Training Dataset We use WenetSpeech4TTS [38] Premium (Chinese) and LibriTTS-960 [39] (English) as training sets

Experimental Setup 3.1. Training Dataset We use WenetSpeech4TTS [38] Premium (Chinese) and LibriTTS-960 [39] (English) as training sets. Audio files serve as prompt waveforms with transcripts as prompt text. We ran- domly shuffle the original text corpus to produce target texts for voice cloning. We construct 20k samples each for Chinese and English (40k ...

2000

[4] [4]

Results and Discussion 4.1. Evaluation Metrics Following CV3 [6], we evaluate the effect of FlowTTS-GRPO fine-tuning using three objective metrics: • Content consistency (CER/WER): measures the intelligibil- ity of synthesized speech. We report Character Error Rate (CER) for Chinese and other non-English languages, and Word Error Rate (WER) for English. F...

2000

[5] [5]

Our method enables direct fine-tuning of open-source FM-only and LLM-FM hybrid mod- els by converting ODE trajectories to SDE paths

Conclusion We introduce FlowTTS-GRPO, the first application of Flow- GRPO to text-to-speech models. Our method enables direct fine-tuning of open-source FM-only and LLM-FM hybrid mod- els by converting ODE trajectories to SDE paths. Our frame- work simplifies prior RL approaches by eliminating value net- works, preference pairs, and token-to-reward models...

[6] [6]

The AI tools are used only for grammar checking and did not generate any significant part of the scientific content, technical contributions, or experimental results

Generative AI Use Disclosure We use ChatGPT (OpenAI) and Qwen3-Max (Alibaba) for En- glish grammar checking, sentence polishing in the manuscript. The AI tools are used only for grammar checking and did not generate any significant part of the scientific content, technical contributions, or experimental results. All authors are fully re- sponsible for the...

[7] [7]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[8] [8]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

C. Wang, S. Chen, Y . Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y . Liu, H. Wang, J. Liet al., “Neural codec language mod- els are zero-shot text to speech synthesizers,”arXiv preprint arXiv:2301.02111, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Lauragpt: Listen, attend, understand, and re- generate audio with gpt,

Z. Du, J. Wang, Q. Chen, Y . Chu, Z. Gao, Z. Li, K. Hu, X. Zhou, J. Xu, Z. Maet al., “Lauragpt: Listen, attend, understand, and re- generate audio with gpt,”arXiv preprint arXiv:2310.04673, 2023

work page arXiv 2023

[10] [10]

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Z. Du, Q. Chen, S. Zhang, K. Hu, H. Lu, Y . Yang, H. Hu, S. Zheng, Y . Gu, Z. Maet al., “Cosyvoice: A scalable multi- lingual zero-shot text-to-speech synthesizer based on supervised semantic tokens,”arXiv preprint arXiv:2407.05407, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Z. Du, Y . Wang, Q. Chen, X. Shi, X. Lv, T. Zhao, Z. Gao, Y . Yang, C. Gao, H. Wanget al., “Cosyvoice 2: Scalable stream- ing speech synthesis with large language models,”arXiv preprint arXiv:2412.10117, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Z. Du, C. Gao, Y . Wang, F. Yu, T. Zhao, H. Wang, X. Lv, H. Wang, C. Ni, X. Shiet al., “Cosyvoice 3: Towards in-the- wild speech generation via scaling-up and post-training,”arXiv preprint arXiv:2505.17589, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Llasa: Scaling train-time and inference-time compute for llama-based speech synthesis,

Z. Ye, X. Zhu, C.-M. Chan, X. Wang, X. Tan, J. Lei, Y . Peng, H. Liu, Y . Jin, Z. Daiet al., “Llasa: Scaling train-time and inference-time compute for llama-based speech synthesis,”arXiv preprint arXiv:2502.04128, 2025

work page arXiv 2025

[14] [14]

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

X. Wang, M. Jiang, Z. Ma, Z. Zhang, S. Liu, L. Li, Z. Liang, Q. Zheng, R. Wang, X. Fenget al., “Spark-tts: An efficient llm- based text-to-speech model with single-stream decoupled speech tokens,”arXiv preprint arXiv:2503.01710, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

P. Anastassiou, J. Chen, J. Chen, Y . Chen, Z. Chen, Z. Chen, J. Cong, L. Deng, C. Ding, L. Gaoet al., “Seed-tts: A family of high-quality versatile speech generation models,”arXiv preprint arXiv:2406.02430, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Soundstream: An end-to-end neural audio codec,

N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “Soundstream: An end-to-end neural audio codec,”IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, vol. 30, pp. 495–507, 2021

2021

[17] [17]

High Fidelity Neural Audio Compression

A. D ´efossez, J. Copet, G. Synnaeve, and Y . Adi, “High fidelity neural audio compression,”arXiv preprint arXiv:2210.13438, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[18] [18]

Recent advances in discrete speech tokens: A review,

Y . Guo, Z. Li, H. Wang, B. Li, C. Shao, H. Zhang, C. Du, X. Chen, S. Liu, and K. Yu, “Recent advances in discrete speech tokens: A review,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025

[19] [19]

F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,

Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. JianZhao, K. Yu, and X. Chen, “F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), 2025, pp. 6255–6271

2025

[20] [20]

Flow straight and fast: Learning to generate and transfer data with rectified flow,

X. Liu, C. Gong, and Q. Liu, “Flow straight and fast: Learning to generate and transfer data with rectified flow,” inProc. ICLR. OpenReview.net, 2023. [Online]. Available: https://openreview.net/forum?id=XVjTT1nw5z

2023

[21] [21]

Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech,

S. Zhou, Y . Zhou, Y . He, X. Zhou, J. Wang, W. Deng, and J. Shu, “Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech,” arXiv preprint arXiv:2506.21619, 2025

work page arXiv 2025

[22] [22]

Train- ing language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Train- ing language models to follow instructions with human feedback,” Proc. NeurIPS, vol. 35, pp. 27 730–27 744, 2022

2022

[23] [23]

Imagereward: Learning and evaluating human prefer- ences for text-to-image generation,

J. Xu, X. Liu, Y . Wu, Y . Tong, Q. Li, M. Ding, J. Tang, and Y . Dong, “Imagereward: Learning and evaluating human prefer- ences for text-to-image generation,”Proc. NeurIPS, vol. 36, pp. 15 903–15 935, 2023

2023

[24] [24]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[25] [25]

Preference alignment improves language model-based tts,

J. Tian, C. Zhang, J. Shi, H. Zhang, J. Yu, S. Watanabe, and D. Yu, “Preference alignment improves language model-based tts,” in Proc. ICASSP. IEEE, 2025, pp. 1–5

2025

[26] [26]

Emo-dpo: Controllable emotional speech synthesis through direct preference optimization,

X. Gao, C. Zhang, Y . Chen, H. Zhang, and N. F. Chen, “Emo-dpo: Controllable emotional speech synthesis through direct preference optimization,” inProc. ICASSP. IEEE, 2025, pp. 1–5

2025

[27] [27]

Koel-tts: Enhanc- ing llm based speech generation with preference alignment and classifier free guidance,

S. S. Hussain, P. Neekhara, X. Yang, E. Casanova, S. Ghosh, R. Fejgin, M. T. Desta, R. Valle, and J. Li, “Koel-tts: Enhanc- ing llm based speech generation with preference alignment and classifier free guidance,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 21 230–21 245

2025

[28] [28]

Direct preference optimization: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”Proc. NeurIPS, vol. 36, pp. 53 728–53 741, 2023

2023

[29] [29]

Differentiable Reward Optimiza- tion for LLM based TTS system,

C. Gao, Z. Du, and S. Zhang, “Differentiable Reward Optimiza- tion for LLM based TTS system,” inProc. Interspeech, 2025, pp. 2450–2454

2025

[30] [30]

Rrpo: Robust reward policy optimization for llm-based emotional tts,

C. Wang, C. Gao, Y . Xiang, Z. Du, K. An, H. Zhao, Q. Chen, X. Li, Y . Gao, and Y . Li, “Rrpo: Robust reward policy optimization for llm-based emotional tts,”arXiv preprint arXiv:2512.04552, 2025

work page arXiv 2025

[31] [31]

Group relative policy optimization for text-to-speech with large language models,

C. Liu, Y .-J. Hu, Y .-Y . Gao, S.-L. Zhang, and Z.-H. Ling, “Group relative policy optimization for text-to-speech with large language models,”arXiv preprint arXiv:2509.18798, 2025

work page arXiv 2025

[32] [32]

F5r-tts: Improving flow-matching based text-to-speech with group relative policy optimization,

X. Sun, R. Xiao, J. Mo, B. Wu, Q. Yu, and B. Wang, “F5r-tts: Improving flow-matching based text-to-speech with group relative policy optimization,”arXiv preprint arXiv:2504.02407, 2025

work page arXiv 2025

[33] [33]

Flow-GRPO: Training Flow Matching Models via Online RL

J. Liu, G. Liu, J. Liang, Y . Li, J. Liu, X. Wang, P. Wan, D. Zhang, and W. Ouyang, “Flow-grpo: Training flow matching models via online rl,”arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Flowse-grpo: Training flow matching speech en- hancement via online reinforcement learning,

H. Wang, B. Tian, Y . Jiang, Z. Pan, S. Zhao, B. Ma, D. Chen, and X. Li, “Flowse-grpo: Training flow matching speech en- hancement via online reinforcement learning,”arXiv preprint arXiv:2601.16483, 2026

work page arXiv 2026

[35] [35]

HiFi-GAN: Genera- tive Adversarial Networks for Efficient and High Fi- delity Speech Synthesis,

J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Genera- tive Adversarial Networks for Efficient and High Fi- delity Speech Synthesis,” inProc. NeurIPS, 2020. [On- line]. Available: https://proceedings.neurips.cc/paper/2020/hash/ c5d736809766d46260d816d8dbc9eb44-Abstract.html

2020

[36] [36]

V ocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis,

H. Siuzdak, “V ocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis,” arXiv preprint arXiv:2306.00814, 2023

work page arXiv 2023

[37] [37]

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

J. Li, Y . Cui, T. Huang, Y . Ma, C. Fan, M. Yang, and Z. Zhong, “Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode- sde,”arXiv preprint arXiv:2507.21802, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification,

Y . Chen, S. Zheng, H. Wang, L. Cheng, Q. Chen, and J. Qi, “An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification,” inProc. Interspeech, 2023, pp. 2228–2232

2023

[39] [39]

Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to- end speech recognition,

Z. Gao, S. Zhang, I. McLoughlin, and Z. Yan, “Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to- end speech recognition,” inInterspeech. ISCA, 2022, pp. 2063– 2067

2022

[40] [40]

Faster whisper large v3,

Systran, “Faster whisper large v3,” 2023. [Online]. Available: https://huggingface.co/Systran/faster-whisper-large-v3

2023

[41] [41]

Dnsmos p. 835: A non- intrusive perceptual objective speech quality metric to evaluate noise suppressors,

C. K. Reddy, V . Gopal, and R. Cutler, “Dnsmos p. 835: A non- intrusive perceptual objective speech quality metric to evaluate noise suppressors,” inProc. ICASSP. IEEE, 2022, pp. 886–890

2022

[42] [42]

Promptrl: Prompt matters in rl for flow-based image generation,

F.-Y . Wang, H. Zhang, M. Gharbi, H. Li, and T. Park, “Promptrl: Prompt matters in rl for flow-based image generation,”arXiv preprint arXiv:2602.01382, 2026

work page arXiv 2026

[43] [43]

Maskgct: Zero-shot text-to- speech with masked generative codec transformer,

Y . Wang, H. Zhan, L. Liu, R. Zeng, H. Guo, J. Zheng, Q. Zhang, X. Zhang, S. Zhang, and Z. Wu, “Maskgct: Zero-shot text-to- speech with masked generative codec transformer,”arXiv preprint arXiv:2409.00750, 2024

work page arXiv 2024

[44] [44]

Wenetspeech4tts: A 12,800-hour mandarin tts corpus for large speech generation model bench- mark,

L. Ma, D. Guo, K. Song, Y . Jiang, S. Wang, L. Xue, W. Xu, H. Zhao, B. Zhang, and L. Xie, “Wenetspeech4tts: A 12,800-hour mandarin tts corpus for large speech generation model bench- mark,”arXiv preprint arXiv:2406.05763, 2024

work page arXiv 2024

[45] [45]

Libritts: A corpus derived from librispeech for text- to-speech,

H. Zen, V . Dang, R. Clark, Y . Zhang, R. J. Weiss, Y . Jia, Z. Chen, and Y . Wu, “Libritts: A corpus derived from librispeech for text- to-speech,” inProc. Interspeech, 2019, pp. 1526–1530

2019

[46] [46]

Lora: Low-rank adaptation of large language models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.”Iclr, vol. 1, no. 2, p. 3, 2022

2022

[47] [47]

Large-scale self-supervised speech representation learning for automatic speaker verification,

Z. Chen, S. Chen, Y . Wu, Y . Qian, C. Wang, S. Liu, Y . Qian, and M. Zeng, “Large-scale self-supervised speech representation learning for automatic speaker verification,” inICASSP 2022- 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6147–6151

2022

[48] [48]

Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise sup- pressors,

C. K. Reddy, V . Gopal, and R. Cutler, “Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise sup- pressors,” inProc. ICASSP. IEEE, 2021, pp. 6493–6497

2021