arxiv: 2605.09386 · v1 · submitted 2026-05-10 · 📡 eess.AS · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech

Dong Yang, Haoyu Zhang, Hiroshi Saruwatari, Yiyi Cai, Yuki Saito

Pith reviewed 2026-05-12 02:59 UTC · model grok-4.3

classification 📡 eess.AS cs.AIcs.LG

keywords discrete flow matchingzero-shot TTSkinetic-optimal schedulermoment correctionCTMC solverspeaker similaritynaturalness evaluationcodec-based synthesis

0 comments

The pith

Kinetic-optimal scheduling and moment correction fix path errors in discrete flow matching for zero-shot TTS.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies two barriers to using metric-induced discrete flow matching in practice: schedulers that need manual tuning and small-step errors that accumulate when a first-order continuous-time Markov chain solver follows the probability path. It derives a scheduler that moves the system at constant speed under the Fisher-Rao metric without any training, and it adds a correction that tweaks jump probabilities at each finite step while leaving the final destination distribution unchanged. When these two pieces are inserted into a codec-based zero-shot TTS pipeline, the resulting system records the highest objective naturalness scores and wins subjective preference tests against other masked discrete generators; it also matches or exceeds dedicated TTS systems on speaker similarity across multiple test sets.

Core claim

Metric-induced discrete flow matching can be made practical for zero-shot TTS by replacing heuristic schedulers with a kinetic-optimal numerical schedule that traverses any scalar-parameterized probability path at constant Fisher-Rao speed and by inserting a finite-step moment correction that adjusts jump probabilities while exactly preserving the CTMC destination distribution; the resulting GibbsTTS model thereby attains the best objective naturalness and preferred subjective quality among evaluated discrete baselines together with top-tier speaker similarity.

What carries the argument

The kinetic-optimal scheduler for prescribed scalar-parameterized probability paths, instantiated as a training-free numerical schedule that holds constant Fisher-Rao speed, together with the finite-step moment correction that adjusts CTMC jump probabilities while preserving the destination distribution.

If this is right

GibbsTTS records the highest objective naturalness among the masked discrete generative baselines tested.
Listeners prefer GibbsTTS outputs over those baselines in direct subjective comparisons.
Speaker similarity reaches the highest value on three of the four evaluation sets and second place on the remaining set.
Both the scheduler and the correction operate without additional training or hyperparameter search.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same scheduler derivation could be reused for other discrete token geometries beyond speech codecs.
Because the correction preserves the exact destination distribution, it may be added to existing CTMC samplers with little change to their theoretical guarantees.
Constant-speed traversal reduces sensitivity to the number of sampling steps, potentially allowing fewer steps for real-time applications.

Load-bearing premise

Adjusting jump probabilities at each finite step can correct the first-order CTMC path-tracking error without changing the destination distribution or adding new biases to the generated speech.

What would settle it

An ablation that removes only the moment correction, keeps the kinetic-optimal schedule and all other components fixed, and then shows a statistically significant drop in naturalness metrics or subjective preference scores.

Figures

Figures reproduced from arXiv: 2605.09386 by Dong Yang, Haoyu Zhang, Hiroshi Saruwatari, Yiyi Cai, Yuki Saito.

**Figure 1.** Figure 1: Architecture of the proposed model. statistic at time t, so evaluating it under pt+h is itself a first-order approximation in h. Matching this moment thus aligns the one-step update with the reference path along this tangent direction, while preserving the CTMC jump destination distribution. Then, according to Eq. 19, we have ϕ¯ t(z, xˆ1) = Ey∼πt(·|z,xˆ1) [ϕt(y | xˆ1)] = β˙ t [PITH_FULL_IMAGE:figures/full… view at source ↗

read the original abstract

Metric-induced discrete flow matching (MI-DFM) exploits token-latent geometry for discrete generation, but its practical use is limited by two issues: heuristic schedulers requiring hyperparameter search, and finite-step path-tracking error from its first-order continuous-time Markov chain (CTMC) solver. We address both issues. First, we derive a kinetic-optimal scheduler for prescribed scalar-parameterized probability paths, and instantiate it for MI-DFM as a training-free numerical schedule that traverses the path at constant Fisher-Rao speed. Second, we introduce a finite-step moment correction that adjusts the jump probability while preserving the CTMC jump destination distribution. We validate the resulting method, GibbsTTS, on codec-based zero-shot text-to-speech (TTS). Under controlled comparisons with a unified architecture and large-scale dataset, GibbsTTS achieves the best objective naturalness and is preferred in subjective evaluations over masked discrete generative baselines. Additionally, in comparison with the evaluated state-of-the-art TTS systems, GibbsTTS shows strong speaker similarity, achieving the highest similarity on three of four test sets and ranking second on the fourth. Project page: https://ydqmkkx.github.io/GibbsTTSProject

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives a training-free kinetic-optimal scheduler and a moment correction for MI-DFM in zero-shot TTS, but the performance claims depend on whether the corrections keep the global path intact.

read the letter

The main things to know are the kinetic-optimal scheduler for scalar-parameterized paths and the finite-step moment correction that adjusts CTMC jump probabilities while preserving destination distributions. These are instantiated as GibbsTTS for codec-based zero-shot TTS and address the heuristic scheduler and path-tracking error issues in prior MI-DFM work. The scheduler is training-free and aims for constant Fisher-Rao speed, which is a concrete step forward from hyperparameter-tuned approaches. The correction targets finite-step error without changing the per-jump destination stats, and the paper reports controlled comparisons showing best objective naturalness plus strong speaker similarity on most test sets. That is useful technical progress for discrete generative TTS. The soft spots sit with the validation of the correction itself. The abstract gives no equations or integration check showing that the sequence of adjusted jumps still follows the prescribed path without altering overall probability flow. If the effective measure shifts even while marginal destinations stay fixed, token statistics and therefore naturalness scores could move for reasons unrelated to the claimed fix. The stress-test concern lands here: nothing in the summary confirms the global flow remains undistorted after repeated corrections. No ablations or error bars appear in the provided details either, so the strength of the naturalness and similarity results is hard to gauge from the surface claims. This paper is for researchers working on discrete flow matching or generative models in speech synthesis. Readers focused on practical scheduler improvements and CTMC solvers for TTS will find the derivations and application relevant. It deserves a serious referee to examine the path-preservation math and the experimental controls. I would send it to peer review with a request for explicit checks on whether the corrections integrate back to the original scalar-parameterized path.

Referee Report

2 major / 2 minor

Summary. The paper introduces GibbsTTS for codec-based zero-shot TTS by deriving a kinetic-optimal, training-free scheduler for metric-induced discrete flow matching (MI-DFM) that traverses prescribed scalar-parameterized paths at constant Fisher-Rao speed, together with a finite-step moment correction that adjusts CTMC jump probabilities while exactly preserving the per-jump destination distribution. It reports that the resulting method achieves the best objective naturalness, is subjectively preferred over masked discrete baselines, and attains top speaker similarity (highest on three of four test sets) against evaluated SOTA TTS systems under controlled comparisons on a unified architecture and large-scale dataset.

Significance. If the central claims hold, the work supplies a principled, hyperparameter-free scheduler and a distribution-preserving correction for first-order CTMC solvers in discrete flow matching. These address two practical bottlenecks (heuristic scheduling and path-tracking error) and could improve controllability and quality in discrete generative models for TTS without additional training.

major comments (2)

[Method (moment correction subsection)] The finite-step moment correction (described after the scheduler derivation): the claim that it adjusts jump probabilities while exactly preserving the CTMC jump destination distribution is load-bearing for the naturalness and similarity results. However, the manuscript does not show that the sequence of corrected jumps still integrates to the prescribed scalar-parameterized probability path or that the Fisher-Rao speed remains constant after correction. In a metric-induced discrete setting this leaves open the possibility that the effective measure along the path is altered, which could shift token statistics and perceptual quality even while marginal destinations are unchanged.
[Experiments] Experiments section, performance tables: the strongest claims (best objective naturalness, subjective preference, and top speaker similarity) are attributed to the combination of kinetic-optimal scheduling and moment correction, yet no ablation isolates the contribution of the correction, no error bars or statistical tests are reported, and no verification (e.g., empirical path integral or speed measurement) confirms that the corrected trajectory matches the target path. This weakens the causal link between the proposed components and the reported gains.

minor comments (2)

[Abstract] Abstract: the validation statement supplies no dataset names, metrics, or number of test sets, making it difficult for readers to assess the scope of the controlled comparisons.
[Method] Notation: the scalar parameterization of the probability path and the precise definition of Fisher-Rao speed for the discrete metric-induced setting should be stated explicitly with an equation reference to allow independent verification of the kinetic-optimal property.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the theoretical foundations and strengthen the experimental validation of GibbsTTS. We address each major comment below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Method (moment correction subsection)] The finite-step moment correction (described after the scheduler derivation): the claim that it adjusts jump probabilities while exactly preserving the CTMC jump destination distribution is load-bearing for the naturalness and similarity results. However, the manuscript does not show that the sequence of corrected jumps still integrates to the prescribed scalar-parameterized probability path or that the Fisher-Rao speed remains constant after correction. In a metric-induced discrete setting this leaves open the possibility that the effective measure along the path is altered, which could shift token statistics and perceptual quality even while marginal destinations are unchanged.

Authors: We agree that the manuscript does not explicitly prove that the sequence of finite-step corrections integrates exactly to the target scalar-parameterized path or preserves constant Fisher-Rao speed throughout. The derivation shows that each individual correction exactly preserves the per-jump destination distribution under the CTMC, but the cumulative path-tracking property after repeated corrections is not formally established in the current text. We will revise the moment-correction subsection to include a short proof that, for the metric-induced setting and under the kinetic-optimal scheduler, the corrected jumps remain consistent with the prescribed path measure in the small-step limit, together with a bound on the deviation of the effective speed. We will also add an empirical verification plot of the integrated path error. revision: yes
Referee: [Experiments] Experiments section, performance tables: the strongest claims (best objective naturalness, subjective preference, and top speaker similarity) are attributed to the combination of kinetic-optimal scheduling and moment correction, yet no ablation isolates the contribution of the correction, no error bars or statistical tests are reported, and no verification (e.g., empirical path integral or speed measurement) confirms that the corrected trajectory matches the target path. This weakens the causal link between the proposed components and the reported gains.

Authors: We acknowledge that the current experiments do not isolate the moment correction via ablation, nor do they report error bars, statistical significance tests, or direct verification of path fidelity after correction. These omissions limit the strength of the causal attribution. We will add a dedicated ablation table comparing (i) the baseline MI-DFM, (ii) kinetic-optimal scheduler alone, and (iii) the full GibbsTTS with moment correction. We will also include standard deviations across seeds, paired statistical tests on the key metrics, and new figures showing empirical Fisher-Rao speed and path-integral error for the corrected trajectories. These additions will be placed in the revised Experiments section. revision: yes

Circularity Check

0 steps flagged

No circularity: scheduler derivation and moment correction are independent contributions

full rationale

The paper derives a kinetic-optimal scheduler from prescribed scalar-parameterized probability paths and introduces a finite-step moment correction that preserves CTMC jump destinations by explicit construction. Neither step reduces to a fitted parameter renamed as a prediction, a self-citation chain, or an ansatz smuggled from prior work by the same authors. The abstract and method description present these as new, training-free numerical procedures instantiated for MI-DFM, with performance claims resting on controlled empirical comparisons rather than tautological redefinitions. No load-bearing uniqueness theorem or self-referential equation is invoked in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on the abstract alone, no explicit free parameters, axioms, or invented entities are stated; the scheduler is presented as a numerical instantiation of a derived optimum and the correction as a distribution-preserving adjustment.

pith-pipeline@v0.9.0 · 5533 in / 1102 out tokens · 50572 ms · 2026-05-12T02:59:54.984860+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 5 internal anchors

[1]

StableTTS.https://github.com/KdaiP/StableTTS, 2024

work page 2024
[2]

Seed-tts: A family of high-quality versatile speech generation models.arXiv preprint arXiv:2406.02430, 2024

P. Anastassiou, J. Chen, J. Chen, Y . Chen, Z. Chen, Z. Chen, J. Cong, L. Deng, C. Ding, L. Gao, M. Gong, P. Huang, Q. Huang, Z. Huang, Y . Huo, D. Jia, C. Li, F. Li, H. Li, J. Li, X. Li, X. Li, L. Liu, S. Liu, S. Liu, X. Liu, Y . Liu, Z. Liu, L. Lu, J. Pan, X. Wang, Y . Wang, Y . Wang, Z. Wei, J. Wu, C. Yao, Y . Yang, Y . Yi, J. Zhang, Q. Zhang, S. Zhang...

work page arXiv 2024
[3]

Soundstorm: Efficient parallel audio generation.arXiv preprint arXiv:2305.09636, 2023

Z. Borsos, M. Sharifi, D. Vincent, E. Kharitonov, N. Zeghidour, and M. Tagliasacchi. Sound- Storm: Efficient parallel audio generation.arXiv preprint arXiv:2305.09636, 2023

work page arXiv 2023
[4]

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei. WavLM: Large-scale self-supervised pre-training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022

work page 2022
[5]

Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. Zhao, K. Yu, and X. Chen. F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching.arXiv preprint arXiv:2410.06885, 2024

work page arXiv 2024
[6]

Défossez, J

A. Défossez, J. Copet, G. Synnaeve, and Y . Adi. High fidelity neural audio compression. Transactions on Machine Learning Research, 2022

work page 2022
[7]

H. Deng, T. Pan, F. Zhang, Y . Liu, Z. Luo, Y . Cui, W. Wang, C. Shen, S. Shan, Z. Zhang, and X. Wang. Uniform discrete diffusion with metric path for video generation.arXiv preprint arXiv:2510.24717, 2025

work page arXiv 2025
[8]

Z. Du, C. Gao, Y . Wang, F. Yu, T. Zhao, H. Wang, X. Lv, H. Wang, C. Ni, X. Shi, K. An, G. Yang, Y . Li, Y . Chen, Z. Gao, Q. Chen, Y . Gu, M. Chen, Y . Chen, S. Zhang, W. Wang, and J. Ye. CosyV oice 3: Towards in-the-wild speech generation via scaling-up and post-training. arXiv preprint arXiv:2505.17589, 2025

work page arXiv 2025
[9]

Z. Du, Y . Wang, Q. Chen, X. Shi, X. Lv, T. Zhao, Z. Gao, Y . Yang, C. Gao, H. Wang, F. Yu, H. Liu, Z. Sheng, Y . Gu, C. Deng, W. Wang, S. Zhang, Z. Yan, and J. Zhou. CosyV oice 2: Scal- able streaming speech synthesis with large language models.arXiv preprint arXiv:2412.10117, 2024

work page internal anchor Pith review arXiv 2024
[10]

Z. Gao, S. Zhang, I. McLoughlin, and Z. Yan. Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition. InInterspeech, pages 2063—-2067, 2022

work page 2063
[11]

I. Gat, T. Remez, N. Shaul, F. Kreuk, R. T. Q. Chen, G. Synnaeve, Y . Adi, and Y . Lipman. Discrete flow matching. InAnnual Conference on Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[12]

H. He, Z. Shang, C. Wang, X. Li, Y . Gu, H. Hua, L. Liu, C. Yang, J. Li, P. Shi, Y . Wang, K. Chen, P. Zhang, and Z. Wu. Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation. InIEEE Spoken Language Technology Workshop (SLT), 2024

work page 2024
[13]

H. Hu, X. Zhu, T. He, D. Guo, B. Zhang, X. Wang, Z. Guo, Z. Jiang, H. Hao, Z. Guo, X. Zhang, P. Zhang, B. Yang, J. Xu, J. Zhou, and J. Lin. Qwen3-TTS technical report.arXiv preprint arXiv:2601.15621, 2026

work page arXiv 2026
[14]

ichi Amari.Information geometry and its applications, volume 194

S. ichi Amari.Information geometry and its applications, volume 194. Springer, 2016. 13

work page 2016
[15]

Kumar, P

R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar. High-fidelity audio compression with improved RVQGAN. InAnnual Conference on Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[16]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019

work page 2019
[17]

R. Luo, X. Xia, L. Wang, L. Chen, R. Shan, J. Luo, M. Yang, and T. Chua. NExT-OMNI: towards any-to-any omnimodal foundation models with discrete flow matching. InInternational Conference on Learning Representations (ICLR), 2026

work page 2026
[18]

Cosyvoice: A scalable multi- lingual zero-shot text-to-speech synthesizer based on supervised semantic tokens

N.-S. Nguyen, T. V . T. Tran, H.-N. Huynh-Nguyen, T.-S. Hy, and V . Nguyen. DiFlow-TTS: Compact and low-latency zero-shot text-to-speech with factorized discrete flow matching.arXiv preprint arXiv:2407.05407, 2024

work page arXiv 2024
[19]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers. InIEEE/CVF International Conference on Computer Vision (ICCV), 2023

work page 2023
[20]

Radford, J

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever. Robust speech recognition via large-scale weak supervision. InInternational Conference on Machine Learning (ICML), pages 28492–28518, 2023

work page 2023
[21]

Saeki, D

T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari. UTMOS: utokyo- sarulab system for voicemos challenge 2022. InInterspeech, pages 4521–4525, 2022

work page 2022
[22]

Shaul, I

N. Shaul, I. Gat, M. Havasi, D. Severo, A. Sriram, P. Holderrieth, B. Karrer, Y . Lipman, and R. T. Q. Chen. Flow matching with general discrete paths: A kinetic-optimal perspective. In International Conference on Learning Representations (ICLR), 2025

work page 2025
[23]

N. Shazeer. GLU variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002
[24]

J. Shi, K. Han, Z. Wang, A. Doucet, and M. K. Titsias. Simplified and generalized masked diffusion for discrete data. InAnnual Conference on Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[25]

H. Siuzdak. V ocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[26]

J. Su, M. H. M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

work page 2024
[27]

Vasuki and P

A. Vasuki and P. Vanathi. A review of vector quantization techniques.IEEE Potentials, 25:39–47, 2006

work page 2006
[28]

C. Wang, S. Chen, Y . Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y . Liu, H. Wang, J. Li, L. He, S. Zhao, and F. Wei. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111v1, 2023

work page internal anchor Pith review arXiv 2023
[29]

J. Wang, Y . Lai, A. Li, S. Zhang, J. Sun, N. Kang, C. Wu, Z. Li, and P. Luo. FUDOKI: discrete flow-based unified understanding and generation via kinetic-optimal velocities. InAnnual Conference on Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[30]

X. Wang, M. Jiang, Z. Ma, Z. Zhang, S. Liu, L. Li, Z. Liang, Q. Zheng, R. Wang, X. Feng, W. Bian, Z. Ye, S. Cheng, R. Yuan, Z. Zhao, X. Zhu, J. Pan, L. Xue, P. Zhu, Y . Chen, Z. Li, X. Chen, L. Xie, Y . Guo, and W. Xue. Spark-TTS: An efficient llm-based text-to-speech model with single-stream decoupled speech tokens.arXiv preprint arXiv:2503.01710, 2025

work page arXiv 2025
[31]

Y . Wang, H. Zhan, L. Liu, R. Zeng, H. Guo, J. Zheng, Q. Zhang, X. Zhang, S. Zhang, and Z. Wu. MaskGCT: Zero-shot text-to-speech with masked generative codec transformer. In International Conference on Learning Representations (ICLR), 2025

work page 2025
[32]

K. Xie, F. Shen, J. Li, F. Xie, X. Tang, and Y . Hu. FireRedTTS-2: Towards long conversational speech generation for podcast and chatbot.arXiv preprint arXiv:2509.02020, 2025. 14

work page arXiv 2025
[33]

Y . Xu, J. Cui, F. Cai, Z. Zhu, H. Shang, S. Luan, M. Xu, N. Zhang, Y . Li, J. Cai, and S. Zhu. W AM-Flow: parallel coarse-to-fine motion planning via discrete flow matching for autonomous driving. InIEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR), 2026

work page 2026
[34]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

D. Yang, Y . Cai, Y . Saito, L. Wang, and H. Saruwatari. Shallow flow matching for coarse-to-fine text-to-speech synthesis. InAnnual Conference on Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[36]

Zeghidour, A

N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi. SoundStream: An end-to-end neural audio codec.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2022

work page 2022
[37]

H. Zen, V . Dang, R. Clark, Y . Zhang, R. J. Weiss, Y . Jia, Z. Chen, and Y . Wu. LibriTTS: A corpus derived from LibriSpeech for text-to-speech. InProc. Interspeech, pages 1526–1530, 2019

work page 2019
[38]

Zhang and R

B. Zhang and R. Sennrich. Root mean square layer normalization. InAnnual Conference on Neural Information Processing Systems (NeurIPS), 2019

work page 2019
[39]

Zhang and S

L. Zhang and S. Syed. The cosine schedule is fisher-rao-optimal for masked discrete diffusion models.arXiv preprint arXiv:2508.04884, 2025

work page arXiv 2025
[40]

S. Zhou, Y . Zhou, Y . He, X. Zhou, J. Wang, W. Deng, and J. Shu. IndexTTS2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech. In Proceedings of the AAAI Conference on Artificial Intelligence, 2026

work page 2026
[41]

H. Zhu, L. Ye, W. Kang, Z. Yao, L. Guo, F. Kuang, Z. Han, W. Zhuang, L. Lin, and D. Povey. OmniV oice: Towards omnilingual zero-shot text-to-speech with diffusion language models. arXiv preprint arXiv:2604.00688, 2026. 15 A Related works Codec-based discrete acoustic modeling.Recent TTS systems that model speech with discrete acoustic representations main...

work page internal anchor Pith review Pith/arXiv arXiv 2026