Streaming T5-based Text-to-Speech Synthesis with Limited Lookahead

Jason Roche; Junjie Lai; Muyang Du

arxiv: 2606.21882 · v1 · pith:VSKR2XGKnew · submitted 2026-06-20 · 💻 cs.SD · cs.AI

Streaming T5-based Text-to-Speech Synthesis with Limited Lookahead

Muyang Du , Jason Roche , Junjie Lai This is my paper

Pith reviewed 2026-06-26 11:40 UTC · model grok-4.3

classification 💻 cs.SD cs.AI

keywords streaming TTST5-TTSlow-latency synthesislookahead-causal maskingzero-shot TTSconversational AIincremental speech synthesis

0 comments

The pith

S5-TTS performs streaming T5-based text-to-speech with limited lookahead while matching full-context quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces S5-TTS as a streaming variant of T5-TTS that generates speech word by word after receiving only the first few input words. It relies on encoder-decoder language modeling and monotonic alignment learning to support immediate incremental output instead of waiting for complete text. To handle the restricted future context, the approach adds a lookahead-causal masking mechanism, Conv-based auxiliary attention, and interleaved multi-source distillation. These changes let the model keep intelligibility, naturalness, and speaker similarity close to the original full-context version while cutting end-to-end latency. A reader would care because the resulting low latency supports more responsive conversational AI applications.

Core claim

S5-TTS begins generating speech immediately after receiving the first few words through encoder-decoder language modeling and monotonic alignment learning; a lookahead-causal masking mechanism with Conv-based auxiliary attention preserves intelligibility and speaker similarity, while interleaved multi-source distillation restores naturalness, so the model achieves quality comparable to full-context T5-TTS and supports zero-shot synthesis with high speaker similarity.

What carries the argument

lookahead-causal masking mechanism with Conv-based auxiliary attention and interleaved multi-source distillation, which maintains quality under limited future context.

If this is right

S5-TTS achieves comparable quality to full-context T5-TTS.
S5-TTS supports zero-shot synthesis with high speaker similarity.
S5-TTS significantly reduces end-to-end latency for practical conversational AI systems.
The model can initiate synthesis after the first few words without requiring the full input text.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The masking and distillation techniques could be adapted to other autoregressive TTS architectures to add streaming support.
Lower latency may enable new real-time interactive voice interfaces that were previously limited by full-context delays.
The method might be tested with varying lookahead sizes to find the minimal context needed for acceptable quality.

Load-bearing premise

The lookahead-causal masking combined with Conv-based auxiliary attention and interleaved multi-source distillation is sufficient to preserve intelligibility, naturalness, and speaker similarity when only limited future context is available.

What would settle it

A set of subjective listening tests or objective metrics showing substantially lower naturalness or intelligibility scores for S5-TTS than for full-context T5-TTS on the same inputs would falsify the comparable-quality claim.

Figures

Figures reproduced from arXiv: 2606.21882 by Jason Roche, Junjie Lai, Muyang Du.

**Figure 1.** Figure 1: (Left) The overall architecture of S5-TTS. (Right) Lookahead-Causal Masks and Conv-based Auxiliary Attention. Quantization (FSQ) [31] audio codec. For each codebook, an embedding table is used, and its codec token is predicted using a dedicated linear projection head. As shown in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

read the original abstract

Streaming text-to-speech synthesis in cascaded LLM-TTS systems still faces latency challenges as most TTS models require full context before initiating generation. We present S5-TTS, a streaming variant of T5-TTS that enables low-latency, word-by-word incremental speech synthesis through encoder-decoder language modeling and monotonic alignment learning. S5-TTS begins generating speech immediately after receiving the first few words, substantially reducing end-to-end response latency. To maintain quality under limited lookahead, we introduce a lookahead-causal masking mechanism with Conv-based auxiliary attention that preserves intelligibility and speaker similarity, and employ interleaved multi-source distillation to further restore naturalness. Experiments show that S5-TTS achieves comparable quality to full-context T5-TTS, supports zero-shot synthesis with high speaker similarity, and significantly reduces end-to-end latency for practical conversational AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

S5-TTS adds concrete masking and distillation steps to let T5-TTS start early, but the abstract gives no numbers to show the quality holds up.

read the letter

The paper's main move is to turn T5-TTS into a streaming model called S5-TTS that begins synthesis after only the first few words. It does this with encoder-decoder language modeling plus monotonic alignment, then adds lookahead-causal masking, a Conv-based auxiliary attention path, and interleaved multi-source distillation to try to keep intelligibility and naturalness when future context is short.

Those three additions are the actual new pieces. They target a real bottleneck in cascaded LLM-TTS setups where full-sentence latency hurts conversational use. The description is clear enough that someone could implement the masking and distillation schedule from the text.

The soft spot is the evidence. The abstract states that quality stays comparable to full-context T5-TTS and that speaker similarity holds in zero-shot cases, yet it supplies no WER, MOS, or similarity scores, no baseline tables, and no ablation on the masking or distillation choices. Without those numbers it is impossible to judge whether the mechanisms actually close the gap or just mitigate it.

The work is aimed at groups already running T5-style TTS who want lower end-to-end latency for voice agents. A reader who needs a drop-in streaming variant and is willing to run their own tests could get value from the implementation details.

I would send it to peer review. The problem is practical, the proposed fixes are specific, and referees can check whether the experiments actually support the claims once the full results are in front of them.

Referee Report

2 major / 0 minor

Summary. The paper introduces S5-TTS, a streaming variant of T5-TTS for low-latency text-to-speech synthesis. It uses encoder-decoder language modeling with monotonic alignment learning to begin speech generation after the first few words. To preserve quality under limited lookahead, it proposes a lookahead-causal masking mechanism combined with Conv-based auxiliary attention and interleaved multi-source distillation. The central claims are that S5-TTS achieves comparable intelligibility, naturalness, and speaker similarity to full-context T5-TTS, supports zero-shot synthesis, and substantially reduces end-to-end latency.

Significance. If the experimental claims hold, the work would address a key bottleneck in cascaded LLM-TTS systems by enabling practical streaming synthesis for conversational AI without requiring full text context upfront.

major comments (2)

[Abstract] Abstract: The claims that 'experiments show that S5-TTS achieves comparable quality to full-context T5-TTS' and 'significantly reduces end-to-end latency' are asserted without any quantitative metrics, baselines, ablation studies, or error analysis. This absence makes it impossible to evaluate whether the lookahead-causal masking and distillation approach actually preserves the claimed properties under limited lookahead.
[Abstract] The weakest assumption identified—that the combination of lookahead-causal masking, Conv-based auxiliary attention, and interleaved multi-source distillation is sufficient to maintain intelligibility, naturalness, and speaker similarity—is presented as resolved by experiments, yet no supporting data, tables, or figures are referenced to substantiate this.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the feedback on the abstract. The comments correctly identify that the abstract summarizes results without quantitative support or references to data. We address each point below and will revise accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The claims that 'experiments show that S5-TTS achieves comparable quality to full-context T5-TTS' and 'significantly reduces end-to-end latency' are asserted without any quantitative metrics, baselines, ablation studies, or error analysis. This absence makes it impossible to evaluate whether the lookahead-causal masking and distillation approach actually preserves the claimed properties under limited lookahead.

Authors: We agree the abstract would benefit from explicit quantitative support. The full manuscript contains the requested metrics, baselines, ablations, and error analysis in the Experiments section. We will revise the abstract to incorporate key numerical results (e.g., specific WER, MOS, similarity scores, and latency reductions) to substantiate the claims. revision: yes
Referee: [Abstract] The weakest assumption identified—that the combination of lookahead-causal masking, Conv-based auxiliary attention, and interleaved multi-source distillation is sufficient to maintain intelligibility, naturalness, and speaker similarity—is presented as resolved by experiments, yet no supporting data, tables, or figures are referenced to substantiate this.

Authors: We acknowledge that the abstract does not reference supporting data or figures. The manuscript body includes the relevant tables and figures demonstrating the contribution of each component. We will revise the abstract to include summary statistics and, where feasible, references to the key results that validate the techniques. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces an architectural variant S5-TTS of T5-TTS for streaming synthesis via lookahead-causal masking, Conv-based auxiliary attention, and interleaved multi-source distillation. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described construction. Claims rest on experimental comparisons of intelligibility, naturalness, and speaker similarity rather than reducing to inputs by definition or prior self-referential results. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, background axioms, or newly postulated entities.

pith-pipeline@v0.9.1-grok · 5671 in / 1001 out tokens · 30792 ms · 2026-06-26T11:40:35.074732+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 2 canonical work pages

[1]

Introduction Large language models (LLMs) have become the cornerstone of generative AI, with state-of-the-art models adopting either decoder-only [1–3] or encoder-decoder architectures [4–6]. Re- cent advances in neural audio codecs [7,8] have enabled speech to be represented as discrete tokens, paving the way for LLM- based text-to-speech (TTS) models in...

2026
[2]

with Griffin-Lim [22] and shows that lookback acoustic context is crucial for natural prosody. Inspired by the prefix- to-prefix framework for simultaneous translation [23], subse- quent work [24] introduces the lookahead policy into Tacotron 2 [25] with Parallel WaveGAN [26], showing that even a single future word substantially improves naturalness. Furt...
[3]

Model Overview S5-TTS adopts a T5-based architecture, consisting of a paral- lel Transformer encoder and an autoregressive Transformer de- coder

Proposed Method 2.1. Model Overview S5-TTS adopts a T5-based architecture, consisting of a paral- lel Transformer encoder and an autoregressive Transformer de- coder. The encoder takes as input a phoneme sequence obtained via G2P conversion. At each decoding step, the decoder con- sumes the sum of the embeddings of theKcodec tokens pre- dicted at the prev...

Pith/arXiv arXiv 2026
[4]

Experiments 3.1. Experimental Setup ■Datasets.For initial training, both S5-TTS and T5-TTS are trained on the full training splits of LibriTTS [34] and Hi- FiTTS [35] speech datasets, comprising 845.04 hours of speech from 2,319 speakers. For distillation, we use the same speech datasets together with additional conversational text sampled from UltraChat-...

arXiv
[5]

S5-TTS achieves speech quality comparable to full-context T5-TTS, supports zero-shot synthesis, and significantly reduces response latency in cascaded LLM-TTS systems

Conclusion We presented S5-TTS, a streaming text-to-speech model based on language modeling that demonstrates low-latency, word- by-word speech synthesis under limited lookahead. S5-TTS achieves speech quality comparable to full-context T5-TTS, supports zero-shot synthesis, and significantly reduces response latency in cascaded LLM-TTS systems. Experiment...
[6]

Use of Generative AI Disclosure Generative AI tools were used solely for grammar checking and language polishing to improve the clarity of the manuscript
[7]

Llama: Open and efficient foundation language models,

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”ArXiv preprint, vol. abs/2302.13971, 2023

Pith/arXiv arXiv 2023
[8]

Gemma: Open models based on gemini research and technol- ogy,

G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivi `ere, M. S. Kale, J. Loveet al., “Gemma: Open models based on gemini research and technol- ogy,”ArXiv preprint, vol. abs/2403.08295, 2024

Pith/arXiv arXiv 2024
[9]

Qwen3 technical report,

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”ArXiv preprint, vol. abs/2505.09388, 2025

Pith/arXiv arXiv 2025
[10]

Scal- ing instruction-finetuned language models,

H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fe- dus, Y . Li, X. Wang, M. Dehghani, S. Brahmaet al., “Scal- ing instruction-finetuned language models,”Journal of Machine Learning Research, vol. 25, no. 70, pp. 1–53, 2024

2024
[11]

Encoder- decoder gemma: Improving the quality-efficiency trade-off via adaptation,

B. Zhang, F. Moiseev, J. Ainslie, P. Suganthan, M. Ma, S. Bhu- patiraju, F. Lebron, O. Firat, A. Joulin, and Z. Dong, “Encoder- decoder gemma: Improving the quality-efficiency trade-off via adaptation,”ArXiv preprint, vol. abs/2504.06225, 2025

arXiv 2025
[12]

T5gemma 2: Seeing, reading, and understanding longer,

B. Zhang, P. Suganthan, G. Liu, I. Philippov, S. Dua, B. Hora, K. Black, G. Martins, O. Sanseviero, S. Pathaket al., “T5gemma 2: Seeing, reading, and understanding longer,”ArXiv preprint, vol. abs/2512.14856, 2025

arXiv 2025
[13]

Soundstream: An end-to-end neural audio codec,

N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “Soundstream: An end-to-end neural audio codec,”IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, vol. 30, pp. 495–507, 2021

2021
[14]

High-fidelity audio compression with improved RVQGAN,

R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High-fidelity audio compression with improved RVQGAN,” in Advances in Neural Information Processing Systems 36: An- nual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and...

2023
[15]

Speak, read and prompt: High-fidelity text-to-speech with min- imal supervision,

E. Kharitonov, D. Vincent, Z. Borsos, R. Marinier, S. Girgin, O. Pietquin, M. Sharifi, M. Tagliasacchi, and N. Zeghidour, “Speak, read and prompt: High-fidelity text-to-speech with min- imal supervision,”Transactions of the Association for Computa- tional Linguistics, vol. 11, pp. 1703–1718, 2023

2023
[16]

Neural codec language models are zero-shot text to speech synthesizers,

S. Chen, C. Wang, Y . Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y . Liu, H. Wang, J. Liet al., “Neural codec language models are zero-shot text to speech synthesizers,”IEEE Transactions on Au- dio, Speech and Language Processing, 2025

2025
[17]

Spark-tts: An efficient llm- based text-to-speech model with single-stream decoupled speech tokens,

X. Wang, M. Jiang, Z. Ma, Z. Zhang, S. Liu, L. Li, Z. Liang, Q. Zheng, R. Wang, X. Fenget al., “Spark-tts: An efficient llm- based text-to-speech model with single-stream decoupled speech tokens,”arXiv preprint arXiv:2503.01710, 2025

Pith/arXiv arXiv 2025
[18]

SpeechT5: Unified-modal encoder-decoder pre-training for spoken language processing,

J. Ao, R. Wang, L. Zhou, C. Wang, S. Ren, Y . Wu, S. Liu, T. Ko, Q. Li, Y . Zhang, Z. Wei, Y . Qian, J. Li, and F. Wei, “SpeechT5: Unified-modal encoder-decoder pre-training for spoken language processing,” inProceedings of the 60th Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), S. Muresan, P. Nakov, and A. Vill...

2022
[19]

Improving robustness of llm-based speech synthesis by learn- ing monotonic alignment,

P. Neekhara, S. Hussain, S. Ghosh, J. Li, and B. Ginsburg, “Improving robustness of llm-based speech synthesis by learn- ing monotonic alignment,” inProc. Interspeech 2024, 2024, pp. 3425–3429

2024
[20]

Robust and un- bounded length generalization in autoregressive transformer- based text-to-speech,

E. Battenberg, R. Skerry-Ryan, D. Stanton, S. Mariooryad, M. Shannon, J. Salazar, and D. T.-H. Kao, “Robust and un- bounded length generalization in autoregressive transformer- based text-to-speech,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Com- putational Linguistics: Human Language Technologies (...

2025
[21]

Fastpitch: Parallel text-to-speech with pitch pre- diction,

A. Lancucki, “Fastpitch: Parallel text-to-speech with pitch pre- diction,” inIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021. IEEE, 2021, pp. 6588–6592

2021
[22]

Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,

J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” inPro- ceedings of the 38th International Conference on Machine Learn- ing, ICML 2021, 18-24 July 2021, Virtual Event, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol

2021
[23]

5530–5540

PMLR, 2021, pp. 5530–5540

2021
[24]

Mini-omni: Language models can hear, talk while thinking in streaming,

Z. Xie and C. Wu, “Mini-omni: Language models can hear, talk while thinking in streaming,”ArXiv preprint, vol. abs/2408.16725, 2024

arXiv 2024
[25]

Llama- omni: Seamless speech interaction with large language models,

Q. Fang, S. Guo, Y . Zhou, Z. Ma, S. Zhang, and Y . Feng, “Llama- omni: Seamless speech interaction with large language models,” ArXiv preprint, vol. abs/2409.06666, 2024

arXiv 2024
[26]

Qwen2. 5-omni technical report,

J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Danget al., “Qwen2. 5-omni technical report,”ArXiv preprint, vol. abs/2503.20215, 2025

Pith/arXiv arXiv 2025
[27]

Neural itts: Toward synthesizing speech in real-time with end-to-end neural text-to- speech framework,

T. Yanagita, S. Sakti, and S. Nakamura, “Neural itts: Toward synthesizing speech in real-time with end-to-end neural text-to- speech framework,” inProceedings of the 10th ISCA speech syn- thesis workshop, 2019, pp. 183–188

2019
[28]

Tacotron: Towards end-to-end speech synthesis,

Y . Wang, R. J. Skerry-Ryan, D. Stanton, Y . Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y . Xiao, Z. Chen, S. Bengio, Q. V . Le, Y . Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: Towards end-to-end speech synthesis,” inInterspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, 201...

2017
[29]

Signal estimation from modified short- time fourier transform,

D. Griffin and J. Lim, “Signal estimation from modified short- time fourier transform,”IEEE Transactions on acoustics, speech, and signal processing, vol. 32, no. 2, pp. 236–243, 1984

1984
[30]

STACL: Simultaneous translation with implicit anticipation and control- lable latency using prefix-to-prefix framework,

M. Ma, L. Huang, H. Xiong, R. Zheng, K. Liu, B. Zheng, C. Zhang, Z. He, H. Liu, X. Li, H. Wu, and H. Wang, “STACL: Simultaneous translation with implicit anticipation and control- lable latency using prefix-to-prefix framework,” inProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, A. Korhonen, D. Traum, and L. M`arque...

2019
[31]

Incremental text-to-speech synthesis with prefix- to-prefix framework,

M. Ma, B. Zheng, K. Liu, R. Zheng, H. Liu, K. Peng, K. Church, and L. Huang, “Incremental text-to-speech synthesis with prefix- to-prefix framework,” inFindings of the Association for Compu- tational Linguistics: EMNLP 2020, T. Cohn, Y . He, and Y . Liu, Eds. Online: Association for Computational Linguistics, 2020, pp. 3886–3896

2020
[32]

Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions,

J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y . Zhang, Y . Wang, R. Ryan, R. A. Saurous, Y . Agiomyr- giannakis, and Y . Wu, “Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions,” in2018 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, April ...

2018
[33]

Parallel wavegan: A fast waveform generation model based on generative adversarial net- works with multi-resolution spectrogram,

R. Yamamoto, E. Song, and J. Kim, “Parallel wavegan: A fast waveform generation model based on generative adversarial net- works with multi-resolution spectrogram,” in2020 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020. IEEE, 2020, pp. 6199–6203

2020
[34]

What the fu- ture brings: Investigating the impact of lookahead for incremental neural TTS,

B. Stephenson, L. Besacier, L. Girin, and T. Hueber, “What the fu- ture brings: Investigating the impact of lookahead for incremental neural TTS,” inInterspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, H. Meng, B. Xu, and T. F. Zheng, Eds. ISCA, 2020, pp. 215–219

2020
[35]

Speech-t: Transducer for text to speech and beyond,

J. Chen, X. Tan, Y . Leng, J. Xu, G. Wen, T. Qin, and T. Liu, “Speech-t: Transducer for text to speech and beyond,” inAd- vances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, M. Ranzato, A. Beygelzimer, Y . N. Dauphin, P. Liang, and J. W. Vaughan, E...

2021
[36]

Instantspeech: Instant synchronous text-to-speech synthesis for llm-driven voice chatbots,

M. Du, C. Liu, and J. Lai, “Instantspeech: Instant synchronous text-to-speech synthesis for llm-driven voice chatbots,” inICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

2025
[37]

Hifi-gan: Generative adversar- ial networks for efficient and high fidelity speech synthesis,

J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversar- ial networks for efficient and high fidelity speech synthesis,” in Advances in Neural Information Processing Systems 33: An- nual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., 2020

2020
[38]

Finite scalar quantization: VQ-V AE made simple,

F. Mentzer, D. Minnen, E. Agustsson, and M. Tschannen, “Finite scalar quantization: VQ-V AE made simple,” inThe Twelfth In- ternational Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

2024
[39]

Connectionist temporal classification: labelling unsegmented se- quence data with recurrent neural networks,

A. Graves, S. Fern ´andez, F. J. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented se- quence data with recurrent neural networks,” inMachine Learn- ing, Proceedings of the Twenty-Third International Conference (ICML 2006), Pittsburgh, Pennsylvania, USA, June 25-29, 2006, ser. ACM International Conference Proceeding ...

2006
[40]

Glow-tts: A gener- ative flow for text-to-speech via monotonic alignment search,

J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-tts: A gener- ative flow for text-to-speech via monotonic alignment search,” inAdvances in Neural Information Processing Systems 33: An- nual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., 2020

2020
[41]

Libritts: A corpus derived from librispeech for text- to-speech,

H. Zen, V . Dang, R. Clark, Y . Zhang, R. J. Weiss, Y . Jia, Z. Chen, and Y . Wu, “Libritts: A corpus derived from librispeech for text- to-speech,” inInterspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, G. Kubin and Z. Kacic, Eds. ISCA, 2019, pp. 1526–1530

2019
[42]

Hi- fi multi-speaker english TTS dataset,

E. Bakhturina, V . Lavrukhin, B. Ginsburg, and Y . Zhang, “Hi- fi multi-speaker english TTS dataset,” inInterspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021, H. Hermansky, H. Cernock ´y, L. Burget, L. Lamel, O. Scharen- borg, and P. Motl´ıcek, Eds. ISCA, 2021, pp. 2776–2780

2021
[43]

Enhancing chat language models by scaling high- quality instructional conversations,

N. Ding, Y . Chen, B. Xu, Y . Qin, S. Hu, Z. Liu, M. Sun, and B. Zhou, “Enhancing chat language models by scaling high- quality instructional conversations,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Associ- ation for Computational Linguistics, 2023, pp. 3029–3051

2023
[44]

Efficient sequence transduction by jointly predicting tokens and durations,

H. Xu, F. Jia, S. Majumdar, H. Huang, S. Watanabe, and B. Gins- burg, “Efficient sequence transduction by jointly predicting tokens and durations,” inInternational Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, ser. Pro- ceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, an...

2023
[45]

Mel codec 22khz medium,

NVIDIA, “Mel codec 22khz medium,” https://catalog.ngc.nvidia. com/orgs/nvidia/teams/nemo/models/mel codec 22khz medium, 2024, accessed: 2026-02-17

2024
[46]

Phonemizer: Text to phones transcription for multiple languages in python,

M. Bernard and H. Titeux, “Phonemizer: Text to phones transcription for multiple languages in python,”Journal of Open Source Software, vol. 6, no. 68, p. 3958, 2021. [Online]. Available: https://doi.org/10.21105/joss.03958

work page doi:10.21105/joss.03958 2021
[47]

Decoupled weight decay regulariza- tion,

I. Loshchilov and F. Hutter, “Decoupled weight decay regulariza- tion,” in7th International Conference on Learning Representa- tions, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open- Review.net, 2019

2019
[48]

Prolific. ac—a subject pool for online experiments,

S. Palan and C. Schitter, “Prolific. ac—a subject pool for online experiments,”Journal of behavioral and experimental finance, vol. 17, pp. 22–27, 2018

2018
[49]

Wavlm: Large-scale self- supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “Wavlm: Large-scale self- supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

2022
[50]

CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR V oice Cloning Toolkit,

J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR V oice Cloning Toolkit (version 0.92),” 2019, sound dataset. [Online]. Available: https://doi.org/10.7488/ds/2645

work page doi:10.7488/ds/2645 2019
[51]

UTMOS: utokyo-sarulab system for voicemos challenge 2022,

T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: utokyo-sarulab system for voicemos challenge 2022,” in23rd Annual Conference of the International Speech Communication Association, Interspeech 2022, Incheon, Korea, September 18-22, 2022, H. Ko and J. H. L. Hansen, Eds. ISCA, 2022, pp. 4521–4525

2022
[52]

E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts,

S. E. Eskimez, X. Wang, M. Thakker, C. Li, C.-H. Tsai, Z. Xiao, H. Yang, Z. Zhu, M. Tang, X. Tanet al., “E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 682– 689

2024
[53]

Fireredtts: A foundation text-to- speech framework for industry-level generative speech applica- tions,

H.-H. Guo, Y . Hu, K. Liu, F.-Y . Shen, X. Tang, Y .-C. Wu, F.- L. Xie, K. Xie, and K.-T. Xu, “Fireredtts: A foundation text-to- speech framework for industry-level generative speech applica- tions,”ArXiv preprint, vol. abs/2409.03283, 2024

arXiv 2024
[54]

Maskgct: Zero-shot text- to-speech with masked generative codec transformer,

Y . Wang, H. Zhan, L. Liu, R. Zeng, H. Guo, J. Zheng, Q. Zhang, X. Zhang, S. Zhang, and Z. Wu, “Maskgct: Zero-shot text- to-speech with masked generative codec transformer,” inICLR. OpenReview.net, 2025

2025
[55]

Cosyvoice: A scalable multi- lingual zero-shot text-to-speech synthesizer based on supervised semantic tokens,

Z. Du, Q. Chen, S. Zhang, K. Hu, H. Lu, Y . Yang, H. Hu, S. Zheng, Y . Gu, Z. Maet al., “Cosyvoice: A scalable multi- lingual zero-shot text-to-speech synthesizer based on supervised semantic tokens,”ArXiv preprint, vol. abs/2407.05407, 2024

Pith/arXiv arXiv 2024
[56]

A short- time objective intelligibility measure for time-frequency weighted noisy speech,

C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short- time objective intelligibility measure for time-frequency weighted noisy speech,” in2010 IEEE International Conference on Acous- tics, Speech and Signal Processing, 2010, pp. 4214–4217

2010
[57]

Perceptual eval- uation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,

A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual eval- uation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in2001 IEEE In- ternational Conference on Acoustics, Speech, and Signal Process- ing. Proceedings (Cat. No.01CH37221), vol. 2, 2001, pp. 749– 752 vol.2

2001
[58]

The llama 3 herd of models,

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan et al., “The llama 3 herd of models,”ArXiv preprint, vol. abs/2407.21783, 2024

Pith/arXiv arXiv 2024
[59]

[Online]

Ollama, “Ollama,” 2026, accessed: 2026-02-17. [Online]. Available: https://ollama.com

2026

[1] [1]

Introduction Large language models (LLMs) have become the cornerstone of generative AI, with state-of-the-art models adopting either decoder-only [1–3] or encoder-decoder architectures [4–6]. Re- cent advances in neural audio codecs [7,8] have enabled speech to be represented as discrete tokens, paving the way for LLM- based text-to-speech (TTS) models in...

2026

[2] [2]

with Griffin-Lim [22] and shows that lookback acoustic context is crucial for natural prosody. Inspired by the prefix- to-prefix framework for simultaneous translation [23], subse- quent work [24] introduces the lookahead policy into Tacotron 2 [25] with Parallel WaveGAN [26], showing that even a single future word substantially improves naturalness. Furt...

[3] [3]

Model Overview S5-TTS adopts a T5-based architecture, consisting of a paral- lel Transformer encoder and an autoregressive Transformer de- coder

Proposed Method 2.1. Model Overview S5-TTS adopts a T5-based architecture, consisting of a paral- lel Transformer encoder and an autoregressive Transformer de- coder. The encoder takes as input a phoneme sequence obtained via G2P conversion. At each decoding step, the decoder con- sumes the sum of the embeddings of theKcodec tokens pre- dicted at the prev...

Pith/arXiv arXiv 2026

[4] [4]

Experiments 3.1. Experimental Setup ■Datasets.For initial training, both S5-TTS and T5-TTS are trained on the full training splits of LibriTTS [34] and Hi- FiTTS [35] speech datasets, comprising 845.04 hours of speech from 2,319 speakers. For distillation, we use the same speech datasets together with additional conversational text sampled from UltraChat-...

arXiv

[5] [5]

S5-TTS achieves speech quality comparable to full-context T5-TTS, supports zero-shot synthesis, and significantly reduces response latency in cascaded LLM-TTS systems

Conclusion We presented S5-TTS, a streaming text-to-speech model based on language modeling that demonstrates low-latency, word- by-word speech synthesis under limited lookahead. S5-TTS achieves speech quality comparable to full-context T5-TTS, supports zero-shot synthesis, and significantly reduces response latency in cascaded LLM-TTS systems. Experiment...

[6] [6]

Use of Generative AI Disclosure Generative AI tools were used solely for grammar checking and language polishing to improve the clarity of the manuscript

[7] [7]

Llama: Open and efficient foundation language models,

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”ArXiv preprint, vol. abs/2302.13971, 2023

Pith/arXiv arXiv 2023

[8] [8]

Gemma: Open models based on gemini research and technol- ogy,

G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivi `ere, M. S. Kale, J. Loveet al., “Gemma: Open models based on gemini research and technol- ogy,”ArXiv preprint, vol. abs/2403.08295, 2024

Pith/arXiv arXiv 2024

[9] [9]

Qwen3 technical report,

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”ArXiv preprint, vol. abs/2505.09388, 2025

Pith/arXiv arXiv 2025

[10] [10]

Scal- ing instruction-finetuned language models,

H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fe- dus, Y . Li, X. Wang, M. Dehghani, S. Brahmaet al., “Scal- ing instruction-finetuned language models,”Journal of Machine Learning Research, vol. 25, no. 70, pp. 1–53, 2024

2024

[11] [11]

Encoder- decoder gemma: Improving the quality-efficiency trade-off via adaptation,

B. Zhang, F. Moiseev, J. Ainslie, P. Suganthan, M. Ma, S. Bhu- patiraju, F. Lebron, O. Firat, A. Joulin, and Z. Dong, “Encoder- decoder gemma: Improving the quality-efficiency trade-off via adaptation,”ArXiv preprint, vol. abs/2504.06225, 2025

arXiv 2025

[12] [12]

T5gemma 2: Seeing, reading, and understanding longer,

B. Zhang, P. Suganthan, G. Liu, I. Philippov, S. Dua, B. Hora, K. Black, G. Martins, O. Sanseviero, S. Pathaket al., “T5gemma 2: Seeing, reading, and understanding longer,”ArXiv preprint, vol. abs/2512.14856, 2025

arXiv 2025

[13] [13]

Soundstream: An end-to-end neural audio codec,

N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “Soundstream: An end-to-end neural audio codec,”IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, vol. 30, pp. 495–507, 2021

2021

[14] [14]

High-fidelity audio compression with improved RVQGAN,

R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High-fidelity audio compression with improved RVQGAN,” in Advances in Neural Information Processing Systems 36: An- nual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and...

2023

[15] [15]

Speak, read and prompt: High-fidelity text-to-speech with min- imal supervision,

E. Kharitonov, D. Vincent, Z. Borsos, R. Marinier, S. Girgin, O. Pietquin, M. Sharifi, M. Tagliasacchi, and N. Zeghidour, “Speak, read and prompt: High-fidelity text-to-speech with min- imal supervision,”Transactions of the Association for Computa- tional Linguistics, vol. 11, pp. 1703–1718, 2023

2023

[16] [16]

Neural codec language models are zero-shot text to speech synthesizers,

S. Chen, C. Wang, Y . Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y . Liu, H. Wang, J. Liet al., “Neural codec language models are zero-shot text to speech synthesizers,”IEEE Transactions on Au- dio, Speech and Language Processing, 2025

2025

[17] [17]

Spark-tts: An efficient llm- based text-to-speech model with single-stream decoupled speech tokens,

X. Wang, M. Jiang, Z. Ma, Z. Zhang, S. Liu, L. Li, Z. Liang, Q. Zheng, R. Wang, X. Fenget al., “Spark-tts: An efficient llm- based text-to-speech model with single-stream decoupled speech tokens,”arXiv preprint arXiv:2503.01710, 2025

Pith/arXiv arXiv 2025

[18] [18]

SpeechT5: Unified-modal encoder-decoder pre-training for spoken language processing,

J. Ao, R. Wang, L. Zhou, C. Wang, S. Ren, Y . Wu, S. Liu, T. Ko, Q. Li, Y . Zhang, Z. Wei, Y . Qian, J. Li, and F. Wei, “SpeechT5: Unified-modal encoder-decoder pre-training for spoken language processing,” inProceedings of the 60th Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), S. Muresan, P. Nakov, and A. Vill...

2022

[19] [19]

Improving robustness of llm-based speech synthesis by learn- ing monotonic alignment,

P. Neekhara, S. Hussain, S. Ghosh, J. Li, and B. Ginsburg, “Improving robustness of llm-based speech synthesis by learn- ing monotonic alignment,” inProc. Interspeech 2024, 2024, pp. 3425–3429

2024

[20] [20]

Robust and un- bounded length generalization in autoregressive transformer- based text-to-speech,

E. Battenberg, R. Skerry-Ryan, D. Stanton, S. Mariooryad, M. Shannon, J. Salazar, and D. T.-H. Kao, “Robust and un- bounded length generalization in autoregressive transformer- based text-to-speech,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Com- putational Linguistics: Human Language Technologies (...

2025

[21] [21]

Fastpitch: Parallel text-to-speech with pitch pre- diction,

A. Lancucki, “Fastpitch: Parallel text-to-speech with pitch pre- diction,” inIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021. IEEE, 2021, pp. 6588–6592

2021

[22] [22]

Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,

J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” inPro- ceedings of the 38th International Conference on Machine Learn- ing, ICML 2021, 18-24 July 2021, Virtual Event, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol

2021

[23] [23]

5530–5540

PMLR, 2021, pp. 5530–5540

2021

[24] [24]

Mini-omni: Language models can hear, talk while thinking in streaming,

Z. Xie and C. Wu, “Mini-omni: Language models can hear, talk while thinking in streaming,”ArXiv preprint, vol. abs/2408.16725, 2024

arXiv 2024

[25] [25]

Llama- omni: Seamless speech interaction with large language models,

Q. Fang, S. Guo, Y . Zhou, Z. Ma, S. Zhang, and Y . Feng, “Llama- omni: Seamless speech interaction with large language models,” ArXiv preprint, vol. abs/2409.06666, 2024

arXiv 2024

[26] [26]

Qwen2. 5-omni technical report,

J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Danget al., “Qwen2. 5-omni technical report,”ArXiv preprint, vol. abs/2503.20215, 2025

Pith/arXiv arXiv 2025

[27] [27]

Neural itts: Toward synthesizing speech in real-time with end-to-end neural text-to- speech framework,

T. Yanagita, S. Sakti, and S. Nakamura, “Neural itts: Toward synthesizing speech in real-time with end-to-end neural text-to- speech framework,” inProceedings of the 10th ISCA speech syn- thesis workshop, 2019, pp. 183–188

2019

[28] [28]

Tacotron: Towards end-to-end speech synthesis,

Y . Wang, R. J. Skerry-Ryan, D. Stanton, Y . Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y . Xiao, Z. Chen, S. Bengio, Q. V . Le, Y . Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: Towards end-to-end speech synthesis,” inInterspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, 201...

2017

[29] [29]

Signal estimation from modified short- time fourier transform,

D. Griffin and J. Lim, “Signal estimation from modified short- time fourier transform,”IEEE Transactions on acoustics, speech, and signal processing, vol. 32, no. 2, pp. 236–243, 1984

1984

[30] [30]

STACL: Simultaneous translation with implicit anticipation and control- lable latency using prefix-to-prefix framework,

M. Ma, L. Huang, H. Xiong, R. Zheng, K. Liu, B. Zheng, C. Zhang, Z. He, H. Liu, X. Li, H. Wu, and H. Wang, “STACL: Simultaneous translation with implicit anticipation and control- lable latency using prefix-to-prefix framework,” inProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, A. Korhonen, D. Traum, and L. M`arque...

2019

[31] [31]

Incremental text-to-speech synthesis with prefix- to-prefix framework,

M. Ma, B. Zheng, K. Liu, R. Zheng, H. Liu, K. Peng, K. Church, and L. Huang, “Incremental text-to-speech synthesis with prefix- to-prefix framework,” inFindings of the Association for Compu- tational Linguistics: EMNLP 2020, T. Cohn, Y . He, and Y . Liu, Eds. Online: Association for Computational Linguistics, 2020, pp. 3886–3896

2020

[32] [32]

Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions,

J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y . Zhang, Y . Wang, R. Ryan, R. A. Saurous, Y . Agiomyr- giannakis, and Y . Wu, “Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions,” in2018 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, April ...

2018

[33] [33]

Parallel wavegan: A fast waveform generation model based on generative adversarial net- works with multi-resolution spectrogram,

R. Yamamoto, E. Song, and J. Kim, “Parallel wavegan: A fast waveform generation model based on generative adversarial net- works with multi-resolution spectrogram,” in2020 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020. IEEE, 2020, pp. 6199–6203

2020

[34] [34]

What the fu- ture brings: Investigating the impact of lookahead for incremental neural TTS,

B. Stephenson, L. Besacier, L. Girin, and T. Hueber, “What the fu- ture brings: Investigating the impact of lookahead for incremental neural TTS,” inInterspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, H. Meng, B. Xu, and T. F. Zheng, Eds. ISCA, 2020, pp. 215–219

2020

[35] [35]

Speech-t: Transducer for text to speech and beyond,

J. Chen, X. Tan, Y . Leng, J. Xu, G. Wen, T. Qin, and T. Liu, “Speech-t: Transducer for text to speech and beyond,” inAd- vances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, M. Ranzato, A. Beygelzimer, Y . N. Dauphin, P. Liang, and J. W. Vaughan, E...

2021

[36] [36]

Instantspeech: Instant synchronous text-to-speech synthesis for llm-driven voice chatbots,

M. Du, C. Liu, and J. Lai, “Instantspeech: Instant synchronous text-to-speech synthesis for llm-driven voice chatbots,” inICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

2025

[37] [37]

Hifi-gan: Generative adversar- ial networks for efficient and high fidelity speech synthesis,

J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversar- ial networks for efficient and high fidelity speech synthesis,” in Advances in Neural Information Processing Systems 33: An- nual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., 2020

2020

[38] [38]

Finite scalar quantization: VQ-V AE made simple,

F. Mentzer, D. Minnen, E. Agustsson, and M. Tschannen, “Finite scalar quantization: VQ-V AE made simple,” inThe Twelfth In- ternational Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

2024

[39] [39]

Connectionist temporal classification: labelling unsegmented se- quence data with recurrent neural networks,

A. Graves, S. Fern ´andez, F. J. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented se- quence data with recurrent neural networks,” inMachine Learn- ing, Proceedings of the Twenty-Third International Conference (ICML 2006), Pittsburgh, Pennsylvania, USA, June 25-29, 2006, ser. ACM International Conference Proceeding ...

2006

[40] [40]

Glow-tts: A gener- ative flow for text-to-speech via monotonic alignment search,

J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-tts: A gener- ative flow for text-to-speech via monotonic alignment search,” inAdvances in Neural Information Processing Systems 33: An- nual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., 2020

2020

[41] [41]

Libritts: A corpus derived from librispeech for text- to-speech,

H. Zen, V . Dang, R. Clark, Y . Zhang, R. J. Weiss, Y . Jia, Z. Chen, and Y . Wu, “Libritts: A corpus derived from librispeech for text- to-speech,” inInterspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, G. Kubin and Z. Kacic, Eds. ISCA, 2019, pp. 1526–1530

2019

[42] [42]

Hi- fi multi-speaker english TTS dataset,

E. Bakhturina, V . Lavrukhin, B. Ginsburg, and Y . Zhang, “Hi- fi multi-speaker english TTS dataset,” inInterspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021, H. Hermansky, H. Cernock ´y, L. Burget, L. Lamel, O. Scharen- borg, and P. Motl´ıcek, Eds. ISCA, 2021, pp. 2776–2780

2021

[43] [43]

Enhancing chat language models by scaling high- quality instructional conversations,

N. Ding, Y . Chen, B. Xu, Y . Qin, S. Hu, Z. Liu, M. Sun, and B. Zhou, “Enhancing chat language models by scaling high- quality instructional conversations,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Associ- ation for Computational Linguistics, 2023, pp. 3029–3051

2023

[44] [44]

Efficient sequence transduction by jointly predicting tokens and durations,

H. Xu, F. Jia, S. Majumdar, H. Huang, S. Watanabe, and B. Gins- burg, “Efficient sequence transduction by jointly predicting tokens and durations,” inInternational Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, ser. Pro- ceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, an...

2023

[45] [45]

Mel codec 22khz medium,

NVIDIA, “Mel codec 22khz medium,” https://catalog.ngc.nvidia. com/orgs/nvidia/teams/nemo/models/mel codec 22khz medium, 2024, accessed: 2026-02-17

2024

[46] [46]

Phonemizer: Text to phones transcription for multiple languages in python,

M. Bernard and H. Titeux, “Phonemizer: Text to phones transcription for multiple languages in python,”Journal of Open Source Software, vol. 6, no. 68, p. 3958, 2021. [Online]. Available: https://doi.org/10.21105/joss.03958

work page doi:10.21105/joss.03958 2021

[47] [47]

Decoupled weight decay regulariza- tion,

I. Loshchilov and F. Hutter, “Decoupled weight decay regulariza- tion,” in7th International Conference on Learning Representa- tions, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open- Review.net, 2019

2019

[48] [48]

Prolific. ac—a subject pool for online experiments,

S. Palan and C. Schitter, “Prolific. ac—a subject pool for online experiments,”Journal of behavioral and experimental finance, vol. 17, pp. 22–27, 2018

2018

[49] [49]

Wavlm: Large-scale self- supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “Wavlm: Large-scale self- supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

2022

[50] [50]

CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR V oice Cloning Toolkit,

J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR V oice Cloning Toolkit (version 0.92),” 2019, sound dataset. [Online]. Available: https://doi.org/10.7488/ds/2645

work page doi:10.7488/ds/2645 2019

[51] [51]

UTMOS: utokyo-sarulab system for voicemos challenge 2022,

T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: utokyo-sarulab system for voicemos challenge 2022,” in23rd Annual Conference of the International Speech Communication Association, Interspeech 2022, Incheon, Korea, September 18-22, 2022, H. Ko and J. H. L. Hansen, Eds. ISCA, 2022, pp. 4521–4525

2022

[52] [52]

E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts,

S. E. Eskimez, X. Wang, M. Thakker, C. Li, C.-H. Tsai, Z. Xiao, H. Yang, Z. Zhu, M. Tang, X. Tanet al., “E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 682– 689

2024

[53] [53]

Fireredtts: A foundation text-to- speech framework for industry-level generative speech applica- tions,

H.-H. Guo, Y . Hu, K. Liu, F.-Y . Shen, X. Tang, Y .-C. Wu, F.- L. Xie, K. Xie, and K.-T. Xu, “Fireredtts: A foundation text-to- speech framework for industry-level generative speech applica- tions,”ArXiv preprint, vol. abs/2409.03283, 2024

arXiv 2024

[54] [54]

Maskgct: Zero-shot text- to-speech with masked generative codec transformer,

Y . Wang, H. Zhan, L. Liu, R. Zeng, H. Guo, J. Zheng, Q. Zhang, X. Zhang, S. Zhang, and Z. Wu, “Maskgct: Zero-shot text- to-speech with masked generative codec transformer,” inICLR. OpenReview.net, 2025

2025

[55] [55]

Cosyvoice: A scalable multi- lingual zero-shot text-to-speech synthesizer based on supervised semantic tokens,

Z. Du, Q. Chen, S. Zhang, K. Hu, H. Lu, Y . Yang, H. Hu, S. Zheng, Y . Gu, Z. Maet al., “Cosyvoice: A scalable multi- lingual zero-shot text-to-speech synthesizer based on supervised semantic tokens,”ArXiv preprint, vol. abs/2407.05407, 2024

Pith/arXiv arXiv 2024

[56] [56]

A short- time objective intelligibility measure for time-frequency weighted noisy speech,

C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short- time objective intelligibility measure for time-frequency weighted noisy speech,” in2010 IEEE International Conference on Acous- tics, Speech and Signal Processing, 2010, pp. 4214–4217

2010

[57] [57]

Perceptual eval- uation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,

A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual eval- uation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in2001 IEEE In- ternational Conference on Acoustics, Speech, and Signal Process- ing. Proceedings (Cat. No.01CH37221), vol. 2, 2001, pp. 749– 752 vol.2

2001

[58] [58]

The llama 3 herd of models,

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan et al., “The llama 3 herd of models,”ArXiv preprint, vol. abs/2407.21783, 2024

Pith/arXiv arXiv 2024

[59] [59]

[Online]

Ollama, “Ollama,” 2026, accessed: 2026-02-17. [Online]. Available: https://ollama.com

2026