Streaming T5-based Text-to-Speech Synthesis with Limited Lookahead
Pith reviewed 2026-06-26 11:40 UTC · model grok-4.3
The pith
S5-TTS performs streaming T5-based text-to-speech with limited lookahead while matching full-context quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
S5-TTS begins generating speech immediately after receiving the first few words through encoder-decoder language modeling and monotonic alignment learning; a lookahead-causal masking mechanism with Conv-based auxiliary attention preserves intelligibility and speaker similarity, while interleaved multi-source distillation restores naturalness, so the model achieves quality comparable to full-context T5-TTS and supports zero-shot synthesis with high speaker similarity.
What carries the argument
lookahead-causal masking mechanism with Conv-based auxiliary attention and interleaved multi-source distillation, which maintains quality under limited future context.
If this is right
- S5-TTS achieves comparable quality to full-context T5-TTS.
- S5-TTS supports zero-shot synthesis with high speaker similarity.
- S5-TTS significantly reduces end-to-end latency for practical conversational AI systems.
- The model can initiate synthesis after the first few words without requiring the full input text.
Where Pith is reading between the lines
- The masking and distillation techniques could be adapted to other autoregressive TTS architectures to add streaming support.
- Lower latency may enable new real-time interactive voice interfaces that were previously limited by full-context delays.
- The method might be tested with varying lookahead sizes to find the minimal context needed for acceptable quality.
Load-bearing premise
The lookahead-causal masking combined with Conv-based auxiliary attention and interleaved multi-source distillation is sufficient to preserve intelligibility, naturalness, and speaker similarity when only limited future context is available.
What would settle it
A set of subjective listening tests or objective metrics showing substantially lower naturalness or intelligibility scores for S5-TTS than for full-context T5-TTS on the same inputs would falsify the comparable-quality claim.
Figures
read the original abstract
Streaming text-to-speech synthesis in cascaded LLM-TTS systems still faces latency challenges as most TTS models require full context before initiating generation. We present S5-TTS, a streaming variant of T5-TTS that enables low-latency, word-by-word incremental speech synthesis through encoder-decoder language modeling and monotonic alignment learning. S5-TTS begins generating speech immediately after receiving the first few words, substantially reducing end-to-end response latency. To maintain quality under limited lookahead, we introduce a lookahead-causal masking mechanism with Conv-based auxiliary attention that preserves intelligibility and speaker similarity, and employ interleaved multi-source distillation to further restore naturalness. Experiments show that S5-TTS achieves comparable quality to full-context T5-TTS, supports zero-shot synthesis with high speaker similarity, and significantly reduces end-to-end latency for practical conversational AI systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces S5-TTS, a streaming variant of T5-TTS for low-latency text-to-speech synthesis. It uses encoder-decoder language modeling with monotonic alignment learning to begin speech generation after the first few words. To preserve quality under limited lookahead, it proposes a lookahead-causal masking mechanism combined with Conv-based auxiliary attention and interleaved multi-source distillation. The central claims are that S5-TTS achieves comparable intelligibility, naturalness, and speaker similarity to full-context T5-TTS, supports zero-shot synthesis, and substantially reduces end-to-end latency.
Significance. If the experimental claims hold, the work would address a key bottleneck in cascaded LLM-TTS systems by enabling practical streaming synthesis for conversational AI without requiring full text context upfront.
major comments (2)
- [Abstract] Abstract: The claims that 'experiments show that S5-TTS achieves comparable quality to full-context T5-TTS' and 'significantly reduces end-to-end latency' are asserted without any quantitative metrics, baselines, ablation studies, or error analysis. This absence makes it impossible to evaluate whether the lookahead-causal masking and distillation approach actually preserves the claimed properties under limited lookahead.
- [Abstract] The weakest assumption identified—that the combination of lookahead-causal masking, Conv-based auxiliary attention, and interleaved multi-source distillation is sufficient to maintain intelligibility, naturalness, and speaker similarity—is presented as resolved by experiments, yet no supporting data, tables, or figures are referenced to substantiate this.
Simulated Author's Rebuttal
We thank the referee for the feedback on the abstract. The comments correctly identify that the abstract summarizes results without quantitative support or references to data. We address each point below and will revise accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claims that 'experiments show that S5-TTS achieves comparable quality to full-context T5-TTS' and 'significantly reduces end-to-end latency' are asserted without any quantitative metrics, baselines, ablation studies, or error analysis. This absence makes it impossible to evaluate whether the lookahead-causal masking and distillation approach actually preserves the claimed properties under limited lookahead.
Authors: We agree the abstract would benefit from explicit quantitative support. The full manuscript contains the requested metrics, baselines, ablations, and error analysis in the Experiments section. We will revise the abstract to incorporate key numerical results (e.g., specific WER, MOS, similarity scores, and latency reductions) to substantiate the claims. revision: yes
-
Referee: [Abstract] The weakest assumption identified—that the combination of lookahead-causal masking, Conv-based auxiliary attention, and interleaved multi-source distillation is sufficient to maintain intelligibility, naturalness, and speaker similarity—is presented as resolved by experiments, yet no supporting data, tables, or figures are referenced to substantiate this.
Authors: We acknowledge that the abstract does not reference supporting data or figures. The manuscript body includes the relevant tables and figures demonstrating the contribution of each component. We will revise the abstract to include summary statistics and, where feasible, references to the key results that validate the techniques. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper introduces an architectural variant S5-TTS of T5-TTS for streaming synthesis via lookahead-causal masking, Conv-based auxiliary attention, and interleaved multi-source distillation. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described construction. Claims rest on experimental comparisons of intelligibility, naturalness, and speaker similarity rather than reducing to inputs by definition or prior self-referential results. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introduction Large language models (LLMs) have become the cornerstone of generative AI, with state-of-the-art models adopting either decoder-only [1–3] or encoder-decoder architectures [4–6]. Re- cent advances in neural audio codecs [7,8] have enabled speech to be represented as discrete tokens, paving the way for LLM- based text-to-speech (TTS) models in...
2026
-
[2]
with Griffin-Lim [22] and shows that lookback acoustic context is crucial for natural prosody. Inspired by the prefix- to-prefix framework for simultaneous translation [23], subse- quent work [24] introduces the lookahead policy into Tacotron 2 [25] with Parallel WaveGAN [26], showing that even a single future word substantially improves naturalness. Furt...
-
[3]
Proposed Method 2.1. Model Overview S5-TTS adopts a T5-based architecture, consisting of a paral- lel Transformer encoder and an autoregressive Transformer de- coder. The encoder takes as input a phoneme sequence obtained via G2P conversion. At each decoding step, the decoder con- sumes the sum of the embeddings of theKcodec tokens pre- dicted at the prev...
Pith/arXiv arXiv 2026
-
[4]
Experiments 3.1. Experimental Setup ■Datasets.For initial training, both S5-TTS and T5-TTS are trained on the full training splits of LibriTTS [34] and Hi- FiTTS [35] speech datasets, comprising 845.04 hours of speech from 2,319 speakers. For distillation, we use the same speech datasets together with additional conversational text sampled from UltraChat-...
-
[5]
S5-TTS achieves speech quality comparable to full-context T5-TTS, supports zero-shot synthesis, and significantly reduces response latency in cascaded LLM-TTS systems
Conclusion We presented S5-TTS, a streaming text-to-speech model based on language modeling that demonstrates low-latency, word- by-word speech synthesis under limited lookahead. S5-TTS achieves speech quality comparable to full-context T5-TTS, supports zero-shot synthesis, and significantly reduces response latency in cascaded LLM-TTS systems. Experiment...
-
[6]
Use of Generative AI Disclosure Generative AI tools were used solely for grammar checking and language polishing to improve the clarity of the manuscript
-
[7]
Llama: Open and efficient foundation language models,
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”ArXiv preprint, vol. abs/2302.13971, 2023
Pith/arXiv arXiv 2023
-
[8]
Gemma: Open models based on gemini research and technol- ogy,
G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivi `ere, M. S. Kale, J. Loveet al., “Gemma: Open models based on gemini research and technol- ogy,”ArXiv preprint, vol. abs/2403.08295, 2024
Pith/arXiv arXiv 2024
-
[9]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”ArXiv preprint, vol. abs/2505.09388, 2025
Pith/arXiv arXiv 2025
-
[10]
Scal- ing instruction-finetuned language models,
H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fe- dus, Y . Li, X. Wang, M. Dehghani, S. Brahmaet al., “Scal- ing instruction-finetuned language models,”Journal of Machine Learning Research, vol. 25, no. 70, pp. 1–53, 2024
2024
-
[11]
Encoder- decoder gemma: Improving the quality-efficiency trade-off via adaptation,
B. Zhang, F. Moiseev, J. Ainslie, P. Suganthan, M. Ma, S. Bhu- patiraju, F. Lebron, O. Firat, A. Joulin, and Z. Dong, “Encoder- decoder gemma: Improving the quality-efficiency trade-off via adaptation,”ArXiv preprint, vol. abs/2504.06225, 2025
arXiv 2025
-
[12]
T5gemma 2: Seeing, reading, and understanding longer,
B. Zhang, P. Suganthan, G. Liu, I. Philippov, S. Dua, B. Hora, K. Black, G. Martins, O. Sanseviero, S. Pathaket al., “T5gemma 2: Seeing, reading, and understanding longer,”ArXiv preprint, vol. abs/2512.14856, 2025
arXiv 2025
-
[13]
Soundstream: An end-to-end neural audio codec,
N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “Soundstream: An end-to-end neural audio codec,”IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, vol. 30, pp. 495–507, 2021
2021
-
[14]
High-fidelity audio compression with improved RVQGAN,
R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High-fidelity audio compression with improved RVQGAN,” in Advances in Neural Information Processing Systems 36: An- nual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and...
2023
-
[15]
Speak, read and prompt: High-fidelity text-to-speech with min- imal supervision,
E. Kharitonov, D. Vincent, Z. Borsos, R. Marinier, S. Girgin, O. Pietquin, M. Sharifi, M. Tagliasacchi, and N. Zeghidour, “Speak, read and prompt: High-fidelity text-to-speech with min- imal supervision,”Transactions of the Association for Computa- tional Linguistics, vol. 11, pp. 1703–1718, 2023
2023
-
[16]
Neural codec language models are zero-shot text to speech synthesizers,
S. Chen, C. Wang, Y . Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y . Liu, H. Wang, J. Liet al., “Neural codec language models are zero-shot text to speech synthesizers,”IEEE Transactions on Au- dio, Speech and Language Processing, 2025
2025
-
[17]
Spark-tts: An efficient llm- based text-to-speech model with single-stream decoupled speech tokens,
X. Wang, M. Jiang, Z. Ma, Z. Zhang, S. Liu, L. Li, Z. Liang, Q. Zheng, R. Wang, X. Fenget al., “Spark-tts: An efficient llm- based text-to-speech model with single-stream decoupled speech tokens,”arXiv preprint arXiv:2503.01710, 2025
Pith/arXiv arXiv 2025
-
[18]
SpeechT5: Unified-modal encoder-decoder pre-training for spoken language processing,
J. Ao, R. Wang, L. Zhou, C. Wang, S. Ren, Y . Wu, S. Liu, T. Ko, Q. Li, Y . Zhang, Z. Wei, Y . Qian, J. Li, and F. Wei, “SpeechT5: Unified-modal encoder-decoder pre-training for spoken language processing,” inProceedings of the 60th Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), S. Muresan, P. Nakov, and A. Vill...
2022
-
[19]
Improving robustness of llm-based speech synthesis by learn- ing monotonic alignment,
P. Neekhara, S. Hussain, S. Ghosh, J. Li, and B. Ginsburg, “Improving robustness of llm-based speech synthesis by learn- ing monotonic alignment,” inProc. Interspeech 2024, 2024, pp. 3425–3429
2024
-
[20]
Robust and un- bounded length generalization in autoregressive transformer- based text-to-speech,
E. Battenberg, R. Skerry-Ryan, D. Stanton, S. Mariooryad, M. Shannon, J. Salazar, and D. T.-H. Kao, “Robust and un- bounded length generalization in autoregressive transformer- based text-to-speech,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Com- putational Linguistics: Human Language Technologies (...
2025
-
[21]
Fastpitch: Parallel text-to-speech with pitch pre- diction,
A. Lancucki, “Fastpitch: Parallel text-to-speech with pitch pre- diction,” inIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021. IEEE, 2021, pp. 6588–6592
2021
-
[22]
Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,
J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” inPro- ceedings of the 38th International Conference on Machine Learn- ing, ICML 2021, 18-24 July 2021, Virtual Event, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol
2021
-
[23]
5530–5540
PMLR, 2021, pp. 5530–5540
2021
-
[24]
Mini-omni: Language models can hear, talk while thinking in streaming,
Z. Xie and C. Wu, “Mini-omni: Language models can hear, talk while thinking in streaming,”ArXiv preprint, vol. abs/2408.16725, 2024
arXiv 2024
-
[25]
Llama- omni: Seamless speech interaction with large language models,
Q. Fang, S. Guo, Y . Zhou, Z. Ma, S. Zhang, and Y . Feng, “Llama- omni: Seamless speech interaction with large language models,” ArXiv preprint, vol. abs/2409.06666, 2024
arXiv 2024
-
[26]
Qwen2. 5-omni technical report,
J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Danget al., “Qwen2. 5-omni technical report,”ArXiv preprint, vol. abs/2503.20215, 2025
Pith/arXiv arXiv 2025
-
[27]
Neural itts: Toward synthesizing speech in real-time with end-to-end neural text-to- speech framework,
T. Yanagita, S. Sakti, and S. Nakamura, “Neural itts: Toward synthesizing speech in real-time with end-to-end neural text-to- speech framework,” inProceedings of the 10th ISCA speech syn- thesis workshop, 2019, pp. 183–188
2019
-
[28]
Tacotron: Towards end-to-end speech synthesis,
Y . Wang, R. J. Skerry-Ryan, D. Stanton, Y . Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y . Xiao, Z. Chen, S. Bengio, Q. V . Le, Y . Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: Towards end-to-end speech synthesis,” inInterspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, 201...
2017
-
[29]
Signal estimation from modified short- time fourier transform,
D. Griffin and J. Lim, “Signal estimation from modified short- time fourier transform,”IEEE Transactions on acoustics, speech, and signal processing, vol. 32, no. 2, pp. 236–243, 1984
1984
-
[30]
STACL: Simultaneous translation with implicit anticipation and control- lable latency using prefix-to-prefix framework,
M. Ma, L. Huang, H. Xiong, R. Zheng, K. Liu, B. Zheng, C. Zhang, Z. He, H. Liu, X. Li, H. Wu, and H. Wang, “STACL: Simultaneous translation with implicit anticipation and control- lable latency using prefix-to-prefix framework,” inProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, A. Korhonen, D. Traum, and L. M`arque...
2019
-
[31]
Incremental text-to-speech synthesis with prefix- to-prefix framework,
M. Ma, B. Zheng, K. Liu, R. Zheng, H. Liu, K. Peng, K. Church, and L. Huang, “Incremental text-to-speech synthesis with prefix- to-prefix framework,” inFindings of the Association for Compu- tational Linguistics: EMNLP 2020, T. Cohn, Y . He, and Y . Liu, Eds. Online: Association for Computational Linguistics, 2020, pp. 3886–3896
2020
-
[32]
Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions,
J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y . Zhang, Y . Wang, R. Ryan, R. A. Saurous, Y . Agiomyr- giannakis, and Y . Wu, “Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions,” in2018 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, April ...
2018
-
[33]
Parallel wavegan: A fast waveform generation model based on generative adversarial net- works with multi-resolution spectrogram,
R. Yamamoto, E. Song, and J. Kim, “Parallel wavegan: A fast waveform generation model based on generative adversarial net- works with multi-resolution spectrogram,” in2020 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020. IEEE, 2020, pp. 6199–6203
2020
-
[34]
What the fu- ture brings: Investigating the impact of lookahead for incremental neural TTS,
B. Stephenson, L. Besacier, L. Girin, and T. Hueber, “What the fu- ture brings: Investigating the impact of lookahead for incremental neural TTS,” inInterspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, H. Meng, B. Xu, and T. F. Zheng, Eds. ISCA, 2020, pp. 215–219
2020
-
[35]
Speech-t: Transducer for text to speech and beyond,
J. Chen, X. Tan, Y . Leng, J. Xu, G. Wen, T. Qin, and T. Liu, “Speech-t: Transducer for text to speech and beyond,” inAd- vances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, M. Ranzato, A. Beygelzimer, Y . N. Dauphin, P. Liang, and J. W. Vaughan, E...
2021
-
[36]
Instantspeech: Instant synchronous text-to-speech synthesis for llm-driven voice chatbots,
M. Du, C. Liu, and J. Lai, “Instantspeech: Instant synchronous text-to-speech synthesis for llm-driven voice chatbots,” inICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5
2025
-
[37]
Hifi-gan: Generative adversar- ial networks for efficient and high fidelity speech synthesis,
J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversar- ial networks for efficient and high fidelity speech synthesis,” in Advances in Neural Information Processing Systems 33: An- nual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., 2020
2020
-
[38]
Finite scalar quantization: VQ-V AE made simple,
F. Mentzer, D. Minnen, E. Agustsson, and M. Tschannen, “Finite scalar quantization: VQ-V AE made simple,” inThe Twelfth In- ternational Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024
2024
-
[39]
Connectionist temporal classification: labelling unsegmented se- quence data with recurrent neural networks,
A. Graves, S. Fern ´andez, F. J. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented se- quence data with recurrent neural networks,” inMachine Learn- ing, Proceedings of the Twenty-Third International Conference (ICML 2006), Pittsburgh, Pennsylvania, USA, June 25-29, 2006, ser. ACM International Conference Proceeding ...
2006
-
[40]
Glow-tts: A gener- ative flow for text-to-speech via monotonic alignment search,
J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-tts: A gener- ative flow for text-to-speech via monotonic alignment search,” inAdvances in Neural Information Processing Systems 33: An- nual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., 2020
2020
-
[41]
Libritts: A corpus derived from librispeech for text- to-speech,
H. Zen, V . Dang, R. Clark, Y . Zhang, R. J. Weiss, Y . Jia, Z. Chen, and Y . Wu, “Libritts: A corpus derived from librispeech for text- to-speech,” inInterspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, G. Kubin and Z. Kacic, Eds. ISCA, 2019, pp. 1526–1530
2019
-
[42]
Hi- fi multi-speaker english TTS dataset,
E. Bakhturina, V . Lavrukhin, B. Ginsburg, and Y . Zhang, “Hi- fi multi-speaker english TTS dataset,” inInterspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021, H. Hermansky, H. Cernock ´y, L. Burget, L. Lamel, O. Scharen- borg, and P. Motl´ıcek, Eds. ISCA, 2021, pp. 2776–2780
2021
-
[43]
Enhancing chat language models by scaling high- quality instructional conversations,
N. Ding, Y . Chen, B. Xu, Y . Qin, S. Hu, Z. Liu, M. Sun, and B. Zhou, “Enhancing chat language models by scaling high- quality instructional conversations,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Associ- ation for Computational Linguistics, 2023, pp. 3029–3051
2023
-
[44]
Efficient sequence transduction by jointly predicting tokens and durations,
H. Xu, F. Jia, S. Majumdar, H. Huang, S. Watanabe, and B. Gins- burg, “Efficient sequence transduction by jointly predicting tokens and durations,” inInternational Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, ser. Pro- ceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, an...
2023
-
[45]
Mel codec 22khz medium,
NVIDIA, “Mel codec 22khz medium,” https://catalog.ngc.nvidia. com/orgs/nvidia/teams/nemo/models/mel codec 22khz medium, 2024, accessed: 2026-02-17
2024
-
[46]
Phonemizer: Text to phones transcription for multiple languages in python,
M. Bernard and H. Titeux, “Phonemizer: Text to phones transcription for multiple languages in python,”Journal of Open Source Software, vol. 6, no. 68, p. 3958, 2021. [Online]. Available: https://doi.org/10.21105/joss.03958
-
[47]
Decoupled weight decay regulariza- tion,
I. Loshchilov and F. Hutter, “Decoupled weight decay regulariza- tion,” in7th International Conference on Learning Representa- tions, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open- Review.net, 2019
2019
-
[48]
Prolific. ac—a subject pool for online experiments,
S. Palan and C. Schitter, “Prolific. ac—a subject pool for online experiments,”Journal of behavioral and experimental finance, vol. 17, pp. 22–27, 2018
2018
-
[49]
Wavlm: Large-scale self- supervised pre-training for full stack speech processing,
S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “Wavlm: Large-scale self- supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022
2022
-
[50]
CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR V oice Cloning Toolkit,
J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR V oice Cloning Toolkit (version 0.92),” 2019, sound dataset. [Online]. Available: https://doi.org/10.7488/ds/2645
-
[51]
UTMOS: utokyo-sarulab system for voicemos challenge 2022,
T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: utokyo-sarulab system for voicemos challenge 2022,” in23rd Annual Conference of the International Speech Communication Association, Interspeech 2022, Incheon, Korea, September 18-22, 2022, H. Ko and J. H. L. Hansen, Eds. ISCA, 2022, pp. 4521–4525
2022
-
[52]
E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts,
S. E. Eskimez, X. Wang, M. Thakker, C. Li, C.-H. Tsai, Z. Xiao, H. Yang, Z. Zhu, M. Tang, X. Tanet al., “E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 682– 689
2024
-
[53]
H.-H. Guo, Y . Hu, K. Liu, F.-Y . Shen, X. Tang, Y .-C. Wu, F.- L. Xie, K. Xie, and K.-T. Xu, “Fireredtts: A foundation text-to- speech framework for industry-level generative speech applica- tions,”ArXiv preprint, vol. abs/2409.03283, 2024
arXiv 2024
-
[54]
Maskgct: Zero-shot text- to-speech with masked generative codec transformer,
Y . Wang, H. Zhan, L. Liu, R. Zeng, H. Guo, J. Zheng, Q. Zhang, X. Zhang, S. Zhang, and Z. Wu, “Maskgct: Zero-shot text- to-speech with masked generative codec transformer,” inICLR. OpenReview.net, 2025
2025
-
[55]
Z. Du, Q. Chen, S. Zhang, K. Hu, H. Lu, Y . Yang, H. Hu, S. Zheng, Y . Gu, Z. Maet al., “Cosyvoice: A scalable multi- lingual zero-shot text-to-speech synthesizer based on supervised semantic tokens,”ArXiv preprint, vol. abs/2407.05407, 2024
Pith/arXiv arXiv 2024
-
[56]
A short- time objective intelligibility measure for time-frequency weighted noisy speech,
C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short- time objective intelligibility measure for time-frequency weighted noisy speech,” in2010 IEEE International Conference on Acous- tics, Speech and Signal Processing, 2010, pp. 4214–4217
2010
-
[57]
Perceptual eval- uation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,
A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual eval- uation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in2001 IEEE In- ternational Conference on Acoustics, Speech, and Signal Process- ing. Proceedings (Cat. No.01CH37221), vol. 2, 2001, pp. 749– 752 vol.2
2001
-
[58]
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan et al., “The llama 3 herd of models,”ArXiv preprint, vol. abs/2407.21783, 2024
Pith/arXiv arXiv 2024
-
[59]
[Online]
Ollama, “Ollama,” 2026, accessed: 2026-02-17. [Online]. Available: https://ollama.com
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.