Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision

Damien Vincent; Eugene Kharitonov; Marco Tagliasacchi; Matt Sharifi; Neil Zeghidour; Olivier Pietquin; Rapha\"el Marinier; Sertan Girgin; Zal\'an Borsos

arxiv: 2302.03540 · v1 · pith:V3PJK6TWnew · submitted 2023-02-07 · 💻 cs.SD · eess.AS

Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision

Eugene Kharitonov , Damien Vincent , Zal\'an Borsos , Rapha\"el Marinier , Sertan Girgin , Olivier Pietquin , Matt Sharifi , Marco Tagliasacchi

show 1 more author

Neil Zeghidour

This is my paper

classification 💻 cs.SD eess.AS

keywords dataspear-ttstokensacousticminimalonlyparallelreading

0 comments

read the original abstract

We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as a composition of two sequence-to-sequence tasks: from text to high-level semantic tokens (akin to "reading") and from semantic tokens to low-level acoustic tokens ("speaking"). Decoupling these two tasks enables training of the "speaking" module using abundant audio-only data, and unlocks the highly efficient combination of pretraining and backtranslation to reduce the need for parallel data when training the "reading" component. To control the speaker identity, we adopt example prompting, which allows SPEAR-TTS to generalize to unseen speakers using only a short sample of 3 seconds, without any explicit speaker representation or speaker-id labels. Our experiments demonstrate that SPEAR-TTS achieves a character error rate that is competitive with state-of-the-art methods using only 15 minutes of parallel data, while matching ground-truth speech in terms of naturalness and acoustic quality, as measured in subjective tests.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Step-Audio 2 Technical Report
cs.CL 2025-07 unverdicted novelty 6.0

Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and c...
AudioPaLM: A Large Language Model That Can Speak and Listen
cs.CL 2023-06 unverdicted novelty 6.0

AudioPaLM unifies PaLM-2 and AudioLM to outperform prior systems on speech translation while enabling zero-shot speech-to-text for many unseen language pairs and voice transfer from short prompts.
SPARCLE: SPeaker-aware Aligned Representations via Contrastive Language Embeddings
cs.CL 2026-05 unverdicted novelty 4.0

SPARCLE builds speaker-aware grapheme representations by contrastively aligning characters with Wav2Vec2 acoustic embeddings conditioned on speaker identity, replacing G2P for TTS and halving WER in low-resource cases.