Bagpiper-TTS: Natural Language Guided Universal Speech Synthesis

Chao-Han Huck Yang; Haoran Wang; Jinchuan Tian; Jin Sakuma; Keita Goto; Shinji Watanabe; Siddhant Arora; Takashi Maekaku; Yusuke Shinohara

arxiv: 2606.22811 · v1 · pith:IZ345MUDnew · submitted 2026-06-22 · 💻 cs.CL · cs.AI

Bagpiper-TTS: Natural Language Guided Universal Speech Synthesis

Jinchuan Tian , Haoran Wang , Siddhant Arora , Takashi Maekaku , Keita Goto , Jin Sakuma , Yusuke Shinohara , Chao-Han Huck Yang

show 1 more author

Shinji Watanabe

This is my paper

classification 💻 cs.CL cs.AI

keywords synthesisbagpiper-ttslanguagenaturalspeechapplicationscaptionclassical

0 comments

read the original abstract

Classical TTS systems typically rely on rigid input formats and predefined metadata slots, limiting their ability to fulfill flexible user requirements. This paper introduces Bagpiper-TTS, a universal speech synthesis system that deals with diverse natural language user requests. Given a natural language prompt, Bagpiper-TTS first reasons over the users' intent to derive a rich caption, i.e., a comprehensive textual blueprint encompassing both transcription and nuanced metadata. Subsequently, this caption guides the synthesis of the target speech. Our model inherently supports a broad spectrum of tasks besides classical TTS applications, including multi-talker, intent-to-speech, role-play synthesis, singing voice synthesis, and more. Experimental results demonstrate that Bagpiper-TTS achieves an 1.7% Word Error Rate (WER) on the Seed-TTS-Eval benchmark and match the performance of dedicated models in both LLM-as-a-judge and human subjective evaluations across multiple applications.

This paper has not been read by Pith yet.

Bagpiper-TTS: Natural Language Guided Universal Speech Synthesis

discussion (0)