A Survey on Neural Speech Synthesis
read the original abstract
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural speech given text, is a hot research topic in speech, language, and machine learning communities and has broad applications in the industry. As the development of deep learning and artificial intelligence, neural network-based TTS has significantly improved the quality of synthesized speech in recent years. In this paper, we conduct a comprehensive survey on neural TTS, aiming to provide a good understanding of current research and future trends. We focus on the key components in neural TTS, including text analysis, acoustic models and vocoders, and several advanced topics, including fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS, etc. We further summarize resources related to TTS (e.g., datasets, opensource implementations) and discuss future research directions. This survey can serve both academic researchers and industry practitioners working on TTS.
This paper has not been read by Pith yet.
Forward citations
Cited by 11 Pith papers
-
Bagpiper-TTS: Natural Language Guided Universal Speech Synthesis
Bagpiper-TTS uses natural language prompts and intent reasoning to derive rich captions that guide a single model for universal speech synthesis across classical TTS, multi-talker, singing, and role-play tasks.
-
N\"ushuVoice: Reviving the Voice of Endangered N\"ushu with Pitch-Aware Text-to-Speech
NüshuVoice releases the first sentence-level Nüshu TTS dataset and shows that an F0-conditioned VITS model using five-level pitch notation outperforms baselines on spectral fidelity, pitch accuracy, and intelligibility.
-
Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages
Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
-
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
VALL-E is a neural codec language model trained on 60K hours of speech that performs zero-shot TTS, synthesizing natural speech that matches an unseen speaker's voice, emotion, and environment from a 3-second prompt.
-
DisSpeech: Low-Resource Controllable Mandarin Stuttered Speech Synthesis for ASR Augmentation
DisSpeech synthesizes controllable stuttered Mandarin speech via discrete tokens and stuttering event labels to augment ASR datasets, improving recognition to 4.19% CER on stuttered tasks with minimal impact on fluent speech.
-
Asymmetric Phase Coding Audio Watermarking
APC embeds compact Ed25519 signatures into audio phase data with error correction to achieve 97.5-98.3% cryptographic verification under eight attack types at mean PESQ 3.02.
-
Evaluating Generalization and Robustness in Russian Anti-Spoofing: The RuASD Initiative
RuASD is a comprehensive Russian speech anti-spoofing dataset featuring 37 synthesis systems and a robustness evaluation pipeline for real-world channel distortions.
-
One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech
A system based on OmniVoice with multi-model ensemble distillation for fine-tuning shows consistent gains in intelligibility metrics while keeping speaker similarity for cross-lingual scientific speech.
-
XR-CareerAssist: An Immersive Platform for Personalised Career Guidance Leveraging Extended Reality and Multimodal AI
XR-CareerAssist fuses XR and five AI modules into a Unity-based immersive platform for multilingual, personalized career guidance via 3D avatars and dynamic Sankey diagrams, reporting 78.3% user satisfaction in a 23-p...
-
One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech
Authors submit a cross-lingual voice cloning system to IWSLT 2026 using OmniVoice fine-tuned on ensemble-distilled synthetic data, reporting gains in WER, CER, and speaker similarity for scientific texts in three languages.
-
Voice Mapping of Text-to-Speech Systems: A Metric-Based Approach for Voice Quality Assessment
Voice range indicates TTS model capability with VITS highest, Glow-TTS best at soft phonation, and CPPs of 7-8 dB marking natural quality while values over 10 dB sound robotic.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.