WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.
hub Canonical reference
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
Canonical reference. 89% of citing Pith papers cite this work as background.
abstract
We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis. See https://aka.ms/valle for demos of our work.
hub tools
citation-role summary
citation-polarity summary
roles
background 9representative citing papers
FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.
Probing-guided selection of depth zones from frozen SSL speech models yields compact classifiers with 28% relative EER improvement on cross-domain deepfake detection tasks.
DTM-Codec achieves better reconstruction quality and intelligibility than fixed-frame-rate neural speech codecs at matched total bitrate via dynamic token masking and Path Length Equalization for variable frame rates.
Instruction-based vector steering redirects temporal attention in LALMs to acoustically relevant regions, recovering queried sound event locations with 60.87-68.72% overlap accuracy without training.
HoliDubber introduces a patch-based autoregressive diffusion transformer for joint text-guided synthesis of speech and ambient audio in video dubbing, with a new benchmark showing outperformance over prior speech-only methods.
CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.
AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.
GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.
PairAlign learns compact variable-length token sequences for audio via self-alignment on paired content-preserving views, achieving 55% fewer archive tokens than VQ while preserving edit-distance retrieval at 12.71 tokens/s.
Large-model adaptation with Tibetan text handling produces natural speech from limited data, outperforming commercial systems.
MelShield adds keyed low-energy spread-spectrum perturbations to Mel-spectrograms inside TTS pipelines before vocoding to enable robust extraction of user-specific attribution signals even after compression or noise.
Semantic priors from HuBERT and Whisper improve speech codec intelligibility up to 6 kbps but show diminishing returns beyond that, with a bitrate-aware regulation strategy balancing semantic consistency and naturalness.
V.O.I.C.E is a new taxonomy that organizes synthetic voice risks into five categories and shows how they interact with exposure, visibility, and legal context using empirical incident data.
PhySE combines VLM pre-training for fast social context profiling with a dynamic psychological agent to overcome delays and static tactics in AR-LLM social engineering attacks, tested in a 60-person user study.
Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
X-VC achieves zero-shot streaming voice conversion via one-step codec-space conversion with dual-conditioning acoustic converter and role-assignment training on generated paired data.
Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
DASB is a new benchmark for discrete audio tokens showing semantic tokens outperform acoustic ones but discrete representations remain less robust than continuous features across domains.
Kosmos-1 shows strong zero-shot and few-shot results on language tasks, image captioning, visual QA, OCR-free document understanding, and image recognition guided by text instructions.
A code-mixing guided preference-learning method for TTS produces synthetic data that lowers mixed error rate when fine-tuning Whisper on the SEAME Mandarin-English corpus.
Self-guidance adds a lightweight feature-mapping loss to align decoder manifolds in VQ-VAE speech codecs, raising reconstruction metrics and allowing 4x codebook reduction with no fidelity loss.
UR-BERT scales multilingual TTS encoders to 495 languages via Romanization unification and speech token prediction, outperforming baselines with better generalization.
EmoInstruct-TTS uses Emotion2embed and an Instruction-Conditioned Emotion Flow Model (ICE-Flow) to generate acoustically grounded emotion representations from free-form instructions and integrate them into an LLM-based TTS pipeline.
citing papers explorer
-
WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling
WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.
-
DTM-Codec: Dynamic Token Masking for VFR Speech Coding with Efficient Boundary Selection
DTM-Codec achieves better reconstruction quality and intelligibility than fixed-frame-rate neural speech codecs at matched total bitrate via dynamic token masking and Path Length Equalization for variable frame rates.
-
HoliDubber: Holistic Video Dubbing for Complex Acoustic Scenes via Text-Guided Audio Synthesis
HoliDubber introduces a patch-based autoregressive diffusion transformer for joint text-guided synthesis of speech and ambient audio in video dubbing, with a new benchmark showing outperformance over prior speech-only methods.
-
Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech
GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.
-
SPG-Codec: Exploring the Role and Boundaries of Semantic Priors in Ultra-Low-Bitrate Neural Speech Coding
Semantic priors from HuBERT and Whisper improve speech codec intelligibility up to 6 kbps but show diminishing returns beyond that, with a bitrate-aware regulation strategy balancing semantic consistency and naturalness.
-
Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages
Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
-
X-VC: Zero-shot Streaming Voice Conversion in Codec Space
X-VC achieves zero-shot streaming voice conversion via one-step codec-space conversion with dual-conditioning acoustic converter and role-assignment training on generated paired data.
-
Moshi: a speech-text foundation model for real-time dialogue
Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
-
BareWave: Waveform-Native Flow-Matching Text-to-Speech
BareWave develops a waveform-native flow-matching framework for direct text-to-waveform TTS using representation alignment, staged noise scheduling, and velocity-aware perceptual alignment to achieve strong zero-shot voice cloning results.
-
SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue
SwanVoice is a zero-shot TTS system for 1-4 speakers that reports higher richness and hierarchy scores than open-source baselines on monologue and dialogue tasks via mixed training and DiffusionNFT post-training.
-
Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation
Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.
-
HCFD: A Benchmark for Audio Deepfake Detection in Healthcare
HCFD is a new pathology-aware benchmark and dataset for codec-fake audio detection in healthcare, with PHOENIX-Mamba achieving up to 97% accuracy by modeling fakes as modes in hyperbolic space.
-
StreamMark: A Deep Learning-Based Semi-Fragile Audio Watermarking for Proactive Deepfake Detection
StreamMark trains an Encoder-Distortion-Decoder network to embed semi-fragile watermarks that remain recoverable after benign audio transformations but drop to random accuracy under voice conversion and editing attacks.
-
ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching
ZipVoice-Dialog is a flow-matching non-autoregressive model for zero-shot spoken dialogue generation that uses curriculum learning and speaker-turn embeddings, paired with a new 6.8k-hour OpenDialog dataset, and reports better speed and quality than autoregressive baselines.
-
Perceptual implications of automatic anonymization in pathological speech
Listeners detect automatic anonymization in pathological speech at 91-93% accuracy with a 30-point perceived quality drop, yet clinical severity ratings stay nearly unchanged for dysarthria, dysglossia, and dysphonia.
-
Anchoring the Unknown: Open-Set Model Attribution via Proxy-Anchor Learning
Proxy-Anchor metric learning on Wav2Vec2-BERT embeddings with architecture merging achieves 99.76% closed-set accuracy and 2.04% FPR@95 OOD detection on MLAAD v9, doubling prior OOD accuracy on v5 splits.
-
FlashTTS: Fast Streaming TTS with MTP Acceleration and X-pred Mean Flow Distillation
FlashTTS delivers a streaming TTS system using multi-track input processing and X-pred mean flow matching to reach 325 ms latency in two function evaluations while retaining zero-shot voice cloning.
-
UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion
UNISON introduces a unified latent diffusion framework with layer-wise LLM fusion and channel-mask task encoding for multiple speech and sound generation and editing tasks.
-
Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech
Raon-OpenTTS provides an open 510K-hour curated speech dataset and DiT-based TTS models up to 1B parameters that achieve competitive WER and speaker similarity on benchmarks versus closed models trained on millions of hours.
-
Kimi-Audio Technical Report
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.
-
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.
-
VoCodec: A Low-bitrate Streamable Neural Speech Codec with Voicing-driven Quantization
VoCodec achieves better performance than baselines at 1.1 kbps on LibriTTS by embedding voicing-driven quantization that reduces bitrate by ~27% versus uniform allocation.
-
MELD: Mel-Spectrogram-Based Speech Language Modeling with Discrete Latent Variables
MELD jointly optimizes a discrete latent variable encoder on mel-spectrograms with an autoregressive speech LM, claiming gains over codec and mel baselines on zero-shot TTS/STT plus fewer autoregressive artifacts.
-
One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech
Authors submit a cross-lingual voice cloning system to IWSLT 2026 using OmniVoice fine-tuned on ensemble-distilled synthetic data, reporting gains in WER, CER, and speaker similarity for scientific texts in three languages.