hub Canonical reference

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu · 2023 · cs.CL · arXiv 2301.02111

Canonical reference. 89% of citing Pith papers cite this work as background.

91 Pith papers citing it

Background 89% of classified citations

open full Pith review browse 91 citing papers arXiv PDF

abstract

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis. See https://aka.ms/valle for demos of our work.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9

citation-polarity summary

background 8 support 1

representative citing papers

The Watermark Shortcut: How Provenance Marking Sabotages Audio Deepfake Detection

cs.SD · 2026-06-22 · unverdicted · novelty 8.0

Watermarking only synthetic audio leads deepfake detectors to use the watermark as a spurious shortcut, causing generalization failure, evasion by removing watermarks, and false positives on watermarked real audio.

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling

eess.AS · 2026-06-02 · unverdicted · novelty 8.0

WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

cs.SD · 2026-06-30 · unverdicted · novelty 7.0

FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.

Probing-Guided Layer Selection from Self-Supervised Speech Models for Generalizable Audio Deepfake Detection

cs.SD · 2026-06-29 · unverdicted · novelty 7.0

Probing-guided selection of depth zones from frozen SSL speech models yields compact classifiers with 28% relative EER improvement on cross-domain deepfake detection tasks.

DTM-Codec: Dynamic Token Masking for VFR Speech Coding with Efficient Boundary Selection

eess.AS · 2026-06-28 · unverdicted · novelty 7.0

DTM-Codec achieves better reconstruction quality and intelligibility than fixed-frame-rate neural speech codecs at matched total bitrate via dynamic token masking and Path Length Equalization for variable frame rates.

AudioCALM: Continuous Autoregressive Language Modeling for Universal Audio Generation

eess.AS · 2026-06-22 · unverdicted · novelty 7.0

AudioCALM presents a continuous autoregressive framework with flow-matching prediction and A-MoME architecture that unifies speech, sound, and music generation while matching modality-specific state-of-the-art performance.

Bagpiper-TTS: Natural Language Guided Universal Speech Synthesis

cs.CL · 2026-06-22 · unverdicted · novelty 7.0

Bagpiper-TTS uses natural language prompts and intent reasoning to derive rich captions that guide a single model for universal speech synthesis across classical TTS, multi-talker, singing, and role-play tasks.

NAC: Neural Action Codec for Vision-Language-Action Models

cs.RO · 2026-06-19 · unverdicted · novelty 7.0

NAC adapts multi-scale RVQGAN audio codecs with kinematic-specific losses to produce ordered action tokens that yield lower reconstruction error and higher task success than prior tokenizers in VLA models.

FlowEdit: Associative Memory for Lifelong Pronunciation Adaptation in Flow-Matching TTS

cs.AI · 2026-06-18 · unverdicted · novelty 7.0

FlowEdit stores pronunciation corrections for flow-matching TTS as token perturbations in a Modern Hopfield Network, cutting target-word phoneme error rate by 92.7% on a 312-word multilingual benchmark while preserving general speech quality.

A Survey of Full-Duplex Spoken Dialogue Systems: Architectural Hierarchy, Interaction Ontology, and Decision State Machine

eess.AS · 2026-06-17 · accept · novelty 7.0

A survey proposing an L0-L3 architectural hierarchy, T×I×R interaction ontology, and IDLE/LISTEN/SPEAK/WAIT/DUAL decision state machine for full-duplex spoken dialogue systems, documenting a realization gap between architectural potential and observed behavior due to training data limits.

Steering Where to Listen: Instruction-Based Activation Steering Redirects Temporal Attention in Large Audio-Language Models

cs.SD · 2026-06-09 · unverdicted · novelty 7.0

Instruction-based vector steering redirects temporal attention in LALMs to acoustically relevant regions, recovering queried sound event locations with 60.87-68.72% overlap accuracy without training.

HoliDubber: Holistic Video Dubbing for Complex Acoustic Scenes via Text-Guided Audio Synthesis

eess.AS · 2026-06-08 · unverdicted · novelty 7.0

HoliDubber introduces a patch-based autoregressive diffusion transformer for joint text-guided synthesis of speech and ambient audio in video dubbing, with a new benchmark showing outperformance over prior speech-only methods.

Codec-Robust Attacks on Audio LLMs

cs.SD · 2026-05-19 · unverdicted · novelty 7.0 · 2 refs

CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.

AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling

cs.SD · 2026-05-11 · unverdicted · novelty 7.0

AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.

Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech

eess.AS · 2026-05-10 · unverdicted · novelty 7.0

GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

cs.LG · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

PairAlign learns compact variable-length token sequences for audio via self-alignment on paired content-preserving views, achieving 55% fewer archive tokens than VQ while preserving edit-distance retrieval at 12.71 tokens/s.

Tibetan-TTS:Low-Resource Tibetan Speech Synthesis with Large Model Adaptation

cs.SD · 2026-05-04 · unverdicted · novelty 7.0

Large-model adaptation with Tibetan text handling produces natural speech from limited data, outperforming commercial systems.

MelShield: Robust Mel-Domain Audio Watermarking for Provenance Attribution of AI Generated Synthesized Speech

cs.SD · 2026-05-02 · unverdicted · novelty 7.0

MelShield adds keyed low-energy spread-spectrum perturbations to Mel-spectrograms inside TTS pipelines before vocoding to enable robust extraction of user-specific attribution signals even after compression or noise.

SPG-Codec: Exploring the Role and Boundaries of Semantic Priors in Ultra-Low-Bitrate Neural Speech Coding

eess.AS · 2026-04-29 · unverdicted · novelty 7.0

Semantic priors from HuBERT and Whisper improve speech codec intelligibility up to 6 kbps but show diminishing returns beyond that, with a bitrate-aware regulation strategy balancing semantic consistency and naturalness.

V.O.I.C.E (Voice, Ownership, Identity, Control, Expression): Risk Taxonomy of Synthetic Voice Generation From Empirical Data

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

V.O.I.C.E is a new taxonomy that organizes synthetic voice risks into five categories and shows how they interact with exposure, visibility, and legal context using empirical incident data.

PhySE: A Psychological Framework for Real-Time AR-LLM Social Engineering Attacks

cs.AI · 2026-04-25 · unverdicted · novelty 7.0

PhySE combines VLM pre-training for fast social context profiling with a dynamic psychological agent to overcome delays and static tactics in AR-LLM social engineering attacks, tested in a 60-person user study.

Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages

eess.AS · 2026-04-21 · unverdicted · novelty 7.0

Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.

X-VC: Zero-shot Streaming Voice Conversion in Codec Space

eess.AS · 2026-04-14 · unverdicted · novelty 7.0

X-VC achieves zero-shot streaming voice conversion via one-step codec-space conversion with dual-conditioning acoustic converter and role-assignment training on generated paired data.

Moshi: a speech-text foundation model for real-time dialogue

eess.AS · 2024-09-17 · accept · novelty 7.0

Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.

citing papers explorer

Showing 29 of 29 citing papers after filters.

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling eess.AS · 2026-06-02 · unverdicted · none · ref 83 · internal anchor
WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.
DTM-Codec: Dynamic Token Masking for VFR Speech Coding with Efficient Boundary Selection eess.AS · 2026-06-28 · unverdicted · none · ref 12 · internal anchor
DTM-Codec achieves better reconstruction quality and intelligibility than fixed-frame-rate neural speech codecs at matched total bitrate via dynamic token masking and Path Length Equalization for variable frame rates.
AudioCALM: Continuous Autoregressive Language Modeling for Universal Audio Generation eess.AS · 2026-06-22 · unverdicted · none · ref 27 · internal anchor
AudioCALM presents a continuous autoregressive framework with flow-matching prediction and A-MoME architecture that unifies speech, sound, and music generation while matching modality-specific state-of-the-art performance.
A Survey of Full-Duplex Spoken Dialogue Systems: Architectural Hierarchy, Interaction Ontology, and Decision State Machine eess.AS · 2026-06-17 · accept · none · ref 10 · internal anchor
A survey proposing an L0-L3 architectural hierarchy, T×I×R interaction ontology, and IDLE/LISTEN/SPEAK/WAIT/DUAL decision state machine for full-duplex spoken dialogue systems, documenting a realization gap between architectural potential and observed behavior due to training data limits.
HoliDubber: Holistic Video Dubbing for Complex Acoustic Scenes via Text-Guided Audio Synthesis eess.AS · 2026-06-08 · unverdicted · none · ref 64 · internal anchor
HoliDubber introduces a patch-based autoregressive diffusion transformer for joint text-guided synthesis of speech and ambient audio in video dubbing, with a new benchmark showing outperformance over prior speech-only methods.
Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech eess.AS · 2026-05-10 · unverdicted · none · ref 28 · internal anchor
GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.
SPG-Codec: Exploring the Role and Boundaries of Semantic Priors in Ultra-Low-Bitrate Neural Speech Coding eess.AS · 2026-04-29 · unverdicted · none · ref 5 · internal anchor
Semantic priors from HuBERT and Whisper improve speech codec intelligibility up to 6 kbps but show diminishing returns beyond that, with a bitrate-aware regulation strategy balancing semantic consistency and naturalness.
Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages eess.AS · 2026-04-21 · unverdicted · none · ref 86 · internal anchor
Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
X-VC: Zero-shot Streaming Voice Conversion in Codec Space eess.AS · 2026-04-14 · unverdicted · none · ref 37 · internal anchor
X-VC achieves zero-shot streaming voice conversion via one-step codec-space conversion with dual-conditioning acoustic converter and role-assignment training on generated paired data.
Moshi: a speech-text foundation model for real-time dialogue eess.AS · 2024-09-17 · accept · none · ref 103 · internal anchor
Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
DDPO-VC: Speaker De-Identification via Diffusion Denoising Policy Optimization eess.AS · 2026-06-13 · unverdicted · none · ref 66 · internal anchor
DDPO-VC applies diffusion denoising policy optimization with dual-teacher rewards to improve speaker de-identification while preserving cognitive utility on dementia speech benchmarks.
BareWave: Waveform-Native Flow-Matching Text-to-Speech eess.AS · 2026-06-08 · unverdicted · none · ref 26 · internal anchor
BareWave develops a waveform-native flow-matching framework for direct text-to-waveform TTS using representation alignment, staged noise scheduling, and velocity-aware perceptual alignment to achieve strong zero-shot voice cloning results.
SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue eess.AS · 2026-05-29 · unverdicted · none · ref 44 · internal anchor
SwanVoice is a zero-shot TTS system for 1-4 speakers that reports higher richness and hierarchy scores than open-source baselines on monologue and dialogue tasks via mixed training and DiffusionNFT post-training.
Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation eess.AS · 2026-04-21 · unverdicted · none · ref 21 · internal anchor
Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.
HCFD: A Benchmark for Audio Deepfake Detection in Healthcare eess.AS · 2026-04-19 · unverdicted · none · ref 28 · internal anchor
HCFD is a new pathology-aware benchmark and dataset for codec-fake audio detection in healthcare, with PHOENIX-Mamba achieving up to 97% accuracy by modeling fakes as modes in hyperbolic space.
StreamMark: A Deep Learning-Based Semi-Fragile Audio Watermarking for Proactive Deepfake Detection eess.AS · 2026-04-13 · unverdicted · none · ref 8 · internal anchor
StreamMark trains an Encoder-Distortion-Decoder network to embed semi-fragile watermarks that remain recoverable after benign audio transformations but drop to random accuracy under voice conversion and editing attacks.
ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching eess.AS · 2025-07-12 · conditional · none · ref 4 · internal anchor
ZipVoice-Dialog is a flow-matching non-autoregressive model for zero-shot spoken dialogue generation that uses curriculum learning and speaker-turn embeddings, paired with a new 6.8k-hour OpenDialog dataset, and reports better speed and quality than autoregressive baselines.
Perceptual implications of automatic anonymization in pathological speech eess.AS · 2025-05-01 · conditional · none · ref 93 · internal anchor
Listeners detect automatic anonymization in pathological speech at 91-93% accuracy with a 30-point perceived quality drop, yet clinical severity ratings stay nearly unchanged for dysarthria, dysglossia, and dysphonia.
FlowTTS-GRPO: Online Reinforcement Learning with Multi-Objective Reward Optimization for Flow-Matching Based Text-to-Speech eess.AS · 2026-06-22 · unverdicted · none · ref 8 · internal anchor
FlowTTS-GRPO applies online RL with weighted multi-objective rewards to flow-matching TTS models via ODE-to-SDE conversion, reporting gains in speaker similarity and perceptual quality on CosyVoice 3.0 and F5-TTS.
Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations eess.AS · 2026-06-18 · unverdicted · none · ref 7 · internal anchor
MOS models match humans on acoustic degradation but are insensitive to prosodic errors and show a double dissociation on speaker characteristics like mean F0 bias and insensitivity to rate and F0 variability.
Anchoring the Unknown: Open-Set Model Attribution via Proxy-Anchor Learning eess.AS · 2026-06-09 · unverdicted · none · ref 2 · internal anchor
Proxy-Anchor metric learning on Wav2Vec2-BERT embeddings with architecture merging achieves 99.76% closed-set accuracy and 2.04% FPR@95 OOD detection on MLAAD v9, doubling prior OOD accuracy on v5 splits.
FlashTTS: Fast Streaming TTS with MTP Acceleration and X-pred Mean Flow Distillation eess.AS · 2026-06-08 · unverdicted · none · ref 9 · internal anchor
FlashTTS delivers a streaming TTS system using multi-track input processing and X-pred mean flow matching to reach 325 ms latency in two function evaluations while retaining zero-shot voice cloning.
UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion eess.AS · 2026-05-29 · unverdicted · none · ref 48 · internal anchor
UNISON introduces a unified latent diffusion framework with layer-wise LLM fusion and channel-mask task encoding for multiple speech and sound generation and editing tasks.
Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech eess.AS · 2026-05-20 · unverdicted · none · ref 26 · internal anchor
Raon-OpenTTS provides an open 510K-hour curated speech dataset and DiT-based TTS models up to 1B parameters that achieve competitive WER and speaker similarity on benchmarks versus closed models trained on millions of hours.
Kimi-Audio Technical Report eess.AS · 2025-04-25 · unverdicted · none · ref 70 · internal anchor
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching eess.AS · 2024-10-09 · unverdicted · none · ref 145 · internal anchor
F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.
VoCodec: A Low-bitrate Streamable Neural Speech Codec with Voicing-driven Quantization eess.AS · 2026-06-04 · unverdicted · none · ref 10 · internal anchor
VoCodec achieves better performance than baselines at 1.1 kbps on LibriTTS by embedding voicing-driven quantization that reduces bitrate by ~27% versus uniform allocation.
MELD: Mel-Spectrogram-Based Speech Language Modeling with Discrete Latent Variables eess.AS · 2026-05-28 · unverdicted · none · ref 7 · internal anchor
MELD jointly optimizes a discrete latent variable encoder on mel-spectrograms with an autoregressive speech LM, claiming gains over codec and mel baselines on zero-shot TTS/STT plus fewer autoregressive artifacts.
One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech eess.AS · 2026-04-28 · unverdicted · none · ref 9 · internal anchor
Authors submit a cross-lingual voice cloning system to IWSLT 2026 using OmniVoice fine-tuned on ensemble-distilled synthetic data, reporting gains in WER, CER, and speaker similarity for scientific texts in three languages.

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer