hub Canonical reference

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu · 2023 · cs.CL · arXiv 2301.02111

Canonical reference. 89% of citing Pith papers cite this work as background.

91 Pith papers citing it

Background 89% of classified citations

open full Pith review browse 91 citing papers arXiv PDF

abstract

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis. See https://aka.ms/valle for demos of our work.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9

citation-polarity summary

background 8 support 1

representative citing papers

The Watermark Shortcut: How Provenance Marking Sabotages Audio Deepfake Detection

cs.SD · 2026-06-22 · unverdicted · novelty 8.0

Watermarking only synthetic audio leads deepfake detectors to use the watermark as a spurious shortcut, causing generalization failure, evasion by removing watermarks, and false positives on watermarked real audio.

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling

eess.AS · 2026-06-02 · unverdicted · novelty 8.0

WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

cs.SD · 2026-06-30 · unverdicted · novelty 7.0

FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.

Probing-Guided Layer Selection from Self-Supervised Speech Models for Generalizable Audio Deepfake Detection

cs.SD · 2026-06-29 · unverdicted · novelty 7.0

Probing-guided selection of depth zones from frozen SSL speech models yields compact classifiers with 28% relative EER improvement on cross-domain deepfake detection tasks.

DTM-Codec: Dynamic Token Masking for VFR Speech Coding with Efficient Boundary Selection

eess.AS · 2026-06-28 · unverdicted · novelty 7.0

DTM-Codec achieves better reconstruction quality and intelligibility than fixed-frame-rate neural speech codecs at matched total bitrate via dynamic token masking and Path Length Equalization for variable frame rates.

AudioCALM: Continuous Autoregressive Language Modeling for Universal Audio Generation

eess.AS · 2026-06-22 · unverdicted · novelty 7.0

AudioCALM presents a continuous autoregressive framework with flow-matching prediction and A-MoME architecture that unifies speech, sound, and music generation while matching modality-specific state-of-the-art performance.

Bagpiper-TTS: Natural Language Guided Universal Speech Synthesis

cs.CL · 2026-06-22 · unverdicted · novelty 7.0

Bagpiper-TTS uses natural language prompts and intent reasoning to derive rich captions that guide a single model for universal speech synthesis across classical TTS, multi-talker, singing, and role-play tasks.

NAC: Neural Action Codec for Vision-Language-Action Models

cs.RO · 2026-06-19 · unverdicted · novelty 7.0

NAC adapts multi-scale RVQGAN audio codecs with kinematic-specific losses to produce ordered action tokens that yield lower reconstruction error and higher task success than prior tokenizers in VLA models.

FlowEdit: Associative Memory for Lifelong Pronunciation Adaptation in Flow-Matching TTS

cs.AI · 2026-06-18 · unverdicted · novelty 7.0

FlowEdit stores pronunciation corrections for flow-matching TTS as token perturbations in a Modern Hopfield Network, cutting target-word phoneme error rate by 92.7% on a 312-word multilingual benchmark while preserving general speech quality.

A Survey of Full-Duplex Spoken Dialogue Systems: Architectural Hierarchy, Interaction Ontology, and Decision State Machine

eess.AS · 2026-06-17 · accept · novelty 7.0

A survey proposing an L0-L3 architectural hierarchy, T×I×R interaction ontology, and IDLE/LISTEN/SPEAK/WAIT/DUAL decision state machine for full-duplex spoken dialogue systems, documenting a realization gap between architectural potential and observed behavior due to training data limits.

Steering Where to Listen: Instruction-Based Activation Steering Redirects Temporal Attention in Large Audio-Language Models

cs.SD · 2026-06-09 · unverdicted · novelty 7.0

Instruction-based vector steering redirects temporal attention in LALMs to acoustically relevant regions, recovering queried sound event locations with 60.87-68.72% overlap accuracy without training.

HoliDubber: Holistic Video Dubbing for Complex Acoustic Scenes via Text-Guided Audio Synthesis

eess.AS · 2026-06-08 · unverdicted · novelty 7.0

HoliDubber introduces a patch-based autoregressive diffusion transformer for joint text-guided synthesis of speech and ambient audio in video dubbing, with a new benchmark showing outperformance over prior speech-only methods.

Codec-Robust Attacks on Audio LLMs

cs.SD · 2026-05-19 · unverdicted · novelty 7.0 · 2 refs

CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.

AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling

cs.SD · 2026-05-11 · unverdicted · novelty 7.0

AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.

Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech

eess.AS · 2026-05-10 · unverdicted · novelty 7.0

GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

cs.LG · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

PairAlign learns compact variable-length token sequences for audio via self-alignment on paired content-preserving views, achieving 55% fewer archive tokens than VQ while preserving edit-distance retrieval at 12.71 tokens/s.

Tibetan-TTS:Low-Resource Tibetan Speech Synthesis with Large Model Adaptation

cs.SD · 2026-05-04 · unverdicted · novelty 7.0

Large-model adaptation with Tibetan text handling produces natural speech from limited data, outperforming commercial systems.

MelShield: Robust Mel-Domain Audio Watermarking for Provenance Attribution of AI Generated Synthesized Speech

cs.SD · 2026-05-02 · unverdicted · novelty 7.0

MelShield adds keyed low-energy spread-spectrum perturbations to Mel-spectrograms inside TTS pipelines before vocoding to enable robust extraction of user-specific attribution signals even after compression or noise.

SPG-Codec: Exploring the Role and Boundaries of Semantic Priors in Ultra-Low-Bitrate Neural Speech Coding

eess.AS · 2026-04-29 · unverdicted · novelty 7.0

Semantic priors from HuBERT and Whisper improve speech codec intelligibility up to 6 kbps but show diminishing returns beyond that, with a bitrate-aware regulation strategy balancing semantic consistency and naturalness.

V.O.I.C.E (Voice, Ownership, Identity, Control, Expression): Risk Taxonomy of Synthetic Voice Generation From Empirical Data

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

V.O.I.C.E is a new taxonomy that organizes synthetic voice risks into five categories and shows how they interact with exposure, visibility, and legal context using empirical incident data.

PhySE: A Psychological Framework for Real-Time AR-LLM Social Engineering Attacks

cs.AI · 2026-04-25 · unverdicted · novelty 7.0

PhySE combines VLM pre-training for fast social context profiling with a dynamic psychological agent to overcome delays and static tactics in AR-LLM social engineering attacks, tested in a 60-person user study.

Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages

eess.AS · 2026-04-21 · unverdicted · novelty 7.0

Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.

X-VC: Zero-shot Streaming Voice Conversion in Codec Space

eess.AS · 2026-04-14 · unverdicted · novelty 7.0

X-VC achieves zero-shot streaming voice conversion via one-step codec-space conversion with dual-conditioning acoustic converter and role-assignment training on generated paired data.

Moshi: a speech-text foundation model for real-time dialogue

eess.AS · 2024-09-17 · accept · novelty 7.0

Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.

citing papers explorer

Showing 1 of 1 citing paper after filters.

NAC: Neural Action Codec for Vision-Language-Action Models cs.RO · 2026-06-19 · unverdicted · none · ref 13 · internal anchor
NAC adapts multi-scale RVQGAN audio codecs with kinematic-specific losses to produce ordered action tokens that yield lower reconstruction error and higher task success than prior tokenizers in VLA models.

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer