hub Mixed citations

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen · 2024 · eess.AS · arXiv 2406.02430

Mixed citation behavior. Most common role is dataset (33%).

73 Pith papers citing it

Dataset 33% of classified citations

open full Pith review browse 73 citing papers arXiv PDF

abstract

We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and subjective evaluations. With fine-tuning, we achieve even higher subjective scores across these metrics. Seed-TTS offers superior controllability over various speech attributes such as emotion and is capable of generating highly expressive and diverse speech for speakers in the wild. Furthermore, we propose a self-distillation method for speech factorization, as well as a reinforcement learning approach to enhance model robustness, speaker similarity, and controllability. We additionally present a non-autoregressive (NAR) variant of the Seed-TTS model, named $\text{Seed-TTS}_\text{DiT}$, which utilizes a fully diffusion-based architecture. Unlike previous NAR-based TTS systems, $\text{Seed-TTS}_\text{DiT}$ does not depend on pre-estimated phoneme durations and performs speech generation through end-to-end processing. We demonstrate that this variant achieves comparable performance to the language model-based variant and showcase its effectiveness in speech editing. We encourage readers to listen to demos at \url{https://bytedancespeech.github.io/seedtts_tech_report}.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 dataset 4 method 2 baseline 1 extension 1

citation-polarity summary

use dataset 4 background 3 use method 2 baseline 1 extend 1 unclear 1

representative citing papers

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling

eess.AS · 2026-06-02 · unverdicted · novelty 8.0

WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

cs.SD · 2026-06-30 · unverdicted · novelty 7.0

FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.

Sarashina2.2-TTS: Tackling Kanji Polyphony in Japanese Speech Generation via Data Scaling and Targeted Data Synthesis

cs.SD · 2026-06-24 · unverdicted · novelty 7.0

Sarashina2.2-TTS achieves SOTA kanji reading accuracy via data scaling and Joyo-kanji-targeted synthesis, introduces the Joyo Kanji Yomi Benchmark and Kana-CER metric, and shows stable cross-lingual performance.

ParaPairAudioBench: Paralinguistic Pairwise Audio Benchmark for LALM-as-a-Judge

cs.SD · 2026-06-23 · unverdicted · novelty 7.0

ParaPairAudioBench is a new pairwise benchmark showing LALM judges lag human paralinguistic judgments by 32 percentage points with poor tie calibration across style, rate, emphasis, age, and gender.

AudioCALM: Continuous Autoregressive Language Modeling for Universal Audio Generation

eess.AS · 2026-06-22 · unverdicted · novelty 7.0

AudioCALM presents a continuous autoregressive framework with flow-matching prediction and A-MoME architecture that unifies speech, sound, and music generation while matching modality-specific state-of-the-art performance.

Bagpiper-TTS: Natural Language Guided Universal Speech Synthesis

cs.CL · 2026-06-22 · unverdicted · novelty 7.0

Bagpiper-TTS uses natural language prompts and intent reasoning to derive rich captions that guide a single model for universal speech synthesis across classical TTS, multi-talker, singing, and role-play tasks.

M*: A Modular, Extensible, Serving System for Multimodal Models

cs.LG · 2026-06-10 · unverdicted · novelty 7.0

M* introduces the Walk Graph abstraction to serve arbitrary compositions of multimodal model components and reports latency and throughput gains over vLLM-Omni and other baselines on text-to-image, text-to-speech, and robotic planning workloads.

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

cs.CL · 2026-05-31 · unverdicted · novelty 7.0

PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.

Native Audio-Visual Alignment for Generation

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

NAVA proposes native audio-visual alignment via Align-then-Fuse MMDiT and Timbre-in-Context Conditioning for joint audio-video generation with improved synchronization and timbre control.

Beyond Content: A Comprehensive Speech Toxicity Dataset and Detection Framework Incorporating Paralinguistic Cues

cs.SD · 2026-05-15 · unverdicted · novelty 7.0

ToxiAlert-Bench dataset and dual-head neural network detect toxic speech by distinguishing textual versus paralinguistic sources, reporting 21.1% Macro-F1 and 13% accuracy gains over baselines.

Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech

eess.AS · 2026-05-10 · unverdicted · novelty 7.0

GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.

MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech

eess.AS · 2026-04-20 · unverdicted · novelty 7.0

MINT-Bench is a new benchmark using hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation to assess instruction-following TTS systems, revealing major gaps in compositional and paralinguistic controls.

From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench

cs.AI · 2026-04-16 · unverdicted · novelty 7.0

ProVoice-Bench is the first framework to evaluate proactive voice agents, revealing that state-of-the-art multimodal LLMs struggle with over-triggering and context-aware reasoning.

X-VC: Zero-shot Streaming Voice Conversion in Codec Space

eess.AS · 2026-04-14 · unverdicted · novelty 7.0

X-VC achieves zero-shot streaming voice conversion via one-step codec-space conversion with dual-conditioning acoustic converter and role-assignment training on generated paired data.

CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation

cs.SD · 2026-04-09 · unverdicted · novelty 7.0

CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

cs.SD · 2025-07-10 · unverdicted · novelty 7.0

Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.

HPRO: Hierarchical Progressive Reward Optimization via Preference Extraction for Emotional Text-to-Speech

eess.AS · 2026-06-26 · unverdicted · novelty 6.0

HPRO uses a differentiable HD-Emo codec to extract separate content and style tokens and progressively aligns frame-, word-, and sentence-level rewards to improve emotional expressiveness in TTS while preserving intelligibility.

ProsoCodec: Prosody-Oriented Speech Codec for Voice Conversion

eess.AS · 2026-06-20 · unverdicted · novelty 6.0

ProsoCodec models prosody as a conditional residual in a speech codec via text and speaker prefix conditioning, yielding improved prosody preservation and less timbre leakage in voice conversion experiments.

Bagpiper-Edit: Zero-Shot Open-Ended Audio Editing via Rich-Caption

cs.SD · 2026-06-19 · unverdicted · novelty 6.0

Bagpiper-Edit performs zero-shot open-ended audio editing by translating natural-language instructions into edited rich captions that guide generation anchored to the original audio.

Transcript-Free Flow-Matching Text-to-Speech via Speech Feature Conditioning

eess.AS · 2026-06-18 · unverdicted · novelty 6.0

RTFree-F5 replaces reference transcripts with mapped self-supervised speech representations in F5-TTS, cutting WER on dysarthric speech from 24.6% to 10.4% without any transcript at inference.

TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech

cs.SD · 2026-06-08 · unverdicted · novelty 6.0

TLDR groups codec tokens into patches for patch-level autoregressive modeling in pretrained TTS systems, yielding 1.8x speedup and 75% KV-cache reduction at patch size 4.

dots.tts Technical Report

cs.SD · 2026-06-05 · unverdicted · novelty 6.0

dots.tts reports SOTA benchmark results on Seed-TTS-Eval and other tests via continuous latent-space autoregressive modeling with three listed innovations and code release.

HybridCodec: Fast Dual-Stream, Semantically Enhanced Neural Audio Codec

cs.SD · 2026-06-04 · unverdicted · novelty 6.0

HybridCodec unifies SSL distillation and dual-stream design in a neural audio codec for improved semantic specialization, competitive reconstruction, and faster inference.

GLASS: GRPO-Trained LoRA for Acoustic Style Steering in Zero-Shot Text-to-Speech

cs.SD · 2026-06-04 · unverdicted · novelty 6.0

GLASS enables composable acoustic style control in zero-shot TTS by training independent GRPO-optimized LoRA adapters on style rewards that can be linearly combined.

citing papers explorer

Showing 21 of 21 citing papers after filters.

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling eess.AS · 2026-06-02 · unverdicted · none · ref 1 · internal anchor
WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.
AudioCALM: Continuous Autoregressive Language Modeling for Universal Audio Generation eess.AS · 2026-06-22 · unverdicted · none · ref 2 · internal anchor
AudioCALM presents a continuous autoregressive framework with flow-matching prediction and A-MoME architecture that unifies speech, sound, and music generation while matching modality-specific state-of-the-art performance.
Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech eess.AS · 2026-05-10 · unverdicted · none · ref 2 · internal anchor
GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.
MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech eess.AS · 2026-04-20 · unverdicted · none · ref 1 · internal anchor
MINT-Bench is a new benchmark using hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation to assess instruction-following TTS systems, revealing major gaps in compositional and paralinguistic controls.
X-VC: Zero-shot Streaming Voice Conversion in Codec Space eess.AS · 2026-04-14 · unverdicted · none · ref 1 · internal anchor
X-VC achieves zero-shot streaming voice conversion via one-step codec-space conversion with dual-conditioning acoustic converter and role-assignment training on generated paired data.
HPRO: Hierarchical Progressive Reward Optimization via Preference Extraction for Emotional Text-to-Speech eess.AS · 2026-06-26 · unverdicted · none · ref 1 · internal anchor
HPRO uses a differentiable HD-Emo codec to extract separate content and style tokens and progressively aligns frame-, word-, and sentence-level rewards to improve emotional expressiveness in TTS while preserving intelligibility.
ProsoCodec: Prosody-Oriented Speech Codec for Voice Conversion eess.AS · 2026-06-20 · unverdicted · none · ref 9 · internal anchor
ProsoCodec models prosody as a conditional residual in a speech codec via text and speaker prefix conditioning, yielding improved prosody preservation and less timbre leakage in voice conversion experiments.
Transcript-Free Flow-Matching Text-to-Speech via Speech Feature Conditioning eess.AS · 2026-06-18 · unverdicted · none · ref 35 · internal anchor
RTFree-F5 replaces reference transcripts with mapped self-supervised speech representations in F5-TTS, cutting WER on dysarthric speech from 24.6% to 10.4% without any transcript at inference.
SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue eess.AS · 2026-05-29 · unverdicted · none · ref 3 · internal anchor
SwanVoice is a zero-shot TTS system for 1-4 speakers that reports higher richness and hierarchy scores than open-source baselines on monologue and dialogue tasks via mixed training and DiffusionNFT post-training.
SemaVoice: Semantic-Aware Continuous Autoregressive Speech Synthesis eess.AS · 2026-05-16 · unverdicted · none · ref 44 · internal anchor
SemaVoice adds SFM-guided alignment to refine continuous speech representations in autoregressive TTS, reporting 1.71% English WER on Seed-TTS and competitiveness with open-source SOTA.
Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation eess.AS · 2026-04-21 · unverdicted · none · ref 48 · internal anchor
Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.
ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching eess.AS · 2025-07-12 · conditional · none · ref 1 · internal anchor
ZipVoice-Dialog is a flow-matching non-autoregressive model for zero-shot spoken dialogue generation that uses curriculum learning and speaker-turn embeddings, paired with a new 6.8k-hour OpenDialog dataset, and reports better speed and quality than autoregressive baselines.
Joint Residual Reweighting for Classifier Free Guidance in Flow-Matching Zero-Shot TTS eess.AS · 2026-06-24 · unverdicted · none · ref 10 · 2 links · internal anchor
Introduces joint residual reweighting that disentangles speaker and joint residuals in CFG to improve speaker fidelity while preserving text accuracy in zero-shot TTS.
FlowTTS-GRPO: Online Reinforcement Learning with Multi-Objective Reward Optimization for Flow-Matching Based Text-to-Speech eess.AS · 2026-06-22 · unverdicted · none · ref 15 · internal anchor
FlowTTS-GRPO applies online RL with weighted multi-objective rewards to flow-matching TTS models via ODE-to-SDE conversion, reporting gains in speaker similarity and perceptual quality on CosyVoice 3.0 and F5-TTS.
Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations eess.AS · 2026-06-18 · unverdicted · none · ref 11 · internal anchor
MOS models match humans on acoustic degradation but are insensitive to prosodic errors and show a double dissociation on speaker characteristics like mean F0 bias and insensitivity to rate and F0 variability.
FlashTTS: Fast Streaming TTS with MTP Acceleration and X-pred Mean Flow Distillation eess.AS · 2026-06-08 · unverdicted · none · ref 22 · internal anchor
FlashTTS delivers a streaming TTS system using multi-track input processing and X-pred mean flow matching to reach 325 ms latency in two function evaluations while retaining zero-shot voice cloning.
UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion eess.AS · 2026-05-29 · unverdicted · none · ref 87 · internal anchor
UNISON introduces a unified latent diffusion framework with layer-wise LLM fusion and channel-mask task encoding for multiple speech and sound generation and editing tasks.
Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech eess.AS · 2026-05-20 · unverdicted · none · ref 2 · internal anchor
Raon-OpenTTS provides an open 510K-hour curated speech dataset and DiT-based TTS models up to 1B parameters that achieve competitive WER and speaker similarity on benchmarks versus closed models trained on millions of hours.
SeamlessEdit: Background Noise Aware Zero-Shot Speech Editing with in-Context Enhancement eess.AS · 2025-05-20 · unverdicted · none · ref 29 · internal anchor
SeamlessEdit introduces a noise-resilient zero-shot speech editing system with frequency-band-aware suppression and in-context refinement that outperforms prior methods on noisy audio.
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching eess.AS · 2024-10-09 · unverdicted · none · ref 80 · internal anchor
F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.
MeanVC 2: Robust Low-Latency Streaming Zero-Shot Voice Conversion eess.AS · 2026-06-08 · unverdicted · none · ref 30 · internal anchor
MeanVC 2 introduces future-receptive chunking and a universal timbre token encoder to achieve lower-latency and more robust streaming zero-shot voice conversion than the original MeanVC.

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer