hub Mixed citations

Qwen3-TTS Technical Report

Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang · 2026 · cs.SD · arXiv 2601.15621

Mixed citation behavior. Most common role is background (50%).

53 Pith papers citing it

Background 50% of classified citations

open full Pith review browse 53 citing papers arXiv PDF

abstract

In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control, allowing both the creation of entirely novel voices and fine-grained manipulation over the output speech. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3-TTS adopts a dual-track LM architecture for real-time synthesis, coupled with two speech tokenizers: 1) Qwen-TTS-Tokenizer-25Hz is a single-codebook codec emphasizing semantic content, which offers seamlessly integration with Qwen-Audio and enables streaming waveform reconstruction via a block-wise DiT. 2) Qwen-TTS-Tokenizer-12Hz achieves extreme bitrate reduction and ultra-low-latency streaming, enabling immediate first-packet emission ($97\,\mathrm{ms}$) through its 12.5 Hz, 16-layer multi-codebook design and a lightweight causal ConvNet. Extensive experiments indicate state-of-the-art performance across diverse objective and subjective benchmark (e.g., TTS multilingual test set, InstructTTSEval, and our long speech test set). To facilitate community research and development, we release both tokenizers and models under the Apache 2.0 license.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 baseline 2 method 1

citation-polarity summary

background 3 baseline 2 use method 1

representative citing papers

LambdaMark: Semantic Audio Watermarking for Robustness and Radioactivity

cs.SD · 2026-06-19 · unverdicted · novelty 8.0

LambdaMark is the first generic radioactive audio watermark that injects multi-bit messages into semantic latent representations, achieving robustness to distortions and removal attacks even after downstream model finetuning.

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling

eess.AS · 2026-06-02 · unverdicted · novelty 8.0

WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

cs.SD · 2026-06-30 · unverdicted · novelty 7.0

FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.

Sarashina2.2-TTS: Tackling Kanji Polyphony in Japanese Speech Generation via Data Scaling and Targeted Data Synthesis

cs.SD · 2026-06-24 · unverdicted · novelty 7.0

Sarashina2.2-TTS achieves SOTA kanji reading accuracy via data scaling and Joyo-kanji-targeted synthesis, introduces the Joyo Kanji Yomi Benchmark and Kana-CER metric, and shows stable cross-lingual performance.

Bagpiper-TTS: Natural Language Guided Universal Speech Synthesis

cs.CL · 2026-06-22 · unverdicted · novelty 7.0

Bagpiper-TTS uses natural language prompts and intent reasoning to derive rich captions that guide a single model for universal speech synthesis across classical TTS, multi-talker, singing, and role-play tasks.

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

cs.AI · 2026-06-15 · unverdicted · novelty 7.0

SpeechDx is a multi-task benchmark with 12 datasets and 27 tasks across health conditions, structured by conceptualization, formulation, and articulation stages, showing that no current audio encoder generalizes reliably.

JAVEDIT: Joint Audio-Visual Instruction-Guided Video Editing with Agentic Data Curation

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

JAVEdit-100k is the first large-scale dataset for instruction-guided joint audio-visual video editing, accompanied by JAVEditBench and the JAVEdit model that outperforms baselines on five of six metrics.

Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts

cs.SD · 2026-05-27 · unverdicted · novelty 7.0

PlanAudio introduces a unified autoregressive LLM framework with semantic latent chain-of-thought for generating composite speech and sound audio from free-form text, plus a new benchmark.

Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech

eess.AS · 2026-05-10 · unverdicted · novelty 7.0

GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.

Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

cs.CV · 2026-04-26 · unverdicted · novelty 7.0

Talker-T2AV achieves better lip-sync accuracy, video quality, and audio quality than dual-branch baselines by separating high-level shared autoregressive modeling from modality-specific low-level diffusion refinement in a joint audio-video generation framework.

MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech

eess.AS · 2026-04-20 · unverdicted · novelty 7.0

MINT-Bench is a new benchmark using hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation to assess instruction-following TTS systems, revealing major gaps in compositional and paralinguistic controls.

NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations

cs.SD · 2026-04-17 · unverdicted · novelty 7.0

NVBench provides a standardized bilingual benchmark and evaluation protocol for assessing non-verbal vocalization generation, placement, and salience in text-to-speech systems.

HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models

eess.AS · 2026-04-13 · unverdicted · novelty 7.0

HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semantic conflict resolution.

Knowing What to Stress: A Discourse-Conditioned Text-to-Speech Benchmark

cs.CL · 2026-04-12 · unverdicted · novelty 7.0

CAST benchmark shows language models infer correct word stress from discourse context but TTS systems frequently fail to produce it in speech.

CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation

cs.SD · 2026-04-09 · unverdicted · novelty 7.0

CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.

UniSAE: Unified Speech Attribute Editing on Speaker, Emotion and Low-Level Content via Discrete Phonetic Posteriorgram Modelling

cs.SD · 2026-06-30 · unverdicted · novelty 6.0

UniSAE unifies speaker, emotion, and multi-granularity content editing in speech via a new discrete phonetic posteriorgram representation and diffusion-based rendering.

HPRO: Hierarchical Progressive Reward Optimization via Preference Extraction for Emotional Text-to-Speech

eess.AS · 2026-06-26 · unverdicted · novelty 6.0

HPRO uses a differentiable HD-Emo codec to extract separate content and style tokens and progressively aligns frame-, word-, and sentence-level rewards to improve emotional expressiveness in TTS while preserving intelligibility.

An Evaluation Framework for Text-to-Speech Voice Reconstruction

eess.AS · 2026-06-19 · unverdicted · novelty 6.0

The paper introduces a subjective-objective evaluation framework using Best Worst Scaling and a novel dual-reference distributional measure to better assess intelligibility versus speaker identity trade-offs in TTS voice reconstruction.

Bagpiper-Edit: Zero-Shot Open-Ended Audio Editing via Rich-Caption

cs.SD · 2026-06-19 · unverdicted · novelty 6.0

Bagpiper-Edit performs zero-shot open-ended audio editing by translating natural-language instructions into edited rich captions that guide generation anchored to the original audio.

What Makes Synthetic Speech Sound Sarcastic? A Prosody-Controlled Perception Study

cs.SD · 2026-06-08 · unverdicted · novelty 6.0

Controllable neural TTS reveals that loudness drives human sarcasm perception while a model prioritizes speech rate.

EmoInstruct-TTS: Dual-Path Instruction-Guided Emotional Speech Synthesis

cs.CL · 2026-06-08 · unverdicted · novelty 6.0

EmoInstruct-TTS uses Emotion2embed and an Instruction-Conditioned Emotion Flow Model (ICE-Flow) to generate acoustically grounded emotion representations from free-form instructions and integrate them into an LLM-based TTS pipeline.

dots.tts Technical Report

cs.SD · 2026-06-05 · unverdicted · novelty 6.0

dots.tts reports SOTA benchmark results on Seed-TTS-Eval and other tests via continuous latent-space autoregressive modeling with three listed innovations and code release.

CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding

cs.SD · 2026-06-03 · unverdicted · novelty 6.0

CleanCodec reframes audio tokenization as a selective information bottleneck to encode only perceptually important features at 12.5 tokens per second, outperforming prior codecs in efficiency, speaker similarity, and intelligibility.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Qwen3-TTS Technical Report

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer