pith. sign in

hub Mixed citations

Qwen3-TTS Technical Report

Mixed citation behavior. Most common role is background (50%).

53 Pith papers citing it
Background 50% of classified citations
abstract

In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control, allowing both the creation of entirely novel voices and fine-grained manipulation over the output speech. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3-TTS adopts a dual-track LM architecture for real-time synthesis, coupled with two speech tokenizers: 1) Qwen-TTS-Tokenizer-25Hz is a single-codebook codec emphasizing semantic content, which offers seamlessly integration with Qwen-Audio and enables streaming waveform reconstruction via a block-wise DiT. 2) Qwen-TTS-Tokenizer-12Hz achieves extreme bitrate reduction and ultra-low-latency streaming, enabling immediate first-packet emission ($97\,\mathrm{ms}$) through its 12.5 Hz, 16-layer multi-codebook design and a lightweight causal ConvNet. Extensive experiments indicate state-of-the-art performance across diverse objective and subjective benchmark (e.g., TTS multilingual test set, InstructTTSEval, and our long speech test set). To facilitate community research and development, we release both tokenizers and models under the Apache 2.0 license.

hub tools

citation-role summary

background 3 baseline 2 method 1

citation-polarity summary

years

2026 53

verdicts

UNVERDICTED 53

clear filters

representative citing papers

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

cs.SD · 2026-06-30 · unverdicted · novelty 7.0

FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.

Bagpiper-TTS: Natural Language Guided Universal Speech Synthesis

cs.CL · 2026-06-22 · unverdicted · novelty 7.0

Bagpiper-TTS uses natural language prompts and intent reasoning to derive rich captions that guide a single model for universal speech synthesis across classical TTS, multi-talker, singing, and role-play tasks.

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

cs.AI · 2026-06-15 · unverdicted · novelty 7.0

SpeechDx is a multi-task benchmark with 12 datasets and 27 tasks across health conditions, structured by conceptualization, formulation, and articulation stages, showing that no current audio encoder generalizes reliably.

An Evaluation Framework for Text-to-Speech Voice Reconstruction

eess.AS · 2026-06-19 · unverdicted · novelty 6.0

The paper introduces a subjective-objective evaluation framework using Best Worst Scaling and a novel dual-reference distributional measure to better assess intelligibility versus speaker identity trade-offs in TTS voice reconstruction.

dots.tts Technical Report

cs.SD · 2026-06-05 · unverdicted · novelty 6.0

dots.tts reports SOTA benchmark results on Seed-TTS-Eval and other tests via continuous latent-space autoregressive modeling with three listed innovations and code release.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.