Block-diffusion decoder with prior-calibrated scoring and early stopping produces streaming zero-shot TTS at quality comparable to AR and NAR baselines with lower real-time factor.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
verdicts
UNVERDICTED 2representative citing papers
Lasso-selected speech tokens enhance text LLMs for multimodal classification by reducing long audio sequences to task-relevant features via self-supervised adaptation.
citing papers explorer
-
Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS
Block-diffusion decoder with prior-calibrated scoring and early stopping produces streaming zero-shot TTS at quality comparable to AR and NAR baselines with lower real-time factor.
-
A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification
Lasso-selected speech tokens enhance text LLMs for multimodal classification by reducing long audio sequences to task-relevant features via self-supervised adaptation.