CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
Prompttts 2: Describing and generating voices with text prompt
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4roles
background 2polarities
background 2representative citing papers
Emo-LiPO applies listwise preference optimization to model global emotion intensity ordering in LLM TTS, yielding better accuracy and controllability than supervised or DPO baselines on a new multi-speaker dataset.
SpeakerCard-1M supplies 56.7k evidence-grounded speaker cards, 1.78M captions, and new cross-modal protocols showing audio LMs lag a dual-encoder baseline on attribute-conditioned verification while joint training barely hurts standard EER.
citing papers explorer
-
CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation
CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
-
Emo-LiPO: Listwise Preference Optimization for Fine-Grained Emotion Intensity Control in LLM-based Text-to-Speech
Emo-LiPO applies listwise preference optimization to model global emotion intensity ordering in LLM TTS, yielding better accuracy and controllability than supervised or DPO baselines on a new multi-speaker dataset.
-
SpeakerCard-1M: An Evidence-Grounded Corpus for In-the-Wild Speaker Verification
SpeakerCard-1M supplies 56.7k evidence-grounded speaker cards, 1.78M captions, and new cross-modal protocols showing audio LMs lag a dual-encoder baseline on attribute-conditioned verification while joint training barely hurts standard EER.
- Multimodal Large Language Model-Enabled Video Translation: A Role-Oriented Survey