pith. machine review for the scientific record. sign in

arxiv: 2509.04072 · v2 · submitted 2025-09-04 · 📡 eess.AS · cs.CL· cs.SD

Recognition: unknown

Computational Narrative Understanding for Expressive Text-to-Speech

Authors on Pith no claims yet
classification 📡 eess.AS cs.CLcs.SD
keywords expressivespeechlibriquotecharacterfoundmodeltext-to-speechacross
0
0 comments X
read the original abstract

Recent advances in text-to-speech (TTS) have been driven by large, multi-domain speech corpora, yet the expressive potential of audiobook data remains underexamined. We argue that human-narrated audiobooks, particularly fictional works, contain rich and diverse prosodic cues arising from the natural alternation between neutral narration and expressive character dialogue. Building from this observation, we introduce LibriQuote, a large-scale 5.3K hours of expressive speech drawn from character quotations. Each quote is supplemented with contextual pseudo-labels for speech verbs and adverbs that characterize the intended delivery of direct speech (e.g., "he whispered softly"). We found that fine-tuning a flow-matching model on LibriQuote yields substantial improvements in expressivity and intelligibility, while training from scratch enhances expressiveness of an autoregressive TTS model. Benchmarking on LibriQuote-test highlights significant variability across systems in generating expressive speech. We publicly release the dataset, code, and evaluation resources to facilitate reproducibility. Audio samples can be found at https://libriquote.github.io/.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation

    eess.AS 2026-04 unverdicted novelty 6.0

    Emotion embedding similarities are unsuitable for zero-shot evaluation of emotional expressiveness in speech generation due to confounding by non-emotional acoustic features.