Computational Narrative Understanding for Expressive Text-to-Speech
Pith reviewed 2026-05-18 19:26 UTC · model grok-4.3
The pith
A dataset of 5.3K hours of audiobook character quotes labeled by speech verbs and adverbs improves expressive TTS.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce LibriQuote, a 5.3K-hour corpus of expressive speech taken from character quotations in human-narrated audiobooks, each paired with contextual pseudo-labels for speech verbs and adverbs that indicate intended prosody. Fine-tuning a flow-matching model on LibriQuote yields substantial gains in expressivity and intelligibility, while training an autoregressive TTS model from scratch on the dataset enhances its ability to produce expressive speech. Benchmarking on a held-out LibriQuote-test set reveals large differences across systems in handling expressive generation.
What carries the argument
LibriQuote dataset of character quotations augmented with contextual pseudo-labels from speech verbs and adverbs in the surrounding narrative text.
If this is right
- Fine-tuned flow-matching models become more expressive and intelligible when trained on LibriQuote.
- Autoregressive TTS models gain expressiveness when trained from scratch on the dataset.
- Different TTS architectures exhibit substantial variability when asked to generate expressive speech from the same inputs.
- Public release of the dataset, code, and evaluation resources enables direct replication and extension by other researchers.
Where Pith is reading between the lines
- Narrative text cues could be used at inference time to condition TTS models without retraining.
- The same verb-adverb labeling approach might transfer to non-fiction or dialogue-heavy domains such as podcasts or plays.
- Scaling the dataset size or adding explicit prosody predictors could further amplify the observed gains.
- Computational narrative understanding may become a standard preprocessing step for any expressive synthesis task.
Load-bearing premise
The contextual pseudo-labels derived from speech verbs and adverbs in the source text accurately capture the prosodic intent of the original human narration.
What would settle it
A side-by-side listening test on LibriQuote-test sentences in which listeners rate expressivity and intelligibility of outputs from LibriQuote-trained models no higher than outputs from the same architectures trained only on standard neutral corpora would falsify the performance claim.
Figures
read the original abstract
Recent advances in text-to-speech (TTS) have been driven by large, multi-domain speech corpora, yet the expressive potential of audiobook data remains underexamined. We argue that human-narrated audiobooks, particularly fictional works, contain rich and diverse prosodic cues arising from the natural alternation between neutral narration and expressive character dialogue. Building from this observation, we introduce LibriQuote, a large-scale 5.3K hours of expressive speech drawn from character quotations. Each quote is supplemented with contextual pseudo-labels for speech verbs and adverbs that characterize the intended delivery of direct speech (e.g., "he whispered softly"). We found that fine-tuning a flow-matching model on LibriQuote yields substantial improvements in expressivity and intelligibility, while training from scratch enhances expressiveness of an autoregressive TTS model. Benchmarking on LibriQuote-test highlights significant variability across systems in generating expressive speech. We publicly release the dataset, code, and evaluation resources to facilitate reproducibility. Audio samples can be found at https://libriquote.github.io/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces LibriQuote, a 5.3K-hour dataset of expressive speech extracted from character quotations in human-narrated audiobooks, augmented with contextual pseudo-labels derived from speech verbs and adverbs (e.g., 'he whispered softly'). It reports that fine-tuning a flow-matching TTS model on LibriQuote produces substantial gains in expressivity and intelligibility, while training an autoregressive TTS model from scratch on the same data enhances expressiveness. Benchmarking on a held-out LibriQuote-test set reveals significant variability across systems in generating expressive speech, and the work publicly releases the dataset, code, and evaluation resources.
Significance. If the gains prove attributable to the narrative alternation and pseudo-labels rather than data volume alone, the work would meaningfully advance expressive TTS by showing how computational narrative understanding can scalably mine prosodic cues from fictional audiobooks. The public release of LibriQuote would provide a valuable resource for reproducibility and further research on context-aware prosody modeling.
major comments (2)
- [Abstract] Abstract: The central claim of 'substantial improvements in expressivity and intelligibility' from fine-tuning a flow-matching model (and enhanced expressiveness for the autoregressive model) is stated without any quantitative metrics, error bars, statistical tests, or details on the evaluation protocol, listener studies, or baseline comparisons. This absence leaves the empirical support for the primary result only weakly grounded.
- [Experimental results] Experimental results: The attribution of improvements to the quotation structure, speech-verb pseudo-labels, and narrative alternation lacks necessary controls. No ablation is reported that compares fine-tuning on LibriQuote against fine-tuning on an equal-duration neutral corpus (e.g., LibriSpeech) or a shuffled-label control, leaving open the possibility that observed benefits arise from data scaling rather than computational narrative understanding.
minor comments (2)
- [Dataset] Dataset section: Provide more detail on the extraction pipeline, filtering criteria, and validation that the selected segments indeed correspond to direct character speech rather than narration.
- [Evaluation] Evaluation: Clarify the exact metrics used for 'expressivity' and 'intelligibility' and whether they include both objective and subjective measures.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our manuscript introducing LibriQuote. We address each major comment point by point below and will revise the manuscript to strengthen the presentation of results and controls.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of 'substantial improvements in expressivity and intelligibility' from fine-tuning a flow-matching model (and enhanced expressiveness for the autoregressive model) is stated without any quantitative metrics, error bars, statistical tests, or details on the evaluation protocol, listener studies, or baseline comparisons. This absence leaves the empirical support for the primary result only weakly grounded.
Authors: We agree that the abstract would be strengthened by including specific quantitative support. In the revision we will add key metrics from the experimental section (e.g., expressivity and intelligibility gains with confidence intervals), a brief statement of the evaluation protocol, and reference to the baseline comparisons, while preserving conciseness. revision: yes
-
Referee: [Experimental results] Experimental results: The attribution of improvements to the quotation structure, speech-verb pseudo-labels, and narrative alternation lacks necessary controls. No ablation is reported that compares fine-tuning on LibriQuote against fine-tuning on an equal-duration neutral corpus (e.g., LibriSpeech) or a shuffled-label control, leaving open the possibility that observed benefits arise from data scaling rather than computational narrative understanding.
Authors: The referee is correct that the current experiments lack the suggested ablations. While LibriQuote is constructed specifically around quotation structure and pseudo-labels, and gains are shown relative to models trained on conventional corpora, direct controls against equal-volume neutral data or shuffled labels are absent. We will add these ablations in the revision to better isolate the contribution of narrative elements. revision: yes
Circularity Check
No circularity: empirical dataset release and TTS fine-tuning results are self-contained
full rationale
The paper constructs LibriQuote by extracting quoted speech segments and pseudo-labels from existing audiobooks, then reports empirical improvements from fine-tuning flow-matching and autoregressive TTS models on this data versus baselines. No equations, derivations, or first-principles claims appear; performance numbers rest on standard held-out test-set evaluation rather than any self-referential fitting or parameter renaming. Any self-citations are incidental and not load-bearing for the central empirical claims, which remain falsifiable against external TTS benchmarks and do not reduce to the paper's own inputs by construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
fine-tuning a flow-matching model on LibriQuote yields substantial improvements in expressivity and intelligibility
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation
Emotion embedding similarities are unsuitable for zero-shot evaluation of emotional expressiveness in speech generation due to confounding by non-emotional acoustic features.
Reference graph
Works this paper leans on
-
[1]
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
IEMOCAP: Interactive emotional dyadic motion cap- ture database. Language resources and evaluation , 42(4): 335–359. Cao, H.; Cooper, D. G.; Keutmann, M. K.; Gur, R. C.; Nenkova, A.; and Verma, R. 2014. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing, 5(4): 377–390. Chen, Y .; Niu, Z.; Ma, Z.; Deng, K.;...
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[2]
Miami, Florida, USA: Association for Computational Linguistics. Hatzel, H. O.; and Biemann, C. 2024. Story Embeddings — Narrative-Focused Representations of Fictional Stories. In Al-Onaizan, Y .; Bansal, M.; and Chen, Y .-N., eds.,Pro- ceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 5931–5943. Miami, Florida, USA: Asso...
work page 2024
-
[3]
In 2023 IEEE Automatic Speech Recognition and Understand- ing Workshop (ASRU), 1–8
The singing voice conversion challenge 2023. In 2023 IEEE Automatic Speech Recognition and Understand- ing Workshop (ASRU), 1–8. IEEE. Jiang, Z.; Liu, J.; Ren, Y .; He, J.; Ye, Z.; Ji, S.; Yang, Q.; Zhang, C.; Wei, P.; Wang, C.; et al. 2023a. Mega-tts 2: Boosting prompting mechanisms for zero-shot speech syn- thesis. arXiv preprint arXiv:2307.07218. Jiang...
-
[4]
Advances in neural information process- ing systems, 36: 14005–14034
V oicebox: Text-guided multilingual universal speech generation at scale. Advances in neural information process- ing systems, 36: 14005–14034. Livingstone, S. R.; and Russo, F. A. 2018. The Ryer- son Audio-Visual Database of Emotional Speech and Song (RA VDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PloS one...
-
[5]
Finite Scalar Quantization: VQ-VAE Made Simple
Finite Scalar Quantization: VQ-V AE Made Simple. arXiv:2309.15505. Michel, G.; Epure, E. V .; Hennequin, R.; and Cerisara, C
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Evaluating LLMs for Quotation Attribution in Liter- ary Texts: A Case Study of LLaMa3. In Chiruzzo, L.; Ritter, A.; and Wang, L., eds., Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technolo- gies (Volume 2: Short Papers), 742–755. Albuquerque, New Mexico: Associa...
-
[7]
We include all quotations that have a non-empty adverb
-
[8]
Using the remaining quotations, we include all utterance that have an extracted verb that fall into a predefined list of speech verbs, S
-
[9]
We discard every other quotations. To build the list of speech verbs,S, we started by extracting a large list of 201 potential speech verbs from the web11 along with their descriptions. Then, based on verb descriptions, we discarded every verb that might indicate a neutral way of speaking. The resulting verb list contains 89 speech verbs and can be found ...
work page 2023
-
[10]
(Pasad, Shi, and Livescu 2023). These features are then quantized using single-codebook vector quantization, using factorized codes as done in DAC (Kumar et al. 2023a). Tokens are converted back to raw audio using a Convolu- tional Neural Network (CNN) Decoder composed of Con- vNeXt blocks. The language modeling framework is used to learn to generate sema...
work page 2023
-
[11]
Ella handed the notebook to Jay, eyes uncertain
-
[12]
Jay flipped through the sketches, pausing at one. [QUOTE 1] She nodded
-
[13]
[TARGET] whispered Ella slowly. Target quotation: [TARGET] Answer: { ”whispered”:{ ”id”: ”2”, ”type”: ”verb”, ”confidence”: 10}, ”slowly”:{ ”id”: ”2”, ”type”: ”adverb”, ”confidence”: 10} } Passage:
-
[14]
She went on, half laughing
- [15]
-
[16]
So then she started for the house, leading me by the hand, and the children tagging after
I said I had got it on the boat. So then she started for the house, leading me by the hand, and the children tagging after. When we got there she set me down in a split-bottomed chair, and set herself down on a little low stool in front of me, holding both of my hands, and says:
-
[17]
Figure 7: Screenshot of the platform used for the CMOS experiment
[QUOTE 1] Target quotation: [TARGET] Answer: Figure 6: Example prompt used with Phi-4 (with self-reported confidence) to extract narrative information. Figure 7: Screenshot of the platform used for the CMOS experiment. Figure 8: Screenshot of the platform used for the MOS experiment. Admit Announce Argue Assure Babble Bark Bawl Beg Bellow Bemoan Blabber B...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.