pith. sign in

arxiv: 2509.04072 · v2 · submitted 2025-09-04 · 📡 eess.AS · cs.CL· cs.SD

Computational Narrative Understanding for Expressive Text-to-Speech

Pith reviewed 2026-05-18 19:26 UTC · model grok-4.3

classification 📡 eess.AS cs.CLcs.SD
keywords text-to-speechexpressive speechaudiobooksLibriQuoteprosodynarrative understandingflow-matchingautoregressive TTS
0
0 comments X

The pith

A dataset of 5.3K hours of audiobook character quotes labeled by speech verbs and adverbs improves expressive TTS.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates LibriQuote by extracting direct speech from fictional audiobooks and attaching pseudo-labels drawn from nearby verbs and adverbs that describe delivery, such as whispered or shouted. It then demonstrates that fine-tuning a flow-matching TTS model on this data produces noticeably more expressive and intelligible output, while training an autoregressive model from scratch on the same data increases its expressiveness. A sympathetic reader would care because most current TTS systems still sound flat when rendering stories, even though large volumes of naturally expressive narration already exist in audiobooks. The work also releases a test set and shows that different synthesis architectures vary widely in how well they exploit the new cues.

Core claim

We introduce LibriQuote, a 5.3K-hour corpus of expressive speech taken from character quotations in human-narrated audiobooks, each paired with contextual pseudo-labels for speech verbs and adverbs that indicate intended prosody. Fine-tuning a flow-matching model on LibriQuote yields substantial gains in expressivity and intelligibility, while training an autoregressive TTS model from scratch on the dataset enhances its ability to produce expressive speech. Benchmarking on a held-out LibriQuote-test set reveals large differences across systems in handling expressive generation.

What carries the argument

LibriQuote dataset of character quotations augmented with contextual pseudo-labels from speech verbs and adverbs in the surrounding narrative text.

If this is right

  • Fine-tuned flow-matching models become more expressive and intelligible when trained on LibriQuote.
  • Autoregressive TTS models gain expressiveness when trained from scratch on the dataset.
  • Different TTS architectures exhibit substantial variability when asked to generate expressive speech from the same inputs.
  • Public release of the dataset, code, and evaluation resources enables direct replication and extension by other researchers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Narrative text cues could be used at inference time to condition TTS models without retraining.
  • The same verb-adverb labeling approach might transfer to non-fiction or dialogue-heavy domains such as podcasts or plays.
  • Scaling the dataset size or adding explicit prosody predictors could further amplify the observed gains.
  • Computational narrative understanding may become a standard preprocessing step for any expressive synthesis task.

Load-bearing premise

The contextual pseudo-labels derived from speech verbs and adverbs in the source text accurately capture the prosodic intent of the original human narration.

What would settle it

A side-by-side listening test on LibriQuote-test sentences in which listeners rate expressivity and intelligibility of outputs from LibriQuote-trained models no higher than outputs from the same architectures trained only on standard neutral corpora would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2509.04072 by Christophe Cerisara, Elena V. Epure, Gaspard Michel.

Figure 1
Figure 1. Figure 1: Overview of the alignment process. We match Lib [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: t-SNE projection of emotion vector representations computed with [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 1
Figure 1. Figure 1: Qualitative analysis of the data reveals that Lib [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Utterance duration (s) of LibriQuote-train. book text. Then, a coarse alignment is produced by taking the longest chain of pairs (i1, j1), . . . ,(iN , jN ), such that i1 ≤ · · · ≤ iN and j1 ≤ · · · ≤ jN . The second stage pro￾duces a final alignment by concatenating Levenshtein align￾ments (Lcvenshtcin 1966) between the transcription and the text segment produced by the longest chain of pairs. We re￾fer t… view at source ↗
Figure 4
Figure 4. Figure 4: Average pitch standard deviations (red squares) per [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: t-SNE projection of accent representations of [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example prompt used with Phi-4 (with self-reported confidence) to extract narrative information [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Screenshot of the platform used for the CMOS experiment. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Screenshot of the platform used for the MOS experiment. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
read the original abstract

Recent advances in text-to-speech (TTS) have been driven by large, multi-domain speech corpora, yet the expressive potential of audiobook data remains underexamined. We argue that human-narrated audiobooks, particularly fictional works, contain rich and diverse prosodic cues arising from the natural alternation between neutral narration and expressive character dialogue. Building from this observation, we introduce LibriQuote, a large-scale 5.3K hours of expressive speech drawn from character quotations. Each quote is supplemented with contextual pseudo-labels for speech verbs and adverbs that characterize the intended delivery of direct speech (e.g., "he whispered softly"). We found that fine-tuning a flow-matching model on LibriQuote yields substantial improvements in expressivity and intelligibility, while training from scratch enhances expressiveness of an autoregressive TTS model. Benchmarking on LibriQuote-test highlights significant variability across systems in generating expressive speech. We publicly release the dataset, code, and evaluation resources to facilitate reproducibility. Audio samples can be found at https://libriquote.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LibriQuote, a 5.3K-hour dataset of expressive speech extracted from character quotations in human-narrated audiobooks, augmented with contextual pseudo-labels derived from speech verbs and adverbs (e.g., 'he whispered softly'). It reports that fine-tuning a flow-matching TTS model on LibriQuote produces substantial gains in expressivity and intelligibility, while training an autoregressive TTS model from scratch on the same data enhances expressiveness. Benchmarking on a held-out LibriQuote-test set reveals significant variability across systems in generating expressive speech, and the work publicly releases the dataset, code, and evaluation resources.

Significance. If the gains prove attributable to the narrative alternation and pseudo-labels rather than data volume alone, the work would meaningfully advance expressive TTS by showing how computational narrative understanding can scalably mine prosodic cues from fictional audiobooks. The public release of LibriQuote would provide a valuable resource for reproducibility and further research on context-aware prosody modeling.

major comments (2)
  1. [Abstract] Abstract: The central claim of 'substantial improvements in expressivity and intelligibility' from fine-tuning a flow-matching model (and enhanced expressiveness for the autoregressive model) is stated without any quantitative metrics, error bars, statistical tests, or details on the evaluation protocol, listener studies, or baseline comparisons. This absence leaves the empirical support for the primary result only weakly grounded.
  2. [Experimental results] Experimental results: The attribution of improvements to the quotation structure, speech-verb pseudo-labels, and narrative alternation lacks necessary controls. No ablation is reported that compares fine-tuning on LibriQuote against fine-tuning on an equal-duration neutral corpus (e.g., LibriSpeech) or a shuffled-label control, leaving open the possibility that observed benefits arise from data scaling rather than computational narrative understanding.
minor comments (2)
  1. [Dataset] Dataset section: Provide more detail on the extraction pipeline, filtering criteria, and validation that the selected segments indeed correspond to direct character speech rather than narration.
  2. [Evaluation] Evaluation: Clarify the exact metrics used for 'expressivity' and 'intelligibility' and whether they include both objective and subjective measures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript introducing LibriQuote. We address each major comment point by point below and will revise the manuscript to strengthen the presentation of results and controls.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of 'substantial improvements in expressivity and intelligibility' from fine-tuning a flow-matching model (and enhanced expressiveness for the autoregressive model) is stated without any quantitative metrics, error bars, statistical tests, or details on the evaluation protocol, listener studies, or baseline comparisons. This absence leaves the empirical support for the primary result only weakly grounded.

    Authors: We agree that the abstract would be strengthened by including specific quantitative support. In the revision we will add key metrics from the experimental section (e.g., expressivity and intelligibility gains with confidence intervals), a brief statement of the evaluation protocol, and reference to the baseline comparisons, while preserving conciseness. revision: yes

  2. Referee: [Experimental results] Experimental results: The attribution of improvements to the quotation structure, speech-verb pseudo-labels, and narrative alternation lacks necessary controls. No ablation is reported that compares fine-tuning on LibriQuote against fine-tuning on an equal-duration neutral corpus (e.g., LibriSpeech) or a shuffled-label control, leaving open the possibility that observed benefits arise from data scaling rather than computational narrative understanding.

    Authors: The referee is correct that the current experiments lack the suggested ablations. While LibriQuote is constructed specifically around quotation structure and pseudo-labels, and gains are shown relative to models trained on conventional corpora, direct controls against equal-volume neutral data or shuffled labels are absent. We will add these ablations in the revision to better isolate the contribution of narrative elements. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset release and TTS fine-tuning results are self-contained

full rationale

The paper constructs LibriQuote by extracting quoted speech segments and pseudo-labels from existing audiobooks, then reports empirical improvements from fine-tuning flow-matching and autoregressive TTS models on this data versus baselines. No equations, derivations, or first-principles claims appear; performance numbers rest on standard held-out test-set evaluation rather than any self-referential fitting or parameter renaming. Any self-citations are incidental and not load-bearing for the central empirical claims, which remain falsifiable against external TTS benchmarks and do not reduce to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Work is data-driven; no explicit free parameters, axioms, or invented entities are introduced beyond standard assumptions of supervised fine-tuning on labeled speech data.

pith-pipeline@v0.9.0 · 5719 in / 938 out tokens · 38961 ms · 2026-05-18T19:26:47.476038+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation

    eess.AS 2026-04 unverdicted novelty 6.0

    Emotion embedding similarities are unsuitable for zero-shot evaluation of emotional expressiveness in speech generation due to confounding by non-emotional acoustic features.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

    IEMOCAP: Interactive emotional dyadic motion cap- ture database. Language resources and evaluation , 42(4): 335–359. Cao, H.; Cooper, D. G.; Keutmann, M. K.; Gur, R. C.; Nenkova, A.; and Verma, R. 2014. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing, 5(4): 377–390. Chen, Y .; Niu, Z.; Ma, Z.; Deng, K.;...

  2. [2]

    Hatzel, H

    Miami, Florida, USA: Association for Computational Linguistics. Hatzel, H. O.; and Biemann, C. 2024. Story Embeddings — Narrative-Focused Representations of Fictional Stories. In Al-Onaizan, Y .; Bansal, M.; and Chen, Y .-N., eds.,Pro- ceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 5931–5943. Miami, Florida, USA: Asso...

  3. [3]

    In 2023 IEEE Automatic Speech Recognition and Understand- ing Workshop (ASRU), 1–8

    The singing voice conversion challenge 2023. In 2023 IEEE Automatic Speech Recognition and Understand- ing Workshop (ASRU), 1–8. IEEE. Jiang, Z.; Liu, J.; Ren, Y .; He, J.; Ye, Z.; Ji, S.; Yang, Q.; Zhang, C.; Wei, P.; Wang, C.; et al. 2023a. Mega-tts 2: Boosting prompting mechanisms for zero-shot speech syn- thesis. arXiv preprint arXiv:2307.07218. Jiang...

  4. [4]

    Advances in neural information process- ing systems, 36: 14005–14034

    V oicebox: Text-guided multilingual universal speech generation at scale. Advances in neural information process- ing systems, 36: 14005–14034. Livingstone, S. R.; and Russo, F. A. 2018. The Ryer- son Audio-Visual Database of Emotional Speech and Song (RA VDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PloS one...

  5. [5]

    Finite Scalar Quantization: VQ-VAE Made Simple

    Finite Scalar Quantization: VQ-V AE Made Simple. arXiv:2309.15505. Michel, G.; Epure, E. V .; Hennequin, R.; and Cerisara, C

  6. [6]

    [TARGET]

    Evaluating LLMs for Quotation Attribution in Liter- ary Texts: A Case Study of LLaMa3. In Chiruzzo, L.; Ritter, A.; and Wang, L., eds., Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technolo- gies (Volume 2: Short Papers), 742–755. Albuquerque, New Mexico: Associa...

  7. [7]

    We include all quotations that have a non-empty adverb

  8. [8]

    Using the remaining quotations, we include all utterance that have an extracted verb that fall into a predefined list of speech verbs, S

  9. [9]

    To build the list of speech verbs,S, we started by extracting a large list of 201 potential speech verbs from the web11 along with their descriptions

    We discard every other quotations. To build the list of speech verbs,S, we started by extracting a large list of 201 potential speech verbs from the web11 along with their descriptions. Then, based on verb descriptions, we discarded every verb that might indicate a neutral way of speaking. The resulting verb list contains 89 speech verbs and can be found ...

  10. [10]

    These features are then quantized using single-codebook vector quantization, using factorized codes as done in DAC (Kumar et al

    (Pasad, Shi, and Livescu 2023). These features are then quantized using single-codebook vector quantization, using factorized codes as done in DAC (Kumar et al. 2023a). Tokens are converted back to raw audio using a Convolu- tional Neural Network (CNN) Decoder composed of Con- vNeXt blocks. The language modeling framework is used to learn to generate sema...

  11. [11]

    Ella handed the notebook to Jay, eyes uncertain

  12. [12]

    [QUOTE 1] She nodded

    Jay flipped through the sketches, pausing at one. [QUOTE 1] She nodded

  13. [13]

    Target quotation: [TARGET] Answer: { ”whispered”:{ ”id”: ”2”, ”type”: ”verb”, ”confidence”: 10}, ”slowly”:{ ”id”: ”2”, ”type”: ”adverb”, ”confidence”: 10} } Passage:

    [TARGET] whispered Ella slowly. Target quotation: [TARGET] Answer: { ”whispered”:{ ”id”: ”2”, ”type”: ”verb”, ”confidence”: 10}, ”slowly”:{ ”id”: ”2”, ”type”: ”adverb”, ”confidence”: 10} } Passage:

  14. [14]

    She went on, half laughing

  15. [15]

    Passage:

    [TARGET] Then we went to the park, and he said [QUOTE 1] Target quotation: [TARGET] Answer: { ”went”: { ”id”: ”0”, ”type”: ”verb”, ”confidence”: 9}, ”laughing”: { ”id”: ”0”, ”type”: ”verb”, ”confidence”: 9} } . . . Passage:

  16. [16]

    So then she started for the house, leading me by the hand, and the children tagging after

    I said I had got it on the boat. So then she started for the house, leading me by the hand, and the children tagging after. When we got there she set me down in a split-bottomed chair, and set herself down on a little low stool in front of me, holding both of my hands, and says:

  17. [17]

    Figure 7: Screenshot of the platform used for the CMOS experiment

    [QUOTE 1] Target quotation: [TARGET] Answer: Figure 6: Example prompt used with Phi-4 (with self-reported confidence) to extract narrative information. Figure 7: Screenshot of the platform used for the CMOS experiment. Figure 8: Screenshot of the platform used for the MOS experiment. Admit Announce Argue Assure Babble Bark Bawl Beg Bellow Bemoan Blabber B...