Audio-Based Understanding of Audiobook Narration Appeal

Emmanouil Benetos; Mariano Beguerisse-D\'iaz; Shahar Elisha

arxiv: 2607.02473 · v1 · pith:NXQXYHNNnew · submitted 2026-07-02 · 💻 cs.CL · cs.SD· eess.AS

Audio-Based Understanding of Audiobook Narration Appeal

Shahar Elisha , Mariano Beguerisse-D\'iaz , Emmanouil Benetos This is my paper

Pith reviewed 2026-07-03 14:17 UTC · model grok-4.3

classification 💻 cs.CL cs.SDeess.AS

keywords audiobook narrationacoustic featuresappeal predictionview-rate metricsLibriVoxnarrator castinggenre effectsconsumption data

0 comments

The pith

Acoustic features from audiobook narration link to listener appeal even after controlling for title effects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether vocal qualities such as tone, pace, and loudness shape how appealing listeners find an audiobook. Researchers extract these features from LibriVox recordings with pre-trained audio models and compare them to consumption signals like view-rate while holding genre and title constant. They report that the acoustic signals retain a clear association with appeal measures. The same pattern appears when checked against proprietary engagement data. The work frames this as the first systematic computational link between narration acoustics, title, genre, and consumption.

Core claim

Acoustic information alone has a robust association with appeal, even after accounting for title effects, as shown by vocal and acoustic features extracted via pre-trained models from LibriVox and tested against view-rate plus proprietary engagement metrics.

What carries the argument

Extraction of vocal and acoustic features (tone, pace, loudness) via pre-trained audio models, correlated against view-rate and engagement metrics while controlling for title and genre.

If this is right

Narration qualities can be matched to titles for higher consumption.
Data on acoustic features can inform narrator casting choices.
Genre-specific acoustic preferences become identifiable for personalization.
Computational methods can supplement human judgment in audiobook production.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Platforms could use acoustic profiles to recommend narrators to users with similar past preferences.
The approach might extend to training or evaluating synthetic voices for appeal.
Longitudinal listener data could reveal whether acoustic appeal changes over repeated listens.

Load-bearing premise

View-rate and proprietary engagement metrics serve as reliable proxies for narration appeal without substantial confounding from content, marketing, or listener demographics.

What would settle it

An experiment that swaps different narrations for identical titles and measures resulting changes in view-rate or engagement would test whether the acoustic association is causal.

read the original abstract

Narration is central to the audiobook listening experience, shaping how listeners engage with and understand the content. This work explores how narration qualities shape an audiobook's appeal, noting that their effects can vary by genre, title, and audience. We extract vocal and acoustic features (e.g., tone, pace, loudness) from LibriVox using pre-trained audio models and analyse their relationship with consumption data (specifically, view-rate) and their interplay with genre and title. Despite limited consumption data, we find that acoustic information alone has a robust association with appeal, even after accounting for title effects. We further validate these findings using more nuanced proprietary engagement metrics. To our knowledge, this is the first systematic computational study linking narration qualities, genre, title, and audiobook consumption, highlighting the potential of data-driven insights to improve audiobook personalisation and narrator casting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Acoustic features from narration show an association with view-rate and engagement metrics even after title controls, but the proxy measures and lack of detail on controls leave the isolation of narration quality uncertain.

read the letter

The main takeaway is that this paper extracts vocal and acoustic features from LibriVox audiobooks using pre-trained models and reports a link to consumption metrics that survives title-level controls. They also check interactions with genre and back the result with proprietary engagement data.

What stands out as new is the attempt to treat narration appeal as a measurable signal tied to real listening data rather than just subjective ratings. The work does a reasonable job framing the problem around commercial use cases like narrator casting and noting that effects vary by title and genre.

The soft spot is the reliance on view-rate as the main outcome. The abstract claims the association holds after accounting for title effects, but supplies no description of the controls, no sample sizes, and no checks against marketing spend or listener demographics. Proprietary metrics are mentioned for validation, yet without external grounding against human narration judgments it is difficult to know whether the signal is truly about narration or residual title popularity. Limited data is flagged, which makes the robustness claim harder to assess.

This is aimed at people working on audio analysis for media recommendation or personalization. A reader in speech processing or digital publishing might find the feature extraction and consumption tie-in useful as a starting point.

I would send it for peer review so the methods can be examined directly; the core idea has practical relevance if the controls and metrics hold up under scrutiny.

Referee Report

2 major / 2 minor

Summary. The manuscript extracts vocal and acoustic features (tone, pace, loudness) from LibriVox audiobooks via pre-trained audio models and reports a robust association between these features and narration appeal, measured via view-rate and proprietary engagement metrics. The association is claimed to persist after accounting for title effects and to vary by genre and title; the work positions itself as the first systematic computational study linking narration qualities to consumption data.

Significance. If the reported association is shown to be isolated from title popularity, marketing, and demographic confounders, the result would be significant for audiobook recommendation systems and narrator casting, as it supplies the first quantitative evidence that acoustic properties alone carry predictive signal for engagement.

major comments (2)

[Methods] Methods section: the description of how title effects are controlled (fixed effects, matching, or regression covariates) is insufficient to determine whether acoustic features are isolated from residual title-level popularity, marketing spend, or content-driven selection; without these details the central claim that the association is 'robust even after accounting for title effects' cannot be evaluated.
[Results] Results section: no sample sizes, confidence intervals, or model specifications (e.g., regression coefficients, R² values, or cross-validation details) are provided for the view-rate or proprietary-metric analyses, preventing assessment of whether the reported robustness exceeds what would be expected from imperfect title controls.

minor comments (2)

[Abstract] The abstract and introduction should explicitly state the number of titles, narrations, and listeners in the LibriVox and proprietary datasets.
[Methods] Clarify whether the pre-trained audio models were fine-tuned on any audiobook data or used zero-shot; this affects reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments highlight areas where additional clarity is needed, and we will revise the manuscript to address them directly. Below we respond point by point.

read point-by-point responses

Referee: [Methods] Methods section: the description of how title effects are controlled (fixed effects, matching, or regression covariates) is insufficient to determine whether acoustic features are isolated from residual title-level popularity, marketing spend, or content-driven selection; without these details the central claim that the association is 'robust even after accounting for title effects' cannot be evaluated.

Authors: We agree that the current methods description is too brief. In the revision we will expand the relevant subsection to specify that title fixed effects were included in the linear regression models relating acoustic features to view-rate (and separately to the proprietary metrics). This specification absorbs all time-invariant title-level factors. We will also explicitly note the absence of marketing-spend or time-varying selection variables in the LibriVox-derived data and discuss this as a limitation of the design. revision: yes
Referee: [Results] Results section: no sample sizes, confidence intervals, or model specifications (e.g., regression coefficients, R² values, or cross-validation details) are provided for the view-rate or proprietary-metric analyses, preventing assessment of whether the reported robustness exceeds what would be expected from imperfect title controls.

Authors: We accept that these quantitative details were omitted. The revised results section will report the exact sample sizes used for each analysis, the regression coefficients with 95 % confidence intervals, R² values, and any cross-validation or robustness checks performed. These additions will allow readers to evaluate the magnitude and stability of the reported associations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical associations rely on external models and data

full rationale

The paper extracts vocal/acoustic features via pre-trained audio models (external to the study) and performs statistical analysis of associations with view-rate and proprietary engagement metrics, including title-effect controls. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains are present that would reduce any claim to its own inputs by construction. The central finding is an observed correlation after controls, not a self-referential prediction or uniqueness theorem. This is a standard observational study whose validity rests on data quality rather than definitional circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only review; ledger populated from stated elements only. View-rate is treated as appeal proxy without justification visible. Pre-trained models assumed to extract relevant narration qualities.

axioms (2)

domain assumption View-rate is a valid proxy for audiobook appeal
Used as consumption metric in the analysis; abstract notes limited data but does not validate the proxy.
domain assumption Pre-trained audio models extract narration qualities independent of textual content
Core to the feature extraction step; no details on content controls in abstract.

pith-pipeline@v0.9.1-grok · 5682 in / 1331 out tokens · 22913 ms · 2026-07-03T14:17:51.035665+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 14 canonical work pages · 2 internal anchors

[1]

Introduction Narration style and acoustic presentation are important compo- nents of audiobooks; they have the power to either elevate or undermine a listener’s experience, understanding, and engage- ment with the story [1]. While the narration alone may not be the determining factor in audiobook selection amongst users, it has a significant impact on whe...
[2]

Related Works 2.1. Computational Paralinguistics and Voice Perception Human voices carry paralinguistic information from which a listener perceives qualities about the speaker’s identity and in- tention [6]. Researchers have developed computational mod- els for paralinguistic tasks such as perceived gender and age classification, health predictors, emotio...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

LibriVox catalogue LibriV ox [27] is a catalogue of public domain audiobooks, read and recorded by volunteers, with multiple titles and genres

Experimental Setup 3.1. LibriVox catalogue LibriV ox [27] is a catalogue of public domain audiobooks, read and recorded by volunteers, with multiple titles and genres. The metadata (e.g., title, author, narrator, genres, text-source) and audio files are available to download freely. The Internet Archive keeps track of the number of page views, favourites,...
[4]

Results 4.1. Statistical Modelling Results Global modelling of consumption: The GLM attains a pseudo-R2 of 0.09, indicating that narration-related properties explain a measurable portion of variation in appeal despite the coarse proxy (see Sec. 3.1) and omission of title, genre, and promotional factors. In a large and noisy real-world dataset, explaining ...
[5]

Conclusion We examined the relationship between audiobook narration, genres, title, and consumption, and consistently found that acoustic features of narration influence appeal. The robustness of these results, despite coarse consumption data and mixed recording quality, validates our hypothesis that narration styles influence appeal, and point the way to...
[6]

Acknowledgments We thank R. Dall, R. Jones, D. Korkinof, A. Lima, A. McDow- ell, S. Reddy, B. Regan, A. Torrisi, L. V ongsathorn, J. Walker, H. Zhang, E. zu Erbach for their useful feedback
[7]

All experi- mental design, analysis, and results were conducted and verified by the authors

Generative AI Use Disclosure Generative AI tools were used to assist with language editing, formatting, and improving clarity of the manuscript. All experi- mental design, analysis, and results were conducted and verified by the authors
[8]

Why do we listen to audio- books? the role of narrator performance, bgm, telepresence, and emotional connectedness,

D. Ji, B. Liu, J. Xu, and J. Gong, “Why do we listen to audio- books? the role of narrator performance, bgm, telepresence, and emotional connectedness,”Sage Open, vol. 14, no. 2, 2024

2024
[9]

Preferences and attitudes of audiobook users in Swe- den : Surveying Swedish audiobook groups on Facebook,

M. Dakic, “Preferences and attitudes of audiobook users in Swe- den : Surveying Swedish audiobook groups on Facebook,” Mas- ter’s thesis, University of Bor˚as, Faculty of Librarianship, Infor- mation, Education and IT, 2019

2019
[10]

Experiencing literary audiobooks: A framework for theoretical and empirical investigations of the auditory reception of literature,

L. Kosch, A. Schwabe, H. Boomgaarden, and G. Stocker, “Experiencing literary audiobooks: A framework for theoretical and empirical investigations of the auditory reception of literature,”Journal of Literary Theory, vol. 18, no. 1, pp. 67–88,
[11]

Available: https://doi.org/10.1515/jlt-2024-2005

[Online]. Available: https://doi.org/10.1515/jlt-2024-2005

work page doi:10.1515/jlt-2024-2005 2024
[12]

Generalized user representa- tions for large-scale recommendations and downstream tasks,

G. Fazelnia, S. Gupta, C. Keum, M. Koh, T. Heath, G. Car- rasco Hern ´andez, S. Xie, N. Singh, I. Anderson, M. Hristakeva, P. Pehrson Skid´en, and M. Lalmas, “Generalized user representa- tions for large-scale recommendations and downstream tasks,” in Proceedings of the Nineteenth ACM Conference on Recommender Systems, ser. RecSys ’25. New York, NY , USA:...

work page doi:10.1145/3705328.3748132 2025
[13]

Gomez-Uribe and Neil Hunt

C. A. Gomez-Uribe and N. Hunt, “The Netflix recommender system: Algorithms, business value, and innovation,”ACM Trans. Manage. Inf. Syst., vol. 6, no. 4, Dec. 2016. [Online]. Available: https://doi.org/10.1145/2843948

work page doi:10.1145/2843948 2016
[14]

Neurocomputational models of voice and speech perception,

B. J. Kr ¨oger, “Neurocomputational models of voice and speech perception,” inThe Oxford Handbook of Voice Perception, S. Fr ¨uhholz and P. Belin, Eds. Oxford University Press, 12 2018. [Online]. Available: https://doi.org/10.1093/oxfordhb/ 9780198743187.013.34

work page doi:10.1093/oxfordhb/ 2018
[15]

Affective and behavioural computing: Lessons learnt from the first computational paralinguistics challenge,

B. Schuller, F. Weninger, Y . Zhang, F. Ringeval, A. Batliner, S. Steidl, F. Eyben, E. Marchi, A. Vinciarelli, K. Scherer, M. Chetouani, and M. Mortillaro, “Affective and behavioural computing: Lessons learnt from the first computational paralinguistics challenge,”Computer Speech & Language, vol. 53, pp. 156–180, 2019. [Online]. Available: https://www. sc...

2019
[16]

Improving domain generalization in speech emotion recognition with Whisper,

E. Goron, L. Asai, E. Rut, and M. Dinov, “Improving domain generalization in speech emotion recognition with Whisper,” in ICASSP 2024, 2024, pp. 11 631–11 635

2024
[17]

Obuchi,Multidimensional Mapping of Voice Attractiveness and Listener’s Preference: Optimization and Estimation from Audio Signal

Y . Obuchi,Multidimensional Mapping of Voice Attractiveness and Listener’s Preference: Optimization and Estimation from Audio Signal. Singapore: Springer Singapore, 2021, pp. 281–295. [Online]. Available: https://doi.org/10.1007/978-981-15-6627-1 15

work page doi:10.1007/978-981-15-6627-1 2021
[18]

Classification of spontaneous and scripted speech for multilin- gual audio,

S. Elisha, A. McDowell, M. Beguerisse-D ´ıaz, and E. Benetos, “Classification of spontaneous and scripted speech for multilin- gual audio,” in2024 SLT, 2024, pp. 489–495

2024
[19]

Acoustic analysis and digital signal processing for the assessment of voice quality,

F. Jalali-najafabadi, C. Gadepalli, D. Jarchi, and B. M. Cheetham, “Acoustic analysis and digital signal processing for the assessment of voice quality,”Biomedical Signal Processing and Control, vol. 70, p. 103018, 2021. [Online]. Available: https://www. sciencedirect.com/science/article/pii/S1746809421006157

2021
[20]

Discrimination of male and female voice using occurrence pattern of spectral flux,

G. Yasmin, S. Dutta, and A. Ghosal, “Discrimination of male and female voice using occurrence pattern of spectral flux,” in2017 International Conference on Intelligent Computing, Instrumenta- tion and Control Technologies (ICICICT), 2017, pp. 576–581

2017
[21]

Automatic speech-based charisma recognition and the impact of integrating auxiliary characteristics,

A. Kathan, S. Amiriparian, L. Christ, S. Eulitz, and B. W. Schuller, “Automatic speech-based charisma recognition and the impact of integrating auxiliary characteristics,” in2024 IEEE Conference on Telepresence, 2024, pp. 148–153

2024
[22]

Speech-based depres- sion assessment: A comprehensive survey,

S. S. Leal, S. Ntalampiras, and R. Sassi, “Speech-based depres- sion assessment: A comprehensive survey,”IEEE Transactions on Affective Computing, vol. 16, no. 3, pp. 1318–1333, 2025

2025
[23]

Schuller and A

B. Schuller and A. Batliner,Computational paralinguistics: emo- tion, affect and personality in speech and language processing. John Wiley & Sons, 2013

2013
[24]

Ethical awareness in paralinguistics: A taxonomy of applications,

A. Batliner, M. Neumann, F. Burkhardt, A. Baird, S. Meyer, N. T. Vu, and B. W. Schuller, “Ethical awareness in paralinguistics: A taxonomy of applications,”International Journal of Human–Computer Interaction, vol. 39, no. 9, pp. 1904–1921, 2023. [Online]. Available: https://doi.org/10.1080/ 10447318.2022.2140385

work page arXiv 1904
[25]

Emotionally en- hanced audiobook reader with character voice differentiation,

B. Manoj, J. Jiji, R. Dileep, and N. Manohar, “Emotionally en- hanced audiobook reader with character voice differentiation,” in 2025 International Conference on Computing Technologies (IC- OCT), 2025, pp. 1–6

2025
[26]

Investigating inter- and intra-speaker voice conversion using audiobooks,

A. Sini, D. Lolive, N. Barbot, and P. Alain, “Investigating inter- and intra-speaker voice conversion using audiobooks,” inProc. of the 13th LREC. Marseille, France: European Language Resources Association, Jun. 2022, pp. 7305–7313. [Online]. Available: https://aclanthology.org/2022.lrec-1.794/

2022
[27]

Synthetic versus human voices in audiobooks: The human emotional intimacy effect,

E. Rodero and I. Lucas, “Synthetic versus human voices in audiobooks: The human emotional intimacy effect,”New Media & Society, vol. 25, no. 7, pp. 1746–1764, 2023. [Online]. Available: https://doi.org/10.1177/14614448211024142

work page doi:10.1177/14614448211024142 2023
[28]

Evaluating expressive speech synthesis from audiobook corpora for conversational phrases,

´E. Sz ´ekely, J. P. Cabral, M. Abou-Zleikha, P. Cahill, and J. Carson-Berndsen, “Evaluating expressive speech synthesis from audiobook corpora for conversational phrases,” inPro- ceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), N. Calzolari, K. Choukri, T. Declerck, M. U. Do ˘gan, B. Maegaard, J. Mariani, A....

2012
[29]

Available: https://aclanthology.org/L12-1513/

[Online]. Available: https://aclanthology.org/L12-1513/
[30]

The role of prosody and voice quality in indirect storytelling speech: Annotation methodology and expressive categories,

R. Monta ˜no and F. Al ´ıas, “The role of prosody and voice quality in indirect storytelling speech: Annotation methodology and expressive categories,”Speech Communication, vol. 85, pp. 8–18, 2016. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S0167639315300108

2016
[31]

The role of prosody and voice quality in indirect storytelling speech: A cross-narrator perspective in four European languages,

——, “The role of prosody and voice quality in indirect storytelling speech: A cross-narrator perspective in four European languages,”Speech Communication, vol. 88, pp. 1–16, 2017. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S0167639315300418

2017
[32]

Prosody analysis of audiobooks,

C. Pethe, B. Pham, F. D. Childress, Y . Yin, and S. Skiena, “Prosody analysis of audiobooks,” in2025 19th International Conference on Semantic Computing (ICSC), 2025, pp. 217–221

2025
[33]

Clus- tering expressive speech styles in audiobooks using glottal source parameters

´E. Sz´ekely, J. P. Cabral, P. Cahill, and J. Carson-Berndsen, “Clus- tering expressive speech styles in audiobooks using glottal source parameters.”Proc. Interspeech 2011, pp. 2409–2412, 2011

2011
[34]

Representing voices using convolutional neural network embeddings,

N. Embrets ´en, “Representing voices using convolutional neural network embeddings,” Master’s thesis, KTH, School of Electrical Engineering and Computer Science (EECS), 2019

2019
[35]

Narrative aesthetic absorption in audiobooks is predicted by blink rate and acoustic features

E. B. Lange, D. Thiele, and M. M. Kuijpers, “Narrative aesthetic absorption in audiobooks is predicted by blink rate and acoustic features.”Psychology of Aesthetics, Creativity, and the Arts, vol. 16, no. 1, pp. 110–124, 2022. [Online]. Available: https://doi.org/10.1037/aca0000321

work page doi:10.1037/aca0000321 2022
[36]

LibriV ox: Free public domain audiobooks,

LibriV ox, “LibriV ox: Free public domain audiobooks,” https: //librivox.org, 2025

2025
[37]

LibriV ox audio collection,

Internet Archive, “LibriV ox audio collection,” https://archive.org/ details/librivoxaudio, 2025

2025
[38]

The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing,

F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. Andr ´e, C. Busso, L. Y . Devillers, J. Epps, P. Laukka, S. S. Narayanan, and K. P. Truong, “The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing,”IEEE Transactions on Affective Computing, vol. 7, no. 2, pp. 190–202, 2016

2016
[39]

Opensmile: the munich versatile and fast open-source audio feature extractor,

F. Eyben, M. W ¨ollmer, and B. Schuller, “Opensmile: the munich versatile and fast open-source audio feature extractor,” inProceedings of the 18th ACM International Conference on Multimedia, ser. MM ’10. New York, NY , USA: Association for Computing Machinery, 2010, p. 1459–1462. [Online]. Available: https://doi.org/10.1145/1873951.1874246

work page doi:10.1145/1873951.1874246 2010
[40]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

Audio set: An ontology and human-labeled dataset for audio events,

J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 776–780

2017
[42]

Robust speech recognition via large-scale weak su- pervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak su- pervision,” inProceedings of the 40th International Conference on Machine Learning, ser. ICML’23. JMLR.org, 2023

2023
[43]

syllables: A simple syllable counting package for Python,

K. Gorman, “syllables: A simple syllable counting package for Python,” https://pypi.org/project/syllables/, 2025

2025
[44]

Controlling the false dis- covery rate: A practical and powerful approach to multiple testing,

Y . Benjamini and Y . Hochberg, “Controlling the false dis- covery rate: A practical and powerful approach to multiple testing,”Journal of the Royal Statistical Society: Series B (Methodological), vol. 57, no. 1, pp. 289–300, 1995. [Online]. Available: https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j. 2517-6161.1995.tb02031.x

work page doi:10.1111/j 1995
[45]

D. C. Montgomery, E. A. Peck, and G. G. Vining,Introduction to Linear Regression Analysis, 6th ed. Wiley, 2021

2021
[46]

A new look at the statistical model identification,

H. Akaike, “A new look at the statistical model identification,” IEEE Transactions on Automatic Control, vol. 19, no. 6, pp. 716– 723, 1974

1974
[47]

XGBoost documentation,

XGBoost Developers, “XGBoost documentation,” https: //xgboost.readthedocs.io, 2025

2025
[48]

Learning to rank with nonsmooth cost functions,

C. J. C. Burges, R. Ragno, and Q. V . Le, “Learning to rank with nonsmooth cost functions,” inProceedings of the 20th Interna- tional Conference on Neural Information Processing Systems, ser. NIPS’06. Cambridge, MA, USA: MIT Press, 2006, p. 193–200

2006
[49]

Learning to rank using gradient descent,

C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender, “Learning to rank using gradient descent,” inProceedings of the 22nd International Conference on Machine Learning, ser. ICML ’05. New York, NY , USA: Association for Computing Machinery, 2005, p. 89–96. [Online]. Available: https://doi.org/10.1145/1102351.1102363

work page doi:10.1145/1102351.1102363 2005
[50]

LightGBM documentation,

LightGBM Developers, “LightGBM documentation,” https:// lightgbm.readthedocs.io, 2025

2025
[51]

From RankNet to Lamb- daRank to LambdaMART: An overview,

C. J. Burges, “From RankNet to Lamb- daRank to LambdaMART: An overview,” Tech. Rep. MSR-TR-2010-82, June 2010. [Online]. Avail- able: https://www.microsoft.com/en-us/research/publication/ from-ranknet-to-lambdarank-to-lambdamart-an-overview/

2010
[52]

DiffKendall: a novel ap- proach for few-shot learning with differentiable kendall’s rank correlation,

K. Zheng, H. Zhang, and W. Huang, “DiffKendall: a novel ap- proach for few-shot learning with differentiable kendall’s rank correlation,” inProc. of the 37th NeurIPS, ser. NIPS ’23. Red Hook, NY , USA: Curran Associates Inc., 2023

2023
[53]

The effect of levels and types of experience on judgment of synthesized voice quality,

J. L. Sofranko and R. A. Prosek, “The effect of levels and types of experience on judgment of synthesized voice quality,”Journal of Voice, vol. 28, no. 1, pp. 24–35, 2014. [Online]. Available: https://www.jvoice.org/article/S0892-1997(13)00103-3/abstract

2014
[54]

Acoustic features distinguishing emotions in Swedish speech,

M. Ekberg, G. Stavrinos, J. Andin, S. Stenfelt, and ¨O. Dahlstr¨om, “Acoustic features distinguishing emotions in Swedish speech,” Journal of Voice, vol. 39, no. 6, pp. 1699.e11–1699.e20, 2025. [Online]. Available: https://doi.org/10.1016/j.jvoice.2023.03.010

work page doi:10.1016/j.jvoice.2023.03.010 2025

[1] [1]

Introduction Narration style and acoustic presentation are important compo- nents of audiobooks; they have the power to either elevate or undermine a listener’s experience, understanding, and engage- ment with the story [1]. While the narration alone may not be the determining factor in audiobook selection amongst users, it has a significant impact on whe...

[2] [2]

Related Works 2.1. Computational Paralinguistics and Voice Perception Human voices carry paralinguistic information from which a listener perceives qualities about the speaker’s identity and in- tention [6]. Researchers have developed computational mod- els for paralinguistic tasks such as perceived gender and age classification, health predictors, emotio...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

LibriVox catalogue LibriV ox [27] is a catalogue of public domain audiobooks, read and recorded by volunteers, with multiple titles and genres

Experimental Setup 3.1. LibriVox catalogue LibriV ox [27] is a catalogue of public domain audiobooks, read and recorded by volunteers, with multiple titles and genres. The metadata (e.g., title, author, narrator, genres, text-source) and audio files are available to download freely. The Internet Archive keeps track of the number of page views, favourites,...

[4] [4]

Results 4.1. Statistical Modelling Results Global modelling of consumption: The GLM attains a pseudo-R2 of 0.09, indicating that narration-related properties explain a measurable portion of variation in appeal despite the coarse proxy (see Sec. 3.1) and omission of title, genre, and promotional factors. In a large and noisy real-world dataset, explaining ...

[5] [5]

Conclusion We examined the relationship between audiobook narration, genres, title, and consumption, and consistently found that acoustic features of narration influence appeal. The robustness of these results, despite coarse consumption data and mixed recording quality, validates our hypothesis that narration styles influence appeal, and point the way to...

[6] [6]

Acknowledgments We thank R. Dall, R. Jones, D. Korkinof, A. Lima, A. McDow- ell, S. Reddy, B. Regan, A. Torrisi, L. V ongsathorn, J. Walker, H. Zhang, E. zu Erbach for their useful feedback

[7] [7]

All experi- mental design, analysis, and results were conducted and verified by the authors

Generative AI Use Disclosure Generative AI tools were used to assist with language editing, formatting, and improving clarity of the manuscript. All experi- mental design, analysis, and results were conducted and verified by the authors

[8] [8]

Why do we listen to audio- books? the role of narrator performance, bgm, telepresence, and emotional connectedness,

D. Ji, B. Liu, J. Xu, and J. Gong, “Why do we listen to audio- books? the role of narrator performance, bgm, telepresence, and emotional connectedness,”Sage Open, vol. 14, no. 2, 2024

2024

[9] [9]

Preferences and attitudes of audiobook users in Swe- den : Surveying Swedish audiobook groups on Facebook,

M. Dakic, “Preferences and attitudes of audiobook users in Swe- den : Surveying Swedish audiobook groups on Facebook,” Mas- ter’s thesis, University of Bor˚as, Faculty of Librarianship, Infor- mation, Education and IT, 2019

2019

[10] [10]

Experiencing literary audiobooks: A framework for theoretical and empirical investigations of the auditory reception of literature,

L. Kosch, A. Schwabe, H. Boomgaarden, and G. Stocker, “Experiencing literary audiobooks: A framework for theoretical and empirical investigations of the auditory reception of literature,”Journal of Literary Theory, vol. 18, no. 1, pp. 67–88,

[11] [11]

Available: https://doi.org/10.1515/jlt-2024-2005

[Online]. Available: https://doi.org/10.1515/jlt-2024-2005

work page doi:10.1515/jlt-2024-2005 2024

[12] [12]

Generalized user representa- tions for large-scale recommendations and downstream tasks,

G. Fazelnia, S. Gupta, C. Keum, M. Koh, T. Heath, G. Car- rasco Hern ´andez, S. Xie, N. Singh, I. Anderson, M. Hristakeva, P. Pehrson Skid´en, and M. Lalmas, “Generalized user representa- tions for large-scale recommendations and downstream tasks,” in Proceedings of the Nineteenth ACM Conference on Recommender Systems, ser. RecSys ’25. New York, NY , USA:...

work page doi:10.1145/3705328.3748132 2025

[13] [13]

Gomez-Uribe and Neil Hunt

C. A. Gomez-Uribe and N. Hunt, “The Netflix recommender system: Algorithms, business value, and innovation,”ACM Trans. Manage. Inf. Syst., vol. 6, no. 4, Dec. 2016. [Online]. Available: https://doi.org/10.1145/2843948

work page doi:10.1145/2843948 2016

[14] [14]

Neurocomputational models of voice and speech perception,

B. J. Kr ¨oger, “Neurocomputational models of voice and speech perception,” inThe Oxford Handbook of Voice Perception, S. Fr ¨uhholz and P. Belin, Eds. Oxford University Press, 12 2018. [Online]. Available: https://doi.org/10.1093/oxfordhb/ 9780198743187.013.34

work page doi:10.1093/oxfordhb/ 2018

[15] [15]

Affective and behavioural computing: Lessons learnt from the first computational paralinguistics challenge,

B. Schuller, F. Weninger, Y . Zhang, F. Ringeval, A. Batliner, S. Steidl, F. Eyben, E. Marchi, A. Vinciarelli, K. Scherer, M. Chetouani, and M. Mortillaro, “Affective and behavioural computing: Lessons learnt from the first computational paralinguistics challenge,”Computer Speech & Language, vol. 53, pp. 156–180, 2019. [Online]. Available: https://www. sc...

2019

[16] [16]

Improving domain generalization in speech emotion recognition with Whisper,

E. Goron, L. Asai, E. Rut, and M. Dinov, “Improving domain generalization in speech emotion recognition with Whisper,” in ICASSP 2024, 2024, pp. 11 631–11 635

2024

[17] [17]

Obuchi,Multidimensional Mapping of Voice Attractiveness and Listener’s Preference: Optimization and Estimation from Audio Signal

Y . Obuchi,Multidimensional Mapping of Voice Attractiveness and Listener’s Preference: Optimization and Estimation from Audio Signal. Singapore: Springer Singapore, 2021, pp. 281–295. [Online]. Available: https://doi.org/10.1007/978-981-15-6627-1 15

work page doi:10.1007/978-981-15-6627-1 2021

[18] [18]

Classification of spontaneous and scripted speech for multilin- gual audio,

S. Elisha, A. McDowell, M. Beguerisse-D ´ıaz, and E. Benetos, “Classification of spontaneous and scripted speech for multilin- gual audio,” in2024 SLT, 2024, pp. 489–495

2024

[19] [19]

Acoustic analysis and digital signal processing for the assessment of voice quality,

F. Jalali-najafabadi, C. Gadepalli, D. Jarchi, and B. M. Cheetham, “Acoustic analysis and digital signal processing for the assessment of voice quality,”Biomedical Signal Processing and Control, vol. 70, p. 103018, 2021. [Online]. Available: https://www. sciencedirect.com/science/article/pii/S1746809421006157

2021

[20] [20]

Discrimination of male and female voice using occurrence pattern of spectral flux,

G. Yasmin, S. Dutta, and A. Ghosal, “Discrimination of male and female voice using occurrence pattern of spectral flux,” in2017 International Conference on Intelligent Computing, Instrumenta- tion and Control Technologies (ICICICT), 2017, pp. 576–581

2017

[21] [21]

Automatic speech-based charisma recognition and the impact of integrating auxiliary characteristics,

A. Kathan, S. Amiriparian, L. Christ, S. Eulitz, and B. W. Schuller, “Automatic speech-based charisma recognition and the impact of integrating auxiliary characteristics,” in2024 IEEE Conference on Telepresence, 2024, pp. 148–153

2024

[22] [22]

Speech-based depres- sion assessment: A comprehensive survey,

S. S. Leal, S. Ntalampiras, and R. Sassi, “Speech-based depres- sion assessment: A comprehensive survey,”IEEE Transactions on Affective Computing, vol. 16, no. 3, pp. 1318–1333, 2025

2025

[23] [23]

Schuller and A

B. Schuller and A. Batliner,Computational paralinguistics: emo- tion, affect and personality in speech and language processing. John Wiley & Sons, 2013

2013

[24] [24]

Ethical awareness in paralinguistics: A taxonomy of applications,

A. Batliner, M. Neumann, F. Burkhardt, A. Baird, S. Meyer, N. T. Vu, and B. W. Schuller, “Ethical awareness in paralinguistics: A taxonomy of applications,”International Journal of Human–Computer Interaction, vol. 39, no. 9, pp. 1904–1921, 2023. [Online]. Available: https://doi.org/10.1080/ 10447318.2022.2140385

work page arXiv 1904

[25] [25]

Emotionally en- hanced audiobook reader with character voice differentiation,

B. Manoj, J. Jiji, R. Dileep, and N. Manohar, “Emotionally en- hanced audiobook reader with character voice differentiation,” in 2025 International Conference on Computing Technologies (IC- OCT), 2025, pp. 1–6

2025

[26] [26]

Investigating inter- and intra-speaker voice conversion using audiobooks,

A. Sini, D. Lolive, N. Barbot, and P. Alain, “Investigating inter- and intra-speaker voice conversion using audiobooks,” inProc. of the 13th LREC. Marseille, France: European Language Resources Association, Jun. 2022, pp. 7305–7313. [Online]. Available: https://aclanthology.org/2022.lrec-1.794/

2022

[27] [27]

Synthetic versus human voices in audiobooks: The human emotional intimacy effect,

E. Rodero and I. Lucas, “Synthetic versus human voices in audiobooks: The human emotional intimacy effect,”New Media & Society, vol. 25, no. 7, pp. 1746–1764, 2023. [Online]. Available: https://doi.org/10.1177/14614448211024142

work page doi:10.1177/14614448211024142 2023

[28] [28]

Evaluating expressive speech synthesis from audiobook corpora for conversational phrases,

´E. Sz ´ekely, J. P. Cabral, M. Abou-Zleikha, P. Cahill, and J. Carson-Berndsen, “Evaluating expressive speech synthesis from audiobook corpora for conversational phrases,” inPro- ceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), N. Calzolari, K. Choukri, T. Declerck, M. U. Do ˘gan, B. Maegaard, J. Mariani, A....

2012

[29] [29]

Available: https://aclanthology.org/L12-1513/

[Online]. Available: https://aclanthology.org/L12-1513/

[30] [30]

The role of prosody and voice quality in indirect storytelling speech: Annotation methodology and expressive categories,

R. Monta ˜no and F. Al ´ıas, “The role of prosody and voice quality in indirect storytelling speech: Annotation methodology and expressive categories,”Speech Communication, vol. 85, pp. 8–18, 2016. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S0167639315300108

2016

[31] [31]

The role of prosody and voice quality in indirect storytelling speech: A cross-narrator perspective in four European languages,

——, “The role of prosody and voice quality in indirect storytelling speech: A cross-narrator perspective in four European languages,”Speech Communication, vol. 88, pp. 1–16, 2017. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S0167639315300418

2017

[32] [32]

Prosody analysis of audiobooks,

C. Pethe, B. Pham, F. D. Childress, Y . Yin, and S. Skiena, “Prosody analysis of audiobooks,” in2025 19th International Conference on Semantic Computing (ICSC), 2025, pp. 217–221

2025

[33] [33]

Clus- tering expressive speech styles in audiobooks using glottal source parameters

´E. Sz´ekely, J. P. Cabral, P. Cahill, and J. Carson-Berndsen, “Clus- tering expressive speech styles in audiobooks using glottal source parameters.”Proc. Interspeech 2011, pp. 2409–2412, 2011

2011

[34] [34]

Representing voices using convolutional neural network embeddings,

N. Embrets ´en, “Representing voices using convolutional neural network embeddings,” Master’s thesis, KTH, School of Electrical Engineering and Computer Science (EECS), 2019

2019

[35] [35]

Narrative aesthetic absorption in audiobooks is predicted by blink rate and acoustic features

E. B. Lange, D. Thiele, and M. M. Kuijpers, “Narrative aesthetic absorption in audiobooks is predicted by blink rate and acoustic features.”Psychology of Aesthetics, Creativity, and the Arts, vol. 16, no. 1, pp. 110–124, 2022. [Online]. Available: https://doi.org/10.1037/aca0000321

work page doi:10.1037/aca0000321 2022

[36] [36]

LibriV ox: Free public domain audiobooks,

LibriV ox, “LibriV ox: Free public domain audiobooks,” https: //librivox.org, 2025

2025

[37] [37]

LibriV ox audio collection,

Internet Archive, “LibriV ox audio collection,” https://archive.org/ details/librivoxaudio, 2025

2025

[38] [38]

The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing,

F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. Andr ´e, C. Busso, L. Y . Devillers, J. Epps, P. Laukka, S. S. Narayanan, and K. P. Truong, “The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing,”IEEE Transactions on Affective Computing, vol. 7, no. 2, pp. 190–202, 2016

2016

[39] [39]

Opensmile: the munich versatile and fast open-source audio feature extractor,

F. Eyben, M. W ¨ollmer, and B. Schuller, “Opensmile: the munich versatile and fast open-source audio feature extractor,” inProceedings of the 18th ACM International Conference on Multimedia, ser. MM ’10. New York, NY , USA: Association for Computing Machinery, 2010, p. 1459–1462. [Online]. Available: https://doi.org/10.1145/1873951.1874246

work page doi:10.1145/1873951.1874246 2010

[40] [40]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[41] [41]

Audio set: An ontology and human-labeled dataset for audio events,

J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 776–780

2017

[42] [42]

Robust speech recognition via large-scale weak su- pervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak su- pervision,” inProceedings of the 40th International Conference on Machine Learning, ser. ICML’23. JMLR.org, 2023

2023

[43] [43]

syllables: A simple syllable counting package for Python,

K. Gorman, “syllables: A simple syllable counting package for Python,” https://pypi.org/project/syllables/, 2025

2025

[44] [44]

Controlling the false dis- covery rate: A practical and powerful approach to multiple testing,

Y . Benjamini and Y . Hochberg, “Controlling the false dis- covery rate: A practical and powerful approach to multiple testing,”Journal of the Royal Statistical Society: Series B (Methodological), vol. 57, no. 1, pp. 289–300, 1995. [Online]. Available: https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j. 2517-6161.1995.tb02031.x

work page doi:10.1111/j 1995

[45] [45]

D. C. Montgomery, E. A. Peck, and G. G. Vining,Introduction to Linear Regression Analysis, 6th ed. Wiley, 2021

2021

[46] [46]

A new look at the statistical model identification,

H. Akaike, “A new look at the statistical model identification,” IEEE Transactions on Automatic Control, vol. 19, no. 6, pp. 716– 723, 1974

1974

[47] [47]

XGBoost documentation,

XGBoost Developers, “XGBoost documentation,” https: //xgboost.readthedocs.io, 2025

2025

[48] [48]

Learning to rank with nonsmooth cost functions,

C. J. C. Burges, R. Ragno, and Q. V . Le, “Learning to rank with nonsmooth cost functions,” inProceedings of the 20th Interna- tional Conference on Neural Information Processing Systems, ser. NIPS’06. Cambridge, MA, USA: MIT Press, 2006, p. 193–200

2006

[49] [49]

Learning to rank using gradient descent,

C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender, “Learning to rank using gradient descent,” inProceedings of the 22nd International Conference on Machine Learning, ser. ICML ’05. New York, NY , USA: Association for Computing Machinery, 2005, p. 89–96. [Online]. Available: https://doi.org/10.1145/1102351.1102363

work page doi:10.1145/1102351.1102363 2005

[50] [50]

LightGBM documentation,

LightGBM Developers, “LightGBM documentation,” https:// lightgbm.readthedocs.io, 2025

2025

[51] [51]

From RankNet to Lamb- daRank to LambdaMART: An overview,

C. J. Burges, “From RankNet to Lamb- daRank to LambdaMART: An overview,” Tech. Rep. MSR-TR-2010-82, June 2010. [Online]. Avail- able: https://www.microsoft.com/en-us/research/publication/ from-ranknet-to-lambdarank-to-lambdamart-an-overview/

2010

[52] [52]

DiffKendall: a novel ap- proach for few-shot learning with differentiable kendall’s rank correlation,

K. Zheng, H. Zhang, and W. Huang, “DiffKendall: a novel ap- proach for few-shot learning with differentiable kendall’s rank correlation,” inProc. of the 37th NeurIPS, ser. NIPS ’23. Red Hook, NY , USA: Curran Associates Inc., 2023

2023

[53] [53]

The effect of levels and types of experience on judgment of synthesized voice quality,

J. L. Sofranko and R. A. Prosek, “The effect of levels and types of experience on judgment of synthesized voice quality,”Journal of Voice, vol. 28, no. 1, pp. 24–35, 2014. [Online]. Available: https://www.jvoice.org/article/S0892-1997(13)00103-3/abstract

2014

[54] [54]

Acoustic features distinguishing emotions in Swedish speech,

M. Ekberg, G. Stavrinos, J. Andin, S. Stenfelt, and ¨O. Dahlstr¨om, “Acoustic features distinguishing emotions in Swedish speech,” Journal of Voice, vol. 39, no. 6, pp. 1699.e11–1699.e20, 2025. [Online]. Available: https://doi.org/10.1016/j.jvoice.2023.03.010

work page doi:10.1016/j.jvoice.2023.03.010 2025