AudioRole: An Audio Dataset for Character Role-Playing in Large Language Models
Pith reviewed 2026-05-18 12:58 UTC · model grok-4.3
The pith
A dataset of over one million TV character dialogues trains voice models to better match both vocal style and spoken content in role-playing.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the AudioRole collection of synchronized audio-text pairs from TV series, annotated with speaker identities and metadata, enables training of voice LLMs that achieve higher acoustic personalization (0.31) and content personalization (0.36) scores than the base GLM-4-Voice model or a more capable competitor, thereby demonstrating effective learning of character-specific voice and dialogue patterns.
What carries the argument
The AudioRole dataset of 1K+ hours of TV audio paired with 1M+ character-grounded dialogues, together with the ARP-Eval dual-aspect scoring system that separately measures acoustic and content fidelity.
If this is right
- Voice models can be specialized to reproduce both the sound and the typical speech content of individual characters.
- Multiple ARP-Models can be created, each dedicated to a different role from the dataset.
- ARP-Eval supplies a consistent yardstick for comparing future audio role-playing systems on two separate dimensions.
- The scale of one million annotated pairs supports further scaling of training for improved generalization.
Where Pith is reading between the lines
- The same approach could support training assistants that mimic the voices of public figures or historical characters when suitable audio archives become available.
- Interactive storytelling or game applications might adopt the dataset to let players converse with audio versions of fictional personas.
- Longer-term training on mixed scripted and spontaneous audio could test whether the observed gains hold beyond entertainment dialogue.
Load-bearing premise
That dialogues taken from scripted TV series supply representative examples that let models acquire general audio role-playing skills rather than merely memorizing show-specific patterns.
What would settle it
A drop in personalization scores when the trained ARP-Model is tested on new characters and dialogue situations drawn from sources outside the original thirteen TV series.
Figures
read the original abstract
The creation of high-quality multimodal datasets remains fundamental for advancing role-playing capabilities in large language models (LLMs). While existing works predominantly focus on text-based persona simulation, Audio Role-Playing (ARP) presents unique challenges due to the need for synchronized alignment of semantic content and vocal characteristics. To address this gap, we propose AudioRole, a meticulously curated dataset from 13 TV series spanning 1K+ hours with 1M+ character-grounded dialogues, providing synchronized audio-text pairs annotated with speaker identities and contextual metadata. In addition, to demonstrate the effectiveness of the dataset, we introduced ARP-Eval, a dual-aspect evaluation framework that assesses both response quality and role fidelity. Empirical validation showing GLM-4-Voice trained on AudioRole (which we called ARP-Model) achieve an average Acoustic Personalization score of 0.31, significantly outperforming the original GLM-4-voice and the more powerful model MiniCPM-O-2.6, which specifically supports role-playing in one-shot scenarios. The ARP-Model also achieves a Content Personalization score of 0.36, surpassing the untrained original model by about 38% and maintaining the same level as MiniCPM-O-2.6. AudioRole features dialogues from over 115 main characters, 6 trained ARP-Models that role-play different characters, and evaluation protocols. Together, they provide an essential resource for advancing audio-grounded role-playing research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents AudioRole, a curated multimodal dataset of over 1M character-grounded dialogues extracted from 13 TV series (1K+ hours of audio-text pairs with speaker identities and metadata). It introduces ARP-Eval, a dual-aspect framework measuring acoustic and content personalization, and shows that fine-tuning GLM-4-Voice on the dataset yields an ARP-Model with average Acoustic Personalization of 0.31 (outperforming the base GLM-4-Voice and MiniCPM-O-2.6) and Content Personalization of 0.36 (38% above the untrained base model).
Significance. If the empirical claims hold after proper validation, AudioRole would be a substantial resource for audio-grounded role-playing research, filling a gap left by text-only persona datasets. The scale (115+ characters, 6 trained models) and concrete outperformance numbers against a one-shot baseline are strengths; reproducible evaluation protocols would further increase impact.
major comments (3)
- [Abstract and §4] Abstract and §4 (ARP-Eval): the headline scores (0.31 acoustic, 0.36 content) are presented without any description of test-set size, data splits, inter-annotator agreement, or statistical tests for the reported outperformance over GLM-4-Voice and MiniCPM-O-2.6. This directly undermines confidence in the central empirical claim.
- [§3] §3 (Dataset Curation): the claim that dialogues from 13 TV series provide representative training examples for general audio role-playing is load-bearing, yet the manuscript provides no analysis of potential confounds such as dramatic prosody bias, character selection criteria, or how well the distribution matches everyday conversational role-play.
- [§5] §5 (Experiments): the one-shot comparison to MiniCPM-O-2.6 is sensitive to whether test prompts remain within the same TV-series distribution; the paper must clarify whether held-out series or characters were used and report per-character or per-series breakdowns.
minor comments (2)
- [§4] Notation for ARP-Eval scores should be defined explicitly (e.g., how acoustic and content components are normalized and aggregated) rather than left implicit in the abstract.
- [§3] The manuscript would benefit from a table summarizing dataset statistics (hours per series, number of unique speakers, dialogue length distribution) to allow readers to assess coverage.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which have helped us identify areas to improve the clarity and rigor of our empirical claims. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (ARP-Eval): the headline scores (0.31 acoustic, 0.36 content) are presented without any description of test-set size, data splits, inter-annotator agreement, or statistical tests for the reported outperformance over GLM-4-Voice and MiniCPM-O-2.6. This directly undermines confidence in the central empirical claim.
Authors: We agree that additional details on the evaluation protocol are necessary to support the reported scores. In the revised version, we will expand §4 to include the size of the test set used for ARP-Eval, the specific data splitting strategy (including any held-out portions), inter-annotator agreement metrics for the personalization scores, and results from statistical tests comparing ARP-Model against the baselines. These additions will provide better context for the 0.31 and 0.36 scores. revision: yes
-
Referee: [§3] §3 (Dataset Curation): the claim that dialogues from 13 TV series provide representative training examples for general audio role-playing is load-bearing, yet the manuscript provides no analysis of potential confounds such as dramatic prosody bias, character selection criteria, or how well the distribution matches everyday conversational role-play.
Authors: The selection of 13 TV series was intended to capture a wide variety of character interactions and vocal styles across different genres. We acknowledge the potential for biases inherent in scripted dramatic content, such as heightened prosody. In the revision, we will add an analysis subsection in §3 discussing character selection criteria, potential confounds including dramatic prosody bias, and a comparison of the dialogue distribution to more naturalistic conversations. We will also note limitations regarding generalizability to everyday role-play scenarios. revision: yes
-
Referee: [§5] §5 (Experiments): the one-shot comparison to MiniCPM-O-2.6 is sensitive to whether test prompts remain within the same TV-series distribution; the paper must clarify whether held-out series or characters were used and report per-character or per-series breakdowns.
Authors: To ensure fair comparison, our experiments in §5 utilized held-out series and characters not seen during training of ARP-Model. We will clarify this in the revised manuscript and include per-character and per-series performance breakdowns for both acoustic and content personalization scores. This will demonstrate the robustness of the outperformance across different distributions. revision: yes
Circularity Check
Empirical dataset and training results with no derivational circularity
full rationale
The paper is an empirical contribution: it curates the AudioRole dataset from 13 TV series (1M+ dialogues, 1K+ hours) with speaker annotations, introduces the ARP-Eval dual-aspect framework, trains ARP-Model variants on the data, and reports measured scores (acoustic 0.31, content 0.36) against external baselines such as the original GLM-4-Voice and MiniCPM-O-2.6. No equations, first-principles derivations, or predictions appear in the provided text that reduce by construction to fitted parameters, self-definitions, or self-citation chains. Performance claims rest on direct comparisons to independent models rather than internal renormalization or ansatz smuggling. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Dialogues extracted from TV series provide suitable and generalizable training examples for audio-grounded character role-playing in LLMs
Reference graph
Works this paper leans on
-
[1]
Hongzhan Chen, Hehong Chen, Ming Yan, Wenshen Xu, Xing Gao, Weizhou Shen, Xiaojun Quan, Chenliang Li, Ji Zhang, Fei Huang, and Jingren Zhou. 2024 a . https://arxiv.org/abs/2403.13679 Socialbench: Sociality evaluation of role-playing conversational agents . Preprint, arXiv:2403.13679
-
[2]
Nuo Chen, Yan Wang, Haiyun Jiang, Deng Cai, Yuhan Li, Ziyang Chen, Longyue Wang, and Jia Li. 2023. Large language models meet harry potter: A dataset for aligning dialogue agents with characters. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8506--8520
work page 2023
-
[3]
VoiceBench: Benchmarking LLM-Based Voice Assistants
Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T. Tan, and Haizhou Li. 2024 b . https://arxiv.org/abs/2410.17196 Voicebench: Benchmarking llm-based voice assistants . Preprint, arXiv:2410.17196
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Alex S Cohen, Thomas J Dinzeo, Neila J Donovan, Caitlin E Brown, and Sean C Morrison. 2015. Vocal acoustic analysis as a biometric indicator of information processing: Implications for neurological and psychiatric disorders. Psychiatry Research, 226(1):235--241
work page 2015
-
[5]
Jean Decety and Claus Lamm. 2006. Human empathy through the lens of social neuroscience. The scientific World journal, 6(1):1146--1163
work page 2006
-
[6]
Zhouhong Gu, Xiaoxuan Zhu, Haoran Guo, Lin Zhang, Yin Cai, Hao Shen, Jiangjie Chen, Zheyu Ye, Yifei Dai, Yan Gao, and 1 others. 2024. Agent group chat: An interactive group chat simulacra for better eliciting collective emergent behavior. arXiv e-prints, pages arXiv--2403
work page 2024
-
[7]
Fang Guo, Wenyu Li, Honglei Zhuang, Yun Luo, Yafu Li, Le Yan, Qi Zhu, and Yue Zhang. 2025. https://doi.org/10.1145/3701551.3703583 Mcranker: Generating diverse criteria on-the-fly to improve pointwise llm rankers . In Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining, WSDM '25, page 944–953, New York, NY, USA. Associ...
-
[8]
Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat, Alexis Conneau, Felix Kreuk, Jade Copet, Alexandre Defossez, Gabriel Synnaeve, Emmanuel Dupoux, Roy Schwartz, and Yossi Adi. 2024. https://arxiv.org/abs/2305.13009 Textually pretrained speech language models . Preprint, arXiv:2305.13009
-
[9]
Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and Hsin-Min Wang. 2017. Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks. arXiv preprint arXiv:1704.00849
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[10]
Chun-Yi Kuan, Chen-An Li, Tsu-Yuan Hsu, Tse-Yang Lin, Ho-Lam Chung, Kai-Wei Chang, Shuo-Yiin Chang, and Hung-yi Lee. 2023. Towards general-purpose text-instruction-guided voice conversion. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1--8. IEEE
work page 2023
-
[11]
Paul Lerner, Juliette Bergo \"e nd, Camille Guinaudeau, Herv \'e Bredin, Benjamin Maurice, Sharleyne Lefevre, Martin Bouteiller, Aman Berhe, L \'e o Galmant, Ruiqing Yin, and Claude Barras. 2022. https://aclanthology.org/2022.lrec-1.367/ Bazinga! a dataset for multi-party dialogues structuring . In Proceedings of the Thirteenth Language Resources and Eval...
work page 2022
- [12]
-
[13]
Juntao Li, Chang Liu, Chongyang Tao, Zhangming Chan, Dongyan Zhao, Min Zhang, and Rui Yan. 2021. Dialogue history matters! personalized response selection in multi-turn retrieval-based chatbots. ACM Transactions on Information Systems (TOIS), 39(4):1--25
work page 2021
-
[14]
Wenyu Li, Yinuo Zhu, Xin Lin, Ming Li, Ziyue Jiang, and Ziqian Zeng. 2024. https://doi.org/10.1145/3589335.3651584 Zero-shot explainable mental health analysis on social media by incorporating mental scales . In Companion Proceedings of the ACM Web Conference 2024, WWW '24, page 959–962, New York, NY, USA. Association for Computing Machinery
- [15]
-
[16]
Hieu-Thi Luong and Junichi Yamagishi. 2020. Nautilus: a versatile voice cloning system. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2967--2981
work page 2020
- [17]
-
[18]
Generative Agents: Interactive Simulacra of Human Behavior
Joon Sung Park, Joseph C. O'Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. https://arxiv.org/abs/2304.03442 Generative agents: Interactive simulacra of human behavior . Preprint, arXiv:2304.03442
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, and Mark Hasegawa-Johnson. 2019. Autovc: Zero-shot voice style transfer with only autoencoder loss. In International Conference on Machine Learning, pages 5210--5219. PMLR
work page 2019
- [20]
-
[21]
Andros Tjandra, Yi-Chiao Wu, Baishan Guo, John Hoffman, Brian Ellis, Apoorv Vyas, Bowen Shi, Sanyuan Chen, Matt Le, Nick Zacharov, Carleigh Wood, Ann Lee, and Wei-Ning Hsu. 2025. https://arxiv.org/abs/2502.05139 Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound . Preprint, arXiv:2502.05139
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [22]
- [23]
-
[24]
Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong Wang, Shengmin Jiang, Lei Zhao, Yuxiao Dong, and Jie Tang. 2024. Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot. arXiv preprint arXiv:2412.02612
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Jun Zhan, Mingyang Han, Yuxuan Xie, Chen Wang, Dong Zhang, Kexin Huang, Haoxiang Shi, DongXiao Wang, Tengtao Song, Qinyuan Cheng, Shimin Li, Jun Song, Xipeng Qiu, and Bo Zheng. 2025. https://arxiv.org/abs/2509.09716 Vstyle: A benchmark for voice style adaptation with spoken instructions . Preprint, arXiv:2509.09716
-
[26]
Li Zhao and Feifan Chen. 2020. Research on voice cloning with a few samples. In 2020 International Conference on Computer Network, Electronic and Automation (ICCNEA), pages 323--328. IEEE
work page 2020
- [27]
-
[28]
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[29]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.