arxiv: 2509.23435 · v2 · submitted 2025-09-27 · 💻 cs.SD · cs.AI· cs.MM· eess.AS

AudioRole: An Audio Dataset for Character Role-Playing in Large Language Models

Wenyu Li , Xiaoqi Jiao , Yi Chang , Guangyan Zhang , Yiwen Guo This is my paper

Pith reviewed 2026-05-18 12:58 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.MMeess.AS

keywords audio datasetrole-playingvoice modelsmultimodal trainingTV series dialoguespersonalization metricscharacter simulationevaluation framework

0 comments

The pith

A dataset of over one million TV character dialogues trains voice models to better match both vocal style and spoken content in role-playing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AudioRole, a dataset drawn from thirteen TV series that supplies more than one million synchronized audio-text dialogue pairs labeled by speaker and context. It shows that fine-tuning a model such as GLM-4-Voice on this data produces the ARP-Model, which reaches an acoustic personalization score of 0.31 and a content personalization score of 0.36. The latter figure represents a thirty-eight percent improvement over the same model before training and equals the score of a stronger baseline that already supports role-playing. An accompanying ARP-Eval framework judges both acoustic traits and semantic fidelity. The work therefore supplies both the training resource and the measurement method needed to advance audio-grounded character simulation.

Core claim

The central claim is that the AudioRole collection of synchronized audio-text pairs from TV series, annotated with speaker identities and metadata, enables training of voice LLMs that achieve higher acoustic personalization (0.31) and content personalization (0.36) scores than the base GLM-4-Voice model or a more capable competitor, thereby demonstrating effective learning of character-specific voice and dialogue patterns.

What carries the argument

The AudioRole dataset of 1K+ hours of TV audio paired with 1M+ character-grounded dialogues, together with the ARP-Eval dual-aspect scoring system that separately measures acoustic and content fidelity.

If this is right

Voice models can be specialized to reproduce both the sound and the typical speech content of individual characters.
Multiple ARP-Models can be created, each dedicated to a different role from the dataset.
ARP-Eval supplies a consistent yardstick for comparing future audio role-playing systems on two separate dimensions.
The scale of one million annotated pairs supports further scaling of training for improved generalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same approach could support training assistants that mimic the voices of public figures or historical characters when suitable audio archives become available.
Interactive storytelling or game applications might adopt the dataset to let players converse with audio versions of fictional personas.
Longer-term training on mixed scripted and spontaneous audio could test whether the observed gains hold beyond entertainment dialogue.

Load-bearing premise

That dialogues taken from scripted TV series supply representative examples that let models acquire general audio role-playing skills rather than merely memorizing show-specific patterns.

What would settle it

A drop in personalization scores when the trained ARP-Model is tested on new characters and dialogue situations drawn from sources outside the original thirteen TV series.

Figures

Figures reproduced from arXiv: 2509.23435 by Guangyan Zhang, Wenyu Li, Xiaoqi Jiao, Yi Chang, Yiwen Guo.

**Figure 1.** Figure 1: Audio Role-Playing case of Sheldon. The answer should satisfy not only the lexical similarity but also acoustic similarity, which are the two characteristics of Sheldon. remarkable proficiency in textual persona simulation, serving as personalized assistants 1 , emotional companions2 , and social interaction proxies (Park et al., 2023). However, this progress remains fundamentally constrained by unimoda… view at source ↗

**Figure 2.** Figure 2: The pipeline of AudioRole Construction. operations. By calculating exact frame positions from RTTM timestamps and sample rates, we precisely slice the concatenated audio stream while maintaining original quality. These segments are then merged per speaker identity through PyTorch’s tensor concatenation: Fk = torch.cat([C (k) 1 , ..., C(k) M ]) (2) where Fk represents the merged audio for speaker k, and C … view at source ↗

**Figure 3.** Figure 3: One typical case of Sheldon, and the words [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

The creation of high-quality multimodal datasets remains fundamental for advancing role-playing capabilities in large language models (LLMs). While existing works predominantly focus on text-based persona simulation, Audio Role-Playing (ARP) presents unique challenges due to the need for synchronized alignment of semantic content and vocal characteristics. To address this gap, we propose AudioRole, a meticulously curated dataset from 13 TV series spanning 1K+ hours with 1M+ character-grounded dialogues, providing synchronized audio-text pairs annotated with speaker identities and contextual metadata. In addition, to demonstrate the effectiveness of the dataset, we introduced ARP-Eval, a dual-aspect evaluation framework that assesses both response quality and role fidelity. Empirical validation showing GLM-4-Voice trained on AudioRole (which we called ARP-Model) achieve an average Acoustic Personalization score of 0.31, significantly outperforming the original GLM-4-voice and the more powerful model MiniCPM-O-2.6, which specifically supports role-playing in one-shot scenarios. The ARP-Model also achieves a Content Personalization score of 0.36, surpassing the untrained original model by about 38% and maintaining the same level as MiniCPM-O-2.6. AudioRole features dialogues from over 115 main characters, 6 trained ARP-Models that role-play different characters, and evaluation protocols. Together, they provide an essential resource for advancing audio-grounded role-playing research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AudioRole introduces a useful TV-sourced audio dataset for LLM role-playing with initial results, but evaluation details are missing to fully support the claims.

read the letter

The punchline for this paper is a new dataset of synchronized audio and text from TV series for training LLMs on character role-playing, plus some reported gains after fine-tuning, though the details on how those gains were measured are sparse. They have done solid work by assembling over a million dialogues from 115 characters across more than 1,000 hours, with speaker identities and metadata. This moves beyond text-only persona datasets and gives a concrete resource for audio-grounded role play. The ARP-Eval framework that looks at both acoustic and content aspects is a reasonable addition for assessing role fidelity. The soft spots are in the empirical validation. The abstract mentions specific improvements like an acoustic score of 0.31 outperforming baselines and a content score of 0.36 up 38 percent, but it does not cover data splits, statistical significance, or how they handled potential biases from selecting dramatic TV dialogues. Those choices could mean the results are stronger on similar material than on general cases, especially in the one-shot tests against MiniCPM-O-2.6. This paper targets researchers in multimodal models and speech applications who want data for consistent voice personas. Readers who need a starting point for audio role-playing experiments will get practical value from the dataset release. It shows clear thinking on the problem and engages with existing persona work, so it deserves a serious referee. I would send it to peer review to get feedback on strengthening the evaluation and checking for curation effects.

Referee Report

3 major / 2 minor

Summary. The paper presents AudioRole, a curated multimodal dataset of over 1M character-grounded dialogues extracted from 13 TV series (1K+ hours of audio-text pairs with speaker identities and metadata). It introduces ARP-Eval, a dual-aspect framework measuring acoustic and content personalization, and shows that fine-tuning GLM-4-Voice on the dataset yields an ARP-Model with average Acoustic Personalization of 0.31 (outperforming the base GLM-4-Voice and MiniCPM-O-2.6) and Content Personalization of 0.36 (38% above the untrained base model).

Significance. If the empirical claims hold after proper validation, AudioRole would be a substantial resource for audio-grounded role-playing research, filling a gap left by text-only persona datasets. The scale (115+ characters, 6 trained models) and concrete outperformance numbers against a one-shot baseline are strengths; reproducible evaluation protocols would further increase impact.

major comments (3)

[Abstract and §4] Abstract and §4 (ARP-Eval): the headline scores (0.31 acoustic, 0.36 content) are presented without any description of test-set size, data splits, inter-annotator agreement, or statistical tests for the reported outperformance over GLM-4-Voice and MiniCPM-O-2.6. This directly undermines confidence in the central empirical claim.
[§3] §3 (Dataset Curation): the claim that dialogues from 13 TV series provide representative training examples for general audio role-playing is load-bearing, yet the manuscript provides no analysis of potential confounds such as dramatic prosody bias, character selection criteria, or how well the distribution matches everyday conversational role-play.
[§5] §5 (Experiments): the one-shot comparison to MiniCPM-O-2.6 is sensitive to whether test prompts remain within the same TV-series distribution; the paper must clarify whether held-out series or characters were used and report per-character or per-series breakdowns.

minor comments (2)

[§4] Notation for ARP-Eval scores should be defined explicitly (e.g., how acoustic and content components are normalized and aggregated) rather than left implicit in the abstract.
[§3] The manuscript would benefit from a table summarizing dataset statistics (hours per series, number of unique speakers, dialogue length distribution) to allow readers to assess coverage.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us identify areas to improve the clarity and rigor of our empirical claims. We address each major comment below.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (ARP-Eval): the headline scores (0.31 acoustic, 0.36 content) are presented without any description of test-set size, data splits, inter-annotator agreement, or statistical tests for the reported outperformance over GLM-4-Voice and MiniCPM-O-2.6. This directly undermines confidence in the central empirical claim.

Authors: We agree that additional details on the evaluation protocol are necessary to support the reported scores. In the revised version, we will expand §4 to include the size of the test set used for ARP-Eval, the specific data splitting strategy (including any held-out portions), inter-annotator agreement metrics for the personalization scores, and results from statistical tests comparing ARP-Model against the baselines. These additions will provide better context for the 0.31 and 0.36 scores. revision: yes
Referee: [§3] §3 (Dataset Curation): the claim that dialogues from 13 TV series provide representative training examples for general audio role-playing is load-bearing, yet the manuscript provides no analysis of potential confounds such as dramatic prosody bias, character selection criteria, or how well the distribution matches everyday conversational role-play.

Authors: The selection of 13 TV series was intended to capture a wide variety of character interactions and vocal styles across different genres. We acknowledge the potential for biases inherent in scripted dramatic content, such as heightened prosody. In the revision, we will add an analysis subsection in §3 discussing character selection criteria, potential confounds including dramatic prosody bias, and a comparison of the dialogue distribution to more naturalistic conversations. We will also note limitations regarding generalizability to everyday role-play scenarios. revision: yes
Referee: [§5] §5 (Experiments): the one-shot comparison to MiniCPM-O-2.6 is sensitive to whether test prompts remain within the same TV-series distribution; the paper must clarify whether held-out series or characters were used and report per-character or per-series breakdowns.

Authors: To ensure fair comparison, our experiments in §5 utilized held-out series and characters not seen during training of ARP-Model. We will clarify this in the revised manuscript and include per-character and per-series performance breakdowns for both acoustic and content personalization scores. This will demonstrate the robustness of the outperformance across different distributions. revision: yes

Circularity Check

0 steps flagged

Empirical dataset and training results with no derivational circularity

full rationale

The paper is an empirical contribution: it curates the AudioRole dataset from 13 TV series (1M+ dialogues, 1K+ hours) with speaker annotations, introduces the ARP-Eval dual-aspect framework, trains ARP-Model variants on the data, and reports measured scores (acoustic 0.31, content 0.36) against external baselines such as the original GLM-4-Voice and MiniCPM-O-2.6. No equations, first-principles derivations, or predictions appear in the provided text that reduce by construction to fitted parameters, self-definitions, or self-citation chains. Performance claims rest on direct comparisons to independent models rather than internal renormalization or ansatz smuggling. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of scripted TV dialogues for general role-playing training and the validity of the two personalization metrics as proxies for role fidelity; these are domain assumptions rather than derived quantities.

axioms (1)

domain assumption Dialogues extracted from TV series provide suitable and generalizable training examples for audio-grounded character role-playing in LLMs
The dataset construction and model training assume that media dialogues capture the necessary semantic and vocal characteristics for the target task.

pith-pipeline@v0.9.0 · 5813 in / 1481 out tokens · 55288 ms · 2026-05-18T12:58:59.103337+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 5 internal anchors

[1]

Hongzhan Chen, Hehong Chen, Ming Yan, Wenshen Xu, Xing Gao, Weizhou Shen, Xiaojun Quan, Chenliang Li, Ji Zhang, Fei Huang, and Jingren Zhou. 2024 a . https://arxiv.org/abs/2403.13679 Socialbench: Sociality evaluation of role-playing conversational agents . Preprint, arXiv:2403.13679

work page arXiv 2024
[2]

Nuo Chen, Yan Wang, Haiyun Jiang, Deng Cai, Yuhan Li, Ziyang Chen, Longyue Wang, and Jia Li. 2023. Large language models meet harry potter: A dataset for aligning dialogue agents with characters. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8506--8520

work page 2023
[3]

VoiceBench: Benchmarking LLM-Based Voice Assistants

Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T. Tan, and Haizhou Li. 2024 b . https://arxiv.org/abs/2410.17196 Voicebench: Benchmarking llm-based voice assistants . Preprint, arXiv:2410.17196

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Alex S Cohen, Thomas J Dinzeo, Neila J Donovan, Caitlin E Brown, and Sean C Morrison. 2015. Vocal acoustic analysis as a biometric indicator of information processing: Implications for neurological and psychiatric disorders. Psychiatry Research, 226(1):235--241

work page 2015
[5]

Jean Decety and Claus Lamm. 2006. Human empathy through the lens of social neuroscience. The scientific World journal, 6(1):1146--1163

work page 2006
[6]

Zhouhong Gu, Xiaoxuan Zhu, Haoran Guo, Lin Zhang, Yin Cai, Hao Shen, Jiangjie Chen, Zheyu Ye, Yifei Dai, Yan Gao, and 1 others. 2024. Agent group chat: An interactive group chat simulacra for better eliciting collective emergent behavior. arXiv e-prints, pages arXiv--2403

work page 2024
[7]

Fang Guo, Wenyu Li, Honglei Zhuang, Yun Luo, Yafu Li, Le Yan, Qi Zhu, and Yue Zhang. 2025. https://doi.org/10.1145/3701551.3703583 Mcranker: Generating diverse criteria on-the-fly to improve pointwise llm rankers . In Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining, WSDM '25, page 944–953, New York, NY, USA. Associ...

work page doi:10.1145/3701551.3703583 2025
[8]

Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat, Alexis Conneau, Felix Kreuk, Jade Copet, Alexandre Defossez, Gabriel Synnaeve, Emmanuel Dupoux, Roy Schwartz, and Yossi Adi. 2024. https://arxiv.org/abs/2305.13009 Textually pretrained speech language models . Preprint, arXiv:2305.13009

work page arXiv 2024
[9]

Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and Hsin-Min Wang. 2017. Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks. arXiv preprint arXiv:1704.00849

work page internal anchor Pith review Pith/arXiv arXiv 2017
[10]

Chun-Yi Kuan, Chen-An Li, Tsu-Yuan Hsu, Tse-Yang Lin, Ho-Lam Chung, Kai-Wei Chang, Shuo-Yiin Chang, and Hung-yi Lee. 2023. Towards general-purpose text-instruction-guided voice conversion. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1--8. IEEE

work page 2023
[11]

Paul Lerner, Juliette Bergo \"e nd, Camille Guinaudeau, Herv \'e Bredin, Benjamin Maurice, Sharleyne Lefevre, Martin Bouteiller, Aman Berhe, L \'e o Galmant, Ruiqing Yin, and Claude Barras. 2022. https://aclanthology.org/2022.lrec-1.367/ Bazinga! a dataset for multi-party dialogues structuring . In Proceedings of the Thirteenth Language Resources and Eval...

work page 2022
[12]

Cheng Li, Ziang Leng, Chenxi Yan, Junyi Shen, Hao Wang, Weishi Mi, Yaying Fei, Xiaoyang Feng, Song Yan, HaoSheng Wang, and 1 others. 2023. Chatharuhi: Reviving anime character in reality via large language model. arXiv preprint arXiv:2308.09597

work page arXiv 2023
[13]

Juntao Li, Chang Liu, Chongyang Tao, Zhangming Chan, Dongyan Zhao, Min Zhang, and Rui Yan. 2021. Dialogue history matters! personalized response selection in multi-turn retrieval-based chatbots. ACM Transactions on Information Systems (TOIS), 39(4):1--25

work page 2021
[14]

Wenyu Li, Yinuo Zhu, Xin Lin, Ming Li, Ziyue Jiang, and Ziqian Zeng. 2024. https://doi.org/10.1145/3589335.3651584 Zero-shot explainable mental health analysis on social media by incorporating mental scales . In Companion Proceedings of the ACM Web Conference 2024, WWW '24, page 959–962, New York, NY, USA. Association for Computing Machinery

work page doi:10.1145/3589335.3651584 2024
[15]

Heyang Liu, Yuhao Wang, Ziyang Cheng, Ronghua Wu, Qunshan Gu, Yanfeng Wang, and Yu Wang. 2025. https://arxiv.org/abs/2505.15727 Vocalbench: Benchmarking the vocal conversational abilities for speech interaction models . Preprint, arXiv:2505.15727

work page arXiv 2025
[16]

Hieu-Thi Luong and Junichi Yamagishi. 2020. Nautilus: a versatile voice cloning system. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2967--2981

work page 2020
[17]

Xinlei Niu, Jing Zhang, and Charles Patrick Martin. 2024. Hybridvc: Efficient voice style conversion with text and audio prompts. arXiv preprint arXiv:2404.15637

work page arXiv 2024
[18]

Generative Agents: Interactive Simulacra of Human Behavior

Joon Sung Park, Joseph C. O'Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. https://arxiv.org/abs/2304.03442 Generative agents: Interactive simulacra of human behavior . Preprint, arXiv:2304.03442

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, and Mark Hasegawa-Johnson. 2019. Autovc: Zero-shot voice style transfer with only autoencoder loss. In International Conference on Machine Learning, pages 5210--5219. PMLR

work page 2019
[20]

Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu. 2023. Character-llm: A trainable agent for role-playing. arXiv preprint arXiv:2310.10158

work page arXiv 2023
[21]

Andros Tjandra, Yi-Chiao Wu, Baishan Guo, John Hoffman, Brian Ellis, Apoorv Vyas, Bowen Shi, Sanyuan Chen, Matt Le, Nick Zacharov, Carleigh Wood, Ann Lee, and Wei-Ning Hsu. 2025. https://arxiv.org/abs/2502.05139 Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound . Preprint, arXiv:2502.05139

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Quan Tu, Chuanqi Chen, Jinpeng Li, Yanran Li, Shuo Shang, Dongyan Zhao, Ran Wang, and Rui Yan. 2023. Characterchat: Learning towards conversational ai with personalized social support. arXiv preprint arXiv:2308.10278

work page arXiv 2023
[23]

Quan Tu, Shilong Fan, Zihang Tian, and Rui Yan. 2024. Charactereval: A chinese benchmark for role-playing conversational agent evaluation. arXiv preprint arXiv:2401.01275

work page arXiv 2024
[24]

Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong Wang, Shengmin Jiang, Lei Zhao, Yuxiao Dong, and Jie Tang. 2024. Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot. arXiv preprint arXiv:2412.02612

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Jun Zhan, Mingyang Han, Yuxuan Xie, Chen Wang, Dong Zhang, Kexin Huang, Haoxiang Shi, DongXiao Wang, Tengtao Song, Qinyuan Cheng, Shimin Li, Jun Song, Xipeng Qiu, and Bo Zheng. 2025. https://arxiv.org/abs/2509.09716 Vstyle: A benchmark for voice style adaptation with spoken instructions . Preprint, arXiv:2509.09716

work page arXiv 2025
[26]

Li Zhao and Feifan Chen. 2020. Research on voice cloning with a few samples. In 2020 International Conference on Computer Network, Electronic and Automation (ICCNEA), pages 323--328. IEEE

work page 2020
[27]

Jinfeng Zhou, Zhuang Chen, Dazhen Wan, Bosi Wen, Yi Song, Jifan Yu, Yongkang Huang, Libiao Peng, Jiaming Yang, Xiyao Xiao, and 1 others. 2023. Characterglm: Customizing chinese conversational ai characters with large language models. arXiv preprint arXiv:2311.16832

work page arXiv 2023
[28]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page
[29]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page