Adaptive Turn-Taking for Real-time Multi-Party Voice Agents

Abhinav Jain; K V Vijay Girish; Prabhat Pandey; Shanmukha Sahith; Soumyajit Mitra

arxiv: 2606.13544 · v2 · pith:GQZ2RVAAnew · submitted 2026-06-11 · 📡 eess.AS · cs.AI· cs.CL

Adaptive Turn-Taking for Real-time Multi-Party Voice Agents

Soumyajit Mitra , Prabhat Pandey , Abhinav Jain , Shanmukha Sahith , K V Vijay Girish This is my paper

Pith reviewed 2026-06-27 05:30 UTC · model grok-4.3

classification 📡 eess.AS cs.AIcs.CL

keywords turn-takingmulti-party conversationvoice agentsspeech LLMrole conditioningchain-of-thought

0 comments

The pith

Explicit role conditioning improves turn-taking precision by over 40% in multi-party voice agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ModeratorLM, a streaming speech large language model for multi-party conversations that bases its turn-taking decisions on an explicitly assigned role. It includes an optional chain-of-thought variant that reasons over both context and role. Experiments on real meeting recordings and the new RolePlayConv synthetic dataset report more than 40% higher precision, over 70% higher recall, and fewer false interruptions than non-role baselines. A reader would care because voice agents in group settings often fail to judge when to speak or interrupt inappropriately. The work shows that feeding an assigned role into the model changes when the agent decides to take the floor.

Core claim

ModeratorLM conditions turn-taking behavior on an explicitly assigned role in multi-party settings using a chunk-wise streaming speech LLM, with a reasoning-augmented variant that adds chain-of-thought over context and role, yielding over 40% precision and over 70% recall gains plus fewer false-positive interruptions on real-world meeting data and RolePlayConv compared to non-role baselines.

What carries the argument

ModeratorLM, the role-conditioned speech LLM that takes an assigned role as explicit input to guide turn-taking decisions in streaming multi-party audio.

If this is right

Role diversity in a conversation directly affects the agent's ability to manage floor competition.
Adding chain-of-thought reasoning over role and context further reduces unwanted interruptions.
The same conditioning approach scales to real-time deployment because the model processes audio in chunks.
Synthetic role-play data can train agents for a wider range of assistant roles without new recordings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Role conditioning could be applied to other agent behaviors such as response length or topic choice in the same architecture.
Voice agents trained this way might integrate more seamlessly into existing meeting tools without per-scenario hand-tuning.
Extending the method to live human-only meetings with implicit role inference would test robustness beyond assigned roles.

Load-bearing premise

The measured gains in precision and recall come from conditioning on the assigned role rather than from other modeling or data choices.

What would settle it

An ablation that removes the role input while holding model architecture, training, and test data fixed; absence of the reported gains would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.13544 by Abhinav Jain, K V Vijay Girish, Prabhat Pandey, Shanmukha Sahith, Soumyajit Mitra.

**Figure 1.** Figure 1: Example input–output sequence of the LLM for “ModeratorLM-Think” model. No reasoning trace is produced in Chunk 1. A reasoning trace appears in Chunk 2 without turn-taking, while in Chunk 3 the assistant takes the floor. ModeratorLM-Think model. Since this work focuses on modeling turn-taking behavior, the assistant’s responses are generated in text form rather than speech codes. A streaming TTS module (e.… view at source ↗

read the original abstract

Turn-taking in multi-party spoken conversations remains a fundamental challenge for voice-based agents, particularly under dynamic floor competition and varying user expectations. We propose ModeratorLM, a role-playing voice agent that conditions turn-taking behavior on an explicitly assigned role in multi-party settings. The system is built on a speech large language model operating in chunk-wise streaming manner. We further introduce a reasoning-augmented variant that incorporates chain-of-thought reasoning over conversational context and the assigned role. We construct RolePlayConv, a large-scale synthetic dataset of spoken multi-party conversations with diverse assistant roles. Experiments on real-world meeting data and RolePlayConv show improved turn-taking precision by over 40% and recall by more than 70%, while substantially reducing false-positive interruptions compared to non-role-conditioned baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Role conditioning is a straightforward extension but the experiments do not isolate it as the cause of the reported turn-taking gains.

read the letter

The paper's core move is to feed an explicit role token into a streaming speech LLM so the model knows its part in a multi-party conversation, plus an optional chain-of-thought variant. They also release RolePlayConv, a synthetic multi-party dataset built around varied assistant roles. On real meeting recordings and their own data they report large lifts in turn-taking precision and recall plus fewer false interruptions versus non-role baselines.

That is a usable engineering pattern for anyone shipping voice agents that have to share the floor. The chunk-wise streaming design matches real deployment constraints, and constructing a role-diverse synthetic set is a reasonable way to get scale when natural multi-party speech data is scarce.

The soft spot is the missing isolation. The abstract compares against non-role-conditioned baselines, yet gives no evidence that those baselines were identical in training data, prompt length, fine-tuning steps, or architecture—only the role token changed. If RolePlayConv embeds correlations between role labels and turn statistics, any model that sees the role could exploit them without the conditioning mechanism itself being responsible. No ablation that holds data and training fixed while toggling only the role input is described. Precision and recall for turn-taking are also left undefined, and the abstract supplies no error bars or statistical tests.

This is for practitioners who need a concrete starting point for multi-party voice agents rather than for readers seeking a new theoretical result or a tightly controlled causal demonstration. The work shows clear engineering thinking and honest engagement with a practical gap, so it deserves a serious referee even though the current evidence leaves the central claim under-supported.

Referee Report

3 major / 1 minor

Summary. The paper proposes ModeratorLM, a role-conditioned speech LLM for turn-taking in multi-party voice conversations that operates chunk-wise in streaming mode, with an optional chain-of-thought reasoning variant over context and role. It introduces the synthetic RolePlayConv dataset of multi-party conversations with diverse assistant roles. Experiments on real-world meeting data and RolePlayConv report >40% gains in turn-taking precision and >70% in recall, plus reduced false-positive interruptions, relative to non-role-conditioned baselines.

Significance. If the reported gains can be shown to be causally attributable to explicit role conditioning (rather than dataset artifacts or unablated modeling choices) and if the turn-taking metrics are rigorously defined with appropriate statistical support, the work would advance real-time multi-party voice agents. The construction of a large-scale synthetic multi-party dataset is a constructive contribution that could support future research.

major comments (3)

[Abstract] Abstract: precision and recall for turn-taking are never defined, baseline architectures/training procedures are not described, and no error bars or statistical tests are mentioned. These omissions make it impossible to evaluate whether the claimed >40% precision and >70% recall improvements support the central claim.
[Abstract] Abstract: the comparison is only to 'non-role-conditioned baselines' without stating whether those baselines differ solely in the presence/absence of the role token or also in training data composition, prompt length, or fine-tuning procedure. An ablation that holds data, architecture, and training fixed while toggling only the role input is required to isolate the causal contribution of role conditioning.
[Abstract] Abstract: RolePlayConv is a synthetic dataset constructed with diverse assistant roles. Without explicit checks that role labels are uncorrelated with other conversational statistics (turn lengths, interruption patterns, etc.) or without independent validation on held-out real data, gains may reduce to exploitation of dataset artifacts rather than the proposed conditioning mechanism.

minor comments (1)

[Abstract] Clarify the precise definition of the speech LLM backbone, the chunk size used for streaming, and how role tokens are injected into the input stream.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback. We address each of the major comments below and will revise the manuscript accordingly to improve clarity and address concerns about causal attribution and dataset validity.

read point-by-point responses

Referee: [Abstract] Abstract: precision and recall for turn-taking are never defined, baseline architectures/training procedures are not described, and no error bars or statistical tests are mentioned. These omissions make it impossible to evaluate whether the claimed >40% precision and >70% recall improvements support the central claim.

Authors: We agree that the abstract lacks explicit definitions of the turn-taking precision and recall metrics. These are defined in the methods section of the manuscript as standard classification metrics applied to turn-taking decisions. We will revise the abstract to include concise definitions. The baseline architectures are the same speech LLM without role conditioning, using identical training procedures and data. We will add error bars and mention statistical significance tests in the experimental results section of the revised manuscript. revision: partial
Referee: [Abstract] Abstract: the comparison is only to 'non-role-conditioned baselines' without stating whether those baselines differ solely in the presence/absence of the role token or also in training data composition, prompt length, or fine-tuning procedure. An ablation that holds data, architecture, and training fixed while toggling only the role input is required to isolate the causal contribution of role conditioning.

Authors: The non-role-conditioned baselines are constructed by removing only the role token from the input while keeping the model architecture, training data composition, prompt structure, and fine-tuning procedure identical. This isolates the effect of role conditioning. We will explicitly clarify this in the revised manuscript and ensure an ablation study isolating the role input is presented if not already detailed. revision: yes
Referee: [Abstract] Abstract: RolePlayConv is a synthetic dataset constructed with diverse assistant roles. Without explicit checks that role labels are uncorrelated with other conversational statistics (turn lengths, interruption patterns, etc.) or without independent validation on held-out real data, gains may reduce to exploitation of dataset artifacts rather than the proposed conditioning mechanism.

Authors: The manuscript reports results on both the synthetic RolePlayConv dataset and real-world meeting data, providing independent validation on held-out real conversations. To further address potential dataset artifacts in RolePlayConv, we will include an analysis in the revision examining correlations between assigned roles and conversational features such as turn lengths and interruption rates. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on direct evaluation

full rationale

The manuscript presents an empirical system (ModeratorLM) with role conditioning, a synthetic dataset (RolePlayConv), and reported gains versus baselines on both synthetic and real-world data. No equations, derivations, or first-principles predictions appear in the provided text. Performance metrics are obtained from explicit experiments rather than any quantity fitted on the evaluation set and then re-labeled as a prediction. No self-citations, uniqueness theorems, or ansatzes are invoked to justify core modeling choices. The central claim therefore remains self-contained against external benchmarks and does not reduce to its inputs by construction under any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Central claim rests on the assumption that role labels are available at inference time and that the synthetic RolePlayConv distribution matches real multi-party dynamics sufficiently for the reported gains to transfer.

axioms (1)

domain assumption Explicit role assignment is provided to the model and causally affects turn-taking decisions
The system is defined as conditioning behavior on the assigned role.

invented entities (2)

ModeratorLM no independent evidence
purpose: Role-conditioned streaming speech LLM for turn-taking
New system introduced in the work.
RolePlayConv no independent evidence
purpose: Large-scale synthetic multi-party conversation dataset with role labels
Constructed for training and evaluation in this paper.

pith-pipeline@v0.9.1-grok · 5675 in / 1177 out tokens · 23048 ms · 2026-06-27T05:30:32.196047+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 7 linked inside Pith

[1]

Introduction Recent advances in large language models (LLMs) have driven rapid progress in the development of voice-based conversational agents [1, 2, 3, 4]. Modern spoken dialogue systems typi- cally combine low-latency streaming speech processing mod- ules with a core conversational component responsible for di- alogue management and response generation...
[2]

ModeratorLM-Think

Methodology 2.1. ModeratorLM: System Architecture ModeratorLMconsists of a speech encoder and a backbone LLM. The speech encoder processes each incoming audio chunk independently and produces chunk-level embeddings. These embeddings are projected into the LLM embedding space via a trainable linear projection layer, following prior work [18, 19]. The multi...

Pith/arXiv arXiv 2026
[3]

as- sistant

Experimental Setup 3.1. Training Setup We useQwen3-4B-Instruct-2507[27] as the backbone LLM forModeratorLM, andQwen3-4B-Thinking-2507for ModeratorLM-Think. For speech representation, we employ an in-house speech encoder trained with variable lookahead simi- lar to [28, 29], enabling block-wise attention during the infer- ence on variable-sized chunks. To ...
[4]

assistant

(NSF-1), which consists of real recordings of formal meet- ings with approximately four speakers per session and an aver- age duration of six minutes. AsNSF-1lacks role labels, we des- ignate one speaker as the “assistant” using a hybrid procedure that aggregates LLM-based rankings of assistant-like behavior with independent human evaluations. A role desc...
[5]

Main Results Table 2 comparesModeratorLMvariants against non-role- conditioned baselines onNSF-1andRolePlayConvdatasets

Results 4.1. Main Results Table 2 comparesModeratorLMvariants against non-role- conditioned baselines onNSF-1andRolePlayConvdatasets. Moshi, trained on dyadic conversations, fails to generalize to multi-speaker settings, exhibiting very low recall and high false positive rates. TheMP-Baseline, trained on multi-party conver- sations but without role condit...
[6]

Conclusions In this work, we introduced a role-playing voice agent for multi-party conversations that modulates turn-taking behavior based on an assigned role. Experimental results show that role-conditioned fine-tuning yields turn-taking decisions bet- ter aligned with configured preferences, and that incorporat- ing chain-of-thought reasoning further im...
[7]

Acknowledgments We would like to thank Ajay Srinivasamurthy, V olker Leutnant, Adam Kaplan, Andreas Schwarz, Raghavendra Bilgi and Sri Garimella for their support and valuable feedback
[8]

Generative AI was not used in the conceptualization, experi- mental design, or generation of the core scientific content

Generative AI Use Disclosure The authors acknowledge the use of generative AI tools during the preparation of this paper strictly for the purposes of edit- ing, polishing, and improving the readability of the manuscript. Generative AI was not used in the conceptualization, experi- mental design, or generation of the core scientific content. All co-authors...
[9]

Freeze-omni: A smart and low latency speech-to-speech dia- logue model with frozen llm,

X. Wang, Y . Li, C. Fu, Y . Shen, L. Xie, K. Li, X. Sun, and L. Ma, “Freeze-omni: A smart and low latency speech-to-speech dia- logue model with frozen llm,”arXiv preprint arXiv:2411.00774, 2024

arXiv 2024
[10]

Mini-omni: Language models can hear, talk while thinking in streaming, 2024,

Z. Xie and C. Wu, “Mini-omni: Language models can hear, talk while thinking in streaming, 2024,”URL https://arxiv. org/abs/2408.16725, 2024

arXiv 2024
[11]

Minmo: A multimodal large language model for seamless voice interaction,

Q. Chen, Y . Chen, Y . Chen, M. Chen, Y . Chen, C. Deng, Z. Du, R. Gao, C. Gao, Z. Gaoet al., “Minmo: A multimodal large language model for seamless voice interaction,”arXiv preprint arXiv:2501.06282, 2025

arXiv 2025
[12]

Kimi-audio technical report,

D. Ding, Z. Ju, Y . Leng, S. Liu, T. Liu, Z. Shang, K. Shen, W. Song, X. Tan, H. Tanget al., “Kimi-audio technical report,” arXiv preprint arXiv:2504.18425, 2025

Pith/arXiv arXiv 2025
[13]

Moshi: a speech-text foundation model for real-time dialogue,

A. D ´efossez, L. Mazar´e, M. Orsini, A. Royer, P. P´erez, H. J´egou, E. Grave, and N. Zeghidour, “Moshi: a speech-text foundation model for real-time dialogue,”arXiv preprint arXiv:2410.00037, 2024

Pith/arXiv arXiv 2024
[14]

Language model can listen while speaking,

Z. Ma, Y . Song, C. Du, J. Cong, Z. Chen, Y . Wang, Y . Wang, and X. Chen, “Language model can listen while speaking,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 23, 2025, pp. 24 831–24 839

2025
[15]

Salmonn-omni: A codec-free llm for full-duplex speech understanding and generation,

W. Yu, S. Wang, X. Yang, X. Chen, X. Tian, J. Zhang, G. Sun, L. Lu, Y . Wang, and C. Zhang, “Salmonn-omni: A codec-free llm for full-duplex speech understanding and generation,”arXiv preprint arXiv:2411.18138, 2024

arXiv 2024
[16]

Talk- ing turns: Benchmarking audio foundation models on turn-taking dynamics,

S. Arora, Z. Lu, C.-C. Chiu, R. Pang, and S. Watanabe, “Talk- ing turns: Benchmarking audio foundation models on turn-taking dynamics,”arXiv preprint arXiv:2503.01174, 2025

arXiv 2025
[17]

Full-duplex-bench v1. 5: Evaluating overlap handling for full- duplex speech models,

G.-T. Lin, S.-Y . S. Kuan, Q. Wang, J. Lian, T. Li, and H.-y. Lee, “Full-duplex-bench v1. 5: Evaluating overlap handling for full- duplex speech models,”arXiv preprint arXiv:2507.23159, 2025

Pith/arXiv arXiv 2025
[18]

Multi-party conversational agents: A survey,

S. Sapkota, M. S. Hasan, M. Shah, and S. Karmaker, “Multi-party conversational agents: A survey,”arXiv preprint arXiv:2505.18845, 2025

arXiv 2025
[19]

The design and im- plementation of xiaoice, an empathetic social chatbot,

L. Zhou, J. Gao, D. Li, and H.-Y . Shum, “The design and im- plementation of xiaoice, an empathetic social chatbot,”Computa- tional Linguistics, vol. 46, no. 1, pp. 53–93, 2020

2020
[20]

From persona to personalization: A survey on role-playing language agents. arxiv [cs. cl](april 2024),

J. Chen, X. Wang, R. Xu, S. Yuan, Y . Zhang, W. Shi, J. Xie, S. Li, R. Yang, T. Zhuet al., “From persona to personalization: A survey on role-playing language agents. arxiv [cs. cl](april 2024),” 2024

2024
[21]

Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models,

N. Wang, Z. Peng, H. Que, J. Liu, W. Zhou, Y . Wu, H. Guo, R. Gan, Z. Ni, J. Yanget al., “Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models,” in Findings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 14 743–14 777

2024
[22]

Characterglm: Customizing social charac- ters with large language models,

J. Zhou, Z. Chen, D. Wan, B. Wen, Y . Song, J. Yu, Y . Huang, P. Ke, G. Bi, L. Penget al., “Characterglm: Customizing social charac- ters with large language models,” inProceedings of the 2024 Con- ference on Empirical Methods in Natural Language Processing: Industry Track, 2024, pp. 1457–1476

2024
[23]

Capturing minds, not just words: Enhancing role-playing lan- guage models with personality-indicative data,

Y . Ran, X. Wang, R. Xu, X. Yuan, J. Liang, Y . Xiao, and D. Yang, “Capturing minds, not just words: Enhancing role-playing lan- guage models with personality-indicative data,” inFindings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 14 566–14 576

2024
[24]

Character is destiny: Can role-playing language agents make persona-driven decisions?

R. Xu, X. Wang, J. Chen, S. Yuan, X. Yuan, J. Liang, Z. Chen, X. Dong, and Y . Xiao, “Character is destiny: Can role-playing language agents make persona-driven decisions?”arXiv preprint arXiv:2404.12138, 2024

arXiv 2024
[25]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information pro- cessing systems, vol. 35, pp. 24 824–24 837, 2022

2022
[26]

Sift-50m: A large-scale multilin- gual dataset for speech instruction fine-tuning,

P. Pandey, R. V . Swaminathan, K. Girish, A. Sen, J. Xie, G. P. Strimel, and A. Schwarz, “Sift-50m: A large-scale multilin- gual dataset for speech instruction fine-tuning,”arXiv preprint arXiv:2504.09081, 2025

arXiv 2025
[27]

Efficient streaming llm for speech recognition,

J. Jia, G. Keren, W. Zhou, E. Lakomkin, X. Zhang, C. Wu, F. Seide, J. Mahadeokar, and O. Kalinli, “Efficient streaming llm for speech recognition,” inICASSP 2025-2025 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

2025
[28]

Omniflatten: An end-to- end gpt model for seamless voice conversation,

Q. Zhang, L. Cheng, C. Deng, Q. Chen, W. Wang, S. Zheng, J. Liu, H. Yu, C. Tan, Z. Duet al., “Omniflatten: An end-to- end gpt model for seamless voice conversation,”arXiv preprint arXiv:2410.17799, 2024

arXiv 2024
[29]

Qwen2. 5-omni technical report,

J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Danget al., “Qwen2. 5-omni technical report,”arXiv preprint arXiv:2503.20215, 2025

Pith/arXiv arXiv 2025
[30]

Coser: Coordinat- ing llm-based persona simulation of established roles,

X. Wang, H. Wang, Y . Zhang, X. Yuan, R. Xu, J.-t. Huang, S. Yuan, H. Guo, J. Chen, S. Zhouet al., “Coser: Coordinat- ing llm-based persona simulation of established roles,” inForty- second International Conference on Machine Learning, 2025

2025
[31]

Meld: A multimodal multi-party dataset for emo- tion recognition in conversations,

S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea, “Meld: A multimodal multi-party dataset for emo- tion recognition in conversations,” inProceedings of the 57th annual meeting of the association for computational linguistics, 2019, pp. 527–536

2019
[32]

Pachat: Persona-aware speech assistant for multi-party dialogue,

D. Fu, X. Cheng, L. Li, X. Yang, L. Yang, and T. Jin, “Pachat: Persona-aware speech assistant for multi-party dialogue,” inPro- ceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 29 313–29 330

2025
[33]

The amazon nova family of models: Tech- nical report and model card,

A. A. G. Intelligence, “The amazon nova family of models: Tech- nical report and model card,” 2024

2024
[34]

Zonos: an open-weight multilingual text-to-speech model (zonos-v0.1),

Zyphra, “Zonos: an open-weight multilingual text-to-speech model (zonos-v0.1),” https://github.com/Zyphra/Zonos, 2025

2025
[35]

Qwen3 technical report,

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025
[36]

Variable attention masking for configurable transformer trans- ducer speech recognition,

P. Swietojanski, S. Braun, D. Can, T. F. Da Silva, A. Ghoshal, T. Hori, R. Hsiao, H. Mason, E. McDermott, H. Silovskyet al., “Variable attention masking for configurable transformer trans- ducer speech recognition,” inICASSP 2023-2023 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

2023
[37]

Durep: Dual-mode speech representation learning via asr-aware distillation,

P. R. Male, S. N. Ray, H. Arsikere, A. Jaiswal, P. Swarup, P. Sen, D. Chakrabarty, K. V . V . Girish, N. Bhave, F. Weber, S. Bhattacharya, and S. Garimella, “Durep: Dual-mode speech representation learning via asr-aware distillation,” 2025. [Online]. Available: https://arxiv.org/abs/2505.19774

arXiv 2025
[38]

Montreal forced aligner: Trainable text-speech align- ment using kaldi

M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Son- deregger, “Montreal forced aligner: Trainable text-speech align- ment using kaldi.” inInterspeech, vol. 2017, 2017, pp. 498–502

2017
[39]

V oxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,

C. Wang, M. Riviere, A. Lee, A. Wu, C. Talnikar, D. Haz- iza, M. Williamson, J. Pino, and E. Dupoux, “V oxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” inACL, 2021

2021
[40]

Mls: A large-scale multilin- gual dataset for scalable speech research,

V . Pratap, Q. Xu, S. Anicetoet al., “Mls: A large-scale multilin- gual dataset for scalable speech research,” inInterspeech, 2020

2020
[41]

Common voice: A massively-multilingual speech corpus,

R. Ardila, M. Branson, K. Daviset al., “Common voice: A massively-multilingual speech corpus,” inLREC, 2020

2020
[42]

The people’s speech: A large-scale diverse english speech recognition dataset for commercial usage,

D. Galvezet al., “The people’s speech: A large-scale diverse english speech recognition dataset for commercial usage,”arXiv preprint arXiv:2111.09340, 2021

arXiv 2021
[43]

The ami meeting corpus: A pre-announcement,

J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V . Karaiskos, W. Kraaij, M. Kronenthalet al., “The ami meeting corpus: A pre-announcement,” inInterna- tional workshop on machine learning for multimodal interaction. Springer, 2005, pp. 28–39

2005
[44]

The fisher corpus: A resource for the next generations of speech-to-text

C. Cieri, D. Miller, and K. Walker, “The fisher corpus: A resource for the next generations of speech-to-text.”
[45]

Lora: Low-rank adaptation of large lan- guage models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large lan- guage models,”arXiv preprint arXiv:2106.09685, 2021

Pith/arXiv arXiv 2021
[46]

Notsofar-1 challenge: New datasets, baseline, and tasks for dis- tant meeting transcription,

A. Vinnikov, A. I. Mark, A. Hurvitz, I. Abramovski, S. Koubi, I. Gurvich, S. Peer, X. Xiao, B. Elizalde, N. Kandaet al., “Notsofar-1 challenge: New datasets, baseline, and tasks for dis- tant meeting transcription,” inProc. CHiME 2024, 2024

2024
[47]

Qwq-32b: Embracing the power of reinforcement learning,

Q. Team, “Qwq-32b: Embracing the power of reinforcement learning,” 2025

2025
[48]

Judging llm-as-a-judge with mt-bench and chatbot arena,

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xinget al., “Judging llm-as-a-judge with mt-bench and chatbot arena,”Advances in neural information pro- cessing systems, vol. 36, pp. 46 595–46 623, 2023

2023
[49]

Claude 3.5 Sonnet,

Anthropic, “Claude 3.5 Sonnet,” https://www.anthropic.com/ news/claude-3-5-sonnet, 2024

2024
[50]

Streaming sequence-to-sequence learning with delayed streams modeling,

N. Zeghidour, E. Kharitonov, M. Orsini, V . V olhejn, G. de Marmiesse, E. Grave, P. P´erez, L. Mazar´e, and A. D´efossez, “Streaming sequence-to-sequence learning with delayed streams modeling,”arXiv preprint arXiv:2509.08753, 2025

arXiv 2025

[1] [1]

Introduction Recent advances in large language models (LLMs) have driven rapid progress in the development of voice-based conversational agents [1, 2, 3, 4]. Modern spoken dialogue systems typi- cally combine low-latency streaming speech processing mod- ules with a core conversational component responsible for di- alogue management and response generation...

[2] [2]

ModeratorLM-Think

Methodology 2.1. ModeratorLM: System Architecture ModeratorLMconsists of a speech encoder and a backbone LLM. The speech encoder processes each incoming audio chunk independently and produces chunk-level embeddings. These embeddings are projected into the LLM embedding space via a trainable linear projection layer, following prior work [18, 19]. The multi...

Pith/arXiv arXiv 2026

[3] [3]

as- sistant

Experimental Setup 3.1. Training Setup We useQwen3-4B-Instruct-2507[27] as the backbone LLM forModeratorLM, andQwen3-4B-Thinking-2507for ModeratorLM-Think. For speech representation, we employ an in-house speech encoder trained with variable lookahead simi- lar to [28, 29], enabling block-wise attention during the infer- ence on variable-sized chunks. To ...

[4] [4]

assistant

(NSF-1), which consists of real recordings of formal meet- ings with approximately four speakers per session and an aver- age duration of six minutes. AsNSF-1lacks role labels, we des- ignate one speaker as the “assistant” using a hybrid procedure that aggregates LLM-based rankings of assistant-like behavior with independent human evaluations. A role desc...

[5] [5]

Main Results Table 2 comparesModeratorLMvariants against non-role- conditioned baselines onNSF-1andRolePlayConvdatasets

Results 4.1. Main Results Table 2 comparesModeratorLMvariants against non-role- conditioned baselines onNSF-1andRolePlayConvdatasets. Moshi, trained on dyadic conversations, fails to generalize to multi-speaker settings, exhibiting very low recall and high false positive rates. TheMP-Baseline, trained on multi-party conver- sations but without role condit...

[6] [6]

Conclusions In this work, we introduced a role-playing voice agent for multi-party conversations that modulates turn-taking behavior based on an assigned role. Experimental results show that role-conditioned fine-tuning yields turn-taking decisions bet- ter aligned with configured preferences, and that incorporat- ing chain-of-thought reasoning further im...

[7] [7]

Acknowledgments We would like to thank Ajay Srinivasamurthy, V olker Leutnant, Adam Kaplan, Andreas Schwarz, Raghavendra Bilgi and Sri Garimella for their support and valuable feedback

[8] [8]

Generative AI was not used in the conceptualization, experi- mental design, or generation of the core scientific content

Generative AI Use Disclosure The authors acknowledge the use of generative AI tools during the preparation of this paper strictly for the purposes of edit- ing, polishing, and improving the readability of the manuscript. Generative AI was not used in the conceptualization, experi- mental design, or generation of the core scientific content. All co-authors...

[9] [9]

Freeze-omni: A smart and low latency speech-to-speech dia- logue model with frozen llm,

X. Wang, Y . Li, C. Fu, Y . Shen, L. Xie, K. Li, X. Sun, and L. Ma, “Freeze-omni: A smart and low latency speech-to-speech dia- logue model with frozen llm,”arXiv preprint arXiv:2411.00774, 2024

arXiv 2024

[10] [10]

Mini-omni: Language models can hear, talk while thinking in streaming, 2024,

Z. Xie and C. Wu, “Mini-omni: Language models can hear, talk while thinking in streaming, 2024,”URL https://arxiv. org/abs/2408.16725, 2024

arXiv 2024

[11] [11]

Minmo: A multimodal large language model for seamless voice interaction,

Q. Chen, Y . Chen, Y . Chen, M. Chen, Y . Chen, C. Deng, Z. Du, R. Gao, C. Gao, Z. Gaoet al., “Minmo: A multimodal large language model for seamless voice interaction,”arXiv preprint arXiv:2501.06282, 2025

arXiv 2025

[12] [12]

Kimi-audio technical report,

D. Ding, Z. Ju, Y . Leng, S. Liu, T. Liu, Z. Shang, K. Shen, W. Song, X. Tan, H. Tanget al., “Kimi-audio technical report,” arXiv preprint arXiv:2504.18425, 2025

Pith/arXiv arXiv 2025

[13] [13]

Moshi: a speech-text foundation model for real-time dialogue,

A. D ´efossez, L. Mazar´e, M. Orsini, A. Royer, P. P´erez, H. J´egou, E. Grave, and N. Zeghidour, “Moshi: a speech-text foundation model for real-time dialogue,”arXiv preprint arXiv:2410.00037, 2024

Pith/arXiv arXiv 2024

[14] [14]

Language model can listen while speaking,

Z. Ma, Y . Song, C. Du, J. Cong, Z. Chen, Y . Wang, Y . Wang, and X. Chen, “Language model can listen while speaking,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 23, 2025, pp. 24 831–24 839

2025

[15] [15]

Salmonn-omni: A codec-free llm for full-duplex speech understanding and generation,

W. Yu, S. Wang, X. Yang, X. Chen, X. Tian, J. Zhang, G. Sun, L. Lu, Y . Wang, and C. Zhang, “Salmonn-omni: A codec-free llm for full-duplex speech understanding and generation,”arXiv preprint arXiv:2411.18138, 2024

arXiv 2024

[16] [16]

Talk- ing turns: Benchmarking audio foundation models on turn-taking dynamics,

S. Arora, Z. Lu, C.-C. Chiu, R. Pang, and S. Watanabe, “Talk- ing turns: Benchmarking audio foundation models on turn-taking dynamics,”arXiv preprint arXiv:2503.01174, 2025

arXiv 2025

[17] [17]

Full-duplex-bench v1. 5: Evaluating overlap handling for full- duplex speech models,

G.-T. Lin, S.-Y . S. Kuan, Q. Wang, J. Lian, T. Li, and H.-y. Lee, “Full-duplex-bench v1. 5: Evaluating overlap handling for full- duplex speech models,”arXiv preprint arXiv:2507.23159, 2025

Pith/arXiv arXiv 2025

[18] [18]

Multi-party conversational agents: A survey,

S. Sapkota, M. S. Hasan, M. Shah, and S. Karmaker, “Multi-party conversational agents: A survey,”arXiv preprint arXiv:2505.18845, 2025

arXiv 2025

[19] [19]

The design and im- plementation of xiaoice, an empathetic social chatbot,

L. Zhou, J. Gao, D. Li, and H.-Y . Shum, “The design and im- plementation of xiaoice, an empathetic social chatbot,”Computa- tional Linguistics, vol. 46, no. 1, pp. 53–93, 2020

2020

[20] [20]

From persona to personalization: A survey on role-playing language agents. arxiv [cs. cl](april 2024),

J. Chen, X. Wang, R. Xu, S. Yuan, Y . Zhang, W. Shi, J. Xie, S. Li, R. Yang, T. Zhuet al., “From persona to personalization: A survey on role-playing language agents. arxiv [cs. cl](april 2024),” 2024

2024

[21] [21]

Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models,

N. Wang, Z. Peng, H. Que, J. Liu, W. Zhou, Y . Wu, H. Guo, R. Gan, Z. Ni, J. Yanget al., “Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models,” in Findings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 14 743–14 777

2024

[22] [22]

Characterglm: Customizing social charac- ters with large language models,

J. Zhou, Z. Chen, D. Wan, B. Wen, Y . Song, J. Yu, Y . Huang, P. Ke, G. Bi, L. Penget al., “Characterglm: Customizing social charac- ters with large language models,” inProceedings of the 2024 Con- ference on Empirical Methods in Natural Language Processing: Industry Track, 2024, pp. 1457–1476

2024

[23] [23]

Capturing minds, not just words: Enhancing role-playing lan- guage models with personality-indicative data,

Y . Ran, X. Wang, R. Xu, X. Yuan, J. Liang, Y . Xiao, and D. Yang, “Capturing minds, not just words: Enhancing role-playing lan- guage models with personality-indicative data,” inFindings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 14 566–14 576

2024

[24] [24]

Character is destiny: Can role-playing language agents make persona-driven decisions?

R. Xu, X. Wang, J. Chen, S. Yuan, X. Yuan, J. Liang, Z. Chen, X. Dong, and Y . Xiao, “Character is destiny: Can role-playing language agents make persona-driven decisions?”arXiv preprint arXiv:2404.12138, 2024

arXiv 2024

[25] [25]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information pro- cessing systems, vol. 35, pp. 24 824–24 837, 2022

2022

[26] [26]

Sift-50m: A large-scale multilin- gual dataset for speech instruction fine-tuning,

P. Pandey, R. V . Swaminathan, K. Girish, A. Sen, J. Xie, G. P. Strimel, and A. Schwarz, “Sift-50m: A large-scale multilin- gual dataset for speech instruction fine-tuning,”arXiv preprint arXiv:2504.09081, 2025

arXiv 2025

[27] [27]

Efficient streaming llm for speech recognition,

J. Jia, G. Keren, W. Zhou, E. Lakomkin, X. Zhang, C. Wu, F. Seide, J. Mahadeokar, and O. Kalinli, “Efficient streaming llm for speech recognition,” inICASSP 2025-2025 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

2025

[28] [28]

Omniflatten: An end-to- end gpt model for seamless voice conversation,

Q. Zhang, L. Cheng, C. Deng, Q. Chen, W. Wang, S. Zheng, J. Liu, H. Yu, C. Tan, Z. Duet al., “Omniflatten: An end-to- end gpt model for seamless voice conversation,”arXiv preprint arXiv:2410.17799, 2024

arXiv 2024

[29] [29]

Qwen2. 5-omni technical report,

J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Danget al., “Qwen2. 5-omni technical report,”arXiv preprint arXiv:2503.20215, 2025

Pith/arXiv arXiv 2025

[30] [30]

Coser: Coordinat- ing llm-based persona simulation of established roles,

X. Wang, H. Wang, Y . Zhang, X. Yuan, R. Xu, J.-t. Huang, S. Yuan, H. Guo, J. Chen, S. Zhouet al., “Coser: Coordinat- ing llm-based persona simulation of established roles,” inForty- second International Conference on Machine Learning, 2025

2025

[31] [31]

Meld: A multimodal multi-party dataset for emo- tion recognition in conversations,

S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea, “Meld: A multimodal multi-party dataset for emo- tion recognition in conversations,” inProceedings of the 57th annual meeting of the association for computational linguistics, 2019, pp. 527–536

2019

[32] [32]

Pachat: Persona-aware speech assistant for multi-party dialogue,

D. Fu, X. Cheng, L. Li, X. Yang, L. Yang, and T. Jin, “Pachat: Persona-aware speech assistant for multi-party dialogue,” inPro- ceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 29 313–29 330

2025

[33] [33]

The amazon nova family of models: Tech- nical report and model card,

A. A. G. Intelligence, “The amazon nova family of models: Tech- nical report and model card,” 2024

2024

[34] [34]

Zonos: an open-weight multilingual text-to-speech model (zonos-v0.1),

Zyphra, “Zonos: an open-weight multilingual text-to-speech model (zonos-v0.1),” https://github.com/Zyphra/Zonos, 2025

2025

[35] [35]

Qwen3 technical report,

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025

[36] [36]

Variable attention masking for configurable transformer trans- ducer speech recognition,

P. Swietojanski, S. Braun, D. Can, T. F. Da Silva, A. Ghoshal, T. Hori, R. Hsiao, H. Mason, E. McDermott, H. Silovskyet al., “Variable attention masking for configurable transformer trans- ducer speech recognition,” inICASSP 2023-2023 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

2023

[37] [37]

Durep: Dual-mode speech representation learning via asr-aware distillation,

P. R. Male, S. N. Ray, H. Arsikere, A. Jaiswal, P. Swarup, P. Sen, D. Chakrabarty, K. V . V . Girish, N. Bhave, F. Weber, S. Bhattacharya, and S. Garimella, “Durep: Dual-mode speech representation learning via asr-aware distillation,” 2025. [Online]. Available: https://arxiv.org/abs/2505.19774

arXiv 2025

[38] [38]

Montreal forced aligner: Trainable text-speech align- ment using kaldi

M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Son- deregger, “Montreal forced aligner: Trainable text-speech align- ment using kaldi.” inInterspeech, vol. 2017, 2017, pp. 498–502

2017

[39] [39]

V oxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,

C. Wang, M. Riviere, A. Lee, A. Wu, C. Talnikar, D. Haz- iza, M. Williamson, J. Pino, and E. Dupoux, “V oxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” inACL, 2021

2021

[40] [40]

Mls: A large-scale multilin- gual dataset for scalable speech research,

V . Pratap, Q. Xu, S. Anicetoet al., “Mls: A large-scale multilin- gual dataset for scalable speech research,” inInterspeech, 2020

2020

[41] [41]

Common voice: A massively-multilingual speech corpus,

R. Ardila, M. Branson, K. Daviset al., “Common voice: A massively-multilingual speech corpus,” inLREC, 2020

2020

[42] [42]

The people’s speech: A large-scale diverse english speech recognition dataset for commercial usage,

D. Galvezet al., “The people’s speech: A large-scale diverse english speech recognition dataset for commercial usage,”arXiv preprint arXiv:2111.09340, 2021

arXiv 2021

[43] [43]

The ami meeting corpus: A pre-announcement,

J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V . Karaiskos, W. Kraaij, M. Kronenthalet al., “The ami meeting corpus: A pre-announcement,” inInterna- tional workshop on machine learning for multimodal interaction. Springer, 2005, pp. 28–39

2005

[44] [44]

The fisher corpus: A resource for the next generations of speech-to-text

C. Cieri, D. Miller, and K. Walker, “The fisher corpus: A resource for the next generations of speech-to-text.”

[45] [45]

Lora: Low-rank adaptation of large lan- guage models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large lan- guage models,”arXiv preprint arXiv:2106.09685, 2021

Pith/arXiv arXiv 2021

[46] [46]

Notsofar-1 challenge: New datasets, baseline, and tasks for dis- tant meeting transcription,

A. Vinnikov, A. I. Mark, A. Hurvitz, I. Abramovski, S. Koubi, I. Gurvich, S. Peer, X. Xiao, B. Elizalde, N. Kandaet al., “Notsofar-1 challenge: New datasets, baseline, and tasks for dis- tant meeting transcription,” inProc. CHiME 2024, 2024

2024

[47] [47]

Qwq-32b: Embracing the power of reinforcement learning,

Q. Team, “Qwq-32b: Embracing the power of reinforcement learning,” 2025

2025

[48] [48]

Judging llm-as-a-judge with mt-bench and chatbot arena,

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xinget al., “Judging llm-as-a-judge with mt-bench and chatbot arena,”Advances in neural information pro- cessing systems, vol. 36, pp. 46 595–46 623, 2023

2023

[49] [49]

Claude 3.5 Sonnet,

Anthropic, “Claude 3.5 Sonnet,” https://www.anthropic.com/ news/claude-3-5-sonnet, 2024

2024

[50] [50]

Streaming sequence-to-sequence learning with delayed streams modeling,

N. Zeghidour, E. Kharitonov, M. Orsini, V . V olhejn, G. de Marmiesse, E. Grave, P. P´erez, L. Mazar´e, and A. D´efossez, “Streaming sequence-to-sequence learning with delayed streams modeling,”arXiv preprint arXiv:2509.08753, 2025

arXiv 2025