Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings

Atsunori Ogawa; Atsushi Ando; Marc Delcroix; Naohiro Tawara; Ryo Fukuda; Shinji Watanabe; Siddhant Arora; Takatomo Kano; William Chen; Yuya Chiba

arxiv: 2606.17542 · v1 · pith:D3AABQ3Xnew · submitted 2026-06-16 · 💻 cs.CL

Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings

Ryo Fukuda , Takatomo Kano , Siddhant Arora , Marc Delcroix , Naohiro Tawara , Atsunori Ogawa , Yuya Chiba , Atsushi Ando

show 2 more authors

William Chen Shinji Watanabe

This is my paper

Pith reviewed 2026-06-27 01:11 UTC · model grok-4.3

classification 💻 cs.CL

keywords large language modelsturn-takingnext speaker predictionaddressee detectionmultimodal meetingsAMI corpus

0 comments

The pith

LLMs outperform supervised models and humans at next speaker prediction in meetings using only text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds an evaluation setup to test large language models on three turn-taking tasks in multi-party meetings: identifying who is being addressed, whether the speaker will change, and who will speak next. It runs these tests on the AMI corpus and pits text-only LLMs, multimodal LLMs, task-specific supervised models, and human judges against one another. Text-based LLMs reach higher accuracy than the other three groups on next-speaker prediction even though they receive no domain-specific training and no audio or video input. Multimodal LLMs improve results on addressee and turn-change detection yet still fall short of human performance, showing limited ability to use raw audiovisual signals. Ablation checks confirm that surrounding conversation history drives most of the predictive power.

Core claim

Experiments on the AMI corpus showed that LLMs outperformed supervised models and humans in next speaker prediction, despite not being trained on the target domain and without access to audio or visual information. An MM-LLM performed better than text-based LLMs on addressee detection and turn-change prediction but remained below human performance. Ablation analyses revealed that conversational context was critical, particularly for next speaker prediction, and that human and LLM prediction patterns were similar.

What carries the argument

The three-task evaluation framework that feeds meeting transcripts to LLMs, supervised models, and humans and measures accuracy on addressee detection, turn-change prediction, and next speaker prediction.

If this is right

Conversational context supplies the main signal for next speaker prediction across all systems tested.
Humans and LLMs exhibit similar difficulty on stretches of rapid turn changes.
Multimodal LLMs extract some benefit from audio-visual input on addressee and turn-change tasks but not enough to match people.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could support lightweight meeting-assistant tools that run on transcripts alone and require no camera or microphone arrays.
Error-pattern overlap suggests LLMs may be learning some of the same implicit turn-taking rules that people use.
Repeating the study on non-English meetings would test whether the observed advantage generalizes beyond the AMI data.

Load-bearing premise

The text prompts supplied to the LLMs contain information comparable in kind and completeness to the features and context given to the supervised models and human annotators.

What would settle it

Running the same next-speaker prediction test on a second meeting corpus and finding that LLMs no longer exceed human or supervised accuracy would undermine the central result.

Figures

Figures reproduced from arXiv: 2606.17542 by Atsunori Ogawa, Atsushi Ando, Marc Delcroix, Naohiro Tawara, Ryo Fukuda, Shinji Watanabe, Siddhant Arora, Takatomo Kano, William Chen, Yuya Chiba.

**Figure 1.** Figure 1: Screenshot of the experiment tool. Gemini 2.5 Pro via the official API. The same task instructions and input features were provided to the local multimodal models. Inference was conducted with a temperature set to 1.0. 6. Human Evaluation To compare human and model performance, we conducted a human evaluation under the same task formulation described in Section 3. Participants simultaneously performed add… view at source ↗

**Figure 2.** Figure 2: Effect of context size on Qwen3-14B performance. incorporating FOA consistently improved performance in addressee detection and next speaker prediction. This finding aligns with prior research showing that gaze information is a useful signal for identifying the addressee and anticipating the next speaker [21, 47]. In contrast, the effect of FOA on turnchange prediction was mixed. Performance slightly dec… view at source ↗

**Figure 3.** Figure 3: Temporal variation of addressee detection and next speaker prediction accuracies across 1-minute windows. Shaded region indicates the min–max range across participants. 0 20 40 60 80 100 120 0.00 0.25 0.50 0.75 1.00 Multisimo (S02) Addressee Human (mean) Human (min-max) Gemini 2.5 Pro 0 20 40 60 80 100 120 0.00 0.25 0.50 0.75 1.00 Next Speaker Time (s) [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Temporal variation on the Multisimo subset using 15- second windows. sign models that account for the inherent uncertainty in human conversational dynamics, improving reliability in predictable moments, while allowing for more fluid, stochastic behavior for ambiguous cases. To investigate what determines task difficulty across segments, we computed correlations between turn-taking characteristics and h… view at source ↗

read the original abstract

We investigate turn-taking in multimodal multi-party conversations using large language models (LLMs). We construct an evaluation framework for three tasks: addressee detection, turn-change prediction, and next speaker prediction. We compare supervised models trained for these tasks, text-based LLMs, multimodal LLMs (MM-LLMs), and human subjects. Experiments on the AMI corpus showed that LLMs outperformed supervised models and humans in next speaker prediction, despite not being trained on the target domain and without access to audio or visual information. An MM-LLM performed better than text-based LLMs on addressee detection and turn-change prediction but remained below human performance, indicating difficulty leveraging raw audio-visual signals. Ablation analyses revealed that conversational context was critical, particularly for next speaker prediction. We observed that human and LLM prediction patterns were similar, and intervals with frequent turn changes were difficult for both.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs beat humans and supervised models on next-speaker prediction from text alone on AMI, but the input match across conditions still needs explicit confirmation.

read the letter

The main result is that text-only LLMs outperformed both supervised models and humans on next speaker prediction in AMI meetings, even without domain training or audio-visual input. Multimodal LLMs helped on addressee detection and turn-change prediction but stayed below human levels. The paper also shows through ablations that conversational context matters most for next speaker, and that LLM and human error patterns line up, with both struggling on stretches of rapid turn changes.

This is a straightforward empirical extension: three tasks evaluated together on a standard corpus with direct baselines. The next-speaker win is the clearest new data point because it comes from general models with less information than the alternatives.

The soft spot is the input-equivalence question raised in the stress test. The abstract notes that LLMs had no audio or visual access, but it does not state whether the supervised models were also text-only or whether the exact speaker history and timing details in the LLM prompts matched what humans saw during annotation. Without that check, the performance gap could partly reflect differences in what each system received rather than model ability. Prompt wording and any statistical tests are not visible in the summary either.

This is for people working on dialogue systems and turn-taking. It has enough concrete comparisons and a public corpus to deserve peer review, though it will probably need some added detail on the prompt construction and baseline feature sets to hold up cleanly.

Referee Report

1 major / 2 minor

Summary. The paper constructs an evaluation framework for three turn-taking tasks (addressee detection, turn-change prediction, next speaker prediction) in multimodal multi-party meetings. On the AMI corpus it compares text-based LLMs, multimodal LLMs, supervised models trained on the tasks, and human subjects, claiming that text-only LLMs outperform both supervised models and humans on next-speaker prediction despite lacking domain training and audio-visual input; an MM-LLM improves the first two tasks but stays below humans, with ablations showing conversational context is critical and similar error patterns between humans and LLMs.

Significance. If the input representations are shown to be equivalent, the result would indicate that general-purpose LLMs possess strong zero-shot ability to model conversational dynamics from text alone. The public-corpus experiments, human baselines, and context ablations are positive features that would make the work useful for dialogue-system research.

major comments (1)

[Evaluation framework (abstract and experimental setup)] The headline claim in the abstract that LLMs outperformed supervised models and humans on next speaker prediction rests on the assumption that the textual prompts supplied to the LLMs contain exactly the same conversational history, speaker identities, and timing cues that the supervised baselines received as input features and that human annotators saw. The abstract notes that LLMs had “no access to audio or visual information” but does not confirm that the supervised models were likewise restricted to text-only features or that prompt wording matches annotation instructions; any mismatch would make the performance gap an artifact of experimental setup.

minor comments (2)

Add explicit details on prompt templates, feature sets used by the supervised baselines, data splits, and statistical significance tests.
Include a short error analysis or example predictions to substantiate the claim that human and LLM prediction patterns are similar.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation framework. We address the single major comment below and agree that additional clarifications are warranted.

read point-by-point responses

Referee: [Evaluation framework (abstract and experimental setup)] The headline claim in the abstract that LLMs outperformed supervised models and humans on next speaker prediction rests on the assumption that the textual prompts supplied to the LLMs contain exactly the same conversational history, speaker identities, and timing cues that the supervised baselines received as input features and that human annotators saw. The abstract notes that LLMs had “no access to audio or visual information” but does not confirm that the supervised models were likewise restricted to text-only features or that prompt wording matches annotation instructions; any mismatch would make the performance gap an artifact of experimental setup.

Authors: We agree the abstract is insufficiently explicit on input equivalence and will revise it. In Sections 3.2 and 4.1 the supervised baselines are trained solely on textual features (speaker IDs, utterance history, and turn-boundary timestamps extracted from the AMI transcripts); no acoustic or visual features are used. The LLM prompts are constructed from the identical transcript segments and speaker labels. Human annotators received the same text-only transcripts. We will add an explicit statement to the abstract, a feature-comparison table in Section 4, and the full prompt templates plus annotation instructions to the appendix. Regarding timing cues, any additional pause-duration information available only to supervised models would make the LLM outperformance result stronger rather than weaker; we will note this explicitly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on public corpus

full rationale

The paper reports direct experimental comparisons of LLMs, supervised models, and humans on the AMI corpus for three turn-taking tasks. Performance claims rest on measured accuracies rather than any derivation, fitted parameter renamed as prediction, or self-citation chain. No equations or first-principles results are presented that reduce to their own inputs by construction. The evaluation framework is external to the models tested, satisfying the default expectation of non-circularity for empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical evaluation paper; no new theoretical entities or fitted parameters are introduced in the abstract. Relies on standard assumptions of machine learning benchmarking.

axioms (1)

domain assumption Standard machine learning evaluation assumptions including representative data splits and consistent task labeling across systems
Implicit in any comparative study on a fixed corpus like AMI.

pith-pipeline@v0.9.1-grok · 5721 in / 1176 out tokens · 40598 ms · 2026-06-27T01:11:39.447060+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 8 canonical work pages · 4 internal anchors

[1]

Introduction Advances in large language models (LLMs) have substantially improved the ability of conversational agents to understand and generate natural language. With the emergence of multimodal LLMs (MM-LLMs) capable of processing audio and visual in- puts in addition to text [1,2], it is becoming possible to integrate linguistic and non-linguistic inf...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Evaluation of LLMs Several recent studies have examined the ability of LLMs to un- derstand turn-taking in MPCs (Table 1)

Related Work 2.1. Evaluation of LLMs Several recent studies have examined the ability of LLMs to un- derstand turn-taking in MPCs (Table 1). Inoue et al. [27] con- structed a benchmark for addressee detection and next speaker prediction using three-party conversations. They reported that LLM performance with ground-truth transcriptions was close to chance...
[3]

In addition to evaluating models, we also measure human performance on the same tasks to clarify the gap between humans and current mod- els for these tasks

Task Definition In this study, we evaluate turn-taking prediction in MPCs through three tasks: (1)addressee detection, (2)turn-change prediction, and (3)next speaker prediction. In addition to evaluating models, we also measure human performance on the same tasks to clarify the gap between humans and current mod- els for these tasks. In our experiments, s...
[4]

We used the AMI corpus, which consists of 100 hours of meeting record- ings, as in previous studies [24, 28]

Dataset We constructed an evaluation set for the above tasks. We used the AMI corpus, which consists of 100 hours of meeting record- ings, as in previous studies [24, 28]. The AMI corpus provides synchronized audio recordings, video streams, and manual tran- scriptions. This corpus includes scenario-based meetings where four participants, each playing dif...
[5]

As a naive baseline, we report majority or chance-level strate- gies for each task

Model Evaluation We evaluate three classes of models: conventional supervised learning models, and off-the-shelf text-based and MM-LLMs. As a naive baseline, we report majority or chance-level strate- gies for each task. For addressee detection, the naive baseline always predictsGrouplabel. For turn-change prediction, it al- ways predictsShiftlabel. For n...
[6]

Participants simultaneously performed addressee detection, turn-change prediction, and next speaker prediction in an online setting, without access to future utterances

Human Evaluation To compare human and model performance, we conducted a human evaluation under the same task formulation described in Section 3. Participants simultaneously performed addressee detection, turn-change prediction, and next speaker prediction in an online setting, without access to future utterances. We developed a web-based interface (Figure...
[7]

Model comparison Supervised models vs

Results 7.1. Model comparison Supervised models vs. LLMs:Table 5 shows the overall per- formance of models. SVM achieved the highest accuracy in addressee detection. In turn-change prediction, it also outper- formed all LLMs except Gemini 2.5 Pro. These results indicate that task-specific supervised models can surpass general LLMs in these tasks, even wit...
[8]

easy” intervals where accuracy is high for both, and “difficult

Analysis 8.1. Important features Table 7 and 8 include ablation studies examining the contribu- tion of input features. Firstly, removing conversational context ((a) vs. (d) in Table 7 and 8) led to a substantial performance degradation for Qwen3-14B and Gemini 2.5 Pro, particularly in addressee detection and next speaker prediction. These re- sults indic...
[9]

First, humans and MM-LLMs performed the tasks by watching fixed-angle recorded videos, which do not reflect the first-person perspective of a meeting participant

Limitation Our evaluation differs from natural conversational participation in several respects. First, humans and MM-LLMs performed the tasks by watching fixed-angle recorded videos, which do not reflect the first-person perspective of a meeting participant. Second, textual transcripts and explicit current speaker infor- mation were provided. Such inform...
[10]

Conclusion We conducted a unified evaluation of turn-taking in multimodal MPCs. We compared supervised models, text-based LLMs, multimodal LLMs, and human participants on addressee detec- tion, turn-change prediction, and next speaker prediction under online constraints. Our results showed that multimodal LLMs underperformed humans in addressee detection ...
[11]

Generative AI models were also used as compar- ison systems in the experimental evaluation

Generative AI Use Disclosure This manuscript was edited and polished with the assistance of generative AI. Generative AI models were also used as compar- ison systems in the experimental evaluation. All experimental design, implementation, and analysis were conducted by the au- thors who take full responsibility for the content
[12]

Video-LLaMA: An instruction- tuned audio-visual language model for video understanding,

H. Zhang, X. Li, and L. Bing, “Video-LLaMA: An instruction- tuned audio-visual language model for video understanding,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2023, pp. 543–553

2023
[13]

Natural language super- vision for general-purpose audio representations,

B. Elizalde, S. Deshmukh, and H. Wang, “Natural language super- vision for general-purpose audio representations,” inProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024), 2024, pp. 336–340

2024
[14]

A four-participant group facilitation framework for conversational robots,

Y . Matsuyama, I. Akiba, A. Saito, and T. Kobayashi, “A four-participant group facilitation framework for conversational robots,” inProceedings of the SIGDIAL 2013 Conference, 2013, pp. 284–293

2013
[15]

Exploring turn-taking cues in multi-party human-robot discussions about objects,

G. Skantze, M. Johansson, and J. Beskow, “Exploring turn-taking cues in multi-party human-robot discussions about objects,” in Proceedings of the 2015 ACM International Conference on Mul- timodal Interaction, 2015, pp. 67–74

2015
[16]

The ICSI meeting corpus,

A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, and C. Wooters, “The ICSI meeting corpus,” inProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2003), 2003

2003
[17]

Modeling collaborative mul- timodal behavior in group dialogues: The MULTISIMO corpus,

M. Koutsombogera and C. V ogel, “Modeling collaborative mul- timodal behavior in group dialogues: The MULTISIMO corpus,” inProceedings of the Eleventh International Conference on Lan- guage Resources and Evaluation (LREC 2018), 2018

2018
[18]

CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings,

S. Watanabe, M. Mandel, J. Barker, E. Vincent, A. Arora, X. Chang, S. Khudanpur, V . Manohar, D. Povey, D. Raj, D. Sny- der, A. S. Subramanian, J. Trmal, B. B. Yair, C. Boeddeker, Z. Ni, Y . Fujita, S. Horiguchi, N. Kanda, T. Yoshioka, and N. Ryant, “CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings,” inProceedings of ...

2020
[19]

NOTSOFAR-1 challenge: New datasets, baseline, and tasks for distant meeting transcription,

A. Vinnikov, A. Ivry, A. Hurvitz, I. Abramovski, S. Koubi, I. Gur- vich, S. Peer, X. Xiao, B. M. Elizalde, N. Kanda, X. Wang, S. Shaer, S. Yagev, Y . Asher, S. Sivasankaran, Y . Gong, M. Tang, H. Wang, and E. Krupka, “NOTSOFAR-1 challenge: New datasets, baseline, and tasks for distant meeting transcription,” in Proceedings of the 25th Annual Conference of...

2024
[20]

A cocktail-party benchmark: Multi- modal dataset and comparative evaluation results,

T.-B. Nguyen, K. Zmolikova, P. Ma, N. Q. Pham, C. Fue- gen, and A. Waibel, “A cocktail-party benchmark: Multi- modal dataset and comparative evaluation results,”arXiv preprint arXiv:2510.23276, Feb. 2026

work page arXiv 2026
[21]

Issues in multiparty dialogues,

D. Traum, “Issues in multiparty dialogues,” inProceedings of the Workshop on Agent Communication Languages, 2003, pp. 201– 211

2003
[22]

Opportunities and obligations to take turns in collaborative multi-party human-robot interaction,

M. Johansson and G. Skantze, “Opportunities and obligations to take turns in collaborative multi-party human-robot interaction,” inProceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 2015, pp. 305–314

2015
[23]

Towards a general, continuous model of turn-taking in spoken dialogue using LSTM recurrent neural networks,

G. Skantze, “Towards a general, continuous model of turn-taking in spoken dialogue using LSTM recurrent neural networks,” in Proceedings of the 18th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL 2017), 2017, pp. 220–230

2017
[24]

Multimodal continuous turn-taking prediction using multiscale RNNs,

M. Roddy, G. Skantze, and N. Harte, “Multimodal continuous turn-taking prediction using multiscale RNNs,” inProceedings of the 20th ACM International Conference on Multimodal Interac- tion (ICMI 2018), 2018, pp. 186–190

2018
[25]

V oice activity projection: Self- supervised learning of turn-taking events,

E. Ekstedt and G. Skantze, “V oice activity projection: Self- supervised learning of turn-taking events,” inProceedings of the 23rd Annual Conference of the International Speech Communica- tion Association (INTERSPEECH 2022), 2022, pp. 5190–5194

2022
[26]

TurnGPT: a transformer-based language model for pre- dicting turn-taking in spoken dialog,

——, “TurnGPT: a transformer-based language model for pre- dicting turn-taking in spoken dialog,” inFindings of the Associa- tion for Computational Linguistics: EMNLP 2020, Stroudsburg, PA, USA, 2020, pp. 2981–2990

2020
[27]

Talk- ing turns: Benchmarking audio foundation models on turn-taking dynamics,

S. Arora, Z. Lu, C.-C. Chiu, R. Pang, and S. Watanabe, “Talk- ing turns: Benchmarking audio foundation models on turn-taking dynamics,” inInternational Conference on Learning Representa- tions, Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, Eds., vol. 2025, 2025, pp. 52 754–52 781

2025
[28]

Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities.arXiv preprint arXiv:2503.04721, 2025a

G.-T. Lin, J. Lian, T. Li, Q. Wang, G. Anumanchipalli, A. H. Liu, and H.-y. Lee, “Full-duplex-bench: A benchmark to evaluate full- duplex spoken dialogue models on turn-taking capabilities,”arXiv preprint arXiv:2503.04721, 2025

work page arXiv 2025
[29]

A survey of recent advances on turn-taking modeling in spoken dialogue systems,

G. Castillo-L ´opez, G. de Chalendar, and N. Semmar, “A survey of recent advances on turn-taking modeling in spoken dialogue systems,” inProceedings of the 15th International Workshop on Spoken Dialogue Systems Technology, 2025, pp. 254–271

2025
[30]

Towards automatic addressee identification in multi-party dialogues,

N. Jovanovic and R. o. den Akker, “Towards automatic addressee identification in multi-party dialogues,” inProceedings of the 5th SIGdial Workshop on Discourse and Dialogue at HLT-NAACL 2004, 2004, pp. 89–92

2004
[31]

Modeling norms of turn-taking in multi-party conversation,

K. Laskowski, “Modeling norms of turn-taking in multi-party conversation,” inProceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 2010, pp. 999–1008

2010
[32]

Predicting next speaker and timing from gaze transition patterns in multi-party meetings,

R. Ishii, K. Otsuka, S. Kumano, M. Matsuda, and J. Yamato, “Predicting next speaker and timing from gaze transition patterns in multi-party meetings,” inProceedings of the 15th ACM Inter- national Conference on Multimodal Interaction, New York, NY , USA, 2013

2013
[33]

Investiga- tion of the relationship between turn-taking and prosodic features in spontaneous dialogue,

T. Ohsuga, M. Nishida, Y . Horiuchi, and A. Ichikawa, “Investiga- tion of the relationship between turn-taking and prosodic features in spontaneous dialogue,” inProceedings of the 6th Annual Con- ference of the International Speech Communication Association (INTERSPEECH 2005), 2005, pp. 33–36

2005
[34]

Multimodal end-of-turn prediction in multi-party meetings,

I. de Kok and D. Heylen, “Multimodal end-of-turn prediction in multi-party meetings,” inProceedings of the 2009 International Conference on Multimodal Interfaces (ICMI 2009), 2009, pp. 91– 98

2009
[35]

A generic machine learning based approach for addressee detec- tion in multiparty interaction,

U. Malik, M. Barange, N. Ghannad, J. Saunier, and A. Pauchet, “A generic machine learning based approach for addressee detec- tion in multiparty interaction,” inProceedings of the 19th ACM In- ternational Conference on Intelligent Virtual Agents. New York, NY , USA: ACM, Jul. 2019

2019
[36]

Gaze- enhanced multimodal turn-taking prediction in triadic conversa- tions,

S. Heo, C. Miller, C. Murdock, and M. Proulx, “Gaze- enhanced multimodal turn-taking prediction in triadic conversa- tions,” inProceedings of the 26th Annual Conference of the In- ternational Speech Communication Association (INTERSPEECH 2025), 2025, pp. 1068–1072

2025
[37]

Triadic multi- party voice activity projection for turn-taking in spoken dialogue systems,

M. Elmers, K. Inoue, D. Lala, and T. Kawahara, “Triadic multi- party voice activity projection for turn-taking in spoken dialogue systems,” inProceedings of the 26th Annual Conference of the In- ternational Speech Communication Association (INTERSPEECH 2025), 2025

2025
[38]

An LLM benchmark for addressee recognition in multi-modal multi- party dialogue,

K. Inoue, D. Lala, M. Elmers, K. Ochi, and T. Kawahara, “An LLM benchmark for addressee recognition in multi-modal multi- party dialogue,” inProceedings of the 15th International Work- shop on Spoken Dialogue Systems Technology, 2025, pp. 330– 334

2025
[39]

Next speaker prediction for multi- speaker dialogue with large language models,

L. Hilgert and J. Niehues, “Next speaker prediction for multi- speaker dialogue with large language models,” inProceedings of the 8th International Conference on Natural Language and Speech Processing (ICNLSP 2025), 2025, pp. 60–71

2025
[40]

Analysing next speaker prediction in multi-party conversation using multi- modal large language models,

T. Mori, K. Inoue, D. Lala, K. Ochi, and T. Kawahara, “Analysing next speaker prediction in multi-party conversation using multi- modal large language models,” inProceedings of the 16th Inter- national Workshop on Spoken Dialogue System Technology, 2026, pp. 83–94

2026
[41]

DiPCo — dinner party corpus,

M. Van Segbroeck, A. Zaid, K. Kutsenko, C. Huerta, T. Nguyen, X. Luo, B. Hoffmeister, J. Trmal, M. Omologo, and R. Maas, “DiPCo — dinner party corpus,” inProceedings of the 21st An- nual Conference of the International Speech Communication As- sociation (INTERSPEECH 2020), Oct. 2020, pp. 434–436

2020
[42]

Multi-party chat: Conversational agents in group settings with humans and models,

J. Wei, K. Shuster, A. Szlam, J. Weston, J. Urbanek, and M. Komeili, “Multi-party chat: Conversational agents in group settings with humans and models,”arXiv preprint arXiv:2304.13835, Apr. 2023

work page arXiv 2023
[43]

Multimodal conversation structure understanding,

K. K. Chang, M. H. Cramer, A. Ho, T. T. Nguyen, Y . Yuan, and D. Bamman, “Multimodal conversation structure understanding,” arXiv preprint arXiv:2505.17536, 2025

work page arXiv 2025
[44]

The AMI meet- ing corpus,

W. Kraaij, T. Hain, M. Lincoln, and W. Post, “The AMI meet- ing corpus,” inProceedings of the 5th International Conference on Methods and Techniques in Behavioral Research (Measuring Behavior 2005), 2005

2005
[45]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Qwen2.5-Omni Technical Report

J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Dang, B. Zhang, X. Wang, Y . Chu, and J. Lin, “Qwen2.5-omni technical report,”arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Qwen3-Omni Technical Report

J. Xu, Z. Guo, H. Hu, Y . Chu, X. Wang, J. He, Y . Wang, X. Shi, T. He, X. Zhu, Y . Lv, Y . Wang, D. Guo, H. Wang, L. Ma, P. Zhang, X. Zhang, H. Hao, Z. Guo, B. Yang, B. Zhang, Z. Ma, X. Wei, S. Bai, K. Chen, X. Liu, P. Wang, M. Yang, D. Liu, X. Ren, B. Zheng, R. Men, F. Zhou, B. Yu, J. Yang, L. Yu, J. Zhou, and J. Lin, “Qwen3-omni technical report,”arXiv...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

A comparison of addressee detec- tion methods for multiparty conversations,

R. o. d. Akker and D. Traum, “A comparison of addressee detec- tion methods for multiparty conversations,” inProc. DiaHolmia 2009, 2009, pp. 99–106

2009
[49]

Projecting the end of a speaker’s turn: A cognitive cornerstone of conversation,

J.-P. de Ruiter, H. Mitterer, and N. J. Enfield, “Projecting the end of a speaker’s turn: A cognitive cornerstone of conversation,”Lan- guage, vol. 82, no. 3, pp. 515–535, 2006

2006
[50]

The development of predictive pro- cesses in children’s discourse understanding,

M. Casillas and M. Frank, “The development of predictive pro- cesses in children’s discourse understanding,” inProceedings of the Annual Meeting of the Cognitive Science Society, vol. 35, no. 35, 2013

2013
[51]

An empirical study of the na ¨ıve bayes classifier,

I. Rish, “An empirical study of the na ¨ıve bayes classifier,”Pro- ceedings of the IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, 2001

2001
[52]

Classification and regression by ran- domforest,

A. Liaw and M. Wiener, “Classification and regression by ran- domforest,”R News, vol. 2, no. 3, pp. 18–22, 2002

2002
[53]

Multi-layer perceptrons,

R. Kruse, S. Mostaghim, C. Borgelt, C. Braune, and M. Stein- brecher, “Multi-layer perceptrons,” inComputational Intelli- gence: A Methodological Introduction, 2022, pp. 53–124

2022
[54]

Steinwart and A

I. Steinwart and A. Christmann,Support Vector Machines, 2008

2008
[55]

Scikit-learn,

O. Kramer, “Scikit-learn,” inMachine Learning for Evolution Strategies, 2016, pp. 45–53

2016
[56]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017

2017
[57]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning, 2023, pp. 28 492–28 518

2023
[58]

Addressee iden- tification in face-to-face meetings,

N. Jovanovic, R. o. den Akker, and A. Nijholt, “Addressee iden- tification in face-to-face meetings,” in11th Conference of the Eu- ropean Chapter of the Association for Computational Linguistics, 2006, pp. 169–176

2006
[59]

A simplest system- atics for the organization of turn-taking for conversation,

H. Sacks, E. A. Schegloff, and G. Jefferson, “A simplest system- atics for the organization of turn-taking for conversation,”Lan- guage, vol. 50, no. 4, pp. 696–735, 1974

1974
[60]

Social context matters for turn-taking dynamics: A comparative study of autistic and typically developing children,

C. Cox, R. Fusaroli, Y . A. Nielsen, S. Cho, R. Rocca, A. Simon- sen, A. Knox, M. Lyons, M. Liberman, C. Cieriet al., “Social context matters for turn-taking dynamics: A comparative study of autistic and typically developing children,”Cognitive Science, vol. 49, no. 10, p. e70124, 2025

2025
[61]

Modeling turn-taking speed and speaker characteristics,

K. Onishi, H. Ohnaka, and K. Yoshino, “Modeling turn-taking speed and speaker characteristics,” inProceedings of the 26th An- nual Meeting of the Special Interest Group on Discourse and Di- alogue (SIGDIAL 2025), 2025, pp. 21–31

2025

[1] [1]

Introduction Advances in large language models (LLMs) have substantially improved the ability of conversational agents to understand and generate natural language. With the emergence of multimodal LLMs (MM-LLMs) capable of processing audio and visual in- puts in addition to text [1,2], it is becoming possible to integrate linguistic and non-linguistic inf...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Evaluation of LLMs Several recent studies have examined the ability of LLMs to un- derstand turn-taking in MPCs (Table 1)

Related Work 2.1. Evaluation of LLMs Several recent studies have examined the ability of LLMs to un- derstand turn-taking in MPCs (Table 1). Inoue et al. [27] con- structed a benchmark for addressee detection and next speaker prediction using three-party conversations. They reported that LLM performance with ground-truth transcriptions was close to chance...

[3] [3]

In addition to evaluating models, we also measure human performance on the same tasks to clarify the gap between humans and current mod- els for these tasks

Task Definition In this study, we evaluate turn-taking prediction in MPCs through three tasks: (1)addressee detection, (2)turn-change prediction, and (3)next speaker prediction. In addition to evaluating models, we also measure human performance on the same tasks to clarify the gap between humans and current mod- els for these tasks. In our experiments, s...

[4] [4]

We used the AMI corpus, which consists of 100 hours of meeting record- ings, as in previous studies [24, 28]

Dataset We constructed an evaluation set for the above tasks. We used the AMI corpus, which consists of 100 hours of meeting record- ings, as in previous studies [24, 28]. The AMI corpus provides synchronized audio recordings, video streams, and manual tran- scriptions. This corpus includes scenario-based meetings where four participants, each playing dif...

[5] [5]

As a naive baseline, we report majority or chance-level strate- gies for each task

Model Evaluation We evaluate three classes of models: conventional supervised learning models, and off-the-shelf text-based and MM-LLMs. As a naive baseline, we report majority or chance-level strate- gies for each task. For addressee detection, the naive baseline always predictsGrouplabel. For turn-change prediction, it al- ways predictsShiftlabel. For n...

[6] [6]

Participants simultaneously performed addressee detection, turn-change prediction, and next speaker prediction in an online setting, without access to future utterances

Human Evaluation To compare human and model performance, we conducted a human evaluation under the same task formulation described in Section 3. Participants simultaneously performed addressee detection, turn-change prediction, and next speaker prediction in an online setting, without access to future utterances. We developed a web-based interface (Figure...

[7] [7]

Model comparison Supervised models vs

Results 7.1. Model comparison Supervised models vs. LLMs:Table 5 shows the overall per- formance of models. SVM achieved the highest accuracy in addressee detection. In turn-change prediction, it also outper- formed all LLMs except Gemini 2.5 Pro. These results indicate that task-specific supervised models can surpass general LLMs in these tasks, even wit...

[8] [8]

easy” intervals where accuracy is high for both, and “difficult

Analysis 8.1. Important features Table 7 and 8 include ablation studies examining the contribu- tion of input features. Firstly, removing conversational context ((a) vs. (d) in Table 7 and 8) led to a substantial performance degradation for Qwen3-14B and Gemini 2.5 Pro, particularly in addressee detection and next speaker prediction. These re- sults indic...

[9] [9]

First, humans and MM-LLMs performed the tasks by watching fixed-angle recorded videos, which do not reflect the first-person perspective of a meeting participant

Limitation Our evaluation differs from natural conversational participation in several respects. First, humans and MM-LLMs performed the tasks by watching fixed-angle recorded videos, which do not reflect the first-person perspective of a meeting participant. Second, textual transcripts and explicit current speaker infor- mation were provided. Such inform...

[10] [10]

Conclusion We conducted a unified evaluation of turn-taking in multimodal MPCs. We compared supervised models, text-based LLMs, multimodal LLMs, and human participants on addressee detec- tion, turn-change prediction, and next speaker prediction under online constraints. Our results showed that multimodal LLMs underperformed humans in addressee detection ...

[11] [11]

Generative AI models were also used as compar- ison systems in the experimental evaluation

Generative AI Use Disclosure This manuscript was edited and polished with the assistance of generative AI. Generative AI models were also used as compar- ison systems in the experimental evaluation. All experimental design, implementation, and analysis were conducted by the au- thors who take full responsibility for the content

[12] [12]

Video-LLaMA: An instruction- tuned audio-visual language model for video understanding,

H. Zhang, X. Li, and L. Bing, “Video-LLaMA: An instruction- tuned audio-visual language model for video understanding,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2023, pp. 543–553

2023

[13] [13]

Natural language super- vision for general-purpose audio representations,

B. Elizalde, S. Deshmukh, and H. Wang, “Natural language super- vision for general-purpose audio representations,” inProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024), 2024, pp. 336–340

2024

[14] [14]

A four-participant group facilitation framework for conversational robots,

Y . Matsuyama, I. Akiba, A. Saito, and T. Kobayashi, “A four-participant group facilitation framework for conversational robots,” inProceedings of the SIGDIAL 2013 Conference, 2013, pp. 284–293

2013

[15] [15]

Exploring turn-taking cues in multi-party human-robot discussions about objects,

G. Skantze, M. Johansson, and J. Beskow, “Exploring turn-taking cues in multi-party human-robot discussions about objects,” in Proceedings of the 2015 ACM International Conference on Mul- timodal Interaction, 2015, pp. 67–74

2015

[16] [16]

The ICSI meeting corpus,

A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, and C. Wooters, “The ICSI meeting corpus,” inProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2003), 2003

2003

[17] [17]

Modeling collaborative mul- timodal behavior in group dialogues: The MULTISIMO corpus,

M. Koutsombogera and C. V ogel, “Modeling collaborative mul- timodal behavior in group dialogues: The MULTISIMO corpus,” inProceedings of the Eleventh International Conference on Lan- guage Resources and Evaluation (LREC 2018), 2018

2018

[18] [18]

CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings,

S. Watanabe, M. Mandel, J. Barker, E. Vincent, A. Arora, X. Chang, S. Khudanpur, V . Manohar, D. Povey, D. Raj, D. Sny- der, A. S. Subramanian, J. Trmal, B. B. Yair, C. Boeddeker, Z. Ni, Y . Fujita, S. Horiguchi, N. Kanda, T. Yoshioka, and N. Ryant, “CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings,” inProceedings of ...

2020

[19] [19]

NOTSOFAR-1 challenge: New datasets, baseline, and tasks for distant meeting transcription,

A. Vinnikov, A. Ivry, A. Hurvitz, I. Abramovski, S. Koubi, I. Gur- vich, S. Peer, X. Xiao, B. M. Elizalde, N. Kanda, X. Wang, S. Shaer, S. Yagev, Y . Asher, S. Sivasankaran, Y . Gong, M. Tang, H. Wang, and E. Krupka, “NOTSOFAR-1 challenge: New datasets, baseline, and tasks for distant meeting transcription,” in Proceedings of the 25th Annual Conference of...

2024

[20] [20]

A cocktail-party benchmark: Multi- modal dataset and comparative evaluation results,

T.-B. Nguyen, K. Zmolikova, P. Ma, N. Q. Pham, C. Fue- gen, and A. Waibel, “A cocktail-party benchmark: Multi- modal dataset and comparative evaluation results,”arXiv preprint arXiv:2510.23276, Feb. 2026

work page arXiv 2026

[21] [21]

Issues in multiparty dialogues,

D. Traum, “Issues in multiparty dialogues,” inProceedings of the Workshop on Agent Communication Languages, 2003, pp. 201– 211

2003

[22] [22]

Opportunities and obligations to take turns in collaborative multi-party human-robot interaction,

M. Johansson and G. Skantze, “Opportunities and obligations to take turns in collaborative multi-party human-robot interaction,” inProceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 2015, pp. 305–314

2015

[23] [23]

Towards a general, continuous model of turn-taking in spoken dialogue using LSTM recurrent neural networks,

G. Skantze, “Towards a general, continuous model of turn-taking in spoken dialogue using LSTM recurrent neural networks,” in Proceedings of the 18th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL 2017), 2017, pp. 220–230

2017

[24] [24]

Multimodal continuous turn-taking prediction using multiscale RNNs,

M. Roddy, G. Skantze, and N. Harte, “Multimodal continuous turn-taking prediction using multiscale RNNs,” inProceedings of the 20th ACM International Conference on Multimodal Interac- tion (ICMI 2018), 2018, pp. 186–190

2018

[25] [25]

V oice activity projection: Self- supervised learning of turn-taking events,

E. Ekstedt and G. Skantze, “V oice activity projection: Self- supervised learning of turn-taking events,” inProceedings of the 23rd Annual Conference of the International Speech Communica- tion Association (INTERSPEECH 2022), 2022, pp. 5190–5194

2022

[26] [26]

TurnGPT: a transformer-based language model for pre- dicting turn-taking in spoken dialog,

——, “TurnGPT: a transformer-based language model for pre- dicting turn-taking in spoken dialog,” inFindings of the Associa- tion for Computational Linguistics: EMNLP 2020, Stroudsburg, PA, USA, 2020, pp. 2981–2990

2020

[27] [27]

Talk- ing turns: Benchmarking audio foundation models on turn-taking dynamics,

S. Arora, Z. Lu, C.-C. Chiu, R. Pang, and S. Watanabe, “Talk- ing turns: Benchmarking audio foundation models on turn-taking dynamics,” inInternational Conference on Learning Representa- tions, Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, Eds., vol. 2025, 2025, pp. 52 754–52 781

2025

[28] [28]

Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities.arXiv preprint arXiv:2503.04721, 2025a

G.-T. Lin, J. Lian, T. Li, Q. Wang, G. Anumanchipalli, A. H. Liu, and H.-y. Lee, “Full-duplex-bench: A benchmark to evaluate full- duplex spoken dialogue models on turn-taking capabilities,”arXiv preprint arXiv:2503.04721, 2025

work page arXiv 2025

[29] [29]

A survey of recent advances on turn-taking modeling in spoken dialogue systems,

G. Castillo-L ´opez, G. de Chalendar, and N. Semmar, “A survey of recent advances on turn-taking modeling in spoken dialogue systems,” inProceedings of the 15th International Workshop on Spoken Dialogue Systems Technology, 2025, pp. 254–271

2025

[30] [30]

Towards automatic addressee identification in multi-party dialogues,

N. Jovanovic and R. o. den Akker, “Towards automatic addressee identification in multi-party dialogues,” inProceedings of the 5th SIGdial Workshop on Discourse and Dialogue at HLT-NAACL 2004, 2004, pp. 89–92

2004

[31] [31]

Modeling norms of turn-taking in multi-party conversation,

K. Laskowski, “Modeling norms of turn-taking in multi-party conversation,” inProceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 2010, pp. 999–1008

2010

[32] [32]

Predicting next speaker and timing from gaze transition patterns in multi-party meetings,

R. Ishii, K. Otsuka, S. Kumano, M. Matsuda, and J. Yamato, “Predicting next speaker and timing from gaze transition patterns in multi-party meetings,” inProceedings of the 15th ACM Inter- national Conference on Multimodal Interaction, New York, NY , USA, 2013

2013

[33] [33]

Investiga- tion of the relationship between turn-taking and prosodic features in spontaneous dialogue,

T. Ohsuga, M. Nishida, Y . Horiuchi, and A. Ichikawa, “Investiga- tion of the relationship between turn-taking and prosodic features in spontaneous dialogue,” inProceedings of the 6th Annual Con- ference of the International Speech Communication Association (INTERSPEECH 2005), 2005, pp. 33–36

2005

[34] [34]

Multimodal end-of-turn prediction in multi-party meetings,

I. de Kok and D. Heylen, “Multimodal end-of-turn prediction in multi-party meetings,” inProceedings of the 2009 International Conference on Multimodal Interfaces (ICMI 2009), 2009, pp. 91– 98

2009

[35] [35]

A generic machine learning based approach for addressee detec- tion in multiparty interaction,

U. Malik, M. Barange, N. Ghannad, J. Saunier, and A. Pauchet, “A generic machine learning based approach for addressee detec- tion in multiparty interaction,” inProceedings of the 19th ACM In- ternational Conference on Intelligent Virtual Agents. New York, NY , USA: ACM, Jul. 2019

2019

[36] [36]

Gaze- enhanced multimodal turn-taking prediction in triadic conversa- tions,

S. Heo, C. Miller, C. Murdock, and M. Proulx, “Gaze- enhanced multimodal turn-taking prediction in triadic conversa- tions,” inProceedings of the 26th Annual Conference of the In- ternational Speech Communication Association (INTERSPEECH 2025), 2025, pp. 1068–1072

2025

[37] [37]

Triadic multi- party voice activity projection for turn-taking in spoken dialogue systems,

M. Elmers, K. Inoue, D. Lala, and T. Kawahara, “Triadic multi- party voice activity projection for turn-taking in spoken dialogue systems,” inProceedings of the 26th Annual Conference of the In- ternational Speech Communication Association (INTERSPEECH 2025), 2025

2025

[38] [38]

An LLM benchmark for addressee recognition in multi-modal multi- party dialogue,

K. Inoue, D. Lala, M. Elmers, K. Ochi, and T. Kawahara, “An LLM benchmark for addressee recognition in multi-modal multi- party dialogue,” inProceedings of the 15th International Work- shop on Spoken Dialogue Systems Technology, 2025, pp. 330– 334

2025

[39] [39]

Next speaker prediction for multi- speaker dialogue with large language models,

L. Hilgert and J. Niehues, “Next speaker prediction for multi- speaker dialogue with large language models,” inProceedings of the 8th International Conference on Natural Language and Speech Processing (ICNLSP 2025), 2025, pp. 60–71

2025

[40] [40]

Analysing next speaker prediction in multi-party conversation using multi- modal large language models,

T. Mori, K. Inoue, D. Lala, K. Ochi, and T. Kawahara, “Analysing next speaker prediction in multi-party conversation using multi- modal large language models,” inProceedings of the 16th Inter- national Workshop on Spoken Dialogue System Technology, 2026, pp. 83–94

2026

[41] [41]

DiPCo — dinner party corpus,

M. Van Segbroeck, A. Zaid, K. Kutsenko, C. Huerta, T. Nguyen, X. Luo, B. Hoffmeister, J. Trmal, M. Omologo, and R. Maas, “DiPCo — dinner party corpus,” inProceedings of the 21st An- nual Conference of the International Speech Communication As- sociation (INTERSPEECH 2020), Oct. 2020, pp. 434–436

2020

[42] [42]

Multi-party chat: Conversational agents in group settings with humans and models,

J. Wei, K. Shuster, A. Szlam, J. Weston, J. Urbanek, and M. Komeili, “Multi-party chat: Conversational agents in group settings with humans and models,”arXiv preprint arXiv:2304.13835, Apr. 2023

work page arXiv 2023

[43] [43]

Multimodal conversation structure understanding,

K. K. Chang, M. H. Cramer, A. Ho, T. T. Nguyen, Y . Yuan, and D. Bamman, “Multimodal conversation structure understanding,” arXiv preprint arXiv:2505.17536, 2025

work page arXiv 2025

[44] [44]

The AMI meet- ing corpus,

W. Kraaij, T. Hain, M. Lincoln, and W. Post, “The AMI meet- ing corpus,” inProceedings of the 5th International Conference on Methods and Techniques in Behavioral Research (Measuring Behavior 2005), 2005

2005

[45] [45]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

Qwen2.5-Omni Technical Report

J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Dang, B. Zhang, X. Wang, Y . Chu, and J. Lin, “Qwen2.5-omni technical report,”arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

Qwen3-Omni Technical Report

J. Xu, Z. Guo, H. Hu, Y . Chu, X. Wang, J. He, Y . Wang, X. Shi, T. He, X. Zhu, Y . Lv, Y . Wang, D. Guo, H. Wang, L. Ma, P. Zhang, X. Zhang, H. Hao, Z. Guo, B. Yang, B. Zhang, Z. Ma, X. Wei, S. Bai, K. Chen, X. Liu, P. Wang, M. Yang, D. Liu, X. Ren, B. Zheng, R. Men, F. Zhou, B. Yu, J. Yang, L. Yu, J. Zhou, and J. Lin, “Qwen3-omni technical report,”arXiv...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

A comparison of addressee detec- tion methods for multiparty conversations,

R. o. d. Akker and D. Traum, “A comparison of addressee detec- tion methods for multiparty conversations,” inProc. DiaHolmia 2009, 2009, pp. 99–106

2009

[49] [49]

Projecting the end of a speaker’s turn: A cognitive cornerstone of conversation,

J.-P. de Ruiter, H. Mitterer, and N. J. Enfield, “Projecting the end of a speaker’s turn: A cognitive cornerstone of conversation,”Lan- guage, vol. 82, no. 3, pp. 515–535, 2006

2006

[50] [50]

The development of predictive pro- cesses in children’s discourse understanding,

M. Casillas and M. Frank, “The development of predictive pro- cesses in children’s discourse understanding,” inProceedings of the Annual Meeting of the Cognitive Science Society, vol. 35, no. 35, 2013

2013

[51] [51]

An empirical study of the na ¨ıve bayes classifier,

I. Rish, “An empirical study of the na ¨ıve bayes classifier,”Pro- ceedings of the IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, 2001

2001

[52] [52]

Classification and regression by ran- domforest,

A. Liaw and M. Wiener, “Classification and regression by ran- domforest,”R News, vol. 2, no. 3, pp. 18–22, 2002

2002

[53] [53]

Multi-layer perceptrons,

R. Kruse, S. Mostaghim, C. Borgelt, C. Braune, and M. Stein- brecher, “Multi-layer perceptrons,” inComputational Intelli- gence: A Methodological Introduction, 2022, pp. 53–124

2022

[54] [54]

Steinwart and A

I. Steinwart and A. Christmann,Support Vector Machines, 2008

2008

[55] [55]

Scikit-learn,

O. Kramer, “Scikit-learn,” inMachine Learning for Evolution Strategies, 2016, pp. 45–53

2016

[56] [56]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017

2017

[57] [57]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning, 2023, pp. 28 492–28 518

2023

[58] [58]

Addressee iden- tification in face-to-face meetings,

N. Jovanovic, R. o. den Akker, and A. Nijholt, “Addressee iden- tification in face-to-face meetings,” in11th Conference of the Eu- ropean Chapter of the Association for Computational Linguistics, 2006, pp. 169–176

2006

[59] [59]

A simplest system- atics for the organization of turn-taking for conversation,

H. Sacks, E. A. Schegloff, and G. Jefferson, “A simplest system- atics for the organization of turn-taking for conversation,”Lan- guage, vol. 50, no. 4, pp. 696–735, 1974

1974

[60] [60]

Social context matters for turn-taking dynamics: A comparative study of autistic and typically developing children,

C. Cox, R. Fusaroli, Y . A. Nielsen, S. Cho, R. Rocca, A. Simon- sen, A. Knox, M. Lyons, M. Liberman, C. Cieriet al., “Social context matters for turn-taking dynamics: A comparative study of autistic and typically developing children,”Cognitive Science, vol. 49, no. 10, p. e70124, 2025

2025

[61] [61]

Modeling turn-taking speed and speaker characteristics,

K. Onishi, H. Ohnaka, and K. Yoshino, “Modeling turn-taking speed and speaker characteristics,” inProceedings of the 26th An- nual Meeting of the Special Interest Group on Discourse and Di- alogue (SIGDIAL 2025), 2025, pp. 21–31

2025