Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings

· 2026 · cs.CL · arXiv 2606.17542

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

We investigate turn-taking in multimodal multi-party conversations using large language models (LLMs). We construct an evaluation framework for three tasks: addressee detection, turn-change prediction, and next speaker prediction. We compare supervised models trained for these tasks, text-based LLMs, multimodal LLMs (MM-LLMs), and human subjects. Experiments on the AMI corpus showed that LLMs outperformed supervised models and humans in next speaker prediction, despite not being trained on the target domain and without access to audio or visual information. An MM-LLM performed better than text-based LLMs on addressee detection and turn-change prediction but remained below human performance, indicating difficulty leveraging raw audio-visual signals. Ablation analyses revealed that conversational context was critical, particularly for next speaker prediction. We observed that human and LLM prediction patterns were similar, and intervals with frequent turn changes were difficult for both.

representative citing papers

Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings

cs.CL · 2026-06-16 · unverdicted · novelty 5.0

LLMs beat humans and supervised models at next speaker prediction in meetings using only text, while multimodal LLMs improve on addressee and turn-change tasks but remain below human performance.

citing papers explorer

Showing 1 of 1 citing paper.

Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings cs.CL · 2026-06-16 · unverdicted · none · ref 1 · internal anchor
LLMs beat humans and supervised models at next speaker prediction in meetings using only text, while multimodal LLMs improve on addressee and turn-change tasks but remain below human performance.

Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings

fields

years

verdicts

representative citing papers

citing papers explorer