LLMs beat humans and supervised models at next speaker prediction in meetings using only text, while multimodal LLMs improve on addressee and turn-change tasks but remain below human performance.
Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
We investigate turn-taking in multimodal multi-party conversations using large language models (LLMs). We construct an evaluation framework for three tasks: addressee detection, turn-change prediction, and next speaker prediction. We compare supervised models trained for these tasks, text-based LLMs, multimodal LLMs (MM-LLMs), and human subjects. Experiments on the AMI corpus showed that LLMs outperformed supervised models and humans in next speaker prediction, despite not being trained on the target domain and without access to audio or visual information. An MM-LLM performed better than text-based LLMs on addressee detection and turn-change prediction but remained below human performance, indicating difficulty leveraging raw audio-visual signals. Ablation analyses revealed that conversational context was critical, particularly for next speaker prediction. We observed that human and LLM prediction patterns were similar, and intervals with frequent turn changes were difficult for both.
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings
LLMs beat humans and supervised models at next speaker prediction in meetings using only text, while multimodal LLMs improve on addressee and turn-change tasks but remain below human performance.