Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings
Pith reviewed 2026-06-27 01:11 UTC · model grok-4.3
The pith
LLMs outperform supervised models and humans at next speaker prediction in meetings using only text.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Experiments on the AMI corpus showed that LLMs outperformed supervised models and humans in next speaker prediction, despite not being trained on the target domain and without access to audio or visual information. An MM-LLM performed better than text-based LLMs on addressee detection and turn-change prediction but remained below human performance. Ablation analyses revealed that conversational context was critical, particularly for next speaker prediction, and that human and LLM prediction patterns were similar.
What carries the argument
The three-task evaluation framework that feeds meeting transcripts to LLMs, supervised models, and humans and measures accuracy on addressee detection, turn-change prediction, and next speaker prediction.
If this is right
- Conversational context supplies the main signal for next speaker prediction across all systems tested.
- Humans and LLMs exhibit similar difficulty on stretches of rapid turn changes.
- Multimodal LLMs extract some benefit from audio-visual input on addressee and turn-change tasks but not enough to match people.
Where Pith is reading between the lines
- The approach could support lightweight meeting-assistant tools that run on transcripts alone and require no camera or microphone arrays.
- Error-pattern overlap suggests LLMs may be learning some of the same implicit turn-taking rules that people use.
- Repeating the study on non-English meetings would test whether the observed advantage generalizes beyond the AMI data.
Load-bearing premise
The text prompts supplied to the LLMs contain information comparable in kind and completeness to the features and context given to the supervised models and human annotators.
What would settle it
Running the same next-speaker prediction test on a second meeting corpus and finding that LLMs no longer exceed human or supervised accuracy would undermine the central result.
Figures
read the original abstract
We investigate turn-taking in multimodal multi-party conversations using large language models (LLMs). We construct an evaluation framework for three tasks: addressee detection, turn-change prediction, and next speaker prediction. We compare supervised models trained for these tasks, text-based LLMs, multimodal LLMs (MM-LLMs), and human subjects. Experiments on the AMI corpus showed that LLMs outperformed supervised models and humans in next speaker prediction, despite not being trained on the target domain and without access to audio or visual information. An MM-LLM performed better than text-based LLMs on addressee detection and turn-change prediction but remained below human performance, indicating difficulty leveraging raw audio-visual signals. Ablation analyses revealed that conversational context was critical, particularly for next speaker prediction. We observed that human and LLM prediction patterns were similar, and intervals with frequent turn changes were difficult for both.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper constructs an evaluation framework for three turn-taking tasks (addressee detection, turn-change prediction, next speaker prediction) in multimodal multi-party meetings. On the AMI corpus it compares text-based LLMs, multimodal LLMs, supervised models trained on the tasks, and human subjects, claiming that text-only LLMs outperform both supervised models and humans on next-speaker prediction despite lacking domain training and audio-visual input; an MM-LLM improves the first two tasks but stays below humans, with ablations showing conversational context is critical and similar error patterns between humans and LLMs.
Significance. If the input representations are shown to be equivalent, the result would indicate that general-purpose LLMs possess strong zero-shot ability to model conversational dynamics from text alone. The public-corpus experiments, human baselines, and context ablations are positive features that would make the work useful for dialogue-system research.
major comments (1)
- [Evaluation framework (abstract and experimental setup)] The headline claim in the abstract that LLMs outperformed supervised models and humans on next speaker prediction rests on the assumption that the textual prompts supplied to the LLMs contain exactly the same conversational history, speaker identities, and timing cues that the supervised baselines received as input features and that human annotators saw. The abstract notes that LLMs had “no access to audio or visual information” but does not confirm that the supervised models were likewise restricted to text-only features or that prompt wording matches annotation instructions; any mismatch would make the performance gap an artifact of experimental setup.
minor comments (2)
- Add explicit details on prompt templates, feature sets used by the supervised baselines, data splits, and statistical significance tests.
- Include a short error analysis or example predictions to substantiate the claim that human and LLM prediction patterns are similar.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the evaluation framework. We address the single major comment below and agree that additional clarifications are warranted.
read point-by-point responses
-
Referee: [Evaluation framework (abstract and experimental setup)] The headline claim in the abstract that LLMs outperformed supervised models and humans on next speaker prediction rests on the assumption that the textual prompts supplied to the LLMs contain exactly the same conversational history, speaker identities, and timing cues that the supervised baselines received as input features and that human annotators saw. The abstract notes that LLMs had “no access to audio or visual information” but does not confirm that the supervised models were likewise restricted to text-only features or that prompt wording matches annotation instructions; any mismatch would make the performance gap an artifact of experimental setup.
Authors: We agree the abstract is insufficiently explicit on input equivalence and will revise it. In Sections 3.2 and 4.1 the supervised baselines are trained solely on textual features (speaker IDs, utterance history, and turn-boundary timestamps extracted from the AMI transcripts); no acoustic or visual features are used. The LLM prompts are constructed from the identical transcript segments and speaker labels. Human annotators received the same text-only transcripts. We will add an explicit statement to the abstract, a feature-comparison table in Section 4, and the full prompt templates plus annotation instructions to the appendix. Regarding timing cues, any additional pause-duration information available only to supervised models would make the LLM outperformance result stronger rather than weaker; we will note this explicitly. revision: yes
Circularity Check
No circularity: empirical evaluation on public corpus
full rationale
The paper reports direct experimental comparisons of LLMs, supervised models, and humans on the AMI corpus for three turn-taking tasks. Performance claims rest on measured accuracies rather than any derivation, fitted parameter renamed as prediction, or self-citation chain. No equations or first-principles results are presented that reduce to their own inputs by construction. The evaluation framework is external to the models tested, satisfying the default expectation of non-circularity for empirical work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard machine learning evaluation assumptions including representative data splits and consistent task labeling across systems
Reference graph
Works this paper leans on
-
[1]
Introduction Advances in large language models (LLMs) have substantially improved the ability of conversational agents to understand and generate natural language. With the emergence of multimodal LLMs (MM-LLMs) capable of processing audio and visual in- puts in addition to text [1,2], it is becoming possible to integrate linguistic and non-linguistic inf...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
Evaluation of LLMs Several recent studies have examined the ability of LLMs to un- derstand turn-taking in MPCs (Table 1)
Related Work 2.1. Evaluation of LLMs Several recent studies have examined the ability of LLMs to un- derstand turn-taking in MPCs (Table 1). Inoue et al. [27] con- structed a benchmark for addressee detection and next speaker prediction using three-party conversations. They reported that LLM performance with ground-truth transcriptions was close to chance...
-
[3]
In addition to evaluating models, we also measure human performance on the same tasks to clarify the gap between humans and current mod- els for these tasks
Task Definition In this study, we evaluate turn-taking prediction in MPCs through three tasks: (1)addressee detection, (2)turn-change prediction, and (3)next speaker prediction. In addition to evaluating models, we also measure human performance on the same tasks to clarify the gap between humans and current mod- els for these tasks. In our experiments, s...
-
[4]
We used the AMI corpus, which consists of 100 hours of meeting record- ings, as in previous studies [24, 28]
Dataset We constructed an evaluation set for the above tasks. We used the AMI corpus, which consists of 100 hours of meeting record- ings, as in previous studies [24, 28]. The AMI corpus provides synchronized audio recordings, video streams, and manual tran- scriptions. This corpus includes scenario-based meetings where four participants, each playing dif...
-
[5]
As a naive baseline, we report majority or chance-level strate- gies for each task
Model Evaluation We evaluate three classes of models: conventional supervised learning models, and off-the-shelf text-based and MM-LLMs. As a naive baseline, we report majority or chance-level strate- gies for each task. For addressee detection, the naive baseline always predictsGrouplabel. For turn-change prediction, it al- ways predictsShiftlabel. For n...
-
[6]
Participants simultaneously performed addressee detection, turn-change prediction, and next speaker prediction in an online setting, without access to future utterances
Human Evaluation To compare human and model performance, we conducted a human evaluation under the same task formulation described in Section 3. Participants simultaneously performed addressee detection, turn-change prediction, and next speaker prediction in an online setting, without access to future utterances. We developed a web-based interface (Figure...
-
[7]
Model comparison Supervised models vs
Results 7.1. Model comparison Supervised models vs. LLMs:Table 5 shows the overall per- formance of models. SVM achieved the highest accuracy in addressee detection. In turn-change prediction, it also outper- formed all LLMs except Gemini 2.5 Pro. These results indicate that task-specific supervised models can surpass general LLMs in these tasks, even wit...
-
[8]
easy” intervals where accuracy is high for both, and “difficult
Analysis 8.1. Important features Table 7 and 8 include ablation studies examining the contribu- tion of input features. Firstly, removing conversational context ((a) vs. (d) in Table 7 and 8) led to a substantial performance degradation for Qwen3-14B and Gemini 2.5 Pro, particularly in addressee detection and next speaker prediction. These re- sults indic...
-
[9]
First, humans and MM-LLMs performed the tasks by watching fixed-angle recorded videos, which do not reflect the first-person perspective of a meeting participant
Limitation Our evaluation differs from natural conversational participation in several respects. First, humans and MM-LLMs performed the tasks by watching fixed-angle recorded videos, which do not reflect the first-person perspective of a meeting participant. Second, textual transcripts and explicit current speaker infor- mation were provided. Such inform...
-
[10]
Conclusion We conducted a unified evaluation of turn-taking in multimodal MPCs. We compared supervised models, text-based LLMs, multimodal LLMs, and human participants on addressee detec- tion, turn-change prediction, and next speaker prediction under online constraints. Our results showed that multimodal LLMs underperformed humans in addressee detection ...
-
[11]
Generative AI models were also used as compar- ison systems in the experimental evaluation
Generative AI Use Disclosure This manuscript was edited and polished with the assistance of generative AI. Generative AI models were also used as compar- ison systems in the experimental evaluation. All experimental design, implementation, and analysis were conducted by the au- thors who take full responsibility for the content
-
[12]
Video-LLaMA: An instruction- tuned audio-visual language model for video understanding,
H. Zhang, X. Li, and L. Bing, “Video-LLaMA: An instruction- tuned audio-visual language model for video understanding,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2023, pp. 543–553
2023
-
[13]
Natural language super- vision for general-purpose audio representations,
B. Elizalde, S. Deshmukh, and H. Wang, “Natural language super- vision for general-purpose audio representations,” inProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024), 2024, pp. 336–340
2024
-
[14]
A four-participant group facilitation framework for conversational robots,
Y . Matsuyama, I. Akiba, A. Saito, and T. Kobayashi, “A four-participant group facilitation framework for conversational robots,” inProceedings of the SIGDIAL 2013 Conference, 2013, pp. 284–293
2013
-
[15]
Exploring turn-taking cues in multi-party human-robot discussions about objects,
G. Skantze, M. Johansson, and J. Beskow, “Exploring turn-taking cues in multi-party human-robot discussions about objects,” in Proceedings of the 2015 ACM International Conference on Mul- timodal Interaction, 2015, pp. 67–74
2015
-
[16]
The ICSI meeting corpus,
A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, and C. Wooters, “The ICSI meeting corpus,” inProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2003), 2003
2003
-
[17]
Modeling collaborative mul- timodal behavior in group dialogues: The MULTISIMO corpus,
M. Koutsombogera and C. V ogel, “Modeling collaborative mul- timodal behavior in group dialogues: The MULTISIMO corpus,” inProceedings of the Eleventh International Conference on Lan- guage Resources and Evaluation (LREC 2018), 2018
2018
-
[18]
CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings,
S. Watanabe, M. Mandel, J. Barker, E. Vincent, A. Arora, X. Chang, S. Khudanpur, V . Manohar, D. Povey, D. Raj, D. Sny- der, A. S. Subramanian, J. Trmal, B. B. Yair, C. Boeddeker, Z. Ni, Y . Fujita, S. Horiguchi, N. Kanda, T. Yoshioka, and N. Ryant, “CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings,” inProceedings of ...
2020
-
[19]
NOTSOFAR-1 challenge: New datasets, baseline, and tasks for distant meeting transcription,
A. Vinnikov, A. Ivry, A. Hurvitz, I. Abramovski, S. Koubi, I. Gur- vich, S. Peer, X. Xiao, B. M. Elizalde, N. Kanda, X. Wang, S. Shaer, S. Yagev, Y . Asher, S. Sivasankaran, Y . Gong, M. Tang, H. Wang, and E. Krupka, “NOTSOFAR-1 challenge: New datasets, baseline, and tasks for distant meeting transcription,” in Proceedings of the 25th Annual Conference of...
2024
-
[20]
A cocktail-party benchmark: Multi- modal dataset and comparative evaluation results,
T.-B. Nguyen, K. Zmolikova, P. Ma, N. Q. Pham, C. Fue- gen, and A. Waibel, “A cocktail-party benchmark: Multi- modal dataset and comparative evaluation results,”arXiv preprint arXiv:2510.23276, Feb. 2026
-
[21]
Issues in multiparty dialogues,
D. Traum, “Issues in multiparty dialogues,” inProceedings of the Workshop on Agent Communication Languages, 2003, pp. 201– 211
2003
-
[22]
Opportunities and obligations to take turns in collaborative multi-party human-robot interaction,
M. Johansson and G. Skantze, “Opportunities and obligations to take turns in collaborative multi-party human-robot interaction,” inProceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 2015, pp. 305–314
2015
-
[23]
Towards a general, continuous model of turn-taking in spoken dialogue using LSTM recurrent neural networks,
G. Skantze, “Towards a general, continuous model of turn-taking in spoken dialogue using LSTM recurrent neural networks,” in Proceedings of the 18th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL 2017), 2017, pp. 220–230
2017
-
[24]
Multimodal continuous turn-taking prediction using multiscale RNNs,
M. Roddy, G. Skantze, and N. Harte, “Multimodal continuous turn-taking prediction using multiscale RNNs,” inProceedings of the 20th ACM International Conference on Multimodal Interac- tion (ICMI 2018), 2018, pp. 186–190
2018
-
[25]
V oice activity projection: Self- supervised learning of turn-taking events,
E. Ekstedt and G. Skantze, “V oice activity projection: Self- supervised learning of turn-taking events,” inProceedings of the 23rd Annual Conference of the International Speech Communica- tion Association (INTERSPEECH 2022), 2022, pp. 5190–5194
2022
-
[26]
TurnGPT: a transformer-based language model for pre- dicting turn-taking in spoken dialog,
——, “TurnGPT: a transformer-based language model for pre- dicting turn-taking in spoken dialog,” inFindings of the Associa- tion for Computational Linguistics: EMNLP 2020, Stroudsburg, PA, USA, 2020, pp. 2981–2990
2020
-
[27]
Talk- ing turns: Benchmarking audio foundation models on turn-taking dynamics,
S. Arora, Z. Lu, C.-C. Chiu, R. Pang, and S. Watanabe, “Talk- ing turns: Benchmarking audio foundation models on turn-taking dynamics,” inInternational Conference on Learning Representa- tions, Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, Eds., vol. 2025, 2025, pp. 52 754–52 781
2025
-
[28]
G.-T. Lin, J. Lian, T. Li, Q. Wang, G. Anumanchipalli, A. H. Liu, and H.-y. Lee, “Full-duplex-bench: A benchmark to evaluate full- duplex spoken dialogue models on turn-taking capabilities,”arXiv preprint arXiv:2503.04721, 2025
-
[29]
A survey of recent advances on turn-taking modeling in spoken dialogue systems,
G. Castillo-L ´opez, G. de Chalendar, and N. Semmar, “A survey of recent advances on turn-taking modeling in spoken dialogue systems,” inProceedings of the 15th International Workshop on Spoken Dialogue Systems Technology, 2025, pp. 254–271
2025
-
[30]
Towards automatic addressee identification in multi-party dialogues,
N. Jovanovic and R. o. den Akker, “Towards automatic addressee identification in multi-party dialogues,” inProceedings of the 5th SIGdial Workshop on Discourse and Dialogue at HLT-NAACL 2004, 2004, pp. 89–92
2004
-
[31]
Modeling norms of turn-taking in multi-party conversation,
K. Laskowski, “Modeling norms of turn-taking in multi-party conversation,” inProceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 2010, pp. 999–1008
2010
-
[32]
Predicting next speaker and timing from gaze transition patterns in multi-party meetings,
R. Ishii, K. Otsuka, S. Kumano, M. Matsuda, and J. Yamato, “Predicting next speaker and timing from gaze transition patterns in multi-party meetings,” inProceedings of the 15th ACM Inter- national Conference on Multimodal Interaction, New York, NY , USA, 2013
2013
-
[33]
Investiga- tion of the relationship between turn-taking and prosodic features in spontaneous dialogue,
T. Ohsuga, M. Nishida, Y . Horiuchi, and A. Ichikawa, “Investiga- tion of the relationship between turn-taking and prosodic features in spontaneous dialogue,” inProceedings of the 6th Annual Con- ference of the International Speech Communication Association (INTERSPEECH 2005), 2005, pp. 33–36
2005
-
[34]
Multimodal end-of-turn prediction in multi-party meetings,
I. de Kok and D. Heylen, “Multimodal end-of-turn prediction in multi-party meetings,” inProceedings of the 2009 International Conference on Multimodal Interfaces (ICMI 2009), 2009, pp. 91– 98
2009
-
[35]
A generic machine learning based approach for addressee detec- tion in multiparty interaction,
U. Malik, M. Barange, N. Ghannad, J. Saunier, and A. Pauchet, “A generic machine learning based approach for addressee detec- tion in multiparty interaction,” inProceedings of the 19th ACM In- ternational Conference on Intelligent Virtual Agents. New York, NY , USA: ACM, Jul. 2019
2019
-
[36]
Gaze- enhanced multimodal turn-taking prediction in triadic conversa- tions,
S. Heo, C. Miller, C. Murdock, and M. Proulx, “Gaze- enhanced multimodal turn-taking prediction in triadic conversa- tions,” inProceedings of the 26th Annual Conference of the In- ternational Speech Communication Association (INTERSPEECH 2025), 2025, pp. 1068–1072
2025
-
[37]
Triadic multi- party voice activity projection for turn-taking in spoken dialogue systems,
M. Elmers, K. Inoue, D. Lala, and T. Kawahara, “Triadic multi- party voice activity projection for turn-taking in spoken dialogue systems,” inProceedings of the 26th Annual Conference of the In- ternational Speech Communication Association (INTERSPEECH 2025), 2025
2025
-
[38]
An LLM benchmark for addressee recognition in multi-modal multi- party dialogue,
K. Inoue, D. Lala, M. Elmers, K. Ochi, and T. Kawahara, “An LLM benchmark for addressee recognition in multi-modal multi- party dialogue,” inProceedings of the 15th International Work- shop on Spoken Dialogue Systems Technology, 2025, pp. 330– 334
2025
-
[39]
Next speaker prediction for multi- speaker dialogue with large language models,
L. Hilgert and J. Niehues, “Next speaker prediction for multi- speaker dialogue with large language models,” inProceedings of the 8th International Conference on Natural Language and Speech Processing (ICNLSP 2025), 2025, pp. 60–71
2025
-
[40]
Analysing next speaker prediction in multi-party conversation using multi- modal large language models,
T. Mori, K. Inoue, D. Lala, K. Ochi, and T. Kawahara, “Analysing next speaker prediction in multi-party conversation using multi- modal large language models,” inProceedings of the 16th Inter- national Workshop on Spoken Dialogue System Technology, 2026, pp. 83–94
2026
-
[41]
DiPCo — dinner party corpus,
M. Van Segbroeck, A. Zaid, K. Kutsenko, C. Huerta, T. Nguyen, X. Luo, B. Hoffmeister, J. Trmal, M. Omologo, and R. Maas, “DiPCo — dinner party corpus,” inProceedings of the 21st An- nual Conference of the International Speech Communication As- sociation (INTERSPEECH 2020), Oct. 2020, pp. 434–436
2020
-
[42]
Multi-party chat: Conversational agents in group settings with humans and models,
J. Wei, K. Shuster, A. Szlam, J. Weston, J. Urbanek, and M. Komeili, “Multi-party chat: Conversational agents in group settings with humans and models,”arXiv preprint arXiv:2304.13835, Apr. 2023
-
[43]
Multimodal conversation structure understanding,
K. K. Chang, M. H. Cramer, A. Ho, T. T. Nguyen, Y . Yuan, and D. Bamman, “Multimodal conversation structure understanding,” arXiv preprint arXiv:2505.17536, 2025
-
[44]
The AMI meet- ing corpus,
W. Kraaij, T. Hain, M. Lincoln, and W. Post, “The AMI meet- ing corpus,” inProceedings of the 5th International Conference on Methods and Techniques in Behavioral Research (Measuring Behavior 2005), 2005
2005
-
[45]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Dang, B. Zhang, X. Wang, Y . Chu, and J. Lin, “Qwen2.5-omni technical report,”arXiv preprint arXiv:2503.20215, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
J. Xu, Z. Guo, H. Hu, Y . Chu, X. Wang, J. He, Y . Wang, X. Shi, T. He, X. Zhu, Y . Lv, Y . Wang, D. Guo, H. Wang, L. Ma, P. Zhang, X. Zhang, H. Hao, Z. Guo, B. Yang, B. Zhang, Z. Ma, X. Wei, S. Bai, K. Chen, X. Liu, P. Wang, M. Yang, D. Liu, X. Ren, B. Zheng, R. Men, F. Zhou, B. Yu, J. Yang, L. Yu, J. Zhou, and J. Lin, “Qwen3-omni technical report,”arXiv...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
A comparison of addressee detec- tion methods for multiparty conversations,
R. o. d. Akker and D. Traum, “A comparison of addressee detec- tion methods for multiparty conversations,” inProc. DiaHolmia 2009, 2009, pp. 99–106
2009
-
[49]
Projecting the end of a speaker’s turn: A cognitive cornerstone of conversation,
J.-P. de Ruiter, H. Mitterer, and N. J. Enfield, “Projecting the end of a speaker’s turn: A cognitive cornerstone of conversation,”Lan- guage, vol. 82, no. 3, pp. 515–535, 2006
2006
-
[50]
The development of predictive pro- cesses in children’s discourse understanding,
M. Casillas and M. Frank, “The development of predictive pro- cesses in children’s discourse understanding,” inProceedings of the Annual Meeting of the Cognitive Science Society, vol. 35, no. 35, 2013
2013
-
[51]
An empirical study of the na ¨ıve bayes classifier,
I. Rish, “An empirical study of the na ¨ıve bayes classifier,”Pro- ceedings of the IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, 2001
2001
-
[52]
Classification and regression by ran- domforest,
A. Liaw and M. Wiener, “Classification and regression by ran- domforest,”R News, vol. 2, no. 3, pp. 18–22, 2002
2002
-
[53]
Multi-layer perceptrons,
R. Kruse, S. Mostaghim, C. Borgelt, C. Braune, and M. Stein- brecher, “Multi-layer perceptrons,” inComputational Intelli- gence: A Methodological Introduction, 2022, pp. 53–124
2022
-
[54]
Steinwart and A
I. Steinwart and A. Christmann,Support Vector Machines, 2008
2008
-
[55]
Scikit-learn,
O. Kramer, “Scikit-learn,” inMachine Learning for Evolution Strategies, 2016, pp. 45–53
2016
-
[56]
Attention is all you need,
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017
2017
-
[57]
Robust speech recognition via large-scale weak supervision,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning, 2023, pp. 28 492–28 518
2023
-
[58]
Addressee iden- tification in face-to-face meetings,
N. Jovanovic, R. o. den Akker, and A. Nijholt, “Addressee iden- tification in face-to-face meetings,” in11th Conference of the Eu- ropean Chapter of the Association for Computational Linguistics, 2006, pp. 169–176
2006
-
[59]
A simplest system- atics for the organization of turn-taking for conversation,
H. Sacks, E. A. Schegloff, and G. Jefferson, “A simplest system- atics for the organization of turn-taking for conversation,”Lan- guage, vol. 50, no. 4, pp. 696–735, 1974
1974
-
[60]
Social context matters for turn-taking dynamics: A comparative study of autistic and typically developing children,
C. Cox, R. Fusaroli, Y . A. Nielsen, S. Cho, R. Rocca, A. Simon- sen, A. Knox, M. Lyons, M. Liberman, C. Cieriet al., “Social context matters for turn-taking dynamics: A comparative study of autistic and typically developing children,”Cognitive Science, vol. 49, no. 10, p. e70124, 2025
2025
-
[61]
Modeling turn-taking speed and speaker characteristics,
K. Onishi, H. Ohnaka, and K. Yoshino, “Modeling turn-taking speed and speaker characteristics,” inProceedings of the 26th An- nual Meeting of the Special Interest Group on Discourse and Di- alogue (SIGDIAL 2025), 2025, pp. 21–31
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.