pith. machine review for the scientific record. sign in

arxiv: 2604.08125 · v2 · submitted 2026-04-09 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:23 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal reaction generationpolyadic interactionspeaking-listening generationpose fusionsocial cue encodingbody motion synthesisgroup conversation modelingonline generation
0
0 comments X

The pith

PolySLGen generates contextually appropriate multimodal reactions including speech, body motion, and speaking state for a target participant in group interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to build a real-time system that produces speech, gestures, and speaking-listening decisions for one person in a multi-person conversation, drawing on the recent history of motions and words from everyone present. Existing methods handle only two-person exchanges or single output types and ignore the nonverbal signals that keep group talk coherent. A working version would let embodied AI join natural group conversations instead of staying limited to scripted one-on-one exchanges. The authors introduce a pose fusion module paired with a social cue encoder that together combine motion and social information across the whole group to drive the output.

Core claim

Given past conversation and motion from all participants, PolySLGen generates a future speaking or listening reaction for a target participant that includes speech, body motion, and a speaking state score. The framework uses a pose fusion module and a social cue encoder to jointly aggregate motion and social signals from the entire group, yielding reactions that are more contextually appropriate and temporally coherent than those from adapted and state-of-the-art baselines on measures of motion quality, motion-speech alignment, speaking state prediction, and human-perceived realism.

What carries the argument

The pose fusion module and social cue encoder, which together aggregate motion and social signals from the full group to drive the target participant's multimodal reaction.

If this is right

  • Real-time multimodal reaction generation becomes feasible for polyadic rather than only dyadic interactions.
  • Generated body motion aligns more closely with accompanying speech than in prior single-modality or two-person systems.
  • Speaking state prediction improves, allowing the system to decide when the target should speak or listen.
  • Human observers rate the resulting reactions as more realistic than those from adapted baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fusion approach could be applied to virtual meeting avatars that must respond to multiple remote participants at once.
  • Extending the encoder to include facial micro-expressions or gaze direction would likely tighten the temporal coherence further.
  • Testing the model in live human-AI group games would reveal whether the online constraint still holds under unpredictable turn-taking.

Load-bearing premise

The pose fusion module and social cue encoder can combine motion and social signals from the whole group to produce coherent reactions without overfitting to particular datasets or interaction styles.

What would settle it

Running the model on a held-out dataset with different group sizes, cultural interaction norms, or speaking rates and checking whether motion-speech alignment and human realism scores drop below the reported baselines.

Figures

Figures reproduced from arXiv: 2604.08125 by Jouh Yeong Chew, Thomas Markhorst, Xucong Zhang, Zhi-Yi Lin.

Figure 1
Figure 1. Figure 1: Overview of the online reaction generation task in [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of PolySLGen. Past multimodal group interactions are encoded into text [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visual comparison between PolySLGen and SOLAMI. The first column presents speaking reactions, and the second col [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Transcription workflow. Utterances are extracted by [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Three examples of the conversational chunks extracted from the DnD Gesture Dataset [ [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Multimodal instruction template for polyadic interac [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
read the original abstract

Human-like multimodal reaction generation is essential for natural group interactions between humans and embodied AI. However, existing approaches are limited to single-modality or speaking-only responses in dyadic interactions, making them unsuitable for realistic social scenarios. Many also overlook nonverbal cues and complex dynamics of polyadic interactions, both critical for engagement and conversational coherence. In this work, we present PolySLGen, an online framework for Polyadic multimodal Speaking and Listening reaction Generation. Given past conversation and motion from all participants, PolySLGen generates a future speaking or listening reaction for a target participant, including speech, body motion, and speaking state score. To model group interactions effectively, we propose a pose fusion module and a social cue encoder that jointly aggregate motion and social signals from the group. Extensive experiments, along with quantitative and qualitative evaluations, show that PolySLGen produces contextually appropriate and temporally coherent multi-modal reactions, outperforming several adapted and state-of-the-art baselines in motion quality, motion-speech alignment, speaking state prediction, and human-perceived realism.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces PolySLGen, an online framework for generating multimodal (speech, body motion, speaking state) speaking and listening reactions for a target participant in polyadic group interactions. It takes past conversation and motion from all participants as input and proposes a pose fusion module together with a social cue encoder to jointly aggregate motion and social signals across the group. The central claim, supported by asserted extensive experiments, is that the model produces contextually appropriate and temporally coherent reactions that outperform adapted and state-of-the-art baselines on motion quality, motion-speech alignment, speaking-state prediction, and human-perceived realism.

Significance. If the central claims hold, the work would address a clear gap in existing dyadic or single-modality reaction-generation methods and could enable more realistic embodied AI for group conversations. The proposed group-aggregation modules represent a potentially useful technical direction, but the absence of verifiable experimental details, ablations, or generalization tests in the manuscript prevents a firm assessment of whether the gains truly stem from the proposed components rather than capacity or training choices.

major comments (2)
  1. [Abstract] Abstract: the claim that PolySLGen 'outperforms several adapted and state-of-the-art baselines' in multiple metrics is load-bearing for the paper's contribution, yet the abstract (and the provided manuscript excerpt) supplies no quantitative tables, data splits, ablation studies, or cross-dataset results. Without these, it is impossible to verify that the pose fusion module and social cue encoder are responsible for the reported coherence and generalization rather than other factors.
  2. [Abstract] The central modeling assumption—that the pose fusion module and social cue encoder jointly capture general polyadic interaction dynamics—remains untested in the presented material. No evidence of held-out interaction styles, cross-dataset evaluation, or module-specific ablations isolating their contribution to temporal coherence or motion-speech alignment is supplied, leaving open the possibility that reported gains are dataset-specific.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the need for clearer experimental details and validation of the proposed modules. We address each major comment below. The full manuscript contains quantitative results, ablations, and generalization tests, but we have revised the abstract and expanded relevant sections to make these more immediately verifiable.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that PolySLGen 'outperforms several adapted and state-of-the-art baselines' in multiple metrics is load-bearing for the paper's contribution, yet the abstract (and the provided manuscript excerpt) supplies no quantitative tables, data splits, ablation studies, or cross-dataset results. Without these, it is impossible to verify that the pose fusion module and social cue encoder are responsible for the reported coherence and generalization rather than other factors.

    Authors: Abstracts are length-constrained and conventionally omit tables or specific numbers; these appear in the manuscript body. Section 4 reports quantitative comparisons to baselines on motion quality, motion-speech alignment, speaking-state prediction, and human realism. Data splits are described in Section 3.1. Module ablations isolating the pose fusion and social cue encoder contributions appear in Section 4.4. We have revised the abstract to summarize key gains and added explicit cross-references to tables and sections for easier verification. revision: partial

  2. Referee: [Abstract] The central modeling assumption—that the pose fusion module and social cue encoder jointly capture general polyadic interaction dynamics—remains untested in the presented material. No evidence of held-out interaction styles, cross-dataset evaluation, or module-specific ablations isolating their contribution to temporal coherence or motion-speech alignment is supplied, leaving open the possibility that reported gains are dataset-specific.

    Authors: The manuscript already contains module-specific ablations (Section 4.4) that isolate the pose fusion module and social cue encoder, showing their impact on temporal coherence and motion-speech alignment metrics. Held-out evaluation on interaction styles is performed via splits by group size and conversation context within the dataset. Cross-dataset testing is limited by the lack of comparable public polyadic multimodal datasets; we have added explicit discussion of this limitation and the within-dataset generalization results. We have expanded the ablation and generalization subsections to make the evidence more prominent. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation relies on standard ML training and evaluation without self-referential reduction

full rationale

The paper describes a standard encoder-decoder style architecture with proposed pose fusion and social cue encoder modules trained end-to-end on interaction data to generate reactions. No equations or claims reduce a prediction or uniqueness result to a fitted parameter or self-citation by construction. The abstract and available description frame the work as building on conventional aggregation techniques for multimodal signals, with performance claims resting on empirical comparisons to baselines rather than definitional equivalence. No load-bearing self-citation chains, ansatz smuggling, or renaming of known results appear in the provided text. The central claims remain falsifiable via held-out data and human evaluations, satisfying the criteria for a self-contained derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no concrete free parameters, axioms, or invented entities can be extracted. The framework appears to rest on standard assumptions of neural generative models for motion and speech synthesis.

pith-pipeline@v0.9.0 · 5488 in / 1094 out tokens · 142126 ms · 2026-05-10T18:23:37.131203+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

99 extracted references · 12 canonical work pages · 2 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Furhat: a back-projected human-like robot head for multiparty human-machine interaction

    Samer Al Moubayed, Jonas Beskow, Gabriel Skantze, and Bj¨orn Granstr¨om. Furhat: a back-projected human-like robot head for multiparty human-machine interaction. InCogni- tive behavioural systems: COST 2102 international training school, dresden, Germany, february 21-26, 2011, revised se- lected papers, pages 114–130. Springer, 2012. 1, 2

  3. [3]

    Exploring the differential effects of trust vi- olations in human-human and human-robot interactions.Ap- plied Ergonomics, 93:103350, 2021

    Gene M Alarcon, Anthony M Gibson, Sarah A Jessup, and August Capiola. Exploring the differential effects of trust vi- olations in human-human and human-robot interactions.Ap- plied Ergonomics, 93:103350, 2021. 1

  4. [4]

    Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

  5. [5]

    Talking turns: Benchmarking audio foundation models on turn-taking dynamics.ICLR,

    Siddhant Arora, Zhiyun Lu, Chung-Cheng Chiu, Ruoming Pang, and Shinji Watanabe. Talking turns: Benchmarking audio foundation models on turn-taking dynamics.ICLR,

  6. [6]

    Motor contagion during human-human and human-robot in- teraction.PloS one, 9(8):e106172, 2014

    Ambra Bisio, Alessandra Sciutti, Francesco Nori, Giorgio Metta, Luciano Fadiga, Giulio Sandini, and Thierry Pozzo. Motor contagion during human-human and human-robot in- teraction.PloS one, 9(8):e106172, 2014. 1

  7. [7]

    End-to-end speaker seg- mentation for overlap-aware resegmentation.arXiv preprint arXiv:2104.04045, 2021

    Herv ´e Bredin and Antoine Laurent. End-to-end speaker seg- mentation for overlap-aware resegmentation.arXiv preprint arXiv:2104.04045, 2021. 4, 1

  8. [8]

    Digital life project: Autonomous 3d characters with social intelligence

    Zhongang Cai, Jianping Jiang, Zhongfei Qing, Xinying Guo, Mingyuan Zhang, Zhengyu Lin, Haiyi Mei, Chen Wei, Ruisi Wang, Wanqi Yin, et al. Digital life project: Autonomous 3d characters with social intelligence. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 582–592, 2024. 3

  9. [9]

    Ready-to-react: Online reaction policy for two-character interaction genera- tion.arXiv preprint arXiv:2502.20370, 2025

    Zhi Cen, Huaijin Pi, Sida Peng, Qing Shuai, Yujun Shen, Hujun Bao, Xiaowei Zhou, and Ruizhen Hu. Ready-to-react: Online reaction policy for two-character interaction genera- tion.arXiv preprint arXiv:2502.20370, 2025. 2

  10. [10]

    The language of motion: Unifying verbal and non-verbal language of 3d human motion

    Changan Chen, Juze Zhang, Shrinidhi K Lakshmikanth, Yusu Fang, Ruizhi Shao, Gordon Wetzstein, Li Fei-Fei, and Ehsan Adeli. The language of motion: Unifying verbal and non-verbal language of 3d human motion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6200–6211, 2025. 2

  11. [11]

    Wavlm: Large-scale self- supervised pre-training for full stack speech processing

    Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self- supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16 (6):1505–1518, 2022. 5

  12. [12]

    Aroyo, Kerstin Dautenhahn, and Stephen L

    Abhinav Dahiya, Alexander M. Aroyo, Kerstin Dautenhahn, and Stephen L. Smith. A survey of multi-agent human–robot interaction systems.Robotics and Autonomous Systems, 161: 104335, 2023. 2

  13. [13]

    Chatpose: Chatting about 3d human pose

    Yao Feng, Jing Lin, Sai Kumar Dwivedi, Yu Sun, Priyanka Patel, and Michael J Black. Chatpose: Chatting about 3d human pose. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2093–2103,

  14. [14]

    Remos: 3d motion- conditioned reaction synthesis for two-person interactions

    Anindita Ghosh, Rishabh Dabral, Vladislav Golyanik, Chris- tian Theobalt, and Philipp Slusallek. Remos: 3d motion- conditioned reaction synthesis for two-person interactions. InEuropean Conference on Computer Vision (ECCV), 2024. 2

  15. [15]

    Tm2d: Bimodality driven 3d dance generation via music-text integration

    Kehong Gong, Dongze Lian, Heng Chang, Chuan Guo, Zi- hang Jiang, Xinxin Zuo, Michael Bi Mi, and Xinchao Wang. Tm2d: Bimodality driven 3d dance generation via music-text integration. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9942–9952, 2023. 2

  16. [16]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Ab- hinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 3, 5, 4

  17. [17]

    Generating diverse and natural 3d human motions from text

    Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5152–5161, 2022. 2

  18. [18]

    Tm2t: Stochastic and tokenized modeling for the reciprocal genera- tion of 3d human motions and texts

    Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. Tm2t: Stochastic and tokenized modeling for the reciprocal genera- tion of 3d human motions and texts. InEuropean Conference on Computer Vision, pages 580–597. Springer, 2022. 2

  19. [19]

    Onellm: One framework to align all modalities with language

    Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, and Xi- angyu Yue. Onellm: One framework to align all modalities with language. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26584– 26595, 2024. 3

  20. [20]

    Salad: Skeleton-aware latent diffusion for text-driven motion generation and editing

    Seokhyeon Hong, Chaelin Kim, Serin Yoon, Junghyun Nam, Sihun Cha, and Junyong Noh. Salad: Skeleton-aware latent diffusion for text-driven motion generation and editing. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 7158–7168, 2025. 2

  21. [21]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 5

  22. [22]

    Trimodal prediction of speaking and lis- tening willingness to help improve turn-changing modeling

    Ryo Ishii, Xutong Ren, Michal Muszynski, and Louis- Philippe Morency. Trimodal prediction of speaking and lis- tening willingness to help improve turn-changing modeling. Frontiers in Psychology, 13:774547, 2022. 2

  23. [23]

    Mo- tion puzzle: Arbitrary motion style transfer by body part

    Deok-Kyeong Jang, Soomin Park, and Sung-Hee Lee. Mo- tion puzzle: Arbitrary motion style transfer by body part. ACM Transactions on Graphics (TOG), 41(3):1–16, 2022. 5

  24. [24]

    Multi-agent long-term 3d human pose forecasting via interaction-aware trajectory conditioning

    Jaewoo Jeong, Daehee Park, and Kuk-Jin Yoon. Multi-agent long-term 3d human pose forecasting via interaction-aware trajectory conditioning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1617–1628, 2024. 2

  25. [25]

    Solami: Social vision- language-action modeling for immersive interaction with 3d autonomous characters.arXiv preprint arXiv:2412.00174,

    Jianping Jiang, Weiye Xiao, Zhengyu Lin, Huaizhong Zhang, Tianxiang Ren, Yang Gao, Zhiqian Lin, Zhongang Cai, Lei Yang, and Ziwei Liu. Solami: Social vision- language-action modeling for immersive interaction with 3d autonomous characters.arXiv preprint arXiv:2412.00174,

  26. [26]

    1, 2, 3, 5, 6, 7, 8, 4

  27. [27]

    Coherent reconstruction of multiple humans from a single image

    Wen Jiang, Nikos Kolotouros, Georgios Pavlakos, Xiaowei Zhou, and Kostas Daniilidis. Coherent reconstruction of multiple humans from a single image. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5579–5588, 2020. 5

  28. [28]

    Gaze and turn-taking behavior in casual conversational interactions.ACM Transactions on In- teractive Intelligent Systems (TiiS), 3(2):1–30, 2013

    Kristiina Jokinen, Hirohisa Furukawa, Masafumi Nishida, and Seiichi Yamamoto. Gaze and turn-taking behavior in casual conversational interactions.ACM Transactions on In- teractive Intelligent Systems (TiiS), 3(2):1–30, 2013. 4

  29. [29]

    Let’s face it: Probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic set- tings

    Patrik Jonell, Taras Kucherenko, Gustav Eje Henter, and Jonas Beskow. Let’s face it: Probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic set- tings. InProceedings of the 20th ACM International Confer- ence on Intelligent Virtual Agents, pages 1–8, 2020. 2

  30. [30]

    Towards social artificial intelligence: Nonverbal social signal prediction in a triadic interaction

    Hanbyul Joo, Tomas Simon, Mina Cikara, and Yaser Sheikh. Towards social artificial intelligence: Nonverbal social signal prediction in a triadic interaction. InCVPR, 2019. 5

  31. [31]

    Virtual real- ity social cognition training for young adults with high- functioning autism.Journal of autism and developmental disorders, 43:34–44, 2013

    Michelle R Kandalaft, Nyaz Didehbani, Daniel C Krawczyk, Tandra T Allen, and Sandra B Chapman. Virtual real- ity social cognition training for young adults with high- functioning autism.Journal of autism and developmental disorders, 43:34–44, 2013. 2

  32. [32]

    Kobin H Kendrick, Judith Holler, and Stephen C Levinson. Turn-taking in human face-to-face interaction is multimodal: gaze direction and manual gestures aid the coordination of turn transitions.Philosophical transactions of the royal so- ciety B, 378(1875):20210473, 2023. 4

  33. [33]

    Most: Motion style transformer between diverse ac- tion contents

    Boeun Kim, Jungho Kim, Hyung Jin Chang, and Jin Young Choi. Most: Motion style transformer between diverse ac- tion contents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1705– 1714, 2024. 5

  34. [34]

    Talking with hands 16.2 m: A large-scale dataset of synchronized body- finger motion and audio for conversational motion analy- sis and synthesis

    Gilwoo Lee, Zhiwei Deng, Shugao Ma, Takaaki Shiratori, Siddhartha S Srinivasa, and Yaser Sheikh. Talking with hands 16.2 m: A large-scale dataset of synchronized body- finger motion and audio for conversational motion analy- sis and synthesis. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 763–772,

  35. [35]

    Modeling multimodal social interactions: new challenges and baselines with densely aligned repre- sentations

    Sangmin Lee, Bolin Lai, Fiona Ryan, Bikram Boote, and James M Rehg. Modeling multimodal social interactions: new challenges and baselines with densely aligned repre- sentations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14585– 14595, 2024. 1

  36. [36]

    Sangmin Lee, Bolin Lai, Fiona Ryan, Bikram Boote, and James M. Rehg. Modeling multimodal social interactions: New challenges and baselines with densely aligned represen- tations. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14585–14595, 2024. 1

  37. [37]

    Computer-assisted ther- apy for medication-resistant auditory hallucinations: proof- of-concept study.The British Journal of Psychiatry, 202(6): 428–433, 2013

    Julian Leff, Geoffrey Williams, Mark A Huckvale, Mau- rice Arbuthnot, and Alex P Leff. Computer-assisted ther- apy for medication-resistant auditory hallucinations: proof- of-concept study.The British Journal of Psychiatry, 202(6): 428–433, 2013. 2

  38. [38]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 3

  39. [39]

    Towards online multi-modal social interaction understanding.arXiv preprint arXiv:2503.19851, 2025

    Xinpeng Li, Shijian Deng, Bolin Lai, Weiguo Pian, James M Rehg, and Yapeng Tian. Towards online multi-modal social interaction understanding.arXiv preprint arXiv:2503.19851,

  40. [40]

    Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial train- ing with large speech language models.NeurIPS, 36:19594– 19621, 2023

    Yinghao Aaron Li, Cong Han, Vinay Raghavan, Gavin Mis- chler, and Nima Mesgarani. Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial train- ing with large speech language models.NeurIPS, 36:19594– 19621, 2023. 4, 1

  41. [41]

    Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis

    Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. InEuropean confer- ence on computer vision, pages 612–630. Springer, 2022. 5

  42. [42]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 1, 3

  43. [43]

    Towards variable and coordinated holistic co-speech motion generation

    Yifei Liu, Qiong Cao, Yandong Wen, Huaiguang Jiang, and Changxing Ding. Towards variable and coordinated holistic co-speech motion generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1566–1576, 2024. 2

  44. [44]

    Decoupled weight de- cay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2018. 5

  45. [45]

    M 3 gpt: An ad- vanced multimodal, multitask framework for motion com- prehension and generation.Advances in Neural Information Processing Systems, 37:28051–28077, 2024

    Mingshuang Luo, Ruibing Hou, Zhuo Li, Hong Chang, Zimo Liu, Yaowei Wang, and Shiguang Shan. M 3 gpt: An ad- vanced multimodal, multitask framework for motion com- prehension and generation.Advances in Neural Information Processing Systems, 37:28051–28077, 2024. 2

  46. [46]

    Embody 3d: A large-scale multimodal motion and behavior dataset, 2025

    Claire McLean, Makenzie Meendering, Tristan Swartz, Orri Gabbay, Alexandra Olsen, Rachel Jacobs, Nicholas Rosen, Philippe de Bree, Tony Garcia, Gadsden Merrill, Jake San- dakly, Julia Buffalini, Neham Jain, Steven Krenn, Moneish Kumar, Dejan Markovic, Evonne Ng, Fabian Prada, Andrew Saba, Siwei Zhang, Vasu Agrawal, Tim Godisart, Alexander Richard, and Mic...

  47. [47]

    From wer and ril to mer and wil: improved evaluation mea- sures for connected speech recognition

    Andrew Cameron Morris, Viktoria Maier, and Phil D Green. From wer and ril to mer and wil: improved evaluation mea- sures for connected speech recognition. InInterspeech, pages 2765–2768, 2004. 5

  48. [48]

    Convofusion: Multi-modal conversational diffu- sion for co-speech gesture synthesis

    Muhammad Hamza Mughal, Rishabh Dabral, Ikhsanul Habibie, Lucia Donatelli, Marc Habermann, and Christian Theobalt. Convofusion: Multi-modal conversational diffu- sion for co-speech gesture synthesis. InCVPR, pages 1388– 1398, 2024. 2, 5, 6, 1, 3, 4

  49. [49]

    Learning to listen: Modeling non-deterministic dyadic facial motion

    Evonne Ng, Hanbyul Joo, Liwen Hu, Hao Li, Trevor Darrell, Angjoo Kanazawa, and Shiry Ginosar. Learning to listen: Modeling non-deterministic dyadic facial motion. InCVPR, pages 20395–20405, 2022. 2

  50. [50]

    It takes two: Learning to plan for human-robot cooperative carrying

    Eley Ng, Ziang Liu, and Monroe Kennedy. It takes two: Learning to plan for human-robot cooperative carrying. In 2023 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 7526–7532. IEEE, 2023. 2

  51. [51]

    Can language models learn to listen? InICCV, 2023

    Evonne Ng, Sanjay Subramanian, Dan Klein, Angjoo Kanazawa, Trevor Darrell, and Shiry Ginosar. Can language models learn to listen? InICCV, 2023. 1, 2, 5, 6, 3, 4

  52. [52]

    From audio to photoreal embodiment: Synthesizing humans in conversations

    Evonne Ng, Javier Romero, Timur Bagautdinov, Shaojie Bai, Trevor Darrell, Angjoo Kanazawa, and Alexander Richard. From audio to photoreal embodiment: Synthesizing humans in conversations. InCVPR, pages 1001–1010, 2024. 2, 4

  53. [53]

    Egocom: A multi-person multi-modal egocentric communications dataset.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):6783– 6793, 2020

    Curtis G Northcutt, Shengxin Zha, Steven Lovegrove, and Richard Newcombe. Egocom: A multi-person multi-modal egocentric communications dataset.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):6783– 6793, 2020. 5

  54. [54]

    Engagement in human-agent inter- action: An overview.Frontiers in Robotics and AI, V olume 7 - 2020, 2020

    Catharine Oertel, Ginevra Castellano, Mohamed Chetouani, Jauwairia Nasir, Mohammad Obaid, Catherine Pelachaud, and Christopher Peters. Engagement in human-agent inter- action: An overview.Frontiers in Robotics and AI, V olume 7 - 2020, 2020. 2

  55. [55]

    ChatGPT (May 6 version).https://chat

    OpenAI. ChatGPT (May 6 version).https://chat. openai.com/, 2025. 3

  56. [56]

    A unified framework for motion reasoning and generation in human interaction

    Jeongeun Park, Sungjoon Choi, and Sangdoo Yun. A unified framework for motion reasoning and generation in human interaction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10698–10707, 2025. 2

  57. [57]

    Trajectory- aware body interaction transformer for multi-person pose forecasting

    Xiaogang Peng, Siyuan Mao, and Zizhao Wu. Trajectory- aware body interaction transformer for multi-person pose forecasting. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17121– 17130, 2023. 2

  58. [58]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInter- national Conference on Machine Learning, pages 28492– 28518. PMLR, 2023. 4

  59. [59]

    Chirag Raman, Jose Vargas Quiros, Stephanie Tan, Ashraful Islam, Ekin Gedik, and Hayley Hung. Conflab: A data col- lection concept, dataset, and benchmark for machine analy- sis of free-standing social interactions in the wild.Advances in Neural Information Processing Systems, 35:23701–23715,

  60. [60]

    The multimodal nature of communicative ef- ficiency in social interaction.Scientific Reports, 12, 2022

    Marlou Rasenberg, Wim Pouw, Asli ¨Ozy¨urek, and Mark Dingemanse. The multimodal nature of communicative ef- ficiency in social interaction.Scientific Reports, 12, 2022. 1

  61. [61]

    Contact and human dynamics from monocular video

    Davis Rempe, Leonidas J Guibas, Aaron Hertzmann, Bryan Russell, Ruben Villegas, and Jimei Yang. Contact and human dynamics from monocular video. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23– 28, 2020, Proceedings, Part V 16, pages 71–87. Springer,

  62. [62]

    Mo- tiongpt: Human motion synthesis with improved diversity and realism via gpt-3 prompting

    Jose Ribeiro-Gomes, Tianhui Cai, Zolt ´an A Milacski, Chen Wu, Aayush Prakash, Shingo Takagi, Amaury Aubel, Daeil Kim, Alexandre Bernardino, and Fernando De La Torre. Mo- tiongpt: Human motion synthesis with improved diversity and realism via gpt-3 prompting. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5070–50...

  63. [63]

    Visual cues en- hance predictive turn-taking for two-party human interac- tion

    Sam O’Connor Russell and Naomi Harte. Visual cues en- hance predictive turn-taking for two-party human interac- tion. InFindings of the Association for Computational Lin- guistics: ACL 2025, pages 209–221, 2025. 1

  64. [64]

    Ditailis- tener: Controllable high fidelity listener video generation with diffusion

    Maksim Siniukov, Di Chang, Minh Tran, Hongkun Gong, Ashutosh Chaubey, and Mohammad Soleymani. Ditailis- tener: Controllable high fidelity listener video generation with diffusion. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 11991–12001,

  65. [65]

    Duolando: Follower gpt with off-policy reinforcement learn- ing for dance accompaniment

    Li Siyao, Tianpei Gu, Zhitao Yang, Zhengyu Lin, Ziwei Liu, Henghui Ding, Lei Yang, and Chen Change Loy. Duolando: Follower gpt with off-policy reinforcement learn- ing for dance accompaniment. InICLR, 2024. 1, 2

  66. [66]

    Err@hri 2024 challenge: Multimodal detection of errors and failures in human-robot interactions

    Micol Spitale, Maria Teresa Parreira, Maia Stiber, Minja Ax- elsson, Neval Kara, Garima Kankariya, Chien-Ming Huang, Malte Jung, Wendy Ju, and Hatice Gunes. Err@hri 2024 challenge: Multimodal detection of errors and failures in human-robot interactions. InProceedings of the 26th In- ternational Conference on Multimodal Interaction, page 652–656, New York,...

  67. [67]

    stable-ts.https://github.com/jianfch/ stable-ts

    stable-ts. stable-ts.https://github.com/jianfch/ stable-ts. Accessed: 2025-05-15. 4, 1

  68. [68]

    Humans in kitchens: a dataset for multi-person human motion forecasting with scene con- text.Advances in Neural Information Processing Systems, 36:10184–10196, 2023

    Julian Tanke, Oh-Hun Kwon, Felix B Mueller, Andreas Do- ering, and Juergen Gall. Humans in kitchens: a dataset for multi-person human motion forecasting with scene con- text.Advances in Neural Information Processing Systems, 36:10184–10196, 2023. 5

  69. [69]

    Social diffusion: Long-term multiple hu- man motion anticipation

    Julian Tanke, Linguang Zhang, Amy Zhao, Chengcheng Tang, Yujun Cai, Lezi Wang, Po-Chen Wu, Juergen Gall, and Cem Keskin. Social diffusion: Long-term multiple hu- man motion anticipation. InICCV, pages 9601–9611, 2023. 2, 4

  70. [70]

    Understanding the dynamics of social interactions: A multi-modal multi-view approach.ACM Trans

    Rim Trabelsi, Jagannadan Varadarajan, Le Zhang, Issam Jabri, Yong Pei, Fethi Smach, Ammar Bouallegue, and Pierre Moulin. Understanding the dynamics of social interactions: A multi-modal multi-view approach.ACM Trans. Multime- dia Comput. Commun. Appl., 15(1s), 2019. 1

  71. [71]

    Large language models know what to say but not when to speak

    Muhammad Umair, Vasanth Sarathy, and JP de Ruiter. Large language models know what to say but not when to speak. arXiv preprint arXiv:2410.16044, 2024. 3

  72. [72]

    Nonverbal cues in human–robot interaction: A communication studies perspec- tive.J

    Jacqueline Urakami and Katie Seaborn. Nonverbal cues in human–robot interaction: A communication studies perspec- tive.J. Hum.-Robot Interact., 12(2), 2023. 1

  73. [73]

    Jrdb-pose: A large-scale dataset for multi- person pose estimation and tracking

    Edward Vendrow, Duy Tho Le, Jianfei Cai, and Hamid Rezatofighi. Jrdb-pose: A large-scale dataset for multi- person pose estimation and tracking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4811–4820, 2023. 5

  74. [74]

    Verbal and non-verbal communication skills including empathy dur- ing history taking of undergraduate medical students.BMC Medical Education, 18, 2018

    Daniela V ogel, Marco Meyer, and Sigrid Harendza. Verbal and non-verbal communication skills including empathy dur- ing history taking of undergraduate medical students.BMC Medical Education, 18, 2018. 1

  75. [75]

    A systematic review of ai, vr, and llm applications in special education: Opportunities, challenges, and future directions.Education and Information Technologies, pages 1–41, 2025

    Evdokia V oultsiou and Lefteris Moussiades. A systematic review of ai, vr, and llm applications in special education: Opportunities, challenges, and future directions.Education and Information Technologies, pages 1–41, 2025. 2

  76. [76]

    Read- ing lies: Nonverbal communication and deception.Annual review of psychology, 70:295—317, 2019

    Aldert Vrij, Maria Hartwig, and P ¨ar Anders Granhag. Read- ing lies: Nonverbal communication and deception.Annual review of psychology, 70:295—317, 2019. 1

  77. [77]

    Move as you say interact as you can: Language-guided human motion generation with scene af- fordance

    Zan Wang, Yixin Chen, Baoxiong Jia, Puhao Li, Jinlu Zhang, Jingze Zhang, Tengyu Liu, Yixin Zhu, Wei Liang, and Siyuan Huang. Move as you say interact as you can: Language-guided human motion generation with scene af- fordance. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 433–444, 2024. 2

  78. [78]

    Turn-taking in con- versations.Communication theory, pages 226–245, 2017

    John M Wiemann and Mark L Knapp. Turn-taking in con- versations.Communication theory, pages 226–245, 2017. 2

  79. [79]

    Woodard and J.T

    J.P. Woodard and J.T. Nelson. An information theoretic mea- sure of speech recognition performance.Workshop on stan- dardisation for speech I/O technology, 1982. 5

  80. [80]

    Next-gpt: Any-to-any multimodal llm

    Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. Next-gpt: Any-to-any multimodal llm. InForty-first International Conference on Machine Learning, 2024. 1

Showing first 80 references.