Towards Trust Calibration in Socially Interactive Agents: Investigating Gendered Multimodal Behaviors Generation with LLMs

Chlo\'e Clavel; Lucie Galland; Magalie Ochs

arxiv: 2605.19798 · v1 · pith:KHZDVY6Unew · submitted 2026-05-19 · 💻 cs.CL

Towards Trust Calibration in Socially Interactive Agents: Investigating Gendered Multimodal Behaviors Generation with LLMs

Lucie Galland , Chlo\'e Clavel , Magalie Ochs This is my paper

Pith reviewed 2026-05-20 06:33 UTC · model grok-4.3

classification 💻 cs.CL

keywords socially interactive agentstrust calibrationlarge language modelsmultimodal behavior generationgender stereotypesability and benevolencetrustworthiness dimensions

0 comments

The pith

Large language models can generate coherent multimodal behaviors reflecting different levels of ability and benevolence for social agents, while reproducing gender stereotypes when gender is specified.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores using large language models to automatically create behaviors for socially interactive agents that vary in ability and benevolence, two core dimensions of trustworthiness. This matters because calibrated trust could help users interact with agents at appropriate levels of reliance instead of over-trusting or under-using them. The authors test a prompting method to produce aligned outputs in text, intonation, facial expression, and gesture. Analysis of generated data and a user study show the behaviors match theoretical expectations and intended trait levels. The work also reveals that adding gender to prompts causes the models to link male agents with high ability and female agents with high benevolence.

Core claim

GPT-5.4 produces coherent multimodal behaviors across text, intonation, facial expression, and gesture that align with theoretical expectations for ability and benevolence. Random Forest feature importance confirms this alignment. When gender is specified in prompts, the outputs reproduce societal stereotypes, associating male agents with high ability and female agents with high benevolence. A within-subjects user study on Prolific confirms that participants perceive different levels of ability and benevolence in line with the prompt instructions.

What carries the argument

A prompt-based method for automatically generating multimodal behaviors aligned with specific levels of ability and benevolence, which produces outputs in verbal, vocal, gestural, and facial modalities.

If this is right

Multimodal behaviors generated this way could support trust calibration in socially interactive agents.
LLMs can control specific trustworthiness dimensions through targeted prompting across modalities.
Including gender in prompts for behavior generation leads to stereotypical associations in the outputs.
User perceptions of the generated behaviors match the designed levels of ability and benevolence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the method to other traits or contexts could support more varied and context-appropriate agent personalities.
The gender stereotype pattern points to the value of testing debiasing prompts or post-processing steps before deployment.
Real-world interaction tests would reveal whether these generated behaviors actually produce better-calibrated trust and usage decisions.

Load-bearing premise

The prompts to the LLMs can isolate and control the intended levels of ability and benevolence without other uncontrolled factors shaping the outputs or how people perceive them.

What would settle it

A follow-up analysis or user study in which the generated behaviors show no statistical alignment with theoretical expectations for ability and benevolence, or in which participants fail to perceive the intended trait differences.

Figures

Figures reproduced from arXiv: 2605.19798 by Chlo\'e Clavel, Lucie Galland, Magalie Ochs.

**Figure 1.** Figure 1: First, we present our method to generate, from LLMs, [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Most important features in random forest classification of ability and benevolence levels. Features are ranked by [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Most important features in Random Forest classification of gender for male-generated behaviors (symmetrical patterns [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

As Socially Interactive Agents (SIAs) become increasingly integrated into daily life, the ability to calibrate user trust to an agent's actual capabilities would help ensure appropriate usage of these agents. In this paper, we explore the capacity of Large Language Models (LLMs) to generate multimodal behaviors (verbal, vocal, gestural, and facial expression modalities) that reflect varying levels of ability and benevolence, two key dimensions of trustworthiness. We propose a novel method for automatically generating behaviors aligned with specific levels of these traits, a first step towards enabling nuanced and trust-calibrated interactions. By analyzing a large dataset of multimodal transcripts generated by LLMs, we demonstrate that GPT-5.4 is able to produce coherent behavior across different modalities (text, intonation, facial expression, and gesture). Using Random Forest feature importance analysis, we show that the generated behaviors align with theoretical expectations for ability and benevolence. However, we also find that when gender is specified in the prompt, LLMs tend to reproduce societal gender stereotypes, associating male agents' behaviors with high ability and female agents' behaviors with high benevolence. To validate our approach, we conducted a user study on Prolific using a within-subjects design. Participants perceived different levels of ability and benevolence in the generated behaviors align with the intended instructions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs can generate coherent multimodal behaviors aligned with ability and benevolence levels per their analysis, but specifying gender triggers stereotypical outputs and the prompt controls may not be as clean as claimed.

read the letter

Hi, the core result is that GPT-5.4 produces multimodal outputs (text, intonation, face, gesture) that a Random Forest flags as matching theoretical ability and benevolence dimensions, and a Prolific user study shows participants rate them in line with the prompt instructions. When gender is added to the prompt the model shifts toward male-high-ability and female-high-benevolence patterns. That is the main thing to take away. The work is new in the narrow sense that it tries to automate generation of trust-calibrated behaviors across four modalities instead of scripting them by hand. They built a sizable generated dataset and ran a within-subjects human validation, which is a reasonable first empirical step for the SIA trust-calibration angle. The gender-stereotype finding is straightforward and worth noting for anyone building these agents. The soft spot is exactly the one the stress-test flags. The prompts explicitly tell the model the target trait levels, so the Random Forest feature importance could simply be recovering those explicit instructions or other correlated prompt language rather than showing that the model has independently mapped the traits to behavior. If that is happening, the claim of theoretical alignment and the cross-modal coherence both rest on weaker ground. The paper would benefit from clearer ablation of the prompt wording and from reporting how much variance the intended trait labels actually explain versus other prompt elements. This is aimed at researchers working on human-AI trust and socially interactive agents who want concrete generation examples rather than a finished framework. It is a solid enough empirical sketch to deserve referee time, even though the controls need tightening before it would be ready for a top venue.

Referee Report

3 major / 3 minor

Summary. The paper explores using LLMs (specifically GPT-5.4) to generate multimodal behaviors (text, intonation, facial expression, gesture) for socially interactive agents that reflect varying levels of ability and benevolence. It proposes a prompt-based generation method, analyzes a large dataset of outputs with Random Forest feature importance to claim alignment with trustworthiness theory, reports that gender-specified prompts reproduce societal stereotypes (male agents high-ability, female high-benevolence), and validates via a within-subjects Prolific user study that participants perceive the intended trait levels.

Significance. If the central claims hold after addressing confounds, the work offers a practical step toward automated trust calibration in SIAs and surfaces important gender-bias issues in LLM multimodal generation. The empirical pipeline combining large-scale generation, feature analysis, and human judgment data provides a replicable template, though its value hinges on demonstrating that outputs are driven by the targeted trait dimensions rather than prompt artifacts.

major comments (3)

[Methods (prompt-based generation)] The prompt-based generation method (described in the methods for aligning behaviors with specific ability/benevolence levels) assumes prompts can isolate these traits independently. However, without explicit controls or ablation tests for correlated prompt wording, default model biases, or implicit cross-modal consistency rules, the Random Forest alignment may reflect prompt artifacts rather than theoretical mapping, undermining both the coherence claim and the gender-stereotype interpretation.
[Results (Random Forest analysis)] In the Random Forest feature importance analysis, the reported alignment with ability and benevolence theory lacks detail on the exact multimodal features extracted, baseline comparisons (e.g., neutral or random prompts), or cross-validation metrics. This makes it impossible to assess whether importance scores genuinely track the intended dimensions or simply capture surface-level prompt elements.
[User Study] The user study section reports that participant perceptions align with intended instructions but provides no sample size, statistical tests, effect sizes, or confidence intervals. These omissions prevent evaluation of whether the within-subjects design reliably validates the generation method or merely shows weak directional trends.

minor comments (3)

[Abstract] The abstract contains a grammatical error: 'Participants perceived different levels of ability and benevolence in the generated behaviors align with the intended instructions' should be rephrased for clarity.
The model is referred to as 'GPT-5.4'; clarify whether this is a hypothetical future model, a specific fine-tuned variant, or a typographical reference to an existing GPT-4 variant to avoid reader confusion.
[Results] Figure or table captions for the Random Forest results should explicitly list the top features per modality and their importance scores to improve interpretability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which identifies key areas where additional rigor and transparency will strengthen our claims about LLM-generated multimodal behaviors for trust calibration in SIAs. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Methods (prompt-based generation)] The prompt-based generation method (described in the methods for aligning behaviors with specific ability/benevolence levels) assumes prompts can isolate these traits independently. However, without explicit controls or ablation tests for correlated prompt wording, default model biases, or implicit cross-modal consistency rules, the Random Forest alignment may reflect prompt artifacts rather than theoretical mapping, undermining both the coherence claim and the gender-stereotype interpretation.

Authors: We agree that the absence of explicit ablation tests leaves open the possibility that observed alignments partly reflect prompt artifacts or model biases rather than a clean mapping to ability and benevolence. In the revised manuscript we will add a dedicated ablation subsection that (a) systematically varies prompt phrasing while holding trait levels constant, (b) compares outputs against neutral and random-prompt baselines, and (c) examines cross-modal consistency rules. These analyses will be used to qualify both the coherence results and the gender-stereotype findings, making clear which effects persist after controlling for surface-level prompt elements. revision: yes
Referee: [Results (Random Forest analysis)] In the Random Forest feature importance analysis, the reported alignment with ability and benevolence theory lacks detail on the exact multimodal features extracted, baseline comparisons (e.g., neutral or random prompts), or cross-validation metrics. This makes it impossible to assess whether importance scores genuinely track the intended dimensions or simply capture surface-level prompt elements.

Authors: We acknowledge that the current description of the Random Forest analysis is insufficiently detailed for readers to evaluate the source of the reported feature importances. The revised version will expand this section to list all extracted multimodal features (lexical, prosodic, facial, and gestural descriptors), include explicit baseline comparisons with neutral and random-prompt conditions, and report 5-fold cross-validation performance together with stability metrics for the importance rankings. These additions will allow direct assessment of whether the importance scores reflect the targeted theoretical dimensions. revision: yes
Referee: [User Study] The user study section reports that participant perceptions align with intended instructions but provides no sample size, statistical tests, effect sizes, or confidence intervals. These omissions prevent evaluation of whether the within-subjects design reliably validates the generation method or merely shows weak directional trends.

Authors: We thank the referee for noting these reporting gaps. The revised manuscript will supply the exact sample size, the statistical tests performed (repeated-measures ANOVA or paired comparisons), effect sizes, and 95% confidence intervals for the key contrasts. These additions will enable readers to judge the reliability and magnitude of the alignment between intended trait levels and participant perceptions. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical LLM behavior generation study

full rationale

The paper is an empirical study that generates multimodal behavior transcripts via LLM prompts specifying ability/benevolence levels, analyzes them with Random Forest feature importance, and validates via a human user study on Prolific. No equations, derivations, or mathematical claims exist. No self-citation chains or ansatzes reduce any result to its own inputs by construction. The central claims rest on external human judgments and theoretical expectations rather than internal fitting or renaming, making the work self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters or invented entities are introduced in the abstract; the work builds on established concepts in trust research and LLM capabilities.

axioms (1)

domain assumption Ability and benevolence are key dimensions of trustworthiness
Invoked as the basis for generating behaviors reflecting varying levels of these traits.

pith-pipeline@v0.9.0 · 5768 in / 1374 out tokens · 70121 ms · 2026-05-20T06:33:26.084777+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a novel method for automatically generating behaviors aligned with specific levels of these traits... Using Random Forest feature importance analysis, we show that the generated behaviors align with theoretical expectations for ability and benevolence.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

when gender is specified in the prompt, LLMs tend to reproduce societal gender stereotypes

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages

[1]

Naeimeh Anzabi and Hiroyuki Umemuro. 2023. Effect of different listening behav- iors of social robots on perceived trust in human-robot interactions.International Journal of Social Robotics15, 6 (2023), 931–951

work page 2023
[2]

Marjorie Armando, Magalie Ochs, and Isabelle Régner. 2022. The impact of pedagogical agents’ gender on academic learning: A systematic review.Frontiers in Artificial Intelligence5 (2022), 862997

work page 2022
[3]

Agnes Axelsson and Gabriel Skantze. 2022. Multimodal user feedback during adaptive robot-human presentations.Frontiers in Computer Science3 (2022), 741148

work page 2022
[4]

Daniel Balliet and Paul AM Van Lange. 2013. Trust, conflict, and cooperation: a meta-analysis.Psychological bulletin139, 5 (2013), 1090

work page 2013
[5]

Shreyas Bhat, Joseph B Lyons, Cong Shi, and X Jessie Yang. 2024. Value alignment and trust in human-robot interaction: Insights from simulation and user study. InDiscovering the frontiers of human-robot interaction: Insights and innovations in collaboration, communication, and control. Springer, 39–63

work page 2024
[6]

Beatrice Biancardi, Angelo Cafaro, and Catherine Pelachaud. 2017. Analyzing first impressions of warmth and competence from observable nonverbal cues in expert-novice interactions. InProceedings of the 19th ACM international conference on multimodal interaction. 341–349

work page 2017
[7]

Christina Breuer, Joachim Hüffmeier, and Guido Hertel. 2016. Does trust matter more in virtual teams? A meta-analysis of trust and team effectiveness considering virtuality and documentation as moderators.Journal of Applied Psychology101, 8 (2016), 1151

work page 2016
[8]

Fabio Calefato, Filippo Lanubile, and Nicole Novielli. 2015. The role of social media in affective trust building in customer–supplier relationships.Electronic Commerce Research15, 4 (2015), 453–482

work page 2015
[9]

Maureen A Craig and Galen V Bodenhausen. 2018. Category (non) fit modu- lates extrapolative stereotyping of multiply categorizable social targets.Social Cognition36, 5 (2018), 559–588

work page 2018
[10]

Bart A De Jong, Kurt T Dirks, and Nicole Gillespie. 2016. Trust and team perfor- mance: A meta-analysis of main effects, moderators, and covariates.Journal of applied psychology101, 8 (2016), 1134

work page 2016
[11]

David DeSteno, Cynthia Breazeal, Robert H Frank, David Pizarro, Jolie Baumann, Leah Dickens, and Jin Joo Lee. 2012. Detecting the trustworthiness of novel partners in economic exchange.Psychological science23, 12 (2012), 1549–1556

work page 2012
[12]

Weihua Du, Yiming Yang, and Sean Welleck. 2025. Optimizing temperature for language models with multi-sample inference.arXiv preprint arXiv:2502.05234 (2025)

work page arXiv 2025
[13]

Wen Duan, Shiwen Zhou, Matthew J Scalia, Xiaoyun Yin, Nan Weng, Ruihao Zhang, Guo Freeman, Nathan McNeese, Jamie Gorman, and Michael Tolston

work page
[14]

Proceedings of the ACM on Human-Computer Interaction8, CSCW2 (2024), 1–31

Understanding the evolvement of trust over time within Human-AI teams. Proceedings of the ACM on Human-Computer Interaction8, CSCW2 (2024), 1–31

work page 2024
[15]

Easton, Stephen Potter, R

K. Easton, Stephen Potter, R. Bec, M. Bennion, H. Christensen, C. Grindell, Bah- man Mirheidari, S. Weich, L. D. de Witte, D. Wolstenholme, and M. Hawley

work page
[16]

https://api.semanticscholar.org/CorpusId:171093436

A Virtual Agent to Support Individuals Living With Physical and Mental Comorbidities: Co-Design and Acceptability Testing.Journal of Medical Internet Research21 (2019). https://api.semanticscholar.org/CorpusId:171093436

work page 2019
[17]

Paul Ekman, Tim Dalgleish, and M Power. 1999. Basic emotions.San Francisco, USA1 (1999)

work page 1999
[18]

Siska Fitrianie, Merijn Bruijnes, Deborah Richards, Andrea Bönsch, and Willem- Paul Brinkman. 2020. The 19 unifying questionnaire constructs of artificial social agents: An iva community analysis. InProceedings of the 20th ACM International Conference on Intelligent Virtual Agents. 1–8

work page 2020
[19]

Lucie Galland, Catherine Pelachaud, and Florian Pecune. 2025. SMART-DREAM: To Condition or Not to Condition; A Study on the Impact of LLM Conditioning on Motivational Interview Dialog Virtual Agent. InProceedings of the 25th ACM International Conference on Intelligent Virtual Agents. 1–9

work page 2025
[20]

Yuan Gao, Elena Sibirtseva, Ginevra Castellano, and Danica Kragic. 2019. Fast adaptation with meta-reinforcement learning for trust modelling in human-robot interaction. In2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 305–312

work page 2019
[21]

Jonas Gonzalez-Billandon, Alexander M Aroyo, Alessia Tonelli, Dario Pasquali, Alessandra Sciutti, Monica Gori, Giulio Sandini, and Francesco Rea. 2019. Can a robot catch you lying? a machine learning system to detect lies during interac- tions.Frontiers in Robotics and AI6 (2019), 64

work page 2019
[22]

Grivokostopoulou, Konstantinos Kovas, and I

F. Grivokostopoulou, Konstantinos Kovas, and I. Perikos. 2020. The Effectiveness of Embodied Pedagogical Agents and Their Impact on Students Learning in Virtual Worlds.Applied Sciences(2020). https://api.semanticscholar.org/CorpusId: 216241082

work page 2020
[23]

Rosanna E Guadagno, Jim Blascovich, Jeremy N Bailenson, and Cade McCall

work page
[24]

Virtual humans and persuasion: The effects of agency and behavioral realism.Media Psychology10, 1 (2007), 1–22

work page 2007
[25]

Abhay Gupta, Arjun D’Cunha, Kamal Awasthi, and Vineeth Balasubramanian

work page
[26]

Daisee: Towards user engagement recognition in the wild.arXiv preprint arXiv:1609.01885(2016)

work page arXiv 2016
[27]

Bin Han, Deuksin Kwon, Spencer Lin, Kaleen Shrestha, and Jonathan Gratch

work page
[28]

InProceedings of the 25th ACM International Conference on Intelligent Virtual Agents

Can LLMs Generate Behaviors for Embodied Virtual Agents Based on Personality Traits?. InProceedings of the 25th ACM International Conference on Intelligent Virtual Agents. 1–10

work page
[29]

Craig J Johnson, Mustafa Demir, Nathan J McNeese, Jamie C Gorman, Alexandra T Wolff, and Nancy J Cooke. 2023. The impact of training on human–autonomy team communications and trust calibration.Human factors65, 7 (2023), 1554– 1570

work page 2023
[30]

Sai Shashank Kalakonda, Shubh Maheshwari, and Ravi Kiran Sarvadevabhatla

work page
[31]

In2023 IEEE international conference on multimedia and expo (ICME)

Action-gpt: Leveraging large-scale language models for improved and generalized action generation. In2023 IEEE international conference on multimedia and expo (ICME). IEEE, 31–36

work page
[32]

Youngmin Kim, Jiwan Chung, Jisoo Kim, Sunghyun Lee, Sangkyu Lee, Junhyeok Kim, Cheoljong Yang, and Youngjae Yu. 2025. Speaking Beyond Language: A Large-Scale Multimodal Dataset for Learning Nonverbal Cues from Video- Grounded Dialogues. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2247–2265

work page 2025
[33]

Yanghee Kim and Quan Wei. 2011. The impact of learner attributes and learner choice in an agent-based environment.Computers & Education56, 2 (2011), 505–514

work page 2011
[34]

Jennifer T Kubota, Samuel A Venezia, Richa Gautam, Andrea L Wilhelm, Bradley D Mattan, and Jasmin Cloutier. 2023. Distrust as a form of inequal- ity.Scientific Reports13, 1 (2023), 9901

work page 2023
[35]

John D Lee and Katrina A See. 2004. Trust in automation: Designing for appro- priate reliance.Human factors46, 1 (2004), 50–80

work page 2004
[36]

Jin Joo Lee, Brad Knox, Jolie Baumann, Cynthia Breazeal, and David DeSteno

work page
[37]

Computationally modeling interpersonal trust.Frontiers in psychology4 (2013), 56004

work page 2013
[38]

Chang Liu, Qunfen Lin, Zijiao Zeng, and Ye Pan. 2024. Emoface: Audio-driven emotional 3d face animation. In2024 IEEE Conference Virtual Reality and 3D User Interfaces (VR). IEEE, 387–397

work page 2024
[39]

Ziyi Liu, Zhengzhe Zhu, Lijun Zhu, Enze Jiang, Xiyun Hu, Kylie A Peppler, and K. Ramani. 2024. ClassMeta: Designing Interactive Virtual Classmate to Promote VR Classroom Participation.Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems(2024). https://api.semanticscholar.org/CorpusId: 269748691

work page 2024
[40]

Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. Curran Associates, Inc. http://papers.nips.cc/paper/7062-a-unified- approach-to-interpreting-model-predictions.pdf

work page 2017
[41]

Syaheerah Lebai Lutfi, Badr Lahasan, Cristina Luna-Jiménez, Zaher A Bamasood, and Zahid Akhtar. 2023. Effects of Facial Expressions and Gestures on the Trustworthiness of a Person.IEEE Access11 (2023), 133891–133902

work page 2023
[42]

Roger C Mayer and James H Davis. 1999. The effect of the performance appraisal system on trust for management: A field quasi-experiment.Journal of applied psychology84, 1 (1999), 123

work page 1999
[43]

Roger C Mayer, James H Davis, and F David Schoorman. 1995. An integrative model of organizational trust.Academy of management review20, 3 (1995), 709–734

work page 1995
[44]

Luise Metzger, Linda Miller, Martin Baumann, and Johannes Kraus. 2024. Em- powering calibrated (dis-) trust in conversational agents: A user study on the persuasive power of limitation disclaimers vs. authoritative style. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–19

work page 2024
[45]

Jay F Nunamaker, Douglas C Derrick, Aaron C Elkins, Judee K Burgoon, and Mark W Patton. 2011. Embodied conversational agent-based kiosk for automated interviewing.Journal of Management Information Systems28, 1 (2011), 17–48

work page 2011
[46]

Krzysztof Opolski, Piotr Modzelewski, and Agata Kocia. 2019. Interorganizational trust and effectiveness perception in a collaborative service delivery network. Sustainability11, 19 (2019), 5217

work page 2019
[47]

Yaniv Oshrat, Yonatan Aumann, Tal Hollander, Oleg Maksimov, Anita Ostroumov, Natali Shechtman, and Sarit Kraus. 2022. Efficient customer service combining human operators and virtual agents.arXiv preprint arXiv:2209.05226(2022)

work page arXiv 2022
[48]

Abhinanda R Punnakkal, Arjun Chandrasekaran, Nikos Athanasiou, Alejandra Quiros-Ramirez, and Michael J Black. 2021. BABEL: Bodies, action and behavior with english labels. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 722–731

work page 2021
[49]

Andrew Reece, Gus Cooney, Peter Bull, Christine Chung, Bryn Dawson, Casey Fitzpatrick, Tamara Glazer, Dean Knox, Alex Liebscher, and Sebastian Marin. 2023. The CANDOR corpus: Insights from a large multimodal dataset of naturalistic conversation.Science advances9, 13 (2023), eadf3197

work page 2023
[50]

Radhika Santhanagopalan, Isobel A Heck, and Katherine D Kinzler. 2022. Leader- ship, gender, and colorism: Children in India use social category information to guide leadership cognition.Developmental Science25, 3 (2022), e13212

work page 2022
[51]

Su-Mae Tan and Tze Wei Liew. 2020. Designing embodied virtual agents as product specialists in a multi-product category E-commerce: The roles of source Conference’17, July 2017, Washington, DC, USA Galland et al. credibility and social presence.International Journal of Human–Computer Inter- action36, 12 (2020), 1136–1149

work page 2020
[52]

Qi Wu, Yubo Zhao, Yifan Wang, Xinhang Liu, Yu-Wing Tai, and Chi-Keung Tang

work page
[53]

Motion-agent: A conversational framework for human motion generation with llms.arXiv preprint arXiv:2405.17013(2024)

work page arXiv 2024
[54]

Zeyi Zhang, Yanju Zhou, Heyuan Yao, Tenglong Ao, Xiaohang Zhan, and Libin Liu. 2025. Social Agent: Mastering Dyadic Nonverbal Behavior Generation via Conversational LLM Agents. InProceedings of the SIGGRAPH Asia 2025 Conference Papers. 1–12

work page 2025
[55]

Zheguang Zhao, Lorenzo De Stefani, Emanuel Zgraggen, Carsten Binnig, Eli Upfal, and Tim Kraska. 2017. Controlling false discoveries during interactive data exploration. InProceedings of the 2017 acm international conference on man- agement of data. 527–540

work page 2017
[56]

the extent to which a trustee is believed to want to do good to the trustor, aside from an egocentric profit motive

Qingxiao Zheng, Zhuoer Chen, and Yun Huang. 2025. Learning through AI- clones: Enhancing self-perception and presentation performance.Computers in Human Behavior: Artificial Humans3 (2025), 100117. A Prompt template Role:You are a High-Fidelity Multimodal Persona Engine. You specialize in translating psychological frameworks into synchro- nized verbal and...

work page 2025
[57]

He sighed,

Tag Syntax & Placement. Audio Tags:[tag] — Place immediately before or after the dia- logue segment. Focus only on vocal delivery or non-verbal vocal sounds. Facial Tags:f: expression — Place at the exact moment the facial expression should trigger. Gesture Tags:g: gesture — Place at the exact moment the physical movement should begin. Emphasis:Use CAPITA...

work page 2017
[58]

Approved Tag Lists.[List of approuved tags]

work page
[59]

Gesture name

Workflow. Analyze Personality:Read the Ability scores. Create the text:Match the text’s with the provided intention and ability score and oral style. The text is going to be read Keep the text short:3 sentences at most Apply Facial/Gesture Tags:Insert f: and g: tags where the move- ment naturally starts. Apply Audio Tags:Insert [] tags to guide the voice ...

work page 2017

[1] [1]

Naeimeh Anzabi and Hiroyuki Umemuro. 2023. Effect of different listening behav- iors of social robots on perceived trust in human-robot interactions.International Journal of Social Robotics15, 6 (2023), 931–951

work page 2023

[2] [2]

Marjorie Armando, Magalie Ochs, and Isabelle Régner. 2022. The impact of pedagogical agents’ gender on academic learning: A systematic review.Frontiers in Artificial Intelligence5 (2022), 862997

work page 2022

[3] [3]

Agnes Axelsson and Gabriel Skantze. 2022. Multimodal user feedback during adaptive robot-human presentations.Frontiers in Computer Science3 (2022), 741148

work page 2022

[4] [4]

Daniel Balliet and Paul AM Van Lange. 2013. Trust, conflict, and cooperation: a meta-analysis.Psychological bulletin139, 5 (2013), 1090

work page 2013

[5] [5]

Shreyas Bhat, Joseph B Lyons, Cong Shi, and X Jessie Yang. 2024. Value alignment and trust in human-robot interaction: Insights from simulation and user study. InDiscovering the frontiers of human-robot interaction: Insights and innovations in collaboration, communication, and control. Springer, 39–63

work page 2024

[6] [6]

Beatrice Biancardi, Angelo Cafaro, and Catherine Pelachaud. 2017. Analyzing first impressions of warmth and competence from observable nonverbal cues in expert-novice interactions. InProceedings of the 19th ACM international conference on multimodal interaction. 341–349

work page 2017

[7] [7]

Christina Breuer, Joachim Hüffmeier, and Guido Hertel. 2016. Does trust matter more in virtual teams? A meta-analysis of trust and team effectiveness considering virtuality and documentation as moderators.Journal of Applied Psychology101, 8 (2016), 1151

work page 2016

[8] [8]

Fabio Calefato, Filippo Lanubile, and Nicole Novielli. 2015. The role of social media in affective trust building in customer–supplier relationships.Electronic Commerce Research15, 4 (2015), 453–482

work page 2015

[9] [9]

Maureen A Craig and Galen V Bodenhausen. 2018. Category (non) fit modu- lates extrapolative stereotyping of multiply categorizable social targets.Social Cognition36, 5 (2018), 559–588

work page 2018

[10] [10]

Bart A De Jong, Kurt T Dirks, and Nicole Gillespie. 2016. Trust and team perfor- mance: A meta-analysis of main effects, moderators, and covariates.Journal of applied psychology101, 8 (2016), 1134

work page 2016

[11] [11]

David DeSteno, Cynthia Breazeal, Robert H Frank, David Pizarro, Jolie Baumann, Leah Dickens, and Jin Joo Lee. 2012. Detecting the trustworthiness of novel partners in economic exchange.Psychological science23, 12 (2012), 1549–1556

work page 2012

[12] [12]

Weihua Du, Yiming Yang, and Sean Welleck. 2025. Optimizing temperature for language models with multi-sample inference.arXiv preprint arXiv:2502.05234 (2025)

work page arXiv 2025

[13] [13]

Wen Duan, Shiwen Zhou, Matthew J Scalia, Xiaoyun Yin, Nan Weng, Ruihao Zhang, Guo Freeman, Nathan McNeese, Jamie Gorman, and Michael Tolston

work page

[14] [14]

Proceedings of the ACM on Human-Computer Interaction8, CSCW2 (2024), 1–31

Understanding the evolvement of trust over time within Human-AI teams. Proceedings of the ACM on Human-Computer Interaction8, CSCW2 (2024), 1–31

work page 2024

[15] [15]

Easton, Stephen Potter, R

K. Easton, Stephen Potter, R. Bec, M. Bennion, H. Christensen, C. Grindell, Bah- man Mirheidari, S. Weich, L. D. de Witte, D. Wolstenholme, and M. Hawley

work page

[16] [16]

https://api.semanticscholar.org/CorpusId:171093436

A Virtual Agent to Support Individuals Living With Physical and Mental Comorbidities: Co-Design and Acceptability Testing.Journal of Medical Internet Research21 (2019). https://api.semanticscholar.org/CorpusId:171093436

work page 2019

[17] [17]

Paul Ekman, Tim Dalgleish, and M Power. 1999. Basic emotions.San Francisco, USA1 (1999)

work page 1999

[18] [18]

Siska Fitrianie, Merijn Bruijnes, Deborah Richards, Andrea Bönsch, and Willem- Paul Brinkman. 2020. The 19 unifying questionnaire constructs of artificial social agents: An iva community analysis. InProceedings of the 20th ACM International Conference on Intelligent Virtual Agents. 1–8

work page 2020

[19] [19]

Lucie Galland, Catherine Pelachaud, and Florian Pecune. 2025. SMART-DREAM: To Condition or Not to Condition; A Study on the Impact of LLM Conditioning on Motivational Interview Dialog Virtual Agent. InProceedings of the 25th ACM International Conference on Intelligent Virtual Agents. 1–9

work page 2025

[20] [20]

Yuan Gao, Elena Sibirtseva, Ginevra Castellano, and Danica Kragic. 2019. Fast adaptation with meta-reinforcement learning for trust modelling in human-robot interaction. In2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 305–312

work page 2019

[21] [21]

Jonas Gonzalez-Billandon, Alexander M Aroyo, Alessia Tonelli, Dario Pasquali, Alessandra Sciutti, Monica Gori, Giulio Sandini, and Francesco Rea. 2019. Can a robot catch you lying? a machine learning system to detect lies during interac- tions.Frontiers in Robotics and AI6 (2019), 64

work page 2019

[22] [22]

Grivokostopoulou, Konstantinos Kovas, and I

F. Grivokostopoulou, Konstantinos Kovas, and I. Perikos. 2020. The Effectiveness of Embodied Pedagogical Agents and Their Impact on Students Learning in Virtual Worlds.Applied Sciences(2020). https://api.semanticscholar.org/CorpusId: 216241082

work page 2020

[23] [23]

Rosanna E Guadagno, Jim Blascovich, Jeremy N Bailenson, and Cade McCall

work page

[24] [24]

Virtual humans and persuasion: The effects of agency and behavioral realism.Media Psychology10, 1 (2007), 1–22

work page 2007

[25] [25]

Abhay Gupta, Arjun D’Cunha, Kamal Awasthi, and Vineeth Balasubramanian

work page

[26] [26]

Daisee: Towards user engagement recognition in the wild.arXiv preprint arXiv:1609.01885(2016)

work page arXiv 2016

[27] [27]

Bin Han, Deuksin Kwon, Spencer Lin, Kaleen Shrestha, and Jonathan Gratch

work page

[28] [28]

InProceedings of the 25th ACM International Conference on Intelligent Virtual Agents

Can LLMs Generate Behaviors for Embodied Virtual Agents Based on Personality Traits?. InProceedings of the 25th ACM International Conference on Intelligent Virtual Agents. 1–10

work page

[29] [29]

Craig J Johnson, Mustafa Demir, Nathan J McNeese, Jamie C Gorman, Alexandra T Wolff, and Nancy J Cooke. 2023. The impact of training on human–autonomy team communications and trust calibration.Human factors65, 7 (2023), 1554– 1570

work page 2023

[30] [30]

Sai Shashank Kalakonda, Shubh Maheshwari, and Ravi Kiran Sarvadevabhatla

work page

[31] [31]

In2023 IEEE international conference on multimedia and expo (ICME)

Action-gpt: Leveraging large-scale language models for improved and generalized action generation. In2023 IEEE international conference on multimedia and expo (ICME). IEEE, 31–36

work page

[32] [32]

Youngmin Kim, Jiwan Chung, Jisoo Kim, Sunghyun Lee, Sangkyu Lee, Junhyeok Kim, Cheoljong Yang, and Youngjae Yu. 2025. Speaking Beyond Language: A Large-Scale Multimodal Dataset for Learning Nonverbal Cues from Video- Grounded Dialogues. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2247–2265

work page 2025

[33] [33]

Yanghee Kim and Quan Wei. 2011. The impact of learner attributes and learner choice in an agent-based environment.Computers & Education56, 2 (2011), 505–514

work page 2011

[34] [34]

Jennifer T Kubota, Samuel A Venezia, Richa Gautam, Andrea L Wilhelm, Bradley D Mattan, and Jasmin Cloutier. 2023. Distrust as a form of inequal- ity.Scientific Reports13, 1 (2023), 9901

work page 2023

[35] [35]

John D Lee and Katrina A See. 2004. Trust in automation: Designing for appro- priate reliance.Human factors46, 1 (2004), 50–80

work page 2004

[36] [36]

Jin Joo Lee, Brad Knox, Jolie Baumann, Cynthia Breazeal, and David DeSteno

work page

[37] [37]

Computationally modeling interpersonal trust.Frontiers in psychology4 (2013), 56004

work page 2013

[38] [38]

Chang Liu, Qunfen Lin, Zijiao Zeng, and Ye Pan. 2024. Emoface: Audio-driven emotional 3d face animation. In2024 IEEE Conference Virtual Reality and 3D User Interfaces (VR). IEEE, 387–397

work page 2024

[39] [39]

Ziyi Liu, Zhengzhe Zhu, Lijun Zhu, Enze Jiang, Xiyun Hu, Kylie A Peppler, and K. Ramani. 2024. ClassMeta: Designing Interactive Virtual Classmate to Promote VR Classroom Participation.Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems(2024). https://api.semanticscholar.org/CorpusId: 269748691

work page 2024

[40] [40]

Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. Curran Associates, Inc. http://papers.nips.cc/paper/7062-a-unified- approach-to-interpreting-model-predictions.pdf

work page 2017

[41] [41]

Syaheerah Lebai Lutfi, Badr Lahasan, Cristina Luna-Jiménez, Zaher A Bamasood, and Zahid Akhtar. 2023. Effects of Facial Expressions and Gestures on the Trustworthiness of a Person.IEEE Access11 (2023), 133891–133902

work page 2023

[42] [42]

Roger C Mayer and James H Davis. 1999. The effect of the performance appraisal system on trust for management: A field quasi-experiment.Journal of applied psychology84, 1 (1999), 123

work page 1999

[43] [43]

Roger C Mayer, James H Davis, and F David Schoorman. 1995. An integrative model of organizational trust.Academy of management review20, 3 (1995), 709–734

work page 1995

[44] [44]

Luise Metzger, Linda Miller, Martin Baumann, and Johannes Kraus. 2024. Em- powering calibrated (dis-) trust in conversational agents: A user study on the persuasive power of limitation disclaimers vs. authoritative style. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–19

work page 2024

[45] [45]

Jay F Nunamaker, Douglas C Derrick, Aaron C Elkins, Judee K Burgoon, and Mark W Patton. 2011. Embodied conversational agent-based kiosk for automated interviewing.Journal of Management Information Systems28, 1 (2011), 17–48

work page 2011

[46] [46]

Krzysztof Opolski, Piotr Modzelewski, and Agata Kocia. 2019. Interorganizational trust and effectiveness perception in a collaborative service delivery network. Sustainability11, 19 (2019), 5217

work page 2019

[47] [47]

Yaniv Oshrat, Yonatan Aumann, Tal Hollander, Oleg Maksimov, Anita Ostroumov, Natali Shechtman, and Sarit Kraus. 2022. Efficient customer service combining human operators and virtual agents.arXiv preprint arXiv:2209.05226(2022)

work page arXiv 2022

[48] [48]

Abhinanda R Punnakkal, Arjun Chandrasekaran, Nikos Athanasiou, Alejandra Quiros-Ramirez, and Michael J Black. 2021. BABEL: Bodies, action and behavior with english labels. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 722–731

work page 2021

[49] [49]

Andrew Reece, Gus Cooney, Peter Bull, Christine Chung, Bryn Dawson, Casey Fitzpatrick, Tamara Glazer, Dean Knox, Alex Liebscher, and Sebastian Marin. 2023. The CANDOR corpus: Insights from a large multimodal dataset of naturalistic conversation.Science advances9, 13 (2023), eadf3197

work page 2023

[50] [50]

Radhika Santhanagopalan, Isobel A Heck, and Katherine D Kinzler. 2022. Leader- ship, gender, and colorism: Children in India use social category information to guide leadership cognition.Developmental Science25, 3 (2022), e13212

work page 2022

[51] [51]

Su-Mae Tan and Tze Wei Liew. 2020. Designing embodied virtual agents as product specialists in a multi-product category E-commerce: The roles of source Conference’17, July 2017, Washington, DC, USA Galland et al. credibility and social presence.International Journal of Human–Computer Inter- action36, 12 (2020), 1136–1149

work page 2020

[52] [52]

Qi Wu, Yubo Zhao, Yifan Wang, Xinhang Liu, Yu-Wing Tai, and Chi-Keung Tang

work page

[53] [53]

Motion-agent: A conversational framework for human motion generation with llms.arXiv preprint arXiv:2405.17013(2024)

work page arXiv 2024

[54] [54]

Zeyi Zhang, Yanju Zhou, Heyuan Yao, Tenglong Ao, Xiaohang Zhan, and Libin Liu. 2025. Social Agent: Mastering Dyadic Nonverbal Behavior Generation via Conversational LLM Agents. InProceedings of the SIGGRAPH Asia 2025 Conference Papers. 1–12

work page 2025

[55] [55]

Zheguang Zhao, Lorenzo De Stefani, Emanuel Zgraggen, Carsten Binnig, Eli Upfal, and Tim Kraska. 2017. Controlling false discoveries during interactive data exploration. InProceedings of the 2017 acm international conference on man- agement of data. 527–540

work page 2017

[56] [56]

the extent to which a trustee is believed to want to do good to the trustor, aside from an egocentric profit motive

Qingxiao Zheng, Zhuoer Chen, and Yun Huang. 2025. Learning through AI- clones: Enhancing self-perception and presentation performance.Computers in Human Behavior: Artificial Humans3 (2025), 100117. A Prompt template Role:You are a High-Fidelity Multimodal Persona Engine. You specialize in translating psychological frameworks into synchro- nized verbal and...

work page 2025

[57] [57]

He sighed,

Tag Syntax & Placement. Audio Tags:[tag] — Place immediately before or after the dia- logue segment. Focus only on vocal delivery or non-verbal vocal sounds. Facial Tags:f: expression — Place at the exact moment the facial expression should trigger. Gesture Tags:g: gesture — Place at the exact moment the physical movement should begin. Emphasis:Use CAPITA...

work page 2017

[58] [58]

Approved Tag Lists.[List of approuved tags]

work page

[59] [59]

Gesture name

Workflow. Analyze Personality:Read the Ability scores. Create the text:Match the text’s with the provided intention and ability score and oral style. The text is going to be read Keep the text short:3 sentences at most Apply Facial/Gesture Tags:Insert f: and g: tags where the move- ment naturally starts. Apply Audio Tags:Insert [] tags to guide the voice ...

work page 2017