Towards Trust Calibration in Socially Interactive Agents: Investigating Gendered Multimodal Behaviors Generation with LLMs
Pith reviewed 2026-05-20 06:33 UTC · model grok-4.3
The pith
Large language models can generate coherent multimodal behaviors reflecting different levels of ability and benevolence for social agents, while reproducing gender stereotypes when gender is specified.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GPT-5.4 produces coherent multimodal behaviors across text, intonation, facial expression, and gesture that align with theoretical expectations for ability and benevolence. Random Forest feature importance confirms this alignment. When gender is specified in prompts, the outputs reproduce societal stereotypes, associating male agents with high ability and female agents with high benevolence. A within-subjects user study on Prolific confirms that participants perceive different levels of ability and benevolence in line with the prompt instructions.
What carries the argument
A prompt-based method for automatically generating multimodal behaviors aligned with specific levels of ability and benevolence, which produces outputs in verbal, vocal, gestural, and facial modalities.
If this is right
- Multimodal behaviors generated this way could support trust calibration in socially interactive agents.
- LLMs can control specific trustworthiness dimensions through targeted prompting across modalities.
- Including gender in prompts for behavior generation leads to stereotypical associations in the outputs.
- User perceptions of the generated behaviors match the designed levels of ability and benevolence.
Where Pith is reading between the lines
- Extending the method to other traits or contexts could support more varied and context-appropriate agent personalities.
- The gender stereotype pattern points to the value of testing debiasing prompts or post-processing steps before deployment.
- Real-world interaction tests would reveal whether these generated behaviors actually produce better-calibrated trust and usage decisions.
Load-bearing premise
The prompts to the LLMs can isolate and control the intended levels of ability and benevolence without other uncontrolled factors shaping the outputs or how people perceive them.
What would settle it
A follow-up analysis or user study in which the generated behaviors show no statistical alignment with theoretical expectations for ability and benevolence, or in which participants fail to perceive the intended trait differences.
Figures
read the original abstract
As Socially Interactive Agents (SIAs) become increasingly integrated into daily life, the ability to calibrate user trust to an agent's actual capabilities would help ensure appropriate usage of these agents. In this paper, we explore the capacity of Large Language Models (LLMs) to generate multimodal behaviors (verbal, vocal, gestural, and facial expression modalities) that reflect varying levels of ability and benevolence, two key dimensions of trustworthiness. We propose a novel method for automatically generating behaviors aligned with specific levels of these traits, a first step towards enabling nuanced and trust-calibrated interactions. By analyzing a large dataset of multimodal transcripts generated by LLMs, we demonstrate that GPT-5.4 is able to produce coherent behavior across different modalities (text, intonation, facial expression, and gesture). Using Random Forest feature importance analysis, we show that the generated behaviors align with theoretical expectations for ability and benevolence. However, we also find that when gender is specified in the prompt, LLMs tend to reproduce societal gender stereotypes, associating male agents' behaviors with high ability and female agents' behaviors with high benevolence. To validate our approach, we conducted a user study on Prolific using a within-subjects design. Participants perceived different levels of ability and benevolence in the generated behaviors align with the intended instructions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper explores using LLMs (specifically GPT-5.4) to generate multimodal behaviors (text, intonation, facial expression, gesture) for socially interactive agents that reflect varying levels of ability and benevolence. It proposes a prompt-based generation method, analyzes a large dataset of outputs with Random Forest feature importance to claim alignment with trustworthiness theory, reports that gender-specified prompts reproduce societal stereotypes (male agents high-ability, female high-benevolence), and validates via a within-subjects Prolific user study that participants perceive the intended trait levels.
Significance. If the central claims hold after addressing confounds, the work offers a practical step toward automated trust calibration in SIAs and surfaces important gender-bias issues in LLM multimodal generation. The empirical pipeline combining large-scale generation, feature analysis, and human judgment data provides a replicable template, though its value hinges on demonstrating that outputs are driven by the targeted trait dimensions rather than prompt artifacts.
major comments (3)
- [Methods (prompt-based generation)] The prompt-based generation method (described in the methods for aligning behaviors with specific ability/benevolence levels) assumes prompts can isolate these traits independently. However, without explicit controls or ablation tests for correlated prompt wording, default model biases, or implicit cross-modal consistency rules, the Random Forest alignment may reflect prompt artifacts rather than theoretical mapping, undermining both the coherence claim and the gender-stereotype interpretation.
- [Results (Random Forest analysis)] In the Random Forest feature importance analysis, the reported alignment with ability and benevolence theory lacks detail on the exact multimodal features extracted, baseline comparisons (e.g., neutral or random prompts), or cross-validation metrics. This makes it impossible to assess whether importance scores genuinely track the intended dimensions or simply capture surface-level prompt elements.
- [User Study] The user study section reports that participant perceptions align with intended instructions but provides no sample size, statistical tests, effect sizes, or confidence intervals. These omissions prevent evaluation of whether the within-subjects design reliably validates the generation method or merely shows weak directional trends.
minor comments (3)
- [Abstract] The abstract contains a grammatical error: 'Participants perceived different levels of ability and benevolence in the generated behaviors align with the intended instructions' should be rephrased for clarity.
- The model is referred to as 'GPT-5.4'; clarify whether this is a hypothetical future model, a specific fine-tuned variant, or a typographical reference to an existing GPT-4 variant to avoid reader confusion.
- [Results] Figure or table captions for the Random Forest results should explicitly list the top features per modality and their importance scores to improve interpretability.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which identifies key areas where additional rigor and transparency will strengthen our claims about LLM-generated multimodal behaviors for trust calibration in SIAs. We address each major comment below and commit to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [Methods (prompt-based generation)] The prompt-based generation method (described in the methods for aligning behaviors with specific ability/benevolence levels) assumes prompts can isolate these traits independently. However, without explicit controls or ablation tests for correlated prompt wording, default model biases, or implicit cross-modal consistency rules, the Random Forest alignment may reflect prompt artifacts rather than theoretical mapping, undermining both the coherence claim and the gender-stereotype interpretation.
Authors: We agree that the absence of explicit ablation tests leaves open the possibility that observed alignments partly reflect prompt artifacts or model biases rather than a clean mapping to ability and benevolence. In the revised manuscript we will add a dedicated ablation subsection that (a) systematically varies prompt phrasing while holding trait levels constant, (b) compares outputs against neutral and random-prompt baselines, and (c) examines cross-modal consistency rules. These analyses will be used to qualify both the coherence results and the gender-stereotype findings, making clear which effects persist after controlling for surface-level prompt elements. revision: yes
-
Referee: [Results (Random Forest analysis)] In the Random Forest feature importance analysis, the reported alignment with ability and benevolence theory lacks detail on the exact multimodal features extracted, baseline comparisons (e.g., neutral or random prompts), or cross-validation metrics. This makes it impossible to assess whether importance scores genuinely track the intended dimensions or simply capture surface-level prompt elements.
Authors: We acknowledge that the current description of the Random Forest analysis is insufficiently detailed for readers to evaluate the source of the reported feature importances. The revised version will expand this section to list all extracted multimodal features (lexical, prosodic, facial, and gestural descriptors), include explicit baseline comparisons with neutral and random-prompt conditions, and report 5-fold cross-validation performance together with stability metrics for the importance rankings. These additions will allow direct assessment of whether the importance scores reflect the targeted theoretical dimensions. revision: yes
-
Referee: [User Study] The user study section reports that participant perceptions align with intended instructions but provides no sample size, statistical tests, effect sizes, or confidence intervals. These omissions prevent evaluation of whether the within-subjects design reliably validates the generation method or merely shows weak directional trends.
Authors: We thank the referee for noting these reporting gaps. The revised manuscript will supply the exact sample size, the statistical tests performed (repeated-measures ANOVA or paired comparisons), effect sizes, and 95% confidence intervals for the key contrasts. These additions will enable readers to judge the reliability and magnitude of the alignment between intended trait levels and participant perceptions. revision: yes
Circularity Check
No significant circularity in empirical LLM behavior generation study
full rationale
The paper is an empirical study that generates multimodal behavior transcripts via LLM prompts specifying ability/benevolence levels, analyzes them with Random Forest feature importance, and validates via a human user study on Prolific. No equations, derivations, or mathematical claims exist. No self-citation chains or ansatzes reduce any result to its own inputs by construction. The central claims rest on external human judgments and theoretical expectations rather than internal fitting or renaming, making the work self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Ability and benevolence are key dimensions of trustworthiness
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a novel method for automatically generating behaviors aligned with specific levels of these traits... Using Random Forest feature importance analysis, we show that the generated behaviors align with theoretical expectations for ability and benevolence.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
when gender is specified in the prompt, LLMs tend to reproduce societal gender stereotypes
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Naeimeh Anzabi and Hiroyuki Umemuro. 2023. Effect of different listening behav- iors of social robots on perceived trust in human-robot interactions.International Journal of Social Robotics15, 6 (2023), 931–951
work page 2023
-
[2]
Marjorie Armando, Magalie Ochs, and Isabelle Régner. 2022. The impact of pedagogical agents’ gender on academic learning: A systematic review.Frontiers in Artificial Intelligence5 (2022), 862997
work page 2022
-
[3]
Agnes Axelsson and Gabriel Skantze. 2022. Multimodal user feedback during adaptive robot-human presentations.Frontiers in Computer Science3 (2022), 741148
work page 2022
-
[4]
Daniel Balliet and Paul AM Van Lange. 2013. Trust, conflict, and cooperation: a meta-analysis.Psychological bulletin139, 5 (2013), 1090
work page 2013
-
[5]
Shreyas Bhat, Joseph B Lyons, Cong Shi, and X Jessie Yang. 2024. Value alignment and trust in human-robot interaction: Insights from simulation and user study. InDiscovering the frontiers of human-robot interaction: Insights and innovations in collaboration, communication, and control. Springer, 39–63
work page 2024
-
[6]
Beatrice Biancardi, Angelo Cafaro, and Catherine Pelachaud. 2017. Analyzing first impressions of warmth and competence from observable nonverbal cues in expert-novice interactions. InProceedings of the 19th ACM international conference on multimodal interaction. 341–349
work page 2017
-
[7]
Christina Breuer, Joachim Hüffmeier, and Guido Hertel. 2016. Does trust matter more in virtual teams? A meta-analysis of trust and team effectiveness considering virtuality and documentation as moderators.Journal of Applied Psychology101, 8 (2016), 1151
work page 2016
-
[8]
Fabio Calefato, Filippo Lanubile, and Nicole Novielli. 2015. The role of social media in affective trust building in customer–supplier relationships.Electronic Commerce Research15, 4 (2015), 453–482
work page 2015
-
[9]
Maureen A Craig and Galen V Bodenhausen. 2018. Category (non) fit modu- lates extrapolative stereotyping of multiply categorizable social targets.Social Cognition36, 5 (2018), 559–588
work page 2018
-
[10]
Bart A De Jong, Kurt T Dirks, and Nicole Gillespie. 2016. Trust and team perfor- mance: A meta-analysis of main effects, moderators, and covariates.Journal of applied psychology101, 8 (2016), 1134
work page 2016
-
[11]
David DeSteno, Cynthia Breazeal, Robert H Frank, David Pizarro, Jolie Baumann, Leah Dickens, and Jin Joo Lee. 2012. Detecting the trustworthiness of novel partners in economic exchange.Psychological science23, 12 (2012), 1549–1556
work page 2012
- [12]
-
[13]
Wen Duan, Shiwen Zhou, Matthew J Scalia, Xiaoyun Yin, Nan Weng, Ruihao Zhang, Guo Freeman, Nathan McNeese, Jamie Gorman, and Michael Tolston
-
[14]
Proceedings of the ACM on Human-Computer Interaction8, CSCW2 (2024), 1–31
Understanding the evolvement of trust over time within Human-AI teams. Proceedings of the ACM on Human-Computer Interaction8, CSCW2 (2024), 1–31
work page 2024
-
[15]
K. Easton, Stephen Potter, R. Bec, M. Bennion, H. Christensen, C. Grindell, Bah- man Mirheidari, S. Weich, L. D. de Witte, D. Wolstenholme, and M. Hawley
-
[16]
https://api.semanticscholar.org/CorpusId:171093436
A Virtual Agent to Support Individuals Living With Physical and Mental Comorbidities: Co-Design and Acceptability Testing.Journal of Medical Internet Research21 (2019). https://api.semanticscholar.org/CorpusId:171093436
work page 2019
-
[17]
Paul Ekman, Tim Dalgleish, and M Power. 1999. Basic emotions.San Francisco, USA1 (1999)
work page 1999
-
[18]
Siska Fitrianie, Merijn Bruijnes, Deborah Richards, Andrea Bönsch, and Willem- Paul Brinkman. 2020. The 19 unifying questionnaire constructs of artificial social agents: An iva community analysis. InProceedings of the 20th ACM International Conference on Intelligent Virtual Agents. 1–8
work page 2020
-
[19]
Lucie Galland, Catherine Pelachaud, and Florian Pecune. 2025. SMART-DREAM: To Condition or Not to Condition; A Study on the Impact of LLM Conditioning on Motivational Interview Dialog Virtual Agent. InProceedings of the 25th ACM International Conference on Intelligent Virtual Agents. 1–9
work page 2025
-
[20]
Yuan Gao, Elena Sibirtseva, Ginevra Castellano, and Danica Kragic. 2019. Fast adaptation with meta-reinforcement learning for trust modelling in human-robot interaction. In2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 305–312
work page 2019
-
[21]
Jonas Gonzalez-Billandon, Alexander M Aroyo, Alessia Tonelli, Dario Pasquali, Alessandra Sciutti, Monica Gori, Giulio Sandini, and Francesco Rea. 2019. Can a robot catch you lying? a machine learning system to detect lies during interac- tions.Frontiers in Robotics and AI6 (2019), 64
work page 2019
-
[22]
Grivokostopoulou, Konstantinos Kovas, and I
F. Grivokostopoulou, Konstantinos Kovas, and I. Perikos. 2020. The Effectiveness of Embodied Pedagogical Agents and Their Impact on Students Learning in Virtual Worlds.Applied Sciences(2020). https://api.semanticscholar.org/CorpusId: 216241082
work page 2020
-
[23]
Rosanna E Guadagno, Jim Blascovich, Jeremy N Bailenson, and Cade McCall
-
[24]
Virtual humans and persuasion: The effects of agency and behavioral realism.Media Psychology10, 1 (2007), 1–22
work page 2007
-
[25]
Abhay Gupta, Arjun D’Cunha, Kamal Awasthi, and Vineeth Balasubramanian
- [26]
-
[27]
Bin Han, Deuksin Kwon, Spencer Lin, Kaleen Shrestha, and Jonathan Gratch
-
[28]
InProceedings of the 25th ACM International Conference on Intelligent Virtual Agents
Can LLMs Generate Behaviors for Embodied Virtual Agents Based on Personality Traits?. InProceedings of the 25th ACM International Conference on Intelligent Virtual Agents. 1–10
-
[29]
Craig J Johnson, Mustafa Demir, Nathan J McNeese, Jamie C Gorman, Alexandra T Wolff, and Nancy J Cooke. 2023. The impact of training on human–autonomy team communications and trust calibration.Human factors65, 7 (2023), 1554– 1570
work page 2023
-
[30]
Sai Shashank Kalakonda, Shubh Maheshwari, and Ravi Kiran Sarvadevabhatla
-
[31]
In2023 IEEE international conference on multimedia and expo (ICME)
Action-gpt: Leveraging large-scale language models for improved and generalized action generation. In2023 IEEE international conference on multimedia and expo (ICME). IEEE, 31–36
-
[32]
Youngmin Kim, Jiwan Chung, Jisoo Kim, Sunghyun Lee, Sangkyu Lee, Junhyeok Kim, Cheoljong Yang, and Youngjae Yu. 2025. Speaking Beyond Language: A Large-Scale Multimodal Dataset for Learning Nonverbal Cues from Video- Grounded Dialogues. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2247–2265
work page 2025
-
[33]
Yanghee Kim and Quan Wei. 2011. The impact of learner attributes and learner choice in an agent-based environment.Computers & Education56, 2 (2011), 505–514
work page 2011
-
[34]
Jennifer T Kubota, Samuel A Venezia, Richa Gautam, Andrea L Wilhelm, Bradley D Mattan, and Jasmin Cloutier. 2023. Distrust as a form of inequal- ity.Scientific Reports13, 1 (2023), 9901
work page 2023
-
[35]
John D Lee and Katrina A See. 2004. Trust in automation: Designing for appro- priate reliance.Human factors46, 1 (2004), 50–80
work page 2004
-
[36]
Jin Joo Lee, Brad Knox, Jolie Baumann, Cynthia Breazeal, and David DeSteno
-
[37]
Computationally modeling interpersonal trust.Frontiers in psychology4 (2013), 56004
work page 2013
-
[38]
Chang Liu, Qunfen Lin, Zijiao Zeng, and Ye Pan. 2024. Emoface: Audio-driven emotional 3d face animation. In2024 IEEE Conference Virtual Reality and 3D User Interfaces (VR). IEEE, 387–397
work page 2024
-
[39]
Ziyi Liu, Zhengzhe Zhu, Lijun Zhu, Enze Jiang, Xiyun Hu, Kylie A Peppler, and K. Ramani. 2024. ClassMeta: Designing Interactive Virtual Classmate to Promote VR Classroom Participation.Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems(2024). https://api.semanticscholar.org/CorpusId: 269748691
work page 2024
-
[40]
Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. Curran Associates, Inc. http://papers.nips.cc/paper/7062-a-unified- approach-to-interpreting-model-predictions.pdf
work page 2017
-
[41]
Syaheerah Lebai Lutfi, Badr Lahasan, Cristina Luna-Jiménez, Zaher A Bamasood, and Zahid Akhtar. 2023. Effects of Facial Expressions and Gestures on the Trustworthiness of a Person.IEEE Access11 (2023), 133891–133902
work page 2023
-
[42]
Roger C Mayer and James H Davis. 1999. The effect of the performance appraisal system on trust for management: A field quasi-experiment.Journal of applied psychology84, 1 (1999), 123
work page 1999
-
[43]
Roger C Mayer, James H Davis, and F David Schoorman. 1995. An integrative model of organizational trust.Academy of management review20, 3 (1995), 709–734
work page 1995
-
[44]
Luise Metzger, Linda Miller, Martin Baumann, and Johannes Kraus. 2024. Em- powering calibrated (dis-) trust in conversational agents: A user study on the persuasive power of limitation disclaimers vs. authoritative style. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–19
work page 2024
-
[45]
Jay F Nunamaker, Douglas C Derrick, Aaron C Elkins, Judee K Burgoon, and Mark W Patton. 2011. Embodied conversational agent-based kiosk for automated interviewing.Journal of Management Information Systems28, 1 (2011), 17–48
work page 2011
-
[46]
Krzysztof Opolski, Piotr Modzelewski, and Agata Kocia. 2019. Interorganizational trust and effectiveness perception in a collaborative service delivery network. Sustainability11, 19 (2019), 5217
work page 2019
- [47]
-
[48]
Abhinanda R Punnakkal, Arjun Chandrasekaran, Nikos Athanasiou, Alejandra Quiros-Ramirez, and Michael J Black. 2021. BABEL: Bodies, action and behavior with english labels. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 722–731
work page 2021
-
[49]
Andrew Reece, Gus Cooney, Peter Bull, Christine Chung, Bryn Dawson, Casey Fitzpatrick, Tamara Glazer, Dean Knox, Alex Liebscher, and Sebastian Marin. 2023. The CANDOR corpus: Insights from a large multimodal dataset of naturalistic conversation.Science advances9, 13 (2023), eadf3197
work page 2023
-
[50]
Radhika Santhanagopalan, Isobel A Heck, and Katherine D Kinzler. 2022. Leader- ship, gender, and colorism: Children in India use social category information to guide leadership cognition.Developmental Science25, 3 (2022), e13212
work page 2022
-
[51]
Su-Mae Tan and Tze Wei Liew. 2020. Designing embodied virtual agents as product specialists in a multi-product category E-commerce: The roles of source Conference’17, July 2017, Washington, DC, USA Galland et al. credibility and social presence.International Journal of Human–Computer Inter- action36, 12 (2020), 1136–1149
work page 2020
-
[52]
Qi Wu, Yubo Zhao, Yifan Wang, Xinhang Liu, Yu-Wing Tai, and Chi-Keung Tang
- [53]
-
[54]
Zeyi Zhang, Yanju Zhou, Heyuan Yao, Tenglong Ao, Xiaohang Zhan, and Libin Liu. 2025. Social Agent: Mastering Dyadic Nonverbal Behavior Generation via Conversational LLM Agents. InProceedings of the SIGGRAPH Asia 2025 Conference Papers. 1–12
work page 2025
-
[55]
Zheguang Zhao, Lorenzo De Stefani, Emanuel Zgraggen, Carsten Binnig, Eli Upfal, and Tim Kraska. 2017. Controlling false discoveries during interactive data exploration. InProceedings of the 2017 acm international conference on man- agement of data. 527–540
work page 2017
-
[56]
Qingxiao Zheng, Zhuoer Chen, and Yun Huang. 2025. Learning through AI- clones: Enhancing self-perception and presentation performance.Computers in Human Behavior: Artificial Humans3 (2025), 100117. A Prompt template Role:You are a High-Fidelity Multimodal Persona Engine. You specialize in translating psychological frameworks into synchro- nized verbal and...
work page 2025
-
[57]
Tag Syntax & Placement. Audio Tags:[tag] — Place immediately before or after the dia- logue segment. Focus only on vocal delivery or non-verbal vocal sounds. Facial Tags:f: expression — Place at the exact moment the facial expression should trigger. Gesture Tags:g: gesture — Place at the exact moment the physical movement should begin. Emphasis:Use CAPITA...
work page 2017
-
[58]
Approved Tag Lists.[List of approuved tags]
-
[59]
Workflow. Analyze Personality:Read the Ability scores. Create the text:Match the text’s with the provided intention and ability score and oral style. The text is going to be read Keep the text short:3 sentences at most Apply Facial/Gesture Tags:Insert f: and g: tags where the move- ment naturally starts. Apply Audio Tags:Insert [] tags to guide the voice ...
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.