arxiv: 2605.03855 · v2 · submitted 2026-05-05 · 💻 cs.RO

Recognition: 1 theorem link

Evaluating Generative Models as Interactive Emergent Representations of Human-Like Collaborative Behavior

Shinas Shaji , Teena Chakkalayil Hassan , Sebastian Houben , Alex Mitrevski

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:23 UTC · model grok-4.3

classification 💻 cs.RO

keywords human-AI collaborationembodied agentsfoundation modelscollaborative behaviortheory of mindLLM evaluationcolor-matching gamebehavior detection

0 comments

The pith

Foundation model agents exhibit emergent collaborative behaviors like perspective-taking and theory of mind in a human-AI color-matching game without explicit training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether large language model agents in embodied settings can display human-like collaborative behaviors that suggest they have mental models of their partners. This matters for building AI that can coordinate effectively with humans in shared tasks. The authors built a 2D game where agents and humans match colors through coordination, defined five specific behaviors as signs of such mental models, and created an automated system using LLM judges to detect them. Results indicate these behaviors emerge consistently across models, with variations by model and task stage, and humans report positive collaboration experiences. This provides evidence and a method for evaluating such capabilities in generative models.

Core claim

Embodied foundation model agents consistently exhibit emergent collaborative behaviors—perspective-taking, collaborator-aware planning, introspection, theory of mind, and clarification—without being explicitly trained to do so, as identified by an automated behavior detection system using LLM-based judges that achieves fair to substantial agreement with human annotations in a 2D collaborative color-matching game; these behaviors show distinct patterns across different LLMs and collaboration stages, accompanied by positive human satisfaction in user studies.

What carries the argument

The automated behavior detection system that uses LLM-based judges to identify five collaborative behaviors (perspective-taking, collaborator-aware planning, introspection, theory of mind, and clarification) in the 2D game environment.

If this is right

Foundation models can serve as interactive emergent representations of human-like collaborative behavior in embodied settings.
Collaborative behaviors occur at varying frequencies during different stages of collaboration.
Distinct patterns of these behaviors appear across different large language models.
Human participants report positive collaboration experiences with such agents, appreciating task focus and plan verbalization.
The experimental framework enables further assessment of collaboration effectiveness in human-AI teams.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If these behaviors indicate mental models, then scaling up models or refining prompts could enhance coordination without additional training data.
The approach might extend to real-world robotic applications where AI must adapt to human partners dynamically.
Discrepancies between LLM judges and humans could highlight limitations in current models' understanding of social cues.
This work opens the door to using game-based evaluations for measuring theory-of-mind capabilities in generative AI.

Load-bearing premise

The five predefined collaborative behaviors reliably indicate underlying mental models of collaborators and that LLM-based judges accurately detect them in a manner that generalizes beyond this game.

What would settle it

A replication study where human annotators disagree substantially with the LLM judges on behavior detection in new game sessions or where no emergent behaviors are observed in a wider range of foundation models.

read the original abstract

Human-AI collaboration requires AI agents to understand human behavior for effective coordination. While advances in foundation models show promising capabilities in understanding and showing human-like behavior, their application in embodied collaborative settings needs further investigation. This work examines whether embodied foundation model agents exhibit emergent collaborative behaviors indicating underlying mental models of their collaborators, which is an important aspect of effective coordination. This paper develops a 2D collaborative game environment where large language model agents and humans complete color-matching tasks requiring coordination. We define five collaborative behaviors as indicators of emergent mental model representation: perspective-taking, collaborator-aware planning, introspection, theory of mind, and clarification. An automated behavior detection system using LLM-based judges identifies these behaviors, achieving fair to substantial agreement with human annotations. Results from the automated behavior detection system show that foundation models consistently exhibit emergent collaborative behaviors without being explicitly trained to do so. These behaviors occur at varying frequencies during collaboration stages, with distinct patterns across different LLMs. A user study was also conducted to evaluate human satisfaction and perceived collaboration effectiveness, with the results indicating positive collaboration experiences. Participants appreciated the agents' task focus, plan verbalization, and initiative, while suggesting improvements in response times and human-like interactions. This work provides an experimental framework for human-AI collaboration, empirical evidence of collaborative behaviors in embodied LLM agents, a validated behavioral analysis methodology, and an assessment of collaboration effectiveness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper sets up a 2D color-matching game to test emergent collaborative behaviors in embodied LLM agents and reports that the agents display the five defined indicators at varying rates, but the detection rests on LLM judges with only fair-to-substantial human agreement.

read the letter

The main point is a concrete testbed for checking whether foundation-model agents start showing perspective-taking, collaborator-aware planning, introspection, theory of mind, and clarification during joint tasks without explicit training. They built the game, ran several LLMs through it, used separate LLM judges to flag the behaviors in the traces, and ran a small user study on human satisfaction. Different models produced different frequency patterns across collaboration stages, and participants gave positive marks for task focus and plan verbalization while noting slow responses as a drawback.

Referee Report

3 major / 2 minor

Summary. The paper introduces a 2D color-matching collaborative game in which LLM agents interact with humans or other agents. It defines five behaviors (perspective-taking, collaborator-aware planning, introspection, theory of mind, and clarification) as indicators of emergent mental models, deploys an automated LLM-judge detection system that achieves fair-to-substantial agreement with human annotations, reports that foundation models exhibit these behaviors at varying frequencies without explicit training, and presents a user study showing positive human perceptions of collaboration effectiveness.

Significance. If the detection methodology proves robust, the work supplies a concrete experimental framework and initial quantitative evidence that embodied LLM agents can display human-like collaborative behaviors in a coordination task. This could help evaluate and improve human-AI teaming systems. The current manuscript, however, leaves the reliability of the LLM judges and the mapping from observed behaviors to mental models insufficiently substantiated.

major comments (3)

[Abstract] Abstract: the claim that foundation models 'consistently exhibit emergent collaborative behaviors' rests on an automated detector whose agreement with humans is described only as 'fair to substantial.' Without per-behavior kappa values or a breakdown showing that no behavior falls in the fair range (0.21-0.40), the reliability of the frequency counts used to support consistency cannot be evaluated.
[Behavior detection system] Behavior detection system (described after the game definition): the paper provides no details on judge prompting, model choice for the judges, bias-mitigation steps, or any baseline detector (e.g., rule-based or non-LLM). Because every quantitative result flows through these LLM judges, the absence of such controls is load-bearing for the central emergence claim and leaves open the possibility that outputs reflect shared training-data priors rather than independent observation of agent traces.
[Results and interpretation] Results and interpretation sections: the five behaviors are treated as direct indicators of 'underlying mental models,' yet the manuscript reports no additional validation such as direct probing of the agents, human mental-model elicitation, or comparison against human-human play traces. Frequency patterns alone in a single game do not establish that the behaviors reflect genuine collaborator modeling rather than surface-level response patterns.

minor comments (2)

The abstract and methods should report the exact LLMs tested, number of trials per condition, and any data-exclusion criteria or statistical tests applied to the behavior frequencies.
Figure captions and table legends should explicitly state whether error bars represent standard error, standard deviation, or confidence intervals.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We appreciate the referee's identification of areas where additional methodological transparency and interpretive caution would strengthen the work. We address each major comment below and will incorporate revisions to improve the reliability assessment of the LLM judges and to moderate claims about mental models.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that foundation models 'consistently exhibit emergent collaborative behaviors' rests on an automated detector whose agreement with humans is described only as 'fair to substantial.' Without per-behavior kappa values or a breakdown showing that no behavior falls in the fair range (0.21-0.40), the reliability of the frequency counts used to support consistency cannot be evaluated.

Authors: We agree that the abstract claim would be more robust with granular agreement metrics. In the revised manuscript we will add a table reporting per-behavior Cohen's kappa values (computed from our existing human annotations) and will adjust the wording from 'consistently exhibit' to 'frequently exhibit' if any behavior falls in the fair range. This change will be reflected in both the abstract and the results summary. revision: yes
Referee: [Behavior detection system] Behavior detection system (described after the game definition): the paper provides no details on judge prompting, model choice for the judges, bias-mitigation steps, or any baseline detector (e.g., rule-based or non-LLM). Because every quantitative result flows through these LLM judges, the absence of such controls is load-bearing for the central emergence claim and leaves open the possibility that outputs reflect shared training-data priors rather than independent observation of agent traces.

Authors: We acknowledge the omission of these details. The revised manuscript will expand the Behavior Detection System section (and add an appendix) with: the exact judge prompts for each behavior, the specific model used (GPT-4o), bias-mitigation steps including multiple independent judges and majority voting, and a comparison to a keyword-based rule detector on the same traces. These additions will allow readers to evaluate whether detections exceed surface priors. revision: yes
Referee: [Results and interpretation] Results and interpretation sections: the five behaviors are treated as direct indicators of 'underlying mental models,' yet the manuscript reports no additional validation such as direct probing of the agents, human mental-model elicitation, or comparison against human-human play traces. Frequency patterns alone in a single game do not establish that the behaviors reflect genuine collaborator modeling rather than surface-level response patterns.

Authors: We accept that frequency patterns alone do not constitute direct proof of internal mental models. In revision we will (1) change phrasing throughout results and discussion from 'indicators of underlying mental models' to 'behaviors consistent with emergent collaborator modeling,' (2) add an explicit limitations paragraph noting the lack of direct probing or human-human baselines, and (3) outline future work on agent-state probing and human-human comparisons. The user-study results on perceived effectiveness will be repositioned as complementary rather than confirmatory evidence. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical frequencies of a priori behaviors

full rationale

The paper defines five collaborative behaviors upfront as indicators of mental models, then applies a separate LLM-based judge system to detect their presence in agent interaction traces during the color-matching game. Detection reliability is assessed via agreement with human annotations (reported as fair to substantial), and results are reported as observed occurrence frequencies across collaboration stages and models. No equations, parameter fitting, or self-referential derivations are present; the central claim of emergent behaviors follows directly from the measured detection rates rather than reducing to the definitions or any fitted inputs by construction. The methodology is self-contained against external human validation benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Only the abstract is available, so the ledger is inferred from the described methodology; the central claim rests on assumptions about what behaviors count as evidence of mental models and the reliability of LLM judges.

axioms (2)

domain assumption The five behaviors (perspective-taking, collaborator-aware planning, introspection, theory of mind, clarification) serve as valid indicators of emergent mental model representation in collaborators.
This premise directly links observed actions to the paper's claim about underlying representations.
domain assumption LLM-based judges can detect these behaviors with fair to substantial agreement to human annotations.
This enables the automated system whose results support the main finding.

pith-pipeline@v0.9.0 · 5556 in / 1476 out tokens · 44820 ms · 2026-05-08T18:23:20.908215+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 32 canonical work pages · 4 internal anchors

[1]

Evaluating XAI: A Comparison of Rule-Based and Example-Based Ex- planations

Bard, N., Foerster, J.N., Chandar, S., Burch, N., Lanctot, M., Song, H.F., Parisotto, E., Dumoulin, V., Moitra, S., Hughes, E., Dunning, I., Mourad, S., Larochelle, H., Bellemare, M.G., Bowling, M.: The Hanabi challenge: A new frontier for AI research. Artificial Intelligence280, 103216 (2020) https://doi.org/10.1016/j.artint. 2019.103216 40

work page doi:10.1016/j.artint 2020
[2]

CoRRabs/2501.08389(2025) https://doi.org/ 10.48550/arXiv.2501.08389

Belsare, A., Karimi, Z., Mattson, C., Brown, D.S.: Toward zero-shot user intent recognition in shared autonomy. CoRRabs/2501.08389(2025) https://doi.org/ 10.48550/arXiv.2501.08389

work page doi:10.48550/arxiv.2501.08389 2025
[3]

ACDC: The adverse conditions dataset with correspondences for robust semantic driving scene perception,

Cao, Z., Hidalgo Martinez, G., Simon, T., Wei, S., Sheikh, Y.A.: OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019) https://doi.org/10.1109/TPAMI. 2019.2929257

work page doi:10.1109/tpami 2019
[4]

A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960

Cohen, J.: A coefficient of agreement for nominal scales. Educational and Psychological Measurement20(1), 37–46 (1960) https://doi.org/10.1177/001316446002000104

work page doi:10.1177/001316446002000104 1960
[5]

In: Advances in Neural Information Processing Systems (NeurIPS), vol

Dragan, A.: On the Utility of Learning about Humans for Human- AI Coordination. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 32 (2019). https://proceedings.neurips.cc/paper files/paper/2019/ file/f5b1b89d98b7286673128a5fb112cb9a-Paper.pdf

2019
[6]

International Journal of Social Robotics15(5), 867–877 (2023) https://doi

Cucciniello, I., Sangiovanni, S., Maggi, G., Rossi, S.: Mind perception in hri: Exploring users’ attribution of mental and emotional states to robots with different behavioural styles. International Journal of Social Robotics15(5), 867–877 (2023) https://doi. org/10.1007/s12369-023-00989-z

work page doi:10.1007/s12369-023-00989-z 2023
[7]

In: ACM/IEEE Int

Devin, S., Alami, R.: An implemented theory of mind to improve human-robot shared plans execution. In: ACM/IEEE Int. Conf. Human-Robot Interaction (HRI), pp. 319–326 (2016). https://doi.org/10.1109/HRI.2016.7451768

work page doi:10.1109/hri.2016.7451768 2016
[8]

Journal of Human-Robot Interaction2(2), 58–79 (2013) https://doi.org/10.5898/JHRI.2.2.Deits

Deits, R., Tellex, S., Thaker, P., Simeonov, D., Kollar, T., Roy, N.: Clarifying com- mands with information-theoretic human-robot dialog. Journal of Human-Robot Interaction2(2), 58–79 (2013) https://doi.org/10.5898/JHRI.2.2.Deits

work page doi:10.5898/jhri.2.2.deits 2013
[9]

Current Biology15(17), 644–645 (2005) https: //doi.org/10.1016/j.cub.2005.08.041

Frith, C., Frith, U.: Theory of mind. Current Biology15(17), 644–645 (2005) https: //doi.org/10.1016/j.cub.2005.08.041

work page doi:10.1016/j.cub.2005.08.041 2005
[10]

CoRR abs/1907.08584(2019) https://doi.org/10.48550/arXiv.1907.08584

Szlam, A.: Craftassist: A framework for dialogue-enabled interactive agents. CoRR abs/1907.08584(2019) https://doi.org/10.48550/arXiv.1907.08584

work page doi:10.48550/arxiv.1907.08584 1907
[11]

arXiv preprint arXiv:2504.15236 , year =

Huang, S., Durmus, E., McCain, M., Handa, K., Tamkin, A., Hong, J., Stern, M., Somani, A., Zhang, X., Ganguli, D.: Values in the wild: Discovering and mapping values in real-world language model interactions. In: Second Confer- ence on Language Modeling (2025). https://doi.org/10.48550/arXiv.2504.15236 . https://openreview.net/forum?id=zJHZJClG1Z

work page doi:10.48550/arxiv.2504.15236 2025
[12]

https: //doi.org/10.31234/osf.io/munc9 42

Huijzer, R., Hill, Y.: Large Language Models Show Human Behavior (2023). https: //doi.org/10.31234/osf.io/munc9 41

work page doi:10.31234/osf.io/munc9 2023
[13]

In: Proc

Hiatt, L.M., Harrison, A.M., Trafton, J.G.: Accommodating human variability in human-robot teams through theory of mind. In: Proc. Int. Joint Conf. Artificial Intelligence (IJCAI), pp. 2066–2071 (2011). https://doi.org/10.5591/ 978-1-57735-516-8/IJCAI11-345

2066
[14]

Inner Monologue: Embodied Reasoning through Planning with Language Models

Levine, S., Hausman, K., Ichter, B.: Inner Monologue: Embodied Reasoning through Planning with Language Models. In: Proc. 6th Conf. Robot Learning (CoRL), vol. 205, pp. 1769–1782 (2023). https://doi.org/10.48550/arXiv.2207.05608 . https:// proceedings.mlr.press/v205/huang23c.html

work page internal anchor Pith review doi:10.48550/arxiv.2207.05608 2023
[15]

In: Proc

Johnson, M., Hofmann, K., Hutton, T., Bignell, D.: The malmo platform for artifi- cial intelligence experimentation. In: Proc. Int. Joint Conf. Artificial Intelligence (IJCAI), pp. 4246–4247 (2016)

2016
[16]

Bilancia, L

Jahanmahin, R., Masoud, S., Rickli, J., Djuric, A.: Human-robot interactions in manufacturing: A survey of human behavior modeling. Robotics and Computer- Integrated Manufacturing78, 102404 (2022) https://doi.org/10.1016/j.rcim.2022. 102404

work page doi:10.1016/j.rcim.2022 2022
[17]

Proceed- ings of the National Academy of Sciences121(45) (2024) https://doi.org/10.1073/ pnas.2405460121

Kosinski, M.: Evaluating Large Language Models in Theory of Mind Tasks. Proceed- ings of the National Academy of Sciences121(45) (2024) https://doi.org/10.1073/ pnas.2405460121

2024
[18]

OpenVLA: An Open-Source Vision-Language-Action Model

Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P., Finn, C.: Openvla: An open-source vision- language-action model. CoRRabs/2406.09246(2024) https://doi.org/10.48550/ arXiv.2406.09246

work page internal anchor Pith review arXiv 2024
[19]

In: 2024 IEEE/RSJ Int

Kannan, S.S., Venkatesh, V.L.N., Min, B.-C.: Smart-llm: Smart multi-agent robot task planning using large language models. In: 2024 IEEE/RSJ Int. Conf. Intelli- gent Robots and Systems (IROS), pp. 12140–12147 (2024). https://doi.org/10.1109/ IROS58592.2024.10802322

work page arXiv 2024
[20]

In: 2023 IEEE Int

Khanna, P., Yadollahi, E., Bjorkman, M., Leite, I., Smith, C.: Effects of explanation strategies to resolve failures in human-robot collaboration. In: 2023 IEEE Int. Conf. Robot and Human Interactive Communication (RO-MAN), pp. 1829–1836 (2023). https://doi.org/10.1109/RO-MAN57019.2023.10309394

work page doi:10.1109/ro-man57019.2023.10309394 2023
[21]

In: Proc

Liu, Z., Bahety, A., Song, S.: REFLECT: Summarizing Robot Experiences for Failure Explanation and Correction. In: Proc. 7th Conf. Robot Learning (CoRL), vol. 229, pp. 3468–3484 (2023). https://proceedings.mlr.press/v229/liu23g.html

2023
[22]

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Li, H., Dong, Q., Chen, J., Su, H., Zhou, Y., Ai, Q., Ye, Z., Liu, Y.: LLMs- as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods. CoRR 42 abs/2412.05579(2024) https://doi.org/10.48550/arXiv.2412.05579

work page internal anchor Pith review doi:10.48550/arxiv.2412.05579 2024
[23]

MediaPipe: A Framework for Building Perception Pipelines

Grundmann, M.: Mediapipe: A framework for building perception pipelines. CoRR abs/1906.08172(2019) https://doi.org/10.48550/arXiv.1906.08172

work page internal anchor Pith review doi:10.48550/arxiv.1906.08172 1906
[24]

In: 38th Conf

Manling, L.,et al.: Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making. In: 38th Conf. Neural Information Processing Systems (NeurIPS), pp. 100428–100444 (2024)

2024
[25]

In: Workshop on Vision-Language Models for Navigation and Manipulation at ICRA (2024)

Kreiman, T., Xu, C., Luo, J., Tan, Y.L., Sadigh, D., Finn, C., Levine, S.: Octo: An open-source generalist robot policy. In: Workshop on Vision-Language Models for Navigation and Manipulation at ICRA (2024). https://openreview.net/forum?id= jGrtIvJBpS

2024
[26]

In: Advances in Neural Information Processing Systems (NeurIPS), vol

Lowe, R.: Training language models to follow instructions with human feed- back. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 35, pp. 27730–27744 (2022). https://proceedings.neurips.cc/paper files/paper/2022/ file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf

2022
[27]

In: Proc

Park, J.S., O’Brien, J., Cai, C.J., Morris, M.R., Liang, P., Bernstein, M.S.: Generative agents: Interactive simulacra of human behavior. In: Proc. 36th Annu. ACM Symp. User Interface Software and Technology, pp. 1–22 (2023). https://doi.org/10.1145/ 3586183.3606763

work page arXiv 2023
[28]

New York TImes

Sheridan, T.B.: Human-robot interaction: Status and challenges. Human Fac- tors58(4), 525–532 (2016) https://doi.org/10.1177/0018720816644364 . PMID: 27098262

work page doi:10.1177/0018720816644364 2016
[29]

Llm with tools: A survey.arXiv preprint arXiv:2409.18807, 2024

Shen, Z.: LLM With Tools: A Survey. CoRRabs/2409.18807(2024) https://doi. org/10.48550/arXiv.2409.18807

work page doi:10.48550/arxiv.2409.18807 2024
[30]

IEEE Int

Shaji, S., Huppertz, F., Mitrevski, A., Houben, S.: From Language to Action: Can LLM-Based Agents Be Used for Embodied Robot Cognition? In: Proc. IEEE Int. Conf. Robotics and Automation (ICRA) (2026)

2026
[31]

In: Proc

Sharma, A., Rao, S., Brockett, C., Malhotra, A., Jojic, N., Dolan, B.: Investigating agency of LLMs in human-AI collaboration tasks. In: Proc. 18th Conf. Euro- pean Chapter Assoc. Comput. Linguistics (Volume 1: Long Papers), pp. 1968–1987 (2024). https://doi.org/10.18653/v1/2024.eacl-long.119 . https://aclanthology.org/ 2024.eacl-long.119/ 43

work page doi:10.18653/v1/2024.eacl-long.119 1968
[32]

agentic ai: A conceptual taxonomy, applications and challenges

Sapkota, R., Roumeliotis, K.I., Karkee, M.: Ai agents vs. agentic ai: A conceptual taxonomy, applications and challenges. Information Fusion126, 103599 (2025) https://doi.org/10.1016/j.inffus.2025.103599

work page doi:10.1016/j.inffus.2025.103599 2025
[33]

In: Proc

Sidji, M., Smith, W., Rogerson, M.J.: The hidden rules of hanabi: How humans out- perform ai agents. In: Proc. CHI Conf. Human Factors in Computing Systems, pp. 1–16 (2023). https://doi.org/10.1145/3544548.3581550

work page doi:10.1145/3544548.3581550 2023
[34]

CoRRabs/2403.02274(2024) https://doi.org/10

Shrestha, S., Zha, Y., Banagiri, S., Gao, G., Aloimonos, Y., Fermuller, C.: Natsgd: A dataset with speech, gestures, and demonstrations for robot learning in natu- ral human-robot interaction. CoRRabs/2403.02274(2024) https://doi.org/10. 48550/arXiv.2403.02274

work page arXiv 2024
[35]

Large language models fail on trivial alterations to theory-of-mind tasks, 2023

Ullman, T.: Large language models fail on trivial alterations to theory-of-mind tasks. CoRRabs/2302.08399(2023) https://doi.org/10.48550/arXiv.2302.08399

work page doi:10.48550/arxiv.2302.08399 2023
[36]

Verma, M., Bhambri, S., Kambhampati, S.: Theory of Mind Abilities of Large Language Models in Human-Robot Interaction: An Illusion? In: Companion of ACM/IEEE Int. Conf. Human-Robot Interaction (HRI), pp. 36–45 (2024). https: //doi.org/10.1145/3610978.3640767

work page doi:10.1145/3610978.3640767 2024
[37]

In: Proc

Hoorn, D.P.M., Neerincx, A., Graaf, M.M.A.: ”I think you are doing a bad job!”: The Effect of Blame Attribution by a Robot in Human-Robot Collaboration. In: Proc. ACM/IEEE Int. Conf. Human-Robot Interaction (HRI), pp. 140–148 (2021). https: //doi.org/10.1145/3434073.3444681 . https://doi.org/10.1145/3434073.3444681

work page doi:10.1145/3434073.3444681 2021
[38]

first come, first go

Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., Anandkumar, A.: Voyager: An Open-Ended Embodied Agent with Large Language Models. Trans- actions on Machine Learning Research (2024) https://doi.org/10.48550/arXiv.2305. 16291

work page doi:10.48550/arxiv.2305 2024
[39]

Ex- ploring large language models for communica- tion games: An empirical study on werewolf

Xu, Y., Wang, S., Li, P., Luo, F., Wang, X., Liu, W., Liu, Y.: Exploring large lan- guage models for communication games: An empirical study on werewolf. CoRR abs/2309.04658(2023) https://doi.org/10.48550/arXiv.2309.04658

work page doi:10.48550/arxiv.2309.04658 2023
[40]

Frontiers in Robotics and AI10 (2023) https://doi.org/10.3389/frobt.2023.1233328

Zhang, Y., Doyle, T.: Integrating intention-based systems in human-robot interaction: a scoping review of sensors, algorithms, and trust. Frontiers in Robotics and AI10 (2023) https://doi.org/10.3389/frobt.2023.1233328

work page doi:10.3389/frobt.2023.1233328 2023
[41]

Ortega, F

Zhang, L., Ji, Z., Chen, B.: CREW: Facilitating Human-AI Teaming Research. Trans- actions on Machine Learning Research (2024) https://doi.org/10.48550/arXiv.2408. 00170 44

work page doi:10.48550/arxiv.2408 2024