Partner Modelling Emerges in Recurrent Agents (But Only When It Matters)

Christopher G. Lucas; Elizabeth Mieczkowski; Max Taylor-Davies; Natalia Velez; Neil R. Bramley; Ruaridh Mon-Williams; Thomas L. Griffiths; Yanwei Wang

arxiv: 2505.17323 · v2 · submitted 2025-05-22 · 💻 cs.AI · cs.LG

Partner Modelling Emerges in Recurrent Agents (But Only When It Matters)

Ruaridh Mon-Williams , Max Taylor-Davies , Elizabeth Mieczkowski , Natalia Velez , Neil R. Bramley , Yanwei Wang , Thomas L. Griffiths , Christopher G. Lucas This is my paper

Pith reviewed 2026-05-22 13:10 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords partner modelingrecurrent agentsemergent representationscollaborative taskstask allocationinternal representationsmodel-free agents

0 comments

The pith

Simple recurrent agents develop internal models of partners' abilities during collaboration, but only when they can control task allocation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether flexible collaboration in AI requires special mechanisms for understanding others or if it can arise naturally from cooperative pressures. Researchers trained basic model-free recurrent agents to work together in a shared task without any added features for modeling partners. These agents developed structured representations inside their networks that captured what their partners were good or bad at. This allowed them to adapt quickly and work well with new teammates. The modeling only appeared when the agents had the ability to decide how tasks were divided, which created the necessary social pressure.

Core claim

Despite a lack of additional architectural features, inductive biases, or auxiliary objectives, the agents nevertheless develop structured internal representations of their partners' task abilities, enabling rapid adaptation and generalisation to novel collaborators. Notably, we find that structured partner modelling emerges when agents can influence partner behaviour by controlling task allocation.

What carries the argument

The hidden states of the recurrent agents, which develop structured encodings of partners' task abilities specifically when agents can control task allocation.

If this is right

Agents adapt rapidly to novel collaborators using their internal models.
Generalization to new partners occurs without explicit training for social understanding.
Structured partner modeling arises spontaneously from the pressures of open-ended cooperative interaction.
Modeling does not emerge in conditions where agents cannot influence partner behavior through task allocation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Environmental designs that give agents control over shared tasks may be key to fostering implicit social understanding in AI systems.
This emergence suggests that basic recurrent processing combined with appropriate interaction pressures can produce theory-of-mind-like abilities without dedicated modules.
Similar partner modeling might appear in other cooperative environments where one agent can direct the actions of others.

Load-bearing premise

The structured patterns in the agents' hidden states specifically encode models of partners' task abilities rather than other correlated features, and this emergence is caused by the ability to control task allocation.

What would settle it

An experiment showing that agents without task allocation control develop the same structured representations in their hidden states, or that the representations do not specifically predict partner abilities when probed.

Figures

Figures reproduced from arXiv: 2505.17323 by Christopher G. Lucas, Elizabeth Mieczkowski, Max Taylor-Davies, Natalia Velez, Neil R. Bramley, Ruaridh Mon-Williams, Thomas L. Griffiths, Yanwei Wang.

**Figure 2.** Figure 2: Comparison between different ego agent policies in the overcooked environment. Each [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: A comparison of the hidden states of RNN ego agents trained under different conditions. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: A demonstration of online adaptation. (A) Average throughput (rate of soup delivery) during episodes where the partner is switched halfway through from being faster at task 1 to faster at task 2. (B) From the same episodes, the average proportion of time spent by the partner performing each subtask before and after the switch. (C) UMAP embeddings of the average pre-switch and post-switch RNN hidden states … view at source ↗

**Figure 5.** Figure 5: (A) An example CoinGame environment layout at episode start. The two coloured circles [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: (A) UMAP embeddings of RNN hidden states averaged over the final 50 timesteps of each [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Top: correlation coefficients between individual hidden unit values and partner skill profile [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Layouts used in the Experiments B.6 Baselines To isolate the conditions under which partner modelling emerges, we utilise three baselines that allow us to disentangle the effects of training diversity, memory, partner information and architecture on collaborative performance. First, to see if memory plays a role, we include a feedforward network (FFN) MLP baseline. This tests whether memory is necessary fo… view at source ↗

read the original abstract

Humans are remarkably adept at collaboration, able to infer the strengths and weaknesses of new partners in order to work successfully towards shared goals. To build AI systems with this capability, we must first understand its building blocks: does such flexibility require explicit, dedicated mechanisms for modelling others -- or can it emerge spontaneously from the pressures of open-ended cooperative interaction? To investigate this question, we train simple model-free RNN agents to collaborate with a population of diverse partners. Using the `Overcooked-AI' environment, we collect data from thousands of collaborative teams, and analyse agents' internal hidden states. Despite a lack of additional architectural features, inductive biases, or auxiliary objectives, the agents nevertheless develop structured internal representations of their partners' task abilities, enabling rapid adaptation and generalisation to novel collaborators. We investigated these internal models through probing techniques, and large-scale behavioural analysis. Notably, we find that structured partner modelling emerges when agents can influence partner behaviour by controlling task allocation. Our results show that partner modelling can arise spontaneously in model-free agents -- but only under environmental conditions that impose the right kind of social pressure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Partner modeling shows up in plain RNN hidden states on Overcooked only when agents control task allocation, but the causal link to that condition rests on unverified assumptions about what the probes actually capture.

read the letter

The main thing to know is that simple recurrent agents develop structured internal representations of partner task abilities in this setup, and the structure appears specifically when the agent can assign tasks rather than when it cannot. That conditional emergence is the central empirical observation. The work trains model-free RNNs on Overcooked-AI with a diverse partner population, then uses probing on hidden states plus large-scale behavioral tests to show faster adaptation and generalization to new partners. The scale of the team data and the direct comparison across the two regimes are the parts that hold up without extra assumptions. No auxiliary objectives or special modules are added, which keeps the demonstration clean on that dimension. The probing and behavioral results together give a coherent picture that something like partner modeling is happening inside the network under the right pressure. The softer spot is the attribution to task-allocation control. The two conditions need to be matched on partner action statistics, observability, and episode dynamics for the difference in hidden-state structure to be cleanly tied to the control variable. If the controllable condition incidentally produces more variable or predictable partner behavior, the RNN could be learning to forecast those patterns instead of inferring abilities per se. The abstract and summary do not spell out the exact matching checks or the alternative probes that would rule this out, so the causal claim stays provisional until the full methods are examined. This paper is aimed at researchers working on multi-agent RL and on how social inference can arise without dedicated machinery. A reader who wants concrete examples of internal state analysis in cooperative settings will find usable material here. The experiment is grounded enough and the question is clear enough that it should go to peer review rather than get desk-rejected.

Referee Report

2 major / 2 minor

Summary. The manuscript trains simple model-free RNN agents to collaborate with diverse partners in the Overcooked-AI environment. It analyzes agents' hidden states via probing techniques and large-scale behavioral comparisons, claiming that structured internal representations of partners' task abilities emerge spontaneously without dedicated architectures or auxiliary objectives, enabling rapid adaptation and generalization to novel collaborators. This structured partner modelling arises specifically when agents can influence partner behavior through control of task allocation.

Significance. If the central empirical claims hold after addressing potential confounds, the work provides evidence that partner modelling can emerge from general cooperative pressures in recurrent agents. This strengthens understanding of how social inference capabilities arise in multi-agent RL without explicit inductive biases, with potential implications for designing flexible collaborative AI systems. The combination of probing and behavioral analysis offers a concrete empirical approach to studying emergent representations.

major comments (2)

[§4.3] §4.3 (Experimental Conditions): The claim that structured hidden-state patterns specifically encode models of partners' task abilities and emerge because of task-allocation control requires explicit verification that the controllable and non-controllable regimes differ only on that dimension. Partner population statistics, action variability, observability, episode length, and reward structure must be shown to be matched; otherwise, probing accuracy may reflect forecasting of altered dynamics rather than partner-ability inference.
[§5.1] §5.1 (Probing Analysis): The probing classifiers should include controls or ablations for correlated task and environmental features (e.g., current state or partner action history) to establish that the high prediction accuracy for task abilities is not driven by these confounds. Without such controls, the interpretation that the representations are partner models remains under-supported.

minor comments (2)

[Abstract and Methods] The abstract and methods section would benefit from reporting the exact number of agents trained, total episodes collected, and statistical tests used for behavioral comparisons to improve reproducibility.
[Figures] Figure captions for hidden-state visualizations should include details on dimensionality reduction method, clustering algorithm, and color-coding scheme for partner types.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important requirements for verifying experimental controls and strengthening the probing analysis. We address each point below and have revised the manuscript accordingly to provide the requested verifications and ablations.

read point-by-point responses

Referee: [§4.3] §4.3 (Experimental Conditions): The claim that structured hidden-state patterns specifically encode models of partners' task abilities and emerge because of task-allocation control requires explicit verification that the controllable and non-controllable regimes differ only on that dimension. Partner population statistics, action variability, observability, episode length, and reward structure must be shown to be matched; otherwise, probing accuracy may reflect forecasting of altered dynamics rather than partner-ability inference.

Authors: We agree that explicit verification of matched conditions is necessary to support the causal role of task-allocation control. Both regimes used identical partner populations (the same fixed set of 10 diverse RNN partners), the same Overcooked layouts, identical episode lengths (400 steps), and the same reward structure. Observability is matched as both conditions provide full state visibility. To address action variability, we have added a new analysis in the revised §4.3 comparing partner action entropy and transition statistics across conditions, confirming they are statistically equivalent when averaged over shared tasks. A table summarizing these matches has been included to rule out confounds from altered dynamics. revision: yes
Referee: [§5.1] §5.1 (Probing Analysis): The probing classifiers should include controls or ablations for correlated task and environmental features (e.g., current state or partner action history) to establish that the high prediction accuracy for task abilities is not driven by these confounds. Without such controls, the interpretation that the representations are partner models remains under-supported.

Authors: We concur that controls are required to isolate partner-ability encoding from immediate task or environmental features. In the revised manuscript, we have added ablation probes trained on the current environment state vector and on a 10-step history of partner actions. These controls yield substantially lower accuracy for predicting partner task abilities (approximately 65% versus 88% for the RNN hidden states). These results are now reported in §5.1 with statistical comparisons, supporting that the hidden-state representations capture partner-specific information beyond correlated task and action features. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical emergence claims

full rationale

The paper reports results from training model-free RNN agents in Overcooked-AI, followed by probing of hidden states and large-scale behavioral comparisons across experimental conditions. No mathematical derivations, first-principles results, or equations are presented that reduce to their inputs by construction. The central finding—that structured partner modeling emerges specifically when agents control task allocation—is supported by direct empirical contrasts rather than self-definitional structures, fitted parameters renamed as predictions, or load-bearing self-citations. Any self-citations serve only as background and do not substitute for the current experimental evidence, which remains independently falsifiable via replication of the training and probing pipeline.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities beyond standard RNN and reinforcement-learning training assumptions.

pith-pipeline@v0.9.0 · 5753 in / 1043 out tokens · 62207 ms · 2026-05-22T13:10:16.978623+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models
cs.AI 2026-04 unverdicted novelty 8.0

User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Two key steps in the evolution of human cooperation: The interdependence hypothesis.Current Anthropology, 53(6):673–692, 2012

Michael Tomasello, Alicia P Melis, Claudio Tennie, Emily Wyman, and Esther Herrmann. Two key steps in the evolution of human cooperation: The interdependence hypothesis.Current Anthropology, 53(6):673–692, 2012

work page 2012
[2]

Harvard University Press, Cambridge, MA, 2009

Sarah Blaffer Hrdy.Mothers and Others: The Evolutionary Origins of Mutual Understanding. Harvard University Press, Cambridge, MA, 2009

work page 2009
[3]

Princeton University Press, 2018

Joseph Henrich.The Secret of Our Success: How Culture Is Driving Human Evolution, Domesticating Our Species, and Making Us Smarter. Princeton University Press, 2018

work page 2018
[4]

Embodied large language models enable robots to complete complex tasks in unpredictable environments

Ruaridh Mon-Williams, Gen Li, Ran Long, Wenqian Du, and Christopher G Lucas. Embodied large language models enable robots to complete complex tasks in unpredictable environments. Nature Machine Intelligence, pages 1–10, 2025

work page 2025
[5]

Do humans have two systems to track beliefs and belief-like states?Psychological review, 116(4):953, 2009

Ian A Apperly and Stephen A Butterfill. Do humans have two systems to track beliefs and belief-like states?Psychological review, 116(4):953, 2009

work page 2009
[6]

Bayesian theory of mind: Modeling joint belief-desire attribution

Chris Baker, Rebecca Saxe, and Joshua Tenenbaum. Bayesian theory of mind: Modeling joint belief-desire attribution. InProceedings of the annual meeting of the cognitive science society, volume 33, 2011

work page 2011
[7]

theory of mind

Rebecca Saxe and Nancy Kanwisher. People thinking about thinking people: the role of the temporo-parietal junction in “theory of mind”. InSocial neuroscience, pages 171–182. Psychology Press, 2013

work page 2013
[8]

Planning with theory of mind.Trends in Cognitive Sciences, 26(11):959–971, 2022

Mark K Ho, Rebecca Saxe, and Fiery Cushman. Planning with theory of mind.Trends in Cognitive Sciences, 26(11):959–971, 2022

work page 2022
[9]

Collaborative decision making is grounded in representations of other people’s competence and effort.Journal of Experimental Psychology: General, 152(6):1565, 2023

Yang Xiang, Natalia Vélez, and Samuel J Gershman. Collaborative decision making is grounded in representations of other people’s competence and effort.Journal of Experimental Psychology: General, 152(6):1565, 2023

work page 2023
[10]

Optimizing competence in the service of collaboration.Cognitive Psychology, 150:101653, 2024

Yang Xiang, Natalia Vélez, and Samuel J Gershman. Optimizing competence in the service of collaboration.Cognitive Psychology, 150:101653, 2024

work page 2024
[11]

Mini managers: Children strategically divide cognitive labor among collaborators, but with a self-serving bias.Child Development, 93(2):437–450, 2022

Carolyn Baer and Darko Odic. Mini managers: Children strategically divide cognitive labor among collaborators, but with a self-serving bias.Child Development, 93(2):437–450, 2022

work page 2022
[12]

Neil Rabinowitz, Frank Perbet, Francis Song, Chiyuan Zhang, S. M. Ali Eslami, and Matthew Botvinick. Machine theory of mind. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 4218–4227. PMLR, 10–15 Jul 2018

work page 2018
[13]

A survey of ad hoc teamwork research

Reuth Mirsky, Ignacio Carlucho, Arrasy Rahman, Elliot Fosong, William Macke, Mohan Sridharan, Peter Stone, and Stefano V Albrecht. A survey of ad hoc teamwork research. In European conference on multi-agent systems, pages 275–293. Springer, 2022

work page 2022
[14]

Learning to cooperate with humans using generative agents.arXiv preprint arXiv:2411.13934, 2024

Yancheng Liang, Daphne Chen, Abhishek Gupta, Simon S Du, and Natasha Jaques. Learning to cooperate with humans using generative agents.arXiv preprint arXiv:2411.13934, 2024

work page arXiv 2024
[15]

Gmytrasiewicz and Prashant Doshi

Piotr J. Gmytrasiewicz and Prashant Doshi. A framework for sequential planning in multi-agent settings.J. Artif. Int. Res., 24(1):49–79, July 2005

work page 2005
[16]

Empirical evaluation of ad hoc teamwork in the pursuit domain

Samuel Barrett, Peter Stone, and Sarit Kraus. Empirical evaluation of ad hoc teamwork in the pursuit domain. InThe 10th International Conference on Autonomous Agents and Multiagent Systems - Volume 2, AAMAS ’11, page 567–574. International Foundation for Autonomous Agents and Multiagent Systems, 2011

work page 2011
[17]

Albrecht and Subramanian Ramamoorthy

Stefano V . Albrecht and Subramanian Ramamoorthy. A game-theoretic model and best-response learning method for ad hoc coordination in multiagent systems. InProceedings of the 2013 International Conference on Autonomous Agents and Multi-Agent Systems, AAMAS ’13, page 1155–1156, 2013. 11

work page 2013
[18]

Albrecht, Jacob W

Stefano V . Albrecht, Jacob W. Crandall, and Subramanian Ramamoorthy. Belief and truth in hypothesised behaviours.Artificial Intelligence, 235:63–94, 2016

work page 2016
[19]

Variational autoencoders for opponent modeling in multi-agent systems.arXiv preprint arXiv:2001.10829, 2020

Georgios Papoudakis and Stefano V Albrecht. Variational autoencoders for opponent modeling in multi-agent systems.arXiv preprint arXiv:2001.10829, 2020

work page arXiv 2001
[20]

Albrecht

Arrasy Rahman, Ignacio Carlucho, Niklas HÃ¶pner, and Stefano V . Albrecht. A general learning framework for open ad hoc teamwork using graph-based policy learning.Journal of Machine Learning Research, 24(298):1–74, 2023

work page 2023
[21]

Autonomous agents modelling other agents: A compre- hensive survey and open problems.Artificial Intelligence, 258:66–95, 2018

Stefano V Albrecht and Peter Stone. Autonomous agents modelling other agents: A compre- hensive survey and open problems.Artificial Intelligence, 258:66–95, 2018

work page 2018
[22]

A behavioural trans- former for effective collaboration between a robot and a non-stationary human

Ruaridh Mon-Williams, Theodoros Stouraitis, and Sethu Vijayakumar. A behavioural trans- former for effective collaboration between a robot and a non-stationary human. In2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), pages 1150–1157. IEEE, 2023

work page 2023
[23]

Opponent modeling in deep reinforcement learning

He He, Jordan Boyd-Graber, Kevin Kwok, and Hal Daumé, III. Opponent modeling in deep reinforcement learning. In Maria Florina Balcan and Kilian Q. Weinberger, editors,Proceedings of The 33rd International Conference on Machine Learning, volume 48 ofProceedings of Machine Learning Research, pages 1804–1813. PMLR, 20–22 Jun 2016

work page 2016
[24]

Modeling others using oneself in multi-agent reinforcement learning

Roberta Raileanu, Emily Denton, Arthur Szlam, and Rob Fergus. Modeling others using oneself in multi-agent reinforcement learning. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 4257–4266. PMLR, 10–15 Jul 2018

work page 2018
[25]

Learning policy representations in multiagent systems

Aditya Grover, Maruan Al-Shedivat, Jayesh Gupta, Yuri Burda, and Harrison Edwards. Learning policy representations in multiagent systems. In Jennifer Dy and Andreas Krause, editors,Pro- ceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1802–1811. PMLR, 10–15 Jul 2018

work page 2018
[26]

Deep interactive bayesian reinforcement learning via meta-learning

Luisa Zintgraf, Sam Devlin, Kamil Ciosek, Shimon Whiteson, and Katja Hofmann. Deep interactive bayesian reinforcement learning via meta-learning. InProceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’21, page 1712–1714, 2021

work page 2021
[27]

Learning latent representations to influence multi-agent interaction

Annie Xie, Dylan Losey, Ryan Tolsma, Chelsea Finn, and Dorsa Sadigh. Learning latent representations to influence multi-agent interaction. In Jens Kober, Fabio Ramos, and Claire Tomlin, editors,Proceedings of the 2020 Conference on Robot Learning, volume 155 of Proceedings of Machine Learning Research, pages 575–588. PMLR, 16–18 Nov 2021

work page 2020
[28]

Albrecht

Georgios Papoudakis, Filippos Christianos, and Stefano V . Albrecht. Agent modelling under partial observability for deep reinforcement learning. InProceedings of the 35th International Conference on Neural Information Processing Systems, NIPS ’21, 2021

work page 2021
[29]

Social influence as intrinsic motivation for multi-agent deep reinforcement learning

Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre, Pedro Ortega, DJ Strouse, Joel Z Leibo, and Nando De Freitas. Social influence as intrinsic motivation for multi-agent deep reinforcement learning. InInternational conference on machine learning, pages 3040–3049. PMLR, 2019

work page 2019
[30]

Improving policies via search in cooperative partially observable games

Adam Lerer, Hengyuan Hu, Jakob Foerster, and Noam Brown. Improving policies via search in cooperative partially observable games. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7187–7194, 2020

work page 2020
[31]

other-play

Hengyuan Hu, Adam Lerer, Alex Peysakhovich, and Jakob Foerster. “other-play” for zero-shot coordination. InInternational Conference on Machine Learning, pages 4399–4410. PMLR, 2020

work page 2020
[32]

What do navigation agents learn about their environment?

Kshitij Dwivedi, Gemma Roig, Aniruddha Kembhavi, and Roozbeh Mottaghi. What do navigation agents learn about their environment? . In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10266–10275, 2022. 12

work page 2022
[33]

Morcos, and Dhruv Batra

Erik Wijmans, Manolis Savva, Irfan Essa, Stefan Lee, Ari S. Morcos, and Dhruv Batra. Emer- gence of maps in the memories of blind navigation agents. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[34]

Understanding and controlling a maze-solving policy network, 2023

Ulisse Mini, Peli Grietzer, Mrinank Sharma, Austin Meek, Monte MacDiarmid, and Alexan- der Matt Turner. Understanding and controlling a maze-solving policy network, 2023

work page 2023
[35]

Acquisition of chess knowl- edge in alphazero.Proceedings of the National Academy of Sciences, 119(47):e2206625119, 2022

Thomas McGrath, Andrei Kapishnikov, Nenad Tomašev, Adam Pearce, Martin Wattenberg, Demis Hassabis, Been Kim, Ulrich Paquet, and Vladimir Kramnik. Acquisition of chess knowl- edge in alphazero.Proceedings of the National Academy of Sciences, 119(47):e2206625119, 2022

work page 2022
[36]

An investigation of model-free planning

Arthur Guez, Mehdi Mirza, Karol Gregor, Rishabh Kabra, Sebastien Racaniere, Theophane Weber, David Raposo, Adam Santoro, Laurent Orseau, Tom Eccles, Greg Wayne, David Silver, and Timothy Lillicrap. An investigation of model-free planning. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Le...

work page 2019
[37]

Learning to reinforcement learn

Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[38]

Whatever next? predictive brains, situated agents, and the future of cognitive science.Behavioral and brain sciences, 36(3):181–204, 2013

Andy Clark. Whatever next? predictive brains, situated agents, and the future of cognitive science.Behavioral and brain sciences, 36(3):181–204, 2013

work page 2013
[39]

bayesian

John C Harsanyi. Games with incomplete information played by “bayesian” players, i–iii part i. the basic model.Management science, 14(3):159–182, 1967

work page 1967
[40]

Collaborative plans for complex group action.Artificial Intelligence, 86(2):269–357, 1996

Barbara J Grosz and Sarit Kraus. Collaborative plans for complex group action.Artificial Intelligence, 86(2):269–357, 1996

work page 1996
[41]

Inverse attention agents for multi-agent systems.arXiv preprint arXiv:2410.21794, 2024

Qian Long, Ruoyan Li, Minglu Zhao, Tao Gao, and Demetri Terzopoulos. Inverse attention agents for multi-agent systems.arXiv preprint arXiv:2410.21794, 2024

work page arXiv 2024
[42]

Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021

work page 2021
[43]

Jaxmarl: Multi-agent rl environments and algorithms in jax

Alexander Rutherford, Benjamin Ellis, Matteo Gallici, Jonathan Cook, Andrei Lupu, Garðar Ingvarsson, Timon Willi, Ravi Hammond, Akbir Khan, Christian Schroeder de Witt, Alexandra Souly, Saptarashmi Bandyopadhyay, Mikayel Samvelyan, Minqi Jiang, Robert Tjarko Lange, Shimon Whiteson, Bruno Lacerda, Nick Hawes, Tim Rocktäschel, Chris Lu, and Jakob Nicolaus F...

work page 2024
[44]

Learning phrase representations using RNN encoder– decoder for statistical machine translation

Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder– decoder for statistical machine translation. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors,Proceedings of the 2014 Conference on Empirical Methods in Natural Languag...

work page 2014
[45]

Proximal policy optimization algorithms, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

work page 2017
[46]

JAX: composable transformations of Python+NumPy programs, 2018

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018

work page 2018
[47]

Umap: Uniform manifold approximation and projection.Journal of Open Source Software, 3(29):861, 2018

Leland McInnes, John Healy, Nathaniel Saul, and Lukas Großberger. Umap: Uniform manifold approximation and projection.Journal of Open Source Software, 3(29):861, 2018. 13

work page 2018
[48]

Emergent kin selection of altruistic feeding via non-episodic neuroevolution

Max Taylor-Davies, Gautier Hamon, Timothé Boulet, and Clément Moulin-Frier. Emergent kin selection of altruistic feeding via non-episodic neuroevolution. In Pablo García-Sánchez, Emma Hart, and Sarah L. Thomson, editors,Applications of Evolutionary Computation, pages 496–509, 2025. 14 Appendix A CoinGame results To ensure that our findings generalise beyo...

work page 2025
[49]

MLP (ego agent uses a simple MLP in place of an RNN)

work page
[50]

No-influence RNN (ego agent has no control over which coin type the teammate pursues)

work page
[51]

Single-partner RNN (ego agent only exposed to a single partner type during training)

work page
[52]

During evaluation, partners were always sampled uniformly from the set [x,1−x]∀x∈ {0.1,0.3,0.5,0.7,0.9}

Multi-partner RNN (ego agent paired with multiple partner types during training and has influence) During training, the single-agent RNN was always paired with a teammate with skill profile [0.2,0.8] ; in all other cases partners were sampled uniformly from the set [x,1−x]∀x∈ {0.2,0.4,0.6,0.8} . During evaluation, partners were always sampled uniformly fr...

work page

[1] [1]

Two key steps in the evolution of human cooperation: The interdependence hypothesis.Current Anthropology, 53(6):673–692, 2012

Michael Tomasello, Alicia P Melis, Claudio Tennie, Emily Wyman, and Esther Herrmann. Two key steps in the evolution of human cooperation: The interdependence hypothesis.Current Anthropology, 53(6):673–692, 2012

work page 2012

[2] [2]

Harvard University Press, Cambridge, MA, 2009

Sarah Blaffer Hrdy.Mothers and Others: The Evolutionary Origins of Mutual Understanding. Harvard University Press, Cambridge, MA, 2009

work page 2009

[3] [3]

Princeton University Press, 2018

Joseph Henrich.The Secret of Our Success: How Culture Is Driving Human Evolution, Domesticating Our Species, and Making Us Smarter. Princeton University Press, 2018

work page 2018

[4] [4]

Embodied large language models enable robots to complete complex tasks in unpredictable environments

Ruaridh Mon-Williams, Gen Li, Ran Long, Wenqian Du, and Christopher G Lucas. Embodied large language models enable robots to complete complex tasks in unpredictable environments. Nature Machine Intelligence, pages 1–10, 2025

work page 2025

[5] [5]

Do humans have two systems to track beliefs and belief-like states?Psychological review, 116(4):953, 2009

Ian A Apperly and Stephen A Butterfill. Do humans have two systems to track beliefs and belief-like states?Psychological review, 116(4):953, 2009

work page 2009

[6] [6]

Bayesian theory of mind: Modeling joint belief-desire attribution

Chris Baker, Rebecca Saxe, and Joshua Tenenbaum. Bayesian theory of mind: Modeling joint belief-desire attribution. InProceedings of the annual meeting of the cognitive science society, volume 33, 2011

work page 2011

[7] [7]

theory of mind

Rebecca Saxe and Nancy Kanwisher. People thinking about thinking people: the role of the temporo-parietal junction in “theory of mind”. InSocial neuroscience, pages 171–182. Psychology Press, 2013

work page 2013

[8] [8]

Planning with theory of mind.Trends in Cognitive Sciences, 26(11):959–971, 2022

Mark K Ho, Rebecca Saxe, and Fiery Cushman. Planning with theory of mind.Trends in Cognitive Sciences, 26(11):959–971, 2022

work page 2022

[9] [9]

Collaborative decision making is grounded in representations of other people’s competence and effort.Journal of Experimental Psychology: General, 152(6):1565, 2023

Yang Xiang, Natalia Vélez, and Samuel J Gershman. Collaborative decision making is grounded in representations of other people’s competence and effort.Journal of Experimental Psychology: General, 152(6):1565, 2023

work page 2023

[10] [10]

Optimizing competence in the service of collaboration.Cognitive Psychology, 150:101653, 2024

Yang Xiang, Natalia Vélez, and Samuel J Gershman. Optimizing competence in the service of collaboration.Cognitive Psychology, 150:101653, 2024

work page 2024

[11] [11]

Mini managers: Children strategically divide cognitive labor among collaborators, but with a self-serving bias.Child Development, 93(2):437–450, 2022

Carolyn Baer and Darko Odic. Mini managers: Children strategically divide cognitive labor among collaborators, but with a self-serving bias.Child Development, 93(2):437–450, 2022

work page 2022

[12] [12]

Neil Rabinowitz, Frank Perbet, Francis Song, Chiyuan Zhang, S. M. Ali Eslami, and Matthew Botvinick. Machine theory of mind. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 4218–4227. PMLR, 10–15 Jul 2018

work page 2018

[13] [13]

A survey of ad hoc teamwork research

Reuth Mirsky, Ignacio Carlucho, Arrasy Rahman, Elliot Fosong, William Macke, Mohan Sridharan, Peter Stone, and Stefano V Albrecht. A survey of ad hoc teamwork research. In European conference on multi-agent systems, pages 275–293. Springer, 2022

work page 2022

[14] [14]

Learning to cooperate with humans using generative agents.arXiv preprint arXiv:2411.13934, 2024

Yancheng Liang, Daphne Chen, Abhishek Gupta, Simon S Du, and Natasha Jaques. Learning to cooperate with humans using generative agents.arXiv preprint arXiv:2411.13934, 2024

work page arXiv 2024

[15] [15]

Gmytrasiewicz and Prashant Doshi

Piotr J. Gmytrasiewicz and Prashant Doshi. A framework for sequential planning in multi-agent settings.J. Artif. Int. Res., 24(1):49–79, July 2005

work page 2005

[16] [16]

Empirical evaluation of ad hoc teamwork in the pursuit domain

Samuel Barrett, Peter Stone, and Sarit Kraus. Empirical evaluation of ad hoc teamwork in the pursuit domain. InThe 10th International Conference on Autonomous Agents and Multiagent Systems - Volume 2, AAMAS ’11, page 567–574. International Foundation for Autonomous Agents and Multiagent Systems, 2011

work page 2011

[17] [17]

Albrecht and Subramanian Ramamoorthy

Stefano V . Albrecht and Subramanian Ramamoorthy. A game-theoretic model and best-response learning method for ad hoc coordination in multiagent systems. InProceedings of the 2013 International Conference on Autonomous Agents and Multi-Agent Systems, AAMAS ’13, page 1155–1156, 2013. 11

work page 2013

[18] [18]

Albrecht, Jacob W

Stefano V . Albrecht, Jacob W. Crandall, and Subramanian Ramamoorthy. Belief and truth in hypothesised behaviours.Artificial Intelligence, 235:63–94, 2016

work page 2016

[19] [19]

Variational autoencoders for opponent modeling in multi-agent systems.arXiv preprint arXiv:2001.10829, 2020

Georgios Papoudakis and Stefano V Albrecht. Variational autoencoders for opponent modeling in multi-agent systems.arXiv preprint arXiv:2001.10829, 2020

work page arXiv 2001

[20] [20]

Albrecht

Arrasy Rahman, Ignacio Carlucho, Niklas HÃ¶pner, and Stefano V . Albrecht. A general learning framework for open ad hoc teamwork using graph-based policy learning.Journal of Machine Learning Research, 24(298):1–74, 2023

work page 2023

[21] [21]

Autonomous agents modelling other agents: A compre- hensive survey and open problems.Artificial Intelligence, 258:66–95, 2018

Stefano V Albrecht and Peter Stone. Autonomous agents modelling other agents: A compre- hensive survey and open problems.Artificial Intelligence, 258:66–95, 2018

work page 2018

[22] [22]

A behavioural trans- former for effective collaboration between a robot and a non-stationary human

Ruaridh Mon-Williams, Theodoros Stouraitis, and Sethu Vijayakumar. A behavioural trans- former for effective collaboration between a robot and a non-stationary human. In2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), pages 1150–1157. IEEE, 2023

work page 2023

[23] [23]

Opponent modeling in deep reinforcement learning

He He, Jordan Boyd-Graber, Kevin Kwok, and Hal Daumé, III. Opponent modeling in deep reinforcement learning. In Maria Florina Balcan and Kilian Q. Weinberger, editors,Proceedings of The 33rd International Conference on Machine Learning, volume 48 ofProceedings of Machine Learning Research, pages 1804–1813. PMLR, 20–22 Jun 2016

work page 2016

[24] [24]

Modeling others using oneself in multi-agent reinforcement learning

Roberta Raileanu, Emily Denton, Arthur Szlam, and Rob Fergus. Modeling others using oneself in multi-agent reinforcement learning. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 4257–4266. PMLR, 10–15 Jul 2018

work page 2018

[25] [25]

Learning policy representations in multiagent systems

Aditya Grover, Maruan Al-Shedivat, Jayesh Gupta, Yuri Burda, and Harrison Edwards. Learning policy representations in multiagent systems. In Jennifer Dy and Andreas Krause, editors,Pro- ceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1802–1811. PMLR, 10–15 Jul 2018

work page 2018

[26] [26]

Deep interactive bayesian reinforcement learning via meta-learning

Luisa Zintgraf, Sam Devlin, Kamil Ciosek, Shimon Whiteson, and Katja Hofmann. Deep interactive bayesian reinforcement learning via meta-learning. InProceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’21, page 1712–1714, 2021

work page 2021

[27] [27]

Learning latent representations to influence multi-agent interaction

Annie Xie, Dylan Losey, Ryan Tolsma, Chelsea Finn, and Dorsa Sadigh. Learning latent representations to influence multi-agent interaction. In Jens Kober, Fabio Ramos, and Claire Tomlin, editors,Proceedings of the 2020 Conference on Robot Learning, volume 155 of Proceedings of Machine Learning Research, pages 575–588. PMLR, 16–18 Nov 2021

work page 2020

[28] [28]

Albrecht

Georgios Papoudakis, Filippos Christianos, and Stefano V . Albrecht. Agent modelling under partial observability for deep reinforcement learning. InProceedings of the 35th International Conference on Neural Information Processing Systems, NIPS ’21, 2021

work page 2021

[29] [29]

Social influence as intrinsic motivation for multi-agent deep reinforcement learning

Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre, Pedro Ortega, DJ Strouse, Joel Z Leibo, and Nando De Freitas. Social influence as intrinsic motivation for multi-agent deep reinforcement learning. InInternational conference on machine learning, pages 3040–3049. PMLR, 2019

work page 2019

[30] [30]

Improving policies via search in cooperative partially observable games

Adam Lerer, Hengyuan Hu, Jakob Foerster, and Noam Brown. Improving policies via search in cooperative partially observable games. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7187–7194, 2020

work page 2020

[31] [31]

other-play

Hengyuan Hu, Adam Lerer, Alex Peysakhovich, and Jakob Foerster. “other-play” for zero-shot coordination. InInternational Conference on Machine Learning, pages 4399–4410. PMLR, 2020

work page 2020

[32] [32]

What do navigation agents learn about their environment?

Kshitij Dwivedi, Gemma Roig, Aniruddha Kembhavi, and Roozbeh Mottaghi. What do navigation agents learn about their environment? . In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10266–10275, 2022. 12

work page 2022

[33] [33]

Morcos, and Dhruv Batra

Erik Wijmans, Manolis Savva, Irfan Essa, Stefan Lee, Ari S. Morcos, and Dhruv Batra. Emer- gence of maps in the memories of blind navigation agents. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023

[34] [34]

Understanding and controlling a maze-solving policy network, 2023

Ulisse Mini, Peli Grietzer, Mrinank Sharma, Austin Meek, Monte MacDiarmid, and Alexan- der Matt Turner. Understanding and controlling a maze-solving policy network, 2023

work page 2023

[35] [35]

Acquisition of chess knowl- edge in alphazero.Proceedings of the National Academy of Sciences, 119(47):e2206625119, 2022

Thomas McGrath, Andrei Kapishnikov, Nenad Tomašev, Adam Pearce, Martin Wattenberg, Demis Hassabis, Been Kim, Ulrich Paquet, and Vladimir Kramnik. Acquisition of chess knowl- edge in alphazero.Proceedings of the National Academy of Sciences, 119(47):e2206625119, 2022

work page 2022

[36] [36]

An investigation of model-free planning

Arthur Guez, Mehdi Mirza, Karol Gregor, Rishabh Kabra, Sebastien Racaniere, Theophane Weber, David Raposo, Adam Santoro, Laurent Orseau, Tom Eccles, Greg Wayne, David Silver, and Timothy Lillicrap. An investigation of model-free planning. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Le...

work page 2019

[37] [37]

Learning to reinforcement learn

Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[38] [38]

Whatever next? predictive brains, situated agents, and the future of cognitive science.Behavioral and brain sciences, 36(3):181–204, 2013

Andy Clark. Whatever next? predictive brains, situated agents, and the future of cognitive science.Behavioral and brain sciences, 36(3):181–204, 2013

work page 2013

[39] [39]

bayesian

John C Harsanyi. Games with incomplete information played by “bayesian” players, i–iii part i. the basic model.Management science, 14(3):159–182, 1967

work page 1967

[40] [40]

Collaborative plans for complex group action.Artificial Intelligence, 86(2):269–357, 1996

Barbara J Grosz and Sarit Kraus. Collaborative plans for complex group action.Artificial Intelligence, 86(2):269–357, 1996

work page 1996

[41] [41]

Inverse attention agents for multi-agent systems.arXiv preprint arXiv:2410.21794, 2024

Qian Long, Ruoyan Li, Minglu Zhao, Tao Gao, and Demetri Terzopoulos. Inverse attention agents for multi-agent systems.arXiv preprint arXiv:2410.21794, 2024

work page arXiv 2024

[42] [42]

Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021

work page 2021

[43] [43]

Jaxmarl: Multi-agent rl environments and algorithms in jax

Alexander Rutherford, Benjamin Ellis, Matteo Gallici, Jonathan Cook, Andrei Lupu, Garðar Ingvarsson, Timon Willi, Ravi Hammond, Akbir Khan, Christian Schroeder de Witt, Alexandra Souly, Saptarashmi Bandyopadhyay, Mikayel Samvelyan, Minqi Jiang, Robert Tjarko Lange, Shimon Whiteson, Bruno Lacerda, Nick Hawes, Tim Rocktäschel, Chris Lu, and Jakob Nicolaus F...

work page 2024

[44] [44]

Learning phrase representations using RNN encoder– decoder for statistical machine translation

Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder– decoder for statistical machine translation. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors,Proceedings of the 2014 Conference on Empirical Methods in Natural Languag...

work page 2014

[45] [45]

Proximal policy optimization algorithms, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

work page 2017

[46] [46]

JAX: composable transformations of Python+NumPy programs, 2018

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018

work page 2018

[47] [47]

Umap: Uniform manifold approximation and projection.Journal of Open Source Software, 3(29):861, 2018

Leland McInnes, John Healy, Nathaniel Saul, and Lukas Großberger. Umap: Uniform manifold approximation and projection.Journal of Open Source Software, 3(29):861, 2018. 13

work page 2018

[48] [48]

Emergent kin selection of altruistic feeding via non-episodic neuroevolution

Max Taylor-Davies, Gautier Hamon, Timothé Boulet, and Clément Moulin-Frier. Emergent kin selection of altruistic feeding via non-episodic neuroevolution. In Pablo García-Sánchez, Emma Hart, and Sarah L. Thomson, editors,Applications of Evolutionary Computation, pages 496–509, 2025. 14 Appendix A CoinGame results To ensure that our findings generalise beyo...

work page 2025

[49] [49]

MLP (ego agent uses a simple MLP in place of an RNN)

work page

[50] [50]

No-influence RNN (ego agent has no control over which coin type the teammate pursues)

work page

[51] [51]

Single-partner RNN (ego agent only exposed to a single partner type during training)

work page

[52] [52]

During evaluation, partners were always sampled uniformly from the set [x,1−x]∀x∈ {0.1,0.3,0.5,0.7,0.9}

Multi-partner RNN (ego agent paired with multiple partner types during training and has influence) During training, the single-agent RNN was always paired with a teammate with skill profile [0.2,0.8] ; in all other cases partners were sampled uniformly from the set [x,1−x]∀x∈ {0.2,0.4,0.6,0.8} . During evaluation, partners were always sampled uniformly fr...

work page