pith. sign in

arxiv: 2605.23652 · v1 · pith:VOAJBGYBnew · submitted 2026-05-22 · 💻 cs.AI

One Policy, Infinite NPCs: Persona-Traceable Shared RL Policies for Scalable Game Agents

Pith reviewed 2026-05-25 04:05 UTC · model grok-4.3

classification 💻 cs.AI
keywords reinforcement learningshared policiespersona conditioningNPC agentszero-shot identificationInfoNCE losslife simulationgame AI
0
0 comments X

The pith

A single shared RL policy can control thousands of NPCs with distinct, controllable personas in real time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents pcsp, a reinforcement learning policy that takes a single shared network and conditions it on frozen LLM embeddings of free-form persona descriptions to produce consistent NPC behavior. On a 300-persona life-simulation benchmark the approach reaches compositional zero-shot persona identification 17 times above chance, Spearman rho of approximately 0.73 on semantic-behavioral alignment, and 22 times faster inference than an LLM-as-policy baseline. Ablations establish that the InfoNCE trajectory-consistency term is required; its removal collapses identification performance to chance. The same policy also produces measurable behavioral divergence on Melting Pot substrates and runs at sub-frame rates inside Unreal Engine 5 with 64 agents.

Core claim

pcsp is a single reinforcement learning policy conditioned on frozen LLM embeddings of free-form persona descriptions. It combines once-per-NPC persona encoding, low-rank projection, neural conditioning, and a PPO training objective that includes an InfoNCE trajectory-consistency loss plus a KL diversity term. This construction yields compositional zero-shot persona identification up to 17 times above chance, 0.73 Spearman alignment, and 22 times faster inference than LLM baselines while remaining controllable through natural-language persona text.

What carries the argument

Persona Conditioned Shared Policy (pcsp) that conditions a shared PPO policy on frozen LLM persona embeddings via low-rank projection and neural layers, trained with an InfoNCE trajectory-consistency objective.

If this is right

  • One policy can support hundreds to thousands of simultaneously active NPCs without per-NPC model copies.
  • Persona conditioning works for both compositional zero-shot and vocabulary-expansion held-out settings.
  • The same objective produces measurable persona-conditioned divergence in multi-agent strategic environments.
  • Sub-frame inference survives deployment inside a commercial game engine at 64 agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Trajectory-consistency losses may generalize as a lightweight way to enforce identity across other shared-policy domains such as multi-robot coordination.
  • The separation of frozen LLM encoding from the RL policy suggests a route to updating personas without retraining the entire agent.
  • If the low-rank projection continues to scale, the method could support persona libraries orders of magnitude larger than the 300-persona test set.

Load-bearing premise

The InfoNCE trajectory-consistency objective is required for the policy to learn compositional zero-shot persona identification from natural-language descriptions.

What would settle it

Remove the InfoNCE loss, retrain, and measure whether zero-shot persona identification on the 300-persona benchmark falls to chance level.

Figures

Figures reproduced from arXiv: 2605.23652 by Yoosung Hong.

Figure 1
Figure 1. Figure 1: PCSP pipeline and training objectives. Persona text is encoded once per NPC with a frozen Qwen3 embedding model, adapted through a low-rank projection, and consumed by a shared persona-conditioned policy during PCSP-D rollout. The trajectory encoder provides the InfoNCE consistency signal, while the policy is optimized with PPO and KL diversity regularization. The rightmost PCSP-D label depicts the base 12… view at source ↗
Figure 2
Figure 2. Figure 2: Three-layer validation stack. Each layer is selected for the question it can isolate; the InfoNCE consistency term is ablated in all three. Together they cover mechanism, generalization, and deployment. and runtime constraints of a commercial game engine. A single environment cannot answer all three: the conditions that make causal isolation possible (full observability, small action space, short episodes)… view at source ↗
Figure 3
Figure 3. Figure 3: Designer-authored personas in embedding space: raw vs. learned projection. t-SNE of 240 Korean training persona embeddings (grey) plus 50 English designer-authored personas, color-coded by source. (a) In the raw 1024-dim Qwen3 embedding space, the English designer personas form a language-shifted cluster well outside the Korean training distribution: their mean nearest-neighbor distance to any training per… view at source ↗
Figure 4
Figure 4. Figure 4: Melting Pot single-substrate detail on commons_harvest__open. Per-seed bars (blue: full PCSP, 5 seeds; red: −InfoNCE ablation, 1 seed). Dotted blue line: 5-seed mean. (a) Mean pairwise action-KL across the 10 2  = 45 persona pairs. (b) In-distribution trajectory→persona top-3 retrieval on the 10-train vocabulary (dashed grey: chance 3/10). The ablation reaches the highest KL but the lowest top-3 (below ch… view at source ↗
Figure 5
Figure 5. Figure 5: Phase 4 runtime ablation, 64 agents. Zeroing the persona embedding (HybridNoPersona) collapses inter-persona action dispersion from ρ = 0.37 to ρ = 0.99. Replacing the ONNX policy with the BT-only needs heuristic (BTOnly) further halves throughput and reward as 64 agents synchronise on the single most-urgent need each tick. 2,792 interactions in 9.75 min, 0.04% failure rate (1 path￾follow event of 2,792), … view at source ↗
Figure 6
Figure 6. Figure 6: Long-horizon persona persistence (Layer 3). Per-minute dominant intent category for four maximally-separated personas over 30 in-game minutes in UE5 (8 agents, frozen Layer-1 checkpoint, 0.5 s decision throttle, 0.0% BT-abort). Each row is one persona; each column is a 1-minute bin coloured by the modal intent category. p009 stays in Rest for all 30 bins (single run); p058 holds Work as its modal axis in 1… view at source ↗
Figure 8
Figure 8. Figure 8: Expressed vs. preferred intent under engine contention. Policy￾preferred category distribution (mean softmax over the 20-d logits, folded to 11 categories) against the categories whose interactions actually complete, aggregated over 64 agents. The policy’s top preferences (Leisure, Study) collapse at execution while Rest/Social/Work absorb the displaced mass (symmetric KL = 9.07 nats) — agents are rerouted… view at source ↗
Figure 7
Figure 7. Figure 7: Zone-capacity utilisation over a 64-agent HybridPCSP episode. Per-zone occupants/capacity (rows) across 30 equal time bins over the ∼10.5- min episode (columns). Rest, Work, Hygiene, and Exercise zones carry the load (peaks ≈0.65); the unevenness is the engine-side contention behind the ρ-drop. Where the contention lands [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Co-interaction graph over the 64-agent HybridPCSP episode. Nodes are agents, coloured by behavioural archetype (modal expressed category); an edge joins two agents whenever their zone-occupancy intervals overlap and their interaction points lie within 250 world units (same/adjacent seat). Edge opacity and width scale with total shared-zone overlap; node size scales with weighted degree. Same-archetype agen… view at source ↗
Figure 10
Figure 10. Figure 10: Training reward curves at two scales. (a) v1 (6×6, 4 agents, 300 iters). (b) v2 (12×12, 16 agents, 200 iters). Removing the consistency loss (no consist) preserves or even slightly increases reward but collapses zero-shot persona traceability across all scales (Tables XI–III). Removing the diversity loss (no diverse) causes KL collapse at v1 and v2 (0.39 and 0.48) and a v1 reward drop, but its effect on v… view at source ↗
Figure 11
Figure 11. Figure 11: Zero-shot generalization on 60 unseen personas (v1). PCSP (full) achieves 19.3% trajectory-to-persona identification; the red dashed line marks random chance (1.7%). Removing the consistency loss collapses accuracy to 1.7%. The v2 experiment (100 unseen personas) replicates this qualitative pattern [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Empirical projected persona distance vs. behavioral KL divergence at two scales (PCSP full). Each point is a recomputed persona pair from the trained policy checkpoints rather than a synthetic reconstruction from summary statistics. Left: v1 sampled-pair ρ= 0.731 (100 pairs, 200 states). Right: v2 sampled-pair ρ = 0.695 (60 pairs, 100 states). The monotone alignment between persona space and behavior spac… view at source ↗
Figure 14
Figure 14. Figure 14: Map_PCSPDistrict_M in PIE. Top-down view with affordance￾zone labels (SocialHub, Library/Study, Office/Work, Bedroom/Rest, . . . ); each green sphere is an APCSPAgentCharacter executing the hybrid PCSP/BT stack. The map has 10 zones spanning the v3 affordance taxonomy (Kitchen, Bedroom, Office, Library, Gym, Bathroom, SocialHub, Park, Shop; Leisure folded into Observe) [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗
Figure 15
Figure 15. Figure 15: Gameplay Debugger overlay on a single agent. Yellow text (right) is the live Blackboard / BT snapshot for the selected APCSPAgentCharacter: current DesiredActionType, UrgencyScore, bAffordanceReserved, and the active BT node. This view confirms that the persona branch is routing decisions through BTTask_PCSPDecision → MoveToAffordance → PerformInteraction and that the emergency decorator fires only when U… view at source ↗
read the original abstract

On a 300-persona life-simulation benchmark, pcsp achieves compositional zero-shot persona identification up to 17x above chance, Spearman rho approx 0.73 semantic-behavioral alignment, and 22x faster inference than an LLM-as-policy baseline. Life simulation games require hundreds to thousands of non-player characters (NPCs) that behave consistently with distinct personalities while remaining controllable through designer-authored natural language. Existing methods fail on constraints like persona consistency, controllability, or real-time inference. We introduce pcsp (Persona Conditioned Shared Policy), a single reinforcement learning policy conditioned on frozen LLM embeddings of free-form persona descriptions. pcsp combines once-per-NPC persona encoding, low-rank persona projection, neural persona conditioning, and a PPO + InfoNCE consistency + KL diversity training objective. Across three experimental settings, ablations show that the InfoNCE trajectory-consistency objective is load bearing: removing it collapses zero-shot persona identification to chance. External validation on Melting Pot 2.4.0 substrates confirms that our method produces persona-conditioned behavioral divergence in multi-agent strategic environments. We distinguish two senses of held-out evaluation: compositional zero-shot and vocabulary-expansion held-out. Finally, a UE5 deployment reproduces the in-engine persona-conditioning ablation at 64 agents with a low failure rate, showing that the sub-frame inference profile survives in a commercial game engine. These results prove that shared RL policies can support scalable, real-time, persona-conditioned NPC control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces pcsp, a single shared reinforcement learning policy conditioned on frozen LLM embeddings of free-form persona descriptions for NPC control in life-simulation games. It reports that on a 300-persona benchmark the method achieves compositional zero-shot persona identification up to 17x above chance, Spearman rho of approximately 0.73 for semantic-behavioral alignment, and 22x faster inference than an LLM-as-policy baseline. The training objective combines PPO with an InfoNCE trajectory-consistency term and KL diversity; an ablation is stated to show that removing InfoNCE collapses performance to chance. External validation on Melting Pot 2.4.0 substrates and a UE5 deployment at 64 agents are presented to support behavioral divergence and real-time scalability. The work distinguishes compositional zero-shot from vocabulary-expansion held-out evaluation.

Significance. If the empirical claims hold after verification of benchmark construction and training details, the result would be significant for game AI and multi-agent reinforcement learning. It demonstrates that a single policy can deliver persona-consistent, controllable behavior at scale without per-NPC models or slow LLM inference, directly addressing practical constraints in commercial engines. The load-bearing role of the InfoNCE term and the external Melting Pot validation strengthen the case for shared-policy approaches over existing alternatives.

major comments (3)
  1. [Abstract] Abstract: The central claim that the InfoNCE trajectory-consistency objective is load-bearing rests on the statement that its removal collapses zero-shot identification to chance; however, the exact identification accuracy (or other metric) with and without the term is not reported, preventing assessment of effect size.
  2. [Benchmark and Evaluation] Benchmark description: The 300-persona life-simulation benchmark construction, including persona generation process, definition of compositional zero-shot held-out set, and precise measurement of behavioral divergence, is not detailed; these elements are load-bearing for interpreting the 17x-above-chance and rho=0.73 results as evidence of true compositional generalization.
  3. [Deployment] UE5 deployment: The reproduction of the persona-conditioning ablation at 64 agents is presented as evidence of sub-frame inference, but the definition of the reported low failure rate and the precise integration of the low-rank persona projection into the engine are not specified, which is necessary to evaluate the scalability claim.
minor comments (2)
  1. [Abstract] The exact value of Spearman rho (rather than 'approx 0.73') and any associated confidence interval or p-value should be stated for precision.
  2. [Title] The title's use of 'Infinite NPCs' exceeds the scale demonstrated (300 personas, 64 agents); a more precise phrasing would better reflect the reported experiments.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will incorporate clarifications and additional details in a revised manuscript to strengthen the presentation of results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the InfoNCE trajectory-consistency objective is load-bearing rests on the statement that its removal collapses zero-shot identification to chance; however, the exact identification accuracy (or other metric) with and without the term is not reported, preventing assessment of effect size.

    Authors: We agree that explicit numerical values for the ablation would allow readers to better evaluate the effect size of the InfoNCE term. The revised manuscript will include a table reporting the precise zero-shot identification accuracy (and any other relevant metrics) both with and without the InfoNCE objective. revision: yes

  2. Referee: [Benchmark and Evaluation] Benchmark description: The 300-persona life-simulation benchmark construction, including persona generation process, definition of compositional zero-shot held-out set, and precise measurement of behavioral divergence, is not detailed; these elements are load-bearing for interpreting the 17x-above-chance and rho=0.73 results as evidence of true compositional generalization.

    Authors: We acknowledge that expanded details on benchmark construction are required for reproducibility and to support interpretation of the reported metrics. The revised manuscript will add a dedicated subsection describing the persona generation process, the exact definition and construction of the compositional zero-shot held-out set, and the methodology used to quantify behavioral divergence. revision: yes

  3. Referee: [Deployment] UE5 deployment: The reproduction of the persona-conditioning ablation at 64 agents is presented as evidence of sub-frame inference, but the definition of the reported low failure rate and the precise integration of the low-rank persona projection into the engine are not specified, which is necessary to evaluate the scalability claim.

    Authors: We agree that additional specification is needed to substantiate the deployment results. The revised manuscript will define the low failure rate metric explicitly and describe the integration of the low-rank persona projection within the UE5 engine, including any relevant implementation details for the 64-agent setting. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper reports empirical results from held-out evaluation on an external benchmark (Melting Pot 2.4.0), a commercial UE5 deployment at 64 agents, and explicit ablations on the InfoNCE objective. The central claims rest on performance metrics obtained from these external validations rather than any derivation that reduces by construction to fitted parameters, self-citations, or renamed inputs. No load-bearing step equates a prediction to its own training objective or prior author work by definition. The distinction between compositional zero-shot and vocabulary-expansion held-out evaluation is stated explicitly without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that pre-trained LLM embeddings capture usable persona semantics and that the added InfoNCE term produces genuine cross-persona separation rather than benchmark-specific artifacts. No new physical entities are postulated.

axioms (2)
  • domain assumption Frozen LLM embeddings of free-form persona descriptions provide sufficient semantic signal for policy conditioning
    The policy is conditioned directly on these embeddings without any fine-tuning of the LLM.
  • domain assumption Standard PPO augmented with InfoNCE trajectory consistency and KL diversity terms can produce both reward-maximizing and persona-distinct behavior
    This is the core training objective whose removal is claimed to collapse zero-shot identification.

pith-pipeline@v0.9.0 · 5792 in / 1646 out tokens · 32147 ms · 2026-05-25T04:05:57.013789+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 6 internal anchors

  1. [1]

    Scalable evaluation of multi-agent reinforcement learning with melting pot,

    J. Z. Leibo, E. A. Due ˜nez-Guzman, A. Vezhnevets, J. P. Agapiou, P. Sunehag, R. Koster, J. Matyas, C. Beattie, I. Mordatch, and T. Graepel, “Scalable evaluation of multi-agent reinforcement learning with melting pot,” inInternational Conference on Machine Learning (ICML), 2021

  2. [2]

    G. N. Yannakakis and J. Togelius,Artificial Intelligence and Games. Springer, 2018

  3. [3]

    Colledanchise and P

    M. Colledanchise and P. ¨Ogren,Behavior Trees in Robotics and AI: An Introduction. CRC Press, 2018

  4. [4]

    Generative agents: Interactive simulacra of human behavior,

    J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, “Generative agents: Interactive simulacra of human behavior,” inACM Symposium on User Interface Software and Technology (UIST), 2023

  5. [5]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar, “V oyager: An open-ended embodied agent with large language models,”arXiv preprint arXiv:2305.16291, 2023

  6. [6]

    ReAct: Synergizing reasoning and acting in language models,

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “ReAct: Synergizing reasoning and acting in language models,” in International Conference on Learning Representations (ICLR), 2023

  7. [7]

    Reflexion: Language agents with verbal reinforcement learning,

    N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

  8. [8]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022

  9. [9]

    Diversity is all you need: Learning skills without a reward function,

    B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine, “Diversity is all you need: Learning skills without a reward function,” inInternational Conference on Learning Representations (ICLR), 2019

  10. [10]

    CIC: Contrastive intrinsic control for unsupervised skill discovery,

    M. Laskin, H. Liu, X. B. Peng, D. Yarats, A. Rajeswaran, and P. Abbeel, “CIC: Contrastive intrinsic control for unsupervised skill discovery,”arXiv preprint arXiv:2202.00161, 2022

  11. [11]

    Curiosity-driven exploration by self-supervised prediction,

    D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-driven exploration by self-supervised prediction,” inInternational Conference on Machine Learning (ICML), 2017

  12. [12]

    Universal value function approximators,

    T. Schaul, D. Horgan, K. Gregor, and D. Silver, “Universal value function approximators,” inInternational Conference on Machine Learning (ICML), 2015

  13. [13]

    Hindsight experience replay,

    M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba, “Hindsight experience replay,” inAdvances in Neural Information Processing Systems (NeurIPS), 2017

  14. [14]

    BabyAI: A platform to study the sample efficiency of grounded language learning,

    M. Chevalier-Boisvert, D. Bahdanau, S. Lahlou, L. Willems, C. Saharia, T. H. Nguyen, and Y . Bengio, “BabyAI: A platform to study the sample efficiency of grounded language learning,” inInternational Conference on Learning Representations (ICLR), 2019

  15. [15]

    Decision transformer: Reinforcement learning via sequence modeling,

    L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch, “Decision transformer: Reinforcement learning via sequence modeling,” inAdvances in Neural Information Processing Systems (NeurIPS), 2021

  16. [16]

    A Generalist Agent

    S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth- Maron, M. Gimenez, Y . Sulsky, J. Kay, J. T. Springenberget al., “A generalist agent,”arXiv preprint arXiv:2205.06175, 2022

  17. [17]

    Grandmaster level in StarCraft II using multi-agent reinforcement learning,

    O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgievet al., “Grandmaster level in StarCraft II using multi-agent reinforcement learning,”Nature, vol. 575, no. 7782, pp. 350–354, 2019

  18. [18]

    Dota 2 with Large Scale Deep Reinforcement Learning

    C. Berner, G. Brockman, B. Chan, V . Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesseet al., “Dota 2 with large scale deep reinforcement learning,”arXiv preprint arXiv:1912.06680, 2019

  19. [19]

    A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play,

    D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis, “A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play,”Science, vol. 362, no. 6419, pp. 1140–1144, 2018

  20. [20]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems (NeurIPS), 2017

  21. [21]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Y . Zhang, M. Li, D. Long, X. Zhang, H. Lin, B. Yang, P. Xie, A. Yang, D. Liu, J. Lin, F. Huang, and J. Zhou, “Qwen3 embedding: Advancing text embedding and reranking through foundation models,” arXiv preprint arXiv:2506.05176, 2025

  22. [22]

    Sentence-BERT: Sentence embeddings using Siamese BERT-networks,

    N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT-networks,” inConference on Empirical Methods in Natural Language Processing (EMNLP), 2019

  23. [23]

    LoRA: Low-rank adaptation of large language models,

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations (ICLR), 2022

  24. [24]

    FiLM: Visual reasoning with a general conditioning layer,

    E. Perez, F. Strub, H. De Vries, V . Dumoulin, and A. Courville, “FiLM: Visual reasoning with a general conditioning layer,” inAAAI Conference on Artificial Intelligence, 2018

  25. [25]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

  26. [26]

    The surprising effectiveness of PPO in cooperative multi-agent games,

    C. Yu, A. Velu, E. Vinitsky, J. Gao, Y . Wang, A. Bayen, and Y . Wu, “The surprising effectiveness of PPO in cooperative multi-agent games,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022

  27. [27]

    S., Gupta, T., Makoviichuk, D., Makoviychuk, V ., Torr, P

    C. S. de Witt, T. Gupta, D. Makoviichuk, V . Makoviychuk, P. H. S. Torr, M. Sun, and S. Whiteson, “Is independent learning all you need in the StarCraft multi-agent challenge?”arXiv preprint arXiv:2011.09533, 2020

  28. [28]

    Representation Learning with Contrastive Predictive Coding

    A. van den Oord, Y . Li, and O. Vinyals, “Representation learning with contrastive predictive coding,”arXiv preprint arXiv:1807.03748, 2018

  29. [29]

    PettingZoo: Gym for multi-agent reinforcement learning,

    J. K. Terry, B. Black, N. Grammel, M. Jayakumar, A. Hari, R. Sullivan, L. S. Santos, C. Dieffendahl, C. Horsch, R. Perez-Vicente, N. L. Williams, Y . Lokesh, and P. Ravi, “PettingZoo: Gym for multi-agent reinforcement learning,” inAdvances in Neural Information Processing Systems (NeurIPS), 2021

  30. [30]

    An introduction to the five-factor model and its applications,

    R. R. McCrae and P. T. Costa Jr, “An introduction to the five-factor model and its applications,”Journal of Personality, vol. 60, no. 2, pp. 175–215, 1992

  31. [31]

    An alternative “description of personality

    L. R. Goldberg, “An alternative “description of personality”: The Big- Five factor structure,”Journal of Personality and Social Psychology, vol. 59, no. 6, pp. 1216–1229, 1990

  32. [32]

    Trait (The Sims 3),

    The Sims Wiki, “Trait (The Sims 3),” https://sims.fandom.com/wiki/ Trait (The Sims 3), 2026, accessed May 11, 2026

  33. [33]

    Villager,

    Nookipedia, “Villager,” https://nookipedia.com/wiki/Villager, 2026, ac- cessed May 11, 2026

  34. [34]

    The Hanabi challenge: A new frontier for AI research,

    N. Bard, J. N. Foerster, S. Chandar, N. Burch, M. Lanctot, H. F. Song, E. Parisotto, V . Dumoulin, S. Moitra, E. Hugheset al., “The Hanabi challenge: A new frontier for AI research,”Artificial Intelligence, vol. 280, 2020

  35. [35]

    Counterfactual multi-agent policy gradients,

    J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multi-agent policy gradients,” inAAAI Conference on Artificial Intelligence, 2018

  36. [36]

    Multi- agent actor-critic for mixed cooperative-competitive environments,

    R. Lowe, Y . Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi- agent actor-critic for mixed cooperative-competitive environments,” in Advances in Neural Information Processing Systems (NeurIPS), 2017

  37. [37]

    Phase 5 report: persona-conditioned shared policies on melting pot,

    Y . Hong, “Phase 5 report: persona-conditioned shared policies on melting pot,” Project report, research/meltingpot/PHASE5_REPORT.md, 2026

  38. [38]

    Personalizing dialogue agents: I have a dog, do you have pets too?

    S. Zhang, E. Dinan, J. Urbanek, A. Szlam, D. Kiela, and J. Weston, “Personalizing dialogue agents: I have a dog, do you have pets too?” in Annual Meeting of the Association for Computational Linguistics (ACL), 2018

  39. [39]

    Leveraging procedural generation to benchmark reinforcement learning,

    K. Cobbe, C. Hesse, J. Hilton, and J. Schulman, “Leveraging procedural generation to benchmark reinforcement learning,” inInternational Conference on Machine Learning (ICML), 2020