Towards Learning Representations of Policies in Two-Player Zero-Sum Imperfect-Information Games

Amy Greenwald; Arjun Prakash; Kevin Wang; Kevin Yang

arxiv: 2607.01498 · v1 · pith:X7FWSLFMnew · submitted 2026-07-01 · 💻 cs.LG · cs.GT

Towards Learning Representations of Policies in Two-Player Zero-Sum Imperfect-Information Games

Kevin Wang , Kevin Yang , Arjun Prakash , Amy Greenwald This is my paper

Pith reviewed 2026-07-03 20:52 UTC · model grok-4.3

classification 💻 cs.LG cs.GT

keywords policy representationsself-supervised learningimperfect information gameszero-sum gamespokerembeddingsbehavioral representations

0 comments

The pith

Policy embeddings from self-supervised learning on poker variants capture useful behavioral representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how to learn embeddings that represent policies in two-player zero-sum games with hidden information. It develops basic ways to build collections of policies for a game, applies self-supervised techniques to turn those policies into vector representations, and designs tasks that test whether the vectors reflect actual policy behavior. Tests on Kuhn poker and Leduc poker show that the resulting embeddings contain measurable behavioral information. A reader would care because such representations could support policy analysis or transfer without relying on hand-designed features.

Core claim

By generating datasets of policies for Kuhn and Leduc Poker, training embeddings with self-supervised methods, and measuring those embeddings on downstream tasks, the work shows that useful behavioral representations are present in the learned embeddings.

What carries the argument

Policy datasets created for a given game, paired with self-supervised embedding methods and downstream evaluation tasks that probe behavioral properties.

If this is right

Embeddings trained this way can support tasks such as policy clustering or similarity search inside the same game.
Simple self-supervised objectives suffice to extract behavioral signals from policy collections in small poker games.
Systematic comparisons of different self-supervised techniques become feasible once policy datasets and evaluation tasks are standardized.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pipeline might scale to produce representations useful for opponent modeling in larger imperfect-information games.
If the embeddings generalize across game variants, they could reduce the need to retrain policies from scratch when rules change slightly.
Downstream tasks that test strategic properties could be added to check whether the embeddings encode equilibrium concepts rather than surface statistics.

Load-bearing premise

The chosen downstream tasks accurately measure whether the embeddings capture policy behavior relevant to the games.

What would settle it

Running the downstream tasks on the learned embeddings and finding performance indistinguishable from random vectors would falsify the presence of useful behavioral representations.

Figures

Figures reproduced from arXiv: 2607.01498 by Amy Greenwald, Arjun Prakash, Kevin Wang, Kevin Yang.

**Figure 3.** Figure 3: NeuPL-style conditional policy network NeuPL-style Conditional Embeddings Training NeuPL induces an embedding space in which any vector can be decoded into a policy via the conditional network ( [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: t-SNE visualization of embeddings in Kuhn Poker for 1000 random agents, then colored by aggression level (bet/raise frequency). With the trajectory encoder, aggressive strategies (red, top) clearly separated from passive strategies (blue, bottom). With the weight autoencoder, no such patterns emerge. 3.3. Downstream Tasks In this section, we introduce a suite of downstream tasks to evaluate the usefulness … view at source ↗

**Figure 5.** Figure 5: Functional encoder Finally, we are interested in downstream tasks that directly enable the creation of good policies in games, either via payoff prediction or as part of a decision-time planning process. Appendix D. Learned Similarity Metric To test whether a better similarity metric could close the opponent adaptation gap, we trained a learned Mahalanobis projection on Liar’s Dice: a linear map W ∈ R 64×1… view at source ↗

**Figure 6.** Figure 6: Cosine similarity matrices for 100 evaluation agents under Trajectory (contrastive) and Grover (hybrid) encoders on Liar’s Dice. Agents are sorted by mean similarity. The Trajectory encoder produces structured similarity patterns, while the Grover encoder’s similarities collapse to near-uniform values, explaining why its kernel regression degenerates to the Best-Avg baseline [PITH_FULL_IMAGE:figures/full… view at source ↗

read the original abstract

We investigate the problem of learning useful policy representations (embeddings) in two-player zero-sum imperfect-information games. We make three contributions: First, we introduce methods of creating datasets of policies for a given game. Second, we propose methods to learn policy representations. Third, we introduce downstream tasks to evaluate the effectiveness of such representations. We evaluate each dataset method, embedding method, and downstream task on Kuhn and Leduc Poker. Although our methods are very basic, we demonstrate that useful behavioral representations are present in the learned embeddings. To our knowledge, this work is among the first to systematically compare self-supervised learning techniques for learning policy representations in games. Our code is available at https://github.com/VitamintK/ssl-project for others to extend.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They apply standard self-supervised embeddings to policy datasets from Kuhn and Leduc, report that some behavioral signal appears in the embeddings, but give no numbers and leave the downstream tasks' link to real game behavior unargued.

read the letter

This paper takes standard self-supervised embedding methods and applies them to datasets of policies from Kuhn and Leduc poker. They show some evidence that the embeddings pick up behavioral patterns, but the results are not quantified in the abstract and the downstream tasks' connection to real game play is not demonstrated.

They do three things: generate policy datasets, learn embeddings, and define downstream tasks. The systematic comparison on two small games is new for this specific setting, and making the code public is a plus. It gives others a concrete place to start when exploring policy representations in imperfect-info games.

The main weakness is the lack of detail on what the downstream tasks actually are and how well the embeddings perform on them. Without numbers or an argument that success on those tasks means the embeddings capture equilibrium strategies or exploitable behaviors, the central claim stays vague. The scope is also narrow—only two toy games and basic methods—so it is more of an initial exploration than a strong result.

This is for people in multi-agent RL who want to see how self-supervised learning might transfer to game policies. A reader interested in representation learning for games could get some ideas from it, but it is not ready to change how anyone builds agents yet.

I would recommend sending it for peer review. The idea is reasonable and the experiments are a fair first cut; referees can push on the task design and ask for more games or stronger baselines.

Referee Report

2 major / 0 minor

Summary. The manuscript investigates learning useful policy representations (embeddings) in two-player zero-sum imperfect-information games. It introduces methods for creating datasets of policies, proposes self-supervised techniques to learn embeddings, and defines downstream tasks to evaluate the representations. These are evaluated on Kuhn and Leduc Poker, with the claim that useful behavioral representations are present in the learned embeddings despite the basic nature of the methods. The work positions itself as among the first systematic comparisons of such techniques and provides public code.

Significance. If the central claim holds, the work would provide an early systematic comparison of self-supervised learning methods for policy representations in imperfect-information games, with the public code release serving as a concrete strength for reproducibility and community extension.

major comments (2)

[Abstract and Evaluation section] Abstract and Evaluation section: The claim that 'useful behavioral representations are present in the learned embeddings' is asserted without any quantitative results, performance metrics, error bars, or specific outcomes from the downstream tasks on Kuhn and Leduc Poker. This absence makes it impossible to assess whether the embeddings capture policy behavior.
[Evaluation section] Evaluation section: No description is given of how the downstream tasks are constructed, nor is there an argument that success on these tasks implies the embeddings capture equilibrium or exploitative policy behavior relevant to the games. Without this link, the demonstration reduces to showing that embeddings contain some signal passing the chosen proxies, which does not support the headline claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. The comments highlight important gaps in the presentation of results and task definitions, which we will address through revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and Evaluation section] Abstract and Evaluation section: The claim that 'useful behavioral representations are present in the learned embeddings' is asserted without any quantitative results, performance metrics, error bars, or specific outcomes from the downstream tasks on Kuhn and Leduc Poker. This absence makes it impossible to assess whether the embeddings capture policy behavior.

Authors: We agree that the current version of the manuscript does not include specific quantitative results, performance metrics, or error bars to substantiate the claim in the abstract and evaluation sections. In the revision, we will add these details from the experiments on Kuhn and Leduc Poker to enable assessment of whether the embeddings capture policy behavior. revision: yes
Referee: [Evaluation section] Evaluation section: No description is given of how the downstream tasks are constructed, nor is there an argument that success on these tasks implies the embeddings capture equilibrium or exploitative policy behavior relevant to the games. Without this link, the demonstration reduces to showing that embeddings contain some signal passing the chosen proxies, which does not support the headline claim.

Authors: We acknowledge that the manuscript lacks a description of downstream task construction and an explicit argument connecting task performance to equilibrium or exploitative behaviors. We will revise the Evaluation section to include these elements, explaining the task construction and why success on the proxies supports the claim regarding policy representations. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical methods evaluated on external benchmarks

full rationale

The paper introduces dataset creation methods, embedding learning techniques, and downstream evaluation tasks, then reports results on Kuhn and Leduc Poker. No derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing steps are present. The central claim rests on observed performance of the introduced methods rather than any reduction to inputs by construction. The evaluation uses standard game benchmarks independent of the paper's own fitted values.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5661 in / 905 out tokens · 15167 ms · 2026-07-03T20:52:56.663026+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 15 canonical work pages · 4 internal anchors

[1]

Autonomous Agents Modelling Other Agents: A Comprehensive Survey and Open Problems

Stefano V . Albrecht and Peter Stone. Autonomous Agents Modelling Other Agents: A Com- prehensive Survey and Open Problems.Artificial Intelligence, 258:66–95, May 2018. ISSN 00043702. doi: 10.1016/j.artint.2018.01.002. URL http://arxiv.org/abs/1709. 08071. arXiv:1709.08071 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/j.artint.2018.01.002 2018
[2]

Combining Deep Rein- forcement Learning and Search for Imperfect-Information Games, November 2020

Noam Brown, Anton Bakhtin, Adam Lerer, and Qucheng Gong. Combining Deep Rein- forcement Learning and Search for Imperfect-Information Games, November 2020. URL http://arxiv.org/abs/2007.13544. arXiv:2007.13544 [cs]

work page arXiv 2020
[3]

Vinod Kumar Chauhan, Jiandong Zhou, Ping Lu, Soheila Molaei, and David A. Clifton. A Brief Review of Hypernetworks in Deep Learning.Artificial Intelligence Review, 57(9): 250, August 2024. ISSN 1573-7462. doi: 10.1007/s10462-024-10862-8. URL http: //arxiv.org/abs/2306.06955. arXiv:2306.06955 [cs]

work page doi:10.1007/s10462-024-10862-8 2024
[4]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020

2020
[5]

Learning policy representations in multiagent systems

Aditya Grover, Maruan Al-Shedivat, Jayesh K Gupta, Yuri Burda, and Harrison Edwards. Learning policy representations in multiagent systems. InInternational Conference on Machine Learning, pages 1802–1811. PMLR, 2018

2018
[6]

HyperNetworks

David Ha, Andrew Dai, and Quoc V Le. Hypernetworks.arXiv preprint arXiv:1609.09106, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[7]

Opponent modeling in deep reinforcement learning

He He, Jordan Boyd-Graber, Kevin Kwok, and Hal Daum´e III. Opponent modeling in deep reinforcement learning. InInternational Conference on Machine Learning, pages 1804–1813. PMLR, 2016

2016
[8]

A simplified two-person poker.Contributions to the Theory of Games, 1: 97–103, 2016

Harold W Kuhn. A simplified two-person poker.Contributions to the Theory of Games, 1: 97–103, 2016

2016
[9]

A unified game-theoretic approach to multiagent reinforcement learning.Advances in neural information processing systems, 30, 2017

Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien P´erolat, David Silver, and Thore Graepel. A unified game-theoretic approach to multiagent reinforcement learning.Advances in neural information processing systems, 30, 2017

2017
[10]

OpenSpiel: A Framework for Reinforcement Learning in Games.CoRR, abs/1908.09453, 2019

Marc Lanctot, Edward Lockhart, Jean-Baptiste Lespiau, Vinicius Zambaldi, Satyaki Upadhyay, Julien P´erolat, Sriram Srinivasan, Finbarr Timbers, Karl Tuyls, Shayegan Omidshafiei, Daniel Hennes, Dustin Morrill, Paul Muller, Timo Ewalds, Ryan Faulkner, J ´anos Kram´ar, Bart De Vylder, Brennan Saeta, James Bradbury, David Ding, Sebastian Borgeaud, Matthew Lai...

work page arXiv 1908
[11]

CURL: Contrastive unsupervised representations for reinforcement learning

Michael Laskin, Aravind Srinivas, and Pieter Abbeel. CURL: Contrastive unsupervised representations for reinforcement learning. InInternational Conference on Machine Learning, pages 5639–5650. PMLR, 2020

2020
[12]

Neupl: Neural population learning.arXiv preprint arXiv:2202.07415, 2022

Siqi Liu, Luke Marris, Daniel Hennes, Josh Merel, Nicolas Heess, and Thore Graepel. Neupl: Neural population learning.arXiv preprint arXiv:2202.07415, 2022

work page arXiv 2022
[13]

Leibo, and Nicolas Heess

Siqi Liu, Luke Marris, Marc Lanctot, Georgios Piliouras, Joel Z. Leibo, and Nicolas Heess. Neural Population Learning beyond Symmetric Zero-sum Games, January 2024. URL https: //arxiv.org/abs/2401.05133v1

work page arXiv 2024
[14]

Trajectory diversity for zero-shot coordination

Andrei Lupu, Brandon Cui, Hengyuan Hu, and Jakob Foerster. Trajectory diversity for zero-shot coordination. InInternational Conference on Machine Learning, pages 7204–7213. PMLR, 2021

2021
[15]

Con- trastive learning-based agent modeling for deep reinforcement learning.arXiv preprint arXiv:2401.00132, 2024

Wenhao Ma, Yu-Cheng Chang, Jie Yang, Yu-Kai Wang, and Chin-Teng Lin. Con- trastive learning-based agent modeling for deep reinforcement learning.arXiv preprint arXiv:2401.00132, 2024

work page arXiv 2024
[16]

Learning Models of Individual Behavior in Chess

Reid McIlroy-Young, Russell Wang, Siddhartha Sen, Jon Kleinberg, and Ashton Ander- son. Learning Models of Individual Behavior in Chess. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1253–1263, August
[17]

URL http://arxiv.org/abs/2008.10086

doi: 10.1145/3534678.3539367. URL http://arxiv.org/abs/2008.10086. arXiv:2008.10086 [cs]

work page doi:10.1145/3534678.3539367 2008
[18]

A Survey of Opponent Modeling in Adversarial Domains.Journal of Artificial Intelligence Research, 73:277–327, January 2022

Samer Nashed and Shlomo Zilberstein. A Survey of Opponent Modeling in Adversarial Domains.Journal of Artificial Intelligence Research, 73:277–327, January 2022. ISSN 1076-

2022
[19]

URL https://www.jair.org/index.php/jair/ article/view/12889

doi: 10.1613/jair.1.12889. URL https://www.jair.org/index.php/jair/ article/view/12889

work page doi:10.1613/jair.1.12889
[20]

Agent modelling under partial observability for deep reinforcement learning

Georgios Papoudakis, Filippos Christianos, and Stefano Albrecht. Agent modelling under partial observability for deep reinforcement learning. InAdvances in Neural Information Processing Systems, volume 34, pages 19210–19222, 2021

2021
[21]

Reevaluating Policy Gradient Methods for Imperfect-Information Games

Max Rudolph, Nathan Lichtle, Sobhan Mohammadpour, Alexandre Bayen, J Zico Kolter, Amy Zhang, Gabriele Farina, Eugene Vinitsky, and Samuel Sokota. Reevaluating policy gradient methods for imperfect-information games.arXiv preprint arXiv:2502.08938, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms, August 2017. URL http://arxiv.org/abs/1707. 06347. arXiv:1707.06347 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2017
[23]

Solving Common-Payoff Games with Approximate Policy Iteration, January 2021

Samuel Sokota, Edward Lockhart, Finbarr Timbers, Elnaz Davoodi, Ryan D’Orazio, Neil Burch, Martin Schmid, Michael Bowling, and Marc Lanctot. Solving Common-Payoff Games with Approximate Policy Iteration, January 2021. URL http://arxiv.org/abs/2101. 04237. arXiv:2101.04237 [cs]. 9 TOWARDSLEARNINGREPRESENTATIONS OFPOLICIES INGAMES

work page arXiv 2021
[24]

Predicting neural network accuracy from weights

Thomas Unterthiner, Daniel Keysers, Sylvain Gelly, Olivier Bousquet, and Ilya Tolstikhin. Predicting neural network accuracy from weights. InarXiv preprint arXiv:2002.11448, 2020

work page arXiv 2002
[25]

TransAM: Transformer-based agent modeling for multi-agent systems via local trajectory encoding

Conor Wallace, Umer Siddique, and Yongcan Cao. TransAM: Transformer-based agent modeling for multi-agent systems via local trajectory encoding. InReinforcement Learning Conference, 2025

2025
[26]

liars_dice(numdice=1,dice_sides=4)

Annie Xie, Dylan P Losey, Ryan Tolsma, Chelsea Finn, and Dorsa Sadigh. Learning latent representations to influence multi-agent interaction. InConference on Robot Learning, pages 575–588. PMLR, 2020. 10 TOWARDSLEARNINGREPRESENTATIONS OFPOLICIES INGAMES Appendix A. Details on Randomly Initializing Policy Neural Networks Our layer initialization initializes...

work page arXiv 2020

[1] [1]

Autonomous Agents Modelling Other Agents: A Comprehensive Survey and Open Problems

Stefano V . Albrecht and Peter Stone. Autonomous Agents Modelling Other Agents: A Com- prehensive Survey and Open Problems.Artificial Intelligence, 258:66–95, May 2018. ISSN 00043702. doi: 10.1016/j.artint.2018.01.002. URL http://arxiv.org/abs/1709. 08071. arXiv:1709.08071 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/j.artint.2018.01.002 2018

[2] [2]

Combining Deep Rein- forcement Learning and Search for Imperfect-Information Games, November 2020

Noam Brown, Anton Bakhtin, Adam Lerer, and Qucheng Gong. Combining Deep Rein- forcement Learning and Search for Imperfect-Information Games, November 2020. URL http://arxiv.org/abs/2007.13544. arXiv:2007.13544 [cs]

work page arXiv 2020

[3] [3]

Vinod Kumar Chauhan, Jiandong Zhou, Ping Lu, Soheila Molaei, and David A. Clifton. A Brief Review of Hypernetworks in Deep Learning.Artificial Intelligence Review, 57(9): 250, August 2024. ISSN 1573-7462. doi: 10.1007/s10462-024-10862-8. URL http: //arxiv.org/abs/2306.06955. arXiv:2306.06955 [cs]

work page doi:10.1007/s10462-024-10862-8 2024

[4] [4]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020

2020

[5] [5]

Learning policy representations in multiagent systems

Aditya Grover, Maruan Al-Shedivat, Jayesh K Gupta, Yuri Burda, and Harrison Edwards. Learning policy representations in multiagent systems. InInternational Conference on Machine Learning, pages 1802–1811. PMLR, 2018

2018

[6] [6]

HyperNetworks

David Ha, Andrew Dai, and Quoc V Le. Hypernetworks.arXiv preprint arXiv:1609.09106, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[7] [7]

Opponent modeling in deep reinforcement learning

He He, Jordan Boyd-Graber, Kevin Kwok, and Hal Daum´e III. Opponent modeling in deep reinforcement learning. InInternational Conference on Machine Learning, pages 1804–1813. PMLR, 2016

2016

[8] [8]

A simplified two-person poker.Contributions to the Theory of Games, 1: 97–103, 2016

Harold W Kuhn. A simplified two-person poker.Contributions to the Theory of Games, 1: 97–103, 2016

2016

[9] [9]

A unified game-theoretic approach to multiagent reinforcement learning.Advances in neural information processing systems, 30, 2017

Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien P´erolat, David Silver, and Thore Graepel. A unified game-theoretic approach to multiagent reinforcement learning.Advances in neural information processing systems, 30, 2017

2017

[10] [10]

OpenSpiel: A Framework for Reinforcement Learning in Games.CoRR, abs/1908.09453, 2019

Marc Lanctot, Edward Lockhart, Jean-Baptiste Lespiau, Vinicius Zambaldi, Satyaki Upadhyay, Julien P´erolat, Sriram Srinivasan, Finbarr Timbers, Karl Tuyls, Shayegan Omidshafiei, Daniel Hennes, Dustin Morrill, Paul Muller, Timo Ewalds, Ryan Faulkner, J ´anos Kram´ar, Bart De Vylder, Brennan Saeta, James Bradbury, David Ding, Sebastian Borgeaud, Matthew Lai...

work page arXiv 1908

[11] [11]

CURL: Contrastive unsupervised representations for reinforcement learning

Michael Laskin, Aravind Srinivas, and Pieter Abbeel. CURL: Contrastive unsupervised representations for reinforcement learning. InInternational Conference on Machine Learning, pages 5639–5650. PMLR, 2020

2020

[12] [12]

Neupl: Neural population learning.arXiv preprint arXiv:2202.07415, 2022

Siqi Liu, Luke Marris, Daniel Hennes, Josh Merel, Nicolas Heess, and Thore Graepel. Neupl: Neural population learning.arXiv preprint arXiv:2202.07415, 2022

work page arXiv 2022

[13] [13]

Leibo, and Nicolas Heess

Siqi Liu, Luke Marris, Marc Lanctot, Georgios Piliouras, Joel Z. Leibo, and Nicolas Heess. Neural Population Learning beyond Symmetric Zero-sum Games, January 2024. URL https: //arxiv.org/abs/2401.05133v1

work page arXiv 2024

[14] [14]

Trajectory diversity for zero-shot coordination

Andrei Lupu, Brandon Cui, Hengyuan Hu, and Jakob Foerster. Trajectory diversity for zero-shot coordination. InInternational Conference on Machine Learning, pages 7204–7213. PMLR, 2021

2021

[15] [15]

Con- trastive learning-based agent modeling for deep reinforcement learning.arXiv preprint arXiv:2401.00132, 2024

Wenhao Ma, Yu-Cheng Chang, Jie Yang, Yu-Kai Wang, and Chin-Teng Lin. Con- trastive learning-based agent modeling for deep reinforcement learning.arXiv preprint arXiv:2401.00132, 2024

work page arXiv 2024

[16] [16]

Learning Models of Individual Behavior in Chess

Reid McIlroy-Young, Russell Wang, Siddhartha Sen, Jon Kleinberg, and Ashton Ander- son. Learning Models of Individual Behavior in Chess. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1253–1263, August

[17] [17]

URL http://arxiv.org/abs/2008.10086

doi: 10.1145/3534678.3539367. URL http://arxiv.org/abs/2008.10086. arXiv:2008.10086 [cs]

work page doi:10.1145/3534678.3539367 2008

[18] [18]

A Survey of Opponent Modeling in Adversarial Domains.Journal of Artificial Intelligence Research, 73:277–327, January 2022

Samer Nashed and Shlomo Zilberstein. A Survey of Opponent Modeling in Adversarial Domains.Journal of Artificial Intelligence Research, 73:277–327, January 2022. ISSN 1076-

2022

[19] [19]

URL https://www.jair.org/index.php/jair/ article/view/12889

doi: 10.1613/jair.1.12889. URL https://www.jair.org/index.php/jair/ article/view/12889

work page doi:10.1613/jair.1.12889

[20] [20]

Agent modelling under partial observability for deep reinforcement learning

Georgios Papoudakis, Filippos Christianos, and Stefano Albrecht. Agent modelling under partial observability for deep reinforcement learning. InAdvances in Neural Information Processing Systems, volume 34, pages 19210–19222, 2021

2021

[21] [21]

Reevaluating Policy Gradient Methods for Imperfect-Information Games

Max Rudolph, Nathan Lichtle, Sobhan Mohammadpour, Alexandre Bayen, J Zico Kolter, Amy Zhang, Gabriele Farina, Eugene Vinitsky, and Samuel Sokota. Reevaluating policy gradient methods for imperfect-information games.arXiv preprint arXiv:2502.08938, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms, August 2017. URL http://arxiv.org/abs/1707. 06347. arXiv:1707.06347 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2017

[23] [23]

Solving Common-Payoff Games with Approximate Policy Iteration, January 2021

Samuel Sokota, Edward Lockhart, Finbarr Timbers, Elnaz Davoodi, Ryan D’Orazio, Neil Burch, Martin Schmid, Michael Bowling, and Marc Lanctot. Solving Common-Payoff Games with Approximate Policy Iteration, January 2021. URL http://arxiv.org/abs/2101. 04237. arXiv:2101.04237 [cs]. 9 TOWARDSLEARNINGREPRESENTATIONS OFPOLICIES INGAMES

work page arXiv 2021

[24] [24]

Predicting neural network accuracy from weights

Thomas Unterthiner, Daniel Keysers, Sylvain Gelly, Olivier Bousquet, and Ilya Tolstikhin. Predicting neural network accuracy from weights. InarXiv preprint arXiv:2002.11448, 2020

work page arXiv 2002

[25] [25]

TransAM: Transformer-based agent modeling for multi-agent systems via local trajectory encoding

Conor Wallace, Umer Siddique, and Yongcan Cao. TransAM: Transformer-based agent modeling for multi-agent systems via local trajectory encoding. InReinforcement Learning Conference, 2025

2025

[26] [26]

liars_dice(numdice=1,dice_sides=4)

Annie Xie, Dylan P Losey, Ryan Tolsma, Chelsea Finn, and Dorsa Sadigh. Learning latent representations to influence multi-agent interaction. InConference on Robot Learning, pages 575–588. PMLR, 2020. 10 TOWARDSLEARNINGREPRESENTATIONS OFPOLICIES INGAMES Appendix A. Details on Randomly Initializing Policy Neural Networks Our layer initialization initializes...

work page arXiv 2020