arxiv: 2604.05943 · v1 · submitted 2026-04-07 · 💻 cs.AI

Recognition: no theorem link

MARL-GPT: Foundation Model for Multi-Agent Reinforcement Learning

Aleksandr Panov, Alexey Kovalev, Alexey Skrynnik, Anton Andreychuk, Egor Cherepanov, Konstantin Yakovlev, Maria Nesterova, Mikhail Kolosov, Oleg Bulichev

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:41 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-agent reinforcement learningfoundation modelstransformersoffline reinforcement learningStarCraftGoogle Research FootballPOGEMA

0 comments

The pith

A single transformer model trained offline on expert trajectories reaches competitive performance across three unrelated multi-agent environments without any task-specific tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that one GPT-based model can handle StarCraft Multi-Agent Challenge, Google Research Football, and POGEMA by training on hundreds of millions to a billion expert trajectories and using a shared observation encoder. This setup avoids the usual practice of building a separate model for each new multi-agent problem. If the approach holds, it supports the idea of a general-purpose foundation model for multi-agent reinforcement learning that works across varied observation and action spaces. The results come from direct comparisons to specialized baselines in each environment.

Core claim

MARL-GPT uses offline reinforcement learning on large expert datasets together with a single transformer observation encoder to achieve performance comparable to environment-specific agents in SMACv2, GRF, and POGEMA.

What carries the argument

The single shared transformer-based observation encoder that processes inputs from environments with different observation and action spaces without task-specific adjustments.

If this is right

Large-scale offline training on expert data can substitute for custom model design per task.
A shared encoder captures transferable multi-agent coordination patterns across domains.
Scaling the approach to additional environments could reduce the need for repeated architecture search in MARL.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The model might serve as a starting point for quick adaptation to new multi-agent problems with limited extra data.
Combining the offline pre-training with limited online interaction could close remaining gaps to optimal performance.

Load-bearing premise

A single encoder without per-environment changes can still extract useful features from the very different observation formats in these three tasks.

What would settle it

Training the same architecture on a fourth multi-agent environment with substantially different observation structure and measuring whether it still matches specialized baselines.

Figures

Figures reproduced from arXiv: 2604.05943 by Aleksandr Panov, Alexey Kovalev, Alexey Skrynnik, Anton Andreychuk, Egor Cherepanov, Konstantin Yakovlev, Maria Nesterova, Mikhail Kolosov, Oleg Bulichev.

**Figure 2.** Figure 2: The general pipeline begins with training expert policies across diverse MARL environments using domain-appropriate [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the proposed encoding scheme for [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Online fine-tuning results: (a) Battle won on the [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Modular and reconfigurable maze environment [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Waveshare JetBot robotic agent used for maze navi [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Real-world execution of a scenario based on maze [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

read the original abstract

Recent advances in multi-agent reinforcement learning (MARL) have demonstrated success in numerous challenging domains and environments, but typically require specialized models for each task. In this work, we propose a coherent methodology that makes it possible for a single GPT-based model to learn and perform well across diverse MARL environments and tasks, including StarCraft Multi-Agent Challenge, Google Research Football and POGEMA. Our method, MARL-GPT, applies offline reinforcement learning to train at scale on the expert trajectories (400M for SMACv2, 100M for GRF, and 1B for POGEMA) combined with a single transformer-based observation encoder that requires no task-specific tuning. Experiments show that MARL-GPT achieves competitive performance compared to specialized baselines in all tested environments. Thus, our findings suggest that it is, indeed, possible to build a multi-task transformer-based model for a wide variety of (significantly different) multi-agent problems paving the way to the fundamental MARL model (akin to ChatGPT, Llama, Mistral etc. in natural language modeling).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper trains one transformer on huge expert datasets from SMACv2, GRF, and POGEMA and claims competitive results with a shared encoder and no per-task tuning, but the input unification step and lack of numbers make the central claim hard to judge yet.

read the letter

The main point is that they collected massive offline expert trajectories across three dissimilar MARL environments and trained a single GPT-style model with one observation encoder to handle all of them. If the shared encoder really works without environment-specific pieces, that is the concrete advance over prior multi-task RL work that usually keeps separate heads or adapters per domain. The data volumes are large enough to be worth noting: hundreds of millions to a billion trajectories, which is non-trivial to generate and store. The paper does a reasonable job stating the goal of moving toward a general MARL foundation model and describing the offline RL setup at a high level. That part is straightforward and useful for readers who want to replicate the data pipeline. The soft spots sit in the evidence and the architecture details. The abstract says competitive performance but supplies no scores, no baselines, no variance numbers, and no training curves, so it is impossible to tell whether the model is close to the specialized agents or just in the same ballpark. The bigger issue is the shared encoder claim. SMACv2 uses variable-length unit vectors, GRF uses continuous field observations, and POGEMA uses grid states. The paper must show exactly how these get tokenized and embedded into the same transformer without hidden per-environment linear layers or normalizers. If any such components exist, the no-tuning story changes. The stress-test note on unification is on target; that step is load-bearing and needs to be spelled out in the methods with diagrams or pseudocode. This work is for people already working on scaling MARL or foundation models in RL. A reader who cares about cross-domain generalization or large offline datasets will find the recipe worth looking at, even if the results need verification. It deserves peer review because the datasets and the multi-environment claim are substantial enough to warrant referee time, though any review should focus first on the input handling and the actual performance tables.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes MARL-GPT, a single GPT-based model for multi-agent reinforcement learning trained via offline RL on large expert trajectory datasets (400M for SMACv2, 100M for GRF, 1B for POGEMA). It employs a shared transformer observation encoder requiring no task-specific tuning and claims competitive performance against specialized baselines across these heterogeneous environments, arguing this demonstrates the feasibility of a foundational multi-task MARL model analogous to NLP foundation models.

Significance. If the empirical results and architectural unification hold under scrutiny, the work would be significant for advancing generalist approaches in MARL. The scale of multi-environment offline training on diverse tasks (unit-based, continuous, and grid observations) represents a concrete step toward unified models, with potential to reduce the need for per-task specialization if the shared encoder truly operates without tuning.

major comments (2)

[Abstract] Abstract: The assertion that MARL-GPT 'achieves competitive performance compared to specialized baselines in all tested environments' provides no quantitative metrics, baseline names, win rates, or statistical tests. This absence makes the central empirical claim impossible to evaluate and is load-bearing for the paper's contribution.
[Methods] Methods (observation encoder): The claim of a 'single transformer-based observation encoder that requires no task-specific tuning' is central to the foundation-model analogy, yet the manuscript supplies no description of the tokenization, embedding, padding, or projection steps that map heterogeneous inputs (SMACv2 unit vectors, GRF continuous states, POGEMA grids) into a common representation. Without this, it is unclear whether the encoder is truly shared and untuned or relies on implicit per-environment components.

minor comments (1)

[Abstract] The parenthetical analogy to 'ChatGPT, Llama, Mistral etc.' in the abstract would benefit from a brief citation or clarification to avoid informal tone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us identify areas where the manuscript can be strengthened. We address each major comment point by point below and have revised the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that MARL-GPT 'achieves competitive performance compared to specialized baselines in all tested environments' provides no quantitative metrics, baseline names, win rates, or statistical tests. This absence makes the central empirical claim impossible to evaluate and is load-bearing for the paper's contribution.

Authors: We agree that the abstract would be improved by including concrete quantitative details to support the central claim. In the revised manuscript, we have updated the abstract to reference specific performance metrics (win rates and success rates), the names of the specialized baselines, and pointers to the full results with statistical tests in the Experiments section and tables. revision: yes
Referee: [Methods] Methods (observation encoder): The claim of a 'single transformer-based observation encoder that requires no task-specific tuning' is central to the foundation-model analogy, yet the manuscript supplies no description of the tokenization, embedding, padding, or projection steps that map heterogeneous inputs (SMACv2 unit vectors, GRF continuous states, POGEMA grids) into a common representation. Without this, it is unclear whether the encoder is truly shared and untuned or relies on implicit per-environment components.

Authors: We acknowledge that the original manuscript lacked sufficient detail on the observation encoder. We have revised the Methods section to include a complete description of the tokenization, embedding, padding, and projection steps that map the heterogeneous observations into a shared representation space. This addition makes explicit that the encoder is a single shared module with no task-specific tuning or per-environment parameters. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical multi-task training on heterogeneous trajectories is self-contained

full rationale

The paper's chain is: collect large expert datasets (400M SMACv2, 100M GRF, 1B POGEMA), train a single transformer observation encoder plus GPT-style policy head via offline RL, then report competitive returns versus specialized baselines. No equation defines a quantity in terms of itself, no fitted hyperparameter is relabeled as an out-of-sample prediction, and no uniqueness theorem or ansatz is imported from the authors' prior work to force the architecture. The shared-encoder claim is supported by the training procedure and test results rather than by construction or self-citation load-bearing.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the transformer architecture itself contains many standard hyperparameters whose values are not reported.

pith-pipeline@v0.9.0 · 5524 in / 1074 out tokens · 39262 ms · 2026-05-10T19:41:48.048350+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 7 canonical work pages · 2 internal anchors

[1]

Anton Andreychuk, Konstantin Yakovlev, Aleksandr Panov, and Alexey Skryn- nik. 2025. Mapf-gpt: Imitation learning for multi-agent pathfinding at scale. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 23126–23134

2025
[2]

Raunak P Bhattacharyya, Derek J Phillips, Blake Wulfe, Jeremy Morton, Alex Kuefler, and Mykel J Kochenderfer. 2018. Multi-agent imitation learning for driving simulation. In2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 1534–1539

2018
[3]

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. 2021. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

Yevgen Chebotar, Quan Vuong, Karol Hausman, Fei Xia, Yao Lu, Alex Irpan, Avi- ral Kumar, Tianhe Yu, Alexander Herzog, Karl Pertsch, et al. 2023. Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions. InConfer- ence on Robot Learning. PMLR, 3909–3928

2023
[5]

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. 2021. Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems34 (2021), 15084–15097. 4MARL-GPT: https://github.com/Cognitive-AI-Systems/marl-gpt 5NanoGPT: https://...

2021
[6]

Egor Cherepanov, Aleksei Staroverov, Alexey Kovalev, and Aleksandr Panov
[7]

InThe Fourteenth Interna- tional Conference on Learning Representations

Recurrent Action Transformer with Memory. InThe Fourteenth Interna- tional Conference on Learning Representations. https://openreview.net/forum?id= kByN4v0M3e
[8]

Benjamin Ellis, Jonathan Cook, Skander Moalla, Mikayel Samvelyan, Mingfei Sun, Anuj Mahajan, Jakob Foerster, and Shimon Whiteson. 2023. SMACv2: An Improved Benchmark for Cooperative Multi-Agent Reinforcement Learn- ing. InAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36....

2023
[9]

Jesse Farebrother, Jordi Orbay, Quan Vuong, Adrien Ali Taïga, Yevgen Chebotar, Ted Xiao, Alex Irpan, Sergey Levine, Pablo Samuel Castro, Aleksandra Faust, et al. 2024. Stop regressing: training value functions via classification for scalable deep RL. InProceedings of the 41st International Conference on Machine Learning. 13049–13071

2024
[10]

Roya Firoozi, Johnathan Tucker, Stephen Tian, Anirudha Majumdar, Jiankai Sun, Weiyu Liu, Yuke Zhu, Shuran Song, Ashish Kapoor, Karol Hausman, et al. 2023. Foundation models in robotics: Applications, challenges, and the future.The International Journal of Robotics Research(2023)

2023
[11]

Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, et al. 2024. Magentic-one: A generalist multi-agent system for solving complex tasks.arXiv preprint arXiv:2411.04468(2024)

work page arXiv 2024
[12]

Jake Grigsby, Justin Sasek, Samyak Parajuli, Daniel Adebi, Amy Zhang, and Yuke Zhu. 2024. Amago-2: Breaking the multi-task barrier in meta-reinforcement learning with transformers.Advances in Neural Information Processing Systems 37 (2024), 87473–87508

2024
[13]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory.Neural computation9, 8 (1997), 1735–1780

1997
[14]

Siyi Hu, Fengda Zhu, Xiaojun Chang, and Xiaodan Liang. 2021. {UPD}eT: Uni- versal Multi-agent {RL} via Policy Decoupling with Transformers. InInterna- tional Conference on Learning Representations. https://openreview.net/forum?id= v9c7hr9ADKx

2021
[15]

Chang Huang, Junqiao Zhao, Hongtu Zhou, Hai Zhang, Xiao Zhang, and Chen Ye
[16]

In2023 IEEE Intelligent Vehicles Symposium (IV)

Multi-agent Decision-making at Unsignalized Intersections with Reinforce- ment Learning from Demonstrations. In2023 IEEE Intelligent Vehicles Symposium (IV). IEEE, 1–6
[17]

Ahmed Hussein, Mohamed Medhat Gaber, Eyad Elyan, and Chrisina Jayne. 2017. Imitation learning: A survey of learning methods.ACM Computing Surveys (CSUR)50, 2 (2017), 1–35

2017
[18]

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakr- ishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. 2024. OpenVLA: An Open-Source Vision- Language-Action Model. In8th Annual Conference on Robo...

2024
[19]

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. 2020. Conserva- tive q-learning for offline reinforcement learning.Advances in Neural Information Processing Systems33 (2020), 1179–1191

2020
[20]

Karol Kurach, Anton Raichuk, Piotr Stanczyk, Michal Zajac, Olivier Bachem, Lasse Espeholt, Carlos Riquelme, Damien Vincent, Marcin Michalski, Olivier Bousquet, et al. 2020. Google research football: A novel reinforcement learning environment. InProceedings of the AAAI conference on artificial intelligence, Vol. 34. 4501–4510

2020
[21]

Hoang M Le, Yisong Yue, Peter Carr, and Patrick Lucey. 2017. Coordinated multi- agent imitation learning. InInternational Conference on Machine Learning. PMLR, 1995–2003

2017
[22]

Jiaoyang Li, Andrew Tinka, Scott Kiesel, Joseph W Durham, TK Satish Kumar, and Sven Koenig. 2021. Lifelong multi-agent path finding in large-scale warehouses. InProceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI 2021). 11272–11281

2021
[23]

Wei Li, Shiyi Huang, Ziming Qiu, and Aiguo Song. 2024. GAILPG: Multi-Agent Policy Gradient with Generative Adversarial Imitation Learning.IEEE Transac- tions on Games(2024)

2024
[24]

Sicong Liu, Yang Shu, Chenjuan Guo, and Bin Yang. 2025. Learning Gener- alizable Skills from Offline Multi-Task Data for Multi-Agent Cooperation. In The Thirteenth International Conference on Learning Representations. https: //openreview.net/forum?id=HR1ujVR0ig

2025
[25]

Shicheng Liu and Minghui Zhu. 2024. Learning multi-agent behaviors from distributed and streaming demonstrations.Advances in Neural Information Pro- cessing Systems36 (2024)

2024
[26]

Pablo Alvarez Lopez, Michael Behrisch, Laura Bieker-Walz, Jakob Erdmann, Yun- Pang Flötteröd, Robert Hilbrich, Leonhard Lücken, Johannes Rummel, Peter Wagner, and Evamarie Wießner. 2018. Microscopic traffic simulation using sumo. In2018 21st international conference on intelligent transportation systems (ITSC). IEEE, 2575–2582

2018
[27]

Tien Mai, Thanh Nguyen, et al. 2024. Mimicking To Dominate: Imitation Learning Strategies for Success in Multiagent Games.Advances in Neural Information Processing Systems37 (2024), 84669–84697

2024
[28]

Linghui Meng, Muning Wen, Chenyang Le, Xiyun Li, Dengpeng Xing, Weinan Zhang, Ying Wen, Haifeng Zhang, Jun Wang, Yaodong Yang, et al. 2023. Offline pre-trained multi-agent decision transformer.Machine Intelligence Research20, 2 (2023), 233–248

2023
[29]

2016.A concise introduction to decentralized POMDPs

Frans A Oliehoek, Christopher Amato, et al . 2016.A concise introduction to decentralized POMDPs. Vol. 1. Springer

2016
[30]

Hyunwoo Park, Baekryun Seong, and Sang-Ki Ko. 2025. SPECTra: Scalable Multi- Agent Reinforcement Learning with Permutation-Free Networks.arXiv preprint arXiv:2503.11726(2025)

work page arXiv 2025
[31]

Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gómez Colmenarejo, Alexan- der Novikov, Gabriel Barth-maron, Mai Giménez, Yury Sulsky, Jackie Kay, Jost To- bias Springenberg, Tom Eccles, Jake Bruce, Ali Razavi, Ashley Edwards, Nicolas Heess, Yutian Chen, Raia Hadsell, Oriol Vinyals, Mahyar Bordbar, and Nando de Freitas. 2022. A Generalist Agent.Transac...

2022
[32]

Anian Ruoss, Gregoire Deletang, Sourabh Medapati, Jordi Grau-Moya, Li Kevin Wenliang, Elliot Catt, John Reid, Cannada A Lewis, Joel Veness, and Tim Ge- newein. 2024. Amortized Planning with Large-Scale Transformers: A Case Study on Chess. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems

2024
[33]

Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Far- quhar, Nantas Nardelli, Tim GJ Rudner, Chia-Man Hung, Philip HS Torr, Jakob Foerster, and Shimon Whiteson. 2019. The StarCraft Multi-Agent Challenge. InProceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems. 2186–2188

2019
[34]

Andy Shih, Stefano Ermon, and Dorsa Sadigh. 2022. Conditional imitation learning for multi-agent games. In2022 17th ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, 166–175

2022
[35]

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershel- vam, Marc Lanctot, et al . 2016. Mastering the game of Go with deep neural networks and tree search.nature529, 7587 (2016), 484–489

2016
[36]

Alexey Skrynnik, Anton Andreychuk, Anatolii Borzilov, Alexander Chernyavskiy, Konstantin Yakovlev, and Aleksandr Panov. 2025. POGEMA: A Benchmark Platform for Cooperative Multi-Agent Pathfinding. InThe Thirteenth International Conference on Learning Representations

2025
[37]

Jiaming Song, Hongyu Ren, Dorsa Sadigh, and Stefano Ermon. 2018. Multi- agent generative adversarial imitation learning.Advances in neural information processing systems31 (2018)

2018
[38]

Yan Song, He Jiang, Haifeng Zhang, Zheng Tian, Weinan Zhang, and Jun Wang
[39]

Boosting studies of multi-agent reinforcement learning on Google re- search football environment: The past, present, and future.arXiv preprint arXiv:2309.12951(2023)

work page arXiv 2023
[40]

Jingwu Tang, Gokul Swamy, Fei Fang, and Steven Wu. 2024. Multi-Agent Im- itation Learning: Value is Easy, Regret is Hard. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems

2024
[41]

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. 2024. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213(2024)

work page internal anchor Pith review arXiv 2024
[42]

Rishi Veerapaneni, Arthur Jakobsson, Kevin Ren, Samuel Kim, Jiaoyang Li, and Maxim Likhachev. 2024. Work Smarter Not Harder: Simple Imitation Learning with CS-PIBT Outperforms Large Scale Imitation Learning for MAPF.arXiv preprint arXiv:2409.14491(2024)

work page arXiv 2024
[43]

Hongwei Wang, Lantao Yu, Zhangjie Cao, and Stefano Ermon. 2021. Multi-agent imitation learning with copulas. InMachine Learning and Knowledge Discovery in Databases. Research Track: European Conference, ECML PKDD 2021, Bilbao, Spain, September 13–17, 2021, Proceedings, Part I 21. Springer, 139–156

2021
[44]

Yutong Wang, Bairan Xiang, Shinan Huang, and Guillaume Sartoretti. 2023. SCRIMP: Scalable communication for reinforcement-and imitation-learning- based multi-agent pathfinding. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 9301–9308

2023
[45]

Muning Wen, Jakub Kuba, Runji Lin, Weinan Zhang, Ying Wen, Jun Wang, and Yaodong Yang. 2022. Multi-agent reinforcement learning is a sequence modeling problem.Advances in Neural Information Processing Systems35 (2022), 16509– 16521

2022
[46]

Liming Xu, Sara Almahri, Stephen Mak, and Alexandra Brintrup. 2024. Multi- agent systems and foundation models enable autonomous supply chains: Oppor- tunities and challenges.IFAC-PapersOnLine58, 19 (2024), 795–800

2024
[47]

Fan Yang, Alina Vereshchaka, Changyou Chen, and Wen Dong. 2020. Bayesian multi-type mean field multi-agent imitation learning.Advances in Neural Infor- mation Processing Systems33 (2020), 2469–2478

2020
[48]

Sherry Yang, Ofir Nachum, Yilun Du, Jason Wei, Pieter Abbeel, and Dale Schuur- mans. 2023. Foundation models for decision making: Problems, methods, and opportunities.arXiv preprint arXiv:2303.04129(2023)

work page arXiv 2023
[49]

Fuxiang Zhang, Chengxing Jia, Yi-Chen Li, Lei Yuan, Yang Yu, and Zongzhang Zhang. 2022. Discovering generalizable multi-agent coordination skills from multi-task offline data. InThe Eleventh International Conference on Learning Representations

2022
[50]

Yiming Zhang, Kun Yang, Cong Shen, and Dongning Guo. 2025. Multi-Agent Decision Transformer for Power Control in Wireless Networks. In2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1–5. 7 APPENDIX 7.1 Appendix A – Adapt Method to New Environment To use this model in any environment, collect expert data (e.g., using...

2025