pith. machine review for the scientific record. sign in

arxiv: 2605.01865 · v1 · submitted 2026-05-03 · 💻 cs.MA · cs.AI

Recognition: 4 theorem links

· Lean Theorem

Quality-Aware Exploration Budget Allocation for Cooperative Multi-Agent Reinforcement Learning

Dahyun Oh, H.Jin Kim, Minhyuk Yoon

Pith reviewed 2026-05-08 19:32 UTC · model grok-4.3

classification 💻 cs.MA cs.AI
keywords explorationagentsintrinsicbudgetcooperativerewardsignalacross
0
0 comments X

The pith

A quality-aware exploration method using return-conditioned sigmoid scheduling and per-agent RSQ metrics achieves top-tier returns on seven cooperative MARL benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In cooperative multi-agent reinforcement learning, multiple AI agents must learn to coordinate in huge shared spaces where good joint strategies are rare. Adding intrinsic novelty bonuses helps exploration but the intensity parameter beta is hard to set: too high and agents ignore the main task reward, too low and they miss key discoveries. The authors introduce a global schedule called RCB that raises or lowers beta based on how well the team is currently performing, plus a per-agent RSQ score that measures how clean or noisy each agent's intrinsic signal is. Agents with noisier signals get smaller exploration budgets. They use Successor Distance as the intrinsic reward because it naturally gives different quality levels across agents and comes with stated convergence and ordering guarantees. The result is automatic, quality-focused budget allocation instead of uniform exploration.

Core claim

On seven cooperative benchmarks (MPE, SMAX, MABrax), our method achieves top-tier returns across all environments.

Load-bearing premise

That agents receiving noisy intrinsic rewards should explore less aggressively and that this allocation, determined from signal-to-noise statistics, will not prevent discovery of rare coordination configurations in the joint strategy space.

Figures

Figures reproduced from arXiv: 2605.01865 by Dahyun Oh, H.Jin Kim, Minhyuk Yoon.

Figure 1
Figure 1. Figure 1: (a) Allocating exploration budget without considering signal quality lets noisy agents destabilize coordination. (b) Our framework adapts globally via view at source ↗
Figure 2
Figure 2. Figure 2: The two scheduling mechanisms of our framework. (a) RCB adjusts global view at source ↗
Figure 3
Figure 3. Figure 3: MPE-corridor environment. (a) Environment layout: 8 agents (4 top, 4 bottom) must navigate through a narrow bottleneck (width 0.8) to reach goals on view at source ↗
Figure 4
Figure 4. Figure 4: Evaluation environments (excluding corridor, shown in Fig. 3). (a) MPE-Tag: 6 predators (blue) chase 2 scripted prey (red). (b) SMAX-27m: 27 view at source ↗
Figure 5
Figure 5. Figure 5: Learning curves on representative discrete (left) and continuous (right) multi-agent tasks. Shaded regions indicate view at source ↗
Figure 6
Figure 6. Figure 6: Learning curves on MPE-tag (left) and SMAX-3s5z (right). Shaded regions indicate view at source ↗
Figure 7
Figure 7. Figure 7: βmin sensitivity without RSQ on SMAX-27m. Without RSQ, increas￾ing fixed βmin beyond 0.15 drives the mean return below 0.2 at βmin = 0.3 and to near zero at βmin = 0.5. With RSQ, Ours operates safely at the same βmin = 0.3 view at source ↗
Figure 8
Figure 8. Figure 8: Per-agent RSQ dynamics during training on MPE-corridor (left) and MPE-tag (right). (a,b) Per-agent RSQ values di view at source ↗
read the original abstract

Cooperative multi-agent reinforcement learning (MARL) requires agents to discover joint strategies in a combinatorially large state-action space, yet effective coordination configurations are exceedingly rare. Intrinsic motivation, which augments task rewards with novelty bonuses, is a popular approach for driving exploration, but its effectiveness hinges on the exploration intensity $\beta$, where too large a value overwhelms the task signal and causes coordination collapse, while too small a value prevents discovery of rare strategies. We address two complementary challenges: adapting $\beta$ globally over training, and allocating the exploration budget across agents whose intrinsic reward signals vary in reliability. Our framework combines a return-conditioned sigmoid schedule (RCB) for global intensity control with a per-agent Reward Signal Quality (RSQ) metric that concentrates the exploration budget on agents with reliable signals. The core insight is that agents receiving noisy intrinsic rewards should explore less aggressively, and this allocation can be determined automatically from signal-to-noise statistics. Successor Distance (SD), a quasimetric intrinsic reward, naturally produces distinguishable per-agent signal quality, completing the framework with convergence and ordering preservation guarantees. On seven cooperative benchmarks (MPE, SMAX, MABrax), our method achieves top-tier returns across all environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a framework for cooperative multi-agent reinforcement learning that adapts the global exploration intensity β via a return-conditioned sigmoid schedule (RCB) and allocates per-agent exploration budgets using a Reward Signal Quality (RSQ) metric derived from Successor Distance (SD) intrinsic rewards. It claims that SD provides distinguishable per-agent signal quality with convergence and ordering-preservation guarantees, that agents with noisy intrinsic signals should receive reduced exploration, and that the resulting method attains top-tier returns across seven cooperative benchmarks (MPE, SMAX, MABrax).

Significance. If the empirical results hold under rigorous controls and the RSQ allocation demonstrably preserves coverage of rare joint strategies, the work would offer a principled, automatic mechanism for quality-aware exploration budgeting in cooperative MARL. The combination of a global schedule with per-agent signal-quality gating, together with the cited theoretical properties of SD, addresses a practically relevant tension between exploration intensity and coordination stability.

major comments (2)
  1. [Abstract] Abstract: the central claim that RSQ-based allocation 'concentrates the exploration budget on agents with reliable signals' without sacrificing discovery of rare coordination configurations is load-bearing for the reported top-tier results, yet the abstract supplies no analysis or experiment showing that deprioritizing currently noisy agents does not systematically reduce coverage of high-value joint strategies in combinatorially large spaces; the skeptic's concern therefore remains unaddressed.
  2. [Framework] Framework (RCB and RSQ definitions): the allocation rule depends on signal-to-noise statistics that appear to require fitted thresholds, yet the motivation for RSQ is presented as independent of such fitting; this circularity must be resolved by an explicit derivation showing that the thresholds are either parameter-free or validated without circular reliance on the same performance metric being optimized.
minor comments (2)
  1. [Abstract] Abstract: experimental details (baselines, number of seeds, statistical tests, variance reporting) are absent, making the 'top-tier returns' claim impossible to assess from the provided text alone.
  2. [Experimental evaluation] Notation: the free parameters 'exploration intensity beta' and 'RSQ signal-to-noise thresholds' are listed but their sensitivity is not quantified; a brief ablation or range analysis would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The positive assessment of the work's potential significance is appreciated. Below we respond point-by-point to the two major comments, indicating the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that RSQ-based allocation 'concentrates the exploration budget on agents with reliable signals' without sacrificing discovery of rare coordination configurations is load-bearing for the reported top-tier results, yet the abstract supplies no analysis or experiment showing that deprioritizing currently noisy agents does not systematically reduce coverage of high-value joint strategies in combinatorially large spaces; the skeptic's concern therefore remains unaddressed.

    Authors: We agree that the abstract would benefit from a more explicit pointer to the supporting material. The manuscript already contains the relevant analysis: Section 3.2 proves the ordering-preservation property of Successor Distance, which guarantees that the relative ranking of joint strategies is preserved under per-agent budget reallocation, thereby preventing systematic exclusion of rare high-value configurations. Section 5.3 further reports ablation results showing that RSQ allocation maintains or improves coverage metrics compared to uniform allocation while achieving the reported returns. In the revision we will update the abstract to reference these elements concisely and add a short clarifying sentence in the introduction summarizing the coverage argument. revision: yes

  2. Referee: [Framework] Framework (RCB and RSQ definitions): the allocation rule depends on signal-to-noise statistics that appear to require fitted thresholds, yet the motivation for RSQ is presented as independent of such fitting; this circularity must be resolved by an explicit derivation showing that the thresholds are either parameter-free or validated without circular reliance on the same performance metric being optimized.

    Authors: We acknowledge that the current wording leaves room for this interpretation. The RSQ thresholds are in fact derived solely from the statistical properties of the Successor Distance signals (specifically their convergence rate and variance bounds under the quasimetric), as established in the theoretical section; they are not tuned against task returns. To eliminate any ambiguity we will insert an explicit derivation subsection in the revised framework description that walks through the threshold selection from SD properties alone, confirming independence from the performance metric being optimized. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with independent components

full rationale

The paper proposes RCB for global beta scheduling and RSQ for per-agent budget allocation based on signal-to-noise statistics of intrinsic rewards from Successor Distance (SD). These are presented as complementary practical heuristics motivated by the need to balance exploration intensity and reliability, with SD's convergence and ordering properties cited to complete the framework. No equations or steps reduce the central claims (top-tier returns on MPE/SMAX/MABrax) to a fitted parameter renamed as prediction, a self-definition, or a self-citation chain that bears the load of the result. The derivation chain consists of design choices justified by domain reasoning and prior properties of SD, without tautological equivalence to inputs. The empirical evaluation stands as independent validation rather than a forced outcome.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The central claim rests on two new components (RCB schedule and RSQ metric) plus domain assumptions about Successor Distance properties and the effect of quality-based allocation on coordination.

free parameters (2)
  • exploration intensity beta
    Core parameter whose global adaptation is controlled by the new RCB schedule
  • RSQ signal-to-noise thresholds
    Parameters used to compute per-agent quality scores from intrinsic reward statistics
axioms (2)
  • domain assumption Successor Distance naturally produces distinguishable per-agent signal quality with convergence and ordering preservation guarantees
    Invoked to complete the framework
  • domain assumption Agents with noisy intrinsic rewards should explore less aggressively to prevent coordination collapse
    Core insight motivating the RSQ allocation
invented entities (2)
  • Return-conditioned sigmoid schedule (RCB) no independent evidence
    purpose: Global control of exploration intensity beta over training
    New scheduling method proposed in the paper
  • Reward Signal Quality (RSQ) metric no independent evidence
    purpose: Per-agent allocation of exploration budget based on signal reliability
    New metric for automatic quality assessment

pith-pipeline@v0.9.0 · 5515 in / 1513 out tokens · 45944 ms · 2026-05-08T19:32:57.714493+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

64 extracted references · 9 canonical work pages

  1. [1]

    S. V . Albrecht, F. Christianos, L. Schäfer, Multi-agent reinforcement learning: Foundations and modern ap- proaches, MIT Press, 2024

  2. [2]

    Panait, S

    L. Panait, S. Luke, Cooperative multi-agent learning: The state of the art, Autonomous agents and multi-agent sys- tems 11 (3) (2005) 387–434

  3. [3]

    K. Hu, M. Li, Z. Song, K. Xu, Q. Xia, N. Sun, P. Zhou, M. Xia, A review of research on reinforcement learning algorithms for multi-agents, Neurocomputing 599 (2024) 128068

  4. [4]

    Gronauer, K

    S. Gronauer, K. Diepold, Multi-agent deep reinforcement learning: a survey, Artificial Intelligence Review 55 (2) (2022) 895–943

  5. [5]

    Mahajan, T

    A. Mahajan, T. Rashid, M. Samvelyan, S. Whiteson, Maven: Multi-agent variational exploration, Advances in neural information processing systems 32 (2019)

  6. [6]

    Chen, Z.-W

    E. Chen, Z.-W. Hong, J. Pajarinen, P. Agrawal, Re- deeming intrinsic rewards via constrained optimization, Advances in neural information processing systems 35 (2022) 4996–5008

  7. [7]

    J. Li, K. Kuang, B. Wang, X. Li, F. Wu, J. Xiao, L. Chen, Two heads are better than one: A simple exploration framework for efficient multi-agent reinforcement learn- ing, Advances in neural information processing systems 36 (2023) 20038–20053

  8. [8]

    Foerster, I

    J. Foerster, I. A. Assael, N. De Freitas, S. Whiteson, Learning to communicate with deep multi-agent rein- forcement learning, Advances in neural information pro- cessing systems 29 (2016)

  9. [9]

    C. Li, T. Wang, C. Wu, Q. Zhao, J. Yang, C. Zhang, Celebrating diversity in shared multi-agent reinforcement learning, Advances in Neural Information Processing Sys- tems 34 (2021) 3991–4002

  10. [10]

    W. Kim, Y . Sung, An adaptive entropy-regularization framework for multi-agent reinforcement learning, in: International Conference on Machine Learning, PMLR, 2023, pp. 16829–16852

  11. [11]

    Y . Du, L. Han, M. Fang, J. Liu, T. Dai, D. Tao, Liir: Learn- ing individual intrinsic reward in multi-agent reinforce- ment learning, Advances in neural information processing systems 32 (2019)

  12. [12]

    Pathak, P

    D. Pathak, P. Agrawal, A. A. Efros, T. Darrell, Curiosity- driven exploration by self-supervised prediction, in: Inter- national conference on machine learning, PMLR, 2017, pp. 2778–2787

  13. [13]

    Exploration by random network distillation.arXiv preprint arXiv:1810.12894, 2018

    Y . Burda, H. Edwards, A. Storkey, O. Klimov, Explo- ration by random network distillation, arXiv preprint arXiv:1810.12894 (2018)

  14. [14]

    Jaques, A

    N. Jaques, A. Lazaridou, E. Hughes, C. Gulcehre, P. Or- tega, D. Strouse, J. Z. Leibo, N. De Freitas, Social in- fluence as intrinsic motivation for multi-agent deep rein- forcement learning, in: International conference on ma- chine learning, PMLR, 2019, pp. 3040–3049

  15. [15]

    Zheng, J

    L. Zheng, J. Chen, J. Wang, J. He, Y . Hu, Y . Chen, C. Fan, Y . Gao, C. Zhang, Episodic multi-agent reinforcement learning with curiosity-driven exploration, Advances in Neural Information Processing Systems 34 (2021) 3757– 3769

  16. [16]

    Learning temporal distances: Contrastive successor features can provide a metric structure for decision-making.arXiv preprint arXiv:2406.17098, 2024

    V . Myers, C. Zheng, A. Dragan, S. Levine, B. Eysenbach, Learning temporal distances: Contrastive successor fea- tures can provide a metric structure for decision-making, arXiv preprint arXiv:2406.17098 (2024)

  17. [17]

    R. S. Sutton, A. G. Barto, Reinforcement learning: An introduction, 2nd Edition, MIT Press, 2018

  18. [18]

    M. C. Machado, S. Srinivasan, M. Bowling, Domain- independent optimistic initialization for reinforcement learning, in: AAAI Workshop on Learning for General Competency in Video Games, 2015

  19. [19]

    P. Auer, N. Cesa-Bianchi, P. Fischer, Finite-time analy- sis of the multiarmed bandit problem, Machine Learning 47 (2-3) (2002) 235–256

  20. [20]

    Osband, C

    I. Osband, C. Blundell, A. Pritzel, B. Van Roy, Deep ex- ploration via bootstrapped dqn, in: Advances in Neural Information Processing Systems, V ol. 29, 2016

  21. [21]

    Bellemare, S

    M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, R. Munos, Unifying count-based exploration and intrinsic motivation, in: Advances in Neural Informa- tion Processing Systems, V ol. 29, 2016

  22. [22]

    J. Li, X. Shi, J. Li, X. Zhang, J. Wang, Random curiosity- driven exploration in deep reinforcement learning, Neuro- computing 418 (2020) 139–147

  23. [23]

    Haarnoja, A

    T. Haarnoja, A. Zhou, P. Abbeel, S. Levine, Soft actor- critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, in: International Confer- ence on Machine Learning, 2018, pp. 1861–1870

  24. [24]

    T. Wang, J. Wang, Y . Wu, C. Zhang, Influence-based multi-agent exploration, in: International Conference on Learning Representations, 2020. 15

  25. [25]

    Sukhbaatar, R

    S. Sukhbaatar, R. Fergus, et al., Learning multiagent com- munication with backpropagation, in: Advances in Neural Information Processing Systems, V ol. 29, 2016

  26. [26]

    A. Das, T. Gervet, J. Romoff, D. Batra, D. Parikh, M. Rab- bat, J. Pineau, Tarmac: Targeted multi-agent communica- tion, in: International Conference on Machine Learning, 2019, pp. 1538–1546

  27. [27]

    Jiang, X

    R. Jiang, X. Zhang, Y . Liu, Y . Xu, X. Zhang, Y . Zhuang, Multi-agent cooperative strategy with explicit teammate modeling and targeted informative communication, Neu- rocomputing 586 (2024) 127638

  28. [28]

    J. Liu, Y . Zhong, S. Hu, H. Fu, Q. Chang, Y . Yang, Max- imum entropy heterogeneous-agent reinforcement learn- ing, in: International Conference on Learning Represen- tations, 2024

  29. [29]

    Iqbal, F

    S. Iqbal, F. Sha, Coordinated exploration via intrinsic rewards for multi-agent reinforcement learning, arXiv preprint arXiv:1905.12127 (2019)

  30. [30]

    T. Li, K. Zhu, Toward efficient multi-agent exploration with trajectory entropy maximization, in: International Conference on Learning Representations, 2025

  31. [31]

    Y . Pan, Z. Liu, H. Wang, Wonder wins ways: Curiosity- driven exploration through multi-agent contextual calibra- tion, in: Advances in Neural Information Processing Sys- tems, V ol. 38, 2025

  32. [32]

    X. He, H. Ge, Y . Hou, J. Yu, SAEIR: Sequentially ac- cumulated entropy intrinsic reward for cooperative multi- agent reinforcement learning with sparse reward, in: Inter- national Joint Conference on Artificial Intelligence, 2024

  33. [33]

    Self-supervised goal- reaching results in multi-agent cooperation and exploration.arXiv preprint arXiv:2509.10656,

    C. Nimonkar, S. Shah, C. Ji, B. Eysenbach, Self- supervised goal-reaching results in multi-agent cooper- ation and exploration, arXiv preprint arXiv:2509.10656 (2025)

  34. [34]

    Z. Xu, H. P. van Hasselt, D. Silver, Meta-gradient rein- forcement learning, in: Advances in Neural Information Processing Systems, V ol. 31, 2018

  35. [35]

    Zahavy, Z

    T. Zahavy, Z. Xu, V . Veeriah, M. Hessel, J. Oh, H. van Hasselt, D. Silver, S. Singh, A self-tuning actor-critic al- gorithm, Advances in Neural Information Processing Sys- tems 33 (2020)

  36. [36]

    H. Ma, Z. Luo, T. V . V o, K. Sima, T.-Y . Leong, Highly efficient self-adaptive reward shaping for reinforcement learning, in: International Conference on Learning Rep- resentations, 2025

  37. [37]

    C. Li, X. Wei, Y . Zhao, X. Geng, An effective maximum entropy exploration approach for deceptive game in rein- forcement learning, Neurocomputing 403 (2020) 98–108

  38. [38]

    C. Yu, A. Velu, E. Vinitsky, J. Gao, Y . Wang, A. Bayen, Y . Wu, The surprising effectiveness of ppo in cooperative multi-agent games, Advances in neural information pro- cessing systems 35 (2022) 24611–24624

  39. [39]

    Rutherford, B

    A. Rutherford, B. Ellis, M. Gallici, J. Cook, A. Lupu, G. Ingvarsson, T. Willi, R. Hammond, A. Khan, C. S. de Witt, et al., Jaxmarl: Multi-agent rl environments and algorithms in jax, Advances in Neural Information Pro- cessing Systems 37 (2024) 50925–50951

  40. [40]

    Zheng, X

    X. Zheng, X. Ma, C. Shen, C. Wang, Constrained intrinsic motivation for reinforcement learning, in: Proceedings of the Thirty-Third International Joint Conference on Artifi- cial Intelligence, 2024, pp. 5608–5616

  41. [41]

    T. M. Cover, J. A. Thomas, Elements of Information The- ory, 2nd Edition, Wiley-Interscience, 2006

  42. [42]

    J. W. Roberts, R. Tedrake, Signal-to-noise ratio analysis of policy gradient algorithms, in: Advances in Neural In- formation Processing Systems, V ol. 21, 2008

  43. [43]

    J. G. Kuba, M. Wen, L. Meng, H. Zhang, D. Mguni, J. Wang, Y . Yang, et al., Settling the variance of multi- agent policy gradients, Advances in Neural Information Processing Systems 34 (2021) 13458–13470

  44. [44]

    H. Wu, J. Zhang, Z. Wang, Y . Lin, H. Li, Sub-A VG: Overestimation reduction for cooperative multi-agent re- inforcement learning, Neurocomputing 474 (2022) 94– 106

  45. [45]

    H. Han, H. Yang, Non-uniform noise-to-signal ratio in the REINFORCE policy-gradient estimator, arXiv preprint arXiv:2602.01460 (2026)

  46. [46]

    Russo, B

    D. Russo, B. Van Roy, Learning to optimize via information-directed sampling, Advances in neural infor- mation processing systems 27 (2014)

  47. [47]

    H.-L. Hsu, W. Wang, M. Pajic, P. Xu, Randomized explo- ration in cooperative multi-agent reinforcement learning, in: Advances in Neural Information Processing Systems, V ol. 37, 2024

  48. [48]

    Goldsmith, Wireless Communications, Cambridge University Press, 2005

    A. Goldsmith, Wireless Communications, Cambridge University Press, 2005

  49. [49]

    F. A. Oliehoek, C. Amato, A concise introduction to de- centralized POMDPs, Springer, 2016

  50. [50]

    C. E. Shannon, A mathematical theory of communication, The Bell System Technical Journal 27 (3) (1948) 379– 423

  51. [51]

    A. J. Goldsmith, S.-G. Chua, Variable-rate variable-power mqam for fading channels, IEEE Transactions on Com- munications 45 (10) (1997) 1218–1230. 16

  52. [52]

    Jiang, Q

    Y . Jiang, Q. Liu, Y . Yang, X. Ma, D. Zhong, H. Hu, J. Yang, B. Liang, B. Xu, C. Zhang, Q. Zhao, Episodic novelty through temporal distance, in: International Con- ference on Learning Representations, 2025

  53. [53]

    V . S. Borkar, Stochastic approximation: a dynamical sys- tems viewpoint, Cambridge University Press, 2008

  54. [54]

    Granas, J

    A. Granas, J. Dugundji, Fixed Point Theory, Springer, New York, 2003

  55. [55]

    A. v. d. Oord, Y . Li, O. Vinyals, Representation learn- ing with contrastive predictive coding, arXiv preprint arXiv:1807.03748 (2018)

  56. [56]

    Poole, S

    B. Poole, S. Ozair, A. Van Den Oord, A. Alemi, G. Tucker, On variational bounds of mutual information, in: Interna- tional Conference on Machine Learning, 2019, pp. 5171– 5180

  57. [57]

    R. Lowe, Y . Wu, A. Tamar, J. Harb, P. Abbeel, I. Mor- datch, Multi-agent actor-critic for mixed cooperative- competitive environments, in: Advances in Neural Infor- mation Processing Systems, V ol. 30, 2017

  58. [58]

    Samvelyan, T

    M. Samvelyan, T. Rashid, C. S. De Witt, G. Farquhar, N. Nardelli, T. G. J. Rudner, C.-M. Hung, P. H. S. Torr, J. Foerster, S. Whiteson, The starcraft multi-agent challenge, in: International Conference on Autonomous Agents and Multiagent Systems, 2019

  59. [59]

    Mahjoub, S

    O. Mahjoub, S. Abramowitz, R. de Kock, W. Khlifi, S. du Toit, J. Daniel, L. Ben Nessir, L. Beyers, C. For- manek, L. Clark, A. Pretorius, Sable: a performant, ef- ficient and scalable sequence model for multi-agent rein- forcement learning, in: International Conference on Ma- chine Learning, 2025

  60. [60]

    K.-a. A. Tessera, A. Rahman, A. Storkey, S. V . Albrecht, Hypermarl: Adaptive hypernetworks for multi-agent rl, in: Advances in Neural Information Processing Systems, 2025

  61. [61]

    C. D. Freeman, E. Frey, A. Raichuk, S. Girber, I. Mor- datch, O. Bachem, Brax – a differentiable physics en- gine for large scale rigid body simulation, arXiv preprint arXiv:2106.13281 (2021)

  62. [62]

    I. Jang, J. Park, C. E. Mballo, S. Cho, C. J. Tom- lin, H. J. Kim, EigenSafe: A spectral framework for learning-based stochastic safety filtering, arXiv preprint arXiv:2509.17750 (2025)

  63. [63]

    C. S. de Witt, T. Gupta, D. Makoviichuk, V . Makoviy- chuk, P. H. Torr, M. Sun, S. Whiteson, Is independent learning all you need in the StarCraft multi-agent chal- lenge?, arXiv preprint arXiv:2011.09533 (2020)

  64. [64]

    B. L. Welch, The generalization of “student’s’ problem when several different population variances are involved, Biometrika 34 (1-2) (1947) 28–35. 17