pith. machine review for the scientific record. sign in

arxiv: 2604.18190 · v1 · submitted 2026-04-20 · 💻 cs.LG · cs.AI

Recognition: unknown

Scalable Neighborhood-Based Multi-Agent Actor-Critic

Rasmus Jensen, Tim Goppelsroeder

Pith reviewed 2026-05-10 05:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords multi-agent reinforcement learningMADDPGscalable criticsactor-critic methodsneighborhood-based learningcentralized trainingparticle environments
0
0 comments X

The pith

MADDPG-K restricts each agent's critic to the k closest agents by Euclidean distance to keep critic input size constant as the number of agents grows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard multi-agent actor-critic methods like MADDPG use centralized critics that take inputs from all agents, causing the network size and training cost to grow with the total number of agents. MADDPG-K addresses this by having each critic consider only the k nearest agents under a distance metric, here Euclidean distance in observation space. This keeps the critic input dimension fixed regardless of agent count. The paper shows that the remaining quadratic cost comes from simple distance calculations rather than heavy neural net operations. Experiments in particle environments show it matches or beats MADDPG while scaling better in runtime and sometimes converging faster in cooperative tasks.

Core claim

MADDPG-K is a modification to MADDPG where each agent's critic is conditioned only on the observations and actions of the k closest agents according to Euclidean distance. This produces a constant-size critic input, reducing the growth in computational cost from linear in the number of agents to a fixed size per agent while retaining the benefits of centralized training during learning.

What carries the argument

The k-nearest-neighbor restriction on the critic input using Euclidean distance, which selects a fixed number of agents to include in the value estimation regardless of total population size.

Load-bearing premise

The assumption that the k closest agents under Euclidean distance provide enough information for the critic to accurately estimate the joint value, with distant agents contributing little.

What would settle it

An experiment in which performance of MADDPG-K drops significantly below MADDPG for large agent counts or in environments where non-local interactions dominate.

Figures

Figures reproduced from arXiv: 2604.18190 by Rasmus Jensen, Tim Goppelsroeder.

Figure 1
Figure 1. Figure 1: Simple Tag [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 4
Figure 4. Figure 4: Simple Tag We also ran our MADDPG-K implementation on the Simple Tag environment. This is also depicted in 4 and shows that while it does not converge to a higher reward than our MADDPG implementation, it is competitive with the MADDPG implementation from the paper. One can also see that the learning is smoother for MADDPG-K. This might be due to the ordering of adversaries or agents according to proximity… view at source ↗
Figure 5
Figure 5. Figure 5: MADDPG-K and MADDPG on Simple Spread hypothesize that this is due to the fact that agents in MADDPG-K benefit from focusing on agents that are closer to them, since the other agents are less important to that agent’s goal. This advantage diminishes as the number of agents increases. The largest advantage of MADDPG-K over MADDPG on Simple Spread was for N = 5 and K = 2, after which the advantage continues t… view at source ↗
Figure 6
Figure 6. Figure 6: MADDPG and MADDPG-K on Simple Adversary For the environment Simple Adversary we conducted the same experiment 6 comparing MADDPG and MADDPG-K as we did on Simple Spread, where we fix K = 2 and scale N from 3 to 9. MADDPG-K demonstrates competitive performance when compared to MADDPG, even as N increases. No significant differences were observed between MADDPG and MADDPG-K. One minor difference seems to be … view at source ↗
Figure 7
Figure 7. Figure 7: Algorithms from MADDPG paper GitHub [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: our versions of DDPG and MADDPG 11 [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: MADDPG-K 1 agent 3 adversaries K=3 [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: MADDPG and MADDPG-K Simple Adversary N=3, K=2 [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: MADDPG and MADDPG-K Simple Adversary N=5, K=2 [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: MADDPG and MADDPG-K Simple Adversary N=7, K=2 [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: MADDPG and MADDPG-K Simple Adversary N=9, K=2 [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: MADDPG and MADDPG-K Simple Spread N=3, K=2 [PITH_FULL_IMAGE:figures/full_fig_p013_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: MADDPG and MADDPG-K Simple Spread N=5, K=2 [PITH_FULL_IMAGE:figures/full_fig_p014_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: MADDPG and MADDPG-K Simple Spread N=7, K=2 [PITH_FULL_IMAGE:figures/full_fig_p014_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: MADDPG and MADDPG-K Simple Spread N=9, K=2 [PITH_FULL_IMAGE:figures/full_fig_p014_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Scalability of MADDPG-K and MADDPG on Simple Spread environment [PITH_FULL_IMAGE:figures/full_fig_p015_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: MADDPG with Central Critic Algorithm on Simple Tag [PITH_FULL_IMAGE:figures/full_fig_p016_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Original MADDPG algorithm A.4 Methodology for Experiments • Algorithms are tested by running them 10 times and then calculating the average reward trajectory throughout the learning process before plotting the obtained results. This method￾ology is followed for all the plots in this paper. • In our paper K is defined as being the neighbors excluding the agent itself however in our code K is defined as bei… view at source ↗
read the original abstract

We propose MADDPG-K, a scalable extension to Multi-Agent Deep Deterministic Policy Gradient (MADDPG) that addresses the computational limitations of centralized critic approaches. Centralized critics, which condition on the observations and actions of all agents, have demonstrated significant performance gains in cooperative and competitive multi-agent settings. However, their critic networks grow linearly in input size with the number of agents, making them increasingly expensive to train at scale. MADDPG-K mitigates this by restricting each agent's critic to the $k$ closest agents under a chosen metric which in our case is Euclidean distance. This ensures a constant-size critic input regardless of the total agent count. We analyze the complexity of this approach, showing that the quadratic cost it retains arises from cheap scalar distance computations rather than the expensive neural network matrix multiplications that bottleneck standard MADDPG. We validate our method empirically across cooperative and adversarial environments from the Multi-Particle Environment suite, demonstrating competitive or superior performance compared to MADDPG, faster convergence in cooperative settings, and better runtime scaling as the number of agents grows. Our code is available at https://github.com/TimGop/MADDPG-K .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes MADDPG-K, a scalable extension of MADDPG in which each agent's critic is restricted to the observations and actions of only the k nearest agents (selected by Euclidean distance). This keeps critic input size constant independent of total agent count N. The authors provide a complexity argument that the retained quadratic term consists of inexpensive scalar distance computations rather than costly neural-network matrix multiplies, and they report empirical results on cooperative and competitive tasks from the Multi-Particle Environment showing competitive performance, faster convergence in some settings, and improved runtime scaling with growing N.

Significance. If the locality heuristic proves robust, the method offers a practical route to scaling centralized-critic MARL to agent populations that would otherwise be computationally prohibitive. The explicit separation of cheap distance calculations from expensive network operations is a useful engineering insight, and the public code release supports reproducibility.

major comments (1)
  1. [Empirical validation] Empirical validation section: the central claim that k-nearest Euclidean neighbors preserve sufficient information for accurate value estimation is load-bearing yet supported only by aggregate performance numbers. No ablation on the choice of k, no comparison against random or learned neighbor selection, and no statistical tests (multiple seeds, error bars, significance) are described, leaving the weakest assumption untested and the scaling claims difficult to evaluate.
minor comments (1)
  1. [Abstract] The abstract states that runtime scaling improves 'as the number of agents grows' but does not indicate the concrete range of N tested or how wall-clock time was measured; adding these details would strengthen the complexity narrative.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment on empirical validation below and will incorporate the suggested improvements in the revised version.

read point-by-point responses
  1. Referee: [Empirical validation] Empirical validation section: the central claim that k-nearest Euclidean neighbors preserve sufficient information for accurate value estimation is load-bearing yet supported only by aggregate performance numbers. No ablation on the choice of k, no comparison against random or learned neighbor selection, and no statistical tests (multiple seeds, error bars, significance) are described, leaving the weakest assumption untested and the scaling claims difficult to evaluate.

    Authors: We agree that the sufficiency of the k-nearest Euclidean neighbors for value estimation is a central assumption and that the current empirical section relies primarily on aggregate performance metrics. While the reported results demonstrate competitive performance against MADDPG and improved runtime scaling, we acknowledge the absence of targeted ablations and statistical rigor. In the revised manuscript we will add: (1) an ablation study varying k across the evaluated environments, (2) a comparison against random neighbor selection (and, space permitting, a simple learned selection baseline), and (3) results averaged over multiple random seeds with error bars and appropriate statistical significance tests. These additions will directly test the locality heuristic and strengthen the scaling claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces MADDPG-K as a practical algorithmic modification to MADDPG, restricting each critic to the k Euclidean-nearest agents to achieve constant input size. This is justified by explicit complexity analysis (quadratic scalar distances vs. neural-network matrix multiplies) and empirical results on MPE tasks. No load-bearing derivation, fitted-parameter prediction, self-referential definition, or self-citation chain reduces the central claim to its own inputs by construction. The locality heuristic is presented as a domain-appropriate engineering choice rather than a mathematically derived necessity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that local neighborhoods suffice for value estimation in the tested environments; k is treated as a tunable hyperparameter.

free parameters (1)
  • k
    Number of nearest agents included in each critic; selected as a hyperparameter to balance information and compute.
axioms (1)
  • domain assumption Local neighborhood information under Euclidean distance is sufficient to approximate the centralized critic value function.
    Invoked by the design choice to restrict critic input to k closest agents.

pith-pipeline@v0.9.0 · 5500 in / 1143 out tokens · 28351 ms · 2026-05-10T05:15:55.385718+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 9 canonical work pages · 4 internal anchors

  1. [1]

    OpenAI Gym

    Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym.CoRR, abs/1606.01540, 2016. URL http: //arxiv.org/abs/1606.01540

  2. [2]

    Counterfactual multi-agent policy gradients,

    Jakob N. Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients.CoRR, abs/1705.08926, 2017. URL http://arxiv.org/abs/1705.08926

  3. [3]

    Addressing Function Approximation Error in Actor-Critic Methods

    Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods.CoRR, abs/1802.09477, 2018. URL http://arxiv.org/abs/1802. 09477

  4. [5]

    URLhttp://arxiv.org/abs/1801.01290

  5. [6]

    Continuous control with deep reinforcement learning

    Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971, 2015

  6. [7]

    Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , url =

    Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments.CoRR, abs/1706.02275, 2017. URLhttp://arxiv.org/abs/1706.02275

  7. [8]

    Playing Atari with Deep Reinforcement Learning

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning.CoRR, abs/1312.5602, 2013. URLhttp://arxiv.org/abs/1312.5602

  8. [9]

    Emergence of grounded compositional language in multi- agent populations

    Igor Mordatch and Pieter Abbeel. Emergence of grounded compositional language in multi- agent populations. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

  9. [10]

    Benchmarking multi-agent deep reinforcement learning algorithms in cooperative tasks,

    Georgios Papoudakis, Filippos Christianos, Lukas Schäfer, and Stefano V Albrecht. Com- parative evaluation of cooperative multi-agent deep reinforcement learning algorithms.arXiv preprint arXiv:2006.07869, 2020

  10. [11]

    Facmac: Factored multi-agent centralised policy gradients.Advances in Neural Information Processing Systems, 34:12208–12221, 2021

    Bei Peng, Tabish Rashid, Christian Schroeder de Witt, Pierre-Alexandre Kamienny, Philip Torr, Wendelin Böhmer, and Shimon Whiteson. Facmac: Factored multi-agent centralised policy gradients.Advances in Neural Information Processing Systems, 34:12208–12221, 2021

  11. [12]

    Monotonic value function factorisation for deep multi-agent reinforcement learning.Journal of Machine Learning Research, 21(178):1–51, 2020

    Tabish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Monotonic value function factorisation for deep multi-agent reinforcement learning.Journal of Machine Learning Research, 21(178):1–51, 2020

  12. [13]

    Actor-attention-critic for multi-agent reinforcement learning.International Conference on Machine Learning, 2018

    Fei Sha Shariq Iqbal. Actor-attention-critic for multi-agent reinforcement learning.International Conference on Machine Learning, 2018

  13. [14]

    Multi-agent reinforcement learning: Independent versus cooperative agents

    Ming Tan. Multi-agent reinforcement learning: Independent versus cooperative agents. InIn- ternational Conference on Machine Learning, 1997. URL https://api.semanticscholar. org/CorpusID:268857333

  14. [15]

    Terry, Benjamin Black, Ananth Hari, Luis S

    Justin K. Terry, Benjamin Black, Ananth Hari, Luis S. Santos, Clemens Dieffendahl, Niall L. Williams, Yashas Lokesh, Caroline Horsch, and Praveen Ravi. Pettingzoo: Gym for multi-agent reinforcement learning.CoRR, abs/2009.14471, 2020. URL https://arxiv.org/abs/2009. 14471

  15. [16]

    Grandmaster level in starcraft ii using multi-agent reinforcement learning.Nature, 575(7782):350–354, 2019

    Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Jun- young Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning.Nature, 575(7782):350–354, 2019

  16. [17]

    Mean field multi-agent reinforcement learning

    Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. Mean field multi-agent reinforcement learning. InInternational conference on machine learning, pages 5571–5580. PMLR, 2018. 10 A Appendix A.1 Configurations For both our DDPG and MADDPG implementations, we use the following hyperparameters and settings: • Adam optimizer with a learnin...