arxiv: 2604.18190 · v1 · submitted 2026-04-20 · 💻 cs.LG · cs.AI

Recognition: unknown

Scalable Neighborhood-Based Multi-Agent Actor-Critic

Rasmus Jensen, Tim Goppelsroeder

Pith reviewed 2026-05-10 05:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords multi-agent reinforcement learningMADDPGscalable criticsactor-critic methodsneighborhood-based learningcentralized trainingparticle environments

0 comments

The pith

MADDPG-K restricts each agent's critic to the k closest agents by Euclidean distance to keep critic input size constant as the number of agents grows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard multi-agent actor-critic methods like MADDPG use centralized critics that take inputs from all agents, causing the network size and training cost to grow with the total number of agents. MADDPG-K addresses this by having each critic consider only the k nearest agents under a distance metric, here Euclidean distance in observation space. This keeps the critic input dimension fixed regardless of agent count. The paper shows that the remaining quadratic cost comes from simple distance calculations rather than heavy neural net operations. Experiments in particle environments show it matches or beats MADDPG while scaling better in runtime and sometimes converging faster in cooperative tasks.

Core claim

MADDPG-K is a modification to MADDPG where each agent's critic is conditioned only on the observations and actions of the k closest agents according to Euclidean distance. This produces a constant-size critic input, reducing the growth in computational cost from linear in the number of agents to a fixed size per agent while retaining the benefits of centralized training during learning.

What carries the argument

The k-nearest-neighbor restriction on the critic input using Euclidean distance, which selects a fixed number of agents to include in the value estimation regardless of total population size.

Load-bearing premise

The assumption that the k closest agents under Euclidean distance provide enough information for the critic to accurately estimate the joint value, with distant agents contributing little.

What would settle it

An experiment in which performance of MADDPG-K drops significantly below MADDPG for large agent counts or in environments where non-local interactions dominate.

Figures

Figures reproduced from arXiv: 2604.18190 by Rasmus Jensen, Tim Goppelsroeder.

**Figure 4.** Figure 4: Simple Tag We also ran our MADDPG-K implementation on the Simple Tag environment. This is also depicted in 4 and shows that while it does not converge to a higher reward than our MADDPG implementation, it is competitive with the MADDPG implementation from the paper. One can also see that the learning is smoother for MADDPG-K. This might be due to the ordering of adversaries or agents according to proximity… view at source ↗

**Figure 5.** Figure 5: MADDPG-K and MADDPG on Simple Spread hypothesize that this is due to the fact that agents in MADDPG-K benefit from focusing on agents that are closer to them, since the other agents are less important to that agent’s goal. This advantage diminishes as the number of agents increases. The largest advantage of MADDPG-K over MADDPG on Simple Spread was for N = 5 and K = 2, after which the advantage continues t… view at source ↗

**Figure 6.** Figure 6: MADDPG and MADDPG-K on Simple Adversary For the environment Simple Adversary we conducted the same experiment 6 comparing MADDPG and MADDPG-K as we did on Simple Spread, where we fix K = 2 and scale N from 3 to 9. MADDPG-K demonstrates competitive performance when compared to MADDPG, even as N increases. No significant differences were observed between MADDPG and MADDPG-K. One minor difference seems to be … view at source ↗

**Figure 7.** Figure 7: Algorithms from MADDPG paper GitHub [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: our versions of DDPG and MADDPG 11 [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: MADDPG-K 1 agent 3 adversaries K=3 [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: MADDPG and MADDPG-K Simple Adversary N=3, K=2 [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: MADDPG and MADDPG-K Simple Adversary N=5, K=2 [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗

**Figure 12.** Figure 12: MADDPG and MADDPG-K Simple Adversary N=7, K=2 [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

**Figure 13.** Figure 13: MADDPG and MADDPG-K Simple Adversary N=9, K=2 [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗

**Figure 14.** Figure 14: MADDPG and MADDPG-K Simple Spread N=3, K=2 [PITH_FULL_IMAGE:figures/full_fig_p013_14.png] view at source ↗

**Figure 15.** Figure 15: MADDPG and MADDPG-K Simple Spread N=5, K=2 [PITH_FULL_IMAGE:figures/full_fig_p014_15.png] view at source ↗

**Figure 16.** Figure 16: MADDPG and MADDPG-K Simple Spread N=7, K=2 [PITH_FULL_IMAGE:figures/full_fig_p014_16.png] view at source ↗

**Figure 17.** Figure 17: MADDPG and MADDPG-K Simple Spread N=9, K=2 [PITH_FULL_IMAGE:figures/full_fig_p014_17.png] view at source ↗

**Figure 18.** Figure 18: Scalability of MADDPG-K and MADDPG on Simple Spread environment [PITH_FULL_IMAGE:figures/full_fig_p015_18.png] view at source ↗

**Figure 19.** Figure 19: MADDPG with Central Critic Algorithm on Simple Tag [PITH_FULL_IMAGE:figures/full_fig_p016_19.png] view at source ↗

**Figure 20.** Figure 20: Original MADDPG algorithm A.4 Methodology for Experiments • Algorithms are tested by running them 10 times and then calculating the average reward trajectory throughout the learning process before plotting the obtained results. This methodology is followed for all the plots in this paper. • In our paper K is defined as being the neighbors excluding the agent itself however in our code K is defined as bei… view at source ↗

read the original abstract

We propose MADDPG-K, a scalable extension to Multi-Agent Deep Deterministic Policy Gradient (MADDPG) that addresses the computational limitations of centralized critic approaches. Centralized critics, which condition on the observations and actions of all agents, have demonstrated significant performance gains in cooperative and competitive multi-agent settings. However, their critic networks grow linearly in input size with the number of agents, making them increasingly expensive to train at scale. MADDPG-K mitigates this by restricting each agent's critic to the $k$ closest agents under a chosen metric which in our case is Euclidean distance. This ensures a constant-size critic input regardless of the total agent count. We analyze the complexity of this approach, showing that the quadratic cost it retains arises from cheap scalar distance computations rather than the expensive neural network matrix multiplications that bottleneck standard MADDPG. We validate our method empirically across cooperative and adversarial environments from the Multi-Particle Environment suite, demonstrating competitive or superior performance compared to MADDPG, faster convergence in cooperative settings, and better runtime scaling as the number of agents grows. Our code is available at https://github.com/TimGop/MADDPG-K .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MADDPG-K keeps critic inputs fixed-size by feeding only the k Euclidean-nearest agents, which helps scaling on MPE tasks without obvious performance loss.

read the letter

The main thing here is a simple restriction: each critic in MADDPG only sees the k closest agents by Euclidean distance instead of all of them. This caps the input size and avoids the linear growth in network cost that normally hits centralized critics as agent count rises. The paper points out that the remaining quadratic work is just cheap scalar distances, not the expensive neural-net operations, which is a fair complexity argument. On the Multi-Particle Environment they report competitive or better performance than plain MADDPG, faster convergence in cooperative runs, and improved runtime scaling with more agents. Releasing the code is helpful for anyone who wants to test it directly. The central assumption is that distant agents add little to the value estimate, which seems reasonable for the spatially local MPE tasks but could fail in settings with long-range or global effects. The empirical backing is stated clearly but the summary gives limited detail on statistical tests, seed counts, or k ablations, so the evidence is moderate rather than definitive. No internal contradictions or circular claims show up. This is aimed at people trying to train larger multi-agent systems with centralized training. Readers who need a practical scaling step on standard benchmarks will get something usable from it. I would send it to peer review; the modification is transparent, the scaling claim is backed by the reported runs, and referees can check how far the locality heuristic actually travels.

Referee Report

1 major / 1 minor

Summary. The paper proposes MADDPG-K, a scalable extension of MADDPG in which each agent's critic is restricted to the observations and actions of only the k nearest agents (selected by Euclidean distance). This keeps critic input size constant independent of total agent count N. The authors provide a complexity argument that the retained quadratic term consists of inexpensive scalar distance computations rather than costly neural-network matrix multiplies, and they report empirical results on cooperative and competitive tasks from the Multi-Particle Environment showing competitive performance, faster convergence in some settings, and improved runtime scaling with growing N.

Significance. If the locality heuristic proves robust, the method offers a practical route to scaling centralized-critic MARL to agent populations that would otherwise be computationally prohibitive. The explicit separation of cheap distance calculations from expensive network operations is a useful engineering insight, and the public code release supports reproducibility.

major comments (1)

[Empirical validation] Empirical validation section: the central claim that k-nearest Euclidean neighbors preserve sufficient information for accurate value estimation is load-bearing yet supported only by aggregate performance numbers. No ablation on the choice of k, no comparison against random or learned neighbor selection, and no statistical tests (multiple seeds, error bars, significance) are described, leaving the weakest assumption untested and the scaling claims difficult to evaluate.

minor comments (1)

[Abstract] The abstract states that runtime scaling improves 'as the number of agents grows' but does not indicate the concrete range of N tested or how wall-clock time was measured; adding these details would strengthen the complexity narrative.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment on empirical validation below and will incorporate the suggested improvements in the revised version.

read point-by-point responses

Referee: [Empirical validation] Empirical validation section: the central claim that k-nearest Euclidean neighbors preserve sufficient information for accurate value estimation is load-bearing yet supported only by aggregate performance numbers. No ablation on the choice of k, no comparison against random or learned neighbor selection, and no statistical tests (multiple seeds, error bars, significance) are described, leaving the weakest assumption untested and the scaling claims difficult to evaluate.

Authors: We agree that the sufficiency of the k-nearest Euclidean neighbors for value estimation is a central assumption and that the current empirical section relies primarily on aggregate performance metrics. While the reported results demonstrate competitive performance against MADDPG and improved runtime scaling, we acknowledge the absence of targeted ablations and statistical rigor. In the revised manuscript we will add: (1) an ablation study varying k across the evaluated environments, (2) a comparison against random neighbor selection (and, space permitting, a simple learned selection baseline), and (3) results averaged over multiple random seeds with error bars and appropriate statistical significance tests. These additions will directly test the locality heuristic and strengthen the scaling claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces MADDPG-K as a practical algorithmic modification to MADDPG, restricting each critic to the k Euclidean-nearest agents to achieve constant input size. This is justified by explicit complexity analysis (quadratic scalar distances vs. neural-network matrix multiplies) and empirical results on MPE tasks. No load-bearing derivation, fitted-parameter prediction, self-referential definition, or self-citation chain reduces the central claim to its own inputs by construction. The locality heuristic is presented as a domain-appropriate engineering choice rather than a mathematically derived necessity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that local neighborhoods suffice for value estimation in the tested environments; k is treated as a tunable hyperparameter.

free parameters (1)

k
Number of nearest agents included in each critic; selected as a hyperparameter to balance information and compute.

axioms (1)

domain assumption Local neighborhood information under Euclidean distance is sufficient to approximate the centralized critic value function.
Invoked by the design choice to restrict critic input to k closest agents.

pith-pipeline@v0.9.0 · 5500 in / 1143 out tokens · 28351 ms · 2026-05-10T05:15:55.385718+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 9 canonical work pages · 4 internal anchors

[1]

OpenAI Gym

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym.CoRR, abs/1606.01540, 2016. URL http: //arxiv.org/abs/1606.01540

work page internal anchor Pith review arXiv 2016
[2]

Counterfactual multi-agent policy gradients,

Jakob N. Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients.CoRR, abs/1705.08926, 2017. URL http://arxiv.org/abs/1705.08926

work page arXiv 2017
[3]

Addressing Function Approximation Error in Actor-Critic Methods

Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods.CoRR, abs/1802.09477, 2018. URL http://arxiv.org/abs/1802. 09477

work page Pith review arXiv 2018
[5]

URLhttp://arxiv.org/abs/1801.01290

work page internal anchor Pith review arXiv
[6]

Continuous control with deep reinforcement learning

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971, 2015

work page internal anchor Pith review arXiv 2015
[7]

Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , url =

Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments.CoRR, abs/1706.02275, 2017. URLhttp://arxiv.org/abs/1706.02275

work page arXiv 2017
[8]

Playing Atari with Deep Reinforcement Learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning.CoRR, abs/1312.5602, 2013. URLhttp://arxiv.org/abs/1312.5602

work page internal anchor Pith review arXiv 2013
[9]

Emergence of grounded compositional language in multi- agent populations

Igor Mordatch and Pieter Abbeel. Emergence of grounded compositional language in multi- agent populations. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

2018
[10]

Benchmarking multi-agent deep reinforcement learning algorithms in cooperative tasks,

Georgios Papoudakis, Filippos Christianos, Lukas Schäfer, and Stefano V Albrecht. Com- parative evaluation of cooperative multi-agent deep reinforcement learning algorithms.arXiv preprint arXiv:2006.07869, 2020

work page arXiv 2006
[11]

Facmac: Factored multi-agent centralised policy gradients.Advances in Neural Information Processing Systems, 34:12208–12221, 2021

Bei Peng, Tabish Rashid, Christian Schroeder de Witt, Pierre-Alexandre Kamienny, Philip Torr, Wendelin Böhmer, and Shimon Whiteson. Facmac: Factored multi-agent centralised policy gradients.Advances in Neural Information Processing Systems, 34:12208–12221, 2021

2021
[12]

Monotonic value function factorisation for deep multi-agent reinforcement learning.Journal of Machine Learning Research, 21(178):1–51, 2020

Tabish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Monotonic value function factorisation for deep multi-agent reinforcement learning.Journal of Machine Learning Research, 21(178):1–51, 2020

2020
[13]

Actor-attention-critic for multi-agent reinforcement learning.International Conference on Machine Learning, 2018

Fei Sha Shariq Iqbal. Actor-attention-critic for multi-agent reinforcement learning.International Conference on Machine Learning, 2018

2018
[14]

Multi-agent reinforcement learning: Independent versus cooperative agents

Ming Tan. Multi-agent reinforcement learning: Independent versus cooperative agents. InIn- ternational Conference on Machine Learning, 1997. URL https://api.semanticscholar. org/CorpusID:268857333

1997
[15]

Terry, Benjamin Black, Ananth Hari, Luis S

Justin K. Terry, Benjamin Black, Ananth Hari, Luis S. Santos, Clemens Dieffendahl, Niall L. Williams, Yashas Lokesh, Caroline Horsch, and Praveen Ravi. Pettingzoo: Gym for multi-agent reinforcement learning.CoRR, abs/2009.14471, 2020. URL https://arxiv.org/abs/2009. 14471

work page arXiv 2009
[16]

Grandmaster level in starcraft ii using multi-agent reinforcement learning.Nature, 575(7782):350–354, 2019

Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Jun- young Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning.Nature, 575(7782):350–354, 2019

2019
[17]

Mean field multi-agent reinforcement learning

Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. Mean field multi-agent reinforcement learning. InInternational conference on machine learning, pages 5571–5580. PMLR, 2018. 10 A Appendix A.1 Configurations For both our DDPG and MADDPG implementations, we use the following hyperparameters and settings: • Adam optimizer with a learnin...

2018