Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic

Christopher Amato; Ryan Amiri; Shuo Liu; Tianle Chen

arxiv: 2601.21972 · v5 · pith:MVZX2MDLnew · submitted 2026-01-29 · 💻 cs.AI · cs.DC· cs.MA

Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic

Shuo Liu , Tianle Chen , Ryan Amiri , Christopher Amato This is my paper

Pith reviewed 2026-05-16 09:45 UTC · model grok-4.3

classification 💻 cs.AI cs.DCcs.MA

keywords decentralized LLM collaborationmulti-agent actor-criticcentralized criticMonte Carlo fine-tuninglong-horizon taskssparse rewardswriting coding games

0 comments

The pith

Centralized critic in multi-agent actor-critic training outperforms decentralized critics and Monte Carlo methods for LLM collaboration on long-horizon or sparse-reward tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that multi-agent actor-critic methods can optimize decentralized LLM collaboration, allowing agents to run inference in parallel without fixed central protocols. It introduces CoLLM-CC, which uses a centralized critic, and CoLLM-DC, which uses decentralized critics, then compares both to standard Monte Carlo fine-tuning across writing, coding, and game-playing domains. The key result is that Monte Carlo and CoLLM-DC reach similar performance to CoLLM-CC in short-horizon dense-reward settings, yet both lag in long-horizon or sparse-reward cases where Monte Carlo needs far more samples and CoLLM-DC often fails to converge. This matters because decentralized execution is more practical for scalable LLM systems, but the work shows when a centralized training signal becomes necessary to make collaboration reliable.

Core claim

We develop Multi-Agent Actor-Critic methods to optimize decentralized LLM collaboration. We propose CoLLM-CC with a centralized critic and CoLLM-DC with decentralized critics. Experiments across writing, coding, and game-playing domains show that Monte Carlo methods and CoLLM-DC can achieve performance comparable to CoLLM-CC in short-horizon and dense-reward settings. However, they both underperform CoLLM-CC on long-horizon or sparse-reward tasks, where Monte Carlo methods require substantially more samples and CoLLM-DC struggles to converge.

What carries the argument

Multi-Agent Actor-Critic (MAAC) framework for LLM collaboration, with CoLLM-CC using a single centralized critic to estimate joint values and reduce variance during decentralized execution.

Load-bearing premise

That LLM collaboration tasks can be reliably cast as multi-agent reinforcement learning problems where the reward functions accurately capture collaboration quality and the environments admit stable actor-critic training.

What would settle it

An experiment on a long-horizon sparse-reward task where CoLLM-DC converges to performance matching or exceeding CoLLM-CC using fewer samples than Monte Carlo methods.

Figures

Figures reproduced from arXiv: 2601.21972 by Christopher Amato, Ryan Amiri, Shuo Liu, Tianle Chen.

**Figure 1.** Figure 1: Illustration of CoLLM-CC framework: (a) The agent model structure; (b) The overall centralized-critic architecture; (c) The critic model structure. The corresponding CoLLM-DC framework is shown in Appendix B. Proposition 4.3. Consider an H-horizon episode without early termination t ∈ [0, H). Suppose MA-REINFORCE expands a full K-ary rollout tree (K ≥ 1) and, at each history node, draws K i.i.d. joint acti… view at source ↗

**Figure 2.** Figure 2: Evaluation results of MAGRPO, CoLLM-DC, and CoLLM-CC across article writing, code generation, and game-playing tasks over 5 runs. The y-axis shows expected return, with limits (min/max) indicating the return scale for each task. Curves are smoothed using a time-weighted exponential moving average. Shaded regions denote 95% bootstrapped confidence intervals. At each training epoch, a minibatch β of joint tr… view at source ↗

**Figure 3.** Figure 3: Screenshots of building tasks in Minecraft. (a) StrBuild: The LLM agent with wood outputs a /setblock 12 5 5 minecraft:birch planks game instruction to complete the building in “ICML” shape. (b) HouseBuild: The LLM agent outputs /damage @e[type=spider,limit=1] 6 minecraft:player attack to attack a mob, while building a cubic concrete house with a wooden door, 4 obsidian pillars, and a triangular-prism s… view at source ↗

**Figure 4.** Figure 4: CoLLM-DC framework: (a) The agent structure; (b) The overall decentralized-critic architecture; (c) The critic structure. – Agents * Qwen2.5-3B-Instruct * Qwen3-4B-Instruct-2507 – Critic (if applicable): Qwen3-4B-Instruct-2507 – Temperature: 0.6 – Top-p: 0.6 – Top-k: null – Max output tokens * StrBuild: 256 * HouseBuild: 512 C.3. Hyperparameters We show the key hyperparameters used in MAGRPO, CoLLM-DC, and… view at source ↗

read the original abstract

Recent work has explored optimizing LLM collaboration through Multi-Agent Reinforcement Learning (MARL). However, most MARL fine-tuning approaches rely on predefined execution protocols, which often require centralized execution. Decentralized LLM collaboration is more appealing in practice, as agents can run inference in parallel with flexible deployments. Also, current approaches use Monte Carlo methods for fine-tuning, which suffer from high variance and thus require more samples to train effectively. Actor-critic methods are prevalent in MARL for dealing with these issues; thus, we developed Multi-Agent Actor-Critic (MAAC) methods to optimize decentralized LLM collaboration. In this paper, we analyze when and why these MAAC methods are beneficial. We propose 2 MAAC approaches, \textbf{CoLLM-CC} with a \textbf{C}entralized \textbf{C}ritic and \textbf{CoLLM-DC} with \textbf{D}ecentralized \textbf{C}ritics. Our experiments across writing, coding, and game-playing domains show that Monte Carlo methods and CoLLM-DC can achieve performance comparable to CoLLM-CC in short-horizon and dense-reward settings. However, they both underperform CoLLM-CC on long-horizon or sparse-reward tasks, where Monte Carlo methods require substantially more samples and CoLLM-DC struggles to converge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Centralized critic MAAC beats decentralized and Monte Carlo on long-horizon sparse LLM tasks, but the convergence gap may trace to reward shaping rather than decentralization itself.

read the letter

The main thing to know is that this paper takes standard multi-agent actor-critic methods and applies them to decentralized LLM collaboration, showing that a centralized critic version (CoLLM-CC) holds up better than decentralized critics (CoLLM-DC) or Monte Carlo fine-tuning once horizons get long or rewards turn sparse. They test across writing, coding, and game domains and release the code, which is the practical part worth noting. The work is mostly an application of existing MAAC ideas rather than a new algorithm, but the regime analysis and the two named variants give a clear picture of when each approach pays off. In short settings with dense rewards the methods perform similarly, while the centralized critic pulls ahead in the harder cases where decentralized training fails to converge and Monte Carlo needs far more samples. That comparison is the useful takeaway. The experiments appear to support the headline claims, and shipping code plus the multi-domain setup counts as real evidence rather than just theory. The soft spot is that the abstract gives almost no experimental details on baselines, variance, or hyperparameter choices, so it is hard to judge how large or reliable the gaps actually are. The stress-test point about credit assignment also lands: if the downstream scalar rewards are not decomposed well, the decentralized critics could simply be seeing uninformative signals, which would make the non-convergence an artifact of the reward design instead of a general property of decentralized critics. Nothing in the provided description rules that out. This paper is for researchers already working on MARL for LLMs or multi-agent fine-tuning who want a concrete baseline with code. It is not paradigm-shifting but it is a solid, reproducible application that deserves a serious referee to check the full methods and stats. I would send it to review rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The paper proposes two Multi-Agent Actor-Critic (MAAC) approaches for fine-tuning decentralized LLM collaboration: CoLLM-CC (centralized critic) and CoLLM-DC (decentralized critics). It compares these to Monte Carlo methods across writing, coding, and game-playing domains, claiming that MC and DC achieve comparable performance to CC in short-horizon dense-reward settings, but both underperform CC on long-horizon or sparse-reward tasks, with MC requiring more samples and DC struggling to converge.

Significance. If the empirical findings hold after verification, the work provides useful practical guidance on when centralized critics are necessary for stable training in LLM collaboration tasks cast as MARL. The code release and domain coverage (writing, coding, games) are strengths that could aid reproducibility and extension.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): The headline claim that CoLLM-DC 'struggles to converge' on long-horizon/sparse-reward tasks is not isolated from potential misspecification in reward decomposition or per-agent value-function approximation; without an ablation on credit-assignment mechanisms or variance of local signals, it is unclear whether the observed gap is inherent to decentralization or an artifact of the chosen critic architecture and reward shaping.
[§4] §4: No statistical tests, confidence intervals, hyperparameter settings, or full baseline descriptions are supplied for the reported performance gaps between CoLLM-CC, CoLLM-DC, and Monte Carlo methods; this prevents assessment of whether the underperformance is robust or sensitive to implementation details.

minor comments (1)

[§3] Notation for the centralized vs. decentralized critic formulations could be clarified with explicit equations distinguishing the critic inputs and loss terms.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major comment point by point below, providing the strongest honest defense of our claims while incorporating revisions to improve rigor and clarity.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The headline claim that CoLLM-DC 'struggles to converge' on long-horizon/sparse-reward tasks is not isolated from potential misspecification in reward decomposition or per-agent value-function approximation; without an ablation on credit-assignment mechanisms or variance of local signals, it is unclear whether the observed gap is inherent to decentralization or an artifact of the chosen critic architecture and reward shaping.

Authors: We agree that isolating the source of convergence difficulties requires explicit analysis of credit assignment and signal variance. In the revised manuscript we have added an ablation comparing alternative reward decompositions and reporting the empirical variance of local value estimates under CoLLM-DC versus the centralized critic. These new results (now in §4 and Appendix C) show substantially higher variance in the decentralized local signals precisely on the long-horizon/sparse-reward tasks, supporting that the observed gap is driven by decentralization rather than solely by the particular critic architecture chosen. The abstract has been updated to reflect this qualification. revision: yes
Referee: [§4] §4: No statistical tests, confidence intervals, hyperparameter settings, or full baseline descriptions are supplied for the reported performance gaps between CoLLM-CC, CoLLM-DC, and Monte Carlo methods; this prevents assessment of whether the underperformance is robust or sensitive to implementation details.

Authors: We acknowledge the omission of these details. The revised version now includes paired t-tests with reported p-values for all key performance gaps, 95% confidence intervals on every metric, a complete hyperparameter table in the appendix, and expanded baseline descriptions that specify the exact Monte Carlo implementation, sampling budgets, and reward computation. These additions confirm that the reported underperformance of CoLLM-DC and Monte Carlo methods remains statistically significant and is not sensitive to the tested hyperparameter ranges. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical MAAC application to LLM collaboration

full rationale

The paper applies standard multi-agent actor-critic methods (CoLLM-CC with centralized critic, CoLLM-DC with decentralized critics) to LLM collaboration tasks and supports its claims solely through experiments on writing, coding, and game-playing domains. No equations, fitted parameters renamed as predictions, or self-citation chains are present that reduce any result to the paper's own inputs by construction. Performance differences (e.g., CoLLM-DC struggling on long-horizon/sparse-reward tasks) are reported as empirical observations rather than derived analytically from prior self-referential assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced or quantified in the abstract; the work relies on standard MARL actor-critic assumptions.

pith-pipeline@v0.9.0 · 5563 in / 1028 out tokens · 47469 ms · 2026-05-16T09:45:10.430001+00:00 · methodology

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 7.0

A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.
Improving the Efficiency of Language Agent Teams with Adaptive Task Graphs
cs.MA 2026-05 unverdicted novelty 7.0

LATTE coordinates LLM agent teams with an evolving shared task graph, cutting token use, time, and failures while matching or beating accuracy of MetaGPT, leader-worker, and static methods.
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
cs.AI 2026-05 conditional novelty 5.0

The survey proposes the LIFE framework to unify fragmented research on collaboration, failure attribution, and self-evolution in LLM multi-agent systems into a progression toward self-organizing intelligence.
Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces
cs.CL 2026-05 unverdicted novelty 4.0

This survey organizes RL for LLM multi-agent systems into reward families, credit units, and five orchestration sub-decisions, notes the absence of explicit stopping-decision training in its paper pool, and releases a...
Cloud-native and Distributed Systems for Efficient and Scalable Large Language Models -- A Research Agenda
cs.DC 2026-04 unverdicted novelty 2.0

This research agenda argues that cloud-native architectures, microservices, autoscaling, and emerging trends like serverless inference and federated learning are required to make large language models efficient and scalable.