arxiv: 2604.23308 · v1 · submitted 2026-04-25 · 💻 cs.LG · stat.ML

Recognition: unknown

CODA: Coordination via On-Policy Diffusion for Multi-Agent Offline Reinforcement Learning

Anya Sims, Elliot Fosong, John Torr, Juan Claude Formanek, Kale-ab Abebe Tessera, Marcel Hedman, Riccardo Zamboni, Trevor McInroe

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:26 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords multi-agent reinforcement learningoffline reinforcement learningdiffusion modelsdata augmentationcoordinationtrajectory generationcontinuous control

0 comments

The pith

A diffusion model generates synthetic trajectories conditioned on the current joint policy to enable co-adaptation in offline multi-agent reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Offline multi-agent reinforcement learning from fixed datasets often fails because agents cannot adjust their behaviors together as their policies evolve during training. CODA addresses this by training a diffusion model to produce new joint trajectories that are conditioned on the agents' latest policies. The resulting synthetic data changes alongside the learning process, supplying experience that supports ongoing coordination without requiring fresh real-world interactions. The method layers onto existing offline algorithms and shows improved coordination on both simple polynomial games and complex continuous-control benchmarks.

Core claim

CODA resolves coordination failures in offline MARL by using a diffusion-based generator that samples multi-agent trajectories conditioned on the current joint policy, thereby creating evolving synthetic experience that reflects agents' changing behaviors and permits co-adaptation during training.

What carries the argument

CODA, a diffusion model that generates multi-agent trajectories conditioned on the evolving joint policy to augment the static offline dataset dynamically during training.

If this is right

Agents can continue to coordinate and improve joint performance using only the initial fixed dataset.
Static data-augmentation approaches that do not condition on the current joint policy remain insufficient for multi-agent settings.
CODA can be added as a module to both model-free and model-based offline reinforcement learning methods.
Coordination pathologies are avoided on continuous polynomial games while performance improves on MaMuJoCo tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same conditioning idea could reduce reliance on online data collection in other multi-agent domains where coordination matters.
Similar diffusion-based augmentation might address distribution shift in single-agent offline RL when policies change rapidly.
Integrating CODA with specific multi-agent value-decomposition techniques could further stabilize learning on larger state spaces.

Load-bearing premise

That trajectories sampled from a diffusion model conditioned on the current joint policy can accurately represent on-policy experience and support genuine co-adaptation without introducing distribution shifts that harm learning.

What would settle it

An experiment on a continuous polynomial game showing that agents using CODA still converge to the same suboptimal joint behaviors observed with standard offline methods.

Figures

Figures reproduced from arXiv: 2604.23308 by Anya Sims, Elliot Fosong, John Torr, Juan Claude Formanek, Kale-ab Abebe Tessera, Marcel Hedman, Riccardo Zamboni, Trevor McInroe.

**Figure 1.** Figure 1: Multiplication game (R = a x a y ). Reward takes values in [−1, 1]. Left: offline dataset used to train diffusion models and the baseline MADDPG policy. Middle: policy evolution during training. Right: returns during training (the final epoch equals the final test return for deterministic policies). Q-conditioned augmentation appears optimal only due to action-boundary effects (a ∈ [−1, 1]2 ): the y-agent … view at source ↗

**Figure 2.** Figure 2: Twin Peaks game. Left: offline dataset used to train diffusion models and the baseline MADDPG policy. Middle: policy evolution during training. Right: returns during training (the final epoch equals the final test return for deterministic policies). CODA’s on-policy conditioning again mitigates offline miscoordination. Sample efficiency plot contains no error bars as in this environment deterministic pol… view at source ↗

**Figure 4.** Figure 4: 2HalfCheetah mean normalized performance across datasets. Normalized within each dataset. Standard error across 16 seeds; each seed averaged over 10 episodes. ing faithful to offline support. In contrast, unconditional diffusion can still help by providing broader transition diversity without explicitly concentrating probability mass near πcurrent. 7. Related Work Offline MARL and offline RL baselines. Of… view at source ↗

**Figure 5.** Figure 5: 2HalfCheetah, Poor Dataset, BC 2.5, CODA guidance scale 0.6 0.80 0.85 0.90 CODA Unconditional Diffusion MADDPG + BC Median 0.80 0.85 0.90 IQM 0.80 0.85 0.90 Mean 0.10 0.15 0.20 Optimality Gap Normalized return view at source ↗

**Figure 6.** Figure 6: 2HalfCheetah, Replay Dataset, BC 2.5, CODA guidance scale 0.6 0.96 0.97 0.98 CODA Unconditional Diffusion MADDPG + BC Median 0.96 0.97 0.98 IQM 0.96 0.97 0.98 Mean 0.02 0.03 0.04 Optimality Gap Normalized return view at source ↗

**Figure 7.** Figure 7: 2HalfCheetah, Medium Dataset, BC 2.5, CODA guidance scale 0.6 0.95 0.96 0.97 CODA Unconditional Diffusion MADDPG + BC Median 0.95 0.96 0.97 IQM 0.95 0.96 0.97 Mean 0.03 0.04 0.05 Optimality Gap Normalized return view at source ↗

**Figure 8.** Figure 8: 2HalfCheetah, Good Dataset, BC 2.5, CODA guidance scale 0.6 17 view at source ↗

read the original abstract

Offline multi-agent reinforcement learning (MARL) enables policy learning from fixed datasets, but is prone to coordination failure: agents trained on static, off-policy data converge to suboptimal joint behaviours because they cannot co-adapt as their policies change. We introduce CODA (Coordination via On-Policy Diffusion for Multi-Agent Reinforcement Learning), a diffusion-based multi-agent trajectory generator for data augmentation that samples conditioned on the current joint policy, producing synthetic experience which reflects the evolving behaviours of the agents, thereby providing a mechanism for co-adaptation. We find that previous diffusion-based augmentation approaches are insufficient for fostering multi-agent coordination because they produce static augmented datasets that do not evolve as the current joint policy changes during training; CODA resolves this by more closely simulating on-policy learning and is a meaningful step toward coordinated behaviours in the offline setting. CODA is algorithm-agnostic and can be layered onto both model-free and model-based offline reinforcement learning pipelines as an augmentation module. Empirically, CODA not only resolves canonical coordination pathologies in continuous polynomial games but also delivers strong results on the more complex MaMuJoCo continuous-control benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CODA conditions diffusion augmentation on the current joint policy to enable co-adaptation in offline MARL, but the method rests on an unverified claim that those samples stay close enough to true on-policy trajectories.

read the letter

The paper's main move is to take a diffusion model trained on a fixed offline dataset and sample trajectories conditioned on the agents' evolving joint policy. This is positioned as an improvement over prior static diffusion augmentations that cannot track policy changes during training. The goal is to give agents synthetic experience that reflects their current behaviors so they can coordinate better instead of locking into suboptimal joint actions from the original data.

Referee Report

3 major / 2 minor

Summary. The paper proposes CODA, a diffusion-based trajectory generator for offline multi-agent RL that conditions synthetic data generation on the current joint policy to enable co-adaptation during training. It argues that prior diffusion augmentation methods fail because they produce static datasets, whereas CODA's on-policy conditioning resolves coordination pathologies in continuous polynomial games and yields strong performance on MaMuJoCo benchmarks. The method is presented as algorithm-agnostic and suitable for layering onto existing offline MARL pipelines.

Significance. If the core assumption holds—that conditioned diffusion samples remain distributionally close to true on-policy rollouts under an evolving joint policy—CODA could offer a practical augmentation strategy for coordination in offline MARL without online interaction. This would address a known limitation in the field where static offline data prevents policy co-adaptation. However, the absence of quantitative results, baselines, ablations, or error analysis in the provided description makes it difficult to assess whether the gains stem from genuine on-policy fidelity or simply from increased data volume.

major comments (3)

[Abstract / Method description] The central mechanism (conditioning a diffusion model trained on fixed D_off to produce trajectories under evolving π_t) rests on an unverified assumption of distributional fidelity. No bounds, total-variation or Wasserstein distances, or empirical diagnostics are supplied to show that generated samples remain close to true on-policy rollouts as π_t drifts outside the offline support; this directly undermines the claim that CODA enables genuine co-adaptation rather than re-introducing coordination artifacts.
[Abstract / Empirical evaluation] The empirical claims (resolution of coordination pathologies in polynomial games and strong results on MaMuJoCo) are stated without any quantitative metrics, baseline comparisons, ablation studies, or error bars. This prevents assessment of whether reported improvements are statistically meaningful or attributable to the on-policy conditioning versus generic data augmentation.
[Method] No derivation or pseudocode is given for how the diffusion model is conditioned on the joint policy π_t at each training step, nor for how the augmented trajectories are integrated into the underlying offline RL objective. Without these details it is impossible to verify that the procedure is free of self-referential or circular definitions.

minor comments (2)

[Abstract] The abstract asserts that CODA is 'algorithm-agnostic' but provides no concrete examples of integration with specific model-free or model-based offline MARL algorithms.
[Method] Notation for the diffusion model p_θ(τ | π_t) and the offline dataset D_off is introduced without a dedicated notation table or explicit definitions of all symbols.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments. We appreciate the feedback and will revise the manuscript to provide additional details on distributional fidelity, quantitative empirical results, and methodological clarifications.

read point-by-point responses

Referee: [Abstract / Method description] The central mechanism (conditioning a diffusion model trained on fixed D_off to produce trajectories under evolving π_t) rests on an unverified assumption of distributional fidelity. No bounds, total-variation or Wasserstein distances, or empirical diagnostics are supplied to show that generated samples remain close to true on-policy rollouts as π_t drifts outside the offline support; this directly undermines the claim that CODA enables genuine co-adaptation rather than re-introducing coordination artifacts.

Authors: We agree that the manuscript does not include theoretical bounds or explicit distributional metrics such as total-variation or Wasserstein distances. The effectiveness of the on-policy conditioning is demonstrated through empirical results where CODA resolves coordination issues that static methods cannot. In the revision, we will add empirical diagnostics, including Wasserstein distance measurements between the generated samples and true on-policy trajectories at various stages of policy evolution. revision: yes
Referee: [Abstract / Empirical evaluation] The empirical claims (resolution of coordination pathologies in polynomial games and strong results on MaMuJoCo) are stated without any quantitative metrics, baseline comparisons, ablation studies, or error bars. This prevents assessment of whether reported improvements are statistically meaningful or attributable to the on-policy conditioning versus generic data augmentation.

Authors: We acknowledge that the current presentation lacks detailed quantitative metrics, baselines, ablations, and error bars. The revised manuscript will include comprehensive experimental results with performance tables, comparisons to baselines, ablation studies to isolate the effect of on-policy conditioning, and error bars from multiple random seeds to substantiate the claims and allow for statistical assessment. revision: yes
Referee: [Method] No derivation or pseudocode is given for how the diffusion model is conditioned on the joint policy π_t at each training step, nor for how the augmented trajectories are integrated into the underlying offline RL objective. Without these details it is impossible to verify that the procedure is free of self-referential or circular definitions.

Authors: The revised version will include a detailed derivation of the conditioning process and pseudocode outlining the steps: at each training iteration, the current joint policy π_t is used to condition the diffusion model for trajectory generation, and the resulting synthetic data augments the offline dataset for the subsequent RL update. This iterative process is not circular, as the policy for conditioning is fixed during generation and updated afterward. We will make these details explicit to facilitate verification. revision: yes

Circularity Check

0 steps flagged

No circularity: method description contains no derivations or equations that reduce to inputs by construction.

full rationale

The paper presents CODA as an algorithmic augmentation module that conditions a diffusion model on the evolving joint policy to generate synthetic trajectories for co-adaptation in offline MARL. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-referential definitions appear in the abstract or description. The central claim rests on an empirical assumption about distributional fidelity rather than any mathematical reduction to the offline dataset or prior outputs. Previous diffusion approaches are critiqued at a conceptual level without load-bearing self-citations or uniqueness theorems. The work is therefore self-contained as a proposed technique whose validity is left to empirical validation on polynomial games and MaMuJoCo, with no circular steps identified.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on abstract; central claim rests on the domain assumption that policy-conditioned diffusion can generate trajectories faithful enough to support co-adaptation.

axioms (1)

domain assumption Diffusion models conditioned on the current joint policy can produce synthetic multi-agent trajectories that reflect co-adapting behaviors without harmful distribution shift.
This assumption underpins the claim that CODA enables coordination where static augmentation fails.

pith-pipeline@v0.9.0 · 5521 in / 1216 out tokens · 27211 ms · 2026-05-08T08:26:26.531488+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 10 canonical work pages · 3 internal anchors

[1]

org/CorpusID:254044710

URL https://api.semanticscholar. org/CorpusID:254044710. Albrecht, S. V ., Christianos, F., and Sch ¨afer, L.Multi- Agent Reinforcement Learning: Foundations and Modern Approaches. MIT Press, 2024. URL https://www. marl-book.com. Alonso, E., Jelley, A., Micheli, V ., Kanervisto, A., Storkey, A., Pearce, T., and Fleuret, F. Diffusion for world mod- eling: ...

2024
[2]

ISBN 9798331314385

Curran Associates Inc. ISBN 9798331314385. Bansal, A., Chu, H., Schwarzschild, A., Sengupta, S., Gold- blum, M., Geiping, J., and Goldstein, T. Universal guidance for diffusion models. InThe Twelfth Interna- tional Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,

2024
[3]

Diffusion world model: Fu- ture modeling beyond step-by-step rollout for offline reinforcement learning.arXiv preprint arXiv:2402.03570, 2024

URL https://openreview.net/forum? id=pzpWBbnwiJ. Barde, P., Foerster, J., Nowrouzezahrai, D., and Zhang, A. A model-based solution to the offline multi-agent reinforce- ment learning coordination problem. InProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems, AAMAS ’24, pp. 141–150, Rich- land, SC, 2024. Internatio...

work page arXiv 2024
[4]

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

URL https://api.semanticscholar. org/CorpusID:215827910. Fujimoto, S. and Gu, S. S. A minimalist approach to offline reinforcement learning.Advances in neural information processing systems, 34:20132–20145, 2021. Fujimoto, S., Meger, D., and Precup, D. Off-policy deep reinforcement learning without exploration. In Chaudhuri, K. and Salakhutdinov, R. (eds....

work page internal anchor Pith review arXiv 2021
[5]

Classifier-Free Diffusion Guidance

URL https://api.semanticscholar. org/CorpusID:28202810. He, H., Bai, C., Xu, K., Yang, Z., Zhang, W., Wang, D., Zhao, B., and Li, X. Diffusion model is an effective planner and data synthesizer for multi-task reinforcement learning. InProceedings of the 37th International Con- ference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA,...

work page internal anchor Pith review arXiv 2023
[6]

org/CorpusID:248965046

URL https://api.semanticscholar. org/CorpusID:248965046. Jiang, J. and Lu, Z. Offline decentralized multi- agent reinforcement learning.ArXiv, abs/2108.01832,

work page arXiv
[7]

org/CorpusID:236912690

URL https://api.semanticscholar. org/CorpusID:236912690. Karras, T., Aittala, M., Laine, S., and Aila, T. Elucidating the design space of diffusion-based generative models. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY , USA, 2022. Curran Associates Inc. ISBN 9781713871088. Kumar, A., Z...

work page arXiv 2022
[8]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

URL https://api.semanticscholar. org/CorpusID:19077536. Levine, S., Kumar, A., Tucker, G., and Fu, J. Of- fline reinforcement learning: Tutorial, review, and per- spectives on open problems.ArXiv, abs/2005.01643,

work page internal anchor Pith review arXiv 2005
[9]

org/CorpusID:218486979

URL https://api.semanticscholar. org/CorpusID:218486979. Li, C., Deng, Z., Lin, C., Chen, W., Fu, Y ., Liu, W., Wen, C., Wang, C., and Shen, S. Dof: A diffusion factorization framework for offline multi-agent reinforcement learning. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=OTFKVkxSlL...

2025
[10]

ISBN 9781510860964

Curran Associates Inc. ISBN 9781510860964. Lu, C., Ball, P. J., Teh, Y . W., and Parker-Holder, J. Syn- thetic experience replay. InProceedings of the 37th Inter- national Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc. Matignon, L., Jeanpierre, L., and Mouaddib, A.-I. Coordi- nated multi-rob...

work page arXiv 2023
[11]

org/CorpusID:256231177

URL https://api.semanticscholar. org/CorpusID:256231177. Peng, B., Rashid, T., de Witt, C. A. S., Kamienny, P.-A., Torr, P. H. S., B¨ohmer, W., and Whiteson, S. Facmac: factored multi-agent centralised policy gradients. InProceedings of the 35th International Conference on Neural Informa- tion Processing Systems, NIPS ’21, Red Hook, NY , USA,
[12]

ISBN 9781713845393

Curran Associates Inc. ISBN 9781713845393. Rigter, M., Yamada, J., and Posner, I. World models via policy-guided trajectory diffusion.Transactions on Ma- chine Learning Research, 2024(2), 2024. Tilbury, C. R., Formanek, C., Beyers, L., Shock, J. P., and Pretorius, A. Coordination failure in cooperative of- fline marl, 2024. URL https://arxiv.org/abs/ 2407...

work page arXiv 2024
[13]

Available: http://dx.doi.org/10.1109/ IROS51168.2021.9636860

URL https://api.semanticscholar. org/CorpusID:251554821. Willemsen, D., Coppola, M., and de Croon, G. C. Mambpo: Sample-efficient multi-robot reinforcement learning us- ing learned world models. In2021 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS), pp. 5635–5640. IEEE Press, 2021. doi: 10. 1109/IROS51168.2021.9635836. URL ht...

work page doi:10.1109/iros51168.2021.9635836 2021
[14]

Zhang, C

URL https://openreview.net/forum? id=EpnZEzYDUT. Zhang, C. and Lesser, V . Coordinating multi-agent rein- forcement learning with limited communication. InPro- ceedings of the 2013 International Conference on Au- tonomous Agents and Multi-Agent Systems, AAMAS ’13, pp. 1101–1108, Richland, SC, 2013. International Foun- dation for Autonomous Agents and Mult...

work page arXiv 2013
[15]

org/CorpusID:234093673

URL https://api.semanticscholar. org/CorpusID:234093673. Zhu, Z., Liu, M., Mao, L., Kang, B., Xu, M., Yu, Y ., Ermon, S., and Zhang, W. Madiff: Offline multi-agent learning with diffusion models. In Globerson, A., Mackey, L., Bel- grave, D., Fan, A., Paquet, U., Tomczak, J., and Zhang, C. (eds.),Advances in Neural Information Processing Sys- tems, volume ...
[16]

gradient field

URL https://proceedings.neurips. cc/paper_files/paper/2024/file/ 07e278a120830b10aae20cc600a8c07b-Paper-Conference. pdf. 12 CODA: Coordination via On-Policy Diffusion for Multi-Agent Offline Reinforcement Learning A. Algorithms Algorithm 1CODA: Trajectory Sampling via Joint Policy–Guided Diffusion (Multi-Agent) 1: Parameters:noise schedule {σn}Ndiff n=0, ...

2024