CPPO is an on-policy contrastive RL method that derives advantages from contrastive Q-values for PPO optimization, outperforming prior CRL baselines in 14/18 tasks and matching or exceeding reward-based PPO in 12/18 tasks.
hub
Under review
11 Pith papers cite this work. Polarity classification is still indexing.
abstract
We study the problem of cooperative multi-agent reinforcement learning with a single joint reward signal. This class of learning problems is difficult because of the often large combined action and observation spaces. In the fully centralized and decentralized approaches, we find the problem of spurious rewards and a phenomenon we call the "lazy agent" problem, which arises due to partial observability. We address these problems by training individual agents with a novel value decomposition network architecture, which learns to decompose the team value function into agent-wise value functions. We perform an experimental evaluation across a range of partially-observable multi-agent domains and show that learning such value-decompositions leads to superior results, in particular when combined with weight sharing, role information and information channels.
hub tools
years
2026 11representative citing papers
ATD(λ) adapts TD(λ) in MARL via a density ratio estimator on past/current replay buffers to assign λ per state-action pair, yielding competitive or better results than fixed-λ QMIX and MAPPO on SMAC and Gfootball.
Structured per-agent randomness via ranked masking in attention allows symmetric agents to break ties and coordinate, achieving perfect success on symmetric tasks where deterministic policies fail and enabling zero-shot transfer across team sizes.
A new controlled testbed and coordination diagnostics show that multi-agent RL methods achieving similar returns can differ substantially in redundant assignments, diversity, and efficiency.
NonZero introduces an interaction score and bandit-formalized proposal rule for local agent deviations in multi-agent MCTS, delivering a sublinear local-regret guarantee and improved sample efficiency on game benchmarks without full joint-action enumeration.
Entangled QMARL agents approach the Tsirelson bound of 0.854 in CHSH while unentangled versions match classical baselines, and hybrid quantum-classical setups outperform both in CoopNav.
SACHI uses graph transformer convolutions on inter-agent coordination graphs to enrich partial-observation agents with content-dependent teammate information, yielding statistically significant gains over baselines in five cooperative tasks.
Large-scale experiments on two million agents reveal that collective intelligence does not emerge from scale alone due to sparse and shallow interactions.
LLM-generated coordination graph priors improve multi-agent reinforcement learning performance on MPE benchmarks, with models as small as 1.5B parameters proving effective.
Reflective Context Learning unifies context optimization for agents by recasting prior methods as instances of a shared learning problem and extending them with classical primitives such as batching, failure replay, and grouped rollouts, yielding improvements on AppWorld, BrowseComp+, and RewardBene
LeDRL integrates a lightweight LLM using structured prompts and a reflective evaluator with self-attention DRL to achieve over 17% higher task success rates and better convergence in edge computing task offloading.
citing papers explorer
-
Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation
CPPO is an on-policy contrastive RL method that derives advantages from contrastive Q-values for PPO optimization, outperforming prior CRL baselines in 14/18 tasks and matching or exceeding reward-based PPO in 12/18 tasks.
-
Adaptive TD-Lambda for Cooperative Multi-agent Reinforcement Learning
ATD(λ) adapts TD(λ) in MARL via a density ratio estimator on past/current replay buffers to assign λ per state-action pair, yielding competitive or better results than fixed-λ QMIX and MAPPO on SMAC and Gfootball.
-
Randomness is sometimes necessary for coordination
Structured per-agent randomness via ranked masking in attention allows symmetric agents to break ties and coordinate, achieving perfect success on symmetric tasks where deterministic policies fail and enabling zero-shot transfer across team sizes.
-
Coordination Matters: Evaluation of Cooperative Multi-Agent Reinforcement Learning
A new controlled testbed and coordination diagnostics show that multi-agent RL methods achieving similar returns can differ substantially in redundant assignments, diversity, and efficiency.
-
NonZero: Interaction-Guided Exploration for Multi-Agent Monte Carlo Tree Search
NonZero introduces an interaction score and bandit-formalized proposal rule for local agent deviations in multi-agent MCTS, delivering a sublinear local-regret guarantee and improved sample efficiency on game benchmarks without full joint-action enumeration.
-
Quantum Advantage in Multi Agent Reinforcement Learning
Entangled QMARL agents approach the Tsirelson bound of 0.854 in CHSH while unentangled versions match classical baselines, and hybrid quantum-classical setups outperform both in CoopNav.
-
SACHI: Structured Agent Coordination via Holistic Information Integration in Multi-Agent Reinforcement Learning
SACHI uses graph transformer convolutions on inter-agent coordination graphs to enrich partial-observation agents with content-dependent teammate information, yielding statistically significant gains over baselines in five cooperative tasks.
-
Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents
Large-scale experiments on two million agents reveal that collective intelligence does not emerge from scale alone due to sparse and shallow interactions.
-
Do LLM-derived graph priors improve multi-agent coordination?
LLM-generated coordination graph priors improve multi-agent reinforcement learning performance on MPE benchmarks, with models as small as 1.5B parameters proving effective.
-
Reflective Context Learning: Studying the Optimization Primitives of Context Space
Reflective Context Learning unifies context optimization for agents by recasting prior methods as instances of a shared learning problem and extending them with classical primitives such as batching, failure replay, and grouped rollouts, yielding improvements on AppWorld, BrowseComp+, and RewardBene
-
LLM-Enhanced Deep Reinforcement Learning for Task Offloading in Collaborative Edge Computing
LeDRL integrates a lightweight LLM using structured prompts and a reflective evaluator with self-attention DRL to achieve over 17% higher task success rates and better convergence in edge computing task offloading.