A quality-aware exploration method using return-conditioned sigmoid scheduling and per-agent RSQ metrics achieves top-tier returns on seven cooperative MARL benchmarks.
Non-Uniform Noise-to-Signal Ratio in the REINFORCE Policy-Gradient Estimator
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
The work derives the optimal ratio of dynamics-to-reward samples that minimizes a bound on return error and characterizes the tradeoff between noisy but cheap rewards versus accurate but expensive ones in imagination-based policy optimization.
Tempered sequential Monte Carlo samples from a Boltzmann-tilted distribution over controllers to optimize trajectories and policies under differentiable dynamics.
citing papers explorer
-
Quality-Aware Exploration Budget Allocation for Cooperative Multi-Agent Reinforcement Learning
A quality-aware exploration method using return-conditioned sigmoid scheduling and per-agent RSQ metrics achieves top-tier returns on seven cooperative MARL benchmarks.
-
On Training in Imagination
The work derives the optimal ratio of dynamics-to-reward samples that minimizes a bound on return error and characterizes the tradeoff between noisy but cheap rewards versus accurate but expensive ones in imagination-based policy optimization.
-
Tempered Sequential Monte Carlo for Trajectory and Policy Optimization with Differentiable Dynamics
Tempered sequential Monte Carlo samples from a Boltzmann-tilted distribution over controllers to optimize trajectories and policies under differentiable dynamics.