The Mirage of Action-Dependent Baselines in Reinforcement Learning

George Tucker , Surya Bhupatiraju , Shixiang Gu , Richard E. Turner , Zoubin Ghahramani , Sergey Levine

Authors on Pith no claims yet

classification 💻 cs.LG stat.ML

keywords variancegradientbaselinebaselinesestimatorlearningmethodspolicy

read the original abstract

Policy gradient methods are a widely used class of model-free reinforcement learning algorithms where a state-dependent baseline is used to reduce gradient estimator variance. Several recent papers extend the baseline to depend on both the state and action and suggest that this significantly reduces variance and improves sample efficiency without introducing bias into the gradient estimates. To better understand this development, we decompose the variance of the policy gradient estimator and numerically show that learned state-action-dependent baselines do not in fact reduce variance over a state-dependent baseline in commonly tested benchmark domains. We confirm this unexpected result by reviewing the open-source code accompanying these prior papers, and show that subtle implementation decisions cause deviations from the methods presented in the papers and explain the source of the previously observed empirical gains. Furthermore, the variance decomposition highlights areas for improvement, which we demonstrate by illustrating a simple change to the typical value function parameterization that can significantly improve performance.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States
cs.LG 2026-05 unverdicted novelty 7.0

POISE estimates value baselines for RL in LLMs from the actor's internal states via a lightweight probe and cross-rollout construction, matching DAPO performance with lower compute on math reasoning benchmarks.
Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States
cs.LG 2026-05 unverdicted novelty 7.0

POISE trains a lightweight probe on the actor's internal states to predict expected rewards for RLVR, matching DAPO performance on math benchmarks with lower compute by avoiding extra rollouts or critic models.