A short variational proof of equivalence between policy gradients and soft Q learning

Pierre H. Richemond , Brendan Maginnis

Authors on Pith no claims yet

classification 💻 cs.LG

keywords policyconvexdualityequivalencegradientslearningproofq-learning

read the original abstract

Two main families of reinforcement learning algorithms, Q-learning and policy gradients, have recently been proven to be equivalent when using a softmax relaxation on one part, and an entropic regularization on the other. We relate this result to the well-known convex duality of Shannon entropy and the softmax function. Such a result is also known as the Donsker-Varadhan formula. This provides a short proof of the equivalence. We then interpret this duality further, and use ideas of convex analysis to prove a new policy inequality relative to soft Q-learning.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Autoregressive Language Models are Secretly Energy-Based Models: Insights into the Lookahead Capabilities of Next-Token Prediction
cs.LG 2025-12 unverdicted novelty 6.0

Autoregressive language models are equivalent to energy-based models through a bijection that corresponds to the soft Bellman equation, explaining their lookahead capabilities despite next-token training.