Recognition: unknown
A short variational proof of equivalence between policy gradients and soft Q learning
read the original abstract
Two main families of reinforcement learning algorithms, Q-learning and policy gradients, have recently been proven to be equivalent when using a softmax relaxation on one part, and an entropic regularization on the other. We relate this result to the well-known convex duality of Shannon entropy and the softmax function. Such a result is also known as the Donsker-Varadhan formula. This provides a short proof of the equivalence. We then interpret this duality further, and use ideas of convex analysis to prove a new policy inequality relative to soft Q-learning.
This paper has not been read by Pith yet.
Forward citations
Cited by 1 Pith paper
-
Autoregressive Language Models are Secretly Energy-Based Models: Insights into the Lookahead Capabilities of Next-Token Prediction
Autoregressive language models are equivalent to energy-based models through a bijection that corresponds to the soft Bellman equation, explaining their lookahead capabilities despite next-token training.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.