pith. machine review for the scientific record. sign in

arxiv: 1712.08650 · v1 · submitted 2017-12-22 · 💻 cs.LG

Recognition: unknown

A short variational proof of equivalence between policy gradients and soft Q learning

Authors on Pith no claims yet
classification 💻 cs.LG
keywords policyconvexdualityequivalencegradientslearningproofq-learning
0
0 comments X
read the original abstract

Two main families of reinforcement learning algorithms, Q-learning and policy gradients, have recently been proven to be equivalent when using a softmax relaxation on one part, and an entropic regularization on the other. We relate this result to the well-known convex duality of Shannon entropy and the softmax function. Such a result is also known as the Donsker-Varadhan formula. This provides a short proof of the equivalence. We then interpret this duality further, and use ideas of convex analysis to prove a new policy inequality relative to soft Q-learning.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Autoregressive Language Models are Secretly Energy-Based Models: Insights into the Lookahead Capabilities of Next-Token Prediction

    cs.LG 2025-12 unverdicted novelty 6.0

    Autoregressive language models are equivalent to energy-based models through a bijection that corresponds to the soft Bellman equation, explaining their lookahead capabilities despite next-token training.