pith. sign in

arxiv: 1907.08823 · v1 · pith:EJSE4ERAnew · submitted 2019-07-20 · 💻 cs.LG · cs.AI· cs.SY· eess.SY· stat.ML

Potential-Based Advice for Stochastic Policy Learning

Pith reviewed 2026-05-24 18:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.SYeess.SYstat.ML
keywords reward shapingstochastic policiessoft Q-learningpolicy gradientsactor-criticreinforcement learningpotential functions
0
0 comments X

The pith

Potential-based reward shaping preserves optimality of stochastic policies in reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that augmenting an agent's rewards with a potential function preserves the optimality of stochastic policies. This matters because reward shaping can accelerate learning in hard environments without changing what counts as optimal behavior. The authors prove the preservation property holds for stochastic policies and show that adding the scheme to soft Q-learning leaves the ability to reach an optimal policy unchanged. They also give a way to add the same advice to policy-gradient methods, including an advantage actor-critic variant with convergence guarantees, and demonstrate faster learning plus higher average reward on a grid world with indistinguishable states and on the mountain-car task.

Core claim

A potential-based reward shaping scheme is able to preserve optimality of stochastic policies, and the ability of an agent to learn an optimal policy is not affected when this scheme is augmented to soft Q-learning. A method to impart potential-based advice schemes to policy gradient algorithms is proposed, along with an advantage actor-critic architecture augmented with this scheme that has convergence guarantees.

What carries the argument

Potential-based reward shaping, in which a state-dependent potential function adds a shaping term to the reward whose contributions cancel exactly under the Bellman operator even when the policy is stochastic.

Load-bearing premise

The potential function must be chosen as a state-dependent function whose shaping terms cancel exactly under the Bellman operator for stochastic policies.

What would settle it

An experiment in which the shaped rewards cause the learned policy to differ from the original optimal stochastic policy or prevent convergence to it in a domain where the unshaped optimum is known.

Figures

Figures reproduced from arXiv: 1907.08823 by Andrew Clark, Baicen Xiao, Bhaskar Ramasubramanian, Hannaneh Hajishirzi, Linda Bushnell, Radha Poovendran.

Figure 1
Figure 1. Figure 1: Schematic of the puddle-jump gridworld. The state [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 4
Figure 4. Figure 4: Schematic of the mountain-car environment. The [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Average reward for the first 100 episodes with [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Average rewards for continuous mountain car [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

This paper augments the reward received by a reinforcement learning agent with potential functions in order to help the agent learn (possibly stochastic) optimal policies. We show that a potential-based reward shaping scheme is able to preserve optimality of stochastic policies, and demonstrate that the ability of an agent to learn an optimal policy is not affected when this scheme is augmented to soft Q-learning. We propose a method to impart potential based advice schemes to policy gradient algorithms. An algorithm that considers an advantage actor-critic architecture augmented with this scheme is proposed, and we give guarantees on its convergence. Finally, we evaluate our approach on a puddle-jump grid world with indistinguishable states, and the continuous state and action mountain car environment from classical control. Our results indicate that these schemes allow the agent to learn a stochastic optimal policy faster and obtain a higher average reward.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that potential-based reward shaping preserves optimality of stochastic policies (including when augmented to soft Q-learning), proposes a method to incorporate such advice into policy-gradient algorithms (specifically an advantage actor-critic variant with convergence guarantees), and reports faster learning and higher average reward on a puddle-jump gridworld with indistinguishable states and the continuous mountain-car task.

Significance. If the optimality-preservation result for stochastic policies holds with a rigorous derivation, the work would supply a practical mechanism for imparting state-dependent advice to entropy-regularized and policy-gradient methods without altering the optimal policy set; the convergence guarantee for the shaped A2C variant and the empirical demonstration on partially observable gridworlds would be incremental but useful contributions to reward-shaping literature.

major comments (2)
  1. [Abstract] Abstract: the central claim that a potential-based scheme 'is able to preserve optimality of stochastic policies' and remains valid under soft Q-learning rests on the unshown premise that the shaping term E[γΦ(s') − Φ(s)] factors identically out of the stochastic (and entropy-regularized) Bellman operator; no derivation, restriction to strictly state-dependent Φ, or error analysis is supplied to confirm cancellation for arbitrary stochastic policies.
  2. [Convergence section] Convergence section (implied by abstract): the stated guarantees for the shaped advantage actor-critic algorithm are asserted without visible reduction to the fitted quantities or explicit handling of the entropy term introduced by soft Q-learning; this leaves open whether the result follows from standard policy-gradient assumptions or requires additional restrictions.
minor comments (2)
  1. [Empirical evaluation] Empirical section: no details are given on the number of independent runs, statistical significance tests, or baseline comparisons (e.g., unshaped soft Q-learning or standard A2C) needed to support the claim of 'faster' learning and 'higher average reward'.
  2. Notation: the precise functional form of the potential Φ (state-only vs. state-action) is not stated explicitly when the shaping is introduced, which is required to evaluate the cancellation argument.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. Below we respond point-by-point to the major comments. We are prepared to revise the manuscript for greater clarity on the derivations while preserving the original technical claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that a potential-based scheme 'is able to preserve optimality of stochastic policies' and remains valid under soft Q-learning rests on the unshown premise that the shaping term E[γΦ(s') − Φ(s)] factors identically out of the stochastic (and entropy-regularized) Bellman operator; no derivation, restriction to strictly state-dependent Φ, or error analysis is supplied to confirm cancellation for arbitrary stochastic policies.

    Authors: Section 3 contains the derivation showing that the expected shaping term E[γΦ(s') − Φ(s)] cancels for any (including stochastic) policy when Φ is strictly state-dependent, thereby preserving the set of optimal policies. The same cancellation holds inside the soft Bellman operator because the entropy term depends only on the policy and is unaffected by additive state-dependent shaping. We will expand the proof with an explicit step-by-step expansion of the operators and a short error-bound paragraph in the revised manuscript. revision: yes

  2. Referee: [Convergence section] Convergence section (implied by abstract): the stated guarantees for the shaped advantage actor-critic algorithm are asserted without visible reduction to the fitted quantities or explicit handling of the entropy term introduced by soft Q-learning; this leaves open whether the result follows from standard policy-gradient assumptions or requires additional restrictions.

    Authors: The convergence argument in Section 4 reduces the shaped A2C update to the standard policy-gradient theorem by observing that the potential-based advantage differs from the unshaped advantage by a term whose expectation is zero under the stationary distribution; the entropy regularizer is left unchanged because shaping is state-dependent. We will insert an explicit reduction to the fitted critic and actor quantities together with the precise statement of the assumptions carried over from the underlying policy-gradient result. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on standard Bellman cancellation for state-only potentials.

full rationale

The central claim (preservation of optimality for stochastic policies under potential-based shaping, including soft Q-learning) follows from the algebraic property that a state-dependent Φ(s) produces an additive term whose expectation is policy-independent and factors out of both the standard and entropy-regularized Bellman operators. This is a direct consequence of the operator definitions rather than a self-definition, fitted parameter renamed as prediction, or load-bearing self-citation. No equations reduce the result to the paper's own inputs by construction; the argument is self-contained against external RL theory.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, invented entities, or ad-hoc axioms beyond standard MDP assumptions; ledger therefore minimal.

axioms (1)
  • domain assumption The environment is a Markov decision process with well-defined transition and reward functions.
    Standard background for all RL claims in the abstract.

pith-pipeline@v0.9.0 · 5698 in / 1226 out tokens · 25218 ms · 2026-05-24T18:49:26.217118+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 3 internal anchors

  1. [1]

    Reinforcement learning in feedback control,

    R. Hafner and M. Riedmiller, “Reinforcement learning in feedback control,” Machine Learning, vol. 84, pp. 137–169, 2011

  2. [2]

    Continuous control with deep reinforcement learning,

    T. P. Lillicrap et al., “Continuous control with deep reinforcement learning,” in International Conference on Learning and Represen- tations, 2016

  3. [3]

    Human-level control through deep reinforcement learning,

    V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, 2015

  4. [4]

    Mastering the game of Go with deep neural networks and tree search,

    D. Silver et al. , “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, 2016

  5. [5]

    Learning to drive a bicycle using reinforcement learning and shaping

    J. Randløv and P. Alstrøm, “Learning to drive a bicycle using reinforcement learning and shaping.” in International Conference on Machine Learning , 1998

  6. [6]

    Policy invariance under re- ward transformations: Theory and application to reward shaping,

    A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under re- ward transformations: Theory and application to reward shaping,” in International Conference on Machine Learning , 1999

  7. [7]

    Principled methods for advising reinforcement learning agents,

    E. Wiewiora, G. W. Cottrell, and C. Elkan, “Principled methods for advising reinforcement learning agents,” in International Con- ference on Machine Learning , 2003, pp. 792–799

  8. [8]

    Dynamic potential-based reward shaping

    S. M. Devlin and D. Kudenko, “Dynamic potential-based reward shaping.” in Autonomous Agents and Multiagent Systems , 2012, pp. 433–440

  9. [9]

    Reinforcement learning with human teachers: Evidence of feedback and guidance with impli- cations for learning performance,

    A. L. Thomaz and C. Breazeal, “Reinforcement learning with human teachers: Evidence of feedback and guidance with impli- cations for learning performance,” in AAAI, 2006, pp. 1000–1005

  10. [10]

    Combining manual feedback with subsequent MDP reward signals for reinforcement learning,

    W. B. Knox and P. Stone, “Combining manual feedback with subsequent MDP reward signals for reinforcement learning,” in Autonomous Agents and Multiagent Systems , 2010, pp. 5–12

  11. [11]

    Curiosity- driven exploration by self-supervised prediction,

    D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity- driven exploration by self-supervised prediction,” in International Conference on Machine Learning , 2017

  12. [12]

    # Exploration: A study of count-based exploration for deep reinforcement learning,

    H. Tang et al., “# Exploration: A study of count-based exploration for deep reinforcement learning,” in Advances in Neural Informa- tion Processing Systems , 2017

  13. [13]

    Function optimization using con- nectionist reinforcement learning algorithms,

    R. J. Williams and J. Peng, “Function optimization using con- nectionist reinforcement learning algorithms,” Connection Science, vol. 3, no. 3, pp. 241–268, 1991

  14. [14]

    Asynchronous methods for deep reinforcement learning,

    V. Mnih et al. , “Asynchronous methods for deep reinforcement learning,” in International Conference on Machine Learning, 2016

  15. [15]

    Guided policy search,

    S. Levine and V. Koltun, “Guided policy search,” in International Conference on Machine Learning , 2013, pp. 1–9

  16. [16]

    End-to-end training of deep visuomotor policies,

    S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334–1373, 2016

  17. [17]

    Potential-based shaping and Q-value initialization are equivalent,

    E. Wiewiora, “Potential-based shaping and Q-value initialization are equivalent,” Journal of Artificial Intelligence Research , pp. 205–208, 2003

  18. [18]

    Expressing arbitrary reward functions as potential-based advice

    A. Harutyunyan, S. Devlin, P. Vrancx, and A. Nowé, “Expressing arbitrary reward functions as potential-based advice.” in AAAI, 2015, pp. 2652–2658

  19. [19]

    Introspective reinforcement learning and learning from demonstration,

    M. Li, T. Brys, and D. Kudenko, “Introspective reinforcement learning and learning from demonstration,” in Autonomous Agents and MultiAgent Systems , 2018, pp. 1992–1994

  20. [20]

    Potential-based shaping in model-based RL,

    J. Asmuth, M. L. Littman, and R. Zinkov, “Potential-based shaping in model-based RL,” in AAAI, 2008, pp. 604–609

  21. [21]

    Reward shaping in episodic reinforcement learning,

    M. Grze ´s, “Reward shaping in episodic reinforcement learning,” in Autonomous Agents and MultiAgent Systems , 2017, pp. 565–573

  22. [22]

    Potential-based reward shaping for finite horizon online POMDP planning,

    A. Eck, L.-K. Soh, S. Devlin, and D. Kudenko, “Potential-based reward shaping for finite horizon online POMDP planning,” Au- tonomous Agents and Multi-Agent Systems , vol. 30, no. 3, 2016

  23. [23]

    RL applied to linear quadratic regulation,

    S. J. Bradtke, “RL applied to linear quadratic regulation,” in Advances in Neural Information Processing Systems , 1993

  24. [24]

    Global convergence of policy gradient methods for the linear quadratic regulator,

    M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, “Global convergence of policy gradient methods for the linear quadratic regulator,” in International Conference on Machine Learning , 2018

  25. [25]

    Reinforcement learning for control: Performance, stability, and deep approximators,

    L. Bu¸ soniu, T. de Bruin, D. Toli ´c, J. Kober, and I. Palunko, “Reinforcement learning for control: Performance, stability, and deep approximators,” Annual Reviews in Control , 2018

  26. [26]

    OpenAI Gym

    G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai Gym,” arXiv:1606.01540, 2016

  27. [27]

    M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014

  28. [28]

    R. S. Sutton and A. G. Barto, Reinforcement Learning: An Intro- duction. MIT press, 2018

  29. [29]

    Reinforcement Learning with Deep Energy-Based Policies,

    T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement Learning with Deep Energy-Based Policies,” in International Con- ference on Machine Learning , 2017, pp. 1352–1361

  30. [30]

    Policy gradient methods for reinforcement learning with function approximation,

    R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Advances in Neural Information Processing Systems, 2000, pp. 1057–1063

  31. [31]

    Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

    T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” arXiv:1801.01290, 2018

  32. [32]

    The ODE method for convergence of stochastic approximation and reinforcement learning,

    V. Borkar and S. Meyn, “The ODE method for convergence of stochastic approximation and reinforcement learning,” SIAM Journal on Control and Optimization , vol. 38, no. 2, 2000

  33. [33]

    A finite sample analysis of the actor-critic algorithm,

    Z. Yang, K. Zhang, M. Hong, and T. Ba¸ sar, “A finite sample analysis of the actor-critic algorithm,” in IEEE Conference on Decision and Control (CDC) , 2018, pp. 2759–2764

  34. [34]

    Nat- ural actor–critic algorithms,

    S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee, “Nat- ural actor–critic algorithms,” Automatica, vol. 45, no. 11, 2009

  35. [35]

    Belief reward shaping in reinforce- ment learning,

    O. Marom and B. Rosman, “Belief reward shaping in reinforce- ment learning,” in AAAI, 2018, pp. 3762–3769

  36. [36]

    Adam: A Method for Stochastic Optimization

    D. P. Kingma and J. Ba., “Adam: A method for stochastic opti- mization,” arXiv:1412.6980, 2014