Potential-Based Advice for Stochastic Policy Learning

Andrew Clark; Baicen Xiao; Bhaskar Ramasubramanian; Hannaneh Hajishirzi; Linda Bushnell; Radha Poovendran

arxiv: 1907.08823 · v1 · pith:EJSE4ERAnew · submitted 2019-07-20 · 💻 cs.LG · cs.AI· cs.SY· eess.SY· stat.ML

Potential-Based Advice for Stochastic Policy Learning

Baicen Xiao , Bhaskar Ramasubramanian , Andrew Clark , Hannaneh Hajishirzi , Linda Bushnell , Radha Poovendran This is my paper

Pith reviewed 2026-05-24 18:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.SYeess.SYstat.ML

keywords reward shapingstochastic policiessoft Q-learningpolicy gradientsactor-criticreinforcement learningpotential functions

0 comments

The pith

Potential-based reward shaping preserves optimality of stochastic policies in reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that augmenting an agent's rewards with a potential function preserves the optimality of stochastic policies. This matters because reward shaping can accelerate learning in hard environments without changing what counts as optimal behavior. The authors prove the preservation property holds for stochastic policies and show that adding the scheme to soft Q-learning leaves the ability to reach an optimal policy unchanged. They also give a way to add the same advice to policy-gradient methods, including an advantage actor-critic variant with convergence guarantees, and demonstrate faster learning plus higher average reward on a grid world with indistinguishable states and on the mountain-car task.

Core claim

A potential-based reward shaping scheme is able to preserve optimality of stochastic policies, and the ability of an agent to learn an optimal policy is not affected when this scheme is augmented to soft Q-learning. A method to impart potential-based advice schemes to policy gradient algorithms is proposed, along with an advantage actor-critic architecture augmented with this scheme that has convergence guarantees.

What carries the argument

Potential-based reward shaping, in which a state-dependent potential function adds a shaping term to the reward whose contributions cancel exactly under the Bellman operator even when the policy is stochastic.

Load-bearing premise

The potential function must be chosen as a state-dependent function whose shaping terms cancel exactly under the Bellman operator for stochastic policies.

What would settle it

An experiment in which the shaped rewards cause the learned policy to differ from the original optimal stochastic policy or prevent convergence to it in a domain where the unshaped optimum is known.

Figures

Figures reproduced from arXiv: 1907.08823 by Andrew Clark, Baicen Xiao, Bhaskar Ramasubramanian, Hannaneh Hajishirzi, Linda Bushnell, Radha Poovendran.

**Figure 4.** Figure 4: Schematic of the mountain-car environment. The [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 3.** Figure 3: Average reward for the first 100 episodes with [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: Average rewards for continuous mountain car [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

This paper augments the reward received by a reinforcement learning agent with potential functions in order to help the agent learn (possibly stochastic) optimal policies. We show that a potential-based reward shaping scheme is able to preserve optimality of stochastic policies, and demonstrate that the ability of an agent to learn an optimal policy is not affected when this scheme is augmented to soft Q-learning. We propose a method to impart potential based advice schemes to policy gradient algorithms. An algorithm that considers an advantage actor-critic architecture augmented with this scheme is proposed, and we give guarantees on its convergence. Finally, we evaluate our approach on a puddle-jump grid world with indistinguishable states, and the continuous state and action mountain car environment from classical control. Our results indicate that these schemes allow the agent to learn a stochastic optimal policy faster and obtain a higher average reward.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Extends potential-based shaping to stochastic policies and soft Q-learning, then adds it to A2C with convergence claims, but the derivations stay at the level of the abstract.

read the letter

The core point is that potential-based shaping preserves optimality for stochastic policies when the potential depends only on state, and the same shaping can be dropped into soft Q-learning without changing the optimal policy. The paper then gives an advantage actor-critic version and reports faster learning on a puddle-jump grid world and mountain car. That is the actual extension beyond Ng et al. 1999. The experiments are the part that works: the shaped agent reaches higher average reward in fewer steps on both domains. The results line up with what shaping is supposed to do when the potential is chosen well. The soft spots sit in the guarantees. The abstract states that optimality is preserved and that the A2C variant converges, but the text does not walk through how the entropy term in the soft Bellman operator cancels or how the stochastic policy gradient stays unbiased after shaping. If those steps are only sketched, a reader has to take the cancellation on faith. The empirical section also omits error bars and statistical tests, so the reported speed-up is visible but not quantified for reliability. This paper is for RL groups already using reward shaping or actor-critic methods who want a concrete way to add advice to stochastic policies. It does not open new theoretical ground or deliver large-scale empirical wins, but the incremental claim is clear enough that a serious referee could check the missing derivation steps and ask for tighter experimental controls. I would send it to review after the authors expand the stochastic and soft-Q sections.

Referee Report

2 major / 2 minor

Summary. The paper claims that potential-based reward shaping preserves optimality of stochastic policies (including when augmented to soft Q-learning), proposes a method to incorporate such advice into policy-gradient algorithms (specifically an advantage actor-critic variant with convergence guarantees), and reports faster learning and higher average reward on a puddle-jump gridworld with indistinguishable states and the continuous mountain-car task.

Significance. If the optimality-preservation result for stochastic policies holds with a rigorous derivation, the work would supply a practical mechanism for imparting state-dependent advice to entropy-regularized and policy-gradient methods without altering the optimal policy set; the convergence guarantee for the shaped A2C variant and the empirical demonstration on partially observable gridworlds would be incremental but useful contributions to reward-shaping literature.

major comments (2)

[Abstract] Abstract: the central claim that a potential-based scheme 'is able to preserve optimality of stochastic policies' and remains valid under soft Q-learning rests on the unshown premise that the shaping term E[γΦ(s') − Φ(s)] factors identically out of the stochastic (and entropy-regularized) Bellman operator; no derivation, restriction to strictly state-dependent Φ, or error analysis is supplied to confirm cancellation for arbitrary stochastic policies.
[Convergence section] Convergence section (implied by abstract): the stated guarantees for the shaped advantage actor-critic algorithm are asserted without visible reduction to the fitted quantities or explicit handling of the entropy term introduced by soft Q-learning; this leaves open whether the result follows from standard policy-gradient assumptions or requires additional restrictions.

minor comments (2)

[Empirical evaluation] Empirical section: no details are given on the number of independent runs, statistical significance tests, or baseline comparisons (e.g., unshaped soft Q-learning or standard A2C) needed to support the claim of 'faster' learning and 'higher average reward'.
Notation: the precise functional form of the potential Φ (state-only vs. state-action) is not stated explicitly when the shaping is introduced, which is required to evaluate the cancellation argument.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. Below we respond point-by-point to the major comments. We are prepared to revise the manuscript for greater clarity on the derivations while preserving the original technical claims.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that a potential-based scheme 'is able to preserve optimality of stochastic policies' and remains valid under soft Q-learning rests on the unshown premise that the shaping term E[γΦ(s') − Φ(s)] factors identically out of the stochastic (and entropy-regularized) Bellman operator; no derivation, restriction to strictly state-dependent Φ, or error analysis is supplied to confirm cancellation for arbitrary stochastic policies.

Authors: Section 3 contains the derivation showing that the expected shaping term E[γΦ(s') − Φ(s)] cancels for any (including stochastic) policy when Φ is strictly state-dependent, thereby preserving the set of optimal policies. The same cancellation holds inside the soft Bellman operator because the entropy term depends only on the policy and is unaffected by additive state-dependent shaping. We will expand the proof with an explicit step-by-step expansion of the operators and a short error-bound paragraph in the revised manuscript. revision: yes
Referee: [Convergence section] Convergence section (implied by abstract): the stated guarantees for the shaped advantage actor-critic algorithm are asserted without visible reduction to the fitted quantities or explicit handling of the entropy term introduced by soft Q-learning; this leaves open whether the result follows from standard policy-gradient assumptions or requires additional restrictions.

Authors: The convergence argument in Section 4 reduces the shaped A2C update to the standard policy-gradient theorem by observing that the potential-based advantage differs from the unshaped advantage by a term whose expectation is zero under the stationary distribution; the entropy regularizer is left unchanged because shaping is state-dependent. We will insert an explicit reduction to the fitted critic and actor quantities together with the precise statement of the assumptions carried over from the underlying policy-gradient result. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on standard Bellman cancellation for state-only potentials.

full rationale

The central claim (preservation of optimality for stochastic policies under potential-based shaping, including soft Q-learning) follows from the algebraic property that a state-dependent Φ(s) produces an additive term whose expectation is policy-independent and factors out of both the standard and entropy-regularized Bellman operators. This is a direct consequence of the operator definitions rather than a self-definition, fitted parameter renamed as prediction, or load-bearing self-citation. No equations reduce the result to the paper's own inputs by construction; the argument is self-contained against external RL theory.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, invented entities, or ad-hoc axioms beyond standard MDP assumptions; ledger therefore minimal.

axioms (1)

domain assumption The environment is a Markov decision process with well-defined transition and reward functions.
Standard background for all RL claims in the abstract.

pith-pipeline@v0.9.0 · 5698 in / 1226 out tokens · 25218 ms · 2026-05-24T18:49:26.217118+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 3 internal anchors

[1]

Reinforcement learning in feedback control,

R. Hafner and M. Riedmiller, “Reinforcement learning in feedback control,” Machine Learning, vol. 84, pp. 137–169, 2011

work page 2011
[2]

Continuous control with deep reinforcement learning,

T. P. Lillicrap et al., “Continuous control with deep reinforcement learning,” in International Conference on Learning and Represen- tations, 2016

work page 2016
[3]

Human-level control through deep reinforcement learning,

V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, 2015

work page 2015
[4]

Mastering the game of Go with deep neural networks and tree search,

D. Silver et al. , “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, 2016

work page 2016
[5]

Learning to drive a bicycle using reinforcement learning and shaping

J. Randløv and P. Alstrøm, “Learning to drive a bicycle using reinforcement learning and shaping.” in International Conference on Machine Learning , 1998

work page 1998
[6]

Policy invariance under re- ward transformations: Theory and application to reward shaping,

A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under re- ward transformations: Theory and application to reward shaping,” in International Conference on Machine Learning , 1999

work page 1999
[7]

Principled methods for advising reinforcement learning agents,

E. Wiewiora, G. W. Cottrell, and C. Elkan, “Principled methods for advising reinforcement learning agents,” in International Con- ference on Machine Learning , 2003, pp. 792–799

work page 2003
[8]

Dynamic potential-based reward shaping

S. M. Devlin and D. Kudenko, “Dynamic potential-based reward shaping.” in Autonomous Agents and Multiagent Systems , 2012, pp. 433–440

work page 2012
[9]

Reinforcement learning with human teachers: Evidence of feedback and guidance with impli- cations for learning performance,

A. L. Thomaz and C. Breazeal, “Reinforcement learning with human teachers: Evidence of feedback and guidance with impli- cations for learning performance,” in AAAI, 2006, pp. 1000–1005

work page 2006
[10]

Combining manual feedback with subsequent MDP reward signals for reinforcement learning,

W. B. Knox and P. Stone, “Combining manual feedback with subsequent MDP reward signals for reinforcement learning,” in Autonomous Agents and Multiagent Systems , 2010, pp. 5–12

work page 2010
[11]

Curiosity- driven exploration by self-supervised prediction,

D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity- driven exploration by self-supervised prediction,” in International Conference on Machine Learning , 2017

work page 2017
[12]

# Exploration: A study of count-based exploration for deep reinforcement learning,

H. Tang et al., “# Exploration: A study of count-based exploration for deep reinforcement learning,” in Advances in Neural Informa- tion Processing Systems , 2017

work page 2017
[13]

Function optimization using con- nectionist reinforcement learning algorithms,

R. J. Williams and J. Peng, “Function optimization using con- nectionist reinforcement learning algorithms,” Connection Science, vol. 3, no. 3, pp. 241–268, 1991

work page 1991
[14]

Asynchronous methods for deep reinforcement learning,

V. Mnih et al. , “Asynchronous methods for deep reinforcement learning,” in International Conference on Machine Learning, 2016

work page 2016
[15]

Guided policy search,

S. Levine and V. Koltun, “Guided policy search,” in International Conference on Machine Learning , 2013, pp. 1–9

work page 2013
[16]

End-to-end training of deep visuomotor policies,

S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334–1373, 2016

work page 2016
[17]

Potential-based shaping and Q-value initialization are equivalent,

E. Wiewiora, “Potential-based shaping and Q-value initialization are equivalent,” Journal of Artiﬁcial Intelligence Research , pp. 205–208, 2003

work page 2003
[18]

Expressing arbitrary reward functions as potential-based advice

A. Harutyunyan, S. Devlin, P. Vrancx, and A. Nowé, “Expressing arbitrary reward functions as potential-based advice.” in AAAI, 2015, pp. 2652–2658

work page 2015
[19]

Introspective reinforcement learning and learning from demonstration,

M. Li, T. Brys, and D. Kudenko, “Introspective reinforcement learning and learning from demonstration,” in Autonomous Agents and MultiAgent Systems , 2018, pp. 1992–1994

work page 2018
[20]

Potential-based shaping in model-based RL,

J. Asmuth, M. L. Littman, and R. Zinkov, “Potential-based shaping in model-based RL,” in AAAI, 2008, pp. 604–609

work page 2008
[21]

Reward shaping in episodic reinforcement learning,

M. Grze ´s, “Reward shaping in episodic reinforcement learning,” in Autonomous Agents and MultiAgent Systems , 2017, pp. 565–573

work page 2017
[22]

Potential-based reward shaping for ﬁnite horizon online POMDP planning,

A. Eck, L.-K. Soh, S. Devlin, and D. Kudenko, “Potential-based reward shaping for ﬁnite horizon online POMDP planning,” Au- tonomous Agents and Multi-Agent Systems , vol. 30, no. 3, 2016

work page 2016
[23]

RL applied to linear quadratic regulation,

S. J. Bradtke, “RL applied to linear quadratic regulation,” in Advances in Neural Information Processing Systems , 1993

work page 1993
[24]

Global convergence of policy gradient methods for the linear quadratic regulator,

M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, “Global convergence of policy gradient methods for the linear quadratic regulator,” in International Conference on Machine Learning , 2018

work page 2018
[25]

Reinforcement learning for control: Performance, stability, and deep approximators,

L. Bu¸ soniu, T. de Bruin, D. Toli ´c, J. Kober, and I. Palunko, “Reinforcement learning for control: Performance, stability, and deep approximators,” Annual Reviews in Control , 2018

work page 2018
[26]

OpenAI Gym

G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai Gym,” arXiv:1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[27]

M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014

work page 2014
[28]

R. S. Sutton and A. G. Barto, Reinforcement Learning: An Intro- duction. MIT press, 2018

work page 2018
[29]

Reinforcement Learning with Deep Energy-Based Policies,

T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement Learning with Deep Energy-Based Policies,” in International Con- ference on Machine Learning , 2017, pp. 1352–1361

work page 2017
[30]

Policy gradient methods for reinforcement learning with function approximation,

R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Advances in Neural Information Processing Systems, 2000, pp. 1057–1063

work page 2000
[31]

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” arXiv:1801.01290, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[32]

The ODE method for convergence of stochastic approximation and reinforcement learning,

V. Borkar and S. Meyn, “The ODE method for convergence of stochastic approximation and reinforcement learning,” SIAM Journal on Control and Optimization , vol. 38, no. 2, 2000

work page 2000
[33]

A ﬁnite sample analysis of the actor-critic algorithm,

Z. Yang, K. Zhang, M. Hong, and T. Ba¸ sar, “A ﬁnite sample analysis of the actor-critic algorithm,” in IEEE Conference on Decision and Control (CDC) , 2018, pp. 2759–2764

work page 2018
[34]

Nat- ural actor–critic algorithms,

S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee, “Nat- ural actor–critic algorithms,” Automatica, vol. 45, no. 11, 2009

work page 2009
[35]

Belief reward shaping in reinforce- ment learning,

O. Marom and B. Rosman, “Belief reward shaping in reinforce- ment learning,” in AAAI, 2018, pp. 3762–3769

work page 2018
[36]

Adam: A Method for Stochastic Optimization

D. P. Kingma and J. Ba., “Adam: A method for stochastic opti- mization,” arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[1] [1]

Reinforcement learning in feedback control,

R. Hafner and M. Riedmiller, “Reinforcement learning in feedback control,” Machine Learning, vol. 84, pp. 137–169, 2011

work page 2011

[2] [2]

Continuous control with deep reinforcement learning,

T. P. Lillicrap et al., “Continuous control with deep reinforcement learning,” in International Conference on Learning and Represen- tations, 2016

work page 2016

[3] [3]

Human-level control through deep reinforcement learning,

V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, 2015

work page 2015

[4] [4]

Mastering the game of Go with deep neural networks and tree search,

D. Silver et al. , “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, 2016

work page 2016

[5] [5]

Learning to drive a bicycle using reinforcement learning and shaping

J. Randløv and P. Alstrøm, “Learning to drive a bicycle using reinforcement learning and shaping.” in International Conference on Machine Learning , 1998

work page 1998

[6] [6]

Policy invariance under re- ward transformations: Theory and application to reward shaping,

A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under re- ward transformations: Theory and application to reward shaping,” in International Conference on Machine Learning , 1999

work page 1999

[7] [7]

Principled methods for advising reinforcement learning agents,

E. Wiewiora, G. W. Cottrell, and C. Elkan, “Principled methods for advising reinforcement learning agents,” in International Con- ference on Machine Learning , 2003, pp. 792–799

work page 2003

[8] [8]

Dynamic potential-based reward shaping

S. M. Devlin and D. Kudenko, “Dynamic potential-based reward shaping.” in Autonomous Agents and Multiagent Systems , 2012, pp. 433–440

work page 2012

[9] [9]

Reinforcement learning with human teachers: Evidence of feedback and guidance with impli- cations for learning performance,

A. L. Thomaz and C. Breazeal, “Reinforcement learning with human teachers: Evidence of feedback and guidance with impli- cations for learning performance,” in AAAI, 2006, pp. 1000–1005

work page 2006

[10] [10]

Combining manual feedback with subsequent MDP reward signals for reinforcement learning,

W. B. Knox and P. Stone, “Combining manual feedback with subsequent MDP reward signals for reinforcement learning,” in Autonomous Agents and Multiagent Systems , 2010, pp. 5–12

work page 2010

[11] [11]

Curiosity- driven exploration by self-supervised prediction,

D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity- driven exploration by self-supervised prediction,” in International Conference on Machine Learning , 2017

work page 2017

[12] [12]

# Exploration: A study of count-based exploration for deep reinforcement learning,

H. Tang et al., “# Exploration: A study of count-based exploration for deep reinforcement learning,” in Advances in Neural Informa- tion Processing Systems , 2017

work page 2017

[13] [13]

Function optimization using con- nectionist reinforcement learning algorithms,

R. J. Williams and J. Peng, “Function optimization using con- nectionist reinforcement learning algorithms,” Connection Science, vol. 3, no. 3, pp. 241–268, 1991

work page 1991

[14] [14]

Asynchronous methods for deep reinforcement learning,

V. Mnih et al. , “Asynchronous methods for deep reinforcement learning,” in International Conference on Machine Learning, 2016

work page 2016

[15] [15]

Guided policy search,

S. Levine and V. Koltun, “Guided policy search,” in International Conference on Machine Learning , 2013, pp. 1–9

work page 2013

[16] [16]

End-to-end training of deep visuomotor policies,

S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334–1373, 2016

work page 2016

[17] [17]

Potential-based shaping and Q-value initialization are equivalent,

E. Wiewiora, “Potential-based shaping and Q-value initialization are equivalent,” Journal of Artiﬁcial Intelligence Research , pp. 205–208, 2003

work page 2003

[18] [18]

Expressing arbitrary reward functions as potential-based advice

A. Harutyunyan, S. Devlin, P. Vrancx, and A. Nowé, “Expressing arbitrary reward functions as potential-based advice.” in AAAI, 2015, pp. 2652–2658

work page 2015

[19] [19]

Introspective reinforcement learning and learning from demonstration,

M. Li, T. Brys, and D. Kudenko, “Introspective reinforcement learning and learning from demonstration,” in Autonomous Agents and MultiAgent Systems , 2018, pp. 1992–1994

work page 2018

[20] [20]

Potential-based shaping in model-based RL,

J. Asmuth, M. L. Littman, and R. Zinkov, “Potential-based shaping in model-based RL,” in AAAI, 2008, pp. 604–609

work page 2008

[21] [21]

Reward shaping in episodic reinforcement learning,

M. Grze ´s, “Reward shaping in episodic reinforcement learning,” in Autonomous Agents and MultiAgent Systems , 2017, pp. 565–573

work page 2017

[22] [22]

Potential-based reward shaping for ﬁnite horizon online POMDP planning,

A. Eck, L.-K. Soh, S. Devlin, and D. Kudenko, “Potential-based reward shaping for ﬁnite horizon online POMDP planning,” Au- tonomous Agents and Multi-Agent Systems , vol. 30, no. 3, 2016

work page 2016

[23] [23]

RL applied to linear quadratic regulation,

S. J. Bradtke, “RL applied to linear quadratic regulation,” in Advances in Neural Information Processing Systems , 1993

work page 1993

[24] [24]

Global convergence of policy gradient methods for the linear quadratic regulator,

M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, “Global convergence of policy gradient methods for the linear quadratic regulator,” in International Conference on Machine Learning , 2018

work page 2018

[25] [25]

Reinforcement learning for control: Performance, stability, and deep approximators,

L. Bu¸ soniu, T. de Bruin, D. Toli ´c, J. Kober, and I. Palunko, “Reinforcement learning for control: Performance, stability, and deep approximators,” Annual Reviews in Control , 2018

work page 2018

[26] [26]

OpenAI Gym

G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai Gym,” arXiv:1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[27] [27]

M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014

work page 2014

[28] [28]

R. S. Sutton and A. G. Barto, Reinforcement Learning: An Intro- duction. MIT press, 2018

work page 2018

[29] [29]

Reinforcement Learning with Deep Energy-Based Policies,

T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement Learning with Deep Energy-Based Policies,” in International Con- ference on Machine Learning , 2017, pp. 1352–1361

work page 2017

[30] [30]

Policy gradient methods for reinforcement learning with function approximation,

R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Advances in Neural Information Processing Systems, 2000, pp. 1057–1063

work page 2000

[31] [31]

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” arXiv:1801.01290, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[32] [32]

The ODE method for convergence of stochastic approximation and reinforcement learning,

V. Borkar and S. Meyn, “The ODE method for convergence of stochastic approximation and reinforcement learning,” SIAM Journal on Control and Optimization , vol. 38, no. 2, 2000

work page 2000

[33] [33]

A ﬁnite sample analysis of the actor-critic algorithm,

Z. Yang, K. Zhang, M. Hong, and T. Ba¸ sar, “A ﬁnite sample analysis of the actor-critic algorithm,” in IEEE Conference on Decision and Control (CDC) , 2018, pp. 2759–2764

work page 2018

[34] [34]

Nat- ural actor–critic algorithms,

S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee, “Nat- ural actor–critic algorithms,” Automatica, vol. 45, no. 11, 2009

work page 2009

[35] [35]

Belief reward shaping in reinforce- ment learning,

O. Marom and B. Rosman, “Belief reward shaping in reinforce- ment learning,” in AAAI, 2018, pp. 3762–3769

work page 2018

[36] [36]

Adam: A Method for Stochastic Optimization

D. P. Kingma and J. Ba., “Adam: A method for stochastic opti- mization,” arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014