arxiv: 1801.01290 · v2 · submitted 2018-01-04 · 💻 cs.LG · cs.AI· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Aurick Zhou, Pieter Abbeel, Sergey Levine, Tuomas Haarnoja

Pith reviewed 2026-05-13 01:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords soft actor-criticmaximum entropy reinforcement learningoff-policy actor-criticcontinuous controldeep reinforcement learningstochastic policies

0 comments

The pith

Soft actor-critic combines off-policy updates with a maximum-entropy objective to produce stable, high-performing policies for continuous control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes soft actor-critic as an off-policy actor-critic algorithm grounded in the maximum entropy reinforcement learning framework. In this setup the policy is trained to maximize expected reward while also maximizing its own entropy, which encourages random yet effective behavior. Earlier deep RL methods faced high sample complexity and brittle convergence that demanded heavy hyperparameter tuning. The authors show that pairing off-policy updates with this stochastic actor-critic formulation yields state-of-the-art results on continuous control benchmarks and produces nearly identical performance across random seeds, unlike other off-policy algorithms.

Core claim

Soft actor-critic is an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework the actor aims to maximize expected reward while also maximizing entropy, succeeding at the task while acting as randomly as possible. By combining off-policy updates with a stable stochastic actor-critic formulation, the method achieves state-of-the-art performance on continuous control benchmark tasks and demonstrates very similar performance across different random seeds, in contrast to other off-policy algorithms.

What carries the argument

The maximum-entropy objective inside the actor-critic loop, which augments the reward signal with a policy entropy term to produce stochastic yet high-reward actions.

If this is right

The method reaches state-of-the-art performance on a range of continuous control benchmark tasks.
It outperforms both prior on-policy and off-policy deep RL algorithms.
Performance remains very similar across different random seeds, indicating high stability.
The approach reduces the need for meticulous hyperparameter tuning that previously limited real-world applicability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same entropy-augmented objective could be tested on discrete-action or partially observable tasks where exploration remains costly.
Stability across seeds may translate to easier deployment in robotics settings where retraining from different initial conditions is common.
If the temperature schedule generalizes, the method could serve as a drop-in replacement for other off-policy actor-critic algorithms without extra tuning.

Load-bearing premise

The maximum-entropy objective and its temperature schedule keep the entropy term beneficial throughout training instead of collapsing to deterministic behavior or introducing instability on the chosen tasks.

What would settle it

Running the algorithm on the same continuous control benchmarks and finding either large performance differences across random seeds or results no better than prior on-policy and off-policy methods would falsify the stability and performance claims.

read the original abstract

Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergence properties, which necessitate meticulous hyperparameter tuning. Both of these challenges severely limit the applicability of such methods to complex, real-world domains. In this paper, we propose soft actor-critic, an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework, the actor aims to maximize expected reward while also maximizing entropy. That is, to succeed at the task while acting as randomly as possible. Prior deep RL methods based on this framework have been formulated as Q-learning methods. By combining off-policy updates with a stable stochastic actor-critic formulation, our method achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off-policy methods. Furthermore, we demonstrate that, in contrast to other off-policy algorithms, our approach is very stable, achieving very similar performance across different random seeds.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAC turns max-entropy RL into a stable off-policy actor-critic that delivers better benchmark numbers and lower seed variance than prior methods.

read the letter

The main takeaway is that this paper makes the maximum-entropy objective practical for off-policy continuous control by pairing it with a stochastic actor update that stays stable across seeds. The derivation inserts the entropy term into the standard soft Q-function and policy gradient steps, then shows the resulting algorithm outperforming DDPG-style off-policy baselines and on-policy ones like TRPO on MuJoCo tasks while reporting low variance over random seeds. That combination of performance and reliability is the concrete advance over earlier max-entropy work that stayed on-policy or used Q-learning alone. The experiments are run on standard benchmarks with multiple seeds, so the numbers are not circular; they come from fresh rollouts. The temperature parameter is treated as a tunable hyperparameter whose effect is checked on the reported tasks, and the paper does not claim it is parameter-free. No formal convergence proof is given for the deep-network case, which is a limitation but not unusual at this stage of deep RL work. The central assumption that the entropy bonus remains useful rather than collapsing or destabilizing training holds in the presented results, though the paper does not test whether that continues outside the chosen suite. This is aimed at researchers who need a reliable continuous-control algorithm for robotics or simulation planning. A reader working on sample-efficient RL will find the stability data and the off-policy formulation useful even if they later modify the temperature schedule. The work shows clear engagement with the prior literature on entropy-regularized RL and policy gradients. It deserves a serious referee because the empirical case is sharp enough to merit external scrutiny on the stability claims and any missing ablations.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes Soft Actor-Critic (SAC), an off-policy actor-critic deep RL algorithm derived from the maximum-entropy framework. The actor maximizes both expected return and policy entropy; the method uses off-policy updates with a stochastic actor and is evaluated on continuous-control MuJoCo benchmarks, claiming state-of-the-art performance together with markedly lower variance across random seeds than prior on-policy and off-policy baselines.

Significance. If the empirical results hold, the work supplies a practical algorithm that simultaneously improves sample efficiency and training stability for continuous control. The explicit demonstration of low seed-to-seed variance is a concrete strength for real-world use. The derivation inserts the entropy term into standard policy-gradient and Q-learning steps in a manner that remains internally consistent; the stability claim is supported by reporting performance across multiple seeds rather than single runs.

minor comments (3)

[§4] The temperature parameter α is introduced as a hyperparameter whose schedule affects the entropy-regularization term throughout training; a short paragraph or table summarizing the values used per environment would improve reproducibility.
[§5] Figure captions for the learning curves should explicitly state the number of random seeds and whether shaded regions represent standard deviation or standard error.
[§3.2] The soft Q-function update in Equation (8) re-uses the same target network for both the Q and the entropy terms; a brief remark on why this choice does not introduce additional bias would clarify the implementation.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their thorough and positive review of our manuscript. We are pleased that the referee recognizes the practical benefits of Soft Actor-Critic for improving both sample efficiency and training stability in continuous control, as well as the value of the maximum-entropy derivation and the multi-seed evaluation.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper derives the soft actor-critic updates from the standard maximum-entropy RL objective using Bellman consistency and stochastic policy gradients. These steps are internally consistent algebraic manipulations that do not reduce to self-definition, fitted parameters renamed as predictions, or load-bearing self-citations. The headline performance and stability claims are obtained by direct execution on external MuJoCo benchmarks rather than by any internal fitting or renaming operation. No step in the provided derivation chain collapses to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the standard MDP formulation, the existence of a replay buffer that provides i.i.d. samples, and the assumption that the soft Q-function can be approximated by a neural network without introducing bias that invalidates the policy improvement step.

free parameters (1)

temperature alpha
Scalar that trades off reward versus entropy; either fixed by hand or learned via a separate objective.

axioms (2)

domain assumption The environment is a Markov decision process with continuous state and action spaces.
Invoked throughout the derivation of the soft policy improvement and soft Bellman equations.
domain assumption Off-policy samples from a replay buffer can be used to update both actor and critic without introducing unacceptable bias.
Core premise of the off-policy actor-critic formulation.

pith-pipeline@v0.9.0 · 5501 in / 1409 out tokens · 27768 ms · 2026-05-13T01:43:00.013454+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We propose soft actor-critic, an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework... J(π) = ∑ E[r(st,at) + αH(π(·|st))]
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
Soft policy iteration alternates soft policy evaluation (Lemma 1) and improvement via KL projection (Lemma 2, Eq. 4)

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

D4RL: Datasets for Deep Data-Driven Reinforcement Learning
cs.LG 2020-04 accept novelty 8.0

D4RL supplies new offline RL benchmarks and datasets from expert and mixed sources to expose weaknesses in existing algorithms and standardize evaluation.
Revisiting Mixture Policies in Entropy-Regularized Actor-Critic
cs.LG 2026-05 unverdicted novelty 7.0

A new marginalized reparameterization estimator allows low-variance training of mixture policies in entropy-regularized actor-critic algorithms, matching or exceeding Gaussian policy performance in several continuous ...
CODA: Coordination via On-Policy Diffusion for Multi-Agent Offline Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

CODA augments offline multi-agent RL with on-policy diffusion trajectories that evolve with the joint policy to enable coordination.
Planning in entropy-regularized Markov decision processes and games
cs.LG 2026-04 unverdicted novelty 7.0

SmoothCruiser achieves O~(1/epsilon^4) problem-independent sample complexity for value estimation in entropy-regularized MDPs and games via a generative model.
To Learn or Not to Learn: A Litmus Test for Using Reinforcement Learning in Control
eess.SY 2026-04 unverdicted novelty 7.0

A litmus test based on reachset-conformant model identification and correlation analysis of uncertainties predicts if RL-based control is superior to model-based control without any RL training.
Mastering Diverse Domains through World Models
cs.AI 2023-01 unverdicted novelty 7.0

DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.
Dream to Control: Learning Behaviors by Latent Imagination
cs.LG 2019-12 accept novelty 7.0

Dreamer learns to control from images by imagining and optimizing behaviors in a learned latent world model, outperforming prior methods on 20 visual tasks in data efficiency and final performance.
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
cs.AI 2026-05 unverdicted novelty 6.0

LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
cs.AI 2026-05 unverdicted novelty 6.0

LQL stabilizes Q-learning by penalizing violations of n-step action-sequence lower bounds with a hinge loss computed from standard network outputs.
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
cs.LG 2026-05 unverdicted novelty 6.0

MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
Semi-Markov Reinforcement Learning for City-Scale EV Ride-Hailing with Feasibility-Guaranteed Actions
cs.AI 2026-04 unverdicted novelty 6.0

A robust semi-Markov RL agent with MILP feasibility projection and Wasserstein ambiguity set achieves $1.22M net profit on an NYC EV simulator with zero feeder violations, outperforming heuristic and other RL baselines.
Scalable Neighborhood-Based Multi-Agent Actor-Critic
cs.LG 2026-04 unverdicted novelty 6.0

MADDPG-K scales centralized critics in multi-agent RL by limiting each critic to k-nearest neighbors under Euclidean distance, yielding constant input size and competitive performance.
When Forecast Accuracy Fails: Rank Correlation and Decision Quality in Multi-Market Battery Storage Optimization
q-fin.TR 2026-04 unverdicted novelty 6.0

Rank correlation (Kendall tau) of price forecasts, not mean absolute error, determines intraday dispatch value for multi-market battery storage, with tau above 0.85-0.95 capturing 97-100% of perfect-foresight revenue.
Load constrained wind farm flow control through multi-objective multi-agent reinforcement learning
eess.SY 2026-04 conditional novelty 6.0

A multi-agent RL system using Independent Soft Actor-Critic and a local-inflow surrogate for damage-equivalent loads learns policies that raise wind-farm power while respecting explicit load-increase limits.
Behavior Regularized Offline Reinforcement Learning
cs.LG 2019-11 unverdicted novelty 6.0

Behavior-regularized actor-critic methods achieve strong offline RL results with simple regularization, rendering many recent technical additions unnecessary.
Nautilus: From One Prompt to Plug-and-Play Robot Learning
cs.RO 2026-05 unverdicted novelty 5.0

NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
Coordination Architecture Shapes Continuous Demand Response Outcomes in Building Districts
eess.SY 2026-05 unverdicted novelty 5.0

In a 25-building district simulation, the hybrid MPC-SAC architecture delivered the strongest balance of load tracking accuracy (4.8% NMBE), thermal comfort (16.8% exceedance), and lowest spatial variability compared ...
Hierarchical RL-MPC Control for Dynamic Wake Steering in Wind Farms
eess.SY 2026-04 unverdicted novelty 5.0

A hierarchical RL-MPC framework for dynamic wake steering in wind farms delivers 23% power gain over baseline on a three-turbine case while outperforming idealized MPC with perfect state knowledge and offering safer t...
Accelerating Reinforcement Learning for Wind Farm Control via Expert Demonstrations
eess.SY 2026-04 unverdicted novelty 5.0

Pretraining Soft Actor-Critic agents via behavior cloning on PyWake-generated expert trajectories in WindGym simulations eliminates the initial learning phase for 2x2 wind farm control and yields final performance exc...
Gymnasium: A Standard Interface for Reinforcement Learning Environments
cs.LG 2024-07 accept novelty 5.0

Gymnasium establishes a standardized API for RL environments to improve interoperability, reproducibility, and ease of development in reinforcement learning.
An Aircraft Upset Recovery System with Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 4.0

A SAC-based reinforcement learning controller for aircraft upset recovery is judged by domain experts to produce more desirable behavior than conventional control methods.
An Automatic Ground Collision Avoidance System with Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 3.0

The paper designs a reinforcement learning-based automatic ground collision avoidance system for jet trainers that uses limited observations and line-of-sight terrain queries to prevent collisions.
Information-Theoretic Measures in AI: A Practical Decision Guide
cs.AI 2026-04 unverdicted novelty 3.0

A practical guide that organizes seven IT measures around three questions each—what it answers in AI, suitable estimators, and dangerous misuses—complete with flowchart, table, and worked examples.
Perfecting Aircraft Maneuvers with Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 2.0

Reinforcement learning agents simulate multiple aircraft aerobatic maneuvers to support development of an AI-assisted pilot training module.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 24 Pith papers · 4 internal anchors

[1]

G., Sutton, R

Barto, A. G., Sutton, R. S., and Anderson, C. W. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics, pp.\ 834--846, 1983

work page 1983
[2]

S., Maei, H

Bhatnagar, S., Precup, D., Silver, D., Sutton, R. S., Maei, H. R., and Szepesv \'a ri, C. Convergent temporal-difference learning with arbitrary smooth function approximation. In Advances in Neural Information Processing Systems (NIPS), pp.\ 1204--1212, 2009

work page 2009
[3]

OpenAI Gym

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Open AI gym. arXiv preprint arXiv:1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[4]

Houthooft, R., Schulman, J., and Abbeel, P

Duan, Y., Chen, X. Houthooft, R., Schulman, J., and Abbeel, P. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning (ICML), 2016

work page 2016
[5]

Taming the noise in reinforcement learning via soft updates

Fox, R., Pakman, A., and Tishby, N. Taming the noise in reinforcement learning via soft updates. In Conference on Uncertainty in Artificial Intelligence (UAI), 2016

work page 2016
[6]

Addressing Function Approximation Error in Actor-Critic Methods

Fujimoto, S., van Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477, 2018

work page Pith review arXiv 2018
[7]

G., Bellemare, M

Gruslys, A., Azar, M. G., Bellemare, M. G., and Munos, R. The reactor: A sample-efficient actor-critic architecture. arXiv preprint arXiv:1704.04651, 2017

work page arXiv 2017
[8]

E., and Levine, S

Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., and Levine, S. Q-prop: Sample-efficient policy gradient with an off-policy critic. arXiv preprint arXiv:1611.02247, 2016

work page arXiv 2016
[9]

Reinforcement learning with deep energy-based policies

Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning (ICML), pp.\ 1352--1361, 2017

work page 2017
[10]

Hasselt, H. V. Double Q -learning. In Advances in Neural Information Processing Systems (NIPS), pp.\ 2613--2621, 2010

work page 2010
[11]

Learning continuous control policies by stochastic value gradients

Heess, N., Wayne, G., Silver, D., Lillicrap, T., Erez, T., and Tassa, Y. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems (NIPS), pp.\ 2944--2952, 2015

work page 2015
[12]

Deep Reinforcement Learning that Matters

Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. Deep reinforcement learning that matters. arXiv preprint arXiv:1709.06560, 2017

work page Pith review arXiv 2017
[13]

and Ba, J

Kingma, D. and Ba, J. Adam: A method for stochastic optimization. In International Conference for Learning Presentations (ICLR), 2015

work page 2015
[14]

and Koltun, V

Levine, S. and Koltun, V. Guided policy search. In International Conference on Machine Learning (ICML), pp.\ 1--9, 2013

work page 2013
[15]

End-to-end training of deep visuomotor policies

Levine, S., Finn, C., Darrell, T., and Abbeel, P. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17 0 (39): 0 1--40, 2016

work page 2016
[16]

Continuous control with deep reinforcement learning

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[17]

Playing Atari with Deep Reinforcement Learning

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[18]

A., Veness, J., Bellemare, M

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518 0 (7540): 0 529--533, 2015

work page 2015
[19]

P., Mirza, M., Graves, A., Lillicrap, T

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning (ICML), 2016

work page 2016
[20]

Bridging the gap between value and policy based reinforcement learning

Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems (NIPS), pp.\ 2772--2782, 2017 a

work page 2017
[21]

Trust-pcl: An oﬀ-policy trust region method for continuous control.arXiv preprint arXiv:1707.01891,

Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Trust- PCL : An off-policy trust region method for continuous control. arXiv preprint arXiv:1707.01891, 2017 b

work page arXiv 2017
[22]

PGQ : Combining policy gradient and Q -learning

O'Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. PGQ : Combining policy gradient and Q -learning. arXiv preprint arXiv:1611.01626, 2016

work page arXiv 2016
[23]

and Schaal, S

Peters, J. and Schaal, S. Reinforcement learning of motor skills with policy gradients. Neural networks, 21 0 (4): 0 682--697, 2008

work page 2008
[24]

On stochastic optimal control and reinforcement learning by approximate inference

Rawlik, K., Toussaint, M., and Vijayakumar, S. On stochastic optimal control and reinforcement learning by approximate inference. Robotics: Science and Systems (RSS), 2012

work page 2012
[25]

I., and Moritz, P

Schulman, J., Levine, S., Abbeel, P., Jordan, M. I., and Moritz, P. Trust region policy optimization. In International Conference on Machine Learning (ICML), pp.\ 1889--1897, 2015

work page 2015
[26]

Equivalence between policy gradients and soft Q-learning

Schulman, J., Abbeel, P., and Chen, X. Equivalence between policy gradients and soft Q -learning. arXiv preprint arXiv:1704.06440, 2017 a

work page arXiv 2017
[27]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017 b

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

Deterministic policy gradient algorithms

Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. Deterministic policy gradient algorithms. In International Conference on Machine Learning (ICML), 2014

work page 2014
[29]

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. Mastering the game of go with deep neural networks and tree sear...

work page 2016
[30]

Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

work page 1998
[31]

Bias in natural actor-critic algorithms

Thomas, P. Bias in natural actor-critic algorithms. In International Conference on Machine Learning (ICML), pp.\ 441--448, 2014

work page 2014
[32]

General duality between optimal control and estimation

Todorov, E. General duality between optimal control and estimation. In IEEE Conference on Decision and Control (CDC), pp.\ 4286--4292. IEEE, 2008

work page 2008
[33]

Robot trajectory optimization using approximate inference

Toussaint, M. Robot trajectory optimization using approximate inference. In International Conference on Machine Learning (ICML), pp.\ 1049--1056. ACM, 2009

work page 2009
[34]

Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8 0 (3-4): 0 229--256, 1992

work page 1992
[35]

Ziebart, B. D. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University, 2010

work page 2010
[36]

D., Maas, A

Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement learning. In AAAI Conference on Artificial Intelligence (AAAI), pp.\ 1433--1438, 2008

work page 2008