pith. machine review for the scientific record. sign in

arxiv: 1801.01290 · v2 · submitted 2018-01-04 · 💻 cs.LG · cs.AI· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Aurick Zhou, Pieter Abbeel, Sergey Levine, Tuomas Haarnoja

Pith reviewed 2026-05-13 01:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords soft actor-criticmaximum entropy reinforcement learningoff-policy actor-criticcontinuous controldeep reinforcement learningstochastic policies
0
0 comments X

The pith

Soft actor-critic combines off-policy updates with a maximum-entropy objective to produce stable, high-performing policies for continuous control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes soft actor-critic as an off-policy actor-critic algorithm grounded in the maximum entropy reinforcement learning framework. In this setup the policy is trained to maximize expected reward while also maximizing its own entropy, which encourages random yet effective behavior. Earlier deep RL methods faced high sample complexity and brittle convergence that demanded heavy hyperparameter tuning. The authors show that pairing off-policy updates with this stochastic actor-critic formulation yields state-of-the-art results on continuous control benchmarks and produces nearly identical performance across random seeds, unlike other off-policy algorithms.

Core claim

Soft actor-critic is an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework the actor aims to maximize expected reward while also maximizing entropy, succeeding at the task while acting as randomly as possible. By combining off-policy updates with a stable stochastic actor-critic formulation, the method achieves state-of-the-art performance on continuous control benchmark tasks and demonstrates very similar performance across different random seeds, in contrast to other off-policy algorithms.

What carries the argument

The maximum-entropy objective inside the actor-critic loop, which augments the reward signal with a policy entropy term to produce stochastic yet high-reward actions.

If this is right

  • The method reaches state-of-the-art performance on a range of continuous control benchmark tasks.
  • It outperforms both prior on-policy and off-policy deep RL algorithms.
  • Performance remains very similar across different random seeds, indicating high stability.
  • The approach reduces the need for meticulous hyperparameter tuning that previously limited real-world applicability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same entropy-augmented objective could be tested on discrete-action or partially observable tasks where exploration remains costly.
  • Stability across seeds may translate to easier deployment in robotics settings where retraining from different initial conditions is common.
  • If the temperature schedule generalizes, the method could serve as a drop-in replacement for other off-policy actor-critic algorithms without extra tuning.

Load-bearing premise

The maximum-entropy objective and its temperature schedule keep the entropy term beneficial throughout training instead of collapsing to deterministic behavior or introducing instability on the chosen tasks.

What would settle it

Running the algorithm on the same continuous control benchmarks and finding either large performance differences across random seeds or results no better than prior on-policy and off-policy methods would falsify the stability and performance claims.

read the original abstract

Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergence properties, which necessitate meticulous hyperparameter tuning. Both of these challenges severely limit the applicability of such methods to complex, real-world domains. In this paper, we propose soft actor-critic, an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework, the actor aims to maximize expected reward while also maximizing entropy. That is, to succeed at the task while acting as randomly as possible. Prior deep RL methods based on this framework have been formulated as Q-learning methods. By combining off-policy updates with a stable stochastic actor-critic formulation, our method achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off-policy methods. Furthermore, we demonstrate that, in contrast to other off-policy algorithms, our approach is very stable, achieving very similar performance across different random seeds.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes Soft Actor-Critic (SAC), an off-policy actor-critic deep RL algorithm derived from the maximum-entropy framework. The actor maximizes both expected return and policy entropy; the method uses off-policy updates with a stochastic actor and is evaluated on continuous-control MuJoCo benchmarks, claiming state-of-the-art performance together with markedly lower variance across random seeds than prior on-policy and off-policy baselines.

Significance. If the empirical results hold, the work supplies a practical algorithm that simultaneously improves sample efficiency and training stability for continuous control. The explicit demonstration of low seed-to-seed variance is a concrete strength for real-world use. The derivation inserts the entropy term into standard policy-gradient and Q-learning steps in a manner that remains internally consistent; the stability claim is supported by reporting performance across multiple seeds rather than single runs.

minor comments (3)
  1. [§4] The temperature parameter α is introduced as a hyperparameter whose schedule affects the entropy-regularization term throughout training; a short paragraph or table summarizing the values used per environment would improve reproducibility.
  2. [§5] Figure captions for the learning curves should explicitly state the number of random seeds and whether shaded regions represent standard deviation or standard error.
  3. [§3.2] The soft Q-function update in Equation (8) re-uses the same target network for both the Q and the entropy terms; a brief remark on why this choice does not introduce additional bias would clarify the implementation.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their thorough and positive review of our manuscript. We are pleased that the referee recognizes the practical benefits of Soft Actor-Critic for improving both sample efficiency and training stability in continuous control, as well as the value of the maximum-entropy derivation and the multi-seed evaluation.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper derives the soft actor-critic updates from the standard maximum-entropy RL objective using Bellman consistency and stochastic policy gradients. These steps are internally consistent algebraic manipulations that do not reduce to self-definition, fitted parameters renamed as predictions, or load-bearing self-citations. The headline performance and stability claims are obtained by direct execution on external MuJoCo benchmarks rather than by any internal fitting or renaming operation. No step in the provided derivation chain collapses to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the standard MDP formulation, the existence of a replay buffer that provides i.i.d. samples, and the assumption that the soft Q-function can be approximated by a neural network without introducing bias that invalidates the policy improvement step.

free parameters (1)
  • temperature alpha
    Scalar that trades off reward versus entropy; either fixed by hand or learned via a separate objective.
axioms (2)
  • domain assumption The environment is a Markov decision process with continuous state and action spaces.
    Invoked throughout the derivation of the soft policy improvement and soft Bellman equations.
  • domain assumption Off-policy samples from a replay buffer can be used to update both actor and critic without introducing unacceptable bias.
    Core premise of the off-policy actor-critic formulation.

pith-pipeline@v0.9.0 · 5501 in / 1409 out tokens · 27768 ms · 2026-05-13T01:43:00.013454+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. D4RL: Datasets for Deep Data-Driven Reinforcement Learning

    cs.LG 2020-04 accept novelty 8.0

    D4RL supplies new offline RL benchmarks and datasets from expert and mixed sources to expose weaknesses in existing algorithms and standardize evaluation.

  2. Revisiting Mixture Policies in Entropy-Regularized Actor-Critic

    cs.LG 2026-05 unverdicted novelty 7.0

    A new marginalized reparameterization estimator allows low-variance training of mixture policies in entropy-regularized actor-critic algorithms, matching or exceeding Gaussian policy performance in several continuous ...

  3. CODA: Coordination via On-Policy Diffusion for Multi-Agent Offline Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    CODA augments offline multi-agent RL with on-policy diffusion trajectories that evolve with the joint policy to enable coordination.

  4. Planning in entropy-regularized Markov decision processes and games

    cs.LG 2026-04 unverdicted novelty 7.0

    SmoothCruiser achieves O~(1/epsilon^4) problem-independent sample complexity for value estimation in entropy-regularized MDPs and games via a generative model.

  5. To Learn or Not to Learn: A Litmus Test for Using Reinforcement Learning in Control

    eess.SY 2026-04 unverdicted novelty 7.0

    A litmus test based on reachset-conformant model identification and correlation analysis of uncertainties predicts if RL-based control is superior to model-based control without any RL training.

  6. Mastering Diverse Domains through World Models

    cs.AI 2023-01 unverdicted novelty 7.0

    DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.

  7. Dream to Control: Learning Behaviors by Latent Imagination

    cs.LG 2019-12 accept novelty 7.0

    Dreamer learns to control from images by imagining and optimizing behaviors in a learned latent world model, outperforming prior methods on 20 visual tasks in data efficiency and final performance.

  8. Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...

  9. Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities

    cs.AI 2026-05 unverdicted novelty 6.0

    LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.

  10. Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities

    cs.AI 2026-05 unverdicted novelty 6.0

    LQL stabilizes Q-learning by penalizing violations of n-step action-sequence lower bounds with a hinge loss computed from standard network outputs.

  11. A Meta Reinforcement Learning Approach to Goals-Based Wealth Management

    cs.LG 2026-05 unverdicted novelty 6.0

    MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.

  12. Semi-Markov Reinforcement Learning for City-Scale EV Ride-Hailing with Feasibility-Guaranteed Actions

    cs.AI 2026-04 unverdicted novelty 6.0

    A robust semi-Markov RL agent with MILP feasibility projection and Wasserstein ambiguity set achieves $1.22M net profit on an NYC EV simulator with zero feeder violations, outperforming heuristic and other RL baselines.

  13. Scalable Neighborhood-Based Multi-Agent Actor-Critic

    cs.LG 2026-04 unverdicted novelty 6.0

    MADDPG-K scales centralized critics in multi-agent RL by limiting each critic to k-nearest neighbors under Euclidean distance, yielding constant input size and competitive performance.

  14. When Forecast Accuracy Fails: Rank Correlation and Decision Quality in Multi-Market Battery Storage Optimization

    q-fin.TR 2026-04 unverdicted novelty 6.0

    Rank correlation (Kendall tau) of price forecasts, not mean absolute error, determines intraday dispatch value for multi-market battery storage, with tau above 0.85-0.95 capturing 97-100% of perfect-foresight revenue.

  15. Load constrained wind farm flow control through multi-objective multi-agent reinforcement learning

    eess.SY 2026-04 conditional novelty 6.0

    A multi-agent RL system using Independent Soft Actor-Critic and a local-inflow surrogate for damage-equivalent loads learns policies that raise wind-farm power while respecting explicit load-increase limits.

  16. Behavior Regularized Offline Reinforcement Learning

    cs.LG 2019-11 unverdicted novelty 6.0

    Behavior-regularized actor-critic methods achieve strong offline RL results with simple regularization, rendering many recent technical additions unnecessary.

  17. Nautilus: From One Prompt to Plug-and-Play Robot Learning

    cs.RO 2026-05 unverdicted novelty 5.0

    NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.

  18. Coordination Architecture Shapes Continuous Demand Response Outcomes in Building Districts

    eess.SY 2026-05 unverdicted novelty 5.0

    In a 25-building district simulation, the hybrid MPC-SAC architecture delivered the strongest balance of load tracking accuracy (4.8% NMBE), thermal comfort (16.8% exceedance), and lowest spatial variability compared ...

  19. Hierarchical RL-MPC Control for Dynamic Wake Steering in Wind Farms

    eess.SY 2026-04 unverdicted novelty 5.0

    A hierarchical RL-MPC framework for dynamic wake steering in wind farms delivers 23% power gain over baseline on a three-turbine case while outperforming idealized MPC with perfect state knowledge and offering safer t...

  20. Accelerating Reinforcement Learning for Wind Farm Control via Expert Demonstrations

    eess.SY 2026-04 unverdicted novelty 5.0

    Pretraining Soft Actor-Critic agents via behavior cloning on PyWake-generated expert trajectories in WindGym simulations eliminates the initial learning phase for 2x2 wind farm control and yields final performance exc...

  21. Gymnasium: A Standard Interface for Reinforcement Learning Environments

    cs.LG 2024-07 accept novelty 5.0

    Gymnasium establishes a standardized API for RL environments to improve interoperability, reproducibility, and ease of development in reinforcement learning.

  22. An Aircraft Upset Recovery System with Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 4.0

    A SAC-based reinforcement learning controller for aircraft upset recovery is judged by domain experts to produce more desirable behavior than conventional control methods.

  23. An Automatic Ground Collision Avoidance System with Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 3.0

    The paper designs a reinforcement learning-based automatic ground collision avoidance system for jet trainers that uses limited observations and line-of-sight terrain queries to prevent collisions.

  24. Information-Theoretic Measures in AI: A Practical Decision Guide

    cs.AI 2026-04 unverdicted novelty 3.0

    A practical guide that organizes seven IT measures around three questions each—what it answers in AI, suitable estimators, and dangerous misuses—complete with flowchart, table, and worked examples.

  25. Perfecting Aircraft Maneuvers with Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 2.0

    Reinforcement learning agents simulate multiple aircraft aerobatic maneuvers to support development of an AI-assisted pilot training module.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 24 Pith papers · 4 internal anchors

  1. [1]

    G., Sutton, R

    Barto, A. G., Sutton, R. S., and Anderson, C. W. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics, pp.\ 834--846, 1983

  2. [2]

    S., Maei, H

    Bhatnagar, S., Precup, D., Silver, D., Sutton, R. S., Maei, H. R., and Szepesv \'a ri, C. Convergent temporal-difference learning with arbitrary smooth function approximation. In Advances in Neural Information Processing Systems (NIPS), pp.\ 1204--1212, 2009

  3. [3]

    OpenAI Gym

    Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Open AI gym. arXiv preprint arXiv:1606.01540, 2016

  4. [4]

    Houthooft, R., Schulman, J., and Abbeel, P

    Duan, Y., Chen, X. Houthooft, R., Schulman, J., and Abbeel, P. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning (ICML), 2016

  5. [5]

    Taming the noise in reinforcement learning via soft updates

    Fox, R., Pakman, A., and Tishby, N. Taming the noise in reinforcement learning via soft updates. In Conference on Uncertainty in Artificial Intelligence (UAI), 2016

  6. [6]

    Addressing Function Approximation Error in Actor-Critic Methods

    Fujimoto, S., van Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477, 2018

  7. [7]

    G., Bellemare, M

    Gruslys, A., Azar, M. G., Bellemare, M. G., and Munos, R. The reactor: A sample-efficient actor-critic architecture. arXiv preprint arXiv:1704.04651, 2017

  8. [8]

    E., and Levine, S

    Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., and Levine, S. Q-prop: Sample-efficient policy gradient with an off-policy critic. arXiv preprint arXiv:1611.02247, 2016

  9. [9]

    Reinforcement learning with deep energy-based policies

    Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning (ICML), pp.\ 1352--1361, 2017

  10. [10]

    Hasselt, H. V. Double Q -learning. In Advances in Neural Information Processing Systems (NIPS), pp.\ 2613--2621, 2010

  11. [11]

    Learning continuous control policies by stochastic value gradients

    Heess, N., Wayne, G., Silver, D., Lillicrap, T., Erez, T., and Tassa, Y. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems (NIPS), pp.\ 2944--2952, 2015

  12. [12]

    Deep Reinforcement Learning that Matters

    Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. Deep reinforcement learning that matters. arXiv preprint arXiv:1709.06560, 2017

  13. [13]

    and Ba, J

    Kingma, D. and Ba, J. Adam: A method for stochastic optimization. In International Conference for Learning Presentations (ICLR), 2015

  14. [14]

    and Koltun, V

    Levine, S. and Koltun, V. Guided policy search. In International Conference on Machine Learning (ICML), pp.\ 1--9, 2013

  15. [15]

    End-to-end training of deep visuomotor policies

    Levine, S., Finn, C., Darrell, T., and Abbeel, P. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17 0 (39): 0 1--40, 2016

  16. [16]

    Continuous control with deep reinforcement learning

    Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015

  17. [17]

    Playing Atari with Deep Reinforcement Learning

    Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013

  18. [18]

    A., Veness, J., Bellemare, M

    Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518 0 (7540): 0 529--533, 2015

  19. [19]

    P., Mirza, M., Graves, A., Lillicrap, T

    Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning (ICML), 2016

  20. [20]

    Bridging the gap between value and policy based reinforcement learning

    Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems (NIPS), pp.\ 2772--2782, 2017 a

  21. [21]

    Trust-pcl: An off-policy trust region method for continuous control.arXiv preprint arXiv:1707.01891,

    Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Trust- PCL : An off-policy trust region method for continuous control. arXiv preprint arXiv:1707.01891, 2017 b

  22. [22]

    PGQ : Combining policy gradient and Q -learning

    O'Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. PGQ : Combining policy gradient and Q -learning. arXiv preprint arXiv:1611.01626, 2016

  23. [23]

    and Schaal, S

    Peters, J. and Schaal, S. Reinforcement learning of motor skills with policy gradients. Neural networks, 21 0 (4): 0 682--697, 2008

  24. [24]

    On stochastic optimal control and reinforcement learning by approximate inference

    Rawlik, K., Toussaint, M., and Vijayakumar, S. On stochastic optimal control and reinforcement learning by approximate inference. Robotics: Science and Systems (RSS), 2012

  25. [25]

    I., and Moritz, P

    Schulman, J., Levine, S., Abbeel, P., Jordan, M. I., and Moritz, P. Trust region policy optimization. In International Conference on Machine Learning (ICML), pp.\ 1889--1897, 2015

  26. [26]

    Equivalence between policy gradients and soft Q-learning

    Schulman, J., Abbeel, P., and Chen, X. Equivalence between policy gradients and soft Q -learning. arXiv preprint arXiv:1704.06440, 2017 a

  27. [27]

    Proximal Policy Optimization Algorithms

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017 b

  28. [28]

    Deterministic policy gradient algorithms

    Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. Deterministic policy gradient algorithms. In International Conference on Machine Learning (ICML), 2014

  29. [29]

    Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. Mastering the game of go with deep neural networks and tree sear...

  30. [30]

    Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

  31. [31]

    Bias in natural actor-critic algorithms

    Thomas, P. Bias in natural actor-critic algorithms. In International Conference on Machine Learning (ICML), pp.\ 441--448, 2014

  32. [32]

    General duality between optimal control and estimation

    Todorov, E. General duality between optimal control and estimation. In IEEE Conference on Decision and Control (CDC), pp.\ 4286--4292. IEEE, 2008

  33. [33]

    Robot trajectory optimization using approximate inference

    Toussaint, M. Robot trajectory optimization using approximate inference. In International Conference on Machine Learning (ICML), pp.\ 1049--1056. ACM, 2009

  34. [34]

    Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8 0 (3-4): 0 229--256, 1992

  35. [35]

    Ziebart, B. D. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University, 2010

  36. [36]

    D., Maas, A

    Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement learning. In AAAI Conference on Artificial Intelligence (AAAI), pp.\ 1433--1438, 2008