arxiv: 1812.05905 · v2 · submitted 2018-12-13 · 💻 cs.LG · cs.AI· cs.RO· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Soft Actor-Critic Algorithms and Applications

Abhishek Gupta, Aurick Zhou, George Tucker, Henry Zhu, Jie Tan, Kristian Hartikainen, Pieter Abbeel, Sehoon Ha, Sergey Levine, Tuomas Haarnoja, Vikash Kumar

Authors on Pith no claims yet

Pith reviewed 2026-05-13 14:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ROstat.ML

keywords soft actor-criticreinforcement learningmaximum entropyoff-policy actor-criticsample efficiencystabilityrobotics control

0 comments

The pith

Soft Actor-Critic achieves state-of-the-art sample efficiency and stability in reinforcement learning by maximizing both task reward and policy entropy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Soft Actor-Critic, an off-policy reinforcement learning method that trains agents to succeed at tasks while acting as randomly as possible. The approach adds a constrained temperature parameter that tunes itself during training, removing the need for manual adjustment. Evaluations on benchmark tasks and real robots show it learns faster and reaches higher performance than previous on-policy and off-policy methods. It also produces consistent results across different random starting points, unlike other off-policy algorithms. These properties suggest it can handle the high sample needs and sensitivity issues that limit reinforcement learning in real-world settings.

Core claim

Soft Actor-Critic extends the maximum entropy reinforcement learning framework to off-policy actor-critic methods. The actor learns a policy that maximizes both the expected cumulative reward and the entropy of its actions. A new constrained optimization formulation automatically adjusts the temperature parameter that balances these two objectives. This combination yields state-of-the-art sample efficiency and final performance on locomotion and manipulation tasks while remaining stable across random seeds.

What carries the argument

The maximum entropy objective in an off-policy actor-critic setup, with a constrained formulation for automatic temperature tuning, which encourages stochastic policies that explore more effectively during training.

If this is right

SAC outperforms prior on-policy and off-policy methods in both sample-efficiency and asymptotic performance.
The approach exhibits high stability, with similar performance across different random seeds unlike other off-policy algorithms.
The constrained temperature formulation reduces sensitivity to hyperparameter choices.
These properties make the method suitable for challenging real-world robotics tasks such as quadrupedal locomotion and dexterous hand manipulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the stability generalizes, SAC could reduce the engineering effort needed to apply learned policies on physical hardware.
The entropy-driven exploration might help in tasks with sparse rewards where standard methods struggle.
Extending the same maximum entropy principle to discrete action spaces could broaden its applicability.

Load-bearing premise

The maximum entropy objective combined with the constrained temperature formulation will reliably produce the claimed stability and sample-efficiency gains without introducing new sensitivities or requiring task-specific adjustments beyond those described.

What would settle it

A direct comparison of SAC against other leading off-policy algorithms on multiple continuous control benchmarks, measuring both average performance and variance across at least five random seeds per task.

read the original abstract

Model-free deep reinforcement learning (RL) algorithms have been successfully applied to a range of challenging sequential decision making and control tasks. However, these methods typically suffer from two major challenges: high sample complexity and brittleness to hyperparameters. Both of these challenges limit the applicability of such methods to real-world domains. In this paper, we describe Soft Actor-Critic (SAC), our recently introduced off-policy actor-critic algorithm based on the maximum entropy RL framework. In this framework, the actor aims to simultaneously maximize expected return and entropy. That is, to succeed at the task while acting as randomly as possible. We extend SAC to incorporate a number of modifications that accelerate training and improve stability with respect to the hyperparameters, including a constrained formulation that automatically tunes the temperature hyperparameter. We systematically evaluate SAC on a range of benchmark tasks, as well as real-world challenging tasks such as locomotion for a quadrupedal robot and robotic manipulation with a dexterous hand. With these improvements, SAC achieves state-of-the-art performance, outperforming prior on-policy and off-policy methods in sample-efficiency and asymptotic performance. Furthermore, we demonstrate that, in contrast to other off-policy algorithms, our approach is very stable, achieving similar performance across different random seeds. These results suggest that SAC is a promising candidate for learning in real-world robotics tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces Soft Actor-Critic (SAC), an off-policy actor-critic algorithm grounded in the maximum-entropy RL framework in which the policy maximizes both expected return and entropy. It extends the base method with a constrained optimization formulation that automatically tunes the temperature parameter, along with other modifications for faster training and improved hyperparameter stability. Systematic evaluations are reported on standard benchmark control tasks as well as real-world robotics domains (quadrupedal locomotion and dexterous-hand manipulation), with the central claims being state-of-the-art sample efficiency and asymptotic performance together with markedly higher stability across random seeds than prior off-policy methods.

Significance. If the empirical claims hold, the work is significant for deep RL because it directly targets the twin obstacles of sample complexity and hyperparameter brittleness that have limited real-world deployment. The constrained temperature formulation supplies a principled, largely automatic mechanism for balancing exploration and exploitation within the maximum-entropy objective, and the reported success on hardware robotics tasks supplies concrete evidence of practical utility. The stability result across seeds is particularly valuable for reproducibility.

major comments (1)

[Temperature tuning / constrained formulation] The target entropy is fixed to the heuristic value -dim(A). This choice is load-bearing for the claim that the method requires no task-specific adjustments beyond the described auto-tuning of temperature; on tasks whose action spaces or dynamics deviate from the benchmark suite, the heuristic may re-introduce sensitivity that the constrained formulation was intended to remove. A concrete test would be to report performance when the target entropy is instead treated as a tunable hyperparameter or learned jointly.

minor comments (1)

[Abstract] The abstract asserts state-of-the-art results without any quantitative anchors (e.g., percentage improvement or specific baseline names); adding one or two such numbers would strengthen the summary for readers who do not reach the experimental section.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment, recognition of the stability and robotics results, and recommendation for minor revision. We address the single major comment below.

read point-by-point responses

Referee: [Temperature tuning / constrained formulation] The target entropy is fixed to the heuristic value -dim(A). This choice is load-bearing for the claim that the method requires no task-specific adjustments beyond the described auto-tuning of temperature; on tasks whose action spaces or dynamics deviate from the benchmark suite, the heuristic may re-introduce sensitivity that the constrained formulation was intended to remove. A concrete test would be to report performance when the target entropy is instead treated as a tunable hyperparameter or learned jointly.

Authors: We appreciate the referee's observation. The target entropy is set to the fixed heuristic -dim(A), which is a standard, environment-determined choice in the maximum-entropy framework that scales with action dimensionality and requires no per-task manual selection. The central contribution of the constrained formulation is the automatic adaptation of the temperature parameter α itself, which removes the primary source of hyperparameter sensitivity present in prior maximum-entropy methods. Our results across locomotion, manipulation, and real-robot tasks show that this combination yields stable performance without further adjustments, supporting the claim of reduced task-specific tuning. We do not believe the heuristic re-introduces brittleness in practice, as dim(A) is known a priori from the action space. Nevertheless, we will add a short sensitivity discussion and additional plots treating the target entropy as a tunable hyperparameter in the appendix of the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper derives the SAC algorithm from the external maximum-entropy RL objective (expected return plus entropy) using standard soft Bellman backups and a constrained optimization step for the temperature parameter via Lagrange multipliers. These steps follow directly from the framework without reducing to self-definition or renaming fitted quantities as predictions. The target entropy heuristic (-dim(A)) is an explicit design choice, not a derived result. Empirical performance claims are evaluated separately on benchmarks and do not enter the derivation. Self-citations reference prior independent work on max-entropy RL and do not form a load-bearing closed loop. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the appropriateness of the maximum-entropy objective for control tasks and on the constrained optimization successfully replacing manual temperature tuning without side effects.

free parameters (1)

temperature
Automatically tuned via the constrained formulation but remains the key scalar balancing reward and entropy in the objective.

axioms (1)

domain assumption Maximum entropy RL framework is suitable for the target continuous-control tasks
The actor is defined to maximize both expected return and entropy simultaneously.

pith-pipeline@v0.9.0 · 5571 in / 1202 out tokens · 55942 ms · 2026-05-13T14:28:07.440506+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.DAlembert.Inevitability bilinear_family_forced unclear
We extend SAC to incorporate a number of modifications that accelerate training and improve stability with respect to the hyperparameters, including a constrained formulation that automatically tunes the temperature hyperparameter.

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation
cs.LG 2026-05 unverdicted novelty 7.0

CPPO is an on-policy contrastive RL method that derives advantages from contrastive Q-values for PPO optimization, outperforming prior CRL baselines in 14/18 tasks and matching or exceeding reward-based PPO in 12/18 tasks.
TuniQ: Autotuning Compilation Passes for Quantum Workloads at Scale for Effectiveness and Efficiency
quant-ph 2026-05 unverdicted novelty 7.0

TuniQ uses RL with a dual-encoder, shaped rewards, and action masking to autotune quantum compilation passes, improving fidelity and speed over Qiskit while generalizing across backends and scaling to large circuits.
Revisiting Mixture Policies in Entropy-Regularized Actor-Critic
cs.LG 2026-05 unverdicted novelty 7.0

A new marginalized reparameterization estimator allows low-variance training of mixture policies in entropy-regularized actor-critic algorithms, matching or exceeding Gaussian policy performance in several continuous ...
Generative Actor-Critic with Soft Bridge Policies
cs.LG 2026-05 unverdicted novelty 7.0

SoftGAC defines a stochastic bridge from base to action latent that converts the MaxEnt objective into a tractable relative-entropy term reducible to control energy, achieving competitive returns with one-pass sampling.
Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

DOSER detects OOD actions via diffusion-model denoising error and applies selective regularization based on predicted transitions, proving gamma-contraction with performance bounds and outperforming priors on offline ...
Atomic-Probe Governance for Skill Updates in Compositional Robot Policies
cs.RO 2026-04 unverdicted novelty 7.0

A cross-version swap protocol reveals dominant skills that swing composition success by up to 50 percentage points, and an atomic probe with selective revalidation governs updates at lower cost than always re-testing ...
KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning
cs.RO 2026-04 unverdicted novelty 7.0

KinDER is a new open-source benchmark that demonstrates substantial gaps in current robot learning and planning methods for handling physical constraints.
Autonomous Diffractometry Enabled by Visual Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

A model-free reinforcement learning agent learns to align crystals from diffraction images without human supervision or theoretical knowledge.
Physics-Informed Reinforcement Learning of Spatial Density Velocity Potentials for Map-Free Racing
cs.RO 2026-04 unverdicted novelty 7.0

A DRL policy learns racing controls from depth spectral distributions using a non-geometric physics-informed reward, achieving 12% better performance than humans on out-of-distribution tracks with under 1% of baseline...
SafeMind: A Risk-Aware Differentiable Control Framework for Adaptive and Safe Quadruped Locomotion
cs.RO 2026-04 unverdicted novelty 7.0

SafeMind is a differentiable framework that combines probabilistic control barrier functions, semantic context encoding, and meta-adaptive risk calibration to deliver safer, lower-energy quadruped locomotion under unc...
Debiased Model-based Representations for Sample-efficient Continuous Control
cs.LG 2026-05 unverdicted novelty 6.0

DR.Q debiases model-based representations for Q-learning by maximizing mutual information between state-action and next-state representations and applying faded prioritized experience replay, achieving competitive or ...
A Single Deep Preference-Conditioned Policy for Learning Pareto Coverage Sets
cs.LG 2026-05 unverdicted novelty 6.0

A single preference-conditioned policy achieves unique and Lipschitz-continuous Pareto coverage in multi-objective MDPs via a new mirror-descent policy iteration algorithm with O(1/k) convergence.
Offline Reinforcement Learning for Rotation Profile Control in Tokamaks
cs.LG 2026-05 unverdicted novelty 6.0

Offline RL policies trained solely on DIII-D historical data were deployed on the tokamak and produced promising real-world control of the plasma rotation profile.
Counter-Dyna: Data-Efficient RL-Based HVAC Control using Counterfactual Building Models
cs.LG 2026-05 unverdicted novelty 6.0

Counter-Dyna reduces RL training data for HVAC control to five weeks by using counterfactual surrogate models that ignore uncontrollable variables like weather and prices.
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
Atomic-Probe Governance for Skill Updates in Compositional Robot Policies
cs.RO 2026-04 unverdicted novelty 6.0

Empirical study on robosuite tasks reveals a dominant-skill effect in compositions and shows that an atomic probe approximates full revalidation for skill updates at much lower cost.
Self-Predictive Representation for Autonomous UAV Object-Goal Navigation
cs.RO 2026-04 unverdicted novelty 6.0

AmelPredSto, a stochastic self-predictive representation model, outperforms other state representation learning approaches when combined with actor-critic RL for object-goal navigation in UAVs.
Simple but Stable, Fast and Safe: Achieve End-to-end Control by High-Fidelity Differentiable Simulation
cs.RO 2026-04 conditional novelty 6.0

An end-to-end RL policy trained via high-fidelity differentiable simulation maps depth images straight to bodyrate commands, achieving top success rates, low jerk, and zero-shot real-world generalization up to 7.5 m/s...
TD-MPC2: Scalable, Robust World Models for Continuous Control
cs.LG 2023-10 conditional novelty 6.0

TD-MPC2 scales an implicit world-model RL method to a 317M-parameter agent that masters 80 tasks across four domains with a single hyperparameter configuration.
Nautilus: From One Prompt to Plug-and-Play Robot Learning
cs.RO 2026-05 unverdicted novelty 5.0

NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
REAP: Reinforcement-Learning End-to-End Autonomous Parking with Gaussian Splatting Simulator for Real2Sim2Real Transfer
cs.RO 2026-05 unverdicted novelty 5.0

REAP trains an end-to-end SAC policy with behavior cloning and collision penalties inside a 3DGS Real2Sim simulator and transfers it to physical vehicles, succeeding in narrow mechanical parking slots.
Incremental Residual Reinforcement Learning Toward Real-World Learning for Social Navigation
cs.RO 2026-04 unverdicted novelty 5.0

IRRL lets robots learn social navigation in the real world by incrementally updating only the differences from a base policy, matching replay-buffer methods in simulation and adapting to new settings on physical robots.
Delayed homomorphic reinforcement learning for environments with delayed feedback
cs.LG 2026-04 unverdicted novelty 5.0

DHRL defines belief-equivalence over augmented states to abstract away control-redundant states, preserving optimality in finite domains and yielding a deep actor-critic method that outperforms baselines on MuJoCo tasks.
robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
cs.RO 2020-09 unverdicted novelty 5.0

The paper presents robosuite v1.5, a MuJoCo-based modular simulation framework with benchmark environments for reproducible robot learning research.
Flexible Empowerment at Reasoning with Extended Best-of-N Sampling
cs.LG 2026-04 unverdicted novelty 4.0

Extended best-of-N sampling with Tsallis statistics allows flexible empowerment in RL reasoning, balancing exploration-exploitation and improving locomotion task performance.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 24 Pith papers · 3 internal anchors

[1]

Maximum a posteriori policy optimisation

Abdolmaleki, A., Springenberg, J. T., Tassa, Y ., Munos, R., Heess, N., and Riedmiller, M. Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920,

work page arXiv
[2]

OpenAI Gym

Brockman, G., Cheung, V ., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. OpenAI gym. arXiv preprint arXiv:1606.01540,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Addressing Function Approximation Error in Actor-Critic Methods

Fujimoto, S., van Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477,

work page Pith review arXiv
[4]

G., Bellemare, M

Gruslys, A., Azar, M. G., Bellemare, M. G., and Munos, R. The reactor: A sample-efﬁcient actor-critic architecture. arXiv preprint arXiv:1704.04651,

work page arXiv
[5]

E., and Levine, S

Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., and Levine, S. Q-prop: Sample-efﬁcient policy gradient with an off-policy critic. arXiv preprint arXiv:1611.02247,

work page arXiv
[6]

Deep Reinforcement Learning that Matters

Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. Deep reinforcement learning that matters. arXiv preprint arXiv:1709.06560,

work page Pith review arXiv
[7]

Kappen, H. J. Path integrals and symmetry breaking for optimal control theory. Journal of Statistical Mechanics: Theory And Experiment, 2005(11):P11011,

work page 2005
[8]

Continuous control with deep reinforcement learning

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y ., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Playing Atari with Deep Reinforcement Learning

Mnih, V ., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Trust-pcl: An oﬀ-policy trust region method for continuous control.arXiv preprint arXiv:1707.01891,

Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems (NIPS), pp. 2772–2782, 2017a. Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Trust-PCL: An off-policy trust region method for continuous control. arXiv preprint arXiv:1707.018...

work page arXiv
[11]

Equivalence between policy gradients and soft Q-learning

Schulman, J., Abbeel, P., and Chen, X. Equivalence between policy gradients and soft Q-learning. arXiv preprint arXiv:1704.06440, 2017a. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017b. Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmil...

work page arXiv
[12]

Dexterous manipulation with deep reinforcement learning: Efﬁcient, general, and low-cost

Zhu, H., Gupta, A., Rajeswaran, A., Levine, S., and Kumar, V . Dexterous manipulation with deep reinforcement learning: Efﬁcient, general, and low-cost. arXiv preprint arXiv:1810.06045,

work page arXiv
[13]

In that sense, discounted policy gradients typically do not optimize the true discounted objective

Appendix A Inﬁnite Horizon Discounted Maximum Entropy Objective The exact deﬁnition of the discounted maximum entropy objective is complicated by the fact that, when using a discount factor for policy gradient methods, we typically do not discount the state distribution, only the rewards. In that sense, discounted policy gradients typically do not optimiz...

work page 2014