Soft Actor-Critic Algorithms and Applications
Pith reviewed 2026-05-13 14:28 UTC · model grok-4.3
The pith
Soft Actor-Critic achieves state-of-the-art sample efficiency and stability in reinforcement learning by maximizing both task reward and policy entropy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Soft Actor-Critic extends the maximum entropy reinforcement learning framework to off-policy actor-critic methods. The actor learns a policy that maximizes both the expected cumulative reward and the entropy of its actions. A new constrained optimization formulation automatically adjusts the temperature parameter that balances these two objectives. This combination yields state-of-the-art sample efficiency and final performance on locomotion and manipulation tasks while remaining stable across random seeds.
What carries the argument
The maximum entropy objective in an off-policy actor-critic setup, with a constrained formulation for automatic temperature tuning, which encourages stochastic policies that explore more effectively during training.
If this is right
- SAC outperforms prior on-policy and off-policy methods in both sample-efficiency and asymptotic performance.
- The approach exhibits high stability, with similar performance across different random seeds unlike other off-policy algorithms.
- The constrained temperature formulation reduces sensitivity to hyperparameter choices.
- These properties make the method suitable for challenging real-world robotics tasks such as quadrupedal locomotion and dexterous hand manipulation.
Where Pith is reading between the lines
- If the stability generalizes, SAC could reduce the engineering effort needed to apply learned policies on physical hardware.
- The entropy-driven exploration might help in tasks with sparse rewards where standard methods struggle.
- Extending the same maximum entropy principle to discrete action spaces could broaden its applicability.
Load-bearing premise
The maximum entropy objective combined with the constrained temperature formulation will reliably produce the claimed stability and sample-efficiency gains without introducing new sensitivities or requiring task-specific adjustments beyond those described.
What would settle it
A direct comparison of SAC against other leading off-policy algorithms on multiple continuous control benchmarks, measuring both average performance and variance across at least five random seeds per task.
read the original abstract
Model-free deep reinforcement learning (RL) algorithms have been successfully applied to a range of challenging sequential decision making and control tasks. However, these methods typically suffer from two major challenges: high sample complexity and brittleness to hyperparameters. Both of these challenges limit the applicability of such methods to real-world domains. In this paper, we describe Soft Actor-Critic (SAC), our recently introduced off-policy actor-critic algorithm based on the maximum entropy RL framework. In this framework, the actor aims to simultaneously maximize expected return and entropy. That is, to succeed at the task while acting as randomly as possible. We extend SAC to incorporate a number of modifications that accelerate training and improve stability with respect to the hyperparameters, including a constrained formulation that automatically tunes the temperature hyperparameter. We systematically evaluate SAC on a range of benchmark tasks, as well as real-world challenging tasks such as locomotion for a quadrupedal robot and robotic manipulation with a dexterous hand. With these improvements, SAC achieves state-of-the-art performance, outperforming prior on-policy and off-policy methods in sample-efficiency and asymptotic performance. Furthermore, we demonstrate that, in contrast to other off-policy algorithms, our approach is very stable, achieving similar performance across different random seeds. These results suggest that SAC is a promising candidate for learning in real-world robotics tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Soft Actor-Critic (SAC), an off-policy actor-critic algorithm grounded in the maximum-entropy RL framework in which the policy maximizes both expected return and entropy. It extends the base method with a constrained optimization formulation that automatically tunes the temperature parameter, along with other modifications for faster training and improved hyperparameter stability. Systematic evaluations are reported on standard benchmark control tasks as well as real-world robotics domains (quadrupedal locomotion and dexterous-hand manipulation), with the central claims being state-of-the-art sample efficiency and asymptotic performance together with markedly higher stability across random seeds than prior off-policy methods.
Significance. If the empirical claims hold, the work is significant for deep RL because it directly targets the twin obstacles of sample complexity and hyperparameter brittleness that have limited real-world deployment. The constrained temperature formulation supplies a principled, largely automatic mechanism for balancing exploration and exploitation within the maximum-entropy objective, and the reported success on hardware robotics tasks supplies concrete evidence of practical utility. The stability result across seeds is particularly valuable for reproducibility.
major comments (1)
- [Temperature tuning / constrained formulation] The target entropy is fixed to the heuristic value -dim(A). This choice is load-bearing for the claim that the method requires no task-specific adjustments beyond the described auto-tuning of temperature; on tasks whose action spaces or dynamics deviate from the benchmark suite, the heuristic may re-introduce sensitivity that the constrained formulation was intended to remove. A concrete test would be to report performance when the target entropy is instead treated as a tunable hyperparameter or learned jointly.
minor comments (1)
- [Abstract] The abstract asserts state-of-the-art results without any quantitative anchors (e.g., percentage improvement or specific baseline names); adding one or two such numbers would strengthen the summary for readers who do not reach the experimental section.
Simulated Author's Rebuttal
We thank the referee for the positive assessment, recognition of the stability and robotics results, and recommendation for minor revision. We address the single major comment below.
read point-by-point responses
-
Referee: [Temperature tuning / constrained formulation] The target entropy is fixed to the heuristic value -dim(A). This choice is load-bearing for the claim that the method requires no task-specific adjustments beyond the described auto-tuning of temperature; on tasks whose action spaces or dynamics deviate from the benchmark suite, the heuristic may re-introduce sensitivity that the constrained formulation was intended to remove. A concrete test would be to report performance when the target entropy is instead treated as a tunable hyperparameter or learned jointly.
Authors: We appreciate the referee's observation. The target entropy is set to the fixed heuristic -dim(A), which is a standard, environment-determined choice in the maximum-entropy framework that scales with action dimensionality and requires no per-task manual selection. The central contribution of the constrained formulation is the automatic adaptation of the temperature parameter α itself, which removes the primary source of hyperparameter sensitivity present in prior maximum-entropy methods. Our results across locomotion, manipulation, and real-robot tasks show that this combination yields stable performance without further adjustments, supporting the claim of reduced task-specific tuning. We do not believe the heuristic re-introduces brittleness in practice, as dim(A) is known a priori from the action space. Nevertheless, we will add a short sensitivity discussion and additional plots treating the target entropy as a tunable hyperparameter in the appendix of the revised manuscript. revision: partial
Circularity Check
No significant circularity in derivation chain
full rationale
The paper derives the SAC algorithm from the external maximum-entropy RL objective (expected return plus entropy) using standard soft Bellman backups and a constrained optimization step for the temperature parameter via Lagrange multipliers. These steps follow directly from the framework without reducing to self-definition or renaming fitted quantities as predictions. The target entropy heuristic (-dim(A)) is an explicit design choice, not a derived result. Empirical performance claims are evaluated separately on benchmarks and do not enter the derivation. Self-citations reference prior independent work on max-entropy RL and do not form a load-bearing closed loop. The derivation remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- temperature
axioms (1)
- domain assumption Maximum entropy RL framework is suitable for the target continuous-control tasks
Lean theorems connected to this paper
-
Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We extend SAC to incorporate a number of modifications that accelerate training and improve stability with respect to the hyperparameters, including a constrained formulation that automatically tunes the temperature hyperparameter.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
Partial Fusion of Neural Networks: Efficient Tradeoffs Between Ensembles and Weight Aggregation
Partial fusion interpolates between neural network ensembles and weight aggregation by only fusing the most similar neurons identified via partial optimal transport, enabling flexible cost-performance tradeoffs.
-
Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling
DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
-
Policy Optimization in Hybrid Discrete-Continuous Action Spaces via Mixed Gradients
HPO enables unbiased policy optimization in hybrid action spaces by mixing differentiable simulation gradients with score-function estimates, outperforming PPO as continuous dimensions increase.
-
Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation
CPPO is an on-policy contrastive RL method that derives advantages from contrastive Q-values for PPO optimization, outperforming prior CRL baselines in 14/18 tasks and matching or exceeding reward-based PPO in 12/18 tasks.
-
TuniQ: Autotuning Compilation Passes for Quantum Workloads at Scale for Effectiveness and Efficiency
TuniQ uses RL with a dual-encoder, shaped rewards, and action masking to autotune quantum compilation passes, improving fidelity and speed over Qiskit while generalizing across backends and scaling to large circuits.
-
Revisiting Mixture Policies in Entropy-Regularized Actor-Critic
A new marginalized reparameterization estimator allows low-variance training of mixture policies in entropy-regularized actor-critic algorithms, matching or exceeding Gaussian policy performance in several continuous ...
-
Generative Actor-Critic with Soft Bridge Policies
SoftGAC defines a stochastic bridge from base to action latent that converts the MaxEnt objective into a tractable relative-entropy term reducible to control energy, achieving competitive returns with one-pass sampling.
-
Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning
DOSER detects OOD actions via diffusion-model denoising error and applies selective regularization based on predicted transitions, proving gamma-contraction with performance bounds and outperforming priors on offline ...
-
Atomic-Probe Governance for Skill Updates in Compositional Robot Policies
A cross-version swap protocol reveals dominant skills that swing composition success by up to 50 percentage points, and an atomic probe with selective revalidation governs updates at lower cost than always re-testing ...
-
KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning
KinDER is a new open-source benchmark that demonstrates substantial gaps in current robot learning and planning methods for handling physical constraints.
-
Autonomous Diffractometry Enabled by Visual Reinforcement Learning
A model-free reinforcement learning agent learns to align crystals from diffraction images without human supervision or theoretical knowledge.
-
Physics-Informed Reinforcement Learning of Spatial Density Velocity Potentials for Map-Free Racing
A DRL policy learns racing controls from depth spectral distributions using a non-geometric physics-informed reward, achieving 12% better performance than humans on out-of-distribution tracks with under 1% of baseline...
-
SafeMind: A Risk-Aware Differentiable Control Framework for Adaptive and Safe Quadruped Locomotion
SafeMind is a differentiable framework that combines probabilistic control barrier functions, semantic context encoding, and meta-adaptive risk calibration to deliver safer, lower-energy quadruped locomotion under unc...
-
Maximin Robust Bayesian Experimental Design
The paper derives a robust objective for Bayesian experimental design governed by Sibson's α-mutual information and provides PAC-Bayes lower bounds on the robust expected information gain.
-
RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes
RL-AWB uses reinforcement learning to optimize parameters of a statistical white-balance estimator for nighttime scenes and reports better generalization on a new multi-sensor dataset.
-
R2PS: Worst-Case Robust Real-Time Pursuit Strategies under Partial Observability
R2PS combines a proof that dynamic programming remains optimal under asynchronous evader moves, a belief preservation mechanism for partial observability, and integration into equilibrium policy generalization to prod...
-
Real-time reinforcement learning for turbulent state-dependent control in a bluff-body wake
REACT reinforcement learning agent learns a state-dependent policy from experimental measurements that suppresses coherent wake structures to reduce drag with net energy savings, outperforming baselines by 2-4x and ge...
-
Adaptive Ensemble Aggregation for Actor-Critics
AEA dynamically aggregates ensembles in off-policy actor-critics from training dynamics, with proofs of convergence to an error-minimizing equilibrium, bias shrinkage with ensemble size, and monotonic policy improvement.
-
EXPO: Stable Reinforcement Learning with Expressive Policies
EXPO stabilizes online RL for expressive policies by training a base policy with imitation and using a lightweight Gaussian edit policy to select higher-value actions on the fly for sampling and TD backups.
-
DR-SAC: Distributionally Robust Soft Actor-Critic for Reinforcement Learning under Uncertainty
DR-SAC is the first actor-critic distributionally robust RL algorithm for offline continuous control that derives a convergent robust soft policy iteration and reports up to 9.8x higher rewards than SAC under perturbations.
-
Accelerated Learning with Linear Temporal Logic using Differentiable Simulation
Differentiable relaxation of LTL automata via soft labeling enables gradient-based RL from formal specifications, with theoretical bounds on discrete-differentiable discrepancy and up to 2x returns on nonlinear tasks.
-
FP-IRL: Fokker--Planck Inverse Reinforcement Learning -- A Physics-Constrained Approach to Markov Decision Processes
FP-IRL recovers MDP reward, transition, and policy from trajectories alone by using variational system identification on a Fokker-Planck potential that corresponds to reward maximization.
-
Solving Rubik's Cube with a Robot Hand
Reinforcement learning models trained only in simulation using automatic domain randomization solve Rubik's cube with a real robot hand.
-
Understanding Goal Generalisation in Sequential Reinforcement Learning
Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable...
-
Goal-Conditioned Agents that Learn Everything All at Once
LEO enables efficient all-goals learning in goal-conditioned RL by jointly predicting for all goals in one network pass, yielding >250x speedup over relabelling and better performance on Craftax.
-
Reflex: Reinforcement Learning with Reflection Symmetry Exploitation in State-Based Continuous Control
Reflex formalizes axial and bilateral reflection symmetries and adds symmetry regularization to PPO and SAC, reporting better performance and sample efficiency on Gym and DMC benchmarks.
-
R2R2: Robust Representation for Intensive Experience Reuse via Redundancy Reduction in Self-Predictive Learning
R2R2 introduces a non-centered regularization objective for SPL that addresses conflicts with spectral properties, leading to better performance on continuous control tasks at high UTD ratios.
-
Debiased Model-based Representations for Sample-efficient Continuous Control
DR.Q debiases model-based representations for Q-learning by maximizing mutual information between state-action and next-state representations and applying faded prioritized experience replay, achieving competitive or ...
-
A Single Deep Preference-Conditioned Policy for Learning Pareto Coverage Sets
A single preference-conditioned policy achieves unique and Lipschitz-continuous Pareto coverage in multi-objective MDPs via a new mirror-descent policy iteration algorithm with O(1/k) convergence.
-
Offline Reinforcement Learning for Rotation Profile Control in Tokamaks
Offline RL policies trained solely on DIII-D historical data were deployed on the tokamak and produced promising real-world control of the plasma rotation profile.
-
Counter-Dyna: Data-Efficient RL-Based HVAC Control using Counterfactual Building Models
Counter-Dyna reduces RL training data for HVAC control to five weeks by using counterfactual surrogate models that ignore uncontrollable variables like weather and prices.
-
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
-
Atomic-Probe Governance for Skill Updates in Compositional Robot Policies
Empirical study on robosuite tasks reveals a dominant-skill effect in compositions and shows that an atomic probe approximates full revalidation for skill updates at much lower cost.
-
Self-Predictive Representation for Autonomous UAV Object-Goal Navigation
AmelPredSto, a stochastic self-predictive representation model, outperforms other state representation learning approaches when combined with actor-critic RL for object-goal navigation in UAVs.
-
Simple but Stable, Fast and Safe: Achieve End-to-end Control by High-Fidelity Differentiable Simulation
An end-to-end RL policy trained via high-fidelity differentiable simulation maps depth images straight to bodyrate commands, achieving top success rates, low jerk, and zero-shot real-world generalization up to 7.5 m/s...
-
Simulation Distillation: Pretraining World Models in Simulation for Rapid Real-World Adaptation
SimDist pretrains world models in simulation and adapts them to real-world robots by updating only the latent dynamics model, enabling rapid improvement on contact-rich tasks where prior methods fail.
-
FireScope: Wildfire Risk Raster Prediction with a Chain-of-Thought Oracle
FireScope is a VLM framework that generates wildfire risk rasters together with reasoning traces, showing improved cross-continental generalization when trained on US expert maps and tested on European fire events.
-
Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning
MSDP pretrains a transformer encoder via masked multisensory reconstruction and feeds the embeddings into an asymmetric actor-critic RL setup, yielding faster learning and high real-robot success rates with only 6,000...
-
Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning
MINTO sets bootstrapped targets to the minimum of online and target network estimates, yielding faster stable value learning across online/offline RL and discrete/continuous actions.
-
High-Precision and High-Efficiency Trajectory Tracking for Excavators Based on Closed-Loop Dynamics
EfficientTrack integrates model-based learning and closed-loop dynamics to minimize tracking errors in excavator trajectory control with high efficiency and precision, outperforming prior learning-based methods in sim...
-
Dissecting Discrete Soft Actor-Critic: Limitations and Principled Alternatives
Shows entropy coupling limits DSAC on discrete tasks and introduces a generalized actor-critic framework with m-step critics and novel entropy-regularized objectives that perform robustly on Atari.
-
Joint Scheduling of Deferrable and Nondeferrable Demand with Colocated Stochastic Supply
Optimal scheduling of deferrable demands with colocated stochastic supply and piecewise-linear pricing reduces to a finite set of three procrastination thresholds per demand class; a reinforcement learning algorithm l...
-
Koopman-Assisted Reinforcement Learning
Koopman-assisted RL reformulates max-entropy algorithms using controlled Koopman tensors and reports SOTA performance versus neural SAC on Lorenz, fluid flow, and other systems.
-
TD-MPC2: Scalable, Robust World Models for Continuous Control
TD-MPC2 scales an implicit world-model RL method to a 317M-parameter agent that masters 80 tasks across four domains with a single hyperparameter configuration.
-
Unleashing the Power of Tree-of-Thoughts for Edge-Enabled AIGC Service Provisioning
Models ToT prompting as a DAG and introduces DSAC to optimize thought assignment in edge-enabled AIGC, achieving up to 8.32% delay reduction over PPO in simulations while cutting latency over 80% versus local execution.
-
COOPO: Cyclic Offline-Online Policy Optimization Algorithm
COOPO is a cyclic offline-online RL algorithm that repeatedly anchors the policy to a dataset via KL-regularized updates then fine-tunes online, claiming better sample efficiency and monotonic improvement under covera...
-
Nautilus: From One Prompt to Plug-and-Play Robot Learning
NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
-
REAP: Reinforcement-Learning End-to-End Autonomous Parking with Gaussian Splatting Simulator for Real2Sim2Real Transfer
REAP trains an end-to-end SAC policy with behavior cloning and collision penalties inside a 3DGS Real2Sim simulator and transfers it to physical vehicles, succeeding in narrow mechanical parking slots.
-
Incremental Residual Reinforcement Learning Toward Real-World Learning for Social Navigation
IRRL lets robots learn social navigation in the real world by incrementally updating only the differences from a base policy, matching replay-buffer methods in simulation and adapting to new settings on physical robots.
-
Delayed homomorphic reinforcement learning for environments with delayed feedback
DHRL defines belief-equivalence over augmented states to abstract away control-redundant states, preserving optimality in finite domains and yielding a deep actor-critic method that outperforms baselines on MuJoCo tasks.
-
Robust SAC-Enabled UAV-RIS Assisted Secure MISO Systems With Untrusted EH Receivers
SAC framework for robust optimization of UAV-RIS secure MISO systems with UEHRs to maximize WCSEE under CSI uncertainty.
-
Prismatic World Model: Learning Compositional Dynamics for Planning in Hybrid Systems
PRISM-WM uses a context-aware MoE with latent orthogonalization to model hybrid dynamics and reduce rollout drift for model-based planning.
-
FireScope: Wildfire Risk Raster Prediction with a Chain-of-Thought Oracle
FireScope trains a VLM on US data to output wildfire risk rasters with reasoning traces and shows improved cross-continental performance on European events compared with prior approaches.
-
Optimal control of the future via prospective learning with control
Prospective Learning with Control proves ERM asymptotically achieves the Bayes optimal policy in non-stationary reset-free settings and outperforms time-aware RL on a 1D foraging benchmark.
-
Centralized Adaptive Sampling for Reliable Co-Training of Independent Multi-Agent Policies
CoSER adaptively samples joint actions in CTDE MARL to reduce sampling error relative to the joint on-policy distribution, empirically improving reliability of independent policy gradient convergence.
-
Relative Entropy Pathwise Policy Optimization
REPPO is an on-policy RL method that combines pathwise policy gradients with relative entropy constraints to achieve stable training and high sample efficiency without replay buffers.
-
Deep Double Q-learning
Deep Double Q-learning explicitly trains two Q-functions in deep RL, outperforming Double DQN on 47 of 57 Atari games while further reducing overestimation.
-
Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning
KRPO uses a Kalman filter to estimate latent prompt-level reward baselines from per-group rewards in GRPO, yielding better reward curves and accuracy on math reasoning benchmarks.
-
FinTSB: A Comprehensive and Practical Benchmark for Financial Time Series Forecasting
FinTSB introduces a benchmark addressing diversity, standardization, and real-world applicability gaps in financial time series forecasting evaluations.
-
robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
The paper presents robosuite v1.5, a MuJoCo-based modular simulation framework with benchmark environments for reproducible robot learning research.
Reference graph
Works this paper leans on
-
[1]
Maximum a Posteriori Policy Optimisation
Abdolmaleki, A., Springenberg, J. T., Tassa, Y ., Munos, R., Heess, N., and Riedmiller, M. Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920,
-
[2]
Brockman, G., Cheung, V ., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. OpenAI gym. arXiv preprint arXiv:1606.01540,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Addressing Function Approximation Error in Actor-Critic Methods
Fujimoto, S., van Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477,
-
[4]
The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning
Gruslys, A., Azar, M. G., Bellemare, M. G., and Munos, R. The reactor: A sample-efficient actor-critic architecture. arXiv preprint arXiv:1704.04651,
-
[5]
Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic
Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., and Levine, S. Q-prop: Sample-efficient policy gradient with an off-policy critic. arXiv preprint arXiv:1611.02247,
-
[6]
Deep Reinforcement Learning that Matters
Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. Deep reinforcement learning that matters. arXiv preprint arXiv:1709.06560,
-
[7]
Kappen, H. J. Path integrals and symmetry breaking for optimal control theory. Journal of Statistical Mechanics: Theory And Experiment, 2005(11):P11011,
work page 2005
-
[8]
Continuous control with deep reinforcement learning
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y ., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Playing Atari with Deep Reinforcement Learning
Mnih, V ., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Trust-pcl: An off-policy trust region method for continuous control.arXiv preprint arXiv:1707.01891,
Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems (NIPS), pp. 2772–2782, 2017a. Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Trust-PCL: An off-policy trust region method for continuous control. arXiv preprint arXiv:1707.018...
-
[11]
Equivalence Between Policy Gradients and Soft Q-Learning
Schulman, J., Abbeel, P., and Chen, X. Equivalence between policy gradients and soft Q-learning. arXiv preprint arXiv:1704.06440, 2017a. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017b. Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmil...
-
[12]
Dexterous Manipulation with Deep Reinforcement Learning: Efficient, General, and Low-Cost
Zhu, H., Gupta, A., Rajeswaran, A., Levine, S., and Kumar, V . Dexterous manipulation with deep reinforcement learning: Efficient, general, and low-cost. arXiv preprint arXiv:1810.06045,
-
[13]
In that sense, discounted policy gradients typically do not optimize the true discounted objective
Appendix A Infinite Horizon Discounted Maximum Entropy Objective The exact definition of the discounted maximum entropy objective is complicated by the fact that, when using a discount factor for policy gradient methods, we typically do not discount the state distribution, only the rewards. In that sense, discounted policy gradients typically do not optimiz...
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.