Soft Actor-Critic Algorithms and Applications

Abhishek Gupta; Aurick Zhou; George Tucker; Henry Zhu; Jie Tan; Kristian Hartikainen; Pieter Abbeel; Sehoon Ha; Sergey Levine; Tuomas Haarnoja

arxiv: 1812.05905 · v2 · submitted 2018-12-13 · 💻 cs.LG · cs.AI· cs.RO· stat.ML

Soft Actor-Critic Algorithms and Applications

Tuomas Haarnoja , Aurick Zhou , Kristian Hartikainen , George Tucker , Sehoon Ha , Jie Tan , Vikash Kumar , Henry Zhu

show 3 more authors

Abhishek Gupta Pieter Abbeel Sergey Levine

This is my paper

Pith reviewed 2026-05-13 14:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ROstat.ML

keywords soft actor-criticreinforcement learningmaximum entropyoff-policy actor-criticsample efficiencystabilityrobotics control

0 comments

The pith

Soft Actor-Critic achieves state-of-the-art sample efficiency and stability in reinforcement learning by maximizing both task reward and policy entropy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Soft Actor-Critic, an off-policy reinforcement learning method that trains agents to succeed at tasks while acting as randomly as possible. The approach adds a constrained temperature parameter that tunes itself during training, removing the need for manual adjustment. Evaluations on benchmark tasks and real robots show it learns faster and reaches higher performance than previous on-policy and off-policy methods. It also produces consistent results across different random starting points, unlike other off-policy algorithms. These properties suggest it can handle the high sample needs and sensitivity issues that limit reinforcement learning in real-world settings.

Core claim

Soft Actor-Critic extends the maximum entropy reinforcement learning framework to off-policy actor-critic methods. The actor learns a policy that maximizes both the expected cumulative reward and the entropy of its actions. A new constrained optimization formulation automatically adjusts the temperature parameter that balances these two objectives. This combination yields state-of-the-art sample efficiency and final performance on locomotion and manipulation tasks while remaining stable across random seeds.

What carries the argument

The maximum entropy objective in an off-policy actor-critic setup, with a constrained formulation for automatic temperature tuning, which encourages stochastic policies that explore more effectively during training.

If this is right

SAC outperforms prior on-policy and off-policy methods in both sample-efficiency and asymptotic performance.
The approach exhibits high stability, with similar performance across different random seeds unlike other off-policy algorithms.
The constrained temperature formulation reduces sensitivity to hyperparameter choices.
These properties make the method suitable for challenging real-world robotics tasks such as quadrupedal locomotion and dexterous hand manipulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the stability generalizes, SAC could reduce the engineering effort needed to apply learned policies on physical hardware.
The entropy-driven exploration might help in tasks with sparse rewards where standard methods struggle.
Extending the same maximum entropy principle to discrete action spaces could broaden its applicability.

Load-bearing premise

The maximum entropy objective combined with the constrained temperature formulation will reliably produce the claimed stability and sample-efficiency gains without introducing new sensitivities or requiring task-specific adjustments beyond those described.

What would settle it

A direct comparison of SAC against other leading off-policy algorithms on multiple continuous control benchmarks, measuring both average performance and variance across at least five random seeds per task.

read the original abstract

Model-free deep reinforcement learning (RL) algorithms have been successfully applied to a range of challenging sequential decision making and control tasks. However, these methods typically suffer from two major challenges: high sample complexity and brittleness to hyperparameters. Both of these challenges limit the applicability of such methods to real-world domains. In this paper, we describe Soft Actor-Critic (SAC), our recently introduced off-policy actor-critic algorithm based on the maximum entropy RL framework. In this framework, the actor aims to simultaneously maximize expected return and entropy. That is, to succeed at the task while acting as randomly as possible. We extend SAC to incorporate a number of modifications that accelerate training and improve stability with respect to the hyperparameters, including a constrained formulation that automatically tunes the temperature hyperparameter. We systematically evaluate SAC on a range of benchmark tasks, as well as real-world challenging tasks such as locomotion for a quadrupedal robot and robotic manipulation with a dexterous hand. With these improvements, SAC achieves state-of-the-art performance, outperforming prior on-policy and off-policy methods in sample-efficiency and asymptotic performance. Furthermore, we demonstrate that, in contrast to other off-policy algorithms, our approach is very stable, achieving similar performance across different random seeds. These results suggest that SAC is a promising candidate for learning in real-world robotics tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces Soft Actor-Critic (SAC), an off-policy actor-critic algorithm grounded in the maximum-entropy RL framework in which the policy maximizes both expected return and entropy. It extends the base method with a constrained optimization formulation that automatically tunes the temperature parameter, along with other modifications for faster training and improved hyperparameter stability. Systematic evaluations are reported on standard benchmark control tasks as well as real-world robotics domains (quadrupedal locomotion and dexterous-hand manipulation), with the central claims being state-of-the-art sample efficiency and asymptotic performance together with markedly higher stability across random seeds than prior off-policy methods.

Significance. If the empirical claims hold, the work is significant for deep RL because it directly targets the twin obstacles of sample complexity and hyperparameter brittleness that have limited real-world deployment. The constrained temperature formulation supplies a principled, largely automatic mechanism for balancing exploration and exploitation within the maximum-entropy objective, and the reported success on hardware robotics tasks supplies concrete evidence of practical utility. The stability result across seeds is particularly valuable for reproducibility.

major comments (1)

[Temperature tuning / constrained formulation] The target entropy is fixed to the heuristic value -dim(A). This choice is load-bearing for the claim that the method requires no task-specific adjustments beyond the described auto-tuning of temperature; on tasks whose action spaces or dynamics deviate from the benchmark suite, the heuristic may re-introduce sensitivity that the constrained formulation was intended to remove. A concrete test would be to report performance when the target entropy is instead treated as a tunable hyperparameter or learned jointly.

minor comments (1)

[Abstract] The abstract asserts state-of-the-art results without any quantitative anchors (e.g., percentage improvement or specific baseline names); adding one or two such numbers would strengthen the summary for readers who do not reach the experimental section.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment, recognition of the stability and robotics results, and recommendation for minor revision. We address the single major comment below.

read point-by-point responses

Referee: [Temperature tuning / constrained formulation] The target entropy is fixed to the heuristic value -dim(A). This choice is load-bearing for the claim that the method requires no task-specific adjustments beyond the described auto-tuning of temperature; on tasks whose action spaces or dynamics deviate from the benchmark suite, the heuristic may re-introduce sensitivity that the constrained formulation was intended to remove. A concrete test would be to report performance when the target entropy is instead treated as a tunable hyperparameter or learned jointly.

Authors: We appreciate the referee's observation. The target entropy is set to the fixed heuristic -dim(A), which is a standard, environment-determined choice in the maximum-entropy framework that scales with action dimensionality and requires no per-task manual selection. The central contribution of the constrained formulation is the automatic adaptation of the temperature parameter α itself, which removes the primary source of hyperparameter sensitivity present in prior maximum-entropy methods. Our results across locomotion, manipulation, and real-robot tasks show that this combination yields stable performance without further adjustments, supporting the claim of reduced task-specific tuning. We do not believe the heuristic re-introduces brittleness in practice, as dim(A) is known a priori from the action space. Nevertheless, we will add a short sensitivity discussion and additional plots treating the target entropy as a tunable hyperparameter in the appendix of the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper derives the SAC algorithm from the external maximum-entropy RL objective (expected return plus entropy) using standard soft Bellman backups and a constrained optimization step for the temperature parameter via Lagrange multipliers. These steps follow directly from the framework without reducing to self-definition or renaming fitted quantities as predictions. The target entropy heuristic (-dim(A)) is an explicit design choice, not a derived result. Empirical performance claims are evaluated separately on benchmarks and do not enter the derivation. Self-citations reference prior independent work on max-entropy RL and do not form a load-bearing closed loop. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the appropriateness of the maximum-entropy objective for control tasks and on the constrained optimization successfully replacing manual temperature tuning without side effects.

free parameters (1)

temperature
Automatically tuned via the constrained formulation but remains the key scalar balancing reward and entropy in the objective.

axioms (1)

domain assumption Maximum entropy RL framework is suitable for the target continuous-control tasks
The actor is defined to maximize both expected return and entropy simultaneously.

pith-pipeline@v0.9.0 · 5571 in / 1202 out tokens · 55942 ms · 2026-05-13T14:28:07.440506+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We extend SAC to incorporate a number of modifications that accelerate training and improve stability with respect to the hyperparameters, including a constrained formulation that automatically tunes the temperature hyperparameter.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Partial Fusion of Neural Networks: Efficient Tradeoffs Between Ensembles and Weight Aggregation
cs.LG 2026-05 unverdicted novelty 7.0

Partial fusion interpolates between neural network ensembles and weight aggregation by only fusing the most similar neurons identified via partial optimal transport, enabling flexible cost-performance tradeoffs.
Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling
cs.LG 2026-05 unverdicted novelty 7.0

DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
Policy Optimization in Hybrid Discrete-Continuous Action Spaces via Mixed Gradients
cs.LG 2026-05 unverdicted novelty 7.0

HPO enables unbiased policy optimization in hybrid action spaces by mixing differentiable simulation gradients with score-function estimates, outperforming PPO as continuous dimensions increase.
Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation
cs.LG 2026-05 unverdicted novelty 7.0

CPPO is an on-policy contrastive RL method that derives advantages from contrastive Q-values for PPO optimization, outperforming prior CRL baselines in 14/18 tasks and matching or exceeding reward-based PPO in 12/18 tasks.
TuniQ: Autotuning Compilation Passes for Quantum Workloads at Scale for Effectiveness and Efficiency
quant-ph 2026-05 unverdicted novelty 7.0

TuniQ uses RL with a dual-encoder, shaped rewards, and action masking to autotune quantum compilation passes, improving fidelity and speed over Qiskit while generalizing across backends and scaling to large circuits.
Revisiting Mixture Policies in Entropy-Regularized Actor-Critic
cs.LG 2026-05 unverdicted novelty 7.0

A new marginalized reparameterization estimator allows low-variance training of mixture policies in entropy-regularized actor-critic algorithms, matching or exceeding Gaussian policy performance in several continuous ...
Generative Actor-Critic with Soft Bridge Policies
cs.LG 2026-05 unverdicted novelty 7.0

SoftGAC defines a stochastic bridge from base to action latent that converts the MaxEnt objective into a tractable relative-entropy term reducible to control energy, achieving competitive returns with one-pass sampling.
Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

DOSER detects OOD actions via diffusion-model denoising error and applies selective regularization based on predicted transitions, proving gamma-contraction with performance bounds and outperforming priors on offline ...
Atomic-Probe Governance for Skill Updates in Compositional Robot Policies
cs.RO 2026-04 unverdicted novelty 7.0

A cross-version swap protocol reveals dominant skills that swing composition success by up to 50 percentage points, and an atomic probe with selective revalidation governs updates at lower cost than always re-testing ...
KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning
cs.RO 2026-04 unverdicted novelty 7.0

KinDER is a new open-source benchmark that demonstrates substantial gaps in current robot learning and planning methods for handling physical constraints.
Autonomous Diffractometry Enabled by Visual Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

A model-free reinforcement learning agent learns to align crystals from diffraction images without human supervision or theoretical knowledge.
Physics-Informed Reinforcement Learning of Spatial Density Velocity Potentials for Map-Free Racing
cs.RO 2026-04 unverdicted novelty 7.0

A DRL policy learns racing controls from depth spectral distributions using a non-geometric physics-informed reward, achieving 12% better performance than humans on out-of-distribution tracks with under 1% of baseline...
SafeMind: A Risk-Aware Differentiable Control Framework for Adaptive and Safe Quadruped Locomotion
cs.RO 2026-04 unverdicted novelty 7.0

SafeMind is a differentiable framework that combines probabilistic control barrier functions, semantic context encoding, and meta-adaptive risk calibration to deliver safer, lower-energy quadruped locomotion under unc...
Maximin Robust Bayesian Experimental Design
stat.ML 2026-03 unverdicted novelty 7.0

The paper derives a robust objective for Bayesian experimental design governed by Sibson's α-mutual information and provides PAC-Bayes lower bounds on the robust expected information gain.
RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes
cs.CV 2026-01 unverdicted novelty 7.0

RL-AWB uses reinforcement learning to optimize parameters of a statistical white-balance estimator for nighttime scenes and reports better generalization on a new multi-sensor dataset.
R2PS: Worst-Case Robust Real-Time Pursuit Strategies under Partial Observability
cs.LG 2025-11 unverdicted novelty 7.0

R2PS combines a proof that dynamic programming remains optimal under asynchronous evader moves, a belief preservation mechanism for partial observability, and integration into equilibrium policy generalization to prod...
Real-time reinforcement learning for turbulent state-dependent control in a bluff-body wake
physics.flu-dyn 2025-09 unverdicted novelty 7.0

REACT reinforcement learning agent learns a state-dependent policy from experimental measurements that suppresses coherent wake structures to reduce drag with net energy savings, outperforming baselines by 2-4x and ge...
Adaptive Ensemble Aggregation for Actor-Critics
cs.LG 2025-07 unverdicted novelty 7.0

AEA dynamically aggregates ensembles in off-policy actor-critics from training dynamics, with proofs of convergence to an error-minimizing equilibrium, bias shrinkage with ensemble size, and monotonic policy improvement.
EXPO: Stable Reinforcement Learning with Expressive Policies
cs.LG 2025-07 conditional novelty 7.0

EXPO stabilizes online RL for expressive policies by training a base policy with imitation and using a lightweight Gaussian edit policy to select higher-value actions on the fly for sampling and TD backups.
DR-SAC: Distributionally Robust Soft Actor-Critic for Reinforcement Learning under Uncertainty
cs.LG 2025-06 unverdicted novelty 7.0

DR-SAC is the first actor-critic distributionally robust RL algorithm for offline continuous control that derives a convergent robust soft policy iteration and reports up to 9.8x higher rewards than SAC under perturbations.
Accelerated Learning with Linear Temporal Logic using Differentiable Simulation
cs.LG 2025-06 unverdicted novelty 7.0

Differentiable relaxation of LTL automata via soft labeling enables gradient-based RL from formal specifications, with theoretical bounds on discrete-differentiable discrepancy and up to 2x returns on nonlinear tasks.
FP-IRL: Fokker--Planck Inverse Reinforcement Learning -- A Physics-Constrained Approach to Markov Decision Processes
cs.LG 2023-06 unverdicted novelty 7.0

FP-IRL recovers MDP reward, transition, and policy from trajectories alone by using variational system identification on a Fokker-Planck potential that corresponds to reward maximization.
Solving Rubik's Cube with a Robot Hand
cs.LG 2019-10 accept novelty 7.0

Reinforcement learning models trained only in simulation using automatic domain randomization solve Rubik's cube with a real robot hand.
Understanding Goal Generalisation in Sequential Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable...
Goal-Conditioned Agents that Learn Everything All at Once
cs.LG 2026-05 unverdicted novelty 6.0

LEO enables efficient all-goals learning in goal-conditioned RL by jointly predicting for all goals in one network pass, yielding >250x speedup over relabelling and better performance on Craftax.
Reflex: Reinforcement Learning with Reflection Symmetry Exploitation in State-Based Continuous Control
cs.LG 2026-05 unverdicted novelty 6.0

Reflex formalizes axial and bilateral reflection symmetries and adds symmetry regularization to PPO and SAC, reporting better performance and sample efficiency on Gym and DMC benchmarks.
R2R2: Robust Representation for Intensive Experience Reuse via Redundancy Reduction in Self-Predictive Learning
cs.LG 2026-05 unverdicted novelty 6.0

R2R2 introduces a non-centered regularization objective for SPL that addresses conflicts with spectral properties, leading to better performance on continuous control tasks at high UTD ratios.
Debiased Model-based Representations for Sample-efficient Continuous Control
cs.LG 2026-05 unverdicted novelty 6.0

DR.Q debiases model-based representations for Q-learning by maximizing mutual information between state-action and next-state representations and applying faded prioritized experience replay, achieving competitive or ...
A Single Deep Preference-Conditioned Policy for Learning Pareto Coverage Sets
cs.LG 2026-05 unverdicted novelty 6.0

A single preference-conditioned policy achieves unique and Lipschitz-continuous Pareto coverage in multi-objective MDPs via a new mirror-descent policy iteration algorithm with O(1/k) convergence.
Offline Reinforcement Learning for Rotation Profile Control in Tokamaks
cs.LG 2026-05 unverdicted novelty 6.0

Offline RL policies trained solely on DIII-D historical data were deployed on the tokamak and produced promising real-world control of the plasma rotation profile.
Counter-Dyna: Data-Efficient RL-Based HVAC Control using Counterfactual Building Models
cs.LG 2026-05 unverdicted novelty 6.0

Counter-Dyna reduces RL training data for HVAC control to five weeks by using counterfactual surrogate models that ignore uncontrollable variables like weather and prices.
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
Atomic-Probe Governance for Skill Updates in Compositional Robot Policies
cs.RO 2026-04 unverdicted novelty 6.0

Empirical study on robosuite tasks reveals a dominant-skill effect in compositions and shows that an atomic probe approximates full revalidation for skill updates at much lower cost.
Self-Predictive Representation for Autonomous UAV Object-Goal Navigation
cs.RO 2026-04 unverdicted novelty 6.0

AmelPredSto, a stochastic self-predictive representation model, outperforms other state representation learning approaches when combined with actor-critic RL for object-goal navigation in UAVs.
Simple but Stable, Fast and Safe: Achieve End-to-end Control by High-Fidelity Differentiable Simulation
cs.RO 2026-04 conditional novelty 6.0

An end-to-end RL policy trained via high-fidelity differentiable simulation maps depth images straight to bodyrate commands, achieving top success rates, low jerk, and zero-shot real-world generalization up to 7.5 m/s...
Simulation Distillation: Pretraining World Models in Simulation for Rapid Real-World Adaptation
cs.RO 2026-03 unverdicted novelty 6.0

SimDist pretrains world models in simulation and adapts them to real-world robots by updating only the latent dynamics model, enabling rapid improvement on contact-rich tasks where prior methods fail.
FireScope: Wildfire Risk Raster Prediction with a Chain-of-Thought Oracle
cs.CV 2025-11 unverdicted novelty 6.0

FireScope is a VLM framework that generates wildfire risk rasters together with reasoning traces, showing improved cross-continental generalization when trained on US expert maps and tested on European fire events.
Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning
cs.RO 2025-11 unverdicted novelty 6.0

MSDP pretrains a transformer encoder via masked multisensory reconstruction and feeds the embeddings into an asymmetric actor-critic RL setup, yielding faster learning and high real-robot success rates with only 6,000...
Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning
cs.LG 2025-10 unverdicted novelty 6.0

MINTO sets bootstrapped targets to the minimum of online and target network estimates, yielding faster stable value learning across online/offline RL and discrete/continuous actions.
High-Precision and High-Efficiency Trajectory Tracking for Excavators Based on Closed-Loop Dynamics
cs.RO 2025-09 unverdicted novelty 6.0

EfficientTrack integrates model-based learning and closed-loop dynamics to minimize tracking errors in excavator trajectory control with high efficiency and precision, outperforming prior learning-based methods in sim...
Dissecting Discrete Soft Actor-Critic: Limitations and Principled Alternatives
cs.LG 2025-09 conditional novelty 6.0

Shows entropy coupling limits DSAC on discrete tasks and introduces a generalized actor-critic framework with m-step critics and novel entropy-regularized objectives that perform robustly on Atari.
Joint Scheduling of Deferrable and Nondeferrable Demand with Colocated Stochastic Supply
eess.SY 2025-07 unverdicted novelty 6.0

Optimal scheduling of deferrable demands with colocated stochastic supply and piecewise-linear pricing reduces to a finite set of three procrastination thresholds per demand class; a reinforcement learning algorithm l...
Koopman-Assisted Reinforcement Learning
cs.AI 2024-03 unverdicted novelty 6.0

Koopman-assisted RL reformulates max-entropy algorithms using controlled Koopman tensors and reports SOTA performance versus neural SAC on Lorenz, fluid flow, and other systems.
TD-MPC2: Scalable, Robust World Models for Continuous Control
cs.LG 2023-10 conditional novelty 6.0

TD-MPC2 scales an implicit world-model RL method to a 317M-parameter agent that masters 80 tasks across four domains with a single hyperparameter configuration.
Unleashing the Power of Tree-of-Thoughts for Edge-Enabled AIGC Service Provisioning
cs.DC 2026-05 unverdicted novelty 5.0

Models ToT prompting as a DAG and introduces DSAC to optimize thought assignment in edge-enabled AIGC, achieving up to 8.32% delay reduction over PPO in simulations while cutting latency over 80% versus local execution.
COOPO: Cyclic Offline-Online Policy Optimization Algorithm
cs.LG 2026-05 unverdicted novelty 5.0

COOPO is a cyclic offline-online RL algorithm that repeatedly anchors the policy to a dataset via KL-regularized updates then fine-tunes online, claiming better sample efficiency and monotonic improvement under covera...
Nautilus: From One Prompt to Plug-and-Play Robot Learning
cs.RO 2026-05 unverdicted novelty 5.0

NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
REAP: Reinforcement-Learning End-to-End Autonomous Parking with Gaussian Splatting Simulator for Real2Sim2Real Transfer
cs.RO 2026-05 unverdicted novelty 5.0

REAP trains an end-to-end SAC policy with behavior cloning and collision penalties inside a 3DGS Real2Sim simulator and transfers it to physical vehicles, succeeding in narrow mechanical parking slots.
Incremental Residual Reinforcement Learning Toward Real-World Learning for Social Navigation
cs.RO 2026-04 unverdicted novelty 5.0

IRRL lets robots learn social navigation in the real world by incrementally updating only the differences from a base policy, matching replay-buffer methods in simulation and adapting to new settings on physical robots.
Delayed homomorphic reinforcement learning for environments with delayed feedback
cs.LG 2026-04 unverdicted novelty 5.0

DHRL defines belief-equivalence over augmented states to abstract away control-redundant states, preserving optimality in finite domains and yielding a deep actor-critic method that outperforms baselines on MuJoCo tasks.
Robust SAC-Enabled UAV-RIS Assisted Secure MISO Systems With Untrusted EH Receivers
eess.SP 2026-02 unverdicted novelty 5.0

SAC framework for robust optimization of UAV-RIS secure MISO systems with UEHRs to maximize WCSEE under CSI uncertainty.
Prismatic World Model: Learning Compositional Dynamics for Planning in Hybrid Systems
cs.AI 2025-12 unverdicted novelty 5.0

PRISM-WM uses a context-aware MoE with latent orthogonalization to model hybrid dynamics and reduce rollout drift for model-based planning.
FireScope: Wildfire Risk Raster Prediction with a Chain-of-Thought Oracle
cs.CV 2025-11 unverdicted novelty 5.0

FireScope trains a VLM on US data to output wildfire risk rasters with reasoning traces and shows improved cross-continental performance on European events compared with prior approaches.
Optimal control of the future via prospective learning with control
stat.ML 2025-11 unverdicted novelty 5.0

Prospective Learning with Control proves ERM asymptotically achieves the Bayes optimal policy in non-stationary reset-free settings and outperforms time-aware RL on a 1D foraging benchmark.
Centralized Adaptive Sampling for Reliable Co-Training of Independent Multi-Agent Policies
cs.LG 2025-08 unverdicted novelty 5.0

CoSER adaptively samples joint actions in CTDE MARL to reduce sampling error relative to the joint on-policy distribution, empirically improving reliability of independent policy gradient convergence.
Relative Entropy Pathwise Policy Optimization
cs.LG 2025-07 unverdicted novelty 5.0

REPPO is an on-policy RL method that combines pathwise policy gradients with relative entropy constraints to achieve stable training and high sample efficiency without replay buffers.
Deep Double Q-learning
cs.LG 2025-06 unverdicted novelty 5.0

Deep Double Q-learning explicitly trains two Q-functions in deep RL, outperforming Double DQN on 47 of 57 Atari games while further reducing overestimation.
Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning
cs.LG 2025-05 conditional novelty 5.0

KRPO uses a Kalman filter to estimate latent prompt-level reward baselines from per-group rewards in GRPO, yielding better reward curves and accuracy on math reasoning benchmarks.
FinTSB: A Comprehensive and Practical Benchmark for Financial Time Series Forecasting
cs.CE 2025-02 unverdicted novelty 5.0

FinTSB introduces a benchmark addressing diversity, standardization, and real-world applicability gaps in financial time series forecasting evaluations.
robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
cs.RO 2020-09 unverdicted novelty 5.0

The paper presents robosuite v1.5, a MuJoCo-based modular simulation framework with benchmark environments for reproducible robot learning research.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 63 Pith papers · 3 internal anchors

[1]

Maximum a Posteriori Policy Optimisation

Abdolmaleki, A., Springenberg, J. T., Tassa, Y ., Munos, R., Heess, N., and Riedmiller, M. Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920,

work page Pith review arXiv
[2]

OpenAI Gym

Brockman, G., Cheung, V ., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. OpenAI gym. arXiv preprint arXiv:1606.01540,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Addressing Function Approximation Error in Actor-Critic Methods

Fujimoto, S., van Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477,

work page Pith review arXiv
[4]

The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning

Gruslys, A., Azar, M. G., Bellemare, M. G., and Munos, R. The reactor: A sample-efﬁcient actor-critic architecture. arXiv preprint arXiv:1704.04651,

work page Pith review arXiv
[5]

Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic

Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., and Levine, S. Q-prop: Sample-efﬁcient policy gradient with an off-policy critic. arXiv preprint arXiv:1611.02247,

work page Pith review arXiv
[6]

Deep Reinforcement Learning that Matters

Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. Deep reinforcement learning that matters. arXiv preprint arXiv:1709.06560,

work page Pith review arXiv
[7]

Kappen, H. J. Path integrals and symmetry breaking for optimal control theory. Journal of Statistical Mechanics: Theory And Experiment, 2005(11):P11011,

work page 2005
[8]

Continuous control with deep reinforcement learning

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y ., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Playing Atari with Deep Reinforcement Learning

Mnih, V ., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Trust-pcl: An oﬀ-policy trust region method for continuous control.arXiv preprint arXiv:1707.01891,

Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems (NIPS), pp. 2772–2782, 2017a. Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Trust-PCL: An off-policy trust region method for continuous control. arXiv preprint arXiv:1707.018...

work page arXiv
[11]

Equivalence Between Policy Gradients and Soft Q-Learning

Schulman, J., Abbeel, P., and Chen, X. Equivalence between policy gradients and soft Q-learning. arXiv preprint arXiv:1704.06440, 2017a. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017b. Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmil...

work page Pith review arXiv
[12]

Dexterous Manipulation with Deep Reinforcement Learning: Efficient, General, and Low-Cost

Zhu, H., Gupta, A., Rajeswaran, A., Levine, S., and Kumar, V . Dexterous manipulation with deep reinforcement learning: Efﬁcient, general, and low-cost. arXiv preprint arXiv:1810.06045,

work page Pith review arXiv
[13]

In that sense, discounted policy gradients typically do not optimize the true discounted objective

Appendix A Inﬁnite Horizon Discounted Maximum Entropy Objective The exact deﬁnition of the discounted maximum entropy objective is complicated by the fact that, when using a discount factor for policy gradient methods, we typically do not discount the state distribution, only the rewards. In that sense, discounted policy gradients typically do not optimiz...

work page 2014

[1] [1]

Maximum a Posteriori Policy Optimisation

Abdolmaleki, A., Springenberg, J. T., Tassa, Y ., Munos, R., Heess, N., and Riedmiller, M. Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920,

work page Pith review arXiv

[2] [2]

OpenAI Gym

Brockman, G., Cheung, V ., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. OpenAI gym. arXiv preprint arXiv:1606.01540,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Addressing Function Approximation Error in Actor-Critic Methods

Fujimoto, S., van Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477,

work page Pith review arXiv

[4] [4]

The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning

Gruslys, A., Azar, M. G., Bellemare, M. G., and Munos, R. The reactor: A sample-efﬁcient actor-critic architecture. arXiv preprint arXiv:1704.04651,

work page Pith review arXiv

[5] [5]

Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic

Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., and Levine, S. Q-prop: Sample-efﬁcient policy gradient with an off-policy critic. arXiv preprint arXiv:1611.02247,

work page Pith review arXiv

[6] [6]

Deep Reinforcement Learning that Matters

Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. Deep reinforcement learning that matters. arXiv preprint arXiv:1709.06560,

work page Pith review arXiv

[7] [7]

Kappen, H. J. Path integrals and symmetry breaking for optimal control theory. Journal of Statistical Mechanics: Theory And Experiment, 2005(11):P11011,

work page 2005

[8] [8]

Continuous control with deep reinforcement learning

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y ., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Playing Atari with Deep Reinforcement Learning

Mnih, V ., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Trust-pcl: An oﬀ-policy trust region method for continuous control.arXiv preprint arXiv:1707.01891,

Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems (NIPS), pp. 2772–2782, 2017a. Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Trust-PCL: An off-policy trust region method for continuous control. arXiv preprint arXiv:1707.018...

work page arXiv

[11] [11]

Equivalence Between Policy Gradients and Soft Q-Learning

Schulman, J., Abbeel, P., and Chen, X. Equivalence between policy gradients and soft Q-learning. arXiv preprint arXiv:1704.06440, 2017a. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017b. Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmil...

work page Pith review arXiv

[12] [12]

Dexterous Manipulation with Deep Reinforcement Learning: Efficient, General, and Low-Cost

Zhu, H., Gupta, A., Rajeswaran, A., Levine, S., and Kumar, V . Dexterous manipulation with deep reinforcement learning: Efﬁcient, general, and low-cost. arXiv preprint arXiv:1810.06045,

work page Pith review arXiv

[13] [13]

In that sense, discounted policy gradients typically do not optimize the true discounted objective

Appendix A Inﬁnite Horizon Discounted Maximum Entropy Objective The exact deﬁnition of the discounted maximum entropy objective is complicated by the fact that, when using a discount factor for policy gradient methods, we typically do not discount the state distribution, only the rewards. In that sense, discounted policy gradients typically do not optimiz...

work page 2014