pith. sign in

arxiv: 2506.12622 · v2 · submitted 2025-06-14 · 💻 cs.LG · cs.AI· math.OC

DR-SAC: Distributionally Robust Soft Actor-Critic for Reinforcement Learning under Uncertainty

Pith reviewed 2026-05-19 09:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OC
keywords distributionally robust reinforcement learningsoft actor-criticoffline RLcontinuous action spacesKL divergenceuncertainty setgenerative modeling
0
0 comments X p. Extension

The pith

DR-SAC extends actor-critic methods to distributionally robust offline RL in continuous spaces by optimizing against worst-case transitions in a KL ball.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops DR-SAC to train policies that maximize entropy-regularized rewards while guarding against the worst transition models inside a KL-divergence ball around an estimated nominal model. This setup addresses the practical problem that standard RL agents like SAC lose performance when real transition dynamics differ from those seen during training. A reader would care because the method supplies the first actor-critic formulation for continuous actions and offline data, together with a convergence guarantee for its robust policy iteration. Experiments on five continuous tasks show large reward gains under perturbation while also improving computational efficiency over earlier DR-RL approaches.

Core claim

DR-SAC maximizes the entropy-regularized rewards against the worst possible transition models within a KL-divergence constrained uncertainty set. The algorithm derives the distributionally robust version of soft policy iteration, proves its convergence, and uses generative modeling to estimate the unknown nominal transition model. This construction enables the first actor-critic DR-RL method that operates in continuous action spaces for offline learning.

What carries the argument

Distributionally robust soft policy iteration, which replaces standard expectation over transitions with a worst-case optimization inside the KL ball around the nominal model.

If this is right

  • DR-SAC achieves up to 9.8 times higher average reward than the SAC baseline under common perturbations.
  • DR-SAC improves computing efficiency and applicability to large-scale problems relative to prior DR-RL algorithms.
  • The method supplies a convergence guarantee for its distributionally robust soft policy iteration.
  • It supports offline learning directly in continuous action spaces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the nominal model is estimated well from offline data, the approach could narrow the gap between simulated training and real-world execution in robotics.
  • The KL-ball construction might be combined with other base RL algorithms to test whether the robustness gain is specific to the soft actor-critic objective.
  • Structured uncertainties such as sensor bias or actuator wear could be used to check whether the current uncertainty set remains sufficiently expressive.

Load-bearing premise

The generative modeling step must produce a nominal transition model accurate enough that the KL ball around it covers the uncertainty actually met at deployment.

What would settle it

If DR-SAC is applied to a continuous control task whose true perturbations lie outside the estimated KL ball and it then fails to outperform plain SAC, the robustness benefit would be falsified.

Figures

Figures reproduced from arXiv: 2506.12622 by Duo Zhou, Grani A. Hanasusanto, Huan Zhang, Mingxuan Cui, Qiong Wang, Yuxuan Han, Zhengyuan Zhou.

Figure 1
Figure 1. Figure 1: Robustness performance in different environments under perturbations. The curves show the [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pendulum results on TD3-Dataset. The curves show the average reward of 50 episodes, shaded by ±0.5 standard deviation. Cartpole For the Cartpole environment, we compare the DRSAC algorithm with non-robust algorithms SAC, DDPG, FQI, and robust algorithm RFQI. All algorithms are trained onthe SAC￾Dataset. In Cartpole environment, the force applied to the cart is continuous and determined by the actuator’s ac… view at source ↗
Figure 3
Figure 3. Figure 3: (a), DRSAC has performance improvement over 75% compared to non-robust algorithms SAC and DDPG when the standard deviation of noise is 0.2 and 0.3. (a) Observation Perturbation: gaussian noise added to nominal states. (b) “Force_mag” Perturbation: model parameter force_mag change [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: LunarLander results on TD3-Dataset. The curves show the average reward of 50 episodes, shaded by ±0.5 standard deviation. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: HalfCheetah results on TD3-Dataset. The curves show the average reward of 50 episodes, shaded by ±0.5 standard deviation. C.3 Ablation Study Details C.3.1 Training Efficiency of DR-SAC In this section, we want to show that DR-SAC with functional optimization finds a good balance between efficiency and accuracy. We compare training time and robustness of Algorithm 1, DR-SAC without functional optimization, … view at source ↗
Figure 6
Figure 6. Figure 6: Pendulum results on TD3-Dataset. Curves show average reward of 50 episodes, shaded by ±0.5 standard deviation. Algorithms are SAC, DR-SAC with and without functional approximation. Efficiency Comparison with RFQI In Section 4.2, we test and compare the robustness of DR￾SAC with other algorithms and robust algorithm RFQI also shows comparable performance under perturbation in most environments. We want to s… view at source ↗
Figure 7
Figure 7. Figure 7: Average Reward of 20 Episodes over Training Time in [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Average Reward of 20 Episodes over Training Step in [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
read the original abstract

Deep reinforcement learning (RL) has achieved remarkable success, yet its deployment in real-world scenarios is often limited by vulnerability to environmental uncertainties. Distributionally robust RL (DR-RL) algorithms have been proposed to resolve this challenge, but existing approaches are largely restricted to value-based methods in tabular settings. In this work, we introduce Distributionally Robust Soft Actor-Critic (DR-SAC), the first actor-critic based DR-RL algorithm for offline learning in continuous action spaces. DR-SAC maximizes the entropy-regularized rewards against the worst possible transition models within an KL-divergence constrained uncertainty set. We derive the distributionally robust version of the soft policy iteration with a convergence guarantee and incorporate a generative modeling approach to estimate the unknown nominal transition models. Experiment results on five continuous RL tasks demonstrate our algorithm achieves up to 9.8 times higher average reward than the SAC baseline under common perturbations. Additionally, DR-SAC significantly improves computing efficiency and applicability to large-scale problems compared with existing DR-RL algorithms. Code is publicly available at github.com/Lemutisme/DR-SAC.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Distributionally Robust Soft Actor-Critic (DR-SAC), the first actor-critic DR-RL method for offline learning in continuous action spaces. It extends soft actor-critic by maximizing entropy-regularized rewards against worst-case transition models inside a KL-divergence ball around a nominal transition kernel estimated via generative modeling. The authors derive a distributionally robust version of soft policy iteration with a claimed convergence guarantee, and report up to 9.8 times higher average reward than SAC on five continuous control tasks under common perturbations, along with improved computational efficiency over prior DR-RL algorithms.

Significance. If the convergence guarantee holds and the generative modeling step reliably places the true deployment dynamics inside the KL-ball, the work would be significant for extending distributionally robust RL to practical continuous-control settings where actor-critic methods dominate. The public code release and empirical evaluation on five tasks under perturbations are positive features that support reproducibility and applicability claims. However, the significance is tempered by the lack of explicit verification that the estimated nominal model satisfies the coverage condition required for the robustness guarantee to translate to real uncertainty.

major comments (2)
  1. The central robustness claim rests on the generative modeling step producing a nominal transition estimate P̂ such that the true (unknown) dynamics P* lie inside the KL-ball B(P̂, ε) for the chosen radius. The manuscript reports gains under 'common perturbations' but provides no quantitative diagnostic (e.g., estimated KL distance between P̂ and held-out perturbed trajectories, or sensitivity analysis over ε) confirming that condition (i) in the skeptic note holds. Without this check, the observed improvement could be an artifact of the particular test perturbations rather than a consequence of the distributionally robust objective.
  2. The derivation of the distributionally robust soft policy iteration is presented as a direct extension that yields a convergence guarantee. However, the manuscript does not show that the resulting robust Bellman operator is a contraction (or satisfies the conditions for the guarantee) independently of the specific generative model; the circularity concern that the construction reduces to a fitted parameter by the paper's own equations remains unaddressed in the provided theoretical section.
minor comments (2)
  1. The abstract states 'up to 9.8 times higher average reward' but the experimental section should clarify whether this is the maximum over tasks or an average, and include standard errors or statistical significance tests across the five environments.
  2. Notation for the uncertainty set radius ε and the generative model training procedure should be introduced earlier and used consistently when describing how the nominal kernel is obtained from offline data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the presentation of our results and theory.

read point-by-point responses
  1. Referee: The central robustness claim rests on the generative modeling step producing a nominal transition estimate P̂ such that the true (unknown) dynamics P* lie inside the KL-ball B(P̂, ε) for the chosen radius. The manuscript reports gains under 'common perturbations' but provides no quantitative diagnostic (e.g., estimated KL distance between P̂ and held-out perturbed trajectories, or sensitivity analysis over ε) confirming that condition (i) in the skeptic note holds. Without this check, the observed improvement could be an artifact of the particular test perturbations rather than a consequence of the distributionally robust objective.

    Authors: We agree that explicit verification of the coverage condition would strengthen the robustness claims. In the revised manuscript we will add a new subsection with quantitative diagnostics: estimated KL distances between the learned nominal model and held-out trajectories collected under the perturbed dynamics, together with a sensitivity analysis of performance across a range of ε values. These additions will directly address whether the true dynamics lie inside the uncertainty set for the radii used in our experiments. revision: yes

  2. Referee: The derivation of the distributionally robust soft policy iteration is presented as a direct extension that yields a convergence guarantee. However, the manuscript does not show that the resulting robust Bellman operator is a contraction (or satisfies the conditions for the guarantee) independently of the specific generative model; the circularity concern that the construction reduces to a fitted parameter by the paper's own equations remains unaddressed in the provided theoretical section.

    Authors: We thank the referee for this observation. The convergence proof for the robust soft policy iteration appears in the appendix and establishes that the robust Bellman operator is a contraction under the standard assumptions on the entropy-regularized objective and the KL-ball uncertainty set. The contraction property holds for any nominal kernel inside the ball and does not rely on the particular generative model used to estimate that kernel. To eliminate any appearance of circularity we will add a short clarifying paragraph in Section 4 that explicitly states the independence from the generative-model details and references the relevant steps in the appendix proof. revision: yes

Circularity Check

0 steps flagged

Derivation of robust soft policy iteration presented as extension; no reduction to fitted inputs or self-citation chains

full rationale

The paper derives a distributionally robust variant of soft policy iteration from the standard SAC framework and incorporates a separate generative modeling step to estimate the nominal transition kernel before applying the KL-ball. No equation in the provided abstract or skeptic analysis shows a claimed prediction or first-principles result that algebraically equals a fitted parameter or prior self-citation by construction. The generative modeling step is an input to the robust objective rather than a tautological output. This yields a self-contained derivation chain against external benchmarks such as standard SAC, with only minor self-citation risk at the level of the base algorithm.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard RL convergence results for soft policy iteration and on the modeling assumption that a generative model trained on offline data yields a usable nominal transition kernel around which a KL-ball can be constructed.

free parameters (1)
  • KL-ball radius
    The size of the uncertainty set is a tunable hyperparameter that controls the degree of robustness.
axioms (2)
  • standard math Soft policy iteration converges to an optimal policy under standard entropy-regularized MDP assumptions.
    The robust extension builds directly on this background result.
  • domain assumption The nominal transition model can be estimated from offline data via generative modeling.
    This step is required to define the center of the KL uncertainty set.

pith-pipeline@v0.9.0 · 5750 in / 1346 out tokens · 58247 ms · 2026-05-19T09:06:44.246357+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 5 internal anchors

  1. [1]

    Uncertainty-based offline reinforcement learning with diversified q-ensemble.Advances in neural information processing systems, 34:7436–7447, 2021

    Gaon An, Seungyong Moon, Jang-Hyun Kim, and Hyun Oh Song. Uncertainty-based offline reinforcement learning with diversified q-ensemble.Advances in neural information processing systems, 34:7436–7447, 2021

  2. [2]

    Deep reinforcement learning: A brief survey.IEEE Signal Processing Magazine, 34(6):26–38, 2017

    Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. Deep reinforcement learning: A brief survey.IEEE Signal Processing Magazine, 34(6):26–38, 2017

  3. [3]

    Deep generative models for offline policy learning: Tutorial, survey, and perspectives on future directions.arXiv preprint arXiv:2402.13777, 2024

    Jiayu Chen, Bhargav Ganguly, Yang Xu, Yongsheng Mei, Tian Lan, and Vaneet Aggarwal. Deep generative models for offline policy learning: Tutorial, survey, and perspectives on future directions.arXiv preprint arXiv:2402.13777, 2024

  4. [4]

    Corrected soft actor critic for continuous control.arXiv preprint arXiv:2410.16739, 2024

    Yanjun Chen, Xinming Zhang, Xianghui Wang, Zhiqiang Xu, Xiaoyu Shen, and Wei Zhang. Corrected soft actor critic for continuous control.arXiv preprint arXiv:2410.16739, 2024

  5. [5]

    Adversarially trained actor critic for offline reinforcement learning

    Ching-An Cheng, Tengyang Xie, Nan Jiang, and Alekh Agarwal. Adversarially trained actor critic for offline reinforcement learning. InInternational Conference on Machine Learning, pages 3852–3878. PMLR, 2022

  6. [6]

    Towards minimax optimality of model- based robust reinforcement learning.arXiv preprint arXiv:2302.05372, 2023

    Pierre Clavier, Erwan Le Pennec, and Matthieu Geist. Towards minimax optimality of model- based robust reinforcement learning.arXiv preprint arXiv:2302.05372, 2023

  7. [7]

    Soft-Robust Actor-Critic Policy-Gradient

    Esther Derman, Daniel J Mankowitz, Timothy A Mann, and Shie Mannor. Soft-robust actor- critic policy-gradient.arXiv preprint arXiv:1803.04848, 2018

  8. [8]

    Distributional robustness and regularization in reinforcement learning.arXiv preprint arXiv:2003.02894, 2020

    Esther Derman and Shie Mannor. Distributional robustness and regularization in reinforcement learning.arXiv preprint arXiv:2003.02894, 2020

  9. [9]

    Risk-sensitive soft actor-critic for robust deep reinforcement learning under distribution shifts.arXiv preprint arXiv:2402.09992, 2024

    Tobias Enders, James Harrison, and Maximilian Schiffer. Risk-sensitive soft actor-critic for robust deep reinforcement learning under distribution shifts.arXiv preprint arXiv:2402.09992, 2024

  10. [10]

    An introduction to deep reinforcement learning.Foundations and Trends® in Machine Learning, 11(3-4):219–354, 2018

    Vincent Francois-Lavet, Peter Henderson, Riashat Islam, Marc G Bellemare, Joelle Pineau, et al. An introduction to deep reinforcement learning.Foundations and Trends® in Machine Learning, 11(3-4):219–354, 2018

  11. [11]

    Addressing function approximation error in actor-critic methods

    Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. InInternational Conference on Machine Learning, pages 1582–1591, 2018

  12. [12]

    Off-policy deep reinforcement learning without exploration

    Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. InInternational conference on machine learning, pages 2052–2062. PMLR, 2019

  13. [13]

    Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018

  14. [14]

    Soft Actor-Critic Algorithms and Applications

    Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications.arXiv preprint arXiv:1812.05905, 2018

  15. [15]

    Risk-sensitive markov decision processes.Manage- ment science, 18(7):356–369, 1972

    Ronald A Howard and James E Matheson. Risk-sensitive markov decision processes.Manage- ment science, 18(7):356–369, 1972

  16. [16]

    Kullback-leibler divergence constrained distributionally robust optimization.Available at Optimization Online, 1(2):9, 2013

    Zhaolin Hu and L Jeff Hong. Kullback-leibler divergence constrained distributionally robust optimization.Available at Optimization Online, 1(2):9, 2013

  17. [17]

    Garud N. Iyengar. Robust dynamic programming.Mathematics of Operations Research, 30(2):257–280, 2005

  18. [18]

    David Jacobson. Optimal stochastic linear systems with exponential performance criteria and their relation to deterministic differential games.IEEE Transactions on Automatic control, 18(2):124–131, 1973. 10

  19. [19]

    Auto-encoding variational bayes, 2013

    Diederik P Kingma, Max Welling, et al. Auto-encoding variational bayes, 2013

  20. [20]

    Safe reinforcement learning using wasserstein distributionally robust mpc and chance constraint.IEEE Access, 10:130058– 130067, 2022

    Arash Bahari Kordabad, Rafael Wisniewski, and Sebastien Gros. Safe reinforcement learning using wasserstein distributionally robust mpc and chance constraint.IEEE Access, 10:130058– 130067, 2022

  21. [21]

    Stabilizing off- policy q-learning via bootstrapping error reduction.Advances in neural information processing systems, 32, 2019

    Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off- policy q-learning via bootstrapping error reduction.Advances in neural information processing systems, 32, 2019

  22. [22]

    Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems, 33:1179– 1191, 2020

    Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems, 33:1179– 1191, 2020

  23. [23]

    Single- trajectory distributionally robust reinforcement learning, 2024

    Zhipeng Liang, Xiaoteng Ma, Jose Blanchet, Jiheng Zhang, and Zhengyuan Zhou. Single- trajectory distributionally robust reinforcement learning, 2024

  24. [24]

    Continuous control with deep reinforcement learning

    Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971, 2015

  25. [25]

    Minimax optimal and computationally efficient algorithms for distributionally robust offline reinforcement learning.arXiv preprint arXiv:2403.09621, 2024

    Zhishuai Liu and Pan Xu. Minimax optimal and computationally efficient algorithms for distributionally robust offline reinforcement learning.arXiv preprint arXiv:2403.09621, 2024

  26. [26]

    Distributionally robust q-learning

    Zijian Liu, Qinxun Bai, Jose Blanchet, Perry Dong, Wei Xu, Zhengqing Zhou, and Zhengyuan Zhou. Distributionally robust q-learning. InInternational Conference on Machine Learning, pages 13623–13643. PMLR, 2022

  27. [27]

    Soft-robust algorithms for batch reinforcement learning.arXiv preprint arXiv:2011.14495, 2020

    Elita A Lobo, Mohammad Ghavamzadeh, and Marek Petrik. Soft-robust algorithms for batch reinforcement learning.arXiv preprint arXiv:2011.14495, 2020

  28. [28]

    Distributionally robust reinforcement learning with interactive data collection: Fundamental hardness and near-optimal algorithm

    Miao Lu, Han Zhong, Tong Zhang, and Jose Blanchet. Distributionally robust reinforcement learning with interactive data collection: Fundamental hardness and near-optimal algorithm. arXiv preprint arXiv:2404.03578, 2024

  29. [29]

    Mildly conservative q-learning for offline reinforcement learning.Advances in Neural Information Processing Systems, 35:1711–1724, 2022

    Jiafei Lyu, Xiaoteng Ma, Xiu Li, and Zongqing Lu. Mildly conservative q-learning for offline reinforcement learning.Advances in Neural Information Processing Systems, 35:1711–1724, 2022

  30. [30]

    Distributionally robust offline reinforcement learning with linear function approximation, 2023

    Xiaoteng Ma, Zhipeng Liang, Jose Blanchet, Mingwen Liu, Li Xia, Jiheng Zhang, Qianchuan Zhao, and Zhengyuan Zhou. Distributionally robust offline reinforcement learning with linear function approximation, 2023

  31. [31]

    Robust re- inforcement learning for continuous control with model misspecification.arXiv preprint arXiv:1906.07516, 2019

    Daniel J Mankowitz, Nir Levine, Rae Jeong, Yuanyuan Shi, Jackie Kay, Abbas Abdolmaleki, Jost Tobias Springenberg, Timothy Mann, Todd Hester, and Martin Riedmiller. Robust re- inforcement learning for continuous control with model misspecification.arXiv preprint arXiv:1906.07516, 2019

  32. [32]

    An experimental design perspective on model-based reinforcement learning.arXiv preprint arXiv:2112.05244, 2021

    Viraj Mehta, Biswajit Paria, Jeff Schneider, Stefano Ermon, and Willie Neiswanger. An experimental design perspective on model-based reinforcement learning.arXiv preprint arXiv:2112.05244, 2021

  33. [33]

    Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

  34. [34]

    Robust control of markov decision processes with uncertain transition matrices.Operations Research, 53(5):780–798, 2005

    Arnab Nilim and Laurent El Ghaoui. Robust control of markov decision processes with uncertain transition matrices.Operations Research, 53(5):780–798, 2005

  35. [35]

    Risk averse robust adversarial reinforce- ment learning

    Xinlei Pan, Daniel Seita, Yang Gao, and John Canny. Risk averse robust adversarial reinforce- ment learning. In2019 International Conference on Robotics and Automation (ICRA), pages 8522–8528. IEEE, 2019. 11

  36. [36]

    Sample complexity of robust reinforcement learning with a generative model

    Kishan Panaganti and Dileep Kalathil. Sample complexity of robust reinforcement learning with a generative model. InInternational Conference on Artificial Intelligence and Statistics, pages 9582–9602. PMLR, 2022

  37. [37]

    Robust rein- forcement learning using offline data

    Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, and Mohammad Ghavamzadeh. Robust rein- forcement learning using offline data. InAdvances in Neural Information Processing Systems, volume 35, pages 32211–32224. Curran Associates, Inc., 2022

  38. [38]

    Robust adversarial reinforcement learning

    Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learning. InInternational conference on machine learning, pages 2817–2826. PMLR, 2017

  39. [39]

    Risk-averse model uncertainty for distributionally robust safe reinforcement learning.Advances in Neural Information Processing Systems, 36:1659–1680, 2023

    James Queeney and Mouhacine Benosman. Risk-averse model uncertainty for distributionally robust safe reinforcement learning.Advances in Neural Information Processing Systems, 36:1659–1680, 2023

  40. [40]

    Distributionally robust model-based reinforcement learning with large state spaces

    Shyam Sundhar Ramesh, Pier Giuseppe Sessa, Yifan Hu, Andreas Krause, and Ilija Bogunovic. Distributionally robust model-based reinforcement learning with large state spaces. InInterna- tional Conference on Artificial Intelligence and Statistics, pages 100–108. PMLR, 2024

  41. [41]

    On stochastic optimal control and reinforcement learning by approximate inference.Proceedings of Robotics: Science and Systems VIII, 2012

    Konrad Rawlik, Marc Toussaint, and Sethu Vijayakumar. On stochastic optimal control and reinforcement learning by approximate inference.Proceedings of Robotics: Science and Systems VIII, 2012

  42. [42]

    Springer Science & Business Media, 2009

    R Tyrrell Rockafellar and Roger J-B Wets.Variational analysis, volume 317. Springer Science & Business Media, 2009

  43. [43]

    d3rlpy: An offline deep reinforcement learning library.Journal of Machine Learning Research, 23(315):1–20, 2022

    Takuma Seno and Michita Imai. d3rlpy: An offline deep reinforcement learning library.Journal of Machine Learning Research, 23(315):1–20, 2022

  44. [44]

    Distributionally robust stochastic programming.SIAM Journal on Opti- mization, 27(4):2258–2275, 2017

    Alexander Shapiro. Distributionally robust stochastic programming.SIAM Journal on Opti- mization, 27(4):2258–2275, 2017

  45. [45]

    Distributionally robust model-based offline reinforcement learning with near-optimal sample complexity.Journal of Machine Learning Research, 25(200):1–91, 2024

    Laixi Shi and Yuejie Chi. Distributionally robust model-based offline reinforcement learning with near-optimal sample complexity.Journal of Machine Learning Research, 25(200):1–91, 2024

  46. [46]

    Improving robustness via risk averse distributional reinforcement learning

    Rahul Singh, Qinsheng Zhang, and Yongxin Chen. Improving robustness via risk averse distributional reinforcement learning. InLearning for Dynamics and Control, pages 958–968. PMLR, 2020

  47. [47]

    Distributionally Robust Reinforcement Learning

    Elena Smirnova, Elvis Dohmatob, and Jérémie Mary. Distributionally robust reinforcement learning.arXiv preprint arXiv:1902.08708, 2019

  48. [48]

    Policy gradient for coherent risk measures.Advances in neural information processing systems, 28, 2015

    Aviv Tamar, Yinlam Chow, Mohammad Ghavamzadeh, and Shie Mannor. Policy gradient for coherent risk measures.Advances in neural information processing systems, 28, 2015

  49. [49]

    General duality between optimal control and estimation

    Emanuel Todorov. General duality between optimal control and estimation. In2008 47th IEEE conference on decision and control, pages 4286–4292. IEEE, 2008

  50. [50]

    Mujoco: A physics engine for model-based control

    Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012

  51. [51]

    Gymnasium: A Standard Interface for Reinforcement Learning Environments

    Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A standard interface for reinforcement learning environments.arXiv preprint arXiv:2407.17032, 2024

  52. [52]

    Stable reinforcement learning with autoencoders for tactile and visual data

    Herke Van Hoof, Nutan Chen, Maximilian Karl, Patrick Van Der Smagt, and Jan Peters. Stable reinforcement learning with autoencoders for tactile and visual data. In2016 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 3928–3934. IEEE, 2016. 12

  53. [53]

    A finite sample complexity bound for distributionally robust q-learning

    Shengbo Wang, Nian Si, Jose Blanchet, and Zhengyuan Zhou. A finite sample complexity bound for distributionally robust q-learning. InProceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 ofProceedings of Machine Learning Research, pages 3370–3398. PMLR, 2023

  54. [54]

    Sample complexity of variance- reduced distributionally robust q-learning.Journal of Machine Learning Research, 25(341):1– 77, 2024

    Shengbo Wang, Nian Si, Jose Blanchet, and Zhengyuan Zhou. Sample complexity of variance- reduced distributionally robust q-learning.Journal of Machine Learning Research, 25(341):1– 77, 2024

  55. [55]

    Online robust reinforcement learning with model uncertainty

    Yue Wang and Shaofeng Zou. Online robust reinforcement learning with model uncertainty. Advances in Neural Information Processing Systems, 34:7193–7206, 2021

  56. [56]

    Boosting offline reinforcement learning with residual generative modeling.arXiv preprint arXiv:2106.10411, 2021

    Hua Wei, Deheng Ye, Zhao Liu, Hao Wu, Bo Yuan, Qiang Fu, Wei Yang, and Zhenhui Li. Boosting offline reinforcement learning with residual generative modeling.arXiv preprint arXiv:2106.10411, 2021

  57. [57]

    Risk-sensitive linear/quadratic/gaussian control.Advances in Applied Probability, 13(4):764–777, 1981

    Peter Whittle. Risk-sensitive linear/quadratic/gaussian control.Advances in Applied Probability, 13(4):764–777, 1981

  58. [58]

    Robust markov decision processes

    Wolfram Wiesemann, Daniel Kuhn, and Berc Rustem. Robust markov decision processes. Mathematics of Operations Research, 38(1):153–183, 2013

  59. [59]

    Constraints penalized q-learning for safe offline reinforcement learning

    Haoran Xu, Xianyuan Zhan, and Xiangyu Zhu. Constraints penalized q-learning for safe offline reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 8753–8760, 2022

  60. [60]

    Distributionally robust markov decision processes.Advances in Neural Information Processing Systems, 23, 2010

    Huan Xu and Shie Mannor. Distributionally robust markov decision processes.Advances in Neural Information Processing Systems, 23, 2010

  61. [61]

    Improved sample complexity bounds for distributionally robust reinforcement learning

    Zaiyan Xu, Kishan Panaganti, and Dileep Kalathil. Improved sample complexity bounds for distributionally robust reinforcement learning. InProceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 ofProceedings of Machine Learning Research, pages 9728–9754. PMLR, 2023

  62. [62]

    Distributionally robust counterpart in markov decision processes

    Pengqian Yu and Huan Xu. Distributionally robust counterpart in markov decision processes. IEEE Transactions on Automatic Control, 61(9):2538–2543, 2015

  63. [63]

    Robust deep reinforcement learning against adversarial perturbations on state observations.Advances in Neural Information Processing Systems, 33:21024–21037, 2020

    Huan Zhang, Hongge Chen, Chaowei Xiao, Bo Li, Mingyan Liu, Duane Boning, and Cho- Jui Hsieh. Robust deep reinforcement learning against adversarial perturbations on state observations.Advances in Neural Information Processing Systems, 33:21024–21037, 2020

  64. [64]

    Natural actor- critic for robust reinforcement learning with function approximation.Advances in neural information processing systems, 36:97–133, 2023

    Ruida Zhou, Tao Liu, Min Cheng, Dileep Kalathil, PR Kumar, and Chao Tian. Natural actor- critic for robust reinforcement learning with function approximation.Advances in neural information processing systems, 36:97–133, 2023

  65. [65]

    Finite-sample regret bound for distributionally robust offline tabular reinforcement learning

    Zhengqing Zhou, Zhengyuan Zhou, Qinxun Bai, Linhai Qiu, Jose Blanchet, and Peter Glynn. Finite-sample regret bound for distributionally robust offline tabular reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 3331–3339. PMLR, 2021

  66. [66]

    Infinite time horizon maximum causal entropy inverse reinforcement learning.IEEE Transactions on Automatic Control, 63(9):2787–2802, 2018

    Zhengyuan Zhou, Michael Bloem, and Nicholas Bambos. Infinite time horizon maximum causal entropy inverse reinforcement learning.IEEE Transactions on Automatic Control, 63(9):2787–2802, 2018

  67. [67]

    Carnegie Mellon University, 2010

    Brian D Ziebart.Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University, 2010

  68. [68]

    Q(s, a)−E[r]−γ·sup β≥0 −βlog Ep0s,a exp −V(s ′) β −βδ # . If using ERM method, the empirical Bellman residual is bLQ := 1 N NX i=1

    Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, Anind K Dey, et al. Maximum entropy inverse reinforcement learning. InAaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008. 13 Appendix Contents A Discussion 14 A.1 Necessity of V AE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.2 Algorithm Details . . . . . . . . . . . . . ...

  69. [69]

    Force_mag

    or directly estimating the expected value under nominal distributions[23]. None of these methods is applicable in continuous space offline RL tasks. A.2 Algorithm Details In this section, we present a detailed description of the DR-SAC algorithm. In our algorithm, we use neural networks Vψ(s), Qθ(s, a) and πϕ(a|s) to approximate the the value function, th...