DR-SAC: Distributionally Robust Soft Actor-Critic for Reinforcement Learning under Uncertainty

arxiv: 2506.12622 · v2 · submitted 2025-06-14 · 💻 cs.LG · cs.AI· math.OC

DR-SAC: Distributionally Robust Soft Actor-Critic for Reinforcement Learning under Uncertainty

Mingxuan Cui , Duo Zhou , Yuxuan Han , Grani A. Hanasusanto , Qiong Wang , Huan Zhang , Zhengyuan Zhou This is my paper

Pith reviewed 2026-05-19 09:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OC

keywords distributionally robust reinforcement learningsoft actor-criticoffline RLcontinuous action spacesKL divergenceuncertainty setgenerative modeling

0 comments p. Extension

The pith

DR-SAC extends actor-critic methods to distributionally robust offline RL in continuous spaces by optimizing against worst-case transitions in a KL ball.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops DR-SAC to train policies that maximize entropy-regularized rewards while guarding against the worst transition models inside a KL-divergence ball around an estimated nominal model. This setup addresses the practical problem that standard RL agents like SAC lose performance when real transition dynamics differ from those seen during training. A reader would care because the method supplies the first actor-critic formulation for continuous actions and offline data, together with a convergence guarantee for its robust policy iteration. Experiments on five continuous tasks show large reward gains under perturbation while also improving computational efficiency over earlier DR-RL approaches.

Core claim

DR-SAC maximizes the entropy-regularized rewards against the worst possible transition models within a KL-divergence constrained uncertainty set. The algorithm derives the distributionally robust version of soft policy iteration, proves its convergence, and uses generative modeling to estimate the unknown nominal transition model. This construction enables the first actor-critic DR-RL method that operates in continuous action spaces for offline learning.

What carries the argument

Distributionally robust soft policy iteration, which replaces standard expectation over transitions with a worst-case optimization inside the KL ball around the nominal model.

If this is right

DR-SAC achieves up to 9.8 times higher average reward than the SAC baseline under common perturbations.
DR-SAC improves computing efficiency and applicability to large-scale problems relative to prior DR-RL algorithms.
The method supplies a convergence guarantee for its distributionally robust soft policy iteration.
It supports offline learning directly in continuous action spaces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the nominal model is estimated well from offline data, the approach could narrow the gap between simulated training and real-world execution in robotics.
The KL-ball construction might be combined with other base RL algorithms to test whether the robustness gain is specific to the soft actor-critic objective.
Structured uncertainties such as sensor bias or actuator wear could be used to check whether the current uncertainty set remains sufficiently expressive.

Load-bearing premise

The generative modeling step must produce a nominal transition model accurate enough that the KL ball around it covers the uncertainty actually met at deployment.

What would settle it

If DR-SAC is applied to a continuous control task whose true perturbations lie outside the estimated KL ball and it then fails to outperform plain SAC, the robustness benefit would be falsified.

Figures

Figures reproduced from arXiv: 2506.12622 by Duo Zhou, Grani A. Hanasusanto, Huan Zhang, Mingxuan Cui, Qiong Wang, Yuxuan Han, Zhengyuan Zhou.

**Figure 2.** Figure 2: Pendulum results on TD3-Dataset. The curves show the average reward of 50 episodes, shaded by ±0.5 standard deviation. Cartpole For the Cartpole environment, we compare the DRSAC algorithm with non-robust algorithms SAC, DDPG, FQI, and robust algorithm RFQI. All algorithms are trained onthe SACDataset. In Cartpole environment, the force applied to the cart is continuous and determined by the actuator’s ac… view at source ↗

**Figure 3.** Figure 3: (a), DRSAC has performance improvement over 75% compared to non-robust algorithms SAC and DDPG when the standard deviation of noise is 0.2 and 0.3. (a) Observation Perturbation: gaussian noise added to nominal states. (b) “Force_mag” Perturbation: model parameter force_mag change [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗

**Figure 4.** Figure 4: LunarLander results on TD3-Dataset. The curves show the average reward of 50 episodes, shaded by ±0.5 standard deviation. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗

**Figure 5.** Figure 5: HalfCheetah results on TD3-Dataset. The curves show the average reward of 50 episodes, shaded by ±0.5 standard deviation. C.3 Ablation Study Details C.3.1 Training Efficiency of DR-SAC In this section, we want to show that DR-SAC with functional optimization finds a good balance between efficiency and accuracy. We compare training time and robustness of Algorithm 1, DR-SAC without functional optimization, … view at source ↗

**Figure 6.** Figure 6: Pendulum results on TD3-Dataset. Curves show average reward of 50 episodes, shaded by ±0.5 standard deviation. Algorithms are SAC, DR-SAC with and without functional approximation. Efficiency Comparison with RFQI In Section 4.2, we test and compare the robustness of DRSAC with other algorithms and robust algorithm RFQI also shows comparable performance under perturbation in most environments. We want to s… view at source ↗

**Figure 7.** Figure 7: Average Reward of 20 Episodes over Training Time in [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Average Reward of 20 Episodes over Training Step in [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

read the original abstract

Deep reinforcement learning (RL) has achieved remarkable success, yet its deployment in real-world scenarios is often limited by vulnerability to environmental uncertainties. Distributionally robust RL (DR-RL) algorithms have been proposed to resolve this challenge, but existing approaches are largely restricted to value-based methods in tabular settings. In this work, we introduce Distributionally Robust Soft Actor-Critic (DR-SAC), the first actor-critic based DR-RL algorithm for offline learning in continuous action spaces. DR-SAC maximizes the entropy-regularized rewards against the worst possible transition models within an KL-divergence constrained uncertainty set. We derive the distributionally robust version of the soft policy iteration with a convergence guarantee and incorporate a generative modeling approach to estimate the unknown nominal transition models. Experiment results on five continuous RL tasks demonstrate our algorithm achieves up to 9.8 times higher average reward than the SAC baseline under common perturbations. Additionally, DR-SAC significantly improves computing efficiency and applicability to large-scale problems compared with existing DR-RL algorithms. Code is publicly available at github.com/Lemutisme/DR-SAC.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Distributionally Robust Soft Actor-Critic (DR-SAC), the first actor-critic DR-RL method for offline learning in continuous action spaces. It extends soft actor-critic by maximizing entropy-regularized rewards against worst-case transition models inside a KL-divergence ball around a nominal transition kernel estimated via generative modeling. The authors derive a distributionally robust version of soft policy iteration with a claimed convergence guarantee, and report up to 9.8 times higher average reward than SAC on five continuous control tasks under common perturbations, along with improved computational efficiency over prior DR-RL algorithms.

Significance. If the convergence guarantee holds and the generative modeling step reliably places the true deployment dynamics inside the KL-ball, the work would be significant for extending distributionally robust RL to practical continuous-control settings where actor-critic methods dominate. The public code release and empirical evaluation on five tasks under perturbations are positive features that support reproducibility and applicability claims. However, the significance is tempered by the lack of explicit verification that the estimated nominal model satisfies the coverage condition required for the robustness guarantee to translate to real uncertainty.

major comments (2)

The central robustness claim rests on the generative modeling step producing a nominal transition estimate P̂ such that the true (unknown) dynamics P* lie inside the KL-ball B(P̂, ε) for the chosen radius. The manuscript reports gains under 'common perturbations' but provides no quantitative diagnostic (e.g., estimated KL distance between P̂ and held-out perturbed trajectories, or sensitivity analysis over ε) confirming that condition (i) in the skeptic note holds. Without this check, the observed improvement could be an artifact of the particular test perturbations rather than a consequence of the distributionally robust objective.
The derivation of the distributionally robust soft policy iteration is presented as a direct extension that yields a convergence guarantee. However, the manuscript does not show that the resulting robust Bellman operator is a contraction (or satisfies the conditions for the guarantee) independently of the specific generative model; the circularity concern that the construction reduces to a fitted parameter by the paper's own equations remains unaddressed in the provided theoretical section.

minor comments (2)

The abstract states 'up to 9.8 times higher average reward' but the experimental section should clarify whether this is the maximum over tasks or an average, and include standard errors or statistical significance tests across the five environments.
Notation for the uncertainty set radius ε and the generative model training procedure should be introduced earlier and used consistently when describing how the nominal kernel is obtained from offline data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the presentation of our results and theory.

read point-by-point responses

Referee: The central robustness claim rests on the generative modeling step producing a nominal transition estimate P̂ such that the true (unknown) dynamics P* lie inside the KL-ball B(P̂, ε) for the chosen radius. The manuscript reports gains under 'common perturbations' but provides no quantitative diagnostic (e.g., estimated KL distance between P̂ and held-out perturbed trajectories, or sensitivity analysis over ε) confirming that condition (i) in the skeptic note holds. Without this check, the observed improvement could be an artifact of the particular test perturbations rather than a consequence of the distributionally robust objective.

Authors: We agree that explicit verification of the coverage condition would strengthen the robustness claims. In the revised manuscript we will add a new subsection with quantitative diagnostics: estimated KL distances between the learned nominal model and held-out trajectories collected under the perturbed dynamics, together with a sensitivity analysis of performance across a range of ε values. These additions will directly address whether the true dynamics lie inside the uncertainty set for the radii used in our experiments. revision: yes
Referee: The derivation of the distributionally robust soft policy iteration is presented as a direct extension that yields a convergence guarantee. However, the manuscript does not show that the resulting robust Bellman operator is a contraction (or satisfies the conditions for the guarantee) independently of the specific generative model; the circularity concern that the construction reduces to a fitted parameter by the paper's own equations remains unaddressed in the provided theoretical section.

Authors: We thank the referee for this observation. The convergence proof for the robust soft policy iteration appears in the appendix and establishes that the robust Bellman operator is a contraction under the standard assumptions on the entropy-regularized objective and the KL-ball uncertainty set. The contraction property holds for any nominal kernel inside the ball and does not rely on the particular generative model used to estimate that kernel. To eliminate any appearance of circularity we will add a short clarifying paragraph in Section 4 that explicitly states the independence from the generative-model details and references the relevant steps in the appendix proof. revision: yes

Circularity Check

0 steps flagged

Derivation of robust soft policy iteration presented as extension; no reduction to fitted inputs or self-citation chains

full rationale

The paper derives a distributionally robust variant of soft policy iteration from the standard SAC framework and incorporates a separate generative modeling step to estimate the nominal transition kernel before applying the KL-ball. No equation in the provided abstract or skeptic analysis shows a claimed prediction or first-principles result that algebraically equals a fitted parameter or prior self-citation by construction. The generative modeling step is an input to the robust objective rather than a tautological output. This yields a self-contained derivation chain against external benchmarks such as standard SAC, with only minor self-citation risk at the level of the base algorithm.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard RL convergence results for soft policy iteration and on the modeling assumption that a generative model trained on offline data yields a usable nominal transition kernel around which a KL-ball can be constructed.

free parameters (1)

KL-ball radius
The size of the uncertainty set is a tunable hyperparameter that controls the degree of robustness.

axioms (2)

standard math Soft policy iteration converges to an optimal policy under standard entropy-regularized MDP assumptions.
The robust extension builds directly on this background result.
domain assumption The nominal transition model can be estimated from offline data via generative modeling.
This step is required to define the center of the KL uncertainty set.

pith-pipeline@v0.9.0 · 5750 in / 1346 out tokens · 58247 ms · 2026-05-19T09:06:44.246357+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DR-SAC maximizes the entropy-regularized rewards against the worst possible transition models within a KL-divergence constrained uncertainty set... derive the distributionally robust version of the soft policy iteration... incorporate a generative modeling approach to estimate the unknown nominal transition models.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We utilize the interchange property to reformulate the optimization problem within the KL-constrained uncertainty set into functional optimization.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 5 internal anchors

[1]

Uncertainty-based offline reinforcement learning with diversified q-ensemble.Advances in neural information processing systems, 34:7436–7447, 2021

Gaon An, Seungyong Moon, Jang-Hyun Kim, and Hyun Oh Song. Uncertainty-based offline reinforcement learning with diversified q-ensemble.Advances in neural information processing systems, 34:7436–7447, 2021

work page 2021
[2]

Deep reinforcement learning: A brief survey.IEEE Signal Processing Magazine, 34(6):26–38, 2017

Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. Deep reinforcement learning: A brief survey.IEEE Signal Processing Magazine, 34(6):26–38, 2017

work page 2017
[3]

Deep generative models for offline policy learning: Tutorial, survey, and perspectives on future directions.arXiv preprint arXiv:2402.13777, 2024

Jiayu Chen, Bhargav Ganguly, Yang Xu, Yongsheng Mei, Tian Lan, and Vaneet Aggarwal. Deep generative models for offline policy learning: Tutorial, survey, and perspectives on future directions.arXiv preprint arXiv:2402.13777, 2024

work page arXiv 2024
[4]

Corrected soft actor critic for continuous control.arXiv preprint arXiv:2410.16739, 2024

Yanjun Chen, Xinming Zhang, Xianghui Wang, Zhiqiang Xu, Xiaoyu Shen, and Wei Zhang. Corrected soft actor critic for continuous control.arXiv preprint arXiv:2410.16739, 2024

work page arXiv 2024
[5]

Adversarially trained actor critic for offline reinforcement learning

Ching-An Cheng, Tengyang Xie, Nan Jiang, and Alekh Agarwal. Adversarially trained actor critic for offline reinforcement learning. InInternational Conference on Machine Learning, pages 3852–3878. PMLR, 2022

work page 2022
[6]

Towards minimax optimality of model- based robust reinforcement learning.arXiv preprint arXiv:2302.05372, 2023

Pierre Clavier, Erwan Le Pennec, and Matthieu Geist. Towards minimax optimality of model- based robust reinforcement learning.arXiv preprint arXiv:2302.05372, 2023

work page arXiv 2023
[7]

Soft-Robust Actor-Critic Policy-Gradient

Esther Derman, Daniel J Mankowitz, Timothy A Mann, and Shie Mannor. Soft-robust actor- critic policy-gradient.arXiv preprint arXiv:1803.04848, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Distributional robustness and regularization in reinforcement learning.arXiv preprint arXiv:2003.02894, 2020

Esther Derman and Shie Mannor. Distributional robustness and regularization in reinforcement learning.arXiv preprint arXiv:2003.02894, 2020

work page arXiv 2003
[9]

Risk-sensitive soft actor-critic for robust deep reinforcement learning under distribution shifts.arXiv preprint arXiv:2402.09992, 2024

Tobias Enders, James Harrison, and Maximilian Schiffer. Risk-sensitive soft actor-critic for robust deep reinforcement learning under distribution shifts.arXiv preprint arXiv:2402.09992, 2024

work page arXiv 2024
[10]

An introduction to deep reinforcement learning.Foundations and Trends® in Machine Learning, 11(3-4):219–354, 2018

Vincent Francois-Lavet, Peter Henderson, Riashat Islam, Marc G Bellemare, Joelle Pineau, et al. An introduction to deep reinforcement learning.Foundations and Trends® in Machine Learning, 11(3-4):219–354, 2018

work page 2018
[11]

Addressing function approximation error in actor-critic methods

Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. InInternational Conference on Machine Learning, pages 1582–1591, 2018

work page 2018
[12]

Off-policy deep reinforcement learning without exploration

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. InInternational conference on machine learning, pages 2052–2062. PMLR, 2019

work page 2052
[13]

Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018

work page 2018
[14]

Soft Actor-Critic Algorithms and Applications

Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications.arXiv preprint arXiv:1812.05905, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

Risk-sensitive markov decision processes.Manage- ment science, 18(7):356–369, 1972

Ronald A Howard and James E Matheson. Risk-sensitive markov decision processes.Manage- ment science, 18(7):356–369, 1972

work page 1972
[16]

Kullback-leibler divergence constrained distributionally robust optimization.Available at Optimization Online, 1(2):9, 2013

Zhaolin Hu and L Jeff Hong. Kullback-leibler divergence constrained distributionally robust optimization.Available at Optimization Online, 1(2):9, 2013

work page 2013
[17]

Garud N. Iyengar. Robust dynamic programming.Mathematics of Operations Research, 30(2):257–280, 2005

work page 2005
[18]

David Jacobson. Optimal stochastic linear systems with exponential performance criteria and their relation to deterministic differential games.IEEE Transactions on Automatic control, 18(2):124–131, 1973. 10

work page 1973
[19]

Auto-encoding variational bayes, 2013

Diederik P Kingma, Max Welling, et al. Auto-encoding variational bayes, 2013

work page 2013
[20]

Safe reinforcement learning using wasserstein distributionally robust mpc and chance constraint.IEEE Access, 10:130058– 130067, 2022

Arash Bahari Kordabad, Rafael Wisniewski, and Sebastien Gros. Safe reinforcement learning using wasserstein distributionally robust mpc and chance constraint.IEEE Access, 10:130058– 130067, 2022

work page 2022
[21]

Stabilizing off- policy q-learning via bootstrapping error reduction.Advances in neural information processing systems, 32, 2019

Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off- policy q-learning via bootstrapping error reduction.Advances in neural information processing systems, 32, 2019

work page 2019
[22]

Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems, 33:1179– 1191, 2020

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems, 33:1179– 1191, 2020

work page 2020
[23]

Single- trajectory distributionally robust reinforcement learning, 2024

Zhipeng Liang, Xiaoteng Ma, Jose Blanchet, Jiheng Zhang, and Zhengyuan Zhou. Single- trajectory distributionally robust reinforcement learning, 2024

work page 2024
[24]

Continuous control with deep reinforcement learning

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[25]

Minimax optimal and computationally efficient algorithms for distributionally robust offline reinforcement learning.arXiv preprint arXiv:2403.09621, 2024

Zhishuai Liu and Pan Xu. Minimax optimal and computationally efficient algorithms for distributionally robust offline reinforcement learning.arXiv preprint arXiv:2403.09621, 2024

work page arXiv 2024
[26]

Distributionally robust q-learning

Zijian Liu, Qinxun Bai, Jose Blanchet, Perry Dong, Wei Xu, Zhengqing Zhou, and Zhengyuan Zhou. Distributionally robust q-learning. InInternational Conference on Machine Learning, pages 13623–13643. PMLR, 2022

work page 2022
[27]

Soft-robust algorithms for batch reinforcement learning.arXiv preprint arXiv:2011.14495, 2020

Elita A Lobo, Mohammad Ghavamzadeh, and Marek Petrik. Soft-robust algorithms for batch reinforcement learning.arXiv preprint arXiv:2011.14495, 2020

work page arXiv 2011
[28]

Distributionally robust reinforcement learning with interactive data collection: Fundamental hardness and near-optimal algorithm

Miao Lu, Han Zhong, Tong Zhang, and Jose Blanchet. Distributionally robust reinforcement learning with interactive data collection: Fundamental hardness and near-optimal algorithm. arXiv preprint arXiv:2404.03578, 2024

work page arXiv 2024
[29]

Mildly conservative q-learning for offline reinforcement learning.Advances in Neural Information Processing Systems, 35:1711–1724, 2022

Jiafei Lyu, Xiaoteng Ma, Xiu Li, and Zongqing Lu. Mildly conservative q-learning for offline reinforcement learning.Advances in Neural Information Processing Systems, 35:1711–1724, 2022

work page 2022
[30]

Distributionally robust offline reinforcement learning with linear function approximation, 2023

Xiaoteng Ma, Zhipeng Liang, Jose Blanchet, Mingwen Liu, Li Xia, Jiheng Zhang, Qianchuan Zhao, and Zhengyuan Zhou. Distributionally robust offline reinforcement learning with linear function approximation, 2023

work page 2023
[31]

Robust re- inforcement learning for continuous control with model misspecification.arXiv preprint arXiv:1906.07516, 2019

Daniel J Mankowitz, Nir Levine, Rae Jeong, Yuanyuan Shi, Jackie Kay, Abbas Abdolmaleki, Jost Tobias Springenberg, Timothy Mann, Todd Hester, and Martin Riedmiller. Robust re- inforcement learning for continuous control with model misspecification.arXiv preprint arXiv:1906.07516, 2019

work page arXiv 1906
[32]

An experimental design perspective on model-based reinforcement learning.arXiv preprint arXiv:2112.05244, 2021

Viraj Mehta, Biswajit Paria, Jeff Schneider, Stefano Ermon, and Willie Neiswanger. An experimental design perspective on model-based reinforcement learning.arXiv preprint arXiv:2112.05244, 2021

work page arXiv 2021
[33]

Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

work page 2015
[34]

Robust control of markov decision processes with uncertain transition matrices.Operations Research, 53(5):780–798, 2005

Arnab Nilim and Laurent El Ghaoui. Robust control of markov decision processes with uncertain transition matrices.Operations Research, 53(5):780–798, 2005

work page 2005
[35]

Risk averse robust adversarial reinforce- ment learning

Xinlei Pan, Daniel Seita, Yang Gao, and John Canny. Risk averse robust adversarial reinforce- ment learning. In2019 International Conference on Robotics and Automation (ICRA), pages 8522–8528. IEEE, 2019. 11

work page 2019
[36]

Sample complexity of robust reinforcement learning with a generative model

Kishan Panaganti and Dileep Kalathil. Sample complexity of robust reinforcement learning with a generative model. InInternational Conference on Artificial Intelligence and Statistics, pages 9582–9602. PMLR, 2022

work page 2022
[37]

Robust rein- forcement learning using offline data

Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, and Mohammad Ghavamzadeh. Robust rein- forcement learning using offline data. InAdvances in Neural Information Processing Systems, volume 35, pages 32211–32224. Curran Associates, Inc., 2022

work page 2022
[38]

Robust adversarial reinforcement learning

Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learning. InInternational conference on machine learning, pages 2817–2826. PMLR, 2017

work page 2017
[39]

Risk-averse model uncertainty for distributionally robust safe reinforcement learning.Advances in Neural Information Processing Systems, 36:1659–1680, 2023

James Queeney and Mouhacine Benosman. Risk-averse model uncertainty for distributionally robust safe reinforcement learning.Advances in Neural Information Processing Systems, 36:1659–1680, 2023

work page 2023
[40]

Distributionally robust model-based reinforcement learning with large state spaces

Shyam Sundhar Ramesh, Pier Giuseppe Sessa, Yifan Hu, Andreas Krause, and Ilija Bogunovic. Distributionally robust model-based reinforcement learning with large state spaces. InInterna- tional Conference on Artificial Intelligence and Statistics, pages 100–108. PMLR, 2024

work page 2024
[41]

On stochastic optimal control and reinforcement learning by approximate inference.Proceedings of Robotics: Science and Systems VIII, 2012

Konrad Rawlik, Marc Toussaint, and Sethu Vijayakumar. On stochastic optimal control and reinforcement learning by approximate inference.Proceedings of Robotics: Science and Systems VIII, 2012

work page 2012
[42]

Springer Science & Business Media, 2009

R Tyrrell Rockafellar and Roger J-B Wets.Variational analysis, volume 317. Springer Science & Business Media, 2009

work page 2009
[43]

d3rlpy: An offline deep reinforcement learning library.Journal of Machine Learning Research, 23(315):1–20, 2022

Takuma Seno and Michita Imai. d3rlpy: An offline deep reinforcement learning library.Journal of Machine Learning Research, 23(315):1–20, 2022

work page 2022
[44]

Distributionally robust stochastic programming.SIAM Journal on Opti- mization, 27(4):2258–2275, 2017

Alexander Shapiro. Distributionally robust stochastic programming.SIAM Journal on Opti- mization, 27(4):2258–2275, 2017

work page 2017
[45]

Distributionally robust model-based offline reinforcement learning with near-optimal sample complexity.Journal of Machine Learning Research, 25(200):1–91, 2024

Laixi Shi and Yuejie Chi. Distributionally robust model-based offline reinforcement learning with near-optimal sample complexity.Journal of Machine Learning Research, 25(200):1–91, 2024

work page 2024
[46]

Improving robustness via risk averse distributional reinforcement learning

Rahul Singh, Qinsheng Zhang, and Yongxin Chen. Improving robustness via risk averse distributional reinforcement learning. InLearning for Dynamics and Control, pages 958–968. PMLR, 2020

work page 2020
[47]

Distributionally Robust Reinforcement Learning

Elena Smirnova, Elvis Dohmatob, and Jérémie Mary. Distributionally robust reinforcement learning.arXiv preprint arXiv:1902.08708, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902
[48]

Policy gradient for coherent risk measures.Advances in neural information processing systems, 28, 2015

Aviv Tamar, Yinlam Chow, Mohammad Ghavamzadeh, and Shie Mannor. Policy gradient for coherent risk measures.Advances in neural information processing systems, 28, 2015

work page 2015
[49]

General duality between optimal control and estimation

Emanuel Todorov. General duality between optimal control and estimation. In2008 47th IEEE conference on decision and control, pages 4286–4292. IEEE, 2008

work page 2008
[50]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012

work page 2012
[51]

Gymnasium: A Standard Interface for Reinforcement Learning Environments

Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A standard interface for reinforcement learning environments.arXiv preprint arXiv:2407.17032, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Stable reinforcement learning with autoencoders for tactile and visual data

Herke Van Hoof, Nutan Chen, Maximilian Karl, Patrick Van Der Smagt, and Jan Peters. Stable reinforcement learning with autoencoders for tactile and visual data. In2016 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 3928–3934. IEEE, 2016. 12

work page 2016
[53]

A finite sample complexity bound for distributionally robust q-learning

Shengbo Wang, Nian Si, Jose Blanchet, and Zhengyuan Zhou. A finite sample complexity bound for distributionally robust q-learning. InProceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 ofProceedings of Machine Learning Research, pages 3370–3398. PMLR, 2023

work page 2023
[54]

Sample complexity of variance- reduced distributionally robust q-learning.Journal of Machine Learning Research, 25(341):1– 77, 2024

Shengbo Wang, Nian Si, Jose Blanchet, and Zhengyuan Zhou. Sample complexity of variance- reduced distributionally robust q-learning.Journal of Machine Learning Research, 25(341):1– 77, 2024

work page 2024
[55]

Online robust reinforcement learning with model uncertainty

Yue Wang and Shaofeng Zou. Online robust reinforcement learning with model uncertainty. Advances in Neural Information Processing Systems, 34:7193–7206, 2021

work page 2021
[56]

Boosting offline reinforcement learning with residual generative modeling.arXiv preprint arXiv:2106.10411, 2021

Hua Wei, Deheng Ye, Zhao Liu, Hao Wu, Bo Yuan, Qiang Fu, Wei Yang, and Zhenhui Li. Boosting offline reinforcement learning with residual generative modeling.arXiv preprint arXiv:2106.10411, 2021

work page arXiv 2021
[57]

Risk-sensitive linear/quadratic/gaussian control.Advances in Applied Probability, 13(4):764–777, 1981

Peter Whittle. Risk-sensitive linear/quadratic/gaussian control.Advances in Applied Probability, 13(4):764–777, 1981

work page 1981
[58]

Robust markov decision processes

Wolfram Wiesemann, Daniel Kuhn, and Berc Rustem. Robust markov decision processes. Mathematics of Operations Research, 38(1):153–183, 2013

work page 2013
[59]

Constraints penalized q-learning for safe offline reinforcement learning

Haoran Xu, Xianyuan Zhan, and Xiangyu Zhu. Constraints penalized q-learning for safe offline reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 8753–8760, 2022

work page 2022
[60]

Distributionally robust markov decision processes.Advances in Neural Information Processing Systems, 23, 2010

Huan Xu and Shie Mannor. Distributionally robust markov decision processes.Advances in Neural Information Processing Systems, 23, 2010

work page 2010
[61]

Improved sample complexity bounds for distributionally robust reinforcement learning

Zaiyan Xu, Kishan Panaganti, and Dileep Kalathil. Improved sample complexity bounds for distributionally robust reinforcement learning. InProceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 ofProceedings of Machine Learning Research, pages 9728–9754. PMLR, 2023

work page 2023
[62]

Distributionally robust counterpart in markov decision processes

Pengqian Yu and Huan Xu. Distributionally robust counterpart in markov decision processes. IEEE Transactions on Automatic Control, 61(9):2538–2543, 2015

work page 2015
[63]

Robust deep reinforcement learning against adversarial perturbations on state observations.Advances in Neural Information Processing Systems, 33:21024–21037, 2020

Huan Zhang, Hongge Chen, Chaowei Xiao, Bo Li, Mingyan Liu, Duane Boning, and Cho- Jui Hsieh. Robust deep reinforcement learning against adversarial perturbations on state observations.Advances in Neural Information Processing Systems, 33:21024–21037, 2020

work page 2020
[64]

Natural actor- critic for robust reinforcement learning with function approximation.Advances in neural information processing systems, 36:97–133, 2023

Ruida Zhou, Tao Liu, Min Cheng, Dileep Kalathil, PR Kumar, and Chao Tian. Natural actor- critic for robust reinforcement learning with function approximation.Advances in neural information processing systems, 36:97–133, 2023

work page 2023
[65]

Finite-sample regret bound for distributionally robust offline tabular reinforcement learning

Zhengqing Zhou, Zhengyuan Zhou, Qinxun Bai, Linhai Qiu, Jose Blanchet, and Peter Glynn. Finite-sample regret bound for distributionally robust offline tabular reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 3331–3339. PMLR, 2021

work page 2021
[66]

Infinite time horizon maximum causal entropy inverse reinforcement learning.IEEE Transactions on Automatic Control, 63(9):2787–2802, 2018

Zhengyuan Zhou, Michael Bloem, and Nicholas Bambos. Infinite time horizon maximum causal entropy inverse reinforcement learning.IEEE Transactions on Automatic Control, 63(9):2787–2802, 2018

work page 2018
[67]

Carnegie Mellon University, 2010

Brian D Ziebart.Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University, 2010

work page 2010
[68]

Q(s, a)−E[r]−γ·sup β≥0 −βlog Ep0s,a exp −V(s ′) β −βδ # . If using ERM method, the empirical Bellman residual is bLQ := 1 N NX i=1

Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, Anind K Dey, et al. Maximum entropy inverse reinforcement learning. InAaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008. 13 Appendix Contents A Discussion 14 A.1 Necessity of V AE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.2 Algorithm Details . . . . . . . . . . . . . ...

work page 2008
[69]

Force_mag

or directly estimating the expected value under nominal distributions[23]. None of these methods is applicable in continuous space offline RL tasks. A.2 Algorithm Details In this section, we present a detailed description of the DR-SAC algorithm. In our algorithm, we use neural networks Vψ(s), Qθ(s, a) and πϕ(a|s) to approximate the the value function, th...

work page

[1] [1]

Uncertainty-based offline reinforcement learning with diversified q-ensemble.Advances in neural information processing systems, 34:7436–7447, 2021

Gaon An, Seungyong Moon, Jang-Hyun Kim, and Hyun Oh Song. Uncertainty-based offline reinforcement learning with diversified q-ensemble.Advances in neural information processing systems, 34:7436–7447, 2021

work page 2021

[2] [2]

Deep reinforcement learning: A brief survey.IEEE Signal Processing Magazine, 34(6):26–38, 2017

Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. Deep reinforcement learning: A brief survey.IEEE Signal Processing Magazine, 34(6):26–38, 2017

work page 2017

[3] [3]

Deep generative models for offline policy learning: Tutorial, survey, and perspectives on future directions.arXiv preprint arXiv:2402.13777, 2024

Jiayu Chen, Bhargav Ganguly, Yang Xu, Yongsheng Mei, Tian Lan, and Vaneet Aggarwal. Deep generative models for offline policy learning: Tutorial, survey, and perspectives on future directions.arXiv preprint arXiv:2402.13777, 2024

work page arXiv 2024

[4] [4]

Corrected soft actor critic for continuous control.arXiv preprint arXiv:2410.16739, 2024

Yanjun Chen, Xinming Zhang, Xianghui Wang, Zhiqiang Xu, Xiaoyu Shen, and Wei Zhang. Corrected soft actor critic for continuous control.arXiv preprint arXiv:2410.16739, 2024

work page arXiv 2024

[5] [5]

Adversarially trained actor critic for offline reinforcement learning

Ching-An Cheng, Tengyang Xie, Nan Jiang, and Alekh Agarwal. Adversarially trained actor critic for offline reinforcement learning. InInternational Conference on Machine Learning, pages 3852–3878. PMLR, 2022

work page 2022

[6] [6]

Towards minimax optimality of model- based robust reinforcement learning.arXiv preprint arXiv:2302.05372, 2023

Pierre Clavier, Erwan Le Pennec, and Matthieu Geist. Towards minimax optimality of model- based robust reinforcement learning.arXiv preprint arXiv:2302.05372, 2023

work page arXiv 2023

[7] [7]

Soft-Robust Actor-Critic Policy-Gradient

Esther Derman, Daniel J Mankowitz, Timothy A Mann, and Shie Mannor. Soft-robust actor- critic policy-gradient.arXiv preprint arXiv:1803.04848, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

Distributional robustness and regularization in reinforcement learning.arXiv preprint arXiv:2003.02894, 2020

Esther Derman and Shie Mannor. Distributional robustness and regularization in reinforcement learning.arXiv preprint arXiv:2003.02894, 2020

work page arXiv 2003

[9] [9]

Risk-sensitive soft actor-critic for robust deep reinforcement learning under distribution shifts.arXiv preprint arXiv:2402.09992, 2024

Tobias Enders, James Harrison, and Maximilian Schiffer. Risk-sensitive soft actor-critic for robust deep reinforcement learning under distribution shifts.arXiv preprint arXiv:2402.09992, 2024

work page arXiv 2024

[10] [10]

An introduction to deep reinforcement learning.Foundations and Trends® in Machine Learning, 11(3-4):219–354, 2018

Vincent Francois-Lavet, Peter Henderson, Riashat Islam, Marc G Bellemare, Joelle Pineau, et al. An introduction to deep reinforcement learning.Foundations and Trends® in Machine Learning, 11(3-4):219–354, 2018

work page 2018

[11] [11]

Addressing function approximation error in actor-critic methods

Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. InInternational Conference on Machine Learning, pages 1582–1591, 2018

work page 2018

[12] [12]

Off-policy deep reinforcement learning without exploration

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. InInternational conference on machine learning, pages 2052–2062. PMLR, 2019

work page 2052

[13] [13]

Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018

work page 2018

[14] [14]

Soft Actor-Critic Algorithms and Applications

Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications.arXiv preprint arXiv:1812.05905, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[15] [15]

Risk-sensitive markov decision processes.Manage- ment science, 18(7):356–369, 1972

Ronald A Howard and James E Matheson. Risk-sensitive markov decision processes.Manage- ment science, 18(7):356–369, 1972

work page 1972

[16] [16]

Kullback-leibler divergence constrained distributionally robust optimization.Available at Optimization Online, 1(2):9, 2013

Zhaolin Hu and L Jeff Hong. Kullback-leibler divergence constrained distributionally robust optimization.Available at Optimization Online, 1(2):9, 2013

work page 2013

[17] [17]

Garud N. Iyengar. Robust dynamic programming.Mathematics of Operations Research, 30(2):257–280, 2005

work page 2005

[18] [18]

David Jacobson. Optimal stochastic linear systems with exponential performance criteria and their relation to deterministic differential games.IEEE Transactions on Automatic control, 18(2):124–131, 1973. 10

work page 1973

[19] [19]

Auto-encoding variational bayes, 2013

Diederik P Kingma, Max Welling, et al. Auto-encoding variational bayes, 2013

work page 2013

[20] [20]

Safe reinforcement learning using wasserstein distributionally robust mpc and chance constraint.IEEE Access, 10:130058– 130067, 2022

Arash Bahari Kordabad, Rafael Wisniewski, and Sebastien Gros. Safe reinforcement learning using wasserstein distributionally robust mpc and chance constraint.IEEE Access, 10:130058– 130067, 2022

work page 2022

[21] [21]

Stabilizing off- policy q-learning via bootstrapping error reduction.Advances in neural information processing systems, 32, 2019

Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off- policy q-learning via bootstrapping error reduction.Advances in neural information processing systems, 32, 2019

work page 2019

[22] [22]

Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems, 33:1179– 1191, 2020

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems, 33:1179– 1191, 2020

work page 2020

[23] [23]

Single- trajectory distributionally robust reinforcement learning, 2024

Zhipeng Liang, Xiaoteng Ma, Jose Blanchet, Jiheng Zhang, and Zhengyuan Zhou. Single- trajectory distributionally robust reinforcement learning, 2024

work page 2024

[24] [24]

Continuous control with deep reinforcement learning

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[25] [25]

Minimax optimal and computationally efficient algorithms for distributionally robust offline reinforcement learning.arXiv preprint arXiv:2403.09621, 2024

Zhishuai Liu and Pan Xu. Minimax optimal and computationally efficient algorithms for distributionally robust offline reinforcement learning.arXiv preprint arXiv:2403.09621, 2024

work page arXiv 2024

[26] [26]

Distributionally robust q-learning

Zijian Liu, Qinxun Bai, Jose Blanchet, Perry Dong, Wei Xu, Zhengqing Zhou, and Zhengyuan Zhou. Distributionally robust q-learning. InInternational Conference on Machine Learning, pages 13623–13643. PMLR, 2022

work page 2022

[27] [27]

Soft-robust algorithms for batch reinforcement learning.arXiv preprint arXiv:2011.14495, 2020

Elita A Lobo, Mohammad Ghavamzadeh, and Marek Petrik. Soft-robust algorithms for batch reinforcement learning.arXiv preprint arXiv:2011.14495, 2020

work page arXiv 2011

[28] [28]

Distributionally robust reinforcement learning with interactive data collection: Fundamental hardness and near-optimal algorithm

Miao Lu, Han Zhong, Tong Zhang, and Jose Blanchet. Distributionally robust reinforcement learning with interactive data collection: Fundamental hardness and near-optimal algorithm. arXiv preprint arXiv:2404.03578, 2024

work page arXiv 2024

[29] [29]

Mildly conservative q-learning for offline reinforcement learning.Advances in Neural Information Processing Systems, 35:1711–1724, 2022

Jiafei Lyu, Xiaoteng Ma, Xiu Li, and Zongqing Lu. Mildly conservative q-learning for offline reinforcement learning.Advances in Neural Information Processing Systems, 35:1711–1724, 2022

work page 2022

[30] [30]

Distributionally robust offline reinforcement learning with linear function approximation, 2023

Xiaoteng Ma, Zhipeng Liang, Jose Blanchet, Mingwen Liu, Li Xia, Jiheng Zhang, Qianchuan Zhao, and Zhengyuan Zhou. Distributionally robust offline reinforcement learning with linear function approximation, 2023

work page 2023

[31] [31]

Robust re- inforcement learning for continuous control with model misspecification.arXiv preprint arXiv:1906.07516, 2019

Daniel J Mankowitz, Nir Levine, Rae Jeong, Yuanyuan Shi, Jackie Kay, Abbas Abdolmaleki, Jost Tobias Springenberg, Timothy Mann, Todd Hester, and Martin Riedmiller. Robust re- inforcement learning for continuous control with model misspecification.arXiv preprint arXiv:1906.07516, 2019

work page arXiv 1906

[32] [32]

An experimental design perspective on model-based reinforcement learning.arXiv preprint arXiv:2112.05244, 2021

Viraj Mehta, Biswajit Paria, Jeff Schneider, Stefano Ermon, and Willie Neiswanger. An experimental design perspective on model-based reinforcement learning.arXiv preprint arXiv:2112.05244, 2021

work page arXiv 2021

[33] [33]

Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

work page 2015

[34] [34]

Robust control of markov decision processes with uncertain transition matrices.Operations Research, 53(5):780–798, 2005

Arnab Nilim and Laurent El Ghaoui. Robust control of markov decision processes with uncertain transition matrices.Operations Research, 53(5):780–798, 2005

work page 2005

[35] [35]

Risk averse robust adversarial reinforce- ment learning

Xinlei Pan, Daniel Seita, Yang Gao, and John Canny. Risk averse robust adversarial reinforce- ment learning. In2019 International Conference on Robotics and Automation (ICRA), pages 8522–8528. IEEE, 2019. 11

work page 2019

[36] [36]

Sample complexity of robust reinforcement learning with a generative model

Kishan Panaganti and Dileep Kalathil. Sample complexity of robust reinforcement learning with a generative model. InInternational Conference on Artificial Intelligence and Statistics, pages 9582–9602. PMLR, 2022

work page 2022

[37] [37]

Robust rein- forcement learning using offline data

Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, and Mohammad Ghavamzadeh. Robust rein- forcement learning using offline data. InAdvances in Neural Information Processing Systems, volume 35, pages 32211–32224. Curran Associates, Inc., 2022

work page 2022

[38] [38]

Robust adversarial reinforcement learning

Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learning. InInternational conference on machine learning, pages 2817–2826. PMLR, 2017

work page 2017

[39] [39]

Risk-averse model uncertainty for distributionally robust safe reinforcement learning.Advances in Neural Information Processing Systems, 36:1659–1680, 2023

James Queeney and Mouhacine Benosman. Risk-averse model uncertainty for distributionally robust safe reinforcement learning.Advances in Neural Information Processing Systems, 36:1659–1680, 2023

work page 2023

[40] [40]

Distributionally robust model-based reinforcement learning with large state spaces

Shyam Sundhar Ramesh, Pier Giuseppe Sessa, Yifan Hu, Andreas Krause, and Ilija Bogunovic. Distributionally robust model-based reinforcement learning with large state spaces. InInterna- tional Conference on Artificial Intelligence and Statistics, pages 100–108. PMLR, 2024

work page 2024

[41] [41]

On stochastic optimal control and reinforcement learning by approximate inference.Proceedings of Robotics: Science and Systems VIII, 2012

Konrad Rawlik, Marc Toussaint, and Sethu Vijayakumar. On stochastic optimal control and reinforcement learning by approximate inference.Proceedings of Robotics: Science and Systems VIII, 2012

work page 2012

[42] [42]

Springer Science & Business Media, 2009

R Tyrrell Rockafellar and Roger J-B Wets.Variational analysis, volume 317. Springer Science & Business Media, 2009

work page 2009

[43] [43]

d3rlpy: An offline deep reinforcement learning library.Journal of Machine Learning Research, 23(315):1–20, 2022

Takuma Seno and Michita Imai. d3rlpy: An offline deep reinforcement learning library.Journal of Machine Learning Research, 23(315):1–20, 2022

work page 2022

[44] [44]

Distributionally robust stochastic programming.SIAM Journal on Opti- mization, 27(4):2258–2275, 2017

Alexander Shapiro. Distributionally robust stochastic programming.SIAM Journal on Opti- mization, 27(4):2258–2275, 2017

work page 2017

[45] [45]

Distributionally robust model-based offline reinforcement learning with near-optimal sample complexity.Journal of Machine Learning Research, 25(200):1–91, 2024

Laixi Shi and Yuejie Chi. Distributionally robust model-based offline reinforcement learning with near-optimal sample complexity.Journal of Machine Learning Research, 25(200):1–91, 2024

work page 2024

[46] [46]

Improving robustness via risk averse distributional reinforcement learning

Rahul Singh, Qinsheng Zhang, and Yongxin Chen. Improving robustness via risk averse distributional reinforcement learning. InLearning for Dynamics and Control, pages 958–968. PMLR, 2020

work page 2020

[47] [47]

Distributionally Robust Reinforcement Learning

Elena Smirnova, Elvis Dohmatob, and Jérémie Mary. Distributionally robust reinforcement learning.arXiv preprint arXiv:1902.08708, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902

[48] [48]

Policy gradient for coherent risk measures.Advances in neural information processing systems, 28, 2015

Aviv Tamar, Yinlam Chow, Mohammad Ghavamzadeh, and Shie Mannor. Policy gradient for coherent risk measures.Advances in neural information processing systems, 28, 2015

work page 2015

[49] [49]

General duality between optimal control and estimation

Emanuel Todorov. General duality between optimal control and estimation. In2008 47th IEEE conference on decision and control, pages 4286–4292. IEEE, 2008

work page 2008

[50] [50]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012

work page 2012

[51] [51]

Gymnasium: A Standard Interface for Reinforcement Learning Environments

Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A standard interface for reinforcement learning environments.arXiv preprint arXiv:2407.17032, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [52]

Stable reinforcement learning with autoencoders for tactile and visual data

Herke Van Hoof, Nutan Chen, Maximilian Karl, Patrick Van Der Smagt, and Jan Peters. Stable reinforcement learning with autoencoders for tactile and visual data. In2016 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 3928–3934. IEEE, 2016. 12

work page 2016

[53] [53]

A finite sample complexity bound for distributionally robust q-learning

Shengbo Wang, Nian Si, Jose Blanchet, and Zhengyuan Zhou. A finite sample complexity bound for distributionally robust q-learning. InProceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 ofProceedings of Machine Learning Research, pages 3370–3398. PMLR, 2023

work page 2023

[54] [54]

Sample complexity of variance- reduced distributionally robust q-learning.Journal of Machine Learning Research, 25(341):1– 77, 2024

Shengbo Wang, Nian Si, Jose Blanchet, and Zhengyuan Zhou. Sample complexity of variance- reduced distributionally robust q-learning.Journal of Machine Learning Research, 25(341):1– 77, 2024

work page 2024

[55] [55]

Online robust reinforcement learning with model uncertainty

Yue Wang and Shaofeng Zou. Online robust reinforcement learning with model uncertainty. Advances in Neural Information Processing Systems, 34:7193–7206, 2021

work page 2021

[56] [56]

Boosting offline reinforcement learning with residual generative modeling.arXiv preprint arXiv:2106.10411, 2021

Hua Wei, Deheng Ye, Zhao Liu, Hao Wu, Bo Yuan, Qiang Fu, Wei Yang, and Zhenhui Li. Boosting offline reinforcement learning with residual generative modeling.arXiv preprint arXiv:2106.10411, 2021

work page arXiv 2021

[57] [57]

Risk-sensitive linear/quadratic/gaussian control.Advances in Applied Probability, 13(4):764–777, 1981

Peter Whittle. Risk-sensitive linear/quadratic/gaussian control.Advances in Applied Probability, 13(4):764–777, 1981

work page 1981

[58] [58]

Robust markov decision processes

Wolfram Wiesemann, Daniel Kuhn, and Berc Rustem. Robust markov decision processes. Mathematics of Operations Research, 38(1):153–183, 2013

work page 2013

[59] [59]

Constraints penalized q-learning for safe offline reinforcement learning

Haoran Xu, Xianyuan Zhan, and Xiangyu Zhu. Constraints penalized q-learning for safe offline reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 8753–8760, 2022

work page 2022

[60] [60]

Distributionally robust markov decision processes.Advances in Neural Information Processing Systems, 23, 2010

Huan Xu and Shie Mannor. Distributionally robust markov decision processes.Advances in Neural Information Processing Systems, 23, 2010

work page 2010

[61] [61]

Improved sample complexity bounds for distributionally robust reinforcement learning

Zaiyan Xu, Kishan Panaganti, and Dileep Kalathil. Improved sample complexity bounds for distributionally robust reinforcement learning. InProceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 ofProceedings of Machine Learning Research, pages 9728–9754. PMLR, 2023

work page 2023

[62] [62]

Distributionally robust counterpart in markov decision processes

Pengqian Yu and Huan Xu. Distributionally robust counterpart in markov decision processes. IEEE Transactions on Automatic Control, 61(9):2538–2543, 2015

work page 2015

[63] [63]

Robust deep reinforcement learning against adversarial perturbations on state observations.Advances in Neural Information Processing Systems, 33:21024–21037, 2020

Huan Zhang, Hongge Chen, Chaowei Xiao, Bo Li, Mingyan Liu, Duane Boning, and Cho- Jui Hsieh. Robust deep reinforcement learning against adversarial perturbations on state observations.Advances in Neural Information Processing Systems, 33:21024–21037, 2020

work page 2020

[64] [64]

Natural actor- critic for robust reinforcement learning with function approximation.Advances in neural information processing systems, 36:97–133, 2023

Ruida Zhou, Tao Liu, Min Cheng, Dileep Kalathil, PR Kumar, and Chao Tian. Natural actor- critic for robust reinforcement learning with function approximation.Advances in neural information processing systems, 36:97–133, 2023

work page 2023

[65] [65]

Finite-sample regret bound for distributionally robust offline tabular reinforcement learning

Zhengqing Zhou, Zhengyuan Zhou, Qinxun Bai, Linhai Qiu, Jose Blanchet, and Peter Glynn. Finite-sample regret bound for distributionally robust offline tabular reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 3331–3339. PMLR, 2021

work page 2021

[66] [66]

Infinite time horizon maximum causal entropy inverse reinforcement learning.IEEE Transactions on Automatic Control, 63(9):2787–2802, 2018

Zhengyuan Zhou, Michael Bloem, and Nicholas Bambos. Infinite time horizon maximum causal entropy inverse reinforcement learning.IEEE Transactions on Automatic Control, 63(9):2787–2802, 2018

work page 2018

[67] [67]

Carnegie Mellon University, 2010

Brian D Ziebart.Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University, 2010

work page 2010

[68] [68]

Q(s, a)−E[r]−γ·sup β≥0 −βlog Ep0s,a exp −V(s ′) β −βδ # . If using ERM method, the empirical Bellman residual is bLQ := 1 N NX i=1

Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, Anind K Dey, et al. Maximum entropy inverse reinforcement learning. InAaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008. 13 Appendix Contents A Discussion 14 A.1 Necessity of V AE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.2 Algorithm Details . . . . . . . . . . . . . ...

work page 2008

[69] [69]

Force_mag

or directly estimating the expected value under nominal distributions[23]. None of these methods is applicable in continuous space offline RL tasks. A.2 Algorithm Details In this section, we present a detailed description of the DR-SAC algorithm. In our algorithm, we use neural networks Vψ(s), Qθ(s, a) and πϕ(a|s) to approximate the the value function, th...

work page