Reinforcement Learning with Markov Risk Measures and Multipattern Risk Approximation

Andrzej Ruszczynski; Tiangang Zhang

arxiv: 2605.00654 · v1 · submitted 2026-05-01 · 💻 cs.LG · cs.AI· math.OC· stat.ML

Reinforcement Learning with Markov Risk Measures and Multipattern Risk Approximation

Andrzej Ruszczynski , Tiangang Zhang This is my paper

Pith reviewed 2026-05-09 19:41 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OCstat.ML

keywords reinforcement learningrisk-averse MDPsQ-learningregret boundscoherent risk measuresmultipattern approximationMarkov decision processesrisk-averse learning

0 comments

The pith

Feature-based Q-learning with multipattern Q-factor approximation yields a high-probability regret bound of O(H² N^H √K) for risk-averse finite-horizon Markov decision problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces mini-batch coherent risk measures tailored to risk-averse finite-horizon Markov decision problems and defines multipattern risk-averse problems as a generalization of linear systems. It develops a feature-based Q-learning method that applies multipattern approximation to the Q-factors and proves this yields a high-probability regret bound scaling as H squared times N to the power H times the square root of K. An economical variant of the method is also given to simplify the backward policy evaluation step, with demonstrations on a stochastic assignment problem and a short-horizon multi-armed bandit. A sympathetic reader would care because the results supply concrete performance guarantees for learning policies that incorporate risk aversion in sequential decisions under uncertainty.

Core claim

We introduce mini-batch measures as a special class of Markov coherent risk measures and the class of multipattern risk-averse problems that generalizes linear systems. Using these in a feature-based Q-learning method with multipattern Q-factor approximation, we prove a high-probability regret bound of O(H² N^H √K), where H is the horizon, N is the mini-batch size, and K is the number of episodes. We also propose an economical version of the Q-learning method that streamlines the policy evaluation step. The theoretical results are illustrated on a stochastic assignment problem and a short-horizon multi-armed bandit problem.

What carries the argument

The multipattern Q-factor approximation within the feature-based Q-learning method, which generalizes linear systems and enables the application of mini-batch coherent risk measures to derive the regret bound.

If this is right

The Q-learning algorithm achieves sublinear regret growth in the number of episodes under the given conditions.
The economical variant reduces the computational cost of the backward induction step without altering the regret guarantee.
The regret bound applies directly to risk-averse versions of the stochastic assignment problem and short-horizon bandit problems.
Risk-averse policies can be learned with explicit high-probability performance certificates in finite-horizon settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If efficient ways to construct multipattern approximations are found for other risk measures or problem classes, the regret analysis could extend to additional sequential decision settings.
The mini-batch structure may allow practitioners to tune the degree of risk aversion by adjusting batch size N while preserving the overall scaling of the bound.
Similar approximation ideas could be tested in related algorithms such as policy gradient methods to obtain comparable guarantees for risk-averse reinforcement learning.

Load-bearing premise

The underlying process is a finite-horizon Markov decision problem, the risk measures belong to the mini-batch coherent class, and the multipattern approximation holds for the Q-factors.

What would settle it

Running the proposed Q-learning method on the stochastic assignment problem and observing that the empirical regret grows faster than order H² N^H √K with increasing episodes K would contradict the stated bound.

Figures

Figures reproduced from arXiv: 2605.00654 by Andrzej Ruszczynski, Tiangang Zhang.

**Figure 1.** Figure 1: The learning error 𝑑(𝑤 𝑘 , 𝑤ˆ) as a function of the number of episodes during full 𝑄-learning and “lazy” learning with the batch size 𝑁 = 1 on an 8-stage stochastic assignment problem with Bernoulli rewards. is comparable to the full 𝑄-learning algorithm, at about half of the training time. The performance of the two policies learned is statistically indistinguishable. Still, their bias to the optimal perf… view at source ↗

**Figure 2.** Figure 2: The learning error 𝑑(𝑤 𝑘 , 𝑤ˆ) as a function of the number of episodes during full 𝑄-learning and “lazy” learning with the batch size 𝑁 = 2 on an 8-stage stochastic assignment problem with Bernoulli rewards. for calculating the regression targets. We still use for every policy 𝜋 the approximation 𝑄 𝜋 ℎ (𝑥, 𝑎) ≈ 𝜑(𝑥, 𝑎) ⊤𝑤 𝜋 ℎ , (32) with some weight vector 𝑤 𝜋 ℎ , and we want to learn the weights correspon… view at source ↗

**Figure 3.** Figure 3: The distribution functions of the optimal policy performance an 8-stage stochastic assignment [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: The distribution functions of the optimal policy performance in a multi-armed bandit problem with [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

read the original abstract

For a risk-averse finite-horizon Markov Decision Problem, we introduce a special class of Markov coherent risk measures, called mini-batch measures. We also define the class of multipattern risk-averse problems that generalizes the class of linear systems. We use both concepts in a feature-based $Q$-learning method with multipattern $Q$-factor approximation and we prove a high-probability regret bound of $\mathcal{O}\big(H^2 N^H \sqrt{ K}\big)$, where $H$ is the horizon, $N$ is the mini-batch size, and $K$ is the number of episodes. We also propose an economical version of the $Q$-learning method that streamlines the policy evaluation (backward) step. The theoretical results are illustrated on a stochastic assignment problem and a short-horizon multi-armed bandit problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines new mini-batch risk measures and multipattern problems for risk-averse RL, then gives a Q-learning variant with a regret bound that carries an exponential-in-horizon penalty.

read the letter

This paper defines mini-batch Markov coherent risk measures and multipattern risk-averse problems, then folds both into a feature-based Q-learning algorithm with multipattern Q-factor approximation. It also supplies an economical variant that simplifies the backward policy evaluation step and states a high-probability regret bound of order H squared times N to the H times square root of K. The illustrations cover a stochastic assignment problem and a short-horizon bandit instance. Those two new classes are the clearest addition relative to earlier work on coherent risk measures and linear risk-averse systems. The combination with approximation and the economical backward pass is a straightforward engineering move that fits the finite-horizon setting. The bound itself is the part that needs the most attention. The N to the H factor grows exponentially with horizon length, and the numerical examples stay at short horizons, so there is no direct check on whether the term is an unavoidable consequence of the recursive risk-measure application or simply an artifact of the concentration argument. Without the full derivation it is difficult to see how tight the induction or union bound actually is. The paper engages the existing literature on risk measures and RL regret analysis without obvious internal contradictions, so the thinking is clear on its own terms. Readers working on risk-averse sequential decisions or coherent risk measures in optimization will find the new classes worth examining. The bound and the limited experimental range are the main limitations. I would bring the paper to a reading group on RL theory or risk-sensitive control, but I would not cite it in the next year unless the exponential term is tightened or shown to be necessary. It deserves peer review because the constructions are fresh enough that referees can usefully check the proofs and ask for longer-horizon experiments.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces a class of mini-batch Markov coherent risk measures and the class of multipattern risk-averse problems that generalizes linear systems, for risk-averse finite-horizon MDPs. It develops a feature-based Q-learning method with multipattern Q-factor approximation, proves a high-probability regret bound of O(H² N^H √K), proposes an economical variant that streamlines the backward policy-evaluation step, and illustrates the approach on a stochastic assignment problem and a short-horizon multi-armed bandit problem.

Significance. If the stated regret bound holds under the paper's assumptions on the MDP and the newly defined mini-batch coherent risk measures, the work would supply a concrete high-probability guarantee for risk-averse RL that incorporates recursive risk evaluation. The multipattern approximation and the economical variant are practical contributions. The exponential N^H factor, however, limits immediate applicability to long horizons, and the numerical examples do not probe this regime.

major comments (1)

[regret analysis (the proof of the main regret theorem)] The central high-probability regret bound O(H² N^H √K) (stated in the abstract and proved via the recursive application of the mini-batch risk measures) contains an exponential-in-horizon factor N^H. The manuscript provides no separate argument or lower-bound construction showing that this multiplicative accumulation across the H stages is necessary rather than an artifact of the induction or union-bound technique used to control per-step approximation and concentration errors.

minor comments (2)

[Abstract] The abstract states the regret bound and the two new classes but supplies neither an explicit list of assumptions (finite-horizon MDP, mini-batch coherent class, multipattern approximation validity) nor a high-level sketch of the derivation steps.
[Numerical experiments] Numerical illustrations are confined to short horizons; the practical severity of the N^H term is therefore not examined.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thorough review and insightful comments on our work. We provide a point-by-point response to the major comment below.

read point-by-point responses

Referee: The central high-probability regret bound O(H² N^H √K) (stated in the abstract and proved via the recursive application of the mini-batch risk measures) contains an exponential-in-horizon factor N^H. The manuscript provides no separate argument or lower-bound construction showing that this multiplicative accumulation across the H stages is necessary rather than an artifact of the induction or union-bound technique used to control per-step approximation and concentration errors.

Authors: We appreciate the referee pointing out this important aspect of our regret analysis. The bound is obtained by recursively applying the definition of the mini-batch Markov coherent risk measures across the H stages of the finite-horizon MDP and using induction to propagate the per-step approximation and concentration errors. The N^H factor arises from this recursive structure combined with union bounds over the horizon. We do not claim in the manuscript that this dependence is tight, nor do we provide a lower-bound construction to demonstrate necessity. It is possible that the exponential factor is an artifact of the current proof technique and that a tighter bound with milder dependence on H could be derived using more advanced concentration inequalities or a different analysis approach. In the revised manuscript, we will include an additional remark in the discussion of the regret bound acknowledging this and highlighting it as a direction for future improvement. revision: yes

Circularity Check

0 steps flagged

No circularity: regret bound derived independently from new definitions

full rationale

The paper defines mini-batch coherent risk measures and multipattern risk-averse problems as new concepts generalizing linear systems, then applies them to feature-based Q-learning with multipattern Q-factor approximation. The high-probability regret bound O(H² N^H √K) is obtained via standard concentration and error-propagation analysis over the finite-horizon backward pass under the stated assumptions. No step reduces the bound to a fitted parameter, self-referential definition, or self-citation chain; the N^H factor arises explicitly from per-stage multiplicative accumulation of mini-batch size N across H steps, which is a direct consequence of the induction rather than an input. The derivation remains self-contained against external benchmarks such as standard RL regret techniques.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Abstract-only review prevents exhaustive enumeration; the central claim rests on standard MDP assumptions plus the newly introduced risk-measure classes.

axioms (1)

domain assumption The problem is a finite-horizon Markov decision process.
Explicitly stated as the setting for the risk-averse problem and Q-learning method.

invented entities (2)

mini-batch Markov coherent risk measures no independent evidence
purpose: Special class of risk measures for the finite-horizon MDP setting.
Newly defined in the paper to enable the Q-learning approach.
multipattern risk-averse problems no independent evidence
purpose: Generalization of linear systems for Q-factor approximation.
Newly defined to support the feature-based approximation in the algorithm.

pith-pipeline@v0.9.0 · 5446 in / 1245 out tokens · 37942 ms · 2026-05-09T19:41:24.317213+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages

[1]

Abbasi-Yadkori, D

Y. Abbasi-Yadkori, D. P \'a l, and C. Szepesv \'a ri. Improved algorithms for linear stochastic bandits. Advances in Neural Information Processing Systems, 24, 2011

work page 2011
[2]

Artzner, F

P. Artzner, F. Delbaen, J.-M. Eber, and D. Heath. Coherent measures of risk. Mathematical Finance, 9 0 (3): 0 203--228, 1999

work page 1999
[3]

Artzner, F

P. Artzner, F. Delbaen, J.-M. Eber, D. Heath, and H. Ku. Coherent multiperiod risk adjusted values and B ellman's principle. Annals of Operations Research, 152: 0 5--22, 2007

work page 2007
[4]

P. Auer, T. Jaksch, and R. Ortner. Near-optimal regret bounds for reinforcement learning. Advances in Neural Information Processing Systems, 21, 2008

work page 2008
[5]

Ayoub, Z

A. Ayoub, Z. Jia, C. Szepesvari, M. Wang, and L. Yang. Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, pages 463--474. PMLR, 2020

work page 2020
[6]

M. G. Azar, I. Osband, and R. Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263--272. PMLR, 2017

work page 2017
[7]

A. Basu, T. Bhattacharyya, and V. S. Borkar. A learning algorithm for risk-sensitive cost. Mathematics of Operations Research, 33 0 (4): 0 880--898, 2008

work page 2008
[8]

B \"a uerle and A

N. B \"a uerle and A. Glauner. Markov decision processes with recursive risk measures. European Journal of Operational Research, 296 0 (3): 0 953--966, 2022

work page 2022
[9]

V. Borkar. A sensitivity formula for risk-sensitive cost and the actor–critic algorithm. Systems & Control Letters, 44 0 (5): 0 339 -- 346, 2001

work page 2001
[10]

V. S. Borkar. Q-learning for risk-sensitive control. Mathematics of Operations Research, 27 0 (2): 0 294--311, 2002

work page 2002
[11]

S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22 0 (1-3): 0 33--57, 1996

work page 1996
[12]

Bubeck, N

S. Bubeck, N. Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning , 5 0 (1): 0 1--122, 2012

work page 2012
[13]

Cheridito and M

P. Cheridito and M. Kupper. Composition of time-consistent dynamic monetary risk measures in discrete time. International Journal of Theoretical and Applied Finance, 14 0 (01): 0 137--162, 2011

work page 2011
[14]

Cheridito, F

P. Cheridito, F. Delbaen, and M. Kupper. Dynamic monetary risk measures for bounded discrete-time processes. Electronic Journal of Probability, 11: 0 57--106, 2006

work page 2006
[15]

Chow and M

Y. Chow and M. Ghavamzadeh. Algorithms for CVaR optimization in MDP s. In Advances in Neural Information Processing Systems, pages 3509--3517, 2014

work page 2014
[16]

Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone. Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18 0 (1): 0 6070--6120, 2017

work page 2017
[17]

C. Dann, T. Lattimore, and E. Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. Advances in Neural Information Processing Systems, 30, 2017

work page 2017
[18]

Dentcheva and A

D. Dentcheva and A. Ruszczy \'n ski. Mini-batch risk forms. SIAM Journal on Optimization, 33 0 (2): 0 615--637, 2023

work page 2023
[19]

Dentcheva and A

D. Dentcheva and A. Ruszczy \'n ski. Risk-Averse Optimization and Control: Theory and Methods. Springer International Publishing, Cham, Switzerland, 2024

work page 2024
[20]

Derman, G

C. Derman, G. J. Lieberman, and S. M. Ross. A sequential stochastic assignment problem. Management Science, 18 0 (7): 0 349--355, 1972

work page 1972
[21]

S. Dong, B. Van Roy, and Z. Zhou. Provably efficient reinforcement learning with aggregated states. arXiv preprint arXiv:1912.06366, 2019

work page arXiv 1912
[22]

S. Du, S. Kakade, J. Lee, S. Lovett, G. Mahajan, W. Sun, and R. Wang. Bilinear classes: A structural framework for provable generalization in RL . In International Conference on Machine Learning, pages 2826--2836. PMLR, 2021

work page 2021
[23]

Fan and A

J. Fan and A. Ruszczy \'n ski. Process-based risk measures and risk-averse control of discrete-time systems. Mathematical Programming, 191: 0 113--140, 2022

work page 2022
[24]

Fei and R

Y. Fei and R. Xu. Cascaded gaps: Towards logarithmic regret for risk-sensitive reinforcement learning. In International Conference on Machine Learning, pages 6392--6417. PMLR, 2022

work page 2022
[25]

Y. Fei, Z. Yang, Y. Chen, Z. Wang, and Q. Xie. Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret. Advances in Neural Information Processing Systems, 33: 0 22384--22395, 2020

work page 2020
[26]

Y. Fei, Z. Yang, and Z. Wang. Risk-sensitive reinforcement learning with function approximation: A debiasing approach. In International Conference on Machine Learning, pages 3198--3207. PMLR, 2021

work page 2021
[27]

Gelman, J

A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. Bayesian Data Analysis. CRC Press, Boca Raton, FL, 3rd edition, 2013

work page 2013
[28]

Huang and W

W. Huang and W. B. Haskell. Risk-aware Q -learning for M arkov decision processes. In 2017 IEEE 56th Annual Conference on Decision and Control (CDC), pages 4928--4933. IEEE, 2017

work page 2017
[29]

C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan. Is Q -learning provably efficient? Advances in Neural Information Processing Systems, 31, 2018

work page 2018
[30]

C. Jin, Z. Yang, Z. Wang, and M. I. Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137--2143. PMLR, 2020

work page 2020
[31]

K\" o se and A

U. K\" o se and A. Ruszczy\'nski. Risk-averse learning by temporal difference methods with M arkov risk measures. Journal of Machine Learning Research, 22, 2021

work page 2021
[32]

T. Lam, A. Verma, B. K. H. Low, and P. Jaillet. Risk-aware reinforcement learning with coherent risk measures and non-linear function approximation. In The Eleventh International Conference on Learning Representations, 2023

work page 2023
[33]

Lattimore and M

T. Lattimore and M. Hutter. PAC bounds for discounted MDP s. In Algorithmic Learning Theory: 23rd International Conference, ALT 2012, Lyon, France, October 29-31, 2012. Proceedings 23, pages 320--334. Springer, 2012

work page 2012
[34]

Lattimore and C

T. Lattimore and C. Szepesv \'a ri. Bandit algorithms. Cambridge University Press, 2020

work page 2020
[35]

Lin and S

K. Lin and S. I. Marcus. Dynamic programming with non-convex risk-sensitive measures. In American Control Conference (ACC), 2013, pages 6778--6783. IEEE, 2013

work page 2013
[36]

W.-J. Ma, C. Oh, Y. Liu, D. Dentcheva, and M. M. Zavlanos. Risk-averse access point selection in wireless communication networks. IEEE Transactions on Control of Network Systems, 6 0 (1): 0 24--36, 2018

work page 2018
[37]

F. S. Melo and M. I. Ribeiro. Q-learning with linear function approximation. In International Conference on Computational Learning Theory, pages 308--322. Springer, 2007

work page 2007
[38]

A. Modi, N. Jiang, A. Tewari, and S. Singh. Sample complexity of reinforcement learning using linearly combined model ensembles. In International Conference on Artificial Intelligence and Statistics, pages 2010--2020. PMLR, 2020

work page 2010
[39]

L. A. Prashanth and M. Ghavamzadeh. Actor-critic algorithms for risk-sensitive reinforcement learning. CoRR, abs/1403.6530, 2014. URL http://arxiv.org/abs/1403.6530

work page arXiv 2014
[40]

Ruszczy \'n ski

A. Ruszczy \'n ski. Risk-averse dynamic programming for M arkov decision processes. Mathematical Programming, 125 0 (2, Ser. B): 0 235--261, 2010

work page 2010
[41]

Ruszczy\'nski and A

A. Ruszczy\'nski and A. Shapiro. Optimization of convex risk functions. Mathematics of Operations Research, 31 0 (3): 0 433--452, 2006

work page 2006
[42]

Scandolo

G. Scandolo. Risk Measures in a Dynamic Setting. PhD thesis, Universit\` a degli Studi di Milano, Milan, Italy, 2003

work page 2003
[43]

Shapiro, D

A. Shapiro, D. Dentcheva, and A. Ruszczy\'nski. Lectures on S tochastic P rogramming: M odeling and T heory . SIAM, 2021

work page 2021
[44]

Y. Shen, W. Stannat, and K. Obermayer. Risk-sensitive M arkov control processes. SIAM Journal on Control and Optimization, 51 0 (5): 0 3652--3672, 2013

work page 2013
[45]

Sidford, M

A. Sidford, M. Wang, X. Wu, L. Yang, and Y. Ye. Near-optimal time and sample complexities for solving markov decision processes with a generative model. Advances in Neural Information Processing Systems, 31, 2018

work page 2018
[46]

A. L. Strehl, L. Li, and M. L. Littman. Reinforcement learning in finite mdps: Pac analysis. Journal of Machine Learning Research, 10 0 (11), 2009

work page 2009
[47]

R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3 0 (1): 0 9--44, 1988

work page 1988
[48]

R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, 1998

work page 1998
[49]

Tamar, D

A. Tamar, D. Di Castro, and S. Mannor. Policy gradients with variance-related risk criteria. In Proceedings of the Twenty-Ninth International Conference on Machine Learning, pages 387--396, 2012

work page 2012
[50]

Tamar, S

A. Tamar, S. Mannor, and H. Xu. Scaling up robust MDPs using function approximation. In International Conference on Machine Learning, pages 181--189, 2014

work page 2014
[51]

Tamar, Y

A. Tamar, Y. Chow, M. Ghavamzadeh, and S. Mannor. Sequential decision making with coherent risk. IEEE Transactions on Automatic Control, 62 0 (7): 0 3323--3338, 2017

work page 2017
[52]

Tran-Thanh, A

L. Tran-Thanh, A. Chapman, A. Rogers, and N. R. Jennings. Knapsack based optimal policies for budget-limited multi-armed bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 26, pages 1134--1140, 2012

work page 2012
[53]

J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Fransactions on Automatic Control, 42 0 (5): 0 674--690, 1997

work page 1997
[54]

Yang and M

Z. Yang and M. Wang. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In International Conference on Machine Learning, pages 10746--10756. PMLR, 2020

work page 2020
[55]

M. Yin, Y. Duan, M. Wang, and Y.-X. Wang. Near-optimal offline reinforcement learning with linear representation: Leveraging variance information with pessimism. In International Conference on Learning Representation, 2022

work page 2022
[56]

P. Yu, W. B. Haskell, and H. Xu. Approximate value iteration for risk-aware M arkov decision processes. IEEE Transactions on Automatic Control, 63 0 (9): 0 3135--3142, 2018

work page 2018

[1] [1]

Abbasi-Yadkori, D

Y. Abbasi-Yadkori, D. P \'a l, and C. Szepesv \'a ri. Improved algorithms for linear stochastic bandits. Advances in Neural Information Processing Systems, 24, 2011

work page 2011

[2] [2]

Artzner, F

P. Artzner, F. Delbaen, J.-M. Eber, and D. Heath. Coherent measures of risk. Mathematical Finance, 9 0 (3): 0 203--228, 1999

work page 1999

[3] [3]

Artzner, F

P. Artzner, F. Delbaen, J.-M. Eber, D. Heath, and H. Ku. Coherent multiperiod risk adjusted values and B ellman's principle. Annals of Operations Research, 152: 0 5--22, 2007

work page 2007

[4] [4]

P. Auer, T. Jaksch, and R. Ortner. Near-optimal regret bounds for reinforcement learning. Advances in Neural Information Processing Systems, 21, 2008

work page 2008

[5] [5]

Ayoub, Z

A. Ayoub, Z. Jia, C. Szepesvari, M. Wang, and L. Yang. Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, pages 463--474. PMLR, 2020

work page 2020

[6] [6]

M. G. Azar, I. Osband, and R. Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263--272. PMLR, 2017

work page 2017

[7] [7]

A. Basu, T. Bhattacharyya, and V. S. Borkar. A learning algorithm for risk-sensitive cost. Mathematics of Operations Research, 33 0 (4): 0 880--898, 2008

work page 2008

[8] [8]

B \"a uerle and A

N. B \"a uerle and A. Glauner. Markov decision processes with recursive risk measures. European Journal of Operational Research, 296 0 (3): 0 953--966, 2022

work page 2022

[9] [9]

V. Borkar. A sensitivity formula for risk-sensitive cost and the actor–critic algorithm. Systems & Control Letters, 44 0 (5): 0 339 -- 346, 2001

work page 2001

[10] [10]

V. S. Borkar. Q-learning for risk-sensitive control. Mathematics of Operations Research, 27 0 (2): 0 294--311, 2002

work page 2002

[11] [11]

S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22 0 (1-3): 0 33--57, 1996

work page 1996

[12] [12]

Bubeck, N

S. Bubeck, N. Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning , 5 0 (1): 0 1--122, 2012

work page 2012

[13] [13]

Cheridito and M

P. Cheridito and M. Kupper. Composition of time-consistent dynamic monetary risk measures in discrete time. International Journal of Theoretical and Applied Finance, 14 0 (01): 0 137--162, 2011

work page 2011

[14] [14]

Cheridito, F

P. Cheridito, F. Delbaen, and M. Kupper. Dynamic monetary risk measures for bounded discrete-time processes. Electronic Journal of Probability, 11: 0 57--106, 2006

work page 2006

[15] [15]

Chow and M

Y. Chow and M. Ghavamzadeh. Algorithms for CVaR optimization in MDP s. In Advances in Neural Information Processing Systems, pages 3509--3517, 2014

work page 2014

[16] [16]

Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone. Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18 0 (1): 0 6070--6120, 2017

work page 2017

[17] [17]

C. Dann, T. Lattimore, and E. Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. Advances in Neural Information Processing Systems, 30, 2017

work page 2017

[18] [18]

Dentcheva and A

D. Dentcheva and A. Ruszczy \'n ski. Mini-batch risk forms. SIAM Journal on Optimization, 33 0 (2): 0 615--637, 2023

work page 2023

[19] [19]

Dentcheva and A

D. Dentcheva and A. Ruszczy \'n ski. Risk-Averse Optimization and Control: Theory and Methods. Springer International Publishing, Cham, Switzerland, 2024

work page 2024

[20] [20]

Derman, G

C. Derman, G. J. Lieberman, and S. M. Ross. A sequential stochastic assignment problem. Management Science, 18 0 (7): 0 349--355, 1972

work page 1972

[21] [21]

S. Dong, B. Van Roy, and Z. Zhou. Provably efficient reinforcement learning with aggregated states. arXiv preprint arXiv:1912.06366, 2019

work page arXiv 1912

[22] [22]

S. Du, S. Kakade, J. Lee, S. Lovett, G. Mahajan, W. Sun, and R. Wang. Bilinear classes: A structural framework for provable generalization in RL . In International Conference on Machine Learning, pages 2826--2836. PMLR, 2021

work page 2021

[23] [23]

Fan and A

J. Fan and A. Ruszczy \'n ski. Process-based risk measures and risk-averse control of discrete-time systems. Mathematical Programming, 191: 0 113--140, 2022

work page 2022

[24] [24]

Fei and R

Y. Fei and R. Xu. Cascaded gaps: Towards logarithmic regret for risk-sensitive reinforcement learning. In International Conference on Machine Learning, pages 6392--6417. PMLR, 2022

work page 2022

[25] [25]

Y. Fei, Z. Yang, Y. Chen, Z. Wang, and Q. Xie. Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret. Advances in Neural Information Processing Systems, 33: 0 22384--22395, 2020

work page 2020

[26] [26]

Y. Fei, Z. Yang, and Z. Wang. Risk-sensitive reinforcement learning with function approximation: A debiasing approach. In International Conference on Machine Learning, pages 3198--3207. PMLR, 2021

work page 2021

[27] [27]

Gelman, J

A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. Bayesian Data Analysis. CRC Press, Boca Raton, FL, 3rd edition, 2013

work page 2013

[28] [28]

Huang and W

W. Huang and W. B. Haskell. Risk-aware Q -learning for M arkov decision processes. In 2017 IEEE 56th Annual Conference on Decision and Control (CDC), pages 4928--4933. IEEE, 2017

work page 2017

[29] [29]

C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan. Is Q -learning provably efficient? Advances in Neural Information Processing Systems, 31, 2018

work page 2018

[30] [30]

C. Jin, Z. Yang, Z. Wang, and M. I. Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137--2143. PMLR, 2020

work page 2020

[31] [31]

K\" o se and A

U. K\" o se and A. Ruszczy\'nski. Risk-averse learning by temporal difference methods with M arkov risk measures. Journal of Machine Learning Research, 22, 2021

work page 2021

[32] [32]

T. Lam, A. Verma, B. K. H. Low, and P. Jaillet. Risk-aware reinforcement learning with coherent risk measures and non-linear function approximation. In The Eleventh International Conference on Learning Representations, 2023

work page 2023

[33] [33]

Lattimore and M

T. Lattimore and M. Hutter. PAC bounds for discounted MDP s. In Algorithmic Learning Theory: 23rd International Conference, ALT 2012, Lyon, France, October 29-31, 2012. Proceedings 23, pages 320--334. Springer, 2012

work page 2012

[34] [34]

Lattimore and C

T. Lattimore and C. Szepesv \'a ri. Bandit algorithms. Cambridge University Press, 2020

work page 2020

[35] [35]

Lin and S

K. Lin and S. I. Marcus. Dynamic programming with non-convex risk-sensitive measures. In American Control Conference (ACC), 2013, pages 6778--6783. IEEE, 2013

work page 2013

[36] [36]

W.-J. Ma, C. Oh, Y. Liu, D. Dentcheva, and M. M. Zavlanos. Risk-averse access point selection in wireless communication networks. IEEE Transactions on Control of Network Systems, 6 0 (1): 0 24--36, 2018

work page 2018

[37] [37]

F. S. Melo and M. I. Ribeiro. Q-learning with linear function approximation. In International Conference on Computational Learning Theory, pages 308--322. Springer, 2007

work page 2007

[38] [38]

A. Modi, N. Jiang, A. Tewari, and S. Singh. Sample complexity of reinforcement learning using linearly combined model ensembles. In International Conference on Artificial Intelligence and Statistics, pages 2010--2020. PMLR, 2020

work page 2010

[39] [39]

L. A. Prashanth and M. Ghavamzadeh. Actor-critic algorithms for risk-sensitive reinforcement learning. CoRR, abs/1403.6530, 2014. URL http://arxiv.org/abs/1403.6530

work page arXiv 2014

[40] [40]

Ruszczy \'n ski

A. Ruszczy \'n ski. Risk-averse dynamic programming for M arkov decision processes. Mathematical Programming, 125 0 (2, Ser. B): 0 235--261, 2010

work page 2010

[41] [41]

Ruszczy\'nski and A

A. Ruszczy\'nski and A. Shapiro. Optimization of convex risk functions. Mathematics of Operations Research, 31 0 (3): 0 433--452, 2006

work page 2006

[42] [42]

Scandolo

G. Scandolo. Risk Measures in a Dynamic Setting. PhD thesis, Universit\` a degli Studi di Milano, Milan, Italy, 2003

work page 2003

[43] [43]

Shapiro, D

A. Shapiro, D. Dentcheva, and A. Ruszczy\'nski. Lectures on S tochastic P rogramming: M odeling and T heory . SIAM, 2021

work page 2021

[44] [44]

Y. Shen, W. Stannat, and K. Obermayer. Risk-sensitive M arkov control processes. SIAM Journal on Control and Optimization, 51 0 (5): 0 3652--3672, 2013

work page 2013

[45] [45]

Sidford, M

A. Sidford, M. Wang, X. Wu, L. Yang, and Y. Ye. Near-optimal time and sample complexities for solving markov decision processes with a generative model. Advances in Neural Information Processing Systems, 31, 2018

work page 2018

[46] [46]

A. L. Strehl, L. Li, and M. L. Littman. Reinforcement learning in finite mdps: Pac analysis. Journal of Machine Learning Research, 10 0 (11), 2009

work page 2009

[47] [47]

R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3 0 (1): 0 9--44, 1988

work page 1988

[48] [48]

R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, 1998

work page 1998

[49] [49]

Tamar, D

A. Tamar, D. Di Castro, and S. Mannor. Policy gradients with variance-related risk criteria. In Proceedings of the Twenty-Ninth International Conference on Machine Learning, pages 387--396, 2012

work page 2012

[50] [50]

Tamar, S

A. Tamar, S. Mannor, and H. Xu. Scaling up robust MDPs using function approximation. In International Conference on Machine Learning, pages 181--189, 2014

work page 2014

[51] [51]

Tamar, Y

A. Tamar, Y. Chow, M. Ghavamzadeh, and S. Mannor. Sequential decision making with coherent risk. IEEE Transactions on Automatic Control, 62 0 (7): 0 3323--3338, 2017

work page 2017

[52] [52]

Tran-Thanh, A

L. Tran-Thanh, A. Chapman, A. Rogers, and N. R. Jennings. Knapsack based optimal policies for budget-limited multi-armed bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 26, pages 1134--1140, 2012

work page 2012

[53] [53]

J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Fransactions on Automatic Control, 42 0 (5): 0 674--690, 1997

work page 1997

[54] [54]

Yang and M

Z. Yang and M. Wang. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In International Conference on Machine Learning, pages 10746--10756. PMLR, 2020

work page 2020

[55] [55]

M. Yin, Y. Duan, M. Wang, and Y.-X. Wang. Near-optimal offline reinforcement learning with linear representation: Leveraging variance information with pessimism. In International Conference on Learning Representation, 2022

work page 2022

[56] [56]

P. Yu, W. B. Haskell, and H. Xu. Approximate value iteration for risk-aware M arkov decision processes. IEEE Transactions on Automatic Control, 63 0 (9): 0 3135--3142, 2018

work page 2018