arxiv: 2605.08053 · v1 · submitted 2026-05-08 · 💻 cs.LG

Recognition: no theorem link

Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs

Gugan Thoppe , L. A. Prashanth , Ankur Naskar , Sanjay Bhat

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:06 UTC · model grok-4.3

classification 💻 cs.LG

keywords reinforcement learningexponential utilityrisk-sensitive RLQ-learningMarkov decision processescontraction mappingconvergence analysis

0 comments

The pith

Two Q-value operators for exponential utility in discounted MDPs are contractions and induce optimal greedy policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the lack of principled value-based methods for reinforcement learning under exponential utility in discounted Markov decision processes. It extends the known Bellman-type equation for this risk-sensitive objective into two Q-function versions and proves both operators are contractions, one under the standard L_infinity metric and the other under a sup-log/Thompson metric. The fixed points of these operators yield greedy stationary policies that are optimal for the exponential-utility criterion among all stationary policies. These results directly produce two model-free algorithms, one using two-timescale updates with almost-sure convergence and finite-time rates, and another single-timescale method whose convergence is shown via local Lipschitz and monotonicity arguments rather than global contraction.

Core claim

Building on the Bellman-type equation for exponential utility, we derive two Q-value-style extensions whose operators are contractions in the L_infinity and sup-log/Thompson metrics. Their fixed points characterize the optimal exponential-utility values, and the policies obtained by greedily maximizing these fixed-point Q-functions are optimal among stationary policies. The structural results yield a two-timescale Q-learning algorithm with almost-sure convergence and finite-time rates via timescale separation, plus a one-timescale algorithm whose convergence follows from local Lipschitzness, monotonicity, homogeneity, and Dini derivatives, with a scalar finite-time bound highlighting vector-

What carries the argument

Two Q-value extensions of the exponential-utility Bellman equation, proven to be contraction mappings in the L_infinity and sup-log/Thompson metrics respectively.

If this is right

The greedy policy extracted from either operator's fixed point is optimal for exponential utility among stationary policies.
A two-timescale stochastic approximation algorithm converges almost surely to the optimal Q-function with explicit finite-time rates obtained by timescale separation.
A one-timescale algorithm governed by a sublinear power-law operator converges via arguments based on local Lipschitzness, monotonicity, homogeneity, and Dini derivatives.
Finite-time convergence rates are available in the scalar case of the one-timescale algorithm.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The contraction results may allow similar value-based methods to be derived for other risk-sensitive criteria that admit multiplicative or log-based Bellman equations.
The one-timescale algorithm could be preferable in online settings where maintaining separate fast and slow timescales is impractical.
The metrics used suggest that exponential utility may interact productively with multiplicative update rules common in certain risk or robustness problems.
The finite-time analysis for the scalar case points to a possible direction for obtaining vector convergence rates by refining the local-Lipschitz argument.

Load-bearing premise

The two new operators remain contractions in their respective metrics for any fixed risk-aversion parameter and for the class of MDPs under consideration.

What would settle it

A concrete counter-example MDP and risk-aversion value for which either operator fails to be a contraction, or for which the induced greedy policy is suboptimal for the exponential-utility objective.

read the original abstract

Reinforcement learning (RL) for exponential-utility optimization in discounted Markov decision processes (MDPs) lacks principled value-based algorithms. We address this gap in the fixed risk-aversion setting. Building on the Bellman-type equation for exponential utility studied in \cite{porteus1975optimality}, we derive two Q-value-style extensions and show that the associated operators are contractions in the $L_\infty$ and sup-log/Thompson metrics, respectively. We characterize their fixed points and prove that the induced greedy stationary policy is optimal for the exponential-utility objective among stationary policies. These structural results lead to two model-free algorithms: a two-timescale Q-learning--style algorithm, for which we establish almost-sure convergence and provide finite-time convergence rates via timescale separation, and a one-timescale algorithm governed by a sublinear power-law operator. Since the latter does not admit a global contraction in standard metrics, we prove its convergence using delicate arguments based on local Lipschitzness, monotonicity, homogeneity, and Dini derivatives, and provide a scalar finite-time analysis that highlights the challenges in obtaining convergence rates in the vector case. Our work provides a foundation for value-based RL under exponential-utility objectives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives the first convergent value-based algorithms for exponential-utility RL in discounted MDPs.

read the letter

The main thing here is that the authors extend the 1975 Porteus Bellman equation to two new Q-value operators, prove they are contractions in the L_infty and sup-log metrics, show the greedy policy is optimal among stationary policies, and then deliver two model-free algorithms with almost-sure convergence. The two-timescale version comes with finite-time rates via standard separation; the one-timescale version uses local Lipschitzness, monotonicity, homogeneity, and Dini derivatives because it lacks a global contraction. That is the concrete advance over prior risk-sensitive work that mostly stayed at the policy or expectation level. They do this cleanly for fixed risk aversion, which keeps the setting tractable. The structural results look reliable on the evidence given, and the algorithms are stated explicitly enough to implement. The one-timescale proof is the delicate part, but the stress-test shows no internal contradiction in the strategy they describe. The main soft spot is that finite-time rates stay scalar for the harder case and the vector analysis is left open; anyone wanting practical rates will still need follow-up work. The fixed risk-aversion assumption is also explicit, so extensions to varying parameters are not covered. This is for RL theorists who already work with contractions, MDPs, and risk measures. A reader who cares about principled risk-sensitive methods beyond ad-hoc variance penalties will find the operators and proofs useful. It deserves a serious referee because the claims are specific, the gap is real, and the approach is grounded rather than hand-wavy. I would send it out for review.

Referee Report

2 major / 2 minor

Summary. The paper derives two Q-value-style operators extending the Bellman equation for exponential utility (Porteus 1975) in discounted MDPs under fixed risk aversion. It proves these operators are contractions in the L_∞ and sup-log/Thompson metrics, characterizes their fixed points, shows the induced greedy stationary policy is optimal among stationary policies, and presents two model-free algorithms: a two-timescale Q-learning variant with almost-sure convergence and finite-time rates via timescale separation, plus a one-timescale algorithm whose convergence is shown via local Lipschitzness, monotonicity, homogeneity, and Dini derivatives (with scalar finite-time analysis).

Significance. If the contraction properties and convergence arguments hold, the work fills a clear gap by providing principled value-based RL methods for exponential-utility objectives. The structural results on optimality and the careful treatment of the non-contracting one-timescale case contribute technically to risk-sensitive RL theory, enabling future algorithm development in this setting.

major comments (2)

[§4] §4 (one-timescale algorithm): The convergence claim for the vector case rests on delicate local arguments (Lipschitzness, monotonicity, homogeneity, Dini derivatives) rather than a global contraction; only scalar finite-time analysis is supplied, which is load-bearing for the practical utility of the rates and requires explicit verification that the vector extension does not introduce counterexamples.
[§3.2] §3.2 (sup-log/Thompson contraction): The proof that the second operator is a contraction in the Thompson metric relies on homogeneity; this step is central to the fixed-point characterization and optimality result, yet the dependence on the fixed risk-aversion parameter should be shown explicitly to rule out parameter-specific failures.

minor comments (2)

[Introduction] The introduction would benefit from a short paragraph recalling the definition of the sup-log and Thompson metrics before their use in the operator analysis.
Notation for the risk-aversion parameter should be introduced once and used consistently across operator definitions and convergence statements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. The positive assessment of the work's significance is appreciated. We address each major comment below and will make the indicated revisions in the next version of the manuscript.

read point-by-point responses

Referee: [§4] §4 (one-timescale algorithm): The convergence claim for the vector case rests on delicate local arguments (Lipschitzness, monotonicity, homogeneity, Dini derivatives) rather than a global contraction; only scalar finite-time analysis is supplied, which is load-bearing for the practical utility of the rates and requires explicit verification that the vector extension does not introduce counterexamples.

Authors: The almost-sure convergence of the one-timescale algorithm is established for the general (vector) case using the local properties of the operator, which are proven to hold in the vector setting via the definitions of local Lipschitzness, monotonicity, homogeneity, and Dini derivatives in the sup norm. These properties are dimension-independent and apply directly to the Q-function vectors. The scalar finite-time analysis is supplied explicitly to provide concrete rates while underscoring the technical obstacles to vector rates, consistent with the manuscript's discussion of these challenges. We will add a clarifying remark in §4 stating that the local arguments suffice for almost-sure convergence in the vector case and that the operator properties preclude counterexamples in the vector extension, as they are verified without scalar restrictions. revision: partial
Referee: [§3.2] §3.2 (sup-log/Thompson contraction): The proof that the second operator is a contraction in the Thompson metric relies on homogeneity; this step is central to the fixed-point characterization and optimality result, yet the dependence on the fixed risk-aversion parameter should be shown explicitly to rule out parameter-specific failures.

Authors: We agree that explicit tracking of the risk-aversion parameter strengthens the argument. The homogeneity property and the resulting contraction in the Thompson metric hold for any fixed risk-aversion parameter γ > 0, with the contraction modulus depending on γ through the exponential utility scaling. We will revise the proof in §3.2 to explicitly insert the dependence on γ in the homogeneity step and the contraction factor, confirming that the result holds uniformly for the fixed-γ setting of the paper without parameter-specific failures. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The derivation begins from the external Bellman-type equation in the cited Porteus 1975 work (different authors), then independently derives two new Q-operators, proves they are contractions in L_infty and sup-log/Thompson metrics, characterizes their fixed points, establishes optimality of the induced greedy policy among stationary policies, and proves almost-sure convergence (with rates via timescale separation for the two-timescale algorithm and local Lipschitz/monotonicity arguments for the one-timescale case). None of these steps reduce by construction to the inputs or to self-citations; the central claims rest on explicit operator analysis and convergence proofs that are not tautological with the starting equation or any fitted parameters.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard discounted-MDP assumptions and the 1975 Bellman equation; no new free parameters are fitted and no invented entities are introduced.

axioms (2)

domain assumption The MDP is discounted with discount factor strictly less than one
Required for the contraction property and almost-sure convergence in infinite-horizon settings.
domain assumption Risk aversion parameter is fixed and known
The paper explicitly restricts to the fixed risk-aversion setting.

pith-pipeline@v0.9.0 · 5524 in / 1382 out tokens · 30976 ms · 2026-05-11T02:06:43.795122+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages

[1]

Conference on learning theory , pages=

Finite-time error bounds for linear stochastic approximation andtd learning , author=. Conference on learning theory , pages=. 2019 , organization=

work page 2019
[2]

European Journal of Operational Research , volume=

Approximate solutions to constrained risk-sensitive. European Journal of Operational Research , volume=. 2023 , author =

work page 2023
[3]

Finite Sample Analysis of Average-Reward

Zhang, Sheng and Zhang, Zhe and Maguluri, Siva Theja , journal=. Finite Sample Analysis of Average-Reward

work page
[4]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Parameter-free optimal rates for nonlinear semi-norm contractions with applications to q-learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[5]

Proceedings of The 5th Annual Learning for Dynamics and Control Conference , pages =

Modified Policy Iteration for Exponential Cost Risk Sensitive MDPs , author =. Proceedings of The 5th Annual Learning for Dynamics and Control Conference , pages =. 2023 , editor =

work page 2023
[6]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

A Risk-Sensitive Approach to Policy Optimization , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2023 , month=

work page 2023
[7]

IEEE transactions on automatic control , volume=

Sequential decision making with coherent risk , author=. IEEE transactions on automatic control , volume=. 2016 , publisher=

work page 2016
[8]

Machine Learning , volume=

Variance-constrained actor-critic algorithms for discounted and average reward MDPs , author=. Machine Learning , volume=. 2016 , publisher=

work page 2016
[9]

2022 , volume =

Foundations and Trends® in Machine Learning , title =. 2022 , volume =

work page 2022
[10]

Mathematical Methods of Operations Research , volume=

Markov decision processes with risk-sensitive criteria: an overview , author=. Mathematical Methods of Operations Research , volume=. 2024 , publisher=

work page 2024
[11]

Advances in neural information processing systems , volume=

Exponential bellman equation and improved regret bounds for risk-sensitive reinforcement learning , author=. Advances in neural information processing systems , volume=

work page
[12]

and Baras, John S

Noorani, Erfaun and Mavridis, Christos N. and Baras, John S. , journal=. Risk-Sensitive Reinforcement Learning With Exponential Criteria , year=

work page
[13]

Risk-sensitive

Jiang, Yuhua and Huang, Jiawei and Yuan, Yufeng and Mao, Xin and Yue, Yu and Zhao, Qianchuan and Yan, Lin , journal=. Risk-sensitive

work page
[14]

2011 , publisher=

Viability theory: new directions , author=. 2011 , publisher=

work page 2011
[15]

arXiv preprint arXiv:2402.09992 , year=

Risk-sensitive soft actor-critic for robust deep reinforcement learning under distribution shifts , author=. arXiv preprint arXiv:2402.09992 , year=

work page arXiv
[16]

A utility criterion for

Jaquette, Stratton C , journal=. A utility criterion for. 1976 , publisher=

work page 1976
[17]

Uncertainty in economics , pages=

Risk aversion in the small and in the large , author=. Uncertainty in economics , pages=. 1978 , publisher=

work page 1978
[18]

2002 , publisher=

Nonlinear systems , author=. 2002 , publisher=

work page 2002
[19]

2025 , school=

Contraction Theory in Control, Learning, and Optimization , author=. 2025 , school=

work page 2025
[20]

SIAM Journal on Applied Mathematics , volume=

The theory of max-min, with applications , author=. SIAM Journal on Applied Mathematics , volume=. 1966 , publisher=

work page 1966
[21]

L. A. Prashanth and N. Korda and R. Munos , title =. Mach. Learn. , volume =

work page
[22]

Conference On Learning Theory , pages=

A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation , author=. Conference On Learning Theory , pages=

work page
[23]

Dalal, G. and Sz. Thirty-Second AAAI Conference on Artificial Intelligence , year=

work page
[24]

Lakshminarayanan and C

C. Lakshminarayanan and C. Szepesvari , booktitle =. 2018 , volume =

work page 2018
[25]

and Ying, L

Srikant, R. and Ying, L. , booktitle =. 2019 , volume =

work page 2019
[26]

The Annals of Applied Probability , volume=

Convergence rate of linear two-time-scale stochastic approximation , author=. The Annals of Applied Probability , volume=. 2004 , publisher=

work page 2004
[27]

The Annals of Applied Probability , volume=

Convergence rate and averaging of nonlinear two-time-scale stochastic approximation algorithms , author=. The Annals of Applied Probability , volume=. 2006 , publisher=

work page 2006
[28]

Conference On Learning Theory , pages=

Finite sample analysis of two-timescale stochastic approximation with applications to reinforcement learning , author=. Conference On Learning Theory , pages=. 2018 , organization=

work page 2018
[29]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

A tale of two-timescale reinforcement learning with the tightest finite-time bound , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[30]

Automatica , volume=

Average cost temporal-difference learning , author=. Automatica , volume=. 1999 , publisher=

work page 1999
[31]

SIAM Journal on Control and Optimization , volume=

Stochastic approximation for nonexpansive maps: Application to Q-learning algorithms , author=. SIAM Journal on Control and Optimization , volume=. 2002 , publisher=

work page 2002
[32]

and Bertsekas, D

Abounadi, J. and Bertsekas, D. P. and Borkar, V. S. , journal=. Learning algorithms for. 2001 , publisher=

work page 2001
[33]

Operations Research Letters , volume=

Concentration bounds for empirical conditional value-at-risk: The unbounded case , author=. Operations Research Letters , volume=. 2019 , publisher=

work page 2019
[34]

2009 , publisher=

Approximation theorems of mathematical statistics , author=. 2009 , publisher=

work page 2009
[35]

Prashanth, L. A. and K. Jagannathan and R. K. Kolla , pages =. 2020 , Booktitle =

work page 2020
[36]

Advances in Neural Information Processing Systems , pages=

work page
[37]

Operations Research Letters , volume=

Deviation inequalities for an estimator of the conditional value-at-risk , author=. Operations Research Letters , volume=. 2010 , publisher=

work page 2010
[38]

Operations Research Letters , volume=

Large deviations bounds for estimating conditional value-at-risk , author=. Operations Research Letters , volume=. 2007 , publisher=

work page 2007
[39]

2020 , eprint=

Concentration of risk measures: A Wasserstein distance approach , author=. 2020 , eprint=

work page 2020
[40]

Proceedings of the 31st Conference On Learning Theory , pages =

A general approach to multi-armed bandits under risk criteria , author=. Proceedings of the 31st Conference On Learning Theory , pages =

work page
[41]

International Conference on Machine Learning , pages=

Concentration Inequalities for Conditional Value at Risk , author=. International Conference on Machine Learning , pages=

work page
[42]

Moharrami and Y

M. Moharrami and Y. Murthy and A. Roy and R. Srikant , year=. A Policy Gradient Algorithm for the Risk-Sensitive Exponential Cost. 2202.04157 , archivePrefix=

work page arXiv
[43]

Proceedings of the IEEE , volume=

Variance-reduced methods for machine learning , author=. Proceedings of the IEEE , volume=. 2020 , publisher=

work page 2020
[44]

Methodology and Computing in Applied Probability , volume=

Properties of distortion risk measures , author=. Methodology and Computing in Applied Probability , volume=. 2009 , publisher=

work page 2009
[45]

A Lya- punov theory for ﬁnite-sample guarantees of asynchronous Q -learning and TD-learning variants

A Lyapunov theory for finite-sample guarantees of asynchronous Q-learning and TD-learning variants , author=. arXiv preprint arXiv:2102.01567 , year=

work page arXiv
[46]

Journal of Banking & Finance , volume=

Spectral measures of risk: A coherent representation of subjective risk aversion , author=. Journal of Banking & Finance , volume=. 2002 , publisher=

work page 2002
[47]

Journal of Banking & Finance , volume=

On the coherence of expected shortfall , author=. Journal of Banking & Finance , volume=. 2002 , publisher=

work page 2002
[48]

and Blake, D

Dowd, K. and Blake, D. , journal=. 2006 , publisher=

work page 2006
[49]

Siam Review , volume=

Optimization methods for large-scale machine learning , author=. Siam Review , volume=. 2018 , publisher=

work page 2018
[50]

Finance and stochastics , volume=

Convex measures of risk and trading constraints , author=. Finance and stochastics , volume=. 2002 , publisher=

work page 2002
[51]

SIAM Journal on Control and Optimization , volume=

Acceleration of stochastic approximation by averaging , author=. SIAM Journal on Control and Optimization , volume=. 1992 , publisher=

work page 1992
[52]

Handbook of Sequential Analysis , pages=

Stochastic approximation , author=. Handbook of Sequential Analysis , pages=

work page
[53]

2016 , publisher=

Stochastic finance , author=. 2016 , publisher=

work page 2016
[54]

Electronic Journal of Statistics , volume=

Stochastic optimization with momentum: Convergence, fluctuations, and traps avoidance , author=. Electronic Journal of Statistics , volume=. 2021 , publisher=

work page 2021
[55]

Mathematics of operations research , volume=

Q-learning for risk-sensitive control , author=. Mathematics of operations research , volume=. 2002 , publisher=

work page 2002
[56]

Konda, V. R. and Borkar, V. S. , journal=. Actor-Critic--Type Learning Algorithms for. 1999 , publisher=

work page 1999
[57]

Bhatnagar and V

S. Bhatnagar and V. S. Borkar and M. Akarapu , title =. Journal of Machine Learning Research , year =

work page
[58]

Systems & Control Letters , volume=

A sensitivity formula for risk-sensitive cost and the actor--critic algorithm , author=. Systems & Control Letters , volume=. 2001 , publisher=

work page 2001
[59]

SIAM Journal on Control and Optimization , volume=

The ODE method for convergence of stochastic approximation and reinforcement learning , author=. SIAM Journal on Control and Optimization , volume=. 2000 , publisher=

work page 2000
[60]

Zhang and A

K. Zhang and A. Koppel and H. Zhu and T. Basar , title =

work page
[61]

International Conference on Machine Learning , pages =

Stochastic variance-reduced policy gradient , author =. International Conference on Machine Learning , pages =. 2018 , volume =

work page 2018
[62]

International Conference on Machine Learning , pages=

Hessian aided policy gradient , author=. International Conference on Machine Learning , pages=. 2019 , organization=

work page 2019
[63]

Gelfand, S. B. and Mitter, S. K. , journal=. Recursive stochastic algorithms for global optimization in R\^. 1991 , publisher=

work page 1991
[64]

Annales de l'IHP Probabilit

Les algorithmes stochastiques contournent-ils les pieges? , author=. Annales de l'IHP Probabilit

work page
[65]

Conference on Learning Theory , pages=

Escaping from saddle points—online stochastic gradient for tensor decomposition , author=. Conference on Learning Theory , pages=. 2015 , organization=

work page 2015
[66]

International Conference on Machine Learning , pages=

How to escape saddle points efficiently , author=. International Conference on Machine Learning , pages=. 2017 , organization=

work page 2017
[67]

The Annals of Probability , volume=

Nonconvergence to unstable points in urn models and stochastic approximations , author=. The Annals of Probability , volume=. 1990 , publisher=

work page 1990
[68]

and Glynn, P

Asmussen, S. and Glynn, P. W. , volume=. 2007 , publisher=

work page 2007
[69]

2014 , publisher=

Lectures on Stochastic Programming: Modeling and Theory , author=. 2014 , publisher=

work page 2014
[70]

Aleksandrov and V.I

V.M. Aleksandrov and V.I. Sysoyev and V.V. Shemeneva , Journal =. Stochastic optimization , Volume =

work page
[71]

, author=

Policy gradient methods for reinforcement learning with function approximation. , author=. NIPS , volume=

work page
[72]

Mathematics of Operations Research , volume=

Risk-averse approximate dynamic programming with quantile-based risk measures , author=. Mathematics of Operations Research , volume=. 2017 , publisher=

work page 2017
[73]

More risk-sensitive

B\". More risk-sensitive. Mathematics of Operations Research , Number =

work page
[74]

and Ruszczynski, A

Dentcheva, D. and Ruszczynski, A. , Journal =. Optimization with stochastic dominance constraints , Volume =

work page
[75]

arXiv , Author =:1502.03919 , Journal =

Policy gradient for coherent risk measures , Volume =. arXiv , Author =:1502.03919 , Journal =

work page arXiv
[76]

Tsitsiklis, J. N. and Van Roy, B. , Journal =. Average cost temporal-difference learning , Volume =

work page
[77]

Wiley Encyclopedia of Operations Research and Management Science , Title =

Szepesv. Wiley Encyclopedia of Operations Research and Management Science , Title =

work page
[78]

Tsitsiklis, J. N. and Van Roy, B. , Journal =. An analysis of temporal-difference learning with function approximation , Volume =

work page
[79]

Bhatnagar and R

S. Bhatnagar and R. Sutton and M. Ghavamzadeh and M. Lee , Journal =. Natural actor-critic algorithms , Volume =

work page
[80]

Fleming, W. H. and McEneaney, W. M. , Journal =. Risk-sensitive control on an infinite time horizon , Volume =

work page

Showing first 80 references.