Recognition: no theorem link
Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs
Pith reviewed 2026-05-11 02:06 UTC · model grok-4.3
The pith
Two Q-value operators for exponential utility in discounted MDPs are contractions and induce optimal greedy policies.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Building on the Bellman-type equation for exponential utility, we derive two Q-value-style extensions whose operators are contractions in the L_infinity and sup-log/Thompson metrics. Their fixed points characterize the optimal exponential-utility values, and the policies obtained by greedily maximizing these fixed-point Q-functions are optimal among stationary policies. The structural results yield a two-timescale Q-learning algorithm with almost-sure convergence and finite-time rates via timescale separation, plus a one-timescale algorithm whose convergence follows from local Lipschitzness, monotonicity, homogeneity, and Dini derivatives, with a scalar finite-time bound highlighting vector-
What carries the argument
Two Q-value extensions of the exponential-utility Bellman equation, proven to be contraction mappings in the L_infinity and sup-log/Thompson metrics respectively.
If this is right
- The greedy policy extracted from either operator's fixed point is optimal for exponential utility among stationary policies.
- A two-timescale stochastic approximation algorithm converges almost surely to the optimal Q-function with explicit finite-time rates obtained by timescale separation.
- A one-timescale algorithm governed by a sublinear power-law operator converges via arguments based on local Lipschitzness, monotonicity, homogeneity, and Dini derivatives.
- Finite-time convergence rates are available in the scalar case of the one-timescale algorithm.
Where Pith is reading between the lines
- The contraction results may allow similar value-based methods to be derived for other risk-sensitive criteria that admit multiplicative or log-based Bellman equations.
- The one-timescale algorithm could be preferable in online settings where maintaining separate fast and slow timescales is impractical.
- The metrics used suggest that exponential utility may interact productively with multiplicative update rules common in certain risk or robustness problems.
- The finite-time analysis for the scalar case points to a possible direction for obtaining vector convergence rates by refining the local-Lipschitz argument.
Load-bearing premise
The two new operators remain contractions in their respective metrics for any fixed risk-aversion parameter and for the class of MDPs under consideration.
What would settle it
A concrete counter-example MDP and risk-aversion value for which either operator fails to be a contraction, or for which the induced greedy policy is suboptimal for the exponential-utility objective.
read the original abstract
Reinforcement learning (RL) for exponential-utility optimization in discounted Markov decision processes (MDPs) lacks principled value-based algorithms. We address this gap in the fixed risk-aversion setting. Building on the Bellman-type equation for exponential utility studied in \cite{porteus1975optimality}, we derive two Q-value-style extensions and show that the associated operators are contractions in the $L_\infty$ and sup-log/Thompson metrics, respectively. We characterize their fixed points and prove that the induced greedy stationary policy is optimal for the exponential-utility objective among stationary policies. These structural results lead to two model-free algorithms: a two-timescale Q-learning--style algorithm, for which we establish almost-sure convergence and provide finite-time convergence rates via timescale separation, and a one-timescale algorithm governed by a sublinear power-law operator. Since the latter does not admit a global contraction in standard metrics, we prove its convergence using delicate arguments based on local Lipschitzness, monotonicity, homogeneity, and Dini derivatives, and provide a scalar finite-time analysis that highlights the challenges in obtaining convergence rates in the vector case. Our work provides a foundation for value-based RL under exponential-utility objectives.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper derives two Q-value-style operators extending the Bellman equation for exponential utility (Porteus 1975) in discounted MDPs under fixed risk aversion. It proves these operators are contractions in the L_∞ and sup-log/Thompson metrics, characterizes their fixed points, shows the induced greedy stationary policy is optimal among stationary policies, and presents two model-free algorithms: a two-timescale Q-learning variant with almost-sure convergence and finite-time rates via timescale separation, plus a one-timescale algorithm whose convergence is shown via local Lipschitzness, monotonicity, homogeneity, and Dini derivatives (with scalar finite-time analysis).
Significance. If the contraction properties and convergence arguments hold, the work fills a clear gap by providing principled value-based RL methods for exponential-utility objectives. The structural results on optimality and the careful treatment of the non-contracting one-timescale case contribute technically to risk-sensitive RL theory, enabling future algorithm development in this setting.
major comments (2)
- [§4] §4 (one-timescale algorithm): The convergence claim for the vector case rests on delicate local arguments (Lipschitzness, monotonicity, homogeneity, Dini derivatives) rather than a global contraction; only scalar finite-time analysis is supplied, which is load-bearing for the practical utility of the rates and requires explicit verification that the vector extension does not introduce counterexamples.
- [§3.2] §3.2 (sup-log/Thompson contraction): The proof that the second operator is a contraction in the Thompson metric relies on homogeneity; this step is central to the fixed-point characterization and optimality result, yet the dependence on the fixed risk-aversion parameter should be shown explicitly to rule out parameter-specific failures.
minor comments (2)
- [Introduction] The introduction would benefit from a short paragraph recalling the definition of the sup-log and Thompson metrics before their use in the operator analysis.
- Notation for the risk-aversion parameter should be introduced once and used consistently across operator definitions and convergence statements.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive feedback. The positive assessment of the work's significance is appreciated. We address each major comment below and will make the indicated revisions in the next version of the manuscript.
read point-by-point responses
-
Referee: [§4] §4 (one-timescale algorithm): The convergence claim for the vector case rests on delicate local arguments (Lipschitzness, monotonicity, homogeneity, Dini derivatives) rather than a global contraction; only scalar finite-time analysis is supplied, which is load-bearing for the practical utility of the rates and requires explicit verification that the vector extension does not introduce counterexamples.
Authors: The almost-sure convergence of the one-timescale algorithm is established for the general (vector) case using the local properties of the operator, which are proven to hold in the vector setting via the definitions of local Lipschitzness, monotonicity, homogeneity, and Dini derivatives in the sup norm. These properties are dimension-independent and apply directly to the Q-function vectors. The scalar finite-time analysis is supplied explicitly to provide concrete rates while underscoring the technical obstacles to vector rates, consistent with the manuscript's discussion of these challenges. We will add a clarifying remark in §4 stating that the local arguments suffice for almost-sure convergence in the vector case and that the operator properties preclude counterexamples in the vector extension, as they are verified without scalar restrictions. revision: partial
-
Referee: [§3.2] §3.2 (sup-log/Thompson contraction): The proof that the second operator is a contraction in the Thompson metric relies on homogeneity; this step is central to the fixed-point characterization and optimality result, yet the dependence on the fixed risk-aversion parameter should be shown explicitly to rule out parameter-specific failures.
Authors: We agree that explicit tracking of the risk-aversion parameter strengthens the argument. The homogeneity property and the resulting contraction in the Thompson metric hold for any fixed risk-aversion parameter γ > 0, with the contraction modulus depending on γ through the exponential utility scaling. We will revise the proof in §3.2 to explicitly insert the dependence on γ in the homogeneity step and the contraction factor, confirming that the result holds uniformly for the fixed-γ setting of the paper without parameter-specific failures. revision: yes
Circularity Check
No significant circularity identified
full rationale
The derivation begins from the external Bellman-type equation in the cited Porteus 1975 work (different authors), then independently derives two new Q-operators, proves they are contractions in L_infty and sup-log/Thompson metrics, characterizes their fixed points, establishes optimality of the induced greedy policy among stationary policies, and proves almost-sure convergence (with rates via timescale separation for the two-timescale algorithm and local Lipschitz/monotonicity arguments for the one-timescale case). None of these steps reduce by construction to the inputs or to self-citations; the central claims rest on explicit operator analysis and convergence proofs that are not tautological with the starting equation or any fitted parameters.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The MDP is discounted with discount factor strictly less than one
- domain assumption Risk aversion parameter is fixed and known
Reference graph
Works this paper leans on
-
[1]
Conference on learning theory , pages=
Finite-time error bounds for linear stochastic approximation andtd learning , author=. Conference on learning theory , pages=. 2019 , organization=
work page 2019
-
[2]
European Journal of Operational Research , volume=
Approximate solutions to constrained risk-sensitive. European Journal of Operational Research , volume=. 2023 , author =
work page 2023
-
[3]
Finite Sample Analysis of Average-Reward
Zhang, Sheng and Zhang, Zhe and Maguluri, Siva Theja , journal=. Finite Sample Analysis of Average-Reward
-
[4]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Parameter-free optimal rates for nonlinear semi-norm contractions with applications to q-learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[5]
Proceedings of The 5th Annual Learning for Dynamics and Control Conference , pages =
Modified Policy Iteration for Exponential Cost Risk Sensitive MDPs , author =. Proceedings of The 5th Annual Learning for Dynamics and Control Conference , pages =. 2023 , editor =
work page 2023
-
[6]
Proceedings of the AAAI Conference on Artificial Intelligence , author=
A Risk-Sensitive Approach to Policy Optimization , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2023 , month=
work page 2023
-
[7]
IEEE transactions on automatic control , volume=
Sequential decision making with coherent risk , author=. IEEE transactions on automatic control , volume=. 2016 , publisher=
work page 2016
-
[8]
Variance-constrained actor-critic algorithms for discounted and average reward MDPs , author=. Machine Learning , volume=. 2016 , publisher=
work page 2016
-
[9]
Foundations and Trends® in Machine Learning , title =. 2022 , volume =
work page 2022
-
[10]
Mathematical Methods of Operations Research , volume=
Markov decision processes with risk-sensitive criteria: an overview , author=. Mathematical Methods of Operations Research , volume=. 2024 , publisher=
work page 2024
-
[11]
Advances in neural information processing systems , volume=
Exponential bellman equation and improved regret bounds for risk-sensitive reinforcement learning , author=. Advances in neural information processing systems , volume=
-
[12]
Noorani, Erfaun and Mavridis, Christos N. and Baras, John S. , journal=. Risk-Sensitive Reinforcement Learning With Exponential Criteria , year=
-
[13]
Jiang, Yuhua and Huang, Jiawei and Yuan, Yufeng and Mao, Xin and Yue, Yu and Zhao, Qianchuan and Yan, Lin , journal=. Risk-sensitive
- [14]
-
[15]
arXiv preprint arXiv:2402.09992 , year=
Risk-sensitive soft actor-critic for robust deep reinforcement learning under distribution shifts , author=. arXiv preprint arXiv:2402.09992 , year=
-
[16]
Jaquette, Stratton C , journal=. A utility criterion for. 1976 , publisher=
work page 1976
-
[17]
Uncertainty in economics , pages=
Risk aversion in the small and in the large , author=. Uncertainty in economics , pages=. 1978 , publisher=
work page 1978
- [18]
-
[19]
Contraction Theory in Control, Learning, and Optimization , author=. 2025 , school=
work page 2025
-
[20]
SIAM Journal on Applied Mathematics , volume=
The theory of max-min, with applications , author=. SIAM Journal on Applied Mathematics , volume=. 1966 , publisher=
work page 1966
-
[21]
L. A. Prashanth and N. Korda and R. Munos , title =. Mach. Learn. , volume =
-
[22]
Conference On Learning Theory , pages=
A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation , author=. Conference On Learning Theory , pages=
-
[23]
Dalal, G. and Sz. Thirty-Second AAAI Conference on Artificial Intelligence , year=
-
[24]
C. Lakshminarayanan and C. Szepesvari , booktitle =. 2018 , volume =
work page 2018
- [25]
-
[26]
The Annals of Applied Probability , volume=
Convergence rate of linear two-time-scale stochastic approximation , author=. The Annals of Applied Probability , volume=. 2004 , publisher=
work page 2004
-
[27]
The Annals of Applied Probability , volume=
Convergence rate and averaging of nonlinear two-time-scale stochastic approximation algorithms , author=. The Annals of Applied Probability , volume=. 2006 , publisher=
work page 2006
-
[28]
Conference On Learning Theory , pages=
Finite sample analysis of two-timescale stochastic approximation with applications to reinforcement learning , author=. Conference On Learning Theory , pages=. 2018 , organization=
work page 2018
-
[29]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
A tale of two-timescale reinforcement learning with the tightest finite-time bound , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[30]
Average cost temporal-difference learning , author=. Automatica , volume=. 1999 , publisher=
work page 1999
-
[31]
SIAM Journal on Control and Optimization , volume=
Stochastic approximation for nonexpansive maps: Application to Q-learning algorithms , author=. SIAM Journal on Control and Optimization , volume=. 2002 , publisher=
work page 2002
-
[32]
Abounadi, J. and Bertsekas, D. P. and Borkar, V. S. , journal=. Learning algorithms for. 2001 , publisher=
work page 2001
-
[33]
Operations Research Letters , volume=
Concentration bounds for empirical conditional value-at-risk: The unbounded case , author=. Operations Research Letters , volume=. 2019 , publisher=
work page 2019
-
[34]
Approximation theorems of mathematical statistics , author=. 2009 , publisher=
work page 2009
-
[35]
Prashanth, L. A. and K. Jagannathan and R. K. Kolla , pages =. 2020 , Booktitle =
work page 2020
-
[36]
Advances in Neural Information Processing Systems , pages=
-
[37]
Operations Research Letters , volume=
Deviation inequalities for an estimator of the conditional value-at-risk , author=. Operations Research Letters , volume=. 2010 , publisher=
work page 2010
-
[38]
Operations Research Letters , volume=
Large deviations bounds for estimating conditional value-at-risk , author=. Operations Research Letters , volume=. 2007 , publisher=
work page 2007
-
[39]
Concentration of risk measures: A Wasserstein distance approach , author=. 2020 , eprint=
work page 2020
-
[40]
Proceedings of the 31st Conference On Learning Theory , pages =
A general approach to multi-armed bandits under risk criteria , author=. Proceedings of the 31st Conference On Learning Theory , pages =
-
[41]
International Conference on Machine Learning , pages=
Concentration Inequalities for Conditional Value at Risk , author=. International Conference on Machine Learning , pages=
-
[42]
M. Moharrami and Y. Murthy and A. Roy and R. Srikant , year=. A Policy Gradient Algorithm for the Risk-Sensitive Exponential Cost. 2202.04157 , archivePrefix=
-
[43]
Proceedings of the IEEE , volume=
Variance-reduced methods for machine learning , author=. Proceedings of the IEEE , volume=. 2020 , publisher=
work page 2020
-
[44]
Methodology and Computing in Applied Probability , volume=
Properties of distortion risk measures , author=. Methodology and Computing in Applied Probability , volume=. 2009 , publisher=
work page 2009
-
[45]
A Lya- punov theory for finite-sample guarantees of asynchronous Q -learning and TD-learning variants
A Lyapunov theory for finite-sample guarantees of asynchronous Q-learning and TD-learning variants , author=. arXiv preprint arXiv:2102.01567 , year=
-
[46]
Journal of Banking & Finance , volume=
Spectral measures of risk: A coherent representation of subjective risk aversion , author=. Journal of Banking & Finance , volume=. 2002 , publisher=
work page 2002
-
[47]
Journal of Banking & Finance , volume=
On the coherence of expected shortfall , author=. Journal of Banking & Finance , volume=. 2002 , publisher=
work page 2002
- [48]
-
[49]
Optimization methods for large-scale machine learning , author=. Siam Review , volume=. 2018 , publisher=
work page 2018
-
[50]
Finance and stochastics , volume=
Convex measures of risk and trading constraints , author=. Finance and stochastics , volume=. 2002 , publisher=
work page 2002
-
[51]
SIAM Journal on Control and Optimization , volume=
Acceleration of stochastic approximation by averaging , author=. SIAM Journal on Control and Optimization , volume=. 1992 , publisher=
work page 1992
-
[52]
Handbook of Sequential Analysis , pages=
Stochastic approximation , author=. Handbook of Sequential Analysis , pages=
- [53]
-
[54]
Electronic Journal of Statistics , volume=
Stochastic optimization with momentum: Convergence, fluctuations, and traps avoidance , author=. Electronic Journal of Statistics , volume=. 2021 , publisher=
work page 2021
-
[55]
Mathematics of operations research , volume=
Q-learning for risk-sensitive control , author=. Mathematics of operations research , volume=. 2002 , publisher=
work page 2002
-
[56]
Konda, V. R. and Borkar, V. S. , journal=. Actor-Critic--Type Learning Algorithms for. 1999 , publisher=
work page 1999
-
[57]
S. Bhatnagar and V. S. Borkar and M. Akarapu , title =. Journal of Machine Learning Research , year =
-
[58]
Systems & Control Letters , volume=
A sensitivity formula for risk-sensitive cost and the actor--critic algorithm , author=. Systems & Control Letters , volume=. 2001 , publisher=
work page 2001
-
[59]
SIAM Journal on Control and Optimization , volume=
The ODE method for convergence of stochastic approximation and reinforcement learning , author=. SIAM Journal on Control and Optimization , volume=. 2000 , publisher=
work page 2000
- [60]
-
[61]
International Conference on Machine Learning , pages =
Stochastic variance-reduced policy gradient , author =. International Conference on Machine Learning , pages =. 2018 , volume =
work page 2018
-
[62]
International Conference on Machine Learning , pages=
Hessian aided policy gradient , author=. International Conference on Machine Learning , pages=. 2019 , organization=
work page 2019
-
[63]
Gelfand, S. B. and Mitter, S. K. , journal=. Recursive stochastic algorithms for global optimization in R\^. 1991 , publisher=
work page 1991
-
[64]
Les algorithmes stochastiques contournent-ils les pieges? , author=. Annales de l'IHP Probabilit
-
[65]
Conference on Learning Theory , pages=
Escaping from saddle points—online stochastic gradient for tensor decomposition , author=. Conference on Learning Theory , pages=. 2015 , organization=
work page 2015
-
[66]
International Conference on Machine Learning , pages=
How to escape saddle points efficiently , author=. International Conference on Machine Learning , pages=. 2017 , organization=
work page 2017
-
[67]
The Annals of Probability , volume=
Nonconvergence to unstable points in urn models and stochastic approximations , author=. The Annals of Probability , volume=. 1990 , publisher=
work page 1990
- [68]
-
[69]
Lectures on Stochastic Programming: Modeling and Theory , author=. 2014 , publisher=
work page 2014
-
[70]
V.M. Aleksandrov and V.I. Sysoyev and V.V. Shemeneva , Journal =. Stochastic optimization , Volume =
- [71]
-
[72]
Mathematics of Operations Research , volume=
Risk-averse approximate dynamic programming with quantile-based risk measures , author=. Mathematics of Operations Research , volume=. 2017 , publisher=
work page 2017
-
[73]
B\". More risk-sensitive. Mathematics of Operations Research , Number =
-
[74]
Dentcheva, D. and Ruszczynski, A. , Journal =. Optimization with stochastic dominance constraints , Volume =
-
[75]
arXiv , Author =:1502.03919 , Journal =
Policy gradient for coherent risk measures , Volume =. arXiv , Author =:1502.03919 , Journal =
-
[76]
Tsitsiklis, J. N. and Van Roy, B. , Journal =. Average cost temporal-difference learning , Volume =
-
[77]
Wiley Encyclopedia of Operations Research and Management Science , Title =
Szepesv. Wiley Encyclopedia of Operations Research and Management Science , Title =
-
[78]
Tsitsiklis, J. N. and Van Roy, B. , Journal =. An analysis of temporal-difference learning with function approximation , Volume =
-
[79]
S. Bhatnagar and R. Sutton and M. Ghavamzadeh and M. Lee , Journal =. Natural actor-critic algorithms , Volume =
-
[80]
Fleming, W. H. and McEneaney, W. M. , Journal =. Risk-sensitive control on an infinite time horizon , Volume =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.