pith. machine review for the scientific record. sign in

arxiv: 2605.08053 · v1 · submitted 2026-05-08 · 💻 cs.LG

Recognition: no theorem link

Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:06 UTC · model grok-4.3

classification 💻 cs.LG
keywords reinforcement learningexponential utilityrisk-sensitive RLQ-learningMarkov decision processescontraction mappingconvergence analysis
0
0 comments X

The pith

Two Q-value operators for exponential utility in discounted MDPs are contractions and induce optimal greedy policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the lack of principled value-based methods for reinforcement learning under exponential utility in discounted Markov decision processes. It extends the known Bellman-type equation for this risk-sensitive objective into two Q-function versions and proves both operators are contractions, one under the standard L_infinity metric and the other under a sup-log/Thompson metric. The fixed points of these operators yield greedy stationary policies that are optimal for the exponential-utility criterion among all stationary policies. These results directly produce two model-free algorithms, one using two-timescale updates with almost-sure convergence and finite-time rates, and another single-timescale method whose convergence is shown via local Lipschitz and monotonicity arguments rather than global contraction.

Core claim

Building on the Bellman-type equation for exponential utility, we derive two Q-value-style extensions whose operators are contractions in the L_infinity and sup-log/Thompson metrics. Their fixed points characterize the optimal exponential-utility values, and the policies obtained by greedily maximizing these fixed-point Q-functions are optimal among stationary policies. The structural results yield a two-timescale Q-learning algorithm with almost-sure convergence and finite-time rates via timescale separation, plus a one-timescale algorithm whose convergence follows from local Lipschitzness, monotonicity, homogeneity, and Dini derivatives, with a scalar finite-time bound highlighting vector-

What carries the argument

Two Q-value extensions of the exponential-utility Bellman equation, proven to be contraction mappings in the L_infinity and sup-log/Thompson metrics respectively.

If this is right

  • The greedy policy extracted from either operator's fixed point is optimal for exponential utility among stationary policies.
  • A two-timescale stochastic approximation algorithm converges almost surely to the optimal Q-function with explicit finite-time rates obtained by timescale separation.
  • A one-timescale algorithm governed by a sublinear power-law operator converges via arguments based on local Lipschitzness, monotonicity, homogeneity, and Dini derivatives.
  • Finite-time convergence rates are available in the scalar case of the one-timescale algorithm.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The contraction results may allow similar value-based methods to be derived for other risk-sensitive criteria that admit multiplicative or log-based Bellman equations.
  • The one-timescale algorithm could be preferable in online settings where maintaining separate fast and slow timescales is impractical.
  • The metrics used suggest that exponential utility may interact productively with multiplicative update rules common in certain risk or robustness problems.
  • The finite-time analysis for the scalar case points to a possible direction for obtaining vector convergence rates by refining the local-Lipschitz argument.

Load-bearing premise

The two new operators remain contractions in their respective metrics for any fixed risk-aversion parameter and for the class of MDPs under consideration.

What would settle it

A concrete counter-example MDP and risk-aversion value for which either operator fails to be a contraction, or for which the induced greedy policy is suboptimal for the exponential-utility objective.

read the original abstract

Reinforcement learning (RL) for exponential-utility optimization in discounted Markov decision processes (MDPs) lacks principled value-based algorithms. We address this gap in the fixed risk-aversion setting. Building on the Bellman-type equation for exponential utility studied in \cite{porteus1975optimality}, we derive two Q-value-style extensions and show that the associated operators are contractions in the $L_\infty$ and sup-log/Thompson metrics, respectively. We characterize their fixed points and prove that the induced greedy stationary policy is optimal for the exponential-utility objective among stationary policies. These structural results lead to two model-free algorithms: a two-timescale Q-learning--style algorithm, for which we establish almost-sure convergence and provide finite-time convergence rates via timescale separation, and a one-timescale algorithm governed by a sublinear power-law operator. Since the latter does not admit a global contraction in standard metrics, we prove its convergence using delicate arguments based on local Lipschitzness, monotonicity, homogeneity, and Dini derivatives, and provide a scalar finite-time analysis that highlights the challenges in obtaining convergence rates in the vector case. Our work provides a foundation for value-based RL under exponential-utility objectives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper derives two Q-value-style operators extending the Bellman equation for exponential utility (Porteus 1975) in discounted MDPs under fixed risk aversion. It proves these operators are contractions in the L_∞ and sup-log/Thompson metrics, characterizes their fixed points, shows the induced greedy stationary policy is optimal among stationary policies, and presents two model-free algorithms: a two-timescale Q-learning variant with almost-sure convergence and finite-time rates via timescale separation, plus a one-timescale algorithm whose convergence is shown via local Lipschitzness, monotonicity, homogeneity, and Dini derivatives (with scalar finite-time analysis).

Significance. If the contraction properties and convergence arguments hold, the work fills a clear gap by providing principled value-based RL methods for exponential-utility objectives. The structural results on optimality and the careful treatment of the non-contracting one-timescale case contribute technically to risk-sensitive RL theory, enabling future algorithm development in this setting.

major comments (2)
  1. [§4] §4 (one-timescale algorithm): The convergence claim for the vector case rests on delicate local arguments (Lipschitzness, monotonicity, homogeneity, Dini derivatives) rather than a global contraction; only scalar finite-time analysis is supplied, which is load-bearing for the practical utility of the rates and requires explicit verification that the vector extension does not introduce counterexamples.
  2. [§3.2] §3.2 (sup-log/Thompson contraction): The proof that the second operator is a contraction in the Thompson metric relies on homogeneity; this step is central to the fixed-point characterization and optimality result, yet the dependence on the fixed risk-aversion parameter should be shown explicitly to rule out parameter-specific failures.
minor comments (2)
  1. [Introduction] The introduction would benefit from a short paragraph recalling the definition of the sup-log and Thompson metrics before their use in the operator analysis.
  2. Notation for the risk-aversion parameter should be introduced once and used consistently across operator definitions and convergence statements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. The positive assessment of the work's significance is appreciated. We address each major comment below and will make the indicated revisions in the next version of the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (one-timescale algorithm): The convergence claim for the vector case rests on delicate local arguments (Lipschitzness, monotonicity, homogeneity, Dini derivatives) rather than a global contraction; only scalar finite-time analysis is supplied, which is load-bearing for the practical utility of the rates and requires explicit verification that the vector extension does not introduce counterexamples.

    Authors: The almost-sure convergence of the one-timescale algorithm is established for the general (vector) case using the local properties of the operator, which are proven to hold in the vector setting via the definitions of local Lipschitzness, monotonicity, homogeneity, and Dini derivatives in the sup norm. These properties are dimension-independent and apply directly to the Q-function vectors. The scalar finite-time analysis is supplied explicitly to provide concrete rates while underscoring the technical obstacles to vector rates, consistent with the manuscript's discussion of these challenges. We will add a clarifying remark in §4 stating that the local arguments suffice for almost-sure convergence in the vector case and that the operator properties preclude counterexamples in the vector extension, as they are verified without scalar restrictions. revision: partial

  2. Referee: [§3.2] §3.2 (sup-log/Thompson contraction): The proof that the second operator is a contraction in the Thompson metric relies on homogeneity; this step is central to the fixed-point characterization and optimality result, yet the dependence on the fixed risk-aversion parameter should be shown explicitly to rule out parameter-specific failures.

    Authors: We agree that explicit tracking of the risk-aversion parameter strengthens the argument. The homogeneity property and the resulting contraction in the Thompson metric hold for any fixed risk-aversion parameter γ > 0, with the contraction modulus depending on γ through the exponential utility scaling. We will revise the proof in §3.2 to explicitly insert the dependence on γ in the homogeneity step and the contraction factor, confirming that the result holds uniformly for the fixed-γ setting of the paper without parameter-specific failures. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The derivation begins from the external Bellman-type equation in the cited Porteus 1975 work (different authors), then independently derives two new Q-operators, proves they are contractions in L_infty and sup-log/Thompson metrics, characterizes their fixed points, establishes optimality of the induced greedy policy among stationary policies, and proves almost-sure convergence (with rates via timescale separation for the two-timescale algorithm and local Lipschitz/monotonicity arguments for the one-timescale case). None of these steps reduce by construction to the inputs or to self-citations; the central claims rest on explicit operator analysis and convergence proofs that are not tautological with the starting equation or any fitted parameters.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard discounted-MDP assumptions and the 1975 Bellman equation; no new free parameters are fitted and no invented entities are introduced.

axioms (2)
  • domain assumption The MDP is discounted with discount factor strictly less than one
    Required for the contraction property and almost-sure convergence in infinite-horizon settings.
  • domain assumption Risk aversion parameter is fixed and known
    The paper explicitly restricts to the fixed risk-aversion setting.

pith-pipeline@v0.9.0 · 5524 in / 1382 out tokens · 30976 ms · 2026-05-11T02:06:43.795122+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages

  1. [1]

    Conference on learning theory , pages=

    Finite-time error bounds for linear stochastic approximation andtd learning , author=. Conference on learning theory , pages=. 2019 , organization=

  2. [2]

    European Journal of Operational Research , volume=

    Approximate solutions to constrained risk-sensitive. European Journal of Operational Research , volume=. 2023 , author =

  3. [3]

    Finite Sample Analysis of Average-Reward

    Zhang, Sheng and Zhang, Zhe and Maguluri, Siva Theja , journal=. Finite Sample Analysis of Average-Reward

  4. [4]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Parameter-free optimal rates for nonlinear semi-norm contractions with applications to q-learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  5. [5]

    Proceedings of The 5th Annual Learning for Dynamics and Control Conference , pages =

    Modified Policy Iteration for Exponential Cost Risk Sensitive MDPs , author =. Proceedings of The 5th Annual Learning for Dynamics and Control Conference , pages =. 2023 , editor =

  6. [6]

    Proceedings of the AAAI Conference on Artificial Intelligence , author=

    A Risk-Sensitive Approach to Policy Optimization , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2023 , month=

  7. [7]

    IEEE transactions on automatic control , volume=

    Sequential decision making with coherent risk , author=. IEEE transactions on automatic control , volume=. 2016 , publisher=

  8. [8]

    Machine Learning , volume=

    Variance-constrained actor-critic algorithms for discounted and average reward MDPs , author=. Machine Learning , volume=. 2016 , publisher=

  9. [9]

    2022 , volume =

    Foundations and Trends® in Machine Learning , title =. 2022 , volume =

  10. [10]

    Mathematical Methods of Operations Research , volume=

    Markov decision processes with risk-sensitive criteria: an overview , author=. Mathematical Methods of Operations Research , volume=. 2024 , publisher=

  11. [11]

    Advances in neural information processing systems , volume=

    Exponential bellman equation and improved regret bounds for risk-sensitive reinforcement learning , author=. Advances in neural information processing systems , volume=

  12. [12]

    and Baras, John S

    Noorani, Erfaun and Mavridis, Christos N. and Baras, John S. , journal=. Risk-Sensitive Reinforcement Learning With Exponential Criteria , year=

  13. [13]

    Risk-sensitive

    Jiang, Yuhua and Huang, Jiawei and Yuan, Yufeng and Mao, Xin and Yue, Yu and Zhao, Qianchuan and Yan, Lin , journal=. Risk-sensitive

  14. [14]

    2011 , publisher=

    Viability theory: new directions , author=. 2011 , publisher=

  15. [15]

    arXiv preprint arXiv:2402.09992 , year=

    Risk-sensitive soft actor-critic for robust deep reinforcement learning under distribution shifts , author=. arXiv preprint arXiv:2402.09992 , year=

  16. [16]

    A utility criterion for

    Jaquette, Stratton C , journal=. A utility criterion for. 1976 , publisher=

  17. [17]

    Uncertainty in economics , pages=

    Risk aversion in the small and in the large , author=. Uncertainty in economics , pages=. 1978 , publisher=

  18. [18]

    2002 , publisher=

    Nonlinear systems , author=. 2002 , publisher=

  19. [19]

    2025 , school=

    Contraction Theory in Control, Learning, and Optimization , author=. 2025 , school=

  20. [20]

    SIAM Journal on Applied Mathematics , volume=

    The theory of max-min, with applications , author=. SIAM Journal on Applied Mathematics , volume=. 1966 , publisher=

  21. [21]

    L. A. Prashanth and N. Korda and R. Munos , title =. Mach. Learn. , volume =

  22. [22]

    Conference On Learning Theory , pages=

    A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation , author=. Conference On Learning Theory , pages=

  23. [23]

    Dalal, G. and Sz. Thirty-Second AAAI Conference on Artificial Intelligence , year=

  24. [24]

    Lakshminarayanan and C

    C. Lakshminarayanan and C. Szepesvari , booktitle =. 2018 , volume =

  25. [25]

    and Ying, L

    Srikant, R. and Ying, L. , booktitle =. 2019 , volume =

  26. [26]

    The Annals of Applied Probability , volume=

    Convergence rate of linear two-time-scale stochastic approximation , author=. The Annals of Applied Probability , volume=. 2004 , publisher=

  27. [27]

    The Annals of Applied Probability , volume=

    Convergence rate and averaging of nonlinear two-time-scale stochastic approximation algorithms , author=. The Annals of Applied Probability , volume=. 2006 , publisher=

  28. [28]

    Conference On Learning Theory , pages=

    Finite sample analysis of two-timescale stochastic approximation with applications to reinforcement learning , author=. Conference On Learning Theory , pages=. 2018 , organization=

  29. [29]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    A tale of two-timescale reinforcement learning with the tightest finite-time bound , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  30. [30]

    Automatica , volume=

    Average cost temporal-difference learning , author=. Automatica , volume=. 1999 , publisher=

  31. [31]

    SIAM Journal on Control and Optimization , volume=

    Stochastic approximation for nonexpansive maps: Application to Q-learning algorithms , author=. SIAM Journal on Control and Optimization , volume=. 2002 , publisher=

  32. [32]

    and Bertsekas, D

    Abounadi, J. and Bertsekas, D. P. and Borkar, V. S. , journal=. Learning algorithms for. 2001 , publisher=

  33. [33]

    Operations Research Letters , volume=

    Concentration bounds for empirical conditional value-at-risk: The unbounded case , author=. Operations Research Letters , volume=. 2019 , publisher=

  34. [34]

    2009 , publisher=

    Approximation theorems of mathematical statistics , author=. 2009 , publisher=

  35. [35]

    Prashanth, L. A. and K. Jagannathan and R. K. Kolla , pages =. 2020 , Booktitle =

  36. [36]

    Advances in Neural Information Processing Systems , pages=

  37. [37]

    Operations Research Letters , volume=

    Deviation inequalities for an estimator of the conditional value-at-risk , author=. Operations Research Letters , volume=. 2010 , publisher=

  38. [38]

    Operations Research Letters , volume=

    Large deviations bounds for estimating conditional value-at-risk , author=. Operations Research Letters , volume=. 2007 , publisher=

  39. [39]

    2020 , eprint=

    Concentration of risk measures: A Wasserstein distance approach , author=. 2020 , eprint=

  40. [40]

    Proceedings of the 31st Conference On Learning Theory , pages =

    A general approach to multi-armed bandits under risk criteria , author=. Proceedings of the 31st Conference On Learning Theory , pages =

  41. [41]

    International Conference on Machine Learning , pages=

    Concentration Inequalities for Conditional Value at Risk , author=. International Conference on Machine Learning , pages=

  42. [42]

    Moharrami and Y

    M. Moharrami and Y. Murthy and A. Roy and R. Srikant , year=. A Policy Gradient Algorithm for the Risk-Sensitive Exponential Cost. 2202.04157 , archivePrefix=

  43. [43]

    Proceedings of the IEEE , volume=

    Variance-reduced methods for machine learning , author=. Proceedings of the IEEE , volume=. 2020 , publisher=

  44. [44]

    Methodology and Computing in Applied Probability , volume=

    Properties of distortion risk measures , author=. Methodology and Computing in Applied Probability , volume=. 2009 , publisher=

  45. [45]

    A Lya- punov theory for finite-sample guarantees of asynchronous Q -learning and TD-learning variants

    A Lyapunov theory for finite-sample guarantees of asynchronous Q-learning and TD-learning variants , author=. arXiv preprint arXiv:2102.01567 , year=

  46. [46]

    Journal of Banking & Finance , volume=

    Spectral measures of risk: A coherent representation of subjective risk aversion , author=. Journal of Banking & Finance , volume=. 2002 , publisher=

  47. [47]

    Journal of Banking & Finance , volume=

    On the coherence of expected shortfall , author=. Journal of Banking & Finance , volume=. 2002 , publisher=

  48. [48]

    and Blake, D

    Dowd, K. and Blake, D. , journal=. 2006 , publisher=

  49. [49]

    Siam Review , volume=

    Optimization methods for large-scale machine learning , author=. Siam Review , volume=. 2018 , publisher=

  50. [50]

    Finance and stochastics , volume=

    Convex measures of risk and trading constraints , author=. Finance and stochastics , volume=. 2002 , publisher=

  51. [51]

    SIAM Journal on Control and Optimization , volume=

    Acceleration of stochastic approximation by averaging , author=. SIAM Journal on Control and Optimization , volume=. 1992 , publisher=

  52. [52]

    Handbook of Sequential Analysis , pages=

    Stochastic approximation , author=. Handbook of Sequential Analysis , pages=

  53. [53]

    2016 , publisher=

    Stochastic finance , author=. 2016 , publisher=

  54. [54]

    Electronic Journal of Statistics , volume=

    Stochastic optimization with momentum: Convergence, fluctuations, and traps avoidance , author=. Electronic Journal of Statistics , volume=. 2021 , publisher=

  55. [55]

    Mathematics of operations research , volume=

    Q-learning for risk-sensitive control , author=. Mathematics of operations research , volume=. 2002 , publisher=

  56. [56]

    Konda, V. R. and Borkar, V. S. , journal=. Actor-Critic--Type Learning Algorithms for. 1999 , publisher=

  57. [57]

    Bhatnagar and V

    S. Bhatnagar and V. S. Borkar and M. Akarapu , title =. Journal of Machine Learning Research , year =

  58. [58]

    Systems & Control Letters , volume=

    A sensitivity formula for risk-sensitive cost and the actor--critic algorithm , author=. Systems & Control Letters , volume=. 2001 , publisher=

  59. [59]

    SIAM Journal on Control and Optimization , volume=

    The ODE method for convergence of stochastic approximation and reinforcement learning , author=. SIAM Journal on Control and Optimization , volume=. 2000 , publisher=

  60. [60]

    Zhang and A

    K. Zhang and A. Koppel and H. Zhu and T. Basar , title =

  61. [61]

    International Conference on Machine Learning , pages =

    Stochastic variance-reduced policy gradient , author =. International Conference on Machine Learning , pages =. 2018 , volume =

  62. [62]

    International Conference on Machine Learning , pages=

    Hessian aided policy gradient , author=. International Conference on Machine Learning , pages=. 2019 , organization=

  63. [63]

    Gelfand, S. B. and Mitter, S. K. , journal=. Recursive stochastic algorithms for global optimization in R\^. 1991 , publisher=

  64. [64]

    Annales de l'IHP Probabilit

    Les algorithmes stochastiques contournent-ils les pieges? , author=. Annales de l'IHP Probabilit

  65. [65]

    Conference on Learning Theory , pages=

    Escaping from saddle points—online stochastic gradient for tensor decomposition , author=. Conference on Learning Theory , pages=. 2015 , organization=

  66. [66]

    International Conference on Machine Learning , pages=

    How to escape saddle points efficiently , author=. International Conference on Machine Learning , pages=. 2017 , organization=

  67. [67]

    The Annals of Probability , volume=

    Nonconvergence to unstable points in urn models and stochastic approximations , author=. The Annals of Probability , volume=. 1990 , publisher=

  68. [68]

    and Glynn, P

    Asmussen, S. and Glynn, P. W. , volume=. 2007 , publisher=

  69. [69]

    2014 , publisher=

    Lectures on Stochastic Programming: Modeling and Theory , author=. 2014 , publisher=

  70. [70]

    Aleksandrov and V.I

    V.M. Aleksandrov and V.I. Sysoyev and V.V. Shemeneva , Journal =. Stochastic optimization , Volume =

  71. [71]

    , author=

    Policy gradient methods for reinforcement learning with function approximation. , author=. NIPS , volume=

  72. [72]

    Mathematics of Operations Research , volume=

    Risk-averse approximate dynamic programming with quantile-based risk measures , author=. Mathematics of Operations Research , volume=. 2017 , publisher=

  73. [73]

    More risk-sensitive

    B\". More risk-sensitive. Mathematics of Operations Research , Number =

  74. [74]

    and Ruszczynski, A

    Dentcheva, D. and Ruszczynski, A. , Journal =. Optimization with stochastic dominance constraints , Volume =

  75. [75]

    arXiv , Author =:1502.03919 , Journal =

    Policy gradient for coherent risk measures , Volume =. arXiv , Author =:1502.03919 , Journal =

  76. [76]

    Tsitsiklis, J. N. and Van Roy, B. , Journal =. Average cost temporal-difference learning , Volume =

  77. [77]

    Wiley Encyclopedia of Operations Research and Management Science , Title =

    Szepesv. Wiley Encyclopedia of Operations Research and Management Science , Title =

  78. [78]

    Tsitsiklis, J. N. and Van Roy, B. , Journal =. An analysis of temporal-difference learning with function approximation , Volume =

  79. [79]

    Bhatnagar and R

    S. Bhatnagar and R. Sutton and M. Ghavamzadeh and M. Lee , Journal =. Natural actor-critic algorithms , Volume =

  80. [80]

    Fleming, W. H. and McEneaney, W. M. , Journal =. Risk-sensitive control on an infinite time horizon , Volume =

Showing first 80 references.