pith. machine review for the scientific record. sign in

arxiv: 2605.13717 · v1 · submitted 2026-05-13 · 💻 cs.LG · stat.ML

Recognition: unknown

Tight Sample Complexity Bounds for Entropic Best Policy Identification

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:10 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords best policy identificationentropic risk measuresample complexityrisk-sensitive reinforcement learningconcentration boundsstopping rulesexploration bonuses
0
0 comments X

The pith

A new stopping rule closes the exponential gap and matches the lower bound for entropic best-policy identification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies best-policy identification in finite-horizon risk-sensitive RL under the entropic risk measure. Earlier upper bounds carried an extra exponential factor in the horizon compared with known lower bounds of order e to the absolute value of beta times H. The authors trace the gap to loose concentration control on exponential utilities, then close it with a forward-model algorithm that uses KL-based exploration bonuses, sharper concentration bounds derived from the smoothness of the exponential utility, and a new stopping rule that achieves matching sample complexity.

Core claim

The sample complexity of identifying an approximately optimal policy under the entropic risk measure is Theta of e to the absolute value of beta times H, achieved by a forward-model algorithm whose analysis replaces loose concentration inequalities with tighter ones that exploit smoothness of the exponential utility and whose stopping rule is calibrated to this tightness.

What carries the argument

KL-based exploration bonuses adapted to the entropic criterion, paired with a stopping rule that uses tighter concentration bounds coming from the smoothness of the exponential utility function.

If this is right

  • The extra exponential factor present in prior upper bounds is unnecessary.
  • Matching lower-bound sample complexity is achievable for this identification problem using a generative model.
  • The improvement is specific to the exponential utility and the new stopping rule.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same smoothness-driven tightening may apply to other risk measures that admit similar concentration control.
  • Practical risk-sensitive agents with long horizons could become feasible once the exponential cost is removed.
  • The result raises the question of whether analogous gaps exist in best-policy identification for other non-standard objectives.

Load-bearing premise

The smoothness properties of the exponential utility suffice to derive sharper concentration bounds that are tight enough for the new stopping rule to match the lower bound.

What would settle it

An MDP in which the proposed algorithm requires more than a constant multiple of e to the absolute value of beta times H samples to return an approximately optimal policy would falsify the matching claim.

Figures

Figures reproduced from arXiv: 2605.13717 by Amer Essakine, Claire Vernade.

Figure 1
Figure 1. Figure 1: An example of the MDP construction. Here the state [PITH_FULL_IMAGE:figures/full_fig_p044_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: toy MDP. Safe action follows the straight chain to [PITH_FULL_IMAGE:figures/full_fig_p048_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Sample complexity in function of β (log-scale y-axis) [PITH_FULL_IMAGE:figures/full_fig_p049_3.png] view at source ↗
Figure 6
Figure 6. Figure 6: Regret for different algorithms for β “ 2.0 Figures 5, 6, 7, 8 shows that our algorithm consistently achieves the lowest cumulative-regret trajectory over all values of β. Since cumulative regret is computed by evaluating the current policy from the initial state each episode and summing the resulting value gap to optimality, the consistently flatter Entropic BPI curve indicates stronger learning efficienc… view at source ↗
Figure 7
Figure 7. Figure 7: Regret for different algorithms for β “ 3.0 [PITH_FULL_IMAGE:figures/full_fig_p051_7.png] view at source ↗
read the original abstract

We study best-policy identification for finite-horizon risk-sensitive reinforcement learning under the entropic risk measure. Recent work established a constant gap in the exponential horizon dependence between lower and upper bounds on the number of samples required to identify an approximately optimal policy. Precisely, known lower bounds scale in $\Omega(e^{|\beta| H})$ where $H$ is the horizon of the MDP, while the state-of-the-art upper bound achieves at best $O(e^{2|\beta| H})$ (arXiv:2506.00286v2) using a generative model. We show that this extra exponential factor can be traced to overly loose concentration control for exponential utilities. To close this open gap, we revisit the analysis of this problem through a forward-model based algorithm building on KL-based exploration bonuses that we adapt to the entropic criterion. The improvement we get is due to two main novel technical innovations. We leverage the smoothness properties of the exponential utility to derive sharper concentration bounds, and we propose a new stopping rule that exploits further this tightness to obtain a sample complexity that matches the lower bound.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the smoothness-derived concentration bounds and the correctness of the new stopping rule; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Smoothness properties of the exponential utility allow derivation of sharper concentration bounds than previously used.
    Invoked to obtain the tighter bounds that enable matching the lower bound.

pith-pipeline@v0.9.0 · 5487 in / 1220 out tokens · 42160 ms · 2026-05-14T20:10:14.826127+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

  1. [1]

    Proceedings of the 29th International Conference on Machine Learning (ICML-12) , series =

    Mohammad Gheshlaghi Azar and Remi Munos and Bert Kappen , title =. Proceedings of the 29th International Conference on Machine Learning (ICML-12) , series =. 2012 , location =

  2. [2]

    arXiv preprint arXiv:2506.00286 , year =

    Oliver Mortensen and Mohammad Sadegh Talebi , title =. arXiv preprint arXiv:2506.00286 , year =

  3. [3]

    Computational Economics , year =

    Reinforcement Learning in Economics and Finance , author =. Computational Economics , year =

  4. [4]

    Journal of Intelligent

    Survey of Model-Based Reinforcement Learning: Applications on Robotics , author =. Journal of Intelligent. 2017 , volume =

  5. [5]

    Beyond Average Return in Markov Decision Processes , url =

    Marthe, Alexandre and Garivier, Aur\'. Beyond Average Return in Markov Decision Processes , url =. Advances in Neural Information Processing Systems , pages =

  6. [6]

    and Matheson, James E

    Howard, Ronald A. and Matheson, James E. , title =. Management Science , year =

  7. [7]

    2018 , publisher=

    Reinforcement Learning: An Introduction , author=. 2018 , publisher=

  8. [8]

    Proceedings of the 38th International Conference on Machine Learning , pages =

    Fast active learning for pure exploration in reinforcement learning , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , volume =

  9. [9]

    Machine Learning , volume =

    A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , author =. Machine Learning , volume =. 2002 , publisher =

  10. [10]

    Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , url =

    Dann, Christoph and Brunskill, Emma , booktitle =. Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , url =

  11. [11]

    Strehl and Lihong Li and Michael L

    Alexander L. Strehl and Lihong Li and Michael L. Littman , title =. Journal of Machine Learning Research , year =

  12. [12]

    Borkar, V. S. and Meyn, S. P. , title =. SIAM Journal on Control and Optimization , volume =. 2000 , URL =

  13. [13]

    Proceedings of the Seventh Annual Conference on Computational Learning Theory , pages =

    Fiechter, Claude-Nicolas , title =. Proceedings of the Seventh Annual Conference on Computational Learning Theory , pages =. 1994 , isbn =

  14. [14]

    Proceedings of the 37th International Conference on Machine Learning , pages =

    Reward-Free Exploration for Reinforcement Learning , author =. Proceedings of the 37th International Conference on Machine Learning , pages =. 2020 , volume =

  15. [15]

    Adaptive Reward-Free Exploration , booktitle =

    Kaufmann, Emilie and M. Adaptive Reward-Free Exploration , booktitle =. 2021 , publisher =

  16. [16]

    Risk-Sensitive Reinforcement Learning: Near-Optimal Risk-Sample Tradeoff in Regret , url =

    Fei, Yingjie and Yang, Zhuoran and Chen, Yudong and Wang, Zhaoran and Xie, Qiaomin , booktitle =. Risk-Sensitive Reinforcement Learning: Near-Optimal Risk-Sample Tradeoff in Regret , url =

  17. [17]

    Exponential Bellman Equation and Improved Regret Bounds for Risk-Sensitive Reinforcement Learning , url =

    Fei, Yingjie and Yang, Zhuoran and Chen, Yudong and Wang, Zhaoran , booktitle =. Exponential Bellman Equation and Improved Regret Bounds for Risk-Sensitive Reinforcement Learning , url =

  18. [18]

    Proceedings of The 26th International Conference on Artificial Intelligence and Statistics , pages =

    A Tighter Problem-Dependent Regret Bound for Risk-Sensitive Reinforcement Learning , author =. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics , pages =. 2023 , volume =

  19. [19]

    Almost Optimal Model-Free Reinforcement Learningvia Reference-Advantage Decomposition , url =

    Zhang, Zihan and Zhou, Yuan and Ji, Xiangyang , booktitle =. Almost Optimal Model-Free Reinforcement Learningvia Reference-Advantage Decomposition , url =

  20. [20]

    Proceedings of the 32nd International Conference on Algorithmic Learning Theory , pages =

    Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited , author =. Proceedings of the 32nd International Conference on Algorithmic Learning Theory , pages =. 2021 , volume =

  21. [21]

    On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , journal =

    Emilie Kaufmann and Olivier Capp. On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , journal =. 2016 , volume =

  22. [22]

    Proceedings of the 34th International Conference on Machine Learning , pages =

    Minimax Regret Bounds for Reinforcement Learning , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , volume =

  23. [23]

    Proceedings of the 36th International Conference on Machine Learning , pages =

    Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , volume =

  24. [24]

    Proceedings of the 36th International Conference on Machine Learning , pages =

    Statistics and Samples in Distributional Reinforcement Learning , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , volume =

  25. [25]

    Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , url =

    Dann, Christoph and Lattimore, Tor and Brunskill, Emma , booktitle =. Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , url =

  26. [26]

    Efficient Risk-sensitive Planning via Entropic Risk Measures , journal =

    Alexandre Marthe and Samuel Bounan and Aur. Efficient Risk-sensitive Planning via Entropic Risk Measures , journal =. 2025 , eprint =. doi:10.48550/arXiv.2502.20423 , url =

  27. [27]

    Journal of Machine Learning Research , year =

    Liang, Hao and Luo, Zhi-Quan , title =. Journal of Machine Learning Research , year =

  28. [28]

    and Thomas, Joy A

    Cover, Thomas M. and Thomas, Joy A. , title =. 2006 , publisher =

  29. [29]

    Proceedings of Algorithmic Learning Theory , pages =

    Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs , author =. Proceedings of Algorithmic Learning Theory , pages =. 2018 , volume =