Recognition: unknown
Tight Sample Complexity Bounds for Entropic Best Policy Identification
Pith reviewed 2026-05-14 20:10 UTC · model grok-4.3
The pith
A new stopping rule closes the exponential gap and matches the lower bound for entropic best-policy identification.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The sample complexity of identifying an approximately optimal policy under the entropic risk measure is Theta of e to the absolute value of beta times H, achieved by a forward-model algorithm whose analysis replaces loose concentration inequalities with tighter ones that exploit smoothness of the exponential utility and whose stopping rule is calibrated to this tightness.
What carries the argument
KL-based exploration bonuses adapted to the entropic criterion, paired with a stopping rule that uses tighter concentration bounds coming from the smoothness of the exponential utility function.
If this is right
- The extra exponential factor present in prior upper bounds is unnecessary.
- Matching lower-bound sample complexity is achievable for this identification problem using a generative model.
- The improvement is specific to the exponential utility and the new stopping rule.
Where Pith is reading between the lines
- The same smoothness-driven tightening may apply to other risk measures that admit similar concentration control.
- Practical risk-sensitive agents with long horizons could become feasible once the exponential cost is removed.
- The result raises the question of whether analogous gaps exist in best-policy identification for other non-standard objectives.
Load-bearing premise
The smoothness properties of the exponential utility suffice to derive sharper concentration bounds that are tight enough for the new stopping rule to match the lower bound.
What would settle it
An MDP in which the proposed algorithm requires more than a constant multiple of e to the absolute value of beta times H samples to return an approximately optimal policy would falsify the matching claim.
Figures
read the original abstract
We study best-policy identification for finite-horizon risk-sensitive reinforcement learning under the entropic risk measure. Recent work established a constant gap in the exponential horizon dependence between lower and upper bounds on the number of samples required to identify an approximately optimal policy. Precisely, known lower bounds scale in $\Omega(e^{|\beta| H})$ where $H$ is the horizon of the MDP, while the state-of-the-art upper bound achieves at best $O(e^{2|\beta| H})$ (arXiv:2506.00286v2) using a generative model. We show that this extra exponential factor can be traced to overly loose concentration control for exponential utilities. To close this open gap, we revisit the analysis of this problem through a forward-model based algorithm building on KL-based exploration bonuses that we adapt to the entropic criterion. The improvement we get is due to two main novel technical innovations. We leverage the smoothness properties of the exponential utility to derive sharper concentration bounds, and we propose a new stopping rule that exploits further this tightness to obtain a sample complexity that matches the lower bound.
Editorial analysis
A structured set of objections, weighed in public.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Smoothness properties of the exponential utility allow derivation of sharper concentration bounds than previously used.
Reference graph
Works this paper leans on
-
[1]
Proceedings of the 29th International Conference on Machine Learning (ICML-12) , series =
Mohammad Gheshlaghi Azar and Remi Munos and Bert Kappen , title =. Proceedings of the 29th International Conference on Machine Learning (ICML-12) , series =. 2012 , location =
work page 2012
-
[2]
arXiv preprint arXiv:2506.00286 , year =
Oliver Mortensen and Mohammad Sadegh Talebi , title =. arXiv preprint arXiv:2506.00286 , year =
-
[3]
Computational Economics , year =
Reinforcement Learning in Economics and Finance , author =. Computational Economics , year =
-
[4]
Survey of Model-Based Reinforcement Learning: Applications on Robotics , author =. Journal of Intelligent. 2017 , volume =
work page 2017
-
[5]
Beyond Average Return in Markov Decision Processes , url =
Marthe, Alexandre and Garivier, Aur\'. Beyond Average Return in Markov Decision Processes , url =. Advances in Neural Information Processing Systems , pages =
-
[6]
Howard, Ronald A. and Matheson, James E. , title =. Management Science , year =
-
[7]
Reinforcement Learning: An Introduction , author=. 2018 , publisher=
work page 2018
-
[8]
Proceedings of the 38th International Conference on Machine Learning , pages =
Fast active learning for pure exploration in reinforcement learning , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , volume =
work page 2021
-
[9]
A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , author =. Machine Learning , volume =. 2002 , publisher =
work page 2002
-
[10]
Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , url =
Dann, Christoph and Brunskill, Emma , booktitle =. Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , url =
-
[11]
Strehl and Lihong Li and Michael L
Alexander L. Strehl and Lihong Li and Michael L. Littman , title =. Journal of Machine Learning Research , year =
-
[12]
Borkar, V. S. and Meyn, S. P. , title =. SIAM Journal on Control and Optimization , volume =. 2000 , URL =
work page 2000
-
[13]
Proceedings of the Seventh Annual Conference on Computational Learning Theory , pages =
Fiechter, Claude-Nicolas , title =. Proceedings of the Seventh Annual Conference on Computational Learning Theory , pages =. 1994 , isbn =
work page 1994
-
[14]
Proceedings of the 37th International Conference on Machine Learning , pages =
Reward-Free Exploration for Reinforcement Learning , author =. Proceedings of the 37th International Conference on Machine Learning , pages =. 2020 , volume =
work page 2020
-
[15]
Adaptive Reward-Free Exploration , booktitle =
Kaufmann, Emilie and M. Adaptive Reward-Free Exploration , booktitle =. 2021 , publisher =
work page 2021
-
[16]
Risk-Sensitive Reinforcement Learning: Near-Optimal Risk-Sample Tradeoff in Regret , url =
Fei, Yingjie and Yang, Zhuoran and Chen, Yudong and Wang, Zhaoran and Xie, Qiaomin , booktitle =. Risk-Sensitive Reinforcement Learning: Near-Optimal Risk-Sample Tradeoff in Regret , url =
-
[17]
Fei, Yingjie and Yang, Zhuoran and Chen, Yudong and Wang, Zhaoran , booktitle =. Exponential Bellman Equation and Improved Regret Bounds for Risk-Sensitive Reinforcement Learning , url =
-
[18]
Proceedings of The 26th International Conference on Artificial Intelligence and Statistics , pages =
A Tighter Problem-Dependent Regret Bound for Risk-Sensitive Reinforcement Learning , author =. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics , pages =. 2023 , volume =
work page 2023
-
[19]
Almost Optimal Model-Free Reinforcement Learningvia Reference-Advantage Decomposition , url =
Zhang, Zihan and Zhou, Yuan and Ji, Xiangyang , booktitle =. Almost Optimal Model-Free Reinforcement Learningvia Reference-Advantage Decomposition , url =
-
[20]
Proceedings of the 32nd International Conference on Algorithmic Learning Theory , pages =
Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited , author =. Proceedings of the 32nd International Conference on Algorithmic Learning Theory , pages =. 2021 , volume =
work page 2021
-
[21]
On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , journal =
Emilie Kaufmann and Olivier Capp. On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , journal =. 2016 , volume =
work page 2016
-
[22]
Proceedings of the 34th International Conference on Machine Learning , pages =
Minimax Regret Bounds for Reinforcement Learning , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , volume =
work page 2017
-
[23]
Proceedings of the 36th International Conference on Machine Learning , pages =
Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , volume =
work page 2019
-
[24]
Proceedings of the 36th International Conference on Machine Learning , pages =
Statistics and Samples in Distributional Reinforcement Learning , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , volume =
work page 2019
-
[25]
Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , url =
Dann, Christoph and Lattimore, Tor and Brunskill, Emma , booktitle =. Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , url =
-
[26]
Efficient Risk-sensitive Planning via Entropic Risk Measures , journal =
Alexandre Marthe and Samuel Bounan and Aur. Efficient Risk-sensitive Planning via Entropic Risk Measures , journal =. 2025 , eprint =. doi:10.48550/arXiv.2502.20423 , url =
-
[27]
Journal of Machine Learning Research , year =
Liang, Hao and Luo, Zhi-Quan , title =. Journal of Machine Learning Research , year =
- [28]
-
[29]
Proceedings of Algorithmic Learning Theory , pages =
Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs , author =. Proceedings of Algorithmic Learning Theory , pages =. 2018 , volume =
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.