Recognition: unknown
Optimal Posterior Sampling for Policy Identification in Tabular Markov Decision Processes
Pith reviewed 2026-05-07 16:17 UTC · model grok-4.3
The pith
Posterior sampling combined with online learning achieves asymptotically optimal sample complexity for best-policy identification in tabular MDPs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By integrating posterior sampling for model estimation with an online learning subroutine that guides which episodes to sample, the algorithm attains the optimal sample complexity for (ε, δ)-PAC policy identification and matches the posterior contraction rates of ideal estimators, all while executing in O(S² A H) time per episode.
What carries the argument
Posterior sampling guided by an online learning algorithm for exploration in the MDP, which allows efficient and optimal identification.
Load-bearing premise
The analysis requires that posterior samples can be drawn exactly and that the online learning component delivers regret bounds free of additional factors depending on the failure probability.
What would settle it
Running the algorithm on a simple MDP and measuring whether the total samples required scales without a polynomial dependence on log(1/δ) as δ decreases would test the claim; deviation from the predicted asymptotic rate would falsify it.
Figures
read the original abstract
We study the $(\varepsilon, \delta)$-PAC policy identification problem in finite-horizon episodic Markov Decision Processes. Existing approaches provide finite-time guarantees for approximate settings ($\varepsilon>0$) but suffer from high computational cost, rendering them hard to implement, and also suffer from suboptimal dependence on $\log(1/\delta)$. We propose a randomized and computationally efficient algorithm for best policy identification that combines posterior sampling with an online learning algorithm to guide exploration in the MDP. Our method achieves asymptotic optimality in sample complexity, also in terms of posterior contraction rate, and runs in $O(S^2AH)$ per episode, matching standard model-based approaches. Unlike prior algorithms such as MOCA and PEDEL, our guarantees remain meaningful in the asymptotic regime and avoid sub-optimal polynomial dependence on $\log(1/\delta)$. Our results provide both theoretical insights and practical tools for efficient policy identification in tabular MDPs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a randomized algorithm for (ε, δ)-PAC policy identification in finite-horizon episodic tabular MDPs. It combines posterior sampling with an online learning subroutine to guide exploration, claiming asymptotic optimality in sample complexity (including posterior contraction rate), per-episode runtime O(S²AH), and avoidance of the suboptimal polynomial dependence on log(1/δ) that affects prior methods such as MOCA and PEDEL.
Significance. If the guarantees hold, the work would advance the field by delivering both asymptotically optimal sample complexity and practical computational efficiency for policy identification, while providing theoretical insights into the integration of posterior sampling with online learning in MDPs.
major comments (2)
- [Main theorem and regret invocation] The central claim of asymptotic optimality and avoidance of polynomial log(1/δ) dependence rests on the online learning subroutine delivering regret bounds free of extra logarithmic factors in 1/δ. The manuscript must explicitly state the regret bound used (e.g., in the main theorem) and show how it integrates with the stopping-time analysis without reintroducing δ-dependent terms, as standard UCB/EXP3-style bounds typically contain such factors.
- [Posterior sampling analysis] The optimality claim with respect to posterior contraction rate relies on standard contraction properties of the MDP posterior without a self-contained derivation or explicit handling of approximation errors in the sampling process. This needs to be derived or verified in the analysis section to confirm the rate remains meaningful in the asymptotic regime.
minor comments (1)
- [Abstract] The abstract states the claims clearly but could briefly note the key technical step (the δ-independent regret integration) that distinguishes the approach from MOCA/PEDEL.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review of our manuscript. We address each major comment point by point below, outlining the revisions we will implement.
read point-by-point responses
-
Referee: [Main theorem and regret invocation] The central claim of asymptotic optimality and avoidance of polynomial log(1/δ) dependence rests on the online learning subroutine delivering regret bounds free of extra logarithmic factors in 1/δ. The manuscript must explicitly state the regret bound used (e.g., in the main theorem) and show how it integrates with the stopping-time analysis without reintroducing δ-dependent terms, as standard UCB/EXP3-style bounds typically contain such factors.
Authors: We agree that greater explicitness is warranted. In the revised manuscript we will state the precise regret bound of the online learning subroutine directly in the statement of the main theorem. We will also augment the stopping-time analysis to show, step by step, that the chosen bound combines with the posterior-sampling stopping rule in a manner that keeps all δ-dependent factors from becoming polynomial; the argument relies on the fact that the online-learning regret is invoked only on an asymptotically vanishing fraction of episodes, which is already implicit in our current proof but will now be spelled out. revision: yes
-
Referee: [Posterior sampling analysis] The optimality claim with respect to posterior contraction rate relies on standard contraction properties of the MDP posterior without a self-contained derivation or explicit handling of approximation errors in the sampling process. This needs to be derived or verified in the analysis section to confirm the rate remains meaningful in the asymptotic regime.
Authors: We will add a self-contained derivation of the posterior contraction rate (including the handling of sampling approximation errors) to the analysis section. The derivation will follow the standard martingale arguments for Dirichlet posteriors but will be written out in full so that the asymptotic validity of the contraction rate is verified directly within the paper. revision: yes
Circularity Check
No significant circularity; derivation is self-contained against external benchmarks
full rationale
The paper's central claim is that combining posterior sampling with an online learning subroutine yields asymptotic optimality in sample complexity and posterior contraction rate for (ε,δ)-PAC policy identification, with O(S²AH) per-episode runtime and no polynomial log(1/δ) blowup. This rests on two external standard results: (1) known posterior contraction rates for tabular MDPs under exact sampling, and (2) regret bounds from a generic online learning subroutine that are assumed to be free of extra δ-dependent logarithmic factors. Neither is derived inside the paper via self-definition, parameter fitting, or a load-bearing self-citation chain; both are invoked as black-box inputs whose validity is independent of the present work. No equations reduce a claimed prediction to a fitted input by construction, no uniqueness theorem is imported from the authors' prior papers, and no ansatz is smuggled via citation. The derivation chain therefore remains non-circular even though it depends on external subroutine properties.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The MDP is finite-horizon episodic with known horizon H and finite state-action space.
- domain assumption Posterior sampling over transition and reward models can be performed exactly.
Reference graph
Works this paper leans on
-
[1]
2017 , volume =
Lattimore, Tor and Szepesvari, Csaba , booktitle =. 2017 , volume =
2017
-
[2]
Proceedings of the 32nd International Conference on Machine Learning , year =
The Hedge Algorithm on a Continuum , author =. Proceedings of the 32nd International Conference on Machine Learning , year =
-
[3]
Mirror descent policy optimization
Mirror Descent Policy Optimization , author=. arXiv preprint arXiv:2005.09814 , year=
-
[4]
2025 , eprint=
StaQ it! Growing neural networks for Policy Mirror Descent , author=. 2025 , eprint=
2025
-
[5]
Efficient Reinforcement Learning in Block
Zhang, Xuezhou and Song, Yuda and Uehara, Masatoshi and Wang, Mengdi and Agarwal, Alekh and Sun, Wen , booktitle =. Efficient Reinforcement Learning in Block. 2022 , volume =
2022
-
[6]
1995 , author =
Vector-valued Markov decision processes and the systems of linear inequalities , journal =. 1995 , author =
1995
-
[7]
, title =
Chatterjee, Krishnendu and Majumdar, Rupak and Henzinger, Thomas A. , title =. Proceedings of the 23rd Annual Conference on Theoretical Aspects of Computer Science , series =. 2006 , publisher =
2006
-
[8]
1965 , author =
Dynamic programming in multiplicative lattices , journal =. 1965 , author =
1965
-
[9]
1982 , author =
Multi-objective infinite-horizon discounted Markov decision processes , journal =. 1982 , author =
1982
-
[10]
1992 , author =
Optimal stationary policies in the vector-valued Markov decision process , journal =. 1992 , author =
1992
-
[11]
2008 , author =
An analysis of model-based Interval Estimation for Markov Decision Processes , journal =. 2008 , author =
2008
-
[12]
Near-Optimal Randomized Exploration for Tabular Markov Decision Processes , volume =
Xiong, Zhihan and Shen, Ruoqi and Cui, Qiwen and Fazel, Maryam and Du, Simon S , booktitle =. Near-Optimal Randomized Exploration for Tabular Markov Decision Processes , volume =
-
[13]
On Rational Bounds for the Gamma Function , volume =
Yang, Zhen-Hang and Qian, Weimiao and Chu, Yuming and Zhang, Wen , year =. On Rational Bounds for the Gamma Function , volume =
-
[14]
Towards Instance-Optimality in Online PAC Reinforcement Learning , author=
-
[15]
Optimistic posterior sampling for reinforcement learning: worst-case regret bounds , volume =
Agrawal, Shipra and Jia, Randy , booktitle =. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds , volume =
-
[16]
Optimistic Posterior Sampling for Reinforcement Learning with Few Samples and Tight Guarantees , volume =
Tiapkin, Daniil and Belomestny, Denis and Calandriello, Daniele and Moulines, Eric and Munos, Remi and Naumov, Alexey and Rowland, Mark and Valko, Michal and M\'. Optimistic Posterior Sampling for Reinforcement Learning with Few Samples and Tight Guarantees , volume =. Advances in Neural Information Processing Systems , publisher =
-
[17]
Is Q-Learning Provably Efficient? , volume =
Jin, Chi and Allen-Zhu, Zeyuan and Bubeck, Sebastien and Jordan, Michael I , booktitle =. Is Q-Learning Provably Efficient? , volume =
-
[18]
Proceedings of the 29th International Conference on Neural Information Processing Systems - Volume 2 , series =
Dann, Christoph and Brunskill, Emma , title =. Proceedings of the 29th International Conference on Neural Information Processing Systems - Volume 2 , series =. 2015 , publisher =
2015
-
[19]
Proceedings of the 32nd International Conference on Algorithmic Learning Theory , year =
Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited , author =. Proceedings of the 32nd International Conference on Algorithmic Learning Theory , year =
-
[20]
Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms , series =
Sidford, Aaron and Wang, Mengdi and Wu, Xian and Ye, Yinyu , title =. Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms , series =. 2018 , publisher =
2018
-
[21]
Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model , volume =
Sidford, Aaron and Wang, Mengdi and Wu, Xian and Yang, Lin and Ye, Yinyu , booktitle =. Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model , volume =
-
[22]
Proceedings of Thirty Third Conference on Learning Theory , year =
Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal , author =. Proceedings of Thirty Third Conference on Learning Theory , year =
-
[23]
Proceedings of The 34th International Conference on Algorithmic Learning Theory , year =
Optimistic PAC Reinforcement Learning: the Instance-Dependent View , author =. Proceedings of The 34th International Conference on Algorithmic Learning Theory , year =
-
[24]
Journal of Machine Learning Research , year =
Thomas Jaksch and Ronald Ortner and Peter Auer , title =. Journal of Machine Learning Research , year =
-
[25]
, title =
Puterman, Martin L. , title =. 1994 , publisher =
1994
-
[26]
Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , volume =
Kearns, Michael and Singh, Satinder , booktitle =. Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , volume =
-
[27]
Proceedings of Thirty Third Conference on Learning Theory , year =
Provably efficient reinforcement learning with linear function approximation , author =. Proceedings of Thirty Third Conference on Learning Theory , year =
-
[28]
Strehl and Lihong Li and Michael L
Alexander L. Strehl and Lihong Li and Michael L. Littman , title =. Journal of Machine Learning Research , year =
-
[29]
On the sample complexity of reinforcement learning with a generative model , year =
Azar, Mohammad Gheshlaghi and Munos, R\'. On the sample complexity of reinforcement learning with a generative model , year =. Proceedings of the 29th International Coference on International Conference on Machine Learning , series =
-
[30]
Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model , year =
Azar, Mohammad Gheshlaghi and Munos, R\'. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model , year =
-
[31]
Worst-Case Regret Bounds for Exploration via Randomized Value Functions , volume =
Russo, Daniel , booktitle =. Worst-Case Regret Bounds for Exploration via Randomized Value Functions , volume =
-
[32]
Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model , year =
Gheshlaghi Azar, Mohammad and Munos, R\'. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model , year =
-
[33]
Proceedings of the Seventh Annual Conference on Computational Learning Theory , series =
Fiechter, Claude-Nicolas , title =. Proceedings of the Seventh Annual Conference on Computational Learning Theory , series =. 1994 , publisher =
1994
-
[34]
Proceedings of the 34th International Conference on Machine Learning , year =
Minimax Regret Bounds for Reinforcement Learning , author =. Proceedings of the 34th International Conference on Machine Learning , year =
-
[35]
Proceedings of the 32nd International Conference on Algorithmic Learning Theory , year =
Adaptive Reward-Free Exploration , author =. Proceedings of the 32nd International Conference on Algorithmic Learning Theory , year =
-
[36]
Proceedings of the 38th International Conference on Machine Learning , year =
Fast active learning for pure exploration in reinforcement learning , author =. Proceedings of the 38th International Conference on Machine Learning , year =
-
[37]
(More) Efficient Reinforcement Learning via Posterior Sampling , author=
-
[38]
Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs , volume =
Tirinzoni, Andrea and Al Marjani, Aymen and Kaufmann, Emilie , booktitle =. Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs , volume =
-
[39]
and Jamieson, Kevin , booktitle =
Narang, Adhyyan and Wagenmaker, Andrew and Ratliff, Lillian J. and Jamieson, Kevin , booktitle =. Sample Complexity Reduction via Policy Difference Estimation in Tabular Reinforcement Learning , volume =
-
[40]
Navigating to the Best Policy in Markov Decision Processes , volume =
Al Marjani, Aymen and Garivier, Aur\'. Navigating to the Best Policy in Markov Decision Processes , volume =. Advances in Neural Information Processing Systems , publisher =
-
[41]
Proceedings of the 38th International Conference on Machine Learning , year =
Adaptive Sampling for Best Policy Identification in Markov Decision Processes , author =. Proceedings of the 38th International Conference on Machine Learning , year =
-
[42]
Instance-Dependent Near-Optimal Policy Identification in Linear MDPs via Online Experiment Design , volume =
Wagenmaker, Andrew and Jamieson, Kevin G , booktitle =. Instance-Dependent Near-Optimal Policy Identification in Linear MDPs via Online Experiment Design , volume =
-
[43]
Proceedings of Thirty Fifth Conference on Learning Theory , year =
Beyond No Regret: Instance-Dependent PAC Reinforcement Learning , author =. Proceedings of Thirty Fifth Conference on Learning Theory , year =
-
[44]
Online learning in episodic Markovian decision processes by relative entropy policy search , volume =
Zimin, Alexander and Neu, Gergely , booktitle =. Online learning in episodic Markovian decision processes by relative entropy policy search , volume =
-
[45]
Proceedings of the 37th International Conference on Machine Learning , year =
Provably Efficient Exploration in Policy Optimization , author =. Proceedings of the 37th International Conference on Machine Learning , year =
-
[46]
Learning Adversarial
Jin, Chi and Jin, Tiancheng and Luo, Haipeng and Sra, Suvrit and Yu, Tiancheng , booktitle =. Learning Adversarial. 2020 , series =
2020
-
[47]
Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function , volume =
Rosenberg, Aviv and Mansour, Yishay , booktitle =. Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function , volume =
-
[48]
On the optimality of the Hedge algorithm in the stochastic regime , year =
Mourtada, Jaouad and Ga\". On the optimality of the Hedge algorithm in the stochastic regime , year =
-
[49]
Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , year =
Dann, Christoph and Lattimore, Tor and Brunskill, Emma , booktitle =. Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , year =
-
[50]
Near-optimal Regret Bounds for Reinforcement Learning , year =
Auer, Peter and Jaksch, Thomas and Ortner, Ronald , booktitle =. Near-optimal Regret Bounds for Reinforcement Learning , year =
-
[51]
Reward-Free
Wagenmaker, Andrew J and Chen, Yifang and Simchowitz, Max and Du, Simon and Jamieson, Kevin , booktitle =. Reward-Free. 2022 , series =
2022
-
[52]
Proceedings of Thirty Sixth Conference on Learning Theory , year =
Active Coverage for PAC Reinforcement Learning , author =. Proceedings of Thirty Sixth Conference on Learning Theory , year =
-
[53]
Proceedings of the 13th ACM Conference on Recommender Systems , series =
Lin, Xiao and Chen, Hongjie and Pei, Changhua and Sun, Fei and Xiao, Xuanji and Sun, Hanxiao and Zhang, Yongfeng and Ou, Wenwu and Jiang, Peng , title =. Proceedings of the 13th ACM Conference on Recommender Systems , series =. 2019 , publisher =
2019
-
[54]
and Lozano, J.A
Armananzas, R. and Lozano, J.A. , booktitle=. A multiobjective approach to the portfolio optimization problem , year=
-
[55]
2021 , publisher =
Chen, Jing and Du, Tiantian and Xiao, Gongyi , title =. 2021 , publisher =
2021
-
[56]
Proceedings of The 25th International Conference on Artificial Intelligence and Statistics , year =
Hierarchical Bayesian Bandits , author =. Proceedings of The 25th International Conference on Artificial Intelligence and Statistics , year =
-
[57]
Mills , journal =
John P. Mills , journal =. Table of the Ratio: Area to Bounding Ordinate, for Any Portion of Normal Curve , year =
-
[58]
Kung, H. T. and Luccio, F. and Preparata, F. P. , title =. 1975 , publisher =
1975
-
[59]
2008 , author =
Mills' ratio: Monotonicity patterns and functional inequalities , journal =. 2008 , author =
2008
-
[60]
Z. W. Birnbaum , title =. The Annals of Mathematical Statistics , publisher =
-
[61]
Advances in Neural Information Processing Systems , year=
Combinatorial semi-bandit with known covariance , author=. Advances in Neural Information Processing Systems , year=
-
[62]
Impact of structure on the design and analysis of bandit algorithms
Degenne, Rémy. Impact of structure on the design and analysis of bandit algorithms. 2019
2019
-
[63]
Proceedings of The 34th International Conference on Algorithmic Learning Theory , year =
Dealing with Unknown Variances in Best-Arm Identification , author =. Proceedings of The 34th International Conference on Algorithmic Learning Theory , year =
-
[64]
2024 , eprint=
Learning the Pareto Front Using Bootstrapped Observation Samples , author=. 2024 , eprint=
2024
-
[65]
Proceedings of the 28th International Joint Conference on Artificial Intelligence , series =
Lu, Shiyin and Wang, Guanghui and Hu, Yao and Zhang, Lijun , title =. Proceedings of the 28th International Joint Conference on Artificial Intelligence , series =. 2019 , publisher =
2019
-
[66]
Proceedings of the 40th International Conference on Machine Learning ,
Xu, Mengfan and Klabjan, Diego , title =. Proceedings of the 40th International Conference on Machine Learning ,. 2023 , publisher =
2023
-
[67]
Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48 , series =
Shah, Amar and Ghahramani, Zoubin , title =. Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48 , series =. 2016 , publisher =
2016
-
[68]
Follow the leader if you can, hedge if you must , year =
De Rooij, Steven and Van Erven, Tim and Gr\". Follow the leader if you can, hedge if you must , year =
-
[69]
and Cesa-Bianchi, N
Bubeck, S. and Cesa-Bianchi, N. , Journal =. 2012 , Owner =
2012
-
[70]
, author=
Group sequential tests for bivariate response: interim analyses of clinical trials with both efficacy and safety endpoints. , author=. Biometrics , year=
-
[71]
A learning-based approach to the automated design of MPSoC networks , year =
Almer, Oscar and Topham, Nigel and Franke, Bj\". A learning-based approach to the automated design of MPSoC networks , year =
-
[72]
International Conference on Machine Learning , year=
Thompson sampling for (combinatorial) pure exploration , author=. International Conference on Machine Learning , year=
-
[73]
Advances in neural information processing systems , year=
Sequential experimental design for transductive linear bandits , author=. Advances in neural information processing systems , year=
-
[74]
The Gaussian measure of shifted balls , journal =
Kuelbs, James and Li, Wenbo and Linde, Werner , year =. The Gaussian measure of shifted balls , journal =
-
[75]
2009 , author =
A note on multivariate Gaussian estimates , journal =. 2009 , author =
2009
-
[76]
Proceedings of the 40th International Conference on Machine Learning , series =
Xu, Mengfan and Klabjan, Diego , title =. Proceedings of the 40th International Conference on Machine Learning , series =. 2023 , publisher =
2023
-
[77]
Bubeck, Sébastien and Nicolò, Cesa-Bianchi , booktitle=
-
[78]
Sequential learning of the
Crepon, Elise and Garivier, Aur\'. Sequential learning of the. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics , year =
-
[79]
Kone, Cyrille and Kaufmann, Emilie and Richert, Laura , booktitle =. Bandit. 2024 , publisher =
2024
-
[80]
Proceedings of The 28th International Conference on Artificial Intelligence and Statistics , year =
Pareto Set Identification With Posterior Sampling , author =. Proceedings of The 28th International Conference on Artificial Intelligence and Statistics , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.