arxiv: 2604.18312 · v1 · submitted 2026-04-20 · 💻 cs.LG

Recognition: unknown

Scale-free adaptive planning for deterministic dynamics & discounted rewards

Peter L. Bartlett , Victor Gabillon , Jennifer Healey , Michal Valko

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:35 UTC · model grok-4.3

classification 💻 cs.LG

keywords planningreinforcement learningdiscounted rewardsdeterministic dynamicssample complexityadaptive planningscale-free algorithms

0 comments

The pith

Platypoos is a scale-free planning algorithm that adapts to unknown reward scales and smoothness while achieving sample complexity bounds that hold simultaneously across discount factors without prior knowledge of them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses planning in environments with deterministic dynamics and stochastic rewards under discounting, where the optimal value function is unknown and rewards may be unbounded. It introduces Platypoos as a simple algorithm that automatically adapts to the unknown scale and smoothness of the reward function. The analysis provides sample complexity guarantees that improve on prior work and apply at once to a wide range of discount factors and reward scales, even though the algorithm does not know these values. A matching lower bound is also given to show that the upper bound is tight up to constant factors. A sympathetic reader would care because real planning problems rarely provide the exact discount or reward range in advance, so methods that require them or degrade sharply outside narrow ranges are less practical.

Core claim

We propose Platypoos, a simple scale-free planning algorithm that adapts to the unknown scale and smoothness of the reward function. We provide a sample complexity analysis for Platypoos that improves upon prior work and holds simultaneously over a broad range of discount factors and reward scales, without the algorithm knowing them. We also establish a matching lower bound showing our analysis is optimal up to constants.

What carries the argument

Platypoos, the scale-free planning algorithm that adapts to the unknown scale and smoothness of the reward function by adjusting its behavior without explicit knowledge of those parameters.

Load-bearing premise

The environment has deterministic dynamics with stochastic rewards under discounting, and the optimal value function is unknown with possibly unbounded rewards.

What would settle it

Construct a simple deterministic MDP with stochastic rewards, run Platypoos for varying discount factors and reward scales chosen independently of the algorithm, and measure whether the observed sample count required to reach a given accuracy matches the claimed upper bound and is strictly lower than that of prior non-adaptive planners on the same instance.

Figures

Figures reproduced from arXiv: 2604.18312 by Jennifer Healey, Michal Valko, Peter L. Bartlett, Victor Gabillon.

**Figure 1.** Figure 1: Algorithm for free planning with no reset condition until a reasonably shallow depth Hu but sampling all KHu nodes and sharing cross sequence information. In the case of OLOP, HOO, or, particularly StroquOOL, only a subset S of size |S| ≪ KHu of the most promising nodes are explored but at a deeper depth Hs ≫ Hu. Therefore, obtaining a lower bound on the number of sequence of actions at depth Hs that conta… view at source ↗

**Figure 2.** Figure 2: Algorithm for constraint planning (restart) Theorem 1. For any planning problem with associated (ν, ρ), and branching factor κ ≜ κ u (ν, ρ), the simple regret of SequOOL is after n rounds bounded as follows. • If κ = 1, then rn ≤ νρ 1 C j n log n k . • If κ > 1, then rn = O [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The PlaTγPOOS algorithm cells multiple times, not just one time as in the deterministic case. The amount of times a cell should be evaluated to differentiate its value from the optimal value of the function depends on the gap between these two values as well as the range of noise. As we do not want to make any assumptions on knowing these quantities, our algorithm tries to be robust to any potential values… view at source ↗

**Figure 4.** Figure 4: Top and bottom left: Average cumulative discounted return collected by OLOP and PlaTγPOOS with different range of noise, b = 1 (top left), b = 10 (top center), b = 20, (top right), and b = 50 (bottom left). Bottom center: the sensitivity of OLOP to different Remax parameters. Bottom right: the sensitivity of OLOP to different range of the input noise eb as parameters while the true b is set to 10. 6. Numer… view at source ↗

**Figure 5.** Figure 5: MDP used for our experiments The initial state is (0, 0). Therefore, the agent has a choice. It can, for instance, remain in the same binary state bin, starting with a null reward but sees its instant reward growing with time if it keeps taking the same action in the future. Alternatively, it could greedily switch to the other binary state bin and obtain a reward of 2 but delaying the hope of obtaining gr… view at source ↗

read the original abstract

We address the problem of planning in an environment with deterministic dynamics and stochastic rewards with discounted returns. The optimal value function is not known, nor are the rewards bounded. We propose Platypoos, a simple scale-free planning algorithm that adapts to the unknown scale and smoothness of the reward function. We provide a sample complexity analysis for Platypoos that improves upon prior work and holds simultaneously over a broad range of discount factors and reward scales, without the algorithm knowing them. We also establish a matching lower bound showing our analysis is optimal up to constants.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper proposes Platypoos, a scale-free planning algorithm for MDPs with deterministic dynamics, stochastic (possibly unbounded) rewards, and discounted returns where the optimal value function is unknown. The algorithm adapts to the unknown scale and smoothness of the reward function without prior knowledge of these quantities or the discount factor. The central claims are an improved sample-complexity analysis that holds simultaneously over broad ranges of discount factors and reward scales, plus a matching lower bound establishing optimality up to constants.

Significance. If the analysis and lower bound hold, the result would be significant for adaptive planning under uncertainty: it removes the common requirement to know or tune for reward magnitude and discount factor while still delivering uniform guarantees. The deterministic-dynamics assumption enables exact forward simulation, which underpins the scale-free adaptation. The matching lower bound is a strength that would make the tightness claim falsifiable and useful for the literature.

minor comments (3)

[Abstract] The abstract states that the analysis 'improves upon prior work' but does not name the specific prior results or quantify the improvement (e.g., dependence on smoothness or range of γ). Adding one sentence of comparison would help readers place the contribution.
[Problem statement / Section 2] Definitions of 'scale' and 'smoothness' of the reward function appear only after the algorithm is introduced; moving a concise definition to the problem statement section would improve readability.
[Lower bound section] The lower-bound construction should explicitly state the class of reward distributions and the range of γ over which the Ω(·) bound holds, to confirm it matches the upper bound's uniformity claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our work on Platypoos and for recommending minor revision. We appreciate the recognition that the simultaneous sample-complexity bounds across discount factors and reward scales, together with the matching lower bound, would be a useful contribution if the analysis holds. Since the report lists no specific major comments or requests for clarification, we have no points to address point-by-point at this stage. We are happy to incorporate any minor suggestions that may appear in the full review or during the revision process.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central claims rest on a proposed algorithm (Platypoos) whose sample-complexity bounds are derived from standard MDP assumptions under deterministic dynamics, stochastic rewards, and discounted returns. The analysis is stated to hold simultaneously over unknown discount factors and reward scales without the algorithm knowing them, and is supported by a matching lower bound. No load-bearing step reduces by construction to a fitted parameter, self-definition, or self-citation chain; the derivation remains independent of quantities defined from the same data or prior author-specific uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities can be extracted or audited.

pith-pipeline@v0.9.0 · 5384 in / 1054 out tokens · 35993 ms · 2026-05-10T04:35:16.186747+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 8 canonical work pages

[1]

L., Gabillon, V., and Valko, M

Bartlett, P. L., Gabillon, V., and Valko, M. http://researchers.lille.inria.fr/ valko/hp/serve.php?what=publications/bartlett2019simple.pdf A simple parameter-free and adaptive approach to optimization under a minimal local smoothness assumption . In Algorithmic Learning Theory (ALT), 2019

2019
[2]

and Tsitsiklis, J

Bertsekas, D. and Tsitsiklis, J. https://books.google.co.uk/books/about/Neuro_dynamic_Programming.html?id=WxCCQgAACAAJ&source=kp_book_description&redir_esc=y Neuro-dynamic programming . Athena Scientific, Belmont, MA, 1996

1996
[3]

and Munos, R

Bubeck, S. and Munos, R. http://sbubeck.com/COLT10_BM.pdf Open-loop optimistic planning . In Conference on Learning Theory (COLT), 2010

2010
[4]

http://www.jmlr.org/papers/volume12/bubeck11a/bubeck11a.pdf X -armed bandits

Bubeck, S., Munos, R., Stoltz, G., and Szepesv \' a ri, C. http://www.jmlr.org/papers/volume12/bubeck11a/bubeck11a.pdf X -armed bandits . Journal of Machine Learning Research, 12: 0 1587--1627, 2011

2011
[5]

and Munos, R

Bu s oniu, L. and Munos, R. http://proceedings.mlr.press/v22/busoniu12/busoniu12.pdf Optimistic planning for Markov decision processes . In International Conference on Artificial Intelligence and Statistics (AISTATS), 2012

2012
[6]

and Locatelli, A

Carpentier, A. and Locatelli, A. Tight (lower) bounds for the fixed budget best-arm identification bandit problem http://proceedings.mlr.press/v49/carpentier16.pdf. In Conference on Learning Theory (COLT), 2016

2016
[7]

and Munos, R

Coquelin, P.-A. and Munos, R. https://arxiv.org/pdf/1408.2028.pdf Bandit algorithms for tree search . In Conference on Uncertainty in Artificial Intelligence (UAI), 2007

work page arXiv 2028
[8]

https://hal.inria.fr/inria-00116992/document Efficient selectivity and backup operators in Monte-Carlo tree search

Coulom, R. https://hal.inria.fr/inria-00116992/document Efficient selectivity and backup operators in Monte-Carlo tree search . Computers and games, 4630: 0 72–83, 2007

2007
[9]

and Domshlak, C

Feldman, Z. and Domshlak, C. http://dx.doi.org/10.1613/jair.4432 Simple regret optimization in online planning for Markov decision processes . Journal of Artificial Intelligence Research, 2014

work page doi:10.1613/jair.4432 2014
[10]

https://hal.inria.fr/hal-00747005v1/document Best-arm identification: A unified approach to fixed budget and fixed confidence

Gabillon, V., Ghavamzadeh, M., and Lazaric, A. https://hal.inria.fr/hal-00747005v1/document Best-arm identification: A unified approach to fixed budget and fixed confidence . In Neural Information Processing Systems (NeurIPS), 2012

2012
[11]

https://hal.inria.fr/inria-00117266 Modification of UCT with patterns in Monte-Carlo Go

Gelly, S., Yizao, W., Munos, R., and Teytaud, O. https://hal.inria.fr/inria-00117266 Modification of UCT with patterns in Monte-Carlo Go . Technical report, Inria, 2006

2006
[12]

Grill, J.-B., Valko, M., and Munos, R. https://papers.nips.cc/paper/6253-blazing-the-trails-before-beating-the-path-sample-efficient-monte-carlo-planning Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning . In Neural Information Processing Systems (NeurIPS), 2016

2016
[13]

and Hassani, M

Hoorfar, A. and Hassani, M. Inequalities on the lambert w function and hyperpower function https://www.emis.de/journals/JIPAM/images/107_07_JIPAM/107_07.pdf. Journal of Inequalities in Pure and Applied Mathematics, 9 0 (2): 0 5--9, 2008

2008
[14]

and Munos, R

Hren, J.-F. and Munos, R. https://hal.archives-ouvertes.fr/hal-00830182/document Optimistic planning of deterministic systems . In European Workshop on Reinforcement Learning, 2008

2008
[15]

and Koolen, W

Kaufmann, E. and Koolen, W. M. https://arxiv.org/pdf/1706.02986.pdf Monte-carlo tree search by best-arm identification . In Neural Information Processing Systems (NeurIPS), 2017

work page arXiv 2017
[16]

and Szepesv \' a ri, C

Kocsis, L. and Szepesv \' a ri, C. http://ggp.stanford.edu/readings/uct.pdf Bandit-based Monte-Carlo planning . In European Conference on Machine Learning (ECML), 2006

2006
[17]

and Maillard, O.-A

Leurent, E. and Maillard, O.-A. https://arxiv.org/pdf/1904.04700.pdf Practical Open-Loop Optimistic Planning . arXiv preprint arXiv:1904.04700, 2019

work page arXiv 1904
[18]

Munos, R. https://papers.nips.cc/paper/4304-optimistic-optimization-of-a-deterministic-function-without-the-knowledge-of-its-smoothness.pdf Optimistic optimization of deterministic functions without the knowledge of its smoothness . In Neural Information Processing Systems (NeurIPS), 2011

2011
[19]

https://hal.archives-ouvertes.fr/hal-00747575v5/document From bandits to Monte-Carlo tree search: The optimistic principle applied to optimization and planning

Munos, R. https://hal.archives-ouvertes.fr/hal-00747575v5/document From bandits to Monte-Carlo tree search: The optimistic principle applied to optimization and planning . Foundations and Trends in Machine Learning, 7(1): 0 1--130, 2014

2014
[20]

and P \' a l, D

Orabona, F. and P \' a l, D. http://dx.doi.org/10.1016/j.tcs.2017.11.021 Scale-free online learning . Theoretical Computer Science, 2018

work page doi:10.1016/j.tcs.2017.11.021 2017
[21]

and Tommasi, T

Orabona, F. and Tommasi, T. http://papers.neurips.cc/paper/6811-training-deep-networks-without-learning-rates-through-coin-betting.pdf Training deep networks without learning rates through coin betting . In Neural Information Processing Systems (NeurIPS), 2017

2017
[22]

Puterman, M. L. http://www.amazon.ca/exec/obidos/redirect?tag=citeulike09-20&path=ASIN/0471619779 Markov Decision Processes: Discrete Stochastic Dynamic Programming . John Wiley & Sons, New York, NY, 1994

work page arXiv 1994
[23]

http://arxiv.org/abs/1305.6646 Normalized online learning

Ross, S., Mineiro, P., and Langford, J. http://arxiv.org/abs/1305.6646 Normalized online learning . In Conference on Uncertainty in Artificial Intelligence (UAI), 2013

work page arXiv 2013
[24]

https://arxiv.org/pdf/1902.05213.pdf On reinforcement learning using Monte-Carlo tree search with supervised learning: Non-asymptotic analysis

Shah, D., Xie, Q., and Xu, Z. https://arxiv.org/pdf/1902.05213.pdf On reinforcement learning using Monte-Carlo tree search with supervised learning: Non-asymptotic analysis . arXiv preprint arXiv:1902.05213, 2019

work page arXiv 1902
[25]

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. http://www0.cs.ucl.ac.uk/staff/d.silver/web/Publications_files/u...

2016
[26]

Sz \" o r \' e nyi, B., Kedenburg, G., and Munos, R. https://papers.nips.cc/paper/5368-optimistic-planning-in-markov-decision-processes-using-a-generative-model.pdf Optimistic planning in Markov decision processes using a generative model . In Neural Information Processing Systems (NeurIPS), 2014

2014
[27]

http://proceedings.mlr.press/v28/valko13.pdf Stochastic simultaneous optimistic optimization

Valko, M., Carpentier, A., and Munos, R. http://proceedings.mlr.press/v28/valko13.pdf Stochastic simultaneous optimistic optimization . In International Conference on Machine Learning (ICML), 2013

2013