Performance Bounds for Rollout Policies in Stochastic Shortest Path Problems

Anders Hansson; Bo Wahlberg

arxiv: 2605.22965 · v1 · pith:KKLFMOEMnew · submitted 2026-05-21 · 🧮 math.OC · cs.SY· eess.SY

Performance Bounds for Rollout Policies in Stochastic Shortest Path Problems

Anders Hansson , Bo Wahlberg This is my paper

Pith reviewed 2026-05-25 05:34 UTC · model grok-4.3

classification 🧮 math.OC cs.SYeess.SY

keywords rollout policiesstochastic shortest pathperformance boundsapproximate dynamic programminghitting timecertainty equivalenttransient Markov decision processes

0 comments

The pith

The suboptimality of a fixed rollout policy in stochastic shortest path problems is bounded by value approximation error times expected hitting time to absorption.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives a non-asymptotic performance certificate showing that rollout policy loss relative to optimality equals at most the uniform error in the value approximation multiplied by the expected steps until the closed-loop system reaches a terminal state. In the undiscounted transient setting this hitting time replaces the discount factor or finite horizon that appears in standard approximate dynamic programming bounds. A performance-difference identity decomposes the exact suboptimality through the transient occupation measure, and a deterministic example establishes that the hitting-time factor is tight. The argument extends to certainty-equivalent rollout by adding a local model-mismatch term and yields corollaries under uniform hitting-time or Foster-Lyapunov conditions.

Core claim

For stochastic shortest path problems with absorbing terminal states, the loss of any fixed rollout policy relative to the optimal value is at most the supremum norm of the value approximation error times the expected hitting time to absorption under the rollout closed loop. Suboptimality accumulates exactly via the transient occupation measure, and the hitting-time factor is shown to be unavoidable by a deterministic sharpness construction.

What carries the argument

The performance bound obtained by multiplying uniform value-function approximation error by the expected hitting time to the terminal state under the rollout policy.

If this is right

Under a uniform upper bound on expected hitting times the performance loss is controlled solely by the accuracy of the value approximation.
A performance-difference identity gives the exact decomposition of suboptimality through the transient occupation measure.
The same bound extends to certainty-equivalent rollout after adding a separate local model-mismatch penalty.
Foster-Lyapunov drift conditions imply finite hitting-time factors and therefore finite performance guarantees.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The hitting-time factor suggests that rollout performance guarantees can be obtained in any undiscounted transient Markov decision process once a drift condition is verified.
The performance-difference identity may be useful for diagnosing which states contribute most to suboptimality in simulation.
The sharpness example indicates that any attempt to remove the hitting-time multiplier must fail on some deterministic shortest-path graph.

Load-bearing premise

The rollout policy is fixed in advance and the Markov chain under that policy reaches an absorbing terminal state in finite expected time.

What would settle it

A concrete stochastic shortest path instance and rollout policy where the observed suboptimality exceeds the product of the value approximation error and the measured expected hitting time.

read the original abstract

This paper concerns rollout and certainty-equivalent rollout policies for stochastic shortest path problems with absorbing terminal states. The main result provides a direct non-asymptotic performance certificate for a fixed rollout policy: the loss relative to the optimal value is controlled by the uniform accuracy of the value approximation and by the expected time for which the rollout closed loop remains away from the terminal state. Thus, in the undiscounted transient setting, the expected hitting time plays the role of a discount or finite-horizon parameter in more standard approximate dynamic programming bounds. This paper also gives a performance-difference identity showing that suboptimality is exactly accumulated through the transient occupation measure, and a deterministic sharpness example showing that the hitting-time factor is unavoidable. Finally, consequences under uniform hitting-time and Foster-Lyapunov drift conditions are given, and extend the argument to certainty-equivalent rollout by adding a separate local model-mismatch term.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper supplies a direct non-asymptotic bound for fixed rollout policies in absorbing SSP problems, with suboptimality controlled by value approximation error and expected hitting time to absorption.

read the letter

The central contribution is a performance certificate that replaces the usual discount factor with the expected hitting time in the undiscounted transient setting. Suboptimality is expressed exactly through the transient occupation measure, and a deterministic sharpness example shows the hitting-time multiplier is necessary rather than loose. The extension to certainty-equivalent rollout adds only a local model-mismatch term under the same structure. These pieces sit on standard MDP occupation-measure identities and do not appear to recycle prior discounted ADP results in disguised form. The transient absorbing setup and fixed-policy assumption are stated up front, which keeps the claims scoped appropriately. The derivation looks clean on the outline given, with no visible circularity or unsecured steps. Minor soft spot is that the full paper would need to confirm the exact conditions under which the uniform hitting-time or Foster-Lyapunov consequences hold, but nothing in the abstract or stress-test flags a load-bearing gap. This is useful reading for anyone working on rollout or approximate dynamic programming in undiscounted stochastic control. A reader already familiar with performance-difference lemmas will see the extension clearly and can judge whether the hitting-time factor helps in their setting. It is worth sending to peer review because the claim is concrete, the sharpness example is reproducible in principle, and the topic sits in an active area where non-asymptotic certificates are still sparse.

Referee Report

0 major / 3 minor

Summary. The paper derives non-asymptotic performance bounds for fixed rollout policies (and their certainty-equivalent extensions) in stochastic shortest path problems with absorbing terminal states. The central result bounds the suboptimality gap of the rollout policy by the product of the uniform (sup-norm) approximation error of the value function and the expected hitting time to absorption under the closed-loop rollout dynamics. An exact performance-difference identity is given that expresses suboptimality via the transient occupation measure; a deterministic sharpness example shows the hitting-time factor cannot be removed in general. Consequences are derived under uniform hitting-time bounds and Foster-Lyapunov drift conditions.

Significance. If the derivations hold, the work is significant for supplying explicit, non-asymptotic guarantees in the undiscounted transient SSP setting, where the expected hitting time plays the role of a discount factor. The exact occupation-measure identity and the sharpness example (showing necessity of the hitting-time term) are particularly useful. The extension to certainty-equivalent rollout with an additive local mismatch term broadens applicability. These results strengthen the theoretical basis for rollout methods in approximate dynamic programming for SSP problems.

minor comments (3)

[Abstract] The abstract states the main result clearly but does not reference the theorem or section number containing the central bound; adding such a pointer would improve navigation.
Notation for the value approximation error (e.g., ||V̂ - V*||_∞) and the expected hitting time should be introduced with explicit definitions in the introduction or preliminaries section before the main theorem.
The deterministic sharpness example is mentioned but its construction (state space, costs, and policy) could be summarized in one additional sentence in the abstract or introduction to highlight why the hitting-time factor is unavoidable.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work on performance bounds for rollout policies in stochastic shortest path problems and for recommending minor revision. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained on standard MDP identities

full rationale

The central claims rest on a performance-difference identity via transient occupation measures, a non-asymptotic bound using sup-norm value error and expected hitting time to absorption, and a sharpness example. These follow directly from the problem setup (undiscounted transient SSP with absorbing terminals and fixed rollout policy) using standard MDP definitions and occupation-measure identities, without reduction to fitted quantities, self-citation chains, or ansatzes imported from the authors' prior work. No load-bearing step equates a prediction to its input by construction, and the hitting-time factor is shown necessary via explicit example. This matches the default expectation for papers whose derivations remain independent of their own fitted outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claim rests on standard stochastic shortest path assumptions (absorbing terminals, transient dynamics) and the definition of rollout policies; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Existence of absorbing terminal states and transient dynamics under the rollout policy
Explicitly stated as the setting for the problems considered.
domain assumption Uniform accuracy of the value approximation over the state space
Used to control the loss term in the main bound.

pith-pipeline@v0.9.0 · 5682 in / 1220 out tokens · 20655 ms · 2026-05-25T05:34:28.663420+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

T.Trevizan,F.Teichteil-Königsbuch,S.Thiébaux,Efficientsolutions for stochastic shortest path problems with dead ends, in: Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), 2017

work page 2017
[2]

Kolobov, Mausam, D

A. Kolobov, Mausam, D. S. Weld, A theory of goal-oriented mdps with dead ends, in: Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), Catalina Island, CA, 2012

work page 2012
[3]

A. J. Briggs, B. D. Anderson, T. D. Barfoot, Stochastic shortest path planning for landmark-based robot navigation, in: Proceedings of the IEEEInternationalConferenceonRoboticsandAutomation(ICRA), New Orleans, LA, 2004

work page 2004
[4]

S. Lim, H. Balakrishnan, D. Gifford, S. Madden, D. Rus, Stochastic motion planning and applications to traffic, in: G. S. Chirikjian, H. Choset, M. Morales, T. Murphey (Eds.), Algorithmic Foundation of Robotics VIII: Selected Contributions of the Eight International Workshop on the Algorithmic Foundations of Robotics, Springer Berlin Heidelberg, Berlin, H...

work page 2010
[5]

D. P. Bertsekas, S. E. Shreve, Stochastic Optimal Control: The Discrete-Time Case, Academic Press, New York, 1978

work page 1978
[6]

M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, Wiley, New York, 1994

work page 1994
[7]

D.P.Bertsekas,DynamicProgrammingandOptimalControl,Vol.II: Approximate Dynamic Programming, 4th Edition, Athena Scientific, Belmont, MA, 2012

work page 2012
[8]

D. P. Bertsekas, J. N. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, Belmont, MA, 1996

work page 1996
[9]

W. B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality, Wiley, Hoboken, NJ, 2007

work page 2007
[10]

R. S. Sutton, A. G. Barto, Reinforcement Learning: An Introduction, 2nd Edition, MIT Press, Cambridge, MA, 2018

work page 2018
[11]

J.N.Tsitsiklis,B.V.Roy,Ananalysisoftemporal-differencelearning with function approximation, IEEE Transactions on Automatic Con- trol 42 (5) (1997) 674–690

work page 1997
[12]

R.Munos,C.Szepesvári,Finite-timeboundsforfittedvalueiteration, Journal of Machine Learning Research 9 (2008) 815–857

work page 2008
[13]

E. A. Hansen, Error bounds for stochastic shortest path problems, Mathematical Methods of Operations Research 86 (1) (2017) 1–27. doi:10.1007/s00186-017-0581-5

work page doi:10.1007/s00186-017-0581-5 2017
[14]

S. R. Jafari, A. Hansson, B. Wahlberg, Path planning with moving obstaclesusingstochasticoptimalcontrol,tobepresentedatthe15th Asian Control Conference (2025).arXiv:2504.02057. A. Hansson and B. Wahlberg:Preprint submitted to ElsevierPage 8 of 8

work page arXiv 2025

[1] [1]

T.Trevizan,F.Teichteil-Königsbuch,S.Thiébaux,Efficientsolutions for stochastic shortest path problems with dead ends, in: Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), 2017

work page 2017

[2] [2]

Kolobov, Mausam, D

A. Kolobov, Mausam, D. S. Weld, A theory of goal-oriented mdps with dead ends, in: Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), Catalina Island, CA, 2012

work page 2012

[3] [3]

A. J. Briggs, B. D. Anderson, T. D. Barfoot, Stochastic shortest path planning for landmark-based robot navigation, in: Proceedings of the IEEEInternationalConferenceonRoboticsandAutomation(ICRA), New Orleans, LA, 2004

work page 2004

[4] [4]

S. Lim, H. Balakrishnan, D. Gifford, S. Madden, D. Rus, Stochastic motion planning and applications to traffic, in: G. S. Chirikjian, H. Choset, M. Morales, T. Murphey (Eds.), Algorithmic Foundation of Robotics VIII: Selected Contributions of the Eight International Workshop on the Algorithmic Foundations of Robotics, Springer Berlin Heidelberg, Berlin, H...

work page 2010

[5] [5]

D. P. Bertsekas, S. E. Shreve, Stochastic Optimal Control: The Discrete-Time Case, Academic Press, New York, 1978

work page 1978

[6] [6]

M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, Wiley, New York, 1994

work page 1994

[7] [7]

D.P.Bertsekas,DynamicProgrammingandOptimalControl,Vol.II: Approximate Dynamic Programming, 4th Edition, Athena Scientific, Belmont, MA, 2012

work page 2012

[8] [8]

D. P. Bertsekas, J. N. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, Belmont, MA, 1996

work page 1996

[9] [9]

W. B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality, Wiley, Hoboken, NJ, 2007

work page 2007

[10] [10]

R. S. Sutton, A. G. Barto, Reinforcement Learning: An Introduction, 2nd Edition, MIT Press, Cambridge, MA, 2018

work page 2018

[11] [11]

J.N.Tsitsiklis,B.V.Roy,Ananalysisoftemporal-differencelearning with function approximation, IEEE Transactions on Automatic Con- trol 42 (5) (1997) 674–690

work page 1997

[12] [12]

R.Munos,C.Szepesvári,Finite-timeboundsforfittedvalueiteration, Journal of Machine Learning Research 9 (2008) 815–857

work page 2008

[13] [13]

E. A. Hansen, Error bounds for stochastic shortest path problems, Mathematical Methods of Operations Research 86 (1) (2017) 1–27. doi:10.1007/s00186-017-0581-5

work page doi:10.1007/s00186-017-0581-5 2017

[14] [14]

S. R. Jafari, A. Hansson, B. Wahlberg, Path planning with moving obstaclesusingstochasticoptimalcontrol,tobepresentedatthe15th Asian Control Conference (2025).arXiv:2504.02057. A. Hansson and B. Wahlberg:Preprint submitted to ElsevierPage 8 of 8

work page arXiv 2025