arxiv: 2605.07459 · v1 · submitted 2026-05-08 · 💻 cs.CC

Recognition: 2 theorem links

· Lean Theorem

On the Complexity of Discounted Robust MDPs with L_p Uncertainty Sets

Ali Asadi , Krishnendu Chatterjee , Alipasha Montaseri , Ali Shafiee

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:03 UTC · model grok-4.3

classification 💻 cs.CC

keywords robust Markov decision processespolicy iterationLp uncertainty setscomputational complexityrobust Markov chainsdiscounted criterionstrongly polynomial time

0 comments

The pith

Policy iteration for robust MDPs with any compact uncertainty set runs in strongly polynomial time given an oracle for robust Markov chains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies the computational complexity of solving discounted robust Markov decision processes whose transition probabilities lie in Lp uncertainty sets. It proves that policy iteration terminates after polynomially many steps for every compact uncertainty set once an oracle is available that solves the induced robust Markov chain problems. Explicit polynomial iteration bounds are supplied for the L1 and L infinity cases on those chains, while hardness is shown for all other integer p between 1 and infinity. These distinctions matter because they separate uncertainty models that remain tractable from those that do not under the standard robust-planning objective.

Core claim

For any compact uncertainty set, the policy iteration algorithm for RMDPs is strongly polynomial with oracle access to solutions of Robust Markov chains (RMCs). Strongly polynomial-time bounds are presented on the policy iteration algorithm for RMCs with L1 and L∞ uncertainty sets. Hardness results are established for RMCs with Lp uncertainty sets for integer p satisfying 1 < p < ∞.

What carries the argument

Policy iteration on robust MDPs, whose improvement steps reduce to solving robust Markov chains under the chosen Lp uncertainty set.

If this is right

For L1 and L∞ uncertainty sets, the number of policy iterations on robust Markov chains is bounded by a polynomial in the number of states.
Robust Markov chain problems with Lp uncertainty for 1 < p < ∞ are hard, preventing strongly polynomial solutions via the same route.
The overall RMDP algorithm inherits strong polynomiality directly from the quality of the RMC oracle.
Experimental runs confirm that policy iteration converges rapidly in practice precisely when the theoretical bounds apply.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If efficient oracles for intermediate Lp sets can be constructed, the general oracle result would immediately yield tractable algorithms for those models as well.
The separation between L1/L∞ and other p values supplies a concrete criterion for choosing uncertainty sets in applications where runtime guarantees are required.

Load-bearing premise

Uncertainty sets are compact, the discount factor is a fixed constant, and an exact oracle exists for solving the robust Markov chain subproblems.

What would settle it

An explicit family of compact uncertainty sets together with an RMDP instance on which policy iteration requires a number of iterations exponential in the number of states and actions would disprove the strong polynomiality claim.

Figures

Figures reproduced from arXiv: 2605.07459 by Ali Asadi, Alipasha Montaseri, Ali Shafiee, Krishnendu Chatterjee.

**Figure 1.** Figure 1: Outer RMDP-PI (a, b) and total inner RMC-PI (c, d) iterations versus state-space size n on five benchmarks, under L1 (left) and L8 (right) uncertainty sets, with γ “ 0.5, δ “ 0.05. Dashed line: theoretical upper bound. Benchmarks. We evaluate RMDP-PI (Algorithm 1) on five environments, scaling the state-space size up to n “ 256. Four are standard: Gridworld [ARS18], a stochastic grid-navigation task with a… view at source ↗

**Figure 2.** Figure 2: Outer RMDP-PI (a, b) and total inner RMC-PI (c, d) iterations versus state-space size n on five benchmarks, under L1 (left) and L8 (right) uncertainty sets, with γ “ 0.9, δ “ 0.05. Dashed line: theoretical upper bound [PITH_FULL_IMAGE:figures/full_fig_p038_2.png] view at source ↗

**Figure 3.** Figure 3: Outer RMDP-PI (a, b) and total inner RMC-PI (c, d) iterations versus state-space size n on five benchmarks, under L1 (left) and L8 (right) uncertainty sets, with γ “ 0.99, δ “ 0.01. Dashed line: theoretical upper bound. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_3.png] view at source ↗

**Figure 4.** Figure 4: Outer RMDP-PI (a, b) and total inner RMC-PI (c, d) iterations versus state-space size n on five benchmarks, under L1 (left) and L8 (right) uncertainty sets, with γ “ 0.995, δ “ 0.05. Dashed line: theoretical upper bound. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_4.png] view at source ↗

read the original abstract

A basic model in sequential decision making is the Markov decision process (MDP), which is extended to Robust MDPs (RMDPs) by allowing uncertainty in transition probabilities and optimizing against the worst-case transition probabilities from the uncertainty sets. The class of $(s, a)$-rectangular RMDPs with $L_p$ uncertainty sets provides a flexible and expressive model for such problems. We study this class of RMDPs with a discounted-sum cost criterion and a constant discount factor. The existence of an efficient algorithm for this class is a fundamental theoretical question in optimization and sequential decision making. Previous results only establish a strongly polynomial-time algorithm for $L_\infty$ uncertainty sets. In this work, our main results are as follows: (a)~we show that for any compact uncertainty set, the policy iteration algorithm for RMDPs is strongly polynomial with oracle access to solutions of Robust Markov chains (RMCs); (b)~we present strongly polynomial-time bounds on the policy iteration algorithm for RMCs with $L_1$ and $L_\infty$ uncertainty sets; and (c)~we establish hardness results for RMCs with $L_p$ uncertainty sets for integer $p$ satisfying $1<p<\infty$. Finally, motivated by our theoretical bounds, we present experimental results showing how fast policy iteration converges for RMDPs with $L_1$ and $L_\infty$ uncertainty sets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper studies the complexity of policy iteration for discounted (s,a)-rectangular robust MDPs with Lp uncertainty sets under a constant discount factor. It claims that for any compact uncertainty set, policy iteration on RMDPs is strongly polynomial given an oracle for solving robust Markov chains (RMCs); it gives explicit strongly polynomial iteration bounds for the RMC subproblem under L1 and L∞ uncertainty sets; and it proves hardness for RMC instances with Lp uncertainty sets when 1 < p < ∞ is integer. The work also includes motivating experiments on convergence speed for the L1 and L∞ cases.

Significance. If the proofs are correct, the results provide a clean modular separation between the combinatorial policy-improvement structure and the inner robust evaluation, extending the known strongly polynomial result for L∞ to L1 while establishing that intermediate Lp norms are hard. The oracle-based argument for general compact sets and the explicit bounds for the tractable cases are technically useful for robust optimization and sequential decision making. The hardness results clarify why L1/L∞ are special. Experimental validation, while secondary, supports the practical relevance of the polynomial bounds.

minor comments (3)

The statement of strong polynomiality in part (a) should explicitly note the dependence on the number of states, actions, and the fixed discount factor; the current wording in the abstract leaves open whether the iteration bound is independent of the uncertainty-set description size.
Section 5 (experiments): the convergence plots would be clearer if the y-axis were shown on a logarithmic scale and if the number of random instances per parameter setting were reported.
The hardness reduction for 1 < p < ∞ (Theorem 4.3 or equivalent) should include a short remark on whether the constructed RMC instances remain (s,a)-rectangular when lifted back to the RMDP setting.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our work on the complexity of policy iteration for discounted robust MDPs with Lp uncertainty sets, for recognizing the modular separation between policy improvement and robust evaluation, and for recommending minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper establishes complexity-theoretic results on policy iteration for robust MDPs and robust Markov chains under compact uncertainty sets. Part (a) reduces the RMDP problem to polynomially many calls to an RMC oracle via standard contraction arguments that hold for any compact set and constant discount factor; parts (b) and (c) then supply explicit polynomial bounds for L1/L∞ oracles and hardness proofs for intermediate Lp cases. These steps rely on independent algorithmic analysis, oracle separation, and standard hardness reductions rather than any self-definition, fitted-parameter renaming, or load-bearing self-citation. The derivation chain is self-contained against external complexity benchmarks and does not reduce any claimed result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard assumptions from MDP theory and complexity analysis without introducing new free parameters or invented entities.

axioms (1)

domain assumption Uncertainty sets are compact and the discount factor is constant.
Required for the policy iteration to be strongly polynomial in part (a).

pith-pipeline@v0.9.0 · 5568 in / 1092 out tokens · 53868 ms · 2026-05-11T02:03:06.385388+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present strongly polynomial-time bounds on the policy iteration algorithm for RMCs with L1 and L∞ uncertainty sets; ... hardness results for RMCs with Lp uncertainty sets for integer p satisfying 1<p<∞.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the cumulative transition-probability potential ... halving of the cumulative gap ... combinatorial bound on the number of halvings

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

[1]

Principles of model checking , publisher =

Christel Baier and Joost. Principles of model checking , publisher =

work page
[2]

Mathematics of Operations Research , volume=

Robust dynamic programming , author=. Mathematics of Operations Research , volume=. 2005 , publisher=

work page 2005
[3]

Operations Research , volume=

Robust control of Markov decision processes with uncertain transition matrices , author=. Operations Research , volume=. 2005 , publisher=

work page 2005
[4]

Journal of Machine Learning Research , volume=

Partial policy iteration for l1-robust markov decision processes , author=. Journal of Machine Learning Research , volume=

work page
[5]

SIAM Journal on Computing , volume=

On the complexity of numerical analysis , author=. SIAM Journal on Computing , volume=. 2009 , publisher=

work page 2009
[6]

In: Pro- ceedings of the Twentieth Annual ACM Symposium on Theory of Computing STOC

Canny, John , title =. 1988 , publisher =. doi:10.1145/62212.62257 , booktitle =

work page doi:10.1145/62212.62257 1988
[7]

2004 , publisher=

Convex optimization , author=. 2004 , publisher=

work page 2004
[8]

arXiv preprint arXiv:2601.23229 , year=

Strongly polynomial time complexity of policy iteration for L_ robust MDPs , author=. arXiv preprint arXiv:2601.23229 , year=

work page arXiv
[9]

Advances in Neural Information Processing Systems , volume=

Fast Algorithms for L_ -constrained S-rectangular Robust MDPs , author=. Advances in Neural Information Processing Systems , volume=

work page
[10]

International Colloquium on Automata, Languages, and Programming , pages=

Exponential lower bounds for policy iteration , author=. International Colloquium on Automata, Languages, and Programming , pages=. 2010 , organization=

work page 2010
[11]

Mathematics of Operations Research , volume=

The simplex and policy-iteration methods are strongly polynomial for the Markov decision problem with a fixed discount rate , author=. Mathematics of Operations Research , volume=. 2011 , publisher=

work page 2011
[12]

2018 , publisher=

Reinforcement learning: an introduction , author=. 2018 , publisher=

work page 2018
[13]

Puterman , title =

Martin L. Puterman , title =

work page
[14]

Mathematics of Operations Research , volume=

Robust Markov decision processes , author=. Mathematics of Operations Research , volume=. 2013 , publisher=

work page 2013
[15]

Journal of the Operational Research Society , volume=

On the generation of markov decision processes , author=. Journal of the Operational Research Society , volume=. 1995 , publisher=

work page 1995
[16]

2012 , publisher =

Probability and Measure , author =. 2012 , publisher =

work page 2012
[17]

, title =

Hindry, Marc and Silverman, Joseph H. , title =

work page
[18]

Hewlett-Packard Labs, Tech

Inequalities for the l1 deviation of the empirical distribution , author=. Hewlett-Packard Labs, Tech. Rep , pages=

work page
[19]

Bahram Behzadian and Reazul Hasan Russel and Marek Petrik and Chin Pang Ho , title =

work page
[20]

Leach and Thomas L

Robert Givan and Sonia M. Leach and Thomas L. Dean , title =. Artif. Intell. , volume =

work page
[21]

Cyrus Derman , title =

work page
[22]

Management Science , volume=

A probabilistic production and inventory problem , author=. Management Science , volume=. 1963 , publisher=

work page 1963
[23]

Journal of the ACM (JACM) , volume=

Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor , author=. Journal of the ACM (JACM) , volume=. 2013 , publisher=

work page 2013
[24]

Kaufman and Andrew J

David L. Kaufman and Andrew J. Schaefer , title =

work page
[25]

M. R. Garey and Ronald L. Graham and David S. Johnson , title =

work page
[26]

1999 , publisher=

Real analysis: modern techniques and their applications , author=. 1999 , publisher=

work page 1999