arxiv: 2604.26748 · v1 · submitted 2026-04-29 · 💻 cs.LO

Recognition: unknown

On the Complexity of Robust Markov Decision Processes and Bisimulation Metrics

Guillermo A. P\'erez, Marnix Suilen

Authors on Pith no claims yet

Pith reviewed 2026-05-07 12:33 UTC · model grok-4.3

classification 💻 cs.LO

keywords robust Markov decision processescomplexityparity gamesbisimulation metricspolytopic uncertaintythreshold problemrectangular RMDPspolicy iteration

0 comments

The pith

The RMDP threshold problem reduces from parity games, so a polynomial-time algorithm for it would place parity games in P.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper classifies the complexity of deciding whether the worst-case discounted value of a robust MDP exceeds a threshold, for polytopic uncertainty sets given in halfspace form. For (s,a)-rectangular RMDPs it shows robust policy evaluation lies in P via robust linear programming and the threshold decision lies in NP; for s-rectangular RMDPs the threshold decision lies in PSPACE via the first-order theory of the reals. The authors establish matching hardness by polynomial reductions from parity games and from the computation of bisimulation metrics between MDP states. A reader would care because the parity-game reduction directly ties an open complexity question to a practical robust-planning problem.

Core claim

For (s,a)-rectangular RMDPs the threshold problem is in NP and robust policy evaluation is in P; for s-rectangular RMDPs the threshold problem is in PSPACE. Both parity games and bisimulation metrics reduce in polynomial time to the RMDP threshold problem, so any polynomial-time algorithm for the latter would place parity games in P. The bisimulation reduction also yields a practical method: robust policy iteration computes the metrics faster than classical fixed-point iteration on the instances tested.

What carries the argument

Polynomial-time reductions from parity games (and from bisimulation metrics) to the RMDP threshold problem, together with robust linear programming for policy evaluation under rectangular polytopic uncertainty.

If this is right

Robust policy evaluation for (s,a)-rectangular RMDPs can be performed in polynomial time by solving a robust linear program.
Robust policy iteration yields a polynomial-time algorithm for the optimization problem when the discount factor is fixed.
The threshold problem for s-rectangular RMDPs lies in PSPACE.
Bisimulation metrics between states of an ordinary MDP can be computed by reducing to an RMDP and applying robust policy iteration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Techniques developed for parity games could be imported to obtain faster robust-planning algorithms under uncertainty.
The structural preservation in the reductions may let similar hardness proofs apply to other uncertainty representations such as ellipsoids.
Empirical speed-ups observed for bisimulation metrics suggest that game-solving heuristics might accelerate robust value iteration in practice.

Load-bearing premise

The reductions from parity games and bisimulation metrics to the RMDP threshold problem preserve rectangular structure and the polytopic halfspace representation of the uncertainty sets.

What would settle it

A concrete parity-game instance whose corresponding (s,a)-rectangular RMDP has a threshold answer that differs from whether the first player wins the game.

read the original abstract

Robust Markov decision processes (RMDPs) extend standard Markov decision processes (MDPs) to account for uncertainty in the transition probabilities. RMDPs have an uncertainty set that defines a set of possible transition functions, each of which induces a standard MDP. The natural objective in an RMDP is to optimize the discounted cumulative reward under the worst-case transition function in the uncertainty set. We study the complexity of the associated threshold problem for RMDPs with polytopic uncertainty sets in halfspace representation. Previous results focused on approximating the optimum or restricted attention to specific subclasses of RMDPs, such as interval MDPs or $L_\infty$-RMDPs. Our contributions are threefold: (1) For (s,a)-rectangular RMDPs, we prove that robust policy evaluation is in P via robust linear programming, and that the threshold problem is in NP. As a corollary, robust policy iteration is a polynomial-time algorithm for these RMDPs when the discount factor is fixed. (2) For $s$-rectangular RMDPs, we show that the threshold problem is in PSPACE via the first-order theory of the reals. (3) We establish lower bounds by reducing both parity games and bisimulation metrics between MDP states to the RMDP threshold problem. A polynomial-time algorithm for the threshold problem would resolve the long-standing open question of whether parity games can be solved in polynomial time. The reduction from bisimulation metrics also yields a practical benefit: it allows us to apply robust policy iteration as a more efficient alternative to the standard fixed-point iteration, as our empirical evaluation demonstrates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper classifies threshold problems for polytopic rectangular RMDPs as NP or PSPACE and reduces from parity games, but the reduction construction is missing from the abstract so its validity for these exact classes is uncheckable.

read the letter

The core takeaway is that for (s,a)-rectangular polytopic RMDPs the threshold problem sits in NP (via robust LP for policy evaluation) and for s-rectangular versions it is in PSPACE (via the theory of the reals), with explicit reductions from parity games and bisimulation metrics. A poly-time algorithm for the threshold problem would therefore settle whether parity games are in P. That is the headline claim and the reason the work matters if the reductions are tight.

Referee Report

1 major / 1 minor

Summary. The manuscript studies the complexity of the threshold problem for robust Markov decision processes (RMDPs) with polytopic uncertainty sets in halfspace representation. For (s,a)-rectangular RMDPs it shows robust policy evaluation is in P via robust linear programming and the threshold problem is in NP, yielding a polynomial-time robust policy iteration algorithm for fixed discount factors. For s-rectangular RMDPs the threshold problem is placed in PSPACE via the first-order theory of the reals. Hardness is established by reductions from parity games and from bisimulation metric computation on MDPs; the former implies that a polynomial-time algorithm for the RMDP threshold problem would place parity games in P. The bisimulation reduction is additionally used to motivate robust policy iteration as a practical alternative to fixed-point iteration, with supporting empirical results.

Significance. If the reductions preserve rectangularity and polytopic halfspace structure, the work supplies new complexity classifications for structured RMDPs and forges a direct link between RMDP decision problems and the long-standing open question of parity-game complexity. The polynomial-time policy-evaluation result and the PSPACE upper bound are concrete and useful; the practical algorithmic suggestion for bisimulation metrics is a secondary but welcome contribution with empirical support.

major comments (1)

Abstract (and the lower-bound section): the claim that a polynomial-time algorithm for the RMDP threshold problem would resolve whether parity games lie in P requires an explicit reduction that maps parity-game instances to (s,a)-rectangular or s-rectangular polytopic RMDPs whose uncertainty sets remain in halfspace representation. The manuscript must verify that the constructed uncertainty sets satisfy these structural restrictions; otherwise the hardness result applies only to a strictly larger class of RMDPs and the corollary for the classes whose membership is proved (NP via robust LP, PSPACE via FO reals) does not follow.

minor comments (1)

The distinction between (s,a)-rectangular and s-rectangular uncertainty sets should be recalled with a short formal definition in the complexity sections so that readers can immediately map the upper-bound proofs to the hardness instances.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting the need to explicitly verify structural preservation in the hardness reduction. We address this point below and will strengthen the presentation accordingly.

read point-by-point responses

Referee: [—] Abstract (and the lower-bound section): the claim that a polynomial-time algorithm for the RMDP threshold problem would resolve whether parity games lie in P requires an explicit reduction that maps parity-game instances to (s,a)-rectangular or s-rectangular polytopic RMDPs whose uncertainty sets remain in halfspace representation. The manuscript must verify that the constructed uncertainty sets satisfy these structural restrictions; otherwise the hardness result applies only to a strictly larger class of RMDPs and the corollary for the classes whose membership is proved (NP via robust LP, PSPACE via FO reals) does not follow.

Authors: We agree that an explicit verification is required for the claim to apply to the rectangular polytopic classes studied in the paper. Our reduction from parity games constructs an (s,a)-rectangular RMDP in which each state-action pair has an independent polytopic uncertainty set defined by a constant number of halfspace inequalities (specifically, the transition probabilities are constrained to lie in a simplex-like polytope whose facets are given by linear inequalities that do not couple distinct (s,a) pairs). The same construction yields an s-rectangular instance when the per-action uncertainties are aggregated per state. We will add a dedicated lemma (with a short proof) in the lower-bound section that formally confirms both rectangularity and the halfspace representation, thereby ensuring the NP-hardness result transfers directly to the (s,a)-rectangular case for which we already have membership in NP. The PSPACE upper bound for the s-rectangular case is unaffected. This clarification will also be reflected in a revised abstract sentence. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results rely on independent reductions and standard decision procedures

full rationale

The paper establishes membership results directly: robust policy evaluation for (s,a)-rectangular RMDPs is shown to be in P by reduction to robust linear programming, and the threshold problem is placed in NP; for s-rectangular RMDPs the threshold problem is placed in PSPACE by encoding into the first-order theory of the reals. The lower-bound claims are obtained by exhibiting reductions from parity games and from bisimulation-metric computation; these are external, well-studied problems whose complexity status is independent of the present paper. No equation or definition is shown to be equivalent to its own input by construction, no parameter is fitted on a subset and then renamed a prediction, and no load-bearing step rests on a self-citation whose justification is internal to the authors' prior work. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Results rely on standard complexity assumptions and decidability of the first-order theory of the reals; no free parameters or invented entities appear.

axioms (1)

standard math First-order theory of the reals is decidable.
Used to obtain the PSPACE upper bound for s-rectangular case.

pith-pipeline@v0.9.0 · 8329 in / 1025 out tokens · 55311 ms · 2026-05-07T12:33:25.409819+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references

[1]

Note that in addition toV (n)(s,s′) = V (n)(s′,s ), by definition the reward function and the uncertainty set are symmetrical,i.e.,R(s,s′,a ) = R(s′,s,a )and U(s,s′,a) = ΛP(s,a),P(s′,a) = Λ (s′,s,a) =U (s′,s,a). It follows that V (n+1)(s,s′) = max a∈A R(s,s′,a) +γinf u∈U(s,s′,a) ∑ (t,t′)∈S×S Pu(s,s′,a)(t,t′)V (n)(t,t′) = max a∈A R(s′,s,a) +γinf u∈U(s′,s,a...
[2]

Using thatV (n) is a pseudometric by assumption, and thatU(s,s′′,a) is the set of couplings betweenP (s,a )and P (s′′,a ), we observe that the inner minimization problem defines the Kantorovich distance, which is a metric and thus satisfies the triangle inequality. Thus it follows that V (n+1)(s,s′′) = max a∈A R(s,s′′,a) + inf u∈U(s,s′′,a) ∑ (t,t′′)∈S×S P...