pith. sign in

arxiv: 2606.03048 · v1 · pith:4PVQQLGCnew · submitted 2026-06-02 · 🧮 math.OC · math.AG· stat.ML

The Value Function Semi-Algebraic Set in Partially Observable Markov Decision Processes

Pith reviewed 2026-06-28 09:14 UTC · model grok-4.3

classification 🧮 math.OC math.AGstat.ML
keywords POMDPvalue functionsemi-algebraic setmemoryless policypolicy optimizationpartially observable Markov decision processinfinite horizon
0
0 comments X

The pith

The feasible set of value functions in infinite-horizon POMDPs under memoryless stochastic policies is a semi-algebraic set defined by explicit polynomial inequalities from the transition dynamics, observation kernel, and rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a geometric characterization of all achievable long-term value functions for memoryless stochastic policies in POMDPs. It proves that this set is semi-algebraic and can be written explicitly using polynomial inequalities that depend only on the POMDP parameters. This description replaces the simpler polytope structure known for fully observable MDPs with a richer nonlinear geometry induced by partial observability. The result explains why policy optimization landscapes in POMDPs contain isolated local maximizers whose locations depend on the initial distribution. A reader would care because the explicit inequalities make the feasible region algebraically describable rather than abstract.

Core claim

The feasible set of value functions in infinite-horizon POMDPs under memoryless stochastic policies is a semi-algebraic set defined by explicit polynomial inequalities determined by the transition dynamics, observation kernel, and reward structure of the POMDP. In contrast to the polyhedral structure in MDPs, partial observability induces fundamentally nonlinear constraints, leading to a richer geometric structure that reveals isolated local maximizers of long-term reward.

What carries the argument

The semi-algebraic set of feasible value functions, defined by polynomial inequalities derived from the POMDP transition probabilities, observation kernel, and reward function.

If this is right

  • The feasible set in POMDPs has a richer and more complex structure than the polytope arising in MDPs because of nonlinear constraints.
  • Policy optimization in POMDPs can encounter isolated local maximizers of long-term reward that are absent in the fully observable case.
  • The locations of these local maximizers depend on the initial state distribution.
  • The geometric characterization supplies explicit algebraic insight into the optimization landscape for both MDPs and POMDPs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The explicit polynomial description may allow algebraic or semidefinite programming techniques to certify global optimality or to enumerate local maxima in small POMDPs.
  • The same style of derivation could be applied to finite-horizon or discounted settings to obtain similar semi-algebraic descriptions.
  • The dependence on the initial distribution suggests that value-function geometry is not translation-invariant across belief states in the way MDP polytopes are.

Load-bearing premise

The characterization holds only for memoryless stochastic policies and does not automatically apply to history-dependent or deterministic policies.

What would settle it

A concrete value function vector that satisfies all the stated polynomial inequalities yet cannot be realized by any memoryless stochastic policy, or vice versa.

Figures

Figures reproduced from arXiv: 2606.03048 by Guido Montufar, Ryan A. Anderson.

Figure 1
Figure 1. Figure 1: The effective policy polytope for fully observable 2- state MDPs is the blue square, while the red parallelogram is the effective policy polytope for a partially observable 2-state MDPs, for some β not equal to the identity matrix. We regard the Bellman equation (3) as a parametric linear system in the indeterminate V π ∈ R S whose coefficients (I − γP π ) ∈ R S×S and r π ∈ R S are parametrized by the poli… view at source ↗
Figure 2
Figure 2. Figure 2: With Theorem 3.9, we are able to identify the feasible space of value functions for any POMDP as an intersection of infinitely many piecewise linear inequalities. Here we sampled 50 such inequalities for a POMDP as light blue lines and show they enclose the feasible region, shown in green. Remark 3.10. In the special case of fully observable MDPs, Theorem 3.9 reduces to finitely many piecewise linear in￾eq… view at source ↗
Figure 3
Figure 3. Figure 3: We are able to identify the feasible space of value functions for the POMDP with only four curves using Theorem 4.2, namely the solutions to |q (1)|, |q (2)| = 1, whose solutions are plotted as the red and green dashed curves. The faint blue lines are samples from the infinitely many piecewise linear inequalities needed to specify the feasible space under Theorem 3.9. equation Dq = C. Since the objective o… view at source ↗
Figure 4
Figure 4. Figure 4: Above we sample eight feasible policies in the same POMDP with different choices of state distribution ρ in each plot. Optimal policies, which differ between choices of ρ, are denoted with a star, while the red arrow corresponds to the steepest ascent direction for the objective. The grey dashed lines are level sets of the objective J = c. Here, ρ represents an initial state distribution. governed by funda… view at source ↗
Figure 5
Figure 5. Figure 5: We construct the feasible set of value functions for a 2 state, 2 action, 2 observation MDP in a fully observable and partially observable setting. On the left panel, the calculation of SB generates a system of polynomials which all factor into linear terms. These linear functions are plotted in green dashes. On the right panel, the calculation of SB generates a system of polynomials which do not all facto… view at source ↗
read the original abstract

We study the geometry of feasible value functions in infinite-horizon partially observable Markov decision processes (POMDPs) under memoryless stochastic policies. Our main contribution is a characterization of the feasible set of value functions as a semi-algebraic set, defined by explicit polynomial inequalities determined by the transition dynamics, observation kernel, and reward structure of the POMDP. This result extends prior work for fully observable Markov decision processes, where the feasible set is known to be a polytope, to the substantially more intricate partially observable setting. In contrast to the polyhedral structure arising in MDPs, partial observability induces fundamentally nonlinear constraints, leading to a richer and more complex geometric structure. Our geometric characterization provides new insight into the landscape of policy optimization in both MDPs and POMDPs, and reveals qualitative phenomena unique to partial observability, including the emergence of isolated local maximizers of the long-term reward and their dependence on the initial state distribution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper claims that the feasible set of value functions achievable by memoryless stochastic policies in infinite-horizon POMDPs is a semi-algebraic set whose defining polynomial inequalities are explicitly determined by the transition matrix, observation kernel, and reward function. This extends the known polytope characterization for fully observable MDPs and is derived by viewing the value map as the projection of a semi-algebraic graph (linear Bellman equations plus simplex constraints) and invoking the Tarski–Seidenberg theorem.

Significance. If correct, the result supplies a precise algebraic-geometric description of the non-convex feasible set in POMDPs, directly explaining the emergence of isolated local maximizers and their dependence on the initial-state distribution. Such a characterization is useful for theoretical analysis of policy optimization landscapes and for designing algorithms that exploit the semi-algebraic structure.

minor comments (2)
  1. [Abstract / §2] The abstract states that the inequalities are 'explicit' but does not exhibit even a low-dimensional example; adding a two-state, two-observation illustration in §2 or §3 would make the construction more concrete without lengthening the paper.
  2. [Introduction] Notation for the value map V_π and the affine dependence of P_π, r_π on the policy parameters is introduced only in the body; a short notational table or paragraph at the end of the introduction would improve readability for readers coming from the MDP literature.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript, accurate summary of the main result, and recommendation to accept. We are pleased that the referee recognizes the significance of the semi-algebraic characterization for understanding non-convexity and local maxima in POMDP policy optimization.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The central result characterizes the image of the memoryless stochastic policy simplex under the rational map to value functions V_π as a semi-algebraic set. This follows from recognizing that the Bellman fixed-point equations (I − γ P_π)V = r_π together with the simplex constraints on π are polynomial (because P_π and r_π are affine in π), so the graph is semi-algebraic and its projection is semi-algebraic by the Tarski–Seidenberg theorem. No step reduces to a fitted parameter renamed as a prediction, a self-definitional loop, or a load-bearing self-citation whose content is itself unverified; the derivation is self-contained in the algebraic geometry of the POMDP parameters.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The result rests on standard facts from real algebraic geometry (semi-algebraic sets) and the definition of value functions under memoryless policies. No free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption The value function under a memoryless stochastic policy is a rational function of the policy parameters and the POMDP matrices.
    Invoked to obtain polynomial inequalities after clearing denominators.
  • domain assumption The feasible set is the image of the policy simplex under the value map.
    Standard definition used to translate policy constraints into value constraints.

pith-pipeline@v0.9.1-grok · 5696 in / 1386 out tokens · 17694 ms · 2026-06-28T09:14:05.263793+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 6 canonical work pages

  1. [1]

    URL https://doi.org/10

    Springer Vienna. URL https://doi.org/10. 1007/978-3-7091-6280-4_3. Axler, S. J.Linear Algebra Done Right. Undergraduate Texts in Mathematics. Springer, New York, 1997. URL http://linear.axler.net/. Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. InInter- national Conference on Machine Learning, 2017. URL https://ap...

  2. [2]

    Chad`es, I., Pascal, L

    URL http://www.pomdp.org/papers/ applications.pdf. Chad`es, I., Pascal, L. V ., Nicol, S., Fletcher, C. S., and Ferrer-Mestres, J. A primer on partially observ- able Markov decision processes (POMDPs).Methods in Ecology and Evolution, 12(11):2058–2072, 2021. URL https://besjournals.onlinelibrary. wiley.com/doi/10.1111/2041-210X.13692. Chen, F., Bai, Y ., ...

  3. [3]

    Cohen, V

    URL https://openreview.net/forum? id=n05upKp02kQ. Cohen, V . and Parmentier, A. Future memories are not needed for large classes of POMDPs.Op- erations Research Letters, 51(3):270–277, 2023. URL https://linkinghub.elsevier.com/ retrieve/pii/S016763772300041X. Crites, R. H. and Barto, A. G. Elevator Group Control Us- ing Multiple Reinforcement Learning Age...

  4. [4]

    URL https://proceedings

    PMLR, 2019. URL https://proceedings. mlr.press/v97/dadashi19a.html. Dressler, M., Garrote-L ´opez, M., Mont ´ufar, G., M¨uller, J., and Rose, K. Algebraic optimiza- tion of sequential decision problems.Journal of Symbolic Computation, 121:102241, 2024. URL https://www.sciencedirect.com/ science/article/pii/S074771712300055X. Drton, M. and Sullivant, S. Al...

  5. [5]

    10 The Value Function Semi-Algebraic Set in POMDPs cc/paper_files/paper/2016/file/ 2387337ba1e0b0249ba90f55b2ba2521-Paper

    URL https://proceedings.neurips. 10 The Value Function Semi-Algebraic Set in POMDPs cc/paper_files/paper/2016/file/ 2387337ba1e0b0249ba90f55b2ba2521-Paper. pdf. Lauri, M., Hsu, D., and Pajarinen, J. Partially Observ- able Markov Decision Processes in Robotics: A Sur- vey.IEEE Transactions on Robotics, 39(1):21–40,

  6. [6]

    Li, Y ., Yin, B., and Xi, H

    URL https://ieeexplore.ieee.org/ document/9899480/. Li, Y ., Yin, B., and Xi, H. Finding optimal mem- oryless policies of POMDPs under the expected average reward criterion.European Journal of Operational Research, 211(3):556–567, 2011. URL https://linkinghub.elsevier.com/ retrieve/pii/S0377221710008805. Liu, Q., Szepesvari, C., and Jin, C. Sample-efficie...

  7. [7]

    Nesterov, Y

    URL https://proceedings.neurips. cc/paper_files/paper/2022/file/ 743459dae9b2c5d2904e5432d5298128-Paper-Conference. pdf. Madani, O., Hanks, S., and Condon, A. On the undecid- ability of probabilistic planning and related stochastic optimization problems.Artificial Intelligence, 147(1-2):5– 34, 2003. URL https://linkinghub.elsevier. com/retrieve/pii/S00043...

  8. [8]

    com/retrieve/pii/0024379589900049

    URL https://linkinghub.elsevier. com/retrieve/pii/0024379589900049. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y ., Lillicrap, T., Hui, F., Sifre, L., Van Den Driessche, G., Graepel, T., and Hassabis, D. Mastering the game of Go without human knowledge. Nature, 550(76...

  9. [9]

    org/doi/10.1287/opre.21.5.1071

    URL https://pubsonline.informs. org/doi/10.1287/opre.21.5.1071. Sondik, E. J. The Optimal Control of Partially Ob- servable Markov Processes over the Infinite Horizon: Discounted Costs.Operations Research, 26(2):282– 304, 1978. URL https://pubsonline.informs. org/doi/10.1287/opre.26.2.282. Subramanian, J., Sinha, A., Seraj, R., and Mahajan, A. Ap- proxima...

  10. [10]

    net/book/the-book-2nd.html

    URL http://www.incompleteideas. net/book/the-book-2nd.html. Toro Icarte, R., Klassen, T. Q., Valenzano, R., Castro, M. P., Waldie, E., and McIlraith, S. A. Learning re- ward machines: A study in partially observable rein- forcement learning.Artificial Intelligence, 323:103989,

  11. [11]

    com/retrieve/pii/S0004370223001352

    URL https://linkinghub.elsevier. com/retrieve/pii/S0004370223001352. Wang, K., Kumar, N., Zhou, K., Hooi, B., Feng, J., and Mannor, S. The geometry of robust value functions. 11 The Value Function Semi-Algebraic Set in POMDPs InProceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pp. ...

  12. [12]

    White, D

    URL https://proceedings.mlr.press/ v162/wang22k.html. White, D. J. A Survey of Applications of Markov Decision Processes.Journal of the Operational Research Society, 44(11):1073–1096, 1993. URL https://www.tandfonline.com/doi/full/ 10.1057/jors.1993.181. Wolsey, L.Integer Programming. Wiley, 1 edition,

  13. [13]

    com/doi/book/10.1002/9781119606475

    URL https://onlinelibrary.wiley. com/doi/book/10.1002/9781119606475. Wu, Y . and De Loera, J. A. Geometric Policy Iteration for Markov Decision Processes. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 2070–2078, 2022. URL http:// arxiv.org/abs/2206.05809. arXiv:2206.05809 [cs]. Zhang, J., Zhang, H., Zhou, J., ...

  14. [14]

    X s′∈S −γ α(s, a;s ′)−α(s, a i;s ′) 1{o=s} xs′ − r(s, a)−r(s, a i) 1{o=s} # =xs − r(s, ai) +γ X s′∈S α(s, ai;s ′)xs′ ! − 1 2 X a̸=ai

    URL https://linkinghub.elsevier. com/retrieve/pii/0022247X6590154X. 12 The Value Function Semi-Algebraic Set in POMDPs Appendix Contents This appendix is organized as follows: • Appendix A: Proofs of Results in Section 3 • Appendix B: Reduction to MDPs • Appendix C: Proofs of Results in Section 4 • Appendix D: Semi-algebraic Description of the Solution Se...

  15. [15]

    Draw an initial state distributionρ∼Dirichlet(1 S)

  16. [16]

    standard-normal logit initializations, each with η= 0.005 and T= 3000steps

    Run 50 independent optimization trajectories from i.i.d. standard-normal logit initializations, each with η= 0.005 and T= 3000steps

  17. [17]

    Record the terminal triple(J i, Vi, πi)fori= 1, . . . ,50

  18. [18]

    For a given instance we then compute three aggregate statistics over the 50 restarts: •Value spread— the rangemax i Ji −min i Ji

    Build the fully observable baseline and repeat steps 1–3,with the sameρ, on the baseline instance. For a given instance we then compute three aggregate statistics over the 50 restarts: •Value spread— the rangemax i Ji −min i Ji. •Suboptimal fraction— the share of restartsifor whichmax j Jj −J i >0.01. •Policy spread— the average pairwiseℓ 1 distance 150 2...

  19. [19]

    Samplen inst = 20independent POMDP instances

  20. [20]

    For each instance, samplen ρ = 10independent initial state distributionsρ∼Dirichlet(1 S)

  21. [21]

    standard-normal logit initializations

    For each (instance, ρ) pair, run nstarts = 50 independent optimization trajectories of length T= 3000 at η= 0.005 from i.i.d. standard-normal logit initializations

  22. [22]

    ∥V∥ distance

    Cluster the 50 terminal values {Ji} with tolerance δJ = 0.1. The number of clusters is reported as the estimated number of distinct local optima for that (instance,ρ) pair. Pooling overA,O, instances, andρ, for each state-space sizeSwe then report: • Mean number of optima, averaged over all (instance,ρ) pairs with thatS. • Share with more than one optimum...

  23. [23]

    Increasing k tends to reduce the value spread, at least for the smallest configurations where memory is enough to substantially disambiguate the state

  24. [24]

    Partial observability still induces significant spread even at k= 2 : the spread for k= 2 is consistently larger than the spread in the fully observable setting, which is essentially zero

  25. [25]

    The suboptimal fraction does not cleanly shrink with k: on (8,2,2) and (8,3,2) it is non-monotone in k. This is consistent with the geometric reading: enlarging the policy class adds new feasible value functions and new critical points, which can shift the basin structure in ways that are not simply monotone in optimization-hardness indices. 38 The Value ...