pith. sign in

arxiv: 2606.21726 · v1 · pith:4SX4N5ZHnew · submitted 2026-06-19 · 💻 cs.AI

Entropy Objectives in Markov Decision Processes

Pith reviewed 2026-06-26 13:57 UTC · model grok-4.3

classification 💻 cs.AI
keywords entropy objectivesMarkov decision processesstrategy synthesisconvex dualityinvariant synthesisconcentration propertiesverificationstochastic control
0
0 comments X

The pith

Entropy objectives in MDPs reduce to convex programs for strategy synthesis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formalizes the task of synthesizing strategies in Markov decision processes to enforce concentration properties on state distributions via entropy-based objectives. It establishes that the problem remains computationally hard even in relaxed versions. The core contribution is a method that reduces the non-linear objectives to convex programs by combining convex duality with invariant synthesis, yielding a sound and conditionally relatively complete procedure for both verification and synthesis. The work further analyzes when memory or randomization is required and evaluates the approach on benchmarks.

Core claim

We present a sound and conditionally relatively complete method to verify and synthesize strategies for entropy objectives in MDPs. The main technical step exploits convex duality and invariant synthesis to reduce the non-linear objectives to convex programs while preserving the required guarantees for the class of MDPs considered.

What carries the argument

Reduction of non-linear entropy objectives to convex programs via convex duality combined with invariant synthesis

Load-bearing premise

The non-linear entropy objectives admit a reduction to convex programs and invariant synthesis that preserves soundness and conditional completeness for the class of MDPs considered.

What would settle it

An MDP instance in which a strategy satisfying the entropy objective exists but the synthesis procedure either fails to produce a strategy or produces one that violates the objective.

Figures

Figures reproduced from arXiv: 2606.21726 by Aditya Neeraje, Piyush Srivastava, Raghav Goyal, S. Akshay.

Figure 1
Figure 1. Figure 1: MDP 𝑀1 taken from [ACMŽ23] In the above example, it was easy to identify a strategy that minimizes entropy (increases concen￾tration) even if the value of the strategy required some work to compute or bound; but this is not always the case. For instance, in the following, it is a priori unclear what strategy minimizes the entropy or ensures that it is below a certain threshold. Example 4. Consider the 4-st… view at source ↗
Figure 2
Figure 2. Figure 2: MDP 𝑀2 In this example, starting from 𝜇0, with 𝐾 = 0, suppose 𝛾 = 1.092, what would be the strategy that ensures that the Low-Entropy-Decision problem answers YES? With some computation, one can check that a (randomized) strategy that selects action 𝑎 with probability 7/10 indeed ensures this, even though none of the two deterministic strategies do. In the rest of the paper, our goal is to address the abov… view at source ↗
Figure 3
Figure 3. Figure 3: MDP 𝑀3 We then plot the entropy at times 𝑡 = 1, 2, 3 as a function of 𝑥. The corresponding entropies are given below. After 1 step: − (0.9 − 0.6𝑥) ln ( (0.9 − 0.6𝑥)) − (0.1 (1 − 𝑥)) ln ( (0.05 (1 − 𝑥))) − 0.7𝑥 ln (0.7𝑥) . After 2 steps: − 2 (0.9 − 0.6𝑥) 2 ln (0.9 − 0.6𝑥) − [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Entropy functions in the proof of Prop. 15 the parity of the current time using just one bit of memory!) of alternating between choosing 𝑎 and 𝑏 obtains a maximum entropy of at most 0.73. Thus this example illustrates that memoryless strategies are not enough to reach an optimal value. □ This leads us to the question of what memory is needed for ensuring entropy objectives. An ap￾proach towards this is pro… view at source ↗
Figure 5
Figure 5. Figure 5: Example exhibiting the need for randomized strategies [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
read the original abstract

We consider the problem of synthesizing control policies that enforce a concentration property on the state distributions of a stochastic system. We present a formalization of this problem in terms of synthesizing strategies for maintaining an entropy-based objective in Markov Decision Processes (MDPs). We first show that even relaxed versions of this problem are complexity-theoretically hard. We then present a sound and (conditionally) relatively complete method to verify and synthesize strategies for such entropy objectives. The main challenge is the non-linear nature of such objectives, and our approach addresses this by exploiting and combining ideas from convex duality and invariant synthesis. We also investigate the role of memory and randomization in ensuring entropy objectives. Finally, we implement our ideas to evaluate our approach empirically on a few illustrative benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper formalizes synthesizing strategies for entropy-based objectives in MDPs to enforce concentration properties on state distributions in stochastic systems. It establishes complexity-theoretic hardness for relaxed versions of the problem, then proposes a sound and conditionally relatively complete verification/synthesis method that reduces the non-linear objectives to convex programs via convex duality combined with invariant synthesis. The work also examines the roles of memory and randomization and reports empirical results on illustrative benchmarks.

Significance. If the reduction to convex programs is correct and preserves the claimed properties, the result would be a useful contribution to verification of MDPs with non-linear objectives, extending convex-optimization techniques in a targeted way. The hardness result and the explicit treatment of memory/randomization are clear strengths; the empirical evaluation on benchmarks provides initial evidence of practicality.

major comments (1)
  1. [§4] §4 (main theorem on soundness/conditional completeness): the reduction from non-linear entropy objectives to convex programs is central to the claim, yet the manuscript does not supply an explicit counter-example or boundary case showing where the invariant-synthesis step fails to preserve conditional completeness; a concrete example MDP where the method returns 'unknown' would strengthen the conditional-completeness statement.
minor comments (2)
  1. [Preliminaries] Preliminaries section: the definition of the entropy objective (Eq. 3) uses a non-standard normalization; adding a short remark comparing it to Shannon entropy would improve readability for readers outside the immediate sub-area.
  2. [Empirical evaluation] Empirical evaluation: Table 1 reports runtimes but omits the number of states/actions in each benchmark MDP; including these sizes would allow readers to assess scalability claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript and the constructive comment. We address the major comment below.

read point-by-point responses
  1. Referee: [§4] §4 (main theorem on soundness/conditional completeness): the reduction from non-linear entropy objectives to convex programs is central to the claim, yet the manuscript does not supply an explicit counter-example or boundary case showing where the invariant-synthesis step fails to preserve conditional completeness; a concrete example MDP where the method returns 'unknown' would strengthen the conditional-completeness statement.

    Authors: We thank the referee for this observation. The conditional completeness of our method (Theorem 4) is established under the assumption that the invariant synthesis procedure succeeds in discovering a suitable invariant when one exists. While the formal statement already delineates this boundary, we agree that an explicit example MDP illustrating a case in which the method returns 'unknown' would improve reader intuition regarding the practical limits of the approach. In the revised manuscript we will add a small, self-contained MDP example in Section 4 demonstrating such a boundary case, without altering the main theorems or proofs. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central claim is a sound and conditionally relatively complete verification/synthesis method for entropy objectives in MDPs, obtained by reducing the non-linear problem to convex programs combined with invariant synthesis via convex duality. This reduction relies on standard external techniques (convex duality, invariant synthesis) rather than self-definition, fitted parameters renamed as predictions, or load-bearing self-citations. The abstract and description contain no equations or steps that reduce by construction to their own inputs; the hardness result and empirical evaluation are presented as separate from the derivation. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no concrete free parameters, axioms, or invented entities can be extracted from the text.

pith-pipeline@v0.9.1-grok · 5652 in / 956 out tokens · 22666 ms · 2026-06-26T13:57:20.180798+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 3 canonical work pages

  1. [1]

    DOI: 10.1145/3209108.3209185

    ACM, 2018. DOI: 10.1145/3209108.3209185. [AMTZ22] M. Anand, V. Murali, A. Trivedi, and M. Zamani.𝑘-inductive barrier certificates for stochas- tic systems. InProc. 25th ACM International Conference on Hybrid Systems: Computation and Control (HSCC), pages 12:1–12:11. ACM, 2022. DOI: 10.1145/3501710.3519532. [BCJ+23] K. Batz, M. Chen, S. Junges, B. L. Kamin...

  2. [2]

    [GLS93] M

    DOI: 10.1109/TAC.2023.3341282. [GLS93] M. Grötschel, L. Lovász, and A. Schrijver.Geometric Algorithms and Combinatorial Opti- mization, volume 2 ofAlgorithms and Combinatorics. Springer Berlin Heidelberg, Berlin, Heidelberg, second corrected edition, 1993. DOI: 10.1007/978-3-642-78240-4. [GM15] M. Gario and A. Micheli. PySMT: a solver-agnostic library for...

  3. [3]

    [Vaz03] V

    DOI: 10.1145/380752.380839. [Vaz03] V. V. Vazirani.Approximation Algorithms. Springer, 2003. DOI: 10.1007/978-3-662-04565-7. [Vis21] N. K. Vishnoi.Algorithms for Convex Optimization. Cambridge University Press, 2021. DOI: 10.1017/9781108699211. [Wil97] A. J. Wilkie. Schanuel’s Conjecture and the Decidability of the Real Exponential Field. In B. T. Hart, A...

  4. [4]

    best/optimal

    In Russian. English translation:Matekon13 (3) (1977) 25–45. 24 A A strong duality result for entropy In this section, we provide the proof of Proposition 10. As mentioned in Section 5, we need a strong duality result which is usually stated under the additional hypothesis that the feasible point𝑥 0 has all its entriesstrictlypositive (this is the version ...

  5. [5]

    In this example, it is optimal to choose action𝑎 1 always

    MDP𝑀 1 with 3 states, taken from running example of [ACMŽ23]) 𝐴 𝐵 𝐶 𝑎2,1 𝑎1,1 1 1/2 1/2 26 At the start, in𝜇 0, state A has probability 1 2, state B has probability 1 3, state C has probability 1 6. In this example, it is optimal to choose action𝑎 1 always. As the initial probability mass at𝐴 exceeds 1 𝑒 , any additional probability mass will only decreas...

  6. [6]

    MDP𝑀 2 with 4 states: 3 1 2 0 𝑎,1 𝑏,1 4/5 1/5 1/2 1/2 1/3 2/3 The initial distribution𝜇 0 is (0.2, 0.4, 0, 0.4). •𝐾=0Starting from𝜇 0: The answer given by our Gurobi based solver (in a specific run) with an invariant set of size two is𝑒 𝐻(𝜇) =3.7568, with action𝑎being chosen with probability about 0.4. With a brute force computation, we can check that the...

  7. [7]

    This is the same example as the one used in Proposition 15 above

    A five-state MDP𝑀 3 with two bottom strongly connected component and a transient state. This is the same example as the one used in Proposition 15 above. 𝑎,0.71 1 0.5 0.5 0.5 0.5 𝑏,0.05 𝑏,0.05 𝑏,0.9 𝑎,0.3 At the start, i.e., in𝜇 0, the entire probability mass is in the top state. We can parameterize all memoryless strategies in terms of a single parameter...

  8. [8]

    •𝐾=2Starting from𝑀(𝜇 0,2): Our Gurobi based solver achieves a value of𝑒 𝐻(𝜇) of roughly 2.0, even with an invariant of size one

    MDP𝑀 4 with 5 states and 2 actions : A B C D E 𝑎,1 𝑏,1 28 At the start, with𝜇 0 as state A has probability 9 10 and state E has probability 1 10. •𝐾=2Starting from𝑀(𝜇 0,2): Our Gurobi based solver achieves a value of𝑒 𝐻(𝜇) of roughly 2.0, even with an invariant of size one.. The optimal memoryless policy, in comparison, yields an answer of𝑒𝐻(𝜇) ≊1.35506, ...

  9. [9]

    With an invariant of size one, our Gurobi based tool reports an answer of𝑒 𝐻(𝜇) =2.9543, choosing action𝑎with probability one

    MDP𝑀 5 with 3 states, with warm-up parameter𝐾=0: 𝐴 𝐵 𝐶 𝑎,1/2 𝑎,1/4 𝑎,1/4 𝑏,1/4 𝑏,1/4 𝑏,1/2 1/31/3 1/3 2/5 2/5 1/5 Initially, all the probability mass is in A, i.e.,𝜇 0 =(1,0,0). With an invariant of size one, our Gurobi based tool reports an answer of𝑒 𝐻(𝜇) =2.9543, choosing action𝑎with probability one. By a brute force computation we check that this is i...

  10. [10]

    Split (MDP with 4 states and 2 disconnected components taken from [ACMŽ23]): 𝐴 𝐵 𝐶 𝐷 𝑏,1 𝑎,0.9 𝑎,0.1 11 0.5 0.5 29 At the start, i.e., at𝜇0 =0, state A has probability 1 3 and state C has probability 2

  11. [11]

    Our Gurobi based solved reports an answer of about𝑒 𝐻(𝜇) =3.77976, with an invariant set of size one (𝐴+𝐵≤ 1 3), and the strategy of always choosing action𝑏

    We consider the case where the warm-up parameter satisfies𝐾=0. Our Gurobi based solved reports an answer of about𝑒 𝐻(𝜇) =3.77976, with an invariant set of size one (𝐴+𝐵≤ 1 3), and the strategy of always choosing action𝑏. On simulating the MDP with this strategy, we find that the maximum entropy attained is at mostlog 3. However, under the restriction to c...

  12. [12]

    Note that all transition probabilities that are not labelled on edges are 1

    Gridworld 1 (MDP with 8 states): This is a grid of size (2,4) with the point (0,1) being unreach- able, as shown in the figure with (0,0) being top-left and (1,3) being bottom-right. Note that all transition probabilities that are not labelled on edges are 1. Also there is a non-deterministic action at state (0,2), where one can go down or right, and a se...

  13. [13]

    At any other state, one of up to two actions - Down or Right - can be chosen, provided the action does not force one off the board or into an unreachable state

    Gridworld 2 (MDP with 12 states): We have a grid of size (3,4) with the unreachable states as (1, 1), (2, 1) and (1, 3) (with(0,0)being top left and(2,3)the bottom right). At any other state, one of up to two actions - Down or Right - can be chosen, provided the action does not force one off the board or into an unreachable state. All transitions that are...

  14. [14]

    The initial distribution𝜇 0 is top-left state (0,0) having probability 1

    From any reachable state except (2, 3) which does not have transitions marked in the figure, in the actual model there is a transition from that state to all other reachable states with uniform probability. The initial distribution𝜇 0 is top-left state (0,0) having probability 1. 30 Note again that while there is some randomness in this example, it is sim...

  15. [15]

    •𝐾=0Starting from𝜇 0: Our Gurobi based solver reports an answer of𝑒 𝐻(𝜇) =2.000, with an invariant set of size one

    Markov chain𝑀𝐶1with two states BA 10.50.5 At the start, i.e., in𝜇 0, state A has probability 2 3 and state B has probability 1 3. •𝐾=0Starting from𝜇 0: Our Gurobi based solver reports an answer of𝑒 𝐻(𝜇) =2.000, with an invariant set of size one. By simulating a run of the MC, the maximum value of 𝑒𝐻 (𝜇)over the run is 1.88988. However, since the distribut...

  16. [16]

    •𝐾=1,2, starting from𝑀(𝜇 0,1)or𝑀(𝜇 0,2): Our Gurobi based solver outputs the correct optimal value of𝑒 𝐻(𝜇) =1.93782, with an invariant set of size two

    Markov chain𝑀𝐶2with two states: A B 1 0.5 0.5 At the start, i.e., in𝜇 0, both A and B have 1 2 of the probability each. •𝐾=1,2, starting from𝑀(𝜇 0,1)or𝑀(𝜇 0,2): Our Gurobi based solver outputs the correct optimal value of𝑒 𝐻(𝜇) =1.93782, with an invariant set of size two. •𝐾=3, starting from𝑀(𝜇 0,3): Our Gurobi based solver outputs the correct optimal val...

  17. [17]

    Starting from𝜇 0 (𝐾=0), the Gurobi based solver reports the optimal answer of𝑒 𝐻(𝜇) =3

    Markov chain MC3, an ergodic Markov chain with three states: 31 𝐴 𝐵 𝐶 0.5 0.25 0.25 0.5 0.25 0.25 0.25 0.25 0.5 Initially, the probability distribution𝜇 0 is (1, 0, 0). Starting from𝜇 0 (𝐾=0), the Gurobi based solver reports the optimal answer of𝑒 𝐻(𝜇) =3

  18. [18]

    body compartments

    Insulin: Pharmokinetics benchmark taken originally from [CKV+11], but as modified in [AAGT15, Example 2], which we describe briefly. This Markov chain has five non-dummy nodes where three of the nodes correspond to the “body compartments” Plasma (Pl), Intersticial Fluid (IF), and “Utilization and degradation” (Ut). The node (Dr) models the drug being inje...

  19. [19]

    The optimal answer, obtained by simulating the Markov chain, is𝑒 𝐻(𝜇) =2.3717679

    For this problem, starting from𝜇 0, the answer reported by our Gurobi based solver seems to depend quite significantly on the seed, with one particular run reporting the value𝑒 𝐻(𝜇) =2.72, obtained with an invariant of size five (for this example, it also helps to run the solver without the pre-imposed bounds on the magnitudes of the template variables). ...

  20. [20]

    Pagerank (A 5-state Markov chain taken from [AAGT15][Example 1, Fig 3.]). This chains has five states, with the transition matrix given by  1 80 19 60 3 40 19 60 67 2401 80 1 20 41 120 19 60 67 2401 16 1 4 3 8 1 4 1 161 80 1 20 7 8 1 20 1 8033 80 9 20 3 40 1 20 1 80  , with the rows and column corresponding to states A, B, C, D, and E ...