MDPs with a State Sensing Cost

Jayakrishnan Nair; Vansh Kapoor

arxiv: 2505.03280 · v3 · submitted 2025-05-06 · 💻 cs.LG

MDPs with a State Sensing Cost

Vansh Kapoor , Jayakrishnan Nair This is my paper

Pith reviewed 2026-05-22 15:50 UTC · model grok-4.3

classification 💻 cs.LG

keywords MDPsensing coststate observation costsuboptimality boundspolicy improvementdiscounted costblind states

0 comments

The pith

In MDPs with sensing costs, lower bounds on the optimal value function bound the suboptimality gap of any policy and support an efficient improvement algorithm.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows how to handle sequential decisions where checking the current state of the world carries an extra cost. The key move is to rewrite the problem as a standard discounted MDP whose states include both moments when the agent knows the state and moments when it is acting blind. From this larger model the authors obtain lower bounds on the best possible cost, which then give an upper limit on how much worse any chosen policy can be. They also describe a policy-improvement procedure that is fast to run and comes close to the ideal policy on example problems. If these bounds hold, designers of systems that must pay for information can evaluate and improve their decision rules without solving an intractable problem.

Core claim

The central claim is that the sensing-cost problem can be posed as a classical discounted-cost MDP on an expanded countably infinite state space that includes both sensed and blind states. This formulation allows derivation of lower bounds on the optimal value function, which in turn bound the suboptimality gap of any policy. In addition, a computationally efficient policy improvement algorithm called SPI is proposed that performs close to the optimal policy in practice, as demonstrated in a numerical case study.

What carries the argument

The expanded countably infinite state space consisting of sensed states and blind states, which reduces the sensing-cost problem to a standard discounted MDP and enables the lower bounds and policy improvement.

If this is right

Lower bounds on the optimal value function provide a concrete way to quantify the suboptimality gap for any given policy.
The SPI algorithm offers a practical, computationally efficient method to improve policies toward optimality.
Performance close to optimal can be achieved without computing the full optimal policy in the infinite-state MDP.
The approach allows benchmarking against other methods through numerical studies on specific problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar expansions of state space might be useful for other costs associated with information gathering beyond just sensing.
The bounds could potentially guide the development of approximation schemes when the state space is continuous rather than discrete.
Applying the method to real-world systems with communication costs might reveal practical trade-offs between sensing frequency and decision quality.

Load-bearing premise

The sensing-cost dynamics can be exactly captured by modeling the problem as a discounted MDP on an expanded state space that tracks both sensed and blind states.

What would settle it

A counterexample MDP with sensing costs where the derived lower bound is exceeded by the value of some policy, or where the SPI algorithm fails to approach optimal performance in the case study.

read the original abstract

In many practical sequential decision-making problems, tracking the state of the environment incurs a sensing/communication/computation cost. In these settings, the agent's interaction with its environment includes the additional component of deciding when to sense the state, in a manner that balances the value associated with optimal (state-specific) actions and the cost of sensing. We formulate this as an expected discounted cost Markov Decision Process (MDP), wherein the agent incurs an additional cost for sensing its next state, but has the option to take actions while remaining `blind' to the system state. We pose this problem as a classical discounted cost MDP with an expanded (countably infinite) state space. While computing the optimal policy for this MDP is intractable in general, we derive lower bounds on the optimal value function, which allow us to bound the suboptimality gap of any policy. We also propose a computationally efficient algorithm SPI, based on policy improvement, which in practice performs close to the optimal policy. Finally, we benchmark against the state-of-the-art via a numerical case study.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper expands the MDP state space with blind states to model sensing costs, derives lower bounds for suboptimality gaps, and gives a simple policy-improvement algorithm that works in their example.

read the letter

The main takeaway is that they recast the sensing decision as a standard discounted MDP on a countably infinite augmented state space that tracks whether the agent is blind or has sensed the state. From there they produce lower bounds on the optimal value function that certify the gap for any policy, plus an SPI algorithm that iteratively improves policies and stays cheap to run. In their numerical case study it lands close to the optimum without solving the full infinite-state problem.

Referee Report

1 major / 1 minor

Summary. The paper formulates MDPs with state sensing costs as an expected discounted-cost MDP on an expanded countably infinite state space that augments the original states with a collection of 'blind' states. It derives lower bounds on the optimal value function to bound the suboptimality gap of any policy, proposes a computationally efficient policy-improvement algorithm (SPI) that performs close to optimal in practice, and provides a numerical case study benchmarking against prior work.

Significance. If the expanded-state-space reformulation is valid and the lower bounds are correctly derived, the results supply both theoretical guarantees on policy suboptimality and a practical algorithm for sensing-cost MDPs. This could be useful in domains such as robotics or networked control where sensing incurs explicit costs. The use of standard MDP theory on the augmented space and the empirical performance of SPI are strengths if the modeling assumptions hold.

major comments (1)

Abstract, paragraph 2: the claim that the problem can be posed as a classical discounted-cost MDP on an expanded countably infinite state space (original states union blind states) is load-bearing for all subsequent results. For general transition dynamics, sequences of blind actions accumulate uncertainty that cannot be represented by a countable collection of blind states while preserving the Markov property without an explicit belief or additional structure (e.g., static state during blind periods). If this construction does not yield a true MDP, the derived lower bounds on the value function and the suboptimality-gap guarantees for arbitrary policies (and for SPI) do not apply to the original sensing-cost dynamics.

minor comments (1)

The abstract states that lower bounds and an algorithm are derived but provides no proof sketch or verification outline; adding a one-sentence indication of the key technical step (e.g., how the lower bound is obtained) would improve readability without lengthening the abstract.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for identifying this foundational modeling question. We address the concern directly below.

read point-by-point responses

Referee: Abstract, paragraph 2: the claim that the problem can be posed as a classical discounted-cost MDP on an expanded countably infinite state space (original states union blind states) is load-bearing for all subsequent results. For general transition dynamics, sequences of blind actions accumulate uncertainty that cannot be represented by a countable collection of blind states while preserving the Markov property without an explicit belief or additional structure (e.g., static state during blind periods). If this construction does not yield a true MDP, the derived lower bounds on the value function and the suboptimality-gap guarantees for arbitrary policies (and for SPI) do not apply to the original sensing-cost dynamics.

Authors: We thank the referee for this observation. In the formulation, the augmented state is defined as the pair (last sensed state s, number of consecutive blind steps k), yielding the countable set S ∪ {(s,k) | s∈S, k=1,2,…}. Transitions from a blind state (s,k) remain at the same s with probability 1 when the agent elects to continue blind; a sensing action moves to the true next state drawn from the original dynamics. This construction is Markovian precisely because the underlying state is assumed static while the agent is blind. The assumption is implicit in the problem statement but will be stated explicitly in the revised Section 2, together with a short discussion of the modeling scope (e.g., verification or hold-mode problems). Under this modeling choice the lower bounds, suboptimality guarantees, and SPI algorithm apply directly to the original sensing-cost process. For dynamics in which the state continues to evolve while blind, a belief-state formulation would indeed be required; our results are not claimed for that broader class. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation is a direct reformulation followed by independent bounds

full rationale

The paper's central step is an explicit modeling choice that expands the state space to include blind states, after which lower bounds on the value function and the SPI algorithm are derived mathematically from the resulting MDP. No equation or claim reduces a derived quantity to a fitted parameter or to a self-citation that itself depends on the target result. The modeling is presented as a standard reformulation rather than a self-defining construction, and the bounds are obtained by standard MDP analysis techniques applied to the augmented process. This leaves the derivation self-contained against external verification of the Markov property and the bound derivations.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central modeling step treats the sensing-cost problem as an ordinary discounted MDP on an expanded state space; this relies on standard MDP assumptions (known transition probabilities, known costs, fixed discount factor) that are not re-derived in the abstract.

free parameters (1)

sensing cost
Introduced as an exogenous parameter of the problem; its value is chosen by the modeler and not derived from first principles.

axioms (1)

domain assumption The underlying system is a known finite-state MDP whose transitions and costs are given.
Required for the expanded-state construction to be well-defined.

pith-pipeline@v0.9.0 · 5703 in / 1290 out tokens · 37675 ms · 2026-05-22T15:50:09.252214+00:00 · methodology

MDPs with a State Sensing Cost

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)