arxiv: 2605.12410 · v1 · submitted 2026-05-12 · 📊 stat.ML · cs.LG· math.OC· math.ST· stat.TH

Recognition: no theorem link

Model-based Bootstrap of Controlled Markov Chains

Diego Klabjan, Imon Banerjee, Ziwei Su

Pith reviewed 2026-05-13 03:12 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.OCmath.STstat.TH

keywords model-based bootstrapcontrolled Markov chainsoffline reinforcement learningdistributional consistencyconfidence intervalsBellman operatorspolicy evaluationdelta method

0 comments

The pith

A model-based bootstrap for transition kernels yields asymptotically valid confidence intervals for value and Q-functions in offline RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a bootstrap estimator for the transition probabilities of finite controlled Markov chains under unknown or history-dependent policies, a common setup in offline reinforcement learning. It proves distributional consistency of this estimator both for a single long trajectory and for collections of short episodes by introducing a bootstrap law of large numbers on visitation counts and applying a martingale central limit theorem to the resampled increments. The consistency is then lifted to downstream functionals of the Bellman operators through the delta method, after verifying their Hadamard differentiability. This matters because it supplies a practical route to uncertainty quantification for policy evaluation and optimal policy recovery when the true dynamics are inaccessible. Experiments on the RiverSwim benchmark show the resulting percentile intervals attain near-nominal coverage at modest sample sizes where plug-in and episodic bootstrap baselines are miscalibrated.

Core claim

We propose a model-based bootstrap for transition kernels in finite controlled Markov chains with possibly nonstationary or history-dependent control policies. Distributional consistency of the bootstrap transition estimator is established in both the single long-chain regime and the episodic offline RL regime through a novel bootstrap law of large numbers for visitation counts together with a martingale central limit theorem for bootstrap transition increments. The consistency is extended to offline policy evaluation and optimal policy recovery by the delta method after verifying Hadamard differentiability of the Bellman operators, thereby producing asymptotically valid confidence intervals

What carries the argument

The bootstrap transition estimator obtained by resampling visitation counts according to the empirical measure of a controlled Markov chain, together with the martingale central limit theorem applied to its increments.

If this is right

The bootstrap transition estimator is distributionally consistent in both long single trajectories and episodic offline data.
Asymptotically valid confidence intervals for value functions and Q-functions follow from the delta method once Bellman operators are shown to be Hadamard differentiable.
Percentile intervals produced by the procedure achieve closer-to-nominal coverage than plug-in CLT or episodic bootstrap alternatives at small sample sizes.
The method applies directly to offline policy evaluation and optimal policy recovery when the behavior policy is unknown.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same resampling construction may be usable for uncertainty quantification in model-based planning algorithms beyond pure evaluation.
Extensions to approximate dynamic programming or function approximation would require analogous differentiability arguments for the approximate Bellman operators.
The approach offers a concrete way to import classical bootstrap theory into dependent-data settings that arise in sequential decision problems.
Calibration checks on additional finite-state benchmarks could reveal the practical sample-size threshold at which the asymptotic guarantees become reliable.

Load-bearing premise

The novel bootstrap law of large numbers for visitation counts and the martingale central limit theorem for bootstrap increments continue to hold for finite controlled Markov chains under possibly nonstationary or history-dependent policies.

What would settle it

An experiment on RiverSwim or an analogous finite CMC in which the empirical coverage of the bootstrap percentile intervals deviates substantially from the nominal 95 percent level at sample sizes where the asymptotic regime should already be active.

Figures

Figures reproduced from arXiv: 2605.12410 by Diego Klabjan, Imon Banerjee, Ziwei Su.

**Figure 1.** Figure 1: The RiverSwim MDP with S = 6 states. Arc labels (a, p, r): action (0 = swim left, 1 = swim right), transition probability p, immediate reward r. E Details of Numerical Experiments We evaluate finite-sample CI coverage of the proposed bootstrap on the RiverSwim problem [Strehl and Littman, 2004], replicating the experimental setup of the small state-space regime of Zhu et al. [2024]. Our RiverSwim state–act… view at source ↗

**Figure 2.** Figure 2: Empirical coverage relative to the Plug-in (CLT) baseline for [PITH_FULL_IMAGE:figures/full_fig_p025_2.png] view at source ↗

**Figure 3.** Figure 3: Empirical coverage relative to the Plug-in (CLT) baseline for [PITH_FULL_IMAGE:figures/full_fig_p026_3.png] view at source ↗

**Figure 4.** Figure 4: Empirical coverage relative to the Plug-in (CLT) baseline for [PITH_FULL_IMAGE:figures/full_fig_p032_4.png] view at source ↗

**Figure 5.** Figure 5: Empirical coverage relative to the Plug-in (CLT) baseline for [PITH_FULL_IMAGE:figures/full_fig_p033_5.png] view at source ↗

**Figure 6.** Figure 6: Empirical coverage relative to the Plug-in (CLT) baseline for [PITH_FULL_IMAGE:figures/full_fig_p040_6.png] view at source ↗

**Figure 7.** Figure 7: Empirical coverage relative to the Plug-in (CLT) baseline for [PITH_FULL_IMAGE:figures/full_fig_p041_7.png] view at source ↗

read the original abstract

We propose and analyze a model-based bootstrap for transition kernels in finite controlled Markov chains (CMCs) with possibly nonstationary or history-dependent control policies, a setting that arises naturally in offline reinforcement learning (RL) when the behavior policy generating the data is unknown. We establish distributional consistency of the bootstrap transition estimator in both a single long-chain regime and the episodic offline RL regime. The key technical tools are a novel bootstrap law of large numbers (LLN) for the visitation counts and a novel use of the martingale central limit theorem (CLT) for the bootstrap transition increments. We extend bootstrap distributional consistency to the downstream targets of offline policy evaluation (OPE) and optimal policy recovery (OPR) via the delta method by verifying Hadamard differentiability of the Bellman operators, yielding asymptotically valid confidence intervals for value and $Q$-functions. Experiments on the RiverSwim problem show that the proposed bootstrap confidence intervals (CIs), especially the percentile CIs, outperform the episodic bootstrap and plug-in CLT CIs, and are often close to nominal ($50\%$, $90\%$, $95\%$) coverage, while the baselines are poorly calibrated at small sample sizes and short episode lengths.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The bootstrap LLN for visitations and martingale CLT on increments are the actual new pieces for CMC transitions, with solid RiverSwim results, but the OPR delta-method claim runs into the differentiability problem when optimal actions tie.

read the letter

The paper's main contribution is a model-based bootstrap for transition kernels in finite controlled Markov chains when the policy is unknown and can be nonstationary or history-dependent. They prove distributional consistency using a new bootstrap law of large numbers for the visitation counts plus a martingale central limit theorem on the bootstrap increments. That setup covers both a single long trajectory and the episodic offline RL case. They then push the consistency through to offline policy evaluation and optimal policy recovery with the delta method after checking Hadamard differentiability of the relevant Bellman operators. The RiverSwim experiments are the practical highlight: the percentile bootstrap intervals come close to nominal coverage at 50%, 90%, and 95% levels even with small samples and short episodes, while the episodic bootstrap and plug-in CLT baselines are badly miscalibrated. That gives some evidence the method can be useful for uncertainty quantification in offline RL. The soft spot is exactly the one the stress-test flags. For optimal policy recovery the delta-method step needs Hadamard differentiability of the map from transition kernel to optimal value function. In finite CMCs that map is not differentiable at any kernel where two or more actions tie for the maximum Q-value in a state, because a perturbation can flip the argmax. The abstract says they verify the differentiability, yet it supplies no explicit non-degeneracy condition such as strict uniqueness of the optimal action. Without that, the OPR confidence intervals are not justified in the stated generality. The OPE part avoids this issue because evaluation does not involve a max. The technical tools are standard probability results applied to a new setting, with no circularity or self-referential fitting. This is for researchers working on asymptotic theory or bootstrap methods for finite-state offline RL. A reader who cares about rigorous uncertainty quantification in CMCs would find the visitation LLN and the experimental comparison worth their time. I would send it to peer review. The core bootstrap consistency results are worth referee scrutiny even if the OPR extension needs a tighter assumption or a separate argument.

Referee Report

2 major / 2 minor

Summary. The paper proposes a model-based bootstrap for transition kernels in finite controlled Markov chains (CMCs) under possibly nonstationary or history-dependent policies. It establishes distributional consistency of the bootstrap estimator in single long-chain and episodic offline RL regimes via a novel bootstrap LLN for visitation counts and martingale CLT for increments. Consistency is extended to offline policy evaluation (OPE) and optimal policy recovery (OPR) using the delta method after verifying Hadamard differentiability of the Bellman operators, yielding asymptotically valid CIs for value and Q-functions. Experiments on RiverSwim show the percentile bootstrap CIs achieve near-nominal coverage and outperform episodic bootstrap and plug-in CLT baselines at small samples.

Significance. If the central claims hold, the work offers a practical advance for uncertainty quantification in offline RL with unknown behavior policies by delivering asymptotically valid CIs without requiring knowledge of the data-generating policy. The novel bootstrap LLN and martingale CLT application are clear technical strengths, as is the empirical demonstration of improved calibration over baselines. The delta-method extension, if justified, would broaden applicability to policy optimization targets.

major comments (2)

[delta-method extension to OPR (following the Bellman operator differentiability verification)] The extension to OPR via the delta method relies on Hadamard differentiability of the map from transition kernel P to the optimal value function v*. In finite CMCs, this map fails to be Hadamard differentiable at any P where, for some state, two or more actions attain the same maximal Q-value, because a directional perturbation can switch the argmax and the directional derivative limit does not exist independently of path. The manuscript asserts verification of the required differentiability but supplies no explicit non-degeneracy condition (unique optimal action or strict Q-value inequality) that would restore it. This is load-bearing for the OPR consistency claim in the stated generality of nonstationary/history-dependent policies.
[Section 3 (bootstrap consistency theorems)] The novel bootstrap LLN for visitation counts and the martingale CLT for bootstrap increments are central to both the long-chain and episodic consistency results, yet the manuscript provides only high-level statements of the conditions under which they hold for history-dependent policies. Without an explicit assumption list or proof sketch that covers the episodic offline RL regime (where episode lengths are short and policies may be nonstationary), the support for the distributional consistency claims cannot be fully assessed.

minor comments (2)

[Experiments section] The RiverSwim experiments report coverage but omit the number of independent runs and standard errors on the empirical coverage probabilities; adding these would strengthen the comparison to baselines.
[Preliminaries / bootstrap construction] Notation for the resampling scheme in the model-based bootstrap could be clarified with a short pseudocode box or explicit definition of the bootstrap weights.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. We address each major comment below and outline targeted revisions to strengthen the presentation and rigor.

read point-by-point responses

Referee: [delta-method extension to OPR (following the Bellman operator differentiability verification)] The extension to OPR via the delta method relies on Hadamard differentiability of the map from transition kernel P to the optimal value function v*. In finite CMCs, this map fails to be Hadamard differentiable at any P where, for some state, two or more actions attain the same maximal Q-value, because a directional perturbation can switch the argmax and the directional derivative limit does not exist independently of path. The manuscript asserts verification of the required differentiability but supplies no explicit non-degeneracy condition (unique optimal action or strict Q-value inequality) that would restore it. This is load-bearing for the OPR consistency claim in the stated generality of nonstationary/history-dependent policies.

Authors: We agree that Hadamard differentiability of the map from transition kernel to optimal value function requires a non-degeneracy condition; without it, the directional derivative may fail to exist when argmax switches occur. The manuscript's verification of differentiability for the Bellman operators was performed under the implicit assumption that optimal actions are unique or Q-values are strictly separated at relevant states. To make this explicit and restore the claimed consistency for OPR, we will revise the relevant theorem statements and discussion to include this standard non-degeneracy condition, which is common in the RL literature for such results. revision: yes
Referee: [Section 3 (bootstrap consistency theorems)] The novel bootstrap LLN for visitation counts and the martingale CLT for bootstrap increments are central to both the long-chain and episodic consistency results, yet the manuscript provides only high-level statements of the conditions under which they hold for history-dependent policies. Without an explicit assumption list or proof sketch that covers the episodic offline RL regime (where episode lengths are short and policies may be nonstationary), the support for the distributional consistency claims cannot be fully assessed.

Authors: The bootstrap LLN for visitation counts and martingale CLT for increments are formulated to hold for history-dependent policies in both the single long-chain and episodic regimes. We acknowledge that the high-level presentation in Section 3 may limit full assessment of the episodic case with short episodes. We will add an explicit enumerated list of assumptions in Section 3 and include a concise proof sketch that directly addresses the episodic offline RL setting, covering nonstationary policies and finite episode lengths. revision: yes

Circularity Check

0 steps flagged

No circularity; standard probabilistic tools applied to new setting

full rationale

The derivation chain begins with a model-based bootstrap for transition kernels in finite CMCs, establishes distributional consistency via a bootstrap LLN for visitation counts and martingale CLT for increments (both presented as novel applications rather than reductions), and extends to OPE/OPR targets using the delta method after claiming verification of Hadamard differentiability of Bellman operators. No quoted step reduces by construction to a fitted input, self-definition, or self-citation chain; the differentiability verification is asserted as an independent technical check rather than an ansatz or imported uniqueness result. The work remains self-contained against external benchmarks (bootstrap theory, martingale CLT, delta method) without renaming known results or smuggling assumptions via prior self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

The paper introduces new technical results (bootstrap LLN and martingale CLT application) rather than relying on many free parameters or invented entities. However, the full list of regularity conditions for the consistency results is not provided in the abstract.

axioms (3)

domain assumption The controlled Markov chain has finite state and action spaces.
Explicitly stated as finite controlled Markov chains in the abstract.
ad hoc to paper The bootstrap law of large numbers holds for visitation counts under the resampling scheme.
Presented as a novel result required for the distributional consistency.
domain assumption Hadamard differentiability of the Bellman operators holds.
Verified by the authors to enable the delta method extension to OPE and OPR.

pith-pipeline@v0.9.0 · 5522 in / 1642 out tokens · 236014 ms · 2026-05-13T03:12:50.068956+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

[1]

M. K. S. Faradonbeh, A. Tewari, and G. Michailidis. On applications of bootstrap in continuous space reinforcement learning. In2019 IEEE 58th Conference on Decision and Control (CDC), pages 1977–1984. IEEE,

work page 1977
[2]

Kostrikov and O

I. Kostrikov and O. Nachum. Statistical bootstrapping for uncertainty estimation in off-policy evaluation.arXiv preprint arXiv:2007.13609,

work page arXiv 2007
[3]

Modayil and B

J. Modayil and B. Kuipers. Towards bootstrap learning for object discovery. InAAAI-2004 workshop on anchoring symbols to sensor data,

work page 2004
[4]

Nakkiran, B

P. Nakkiran, B. Neyshabur, and H. Sedghi. The deep bootstrap framework: Good online learners are good offline generalizers.arXiv preprint arXiv:2010.08127,

work page arXiv 2010
[5]

Z. Su, I. Banerjee, and D. Klabjan. Central limit theorems for transition probabilities of controlled markov chains.arXiv preprint arXiv:2508.01517,

work page arXiv
[6]

Following Su et al

11 A Sufficient Conditions for Assumption 2 Assumption 2 requires a uniform bound on the cumulative weak ¯η-mixing coefficients of the joint state–action process {(Xi, Ai)}i≥0, which can be difficult to verify directly. Following Su et al. [2026], Banerjee et al. [2025], we give two separate conditions—one for the action process and one for the state proc...

work page 2026
[7]

Thus ¯θi,j measures the rate at which the state process forgets its initial state–action pair, extending Dobrushin’s coefficients [Dobrushin, 1956] to the controlled setting

TV . Thus ¯θi,j measures the rate at which the state process forgets its initial state–action pair, extending Dobrushin’s coefficients [Dobrushin, 1956] to the controlled setting. We now impose a summability condition on these coefficients. Assumption 5(Mixing of States).There exists a constantC θ ≥0such that sup i≥1 ∞X j=i+1 ¯θi,j ≤C θ. Neither assumptio...

work page 1956
[8]

We define a stationary behavior policy ¯πb(a|s) :=p (a) s /ps, where ps :=P a′∈A p(a′) s is the sum of the ergodic occupation measure by actions

B Auxiliary Lemmas for Theorem 1 B.1 Consistency of the Empirical Behavior Policy Before we introduce the consistency of the empirical behavior policy, we introduce some requisite notation. We define a stationary behavior policy ¯πb(a|s) :=p (a) s /ps, where ps :=P a′∈A p(a′) s is the sum of the ergodic occupation measure by actions. We now state the cons...

work page 2026
[9]

We proceed in five steps

Proof. We proceed in five steps. Let N0 be the (a.s. finite) random index beyond which all the entrywise convergences invoked in Steps 1–4 have taken effect simultaneously; its finiteness follows from the fact that each individual a.s. convergence has a finite threshold, and a finite maximum of a.s.-finite random variables is a.s. finite. All statements b...

work page 2026
[10]

∥δzK ∗mr n −ˆp∗∥TV ≤(1−δ) m

Suppose the bound holds for some m≥0 , i.e. ∥δzK ∗mr n −ˆp∗∥TV ≤(1−δ) m. Applying the TV contraction with µ1 =δ zK ∗mr n and µ2 = ˆp∗, and rewritingK ∗(m+1)r n =K ∗mr n K ∗r n , we have δzK ∗(m+1)r n −ˆp∗ TV = (δzK ∗mr n )K ∗r n −ˆp∗K ∗r n TV ≤(1−δ) δzK ∗mr n −ˆp∗ TV ≤(1−δ) m+1, where the second inequality applies the induction hypothesis. Hence for allm≥...

work page 2025
[11]

[2025, Lemma 3] then implies that Assumption 2 holds for the bootstrap CMC, and∥∆∗ n∥ ≤0+ 2C∗ρ∗ 1−ρ∗ +1 = C ∗ ∆ for alln≥N 0 a.s

Banerjee et al. [2025, Lemma 3] then implies that Assumption 2 holds for the bootstrap CMC, and∥∆∗ n∥ ≤0+ 2C∗ρ∗ 1−ρ∗ +1 = C ∗ ∆ for alln≥N 0 a.s. This completes the proof. 16 B.4 Continuity of the Stationary Distribution Map Lemma 6(Continuity of the Stationary Distribution Map).On the almost sure event of Lemma 5, ˆp∗(a) s a.s. − − →p(a) s for every(s, a...

work page 2025
[12]

Since K ∗ n →K entrywise a.s

Then by Meyer [1980, Theorem 4.1], the limiting probability vector of an ergodic chain is a continuous function of the transition matrix. Since K ∗ n →K entrywise a.s. by Lemma 5 Step 1, we conclude ˆp∗(a) s a.s. − − →p(a) s for every (s, a)∈ S × A. B.5 Bootstrap Visitation Count Law of Large Numbers Lemma 7(Bootstrap Visitation Count LLN).For each(s, a)∈...

work page 1980
[13]

Then µ∗ n = n−1X i=0 (ν∗ nK ∗i n )(s, a), hence µ∗ n −nˆp∗(a) s = Pn−1 i=0 [(ν∗ nK ∗i n )(s, a)−ˆp∗(a) s ]

under K ∗ n. Then µ∗ n = n−1X i=0 (ν∗ nK ∗i n )(s, a), hence µ∗ n −nˆp∗(a) s = Pn−1 i=0 [(ν∗ nK ∗i n )(s, a)−ˆp∗(a) s ]. By Lemma 5 (Step 4), supz ∥K ∗i n (z,·)− ˆp∗∥TV ≤C ∗ρi ∗ for alln≥N 0 a.s. Sinceν ∗ n is a probability measure, ∥ν∗ nK ∗i n −ˆp∗∥TV ≤ X z ν∗ n(z) K ∗i n (z,·)−ˆp ∗ TV ≤C ∗ρi ∗. Therefore, for alln≥N 0 a.s., µ∗ n −nˆp∗(a) s ≤ ∞X i=0 C∗ρi...

work page 2025
[14]

Using E[(ξ∗ i )2 | F n,i] =1{X ∗ i =s, A ∗ i =a} ˆM(a) s,t (1− ˆM(a) s,t ), 19 we have E (d∗,sat n,i+1)2 | F n,i = 1{X ∗ i =s, A ∗ i =a} n(p (a) s )2 ˆM(a) s,t 1− ˆM(a) s,t

We now derive the asymptotic variance of S∗,sat n by analyzing the predictable quadratic variationPn−1 i=0 E[(d∗,sat n,i+1)2 | F n,i]. Using E[(ξ∗ i )2 | F n,i] =1{X ∗ i =s, A ∗ i =a} ˆM(a) s,t (1− ˆM(a) s,t ), 19 we have E (d∗,sat n,i+1)2 | F n,i = 1{X ∗ i =s, A ∗ i =a} n(p (a) s )2 ˆM(a) s,t 1− ˆM(a) s,t . Summing overi= 0, . . . , n−1, we get n−1X i=0 ...

work page 2026
[15]

We expand( ˜d∗ n,i+1)2 = P s,a,t usat d∗,sat n,i+1 2 to get E ( ˜d∗ n,i+1)2 | F n,i = X s,a,t X s′,a′,t′ usat us′a′t′ E d∗,sat n,i+1 d∗,s′a′t′ n,i+1 | F n,i

We next compute the predictable quadratic variation. We expand( ˜d∗ n,i+1)2 = P s,a,t usat d∗,sat n,i+1 2 to get E ( ˜d∗ n,i+1)2 | F n,i = X s,a,t X s′,a′,t′ usat us′a′t′ E d∗,sat n,i+1 d∗,s′a′t′ n,i+1 | F n,i . We observe that when(s, a)̸= (s ′, a′), 1{X ∗ i =s, A ∗ i =a} ·1{X ∗ i =s ′, A∗ i =a ′}= 0, since the chain occupies a single state–action pair a...

work page 2000
[16]

The proof of Proposition 2 is as follows

The bootstrap delta method [Van der Vaart, 2000, Theorem 23.9] then states: if √n(ˆθ∗ − ˆθ) d − →L conditionally on Dn in probability, and ϕ is Hadamard differentiable at θ tangentially to a subspace D0, then √n(ϕ(ˆθ∗)−ϕ( ˆθ)) d − →ϕ′ θ(L)conditionally onD n in probability. The proof of Proposition 2 is as follows. Proof. OPE targets Vπ and Qπ.We first de...

work page 2000
[17]

Moreover, at its left-most state 1, the optimal policy is nearly degenerate, with Q⋆(1,0)≈Q ⋆(1,1) , thus the optimal action is only weakly identifiable

visited far less frequently than downstream states (near state 1). Moreover, at its left-most state 1, the optimal policy is nearly degenerate, with Q⋆(1,0)≈Q ⋆(1,1) , thus the optimal action is only weakly identifiable. This places the problem near the boundary of optimal policy degeneracy and the transition probabilities for rarely visited state–action ...

work page 2021