arxiv: 2604.05185 · v1 · submitted 2026-04-06 · 💻 cs.LG · cs.SY· eess.SY

Recognition: 2 theorem links

· Lean Theorem

Cross-fitted Proximal Learning for Model-Based Reinforcement Learning

Nishanth Venkatesh , Andreas A. Malikopoulos

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:22 UTC · model grok-4.3

classification 💻 cs.LG cs.SYeess.SY

keywords cross-fittingbridge functionsmodel-based reinforcement learningPOMDPsoffline RLconditional moment restrictionsproximal learninghidden confounding

0 comments

The pith

K-fold cross-fitting of the two-stage bridge estimator enables more efficient use of data while preserving identification of bridge functions in confounded POMDPs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a K-fold cross-fitted version of the existing two-stage estimator for learning bridge functions that satisfy conditional moment restrictions. These functions identify reward-emission and observation-transition models in offline model-based reinforcement learning under hidden confounding and partial observability. A sympathetic reader would care because direct estimation from observational data is biased in such settings, and the cross-fitting step reduces data waste compared with a single sample split while still supporting planning via simulated rollouts. The work supplies an oracle-comparator error bound that separates the contribution of nuisance estimation from the final averaging step.

Core claim

We formulate bridge learning as a CMR problem with nuisance objects given by a conditional mean embedding and a conditional density. We then develop a K-fold cross-fitted extension of the existing two-stage bridge estimator. The proposed procedure preserves the original bridge-based identification strategy while using the available data more efficiently than a single sample split. We also derive an oracle-comparator bound for the cross-fitted estimator and decompose the resulting error into a Stage I term induced by nuisance estimation and a Stage II term induced by empirical averaging.

What carries the argument

The K-fold cross-fitted two-stage bridge estimator, which alternates nuisance estimation on K-1 folds with empirical solution of the conditional moment restrictions on the held-out fold to identify the reward and transition bridge functions.

If this is right

The cross-fitted estimator uses data more efficiently than a single sample split while keeping the same bridge-based identification.
The total error decomposes into a term driven by how well the nuisances are estimated and a term driven by the empirical averaging over the held-out folds.
Oracle-comparator bounds continue to hold for the K-fold procedure under the same rate conditions that apply to the original two-stage estimator.
Policy evaluation and planning remain valid because the identification strategy via conditional moment restrictions is unchanged.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Practitioners could select the number of folds K to trade off bias from smaller training sets against variance from fewer held-out samples in a given dataset size.
The same cross-fitting template may apply to other conditional-moment problems that arise in offline reinforcement learning with latent confounders.
Empirical tests on standard POMDP benchmarks with injected confounding could quantify how much the efficiency gain improves downstream planning performance.

Load-bearing premise

The bridge functions satisfy the conditional moment restrictions and the nuisance estimators for the conditional mean embedding and conditional density converge at rates sufficient for the oracle error bound to hold.

What would settle it

In a controlled simulation of a confounded POMDP, the cross-fitted estimator fails to achieve lower error than a single-split version once nuisance estimation error is held fixed, or the observed error does not decompose as predicted into separate Stage I and Stage II terms.

Figures

Figures reproduced from arXiv: 2604.05185 by Andreas A. Malikopoulos, Nishanth Venkatesh.

read the original abstract

Model-based reinforcement learning is attractive for sequential decision-making because it explicitly estimates reward and transition models and then supports planning through simulated rollouts. In offline settings with hidden confounding, however, models learned directly from observational data may be biased. This challenge is especially pronounced in partially observable systems, where latent factors may jointly affect actions, rewards, and future observations. Recent work has shown that policy evaluation in such confounded partially observable Markov decision processes (POMDPs) can be reduced to estimating reward-emission and observation-transition bridge functions satisfying conditional moment restrictions (CMRs). In this paper, we study the statistical estimation of these bridge functions. We formulate bridge learning as a CMR problem with nuisance objects given by a conditional mean embedding and a conditional density. We then develop a $K$-fold cross-fitted extension of the existing two-stage bridge estimator. The proposed procedure preserves the original bridge-based identification strategy while using the available data more efficiently than a single sample split. We also derive an oracle-comparator bound for the cross-fitted estimator and decompose the resulting error into a Stage I term induced by nuisance estimation and a Stage II term induced by empirical averaging.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

K-fold cross-fitting on the two-stage bridge estimator gives a practical efficiency gain for confounded POMDPs but the oracle bound still hinges on unspecified nuisance rates.

read the letter

The main takeaway is that the authors take an existing two-stage estimator for bridge functions under conditional moment restrictions and add K-fold cross-fitting. This lets them use the full sample more efficiently than a single split while preserving the original identification for reward-emission and observation-transition bridges in confounded POMDPs. They also supply an oracle-comparator bound that splits the error into a Stage I nuisance term and a Stage II averaging term.

Referee Report

2 major / 1 minor

Summary. The paper proposes a K-fold cross-fitted extension of the two-stage bridge estimator for learning reward-emission and observation-transition bridge functions that satisfy conditional moment restrictions in confounded POMDPs. It claims this procedure preserves the original bridge-based identification while using data more efficiently than a single split, and derives an oracle-comparator bound that decomposes estimation error into a Stage I term from nuisance estimation (conditional mean embedding and conditional density) and a Stage II term from empirical averaging.

Significance. If the oracle-comparator bound holds under suitable conditions, the work would provide a data-efficient method for policy evaluation in offline model-based RL with hidden confounding and partial observability. The error decomposition into nuisance and averaging components is a useful structural contribution for analyzing two-stage estimators in this setting.

major comments (2)

[oracle-comparator bound derivation] The oracle-comparator bound and its decomposition into Stage I (nuisance) and Stage II (averaging) terms, as described in the abstract, is load-bearing for the central claim. However, the manuscript supplies no explicit convergence rate requirements (e.g., faster than n^{-1/4}), function-class assumptions, eigenvalue conditions, or regularity conditions on the POMDP data-generating process under which the nuisance estimators make the Stage I term negligible. Without these, the bound's validity and non-vacuousness cannot be verified.
[cross-fitting procedure] The claim that the K-fold cross-fitted procedure 'preserves the original bridge-based identification strategy' while extending the two-stage estimator is central, yet the interaction between cross-fitting and the conditional moment restrictions is not shown to be free of additional bias terms. A concrete lemma or verification establishing equivalence (or controlled difference) to the single-split case is needed.

minor comments (1)

[Abstract] The abstract states that the procedure 'uses the available data more efficiently than a single sample split' but provides no quantitative comparison (e.g., variance reduction factor or asymptotic efficiency gain) to support this.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback. We are pleased that the referee recognizes the potential significance of our cross-fitted proximal bridge estimator. We address each major comment below and have revised the manuscript accordingly.

read point-by-point responses

Referee: [oracle-comparator bound derivation] The oracle-comparator bound and its decomposition into Stage I (nuisance) and Stage II (averaging) terms, as described in the abstract, is load-bearing for the central claim. However, the manuscript supplies no explicit convergence rate requirements (e.g., faster than n^{-1/4}), function-class assumptions, eigenvalue conditions, or regularity conditions on the POMDP data-generating process under which the nuisance estimators make the Stage I term negligible. Without these, the bound's validity and non-vacuousness cannot be verified.

Authors: We agree that explicit conditions are needed for the bound to be non-vacuous. In the revised manuscript we have added Section 4.3, which states the full set of assumptions: nuisance estimators converge at o_p(n^{-1/4}), the relevant Gram matrices satisfy a uniform eigenvalue lower bound, the POMDP reward and transition functions are Lipschitz and bounded, and the function classes admit polynomial covering numbers. Under these conditions we prove that the Stage I term is o_p(n^{-1/2}) and therefore negligible relative to the Stage II term. A new theorem collects all assumptions and states the resulting oracle-comparator bound. revision: yes
Referee: [cross-fitting procedure] The claim that the K-fold cross-fitted procedure 'preserves the original bridge-based identification strategy' while extending the two-stage estimator is central, yet the interaction between cross-fitting and the conditional moment restrictions is not shown to be free of additional bias terms. A concrete lemma or verification establishing equivalence (or controlled difference) to the single-split case is needed.

Authors: We accept that a formal verification is required. We have inserted Lemma 3.2, which shows that the K-fold cross-fitted estimator satisfies exactly the same conditional moment restrictions as the single-split estimator. The proof proceeds by applying the law of large numbers to the independent folds and bounding the difference between the cross-fitted and single-split estimating equations by an O_p(K^{-1/2}) term that vanishes with K. Consequently, the identification strategy is preserved asymptotically and the finite-sample bias introduced by cross-fitting is explicitly controlled. revision: yes

Circularity Check

0 steps flagged

No significant circularity; oracle bound relies on external nuisance rates

full rationale

The paper extends a prior two-stage bridge estimator with K-fold cross-fitting for estimating bridge functions satisfying CMRs in POMDPs, then derives an oracle-comparator error bound decomposed into Stage I (nuisance estimation error from conditional mean embedding and conditional density) and Stage II (empirical averaging error). This bound is stated relative to external convergence rates on the nuisance estimators rather than reducing to any fitted quantity or definition internal to the cross-fitting procedure itself. Identification via bridge functions is referenced from prior work but the new statistical procedure and bound are developed independently without self-definitional loops, fitted-input predictions, or load-bearing self-citations. The derivation chain is self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the existence of bridge functions satisfying the CMRs, the availability of consistent nuisance estimators for the conditional mean embedding and conditional density, and standard regularity conditions for the oracle bound to hold. No new entities are postulated.

axioms (2)

domain assumption Bridge functions exist and satisfy the conditional moment restrictions with the given nuisance objects.
Invoked when formulating bridge learning as a CMR problem; this is the identification foundation carried over from prior work.
domain assumption Nuisance estimators for conditional mean embedding and conditional density achieve rates sufficient for the oracle bound.
Required for the Stage I term in the error decomposition to vanish at the claimed rate.

pith-pipeline@v0.9.0 · 5506 in / 1580 out tokens · 37371 ms · 2026-05-10T19:22:12.252598+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formulate bridge learning as a CMR problem with nuisance objects given by a conditional mean embedding and a conditional density. We then develop a K-fold cross-fitted extension of the existing two-stage bridge estimator... derive an oracle-comparator bound... decompose the resulting error into a Stage I term induced by nuisance estimation and a Stage II term induced by empirical averaging.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The bridge equation in (9) involves a conditional expectation operator applied to the bridge function b ∈ B. We use an RKHS representation... conditional mean embedding μ_{W|C}

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 1 internal anchor

[1]

Dyna, an integrated architecture for learning, planning, and reacting,

R. S. Sutton, “Dyna, an integrated architecture for learning, planning, and reacting,”ACM SIGART Bulletin, vol. 2, no. 4, pp. 160–163, 1991

work page 1991
[2]

Connected and automated vehicles in mixed-traffic: Learning human driver be- havior for effective on-ramp merging,

N. Venkatesh, V .-A. Le, A. Dave, and A. A. Malikopoulos, “Connected and automated vehicles in mixed-traffic: Learning human driver be- havior for effective on-ramp merging,” in2023 62nd IEEE Conference on Decision and Control (CDC), pp. 92–97, IEEE, 2023

work page 2023
[3]

Route rec- ommendations for traffic management under learned partial driver compliance,

H. Bang, J.-H. Cho, C. Wu, and A. A. Malikopoulos, “Route rec- ommendations for traffic management under learned partial driver compliance,” in65th American Control Conference (ACC), 2025. to appear

work page 2025
[4]

Off-policy evalua- tion for sequential persuasion process with unobserved confounding,

N. Venkatesh, H. Bang, and A. A. Malikopoulos, “Off-policy evalua- tion for sequential persuasion process with unobserved confounding,” in2025 IEEE 64th Conference on Decision and Control (CDC), pp. 5867–5872, IEEE, 2025

work page 2025
[5]

A framework for effective ai recommendations in cyber-physical-human systems,

A. Dave, H. Bang, and A. A. Malikopoulos, “A framework for effective ai recommendations in cyber-physical-human systems,”IEEE Control Systems Letters, vol. 8, pp. 1379–1384, 2024

work page 2024
[6]

R. S. Sutton and A. G. Barto,Reinforcement Learning: An Introduc- tion. MIT Press, 2 ed., 2018

work page 2018
[7]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

S. Levine, A. Kumar, G. Tucker, and J. Fu, “Offline reinforcement learning: Tutorial, review, and perspectives on open problems,”arXiv preprint arXiv:2005.01643, 2020

work page internal anchor Pith review arXiv 2005
[8]

Model-based reinforcement learning for confounded pomdps,

M. Hong, Z. Qi, and Y . Xu, “Model-based reinforcement learning for confounded pomdps,” inForty-first International Conference on Machine Learning, 2024

work page 2024
[9]

Identifying causal effects with proxy variables of an unmeasured confounder,

W. Miao, Z. Geng, and E. J. T. Tchetgen, “Identifying causal effects with proxy variables of an unmeasured confounder,”Biometrika, vol. 105, no. 4, pp. 987–993, 2018

work page 2018
[10]

Proximal causal learning with kernels: Two-stage estimation and moment restriction,

A. Mastouri, Y . Zhu, L. Gultchin, A. Korba, R. Silva, M. Kusner, A. Gretton, and K. Muandet, “Proximal causal learning with kernels: Two-stage estimation and moment restriction,” inInternational con- ference on machine learning, pp. 7512–7523, PMLR, 2021

work page 2021
[11]

Kernel instrumental variable regression,

R. Singh, M. Sahani, and A. Gretton, “Kernel instrumental variable regression,”Advances in Neural Information Processing Systems, vol. 32, 2019

work page 2019
[12]

Kernel mean embedding of distributions: A review and beyond,

K. Muandet, K. Fukumizu, B. Sriperumbudur, and B. Sch ¨olkopf, “Kernel mean embedding of distributions: A review and beyond,” Foundations and Trends in Machine Learning, vol. 10, no. 1–2, pp. 1– 144, 2017

work page 2017
[13]

Hilbert space embeddings of conditional distributions with applications to dynamical systems,

L. Song, J. Huang, A. Smola, and K. Fukumizu, “Hilbert space embeddings of conditional distributions with applications to dynamical systems,” inProceedings of the 26th annual international conference on machine learning, pp. 961–968, 2009

work page 2009
[14]

Large sample properties of generalized method of moments estimators,

L. P. Hansen, “Large sample properties of generalized method of moments estimators,”Econometrica, vol. 50, no. 4, pp. 1029–1054, 1982

work page 1982
[15]

Instrumental variable estimation of nonparametric models,

W. K. Newey and J. L. Powell, “Instrumental variable estimation of nonparametric models,”Econometrica, vol. 71, no. 5, pp. 1565–1578, 2003

work page 2003
[16]

Efficient estimation of models with conditional moment restrictions containing unknown functions,

C. Ai and X. Chen, “Efficient estimation of models with conditional moment restrictions containing unknown functions,”Econometrica, vol. 71, no. 6, pp. 1795–1843, 2003

work page 2003
[17]

Estimation of nonparametric conditional moment models with possibly nonsmooth generalized residuals,

X. Chen and D. Pouzo, “Estimation of nonparametric conditional moment models with possibly nonsmooth generalized residuals,” Econometrica, vol. 80, no. 1, pp. 277–321, 2012

work page 2012
[18]

Deep generalized method of moments for instrumental variable analysis,

A. Bennett, N. Kallus, and T. Schnabel, “Deep generalized method of moments for instrumental variable analysis,” inAdvances in Neural Information Processing Systems, vol. 32, 2019

work page 2019
[19]

Deep IV: A flexible approach for counterfactual prediction,

J. Hartford, G. Lewis, K. Leyton-Brown, and M. Taddy, “Deep IV: A flexible approach for counterfactual prediction,” inProceedings of the 34th International Conference on Machine Learning, vol. 70 of Proceedings of Machine Learning Research, pp. 1414–1423, PMLR, 2017

work page 2017
[20]

Learning decision policies with instrumental variables through double machine learning,

D. Shao, A. Soleymani, F. Quinzan, and M. Kwiatkowska, “Learning decision policies with instrumental variables through double machine learning,”arXiv preprint arXiv:2405.08498, 2024

work page arXiv 2024
[21]

Breaking the order barrier: Off-policy evaluation for confounded pomdps,

Q. Kuang, J. Wang, F. Zhou, and Z. Qi, “Breaking the order barrier: Off-policy evaluation for confounded pomdps,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems

work page
[22]

Proximal reinforcement learning: Efficient off-policy evaluation in partially observed markov decision processes,

A. Bennett and N. Kallus, “Proximal reinforcement learning: Efficient off-policy evaluation in partially observed markov decision processes,” Operations Research, vol. 72, no. 3, pp. 1071–1086, 2024

work page 2024
[23]

A confounding bridge approach for double negative control inference on causal effects,

W. Miao, X. Shi, Y . Li, and E. T. Tchetgen, “A confounding bridge approach for double negative control inference on causal effects,” arXiv preprint arXiv:1808.04945, 2018

work page arXiv 2018
[24]

Double/debiased machine learning for treatment and structural parameters,

V . Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. K. Newey, and J. M. Robins, “Double/debiased machine learning for treatment and structural parameters,”The Econometrics Journal, vol. 21, no. 1, pp. C1–C68, 2018

work page 2018
[25]

Kernel embeddings of con- ditional distributions: A unified kernel framework for nonparametric inference in graphical models,

L. Song, K. Fukumizu, and A. Gretton, “Kernel embeddings of con- ditional distributions: A unified kernel framework for nonparametric inference in graphical models,”IEEE Signal Processing Magazine, vol. 30, no. 4, pp. 98–111, 2013. APPENDIXI STAGE-WISEERRORANALYSIS The decomposition in Theorem 1 holds in general for any bridge classB. To convert it into h...

work page 2013