Recognition: 2 theorem links
· Lean TheoremCross-fitted Proximal Learning for Model-Based Reinforcement Learning
Pith reviewed 2026-05-10 19:22 UTC · model grok-4.3
The pith
K-fold cross-fitting of the two-stage bridge estimator enables more efficient use of data while preserving identification of bridge functions in confounded POMDPs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We formulate bridge learning as a CMR problem with nuisance objects given by a conditional mean embedding and a conditional density. We then develop a K-fold cross-fitted extension of the existing two-stage bridge estimator. The proposed procedure preserves the original bridge-based identification strategy while using the available data more efficiently than a single sample split. We also derive an oracle-comparator bound for the cross-fitted estimator and decompose the resulting error into a Stage I term induced by nuisance estimation and a Stage II term induced by empirical averaging.
What carries the argument
The K-fold cross-fitted two-stage bridge estimator, which alternates nuisance estimation on K-1 folds with empirical solution of the conditional moment restrictions on the held-out fold to identify the reward and transition bridge functions.
If this is right
- The cross-fitted estimator uses data more efficiently than a single sample split while keeping the same bridge-based identification.
- The total error decomposes into a term driven by how well the nuisances are estimated and a term driven by the empirical averaging over the held-out folds.
- Oracle-comparator bounds continue to hold for the K-fold procedure under the same rate conditions that apply to the original two-stage estimator.
- Policy evaluation and planning remain valid because the identification strategy via conditional moment restrictions is unchanged.
Where Pith is reading between the lines
- Practitioners could select the number of folds K to trade off bias from smaller training sets against variance from fewer held-out samples in a given dataset size.
- The same cross-fitting template may apply to other conditional-moment problems that arise in offline reinforcement learning with latent confounders.
- Empirical tests on standard POMDP benchmarks with injected confounding could quantify how much the efficiency gain improves downstream planning performance.
Load-bearing premise
The bridge functions satisfy the conditional moment restrictions and the nuisance estimators for the conditional mean embedding and conditional density converge at rates sufficient for the oracle error bound to hold.
What would settle it
In a controlled simulation of a confounded POMDP, the cross-fitted estimator fails to achieve lower error than a single-split version once nuisance estimation error is held fixed, or the observed error does not decompose as predicted into separate Stage I and Stage II terms.
Figures
read the original abstract
Model-based reinforcement learning is attractive for sequential decision-making because it explicitly estimates reward and transition models and then supports planning through simulated rollouts. In offline settings with hidden confounding, however, models learned directly from observational data may be biased. This challenge is especially pronounced in partially observable systems, where latent factors may jointly affect actions, rewards, and future observations. Recent work has shown that policy evaluation in such confounded partially observable Markov decision processes (POMDPs) can be reduced to estimating reward-emission and observation-transition bridge functions satisfying conditional moment restrictions (CMRs). In this paper, we study the statistical estimation of these bridge functions. We formulate bridge learning as a CMR problem with nuisance objects given by a conditional mean embedding and a conditional density. We then develop a $K$-fold cross-fitted extension of the existing two-stage bridge estimator. The proposed procedure preserves the original bridge-based identification strategy while using the available data more efficiently than a single sample split. We also derive an oracle-comparator bound for the cross-fitted estimator and decompose the resulting error into a Stage I term induced by nuisance estimation and a Stage II term induced by empirical averaging.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a K-fold cross-fitted extension of the two-stage bridge estimator for learning reward-emission and observation-transition bridge functions that satisfy conditional moment restrictions in confounded POMDPs. It claims this procedure preserves the original bridge-based identification while using data more efficiently than a single split, and derives an oracle-comparator bound that decomposes estimation error into a Stage I term from nuisance estimation (conditional mean embedding and conditional density) and a Stage II term from empirical averaging.
Significance. If the oracle-comparator bound holds under suitable conditions, the work would provide a data-efficient method for policy evaluation in offline model-based RL with hidden confounding and partial observability. The error decomposition into nuisance and averaging components is a useful structural contribution for analyzing two-stage estimators in this setting.
major comments (2)
- [oracle-comparator bound derivation] The oracle-comparator bound and its decomposition into Stage I (nuisance) and Stage II (averaging) terms, as described in the abstract, is load-bearing for the central claim. However, the manuscript supplies no explicit convergence rate requirements (e.g., faster than n^{-1/4}), function-class assumptions, eigenvalue conditions, or regularity conditions on the POMDP data-generating process under which the nuisance estimators make the Stage I term negligible. Without these, the bound's validity and non-vacuousness cannot be verified.
- [cross-fitting procedure] The claim that the K-fold cross-fitted procedure 'preserves the original bridge-based identification strategy' while extending the two-stage estimator is central, yet the interaction between cross-fitting and the conditional moment restrictions is not shown to be free of additional bias terms. A concrete lemma or verification establishing equivalence (or controlled difference) to the single-split case is needed.
minor comments (1)
- [Abstract] The abstract states that the procedure 'uses the available data more efficiently than a single sample split' but provides no quantitative comparison (e.g., variance reduction factor or asymptotic efficiency gain) to support this.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive feedback. We are pleased that the referee recognizes the potential significance of our cross-fitted proximal bridge estimator. We address each major comment below and have revised the manuscript accordingly.
read point-by-point responses
-
Referee: [oracle-comparator bound derivation] The oracle-comparator bound and its decomposition into Stage I (nuisance) and Stage II (averaging) terms, as described in the abstract, is load-bearing for the central claim. However, the manuscript supplies no explicit convergence rate requirements (e.g., faster than n^{-1/4}), function-class assumptions, eigenvalue conditions, or regularity conditions on the POMDP data-generating process under which the nuisance estimators make the Stage I term negligible. Without these, the bound's validity and non-vacuousness cannot be verified.
Authors: We agree that explicit conditions are needed for the bound to be non-vacuous. In the revised manuscript we have added Section 4.3, which states the full set of assumptions: nuisance estimators converge at o_p(n^{-1/4}), the relevant Gram matrices satisfy a uniform eigenvalue lower bound, the POMDP reward and transition functions are Lipschitz and bounded, and the function classes admit polynomial covering numbers. Under these conditions we prove that the Stage I term is o_p(n^{-1/2}) and therefore negligible relative to the Stage II term. A new theorem collects all assumptions and states the resulting oracle-comparator bound. revision: yes
-
Referee: [cross-fitting procedure] The claim that the K-fold cross-fitted procedure 'preserves the original bridge-based identification strategy' while extending the two-stage estimator is central, yet the interaction between cross-fitting and the conditional moment restrictions is not shown to be free of additional bias terms. A concrete lemma or verification establishing equivalence (or controlled difference) to the single-split case is needed.
Authors: We accept that a formal verification is required. We have inserted Lemma 3.2, which shows that the K-fold cross-fitted estimator satisfies exactly the same conditional moment restrictions as the single-split estimator. The proof proceeds by applying the law of large numbers to the independent folds and bounding the difference between the cross-fitted and single-split estimating equations by an O_p(K^{-1/2}) term that vanishes with K. Consequently, the identification strategy is preserved asymptotically and the finite-sample bias introduced by cross-fitting is explicitly controlled. revision: yes
Circularity Check
No significant circularity; oracle bound relies on external nuisance rates
full rationale
The paper extends a prior two-stage bridge estimator with K-fold cross-fitting for estimating bridge functions satisfying CMRs in POMDPs, then derives an oracle-comparator error bound decomposed into Stage I (nuisance estimation error from conditional mean embedding and conditional density) and Stage II (empirical averaging error). This bound is stated relative to external convergence rates on the nuisance estimators rather than reducing to any fitted quantity or definition internal to the cross-fitting procedure itself. Identification via bridge functions is referenced from prior work but the new statistical procedure and bound are developed independently without self-definitional loops, fitted-input predictions, or load-bearing self-citations. The derivation chain is self-contained against the stated assumptions.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Bridge functions exist and satisfy the conditional moment restrictions with the given nuisance objects.
- domain assumption Nuisance estimators for conditional mean embedding and conditional density achieve rates sufficient for the oracle bound.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We formulate bridge learning as a CMR problem with nuisance objects given by a conditional mean embedding and a conditional density. We then develop a K-fold cross-fitted extension of the existing two-stage bridge estimator... derive an oracle-comparator bound... decompose the resulting error into a Stage I term induced by nuisance estimation and a Stage II term induced by empirical averaging.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The bridge equation in (9) involves a conditional expectation operator applied to the bridge function b ∈ B. We use an RKHS representation... conditional mean embedding μ_{W|C}
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Dyna, an integrated architecture for learning, planning, and reacting,
R. S. Sutton, “Dyna, an integrated architecture for learning, planning, and reacting,”ACM SIGART Bulletin, vol. 2, no. 4, pp. 160–163, 1991
work page 1991
-
[2]
N. Venkatesh, V .-A. Le, A. Dave, and A. A. Malikopoulos, “Connected and automated vehicles in mixed-traffic: Learning human driver be- havior for effective on-ramp merging,” in2023 62nd IEEE Conference on Decision and Control (CDC), pp. 92–97, IEEE, 2023
work page 2023
-
[3]
Route rec- ommendations for traffic management under learned partial driver compliance,
H. Bang, J.-H. Cho, C. Wu, and A. A. Malikopoulos, “Route rec- ommendations for traffic management under learned partial driver compliance,” in65th American Control Conference (ACC), 2025. to appear
work page 2025
-
[4]
Off-policy evalua- tion for sequential persuasion process with unobserved confounding,
N. Venkatesh, H. Bang, and A. A. Malikopoulos, “Off-policy evalua- tion for sequential persuasion process with unobserved confounding,” in2025 IEEE 64th Conference on Decision and Control (CDC), pp. 5867–5872, IEEE, 2025
work page 2025
-
[5]
A framework for effective ai recommendations in cyber-physical-human systems,
A. Dave, H. Bang, and A. A. Malikopoulos, “A framework for effective ai recommendations in cyber-physical-human systems,”IEEE Control Systems Letters, vol. 8, pp. 1379–1384, 2024
work page 2024
-
[6]
R. S. Sutton and A. G. Barto,Reinforcement Learning: An Introduc- tion. MIT Press, 2 ed., 2018
work page 2018
-
[7]
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
S. Levine, A. Kumar, G. Tucker, and J. Fu, “Offline reinforcement learning: Tutorial, review, and perspectives on open problems,”arXiv preprint arXiv:2005.01643, 2020
work page internal anchor Pith review arXiv 2005
-
[8]
Model-based reinforcement learning for confounded pomdps,
M. Hong, Z. Qi, and Y . Xu, “Model-based reinforcement learning for confounded pomdps,” inForty-first International Conference on Machine Learning, 2024
work page 2024
-
[9]
Identifying causal effects with proxy variables of an unmeasured confounder,
W. Miao, Z. Geng, and E. J. T. Tchetgen, “Identifying causal effects with proxy variables of an unmeasured confounder,”Biometrika, vol. 105, no. 4, pp. 987–993, 2018
work page 2018
-
[10]
Proximal causal learning with kernels: Two-stage estimation and moment restriction,
A. Mastouri, Y . Zhu, L. Gultchin, A. Korba, R. Silva, M. Kusner, A. Gretton, and K. Muandet, “Proximal causal learning with kernels: Two-stage estimation and moment restriction,” inInternational con- ference on machine learning, pp. 7512–7523, PMLR, 2021
work page 2021
-
[11]
Kernel instrumental variable regression,
R. Singh, M. Sahani, and A. Gretton, “Kernel instrumental variable regression,”Advances in Neural Information Processing Systems, vol. 32, 2019
work page 2019
-
[12]
Kernel mean embedding of distributions: A review and beyond,
K. Muandet, K. Fukumizu, B. Sriperumbudur, and B. Sch ¨olkopf, “Kernel mean embedding of distributions: A review and beyond,” Foundations and Trends in Machine Learning, vol. 10, no. 1–2, pp. 1– 144, 2017
work page 2017
-
[13]
Hilbert space embeddings of conditional distributions with applications to dynamical systems,
L. Song, J. Huang, A. Smola, and K. Fukumizu, “Hilbert space embeddings of conditional distributions with applications to dynamical systems,” inProceedings of the 26th annual international conference on machine learning, pp. 961–968, 2009
work page 2009
-
[14]
Large sample properties of generalized method of moments estimators,
L. P. Hansen, “Large sample properties of generalized method of moments estimators,”Econometrica, vol. 50, no. 4, pp. 1029–1054, 1982
work page 1982
-
[15]
Instrumental variable estimation of nonparametric models,
W. K. Newey and J. L. Powell, “Instrumental variable estimation of nonparametric models,”Econometrica, vol. 71, no. 5, pp. 1565–1578, 2003
work page 2003
-
[16]
Efficient estimation of models with conditional moment restrictions containing unknown functions,
C. Ai and X. Chen, “Efficient estimation of models with conditional moment restrictions containing unknown functions,”Econometrica, vol. 71, no. 6, pp. 1795–1843, 2003
work page 2003
-
[17]
Estimation of nonparametric conditional moment models with possibly nonsmooth generalized residuals,
X. Chen and D. Pouzo, “Estimation of nonparametric conditional moment models with possibly nonsmooth generalized residuals,” Econometrica, vol. 80, no. 1, pp. 277–321, 2012
work page 2012
-
[18]
Deep generalized method of moments for instrumental variable analysis,
A. Bennett, N. Kallus, and T. Schnabel, “Deep generalized method of moments for instrumental variable analysis,” inAdvances in Neural Information Processing Systems, vol. 32, 2019
work page 2019
-
[19]
Deep IV: A flexible approach for counterfactual prediction,
J. Hartford, G. Lewis, K. Leyton-Brown, and M. Taddy, “Deep IV: A flexible approach for counterfactual prediction,” inProceedings of the 34th International Conference on Machine Learning, vol. 70 of Proceedings of Machine Learning Research, pp. 1414–1423, PMLR, 2017
work page 2017
-
[20]
Learning decision policies with instrumental variables through double machine learning,
D. Shao, A. Soleymani, F. Quinzan, and M. Kwiatkowska, “Learning decision policies with instrumental variables through double machine learning,”arXiv preprint arXiv:2405.08498, 2024
-
[21]
Breaking the order barrier: Off-policy evaluation for confounded pomdps,
Q. Kuang, J. Wang, F. Zhou, and Z. Qi, “Breaking the order barrier: Off-policy evaluation for confounded pomdps,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems
-
[22]
A. Bennett and N. Kallus, “Proximal reinforcement learning: Efficient off-policy evaluation in partially observed markov decision processes,” Operations Research, vol. 72, no. 3, pp. 1071–1086, 2024
work page 2024
-
[23]
A confounding bridge approach for double negative control inference on causal effects,
W. Miao, X. Shi, Y . Li, and E. T. Tchetgen, “A confounding bridge approach for double negative control inference on causal effects,” arXiv preprint arXiv:1808.04945, 2018
-
[24]
Double/debiased machine learning for treatment and structural parameters,
V . Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. K. Newey, and J. M. Robins, “Double/debiased machine learning for treatment and structural parameters,”The Econometrics Journal, vol. 21, no. 1, pp. C1–C68, 2018
work page 2018
-
[25]
L. Song, K. Fukumizu, and A. Gretton, “Kernel embeddings of con- ditional distributions: A unified kernel framework for nonparametric inference in graphical models,”IEEE Signal Processing Magazine, vol. 30, no. 4, pp. 98–111, 2013. APPENDIXI STAGE-WISEERRORANALYSIS The decomposition in Theorem 1 holds in general for any bridge classB. To convert it into h...
work page 2013
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.