arxiv: 2605.09514 · v1 · submitted 2026-05-10 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Doubly Robust Proxy Causal Learning with Neural Mean Embeddings

Bariscan Bozkurt , Alexandre Galashov , Dimitri Meunier , Zikai Shen , Arthur Gretton , Houssam Zenati

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:53 UTC · model grok-4.3

classification 💻 cs.LG

keywords proxy causal learningdoubly robust estimationneural mean embeddingscausal inferencecontinuous treatmentsbridge functionsunobserved confoundingdose-response estimation

0 comments

The pith

A neural doubly robust estimator recovers causal response curves for continuous treatments by combining outcome and treatment proxy bridges.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a neural framework that uses proxy variables to identify causal effects when direct adjustment for confounders is impossible. It estimates a treatment bridge via neural mean embeddings, pairs it with a neural outcome bridge, and applies a final regression to produce a doubly robust correction. This yields consistent estimators for full dose-response functions under continuous or structured treatments. The proof shows the overall error is governed by the smaller of the two bridge approximation errors plus the final-stage errors. Experiments on synthetic and image data indicate gains over single-bridge neural methods and classical baselines.

Core claim

By estimating both the outcome-inducing bridge and a neural mean-embedding treatment bridge, then correcting via a final regression stage, the doubly robust estimator consistently recovers the causal response function; the error is bounded by the final averaging and regression errors together with the minimum of the two weak-norm bridge errors.

What carries the argument

Neural mean-embedding estimator for the treatment bridge function, which is averaged and combined with the outcome bridge inside a doubly robust correction obtained by final regression.

If this is right

The method produces consistent estimators for population, heterogeneous, and conditional dose-response functions rather than binary effects.
Training stability is achieved through two-stage bridge estimation and history-aware linear-layer updates.
Error control depends on the smaller of the outcome-side and treatment-side weak-norm bridge errors plus the final regression error.
The approach outperforms single-bridge neural estimators and kernel-based baselines on synthetic and image-valued benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The construction may extend naturally to other high-dimensional structured treatments where kernel methods become impractical.
Integration with deeper or recurrent architectures could improve approximation rates for time-varying or sequential treatments.
The double-robustness property suggests the method remains useful even when one bridge is misspecified, provided the other is accurate enough.

Load-bearing premise

Valid treatment- and outcome-inducing proxies exist that satisfy the required bridge equations, and neural networks can approximate those bridge functions sufficiently well in the relevant norms.

What would settle it

A dataset where the proxies satisfy the bridge equations yet the estimated response curve deviates from the true causal curve by more than the sum of the reported final-stage and minimum-bridge errors.

Figures

Figures reproduced from arXiv: 2605.09514 by Alexandre Galashov, Arthur Gretton, Bariscan Bozkurt, Dimitri Meunier, Houssam Zenati, Zikai Shen.

**Figure 1.** Figure 1: Causal graph for PCL setting. Treatment bridges provide the complementary weighting route. For a given causal estimand, a treatment bridge φ(A, X, Z) is characterized by a moment equation of the form E{r0(A, X, W) − φ(A, X, Z) | A, X, W} = 0, where r0 denotes the estimand specific density ratio; for the population dose-response curve, r0(a, x, w) = pA(a)/pA|X,W (a | x, w). This makes treatment bridges anal… view at source ↗

**Figure 2.** Figure 2: Estimator comparison across the main benchmark settings. Each panel reports causal MSE on a [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Conditional dose-response estimation on the low-dimensional synthetic benchmark. Each panel reports [PITH_FULL_IMAGE:figures/full_fig_p084_3.png] view at source ↗

**Figure 4.** Figure 4: Bridge misspecification on the synthetic low-dimensional dose-response benchmark. Each panel [PITH_FULL_IMAGE:figures/full_fig_p085_4.png] view at source ↗

**Figure 5.** Figure 5: Heterogeneous response estimation under broken proxy–confounder links with [PITH_FULL_IMAGE:figures/full_fig_p086_5.png] view at source ↗

read the original abstract

Unobserved confounding prevents standard covariate adjustment from identifying causal response functions in observational studies. Proxy causal learning addresses this problem through bridge equations involving treatment- and outcome-inducing proxies, avoiding direct recovery of the latent confounder. Existing doubly robust proxy estimators combine outcome and treatment bridges, but typically rely on fixed kernels, sieves, or low-dimensional semiparametric models; existing neural proxy methods are more flexible, but are largely single-bridge estimators. We develop a neural doubly robust framework for proxy causal learning with continuous and structured treatments. Our method introduces a neural mean-embedding estimator for the treatment bridge, combines it with a neural outcome bridge, and estimates the doubly robust correction through a final regression stage. The framework covers population, heterogeneous, and conditional dose-response functions, yielding full response-curve estimators rather than binary-treatment effects. The algorithms use two stages for each bridge and history-aware updates of the final linear layers to stabilize stochastic multi-stage training. We prove consistency of the algorithms showing that the doubly robust error is controlled by the final averaging and regression errors together with the smaller of the outcome- and treatment-side weak-norm bridge errors. Across synthetic and image-valued benchmarks, the proposed estimators outperform existing baselines and single-bridge neural estimators, showing the benefit of combining learned outcome and treatment bridges in a doubly robust construction. Our implementation is available at https://github.com/BariscanBozkurt/DRPCL-Neural-Mean-Embedding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A clean neural DR extension for proxy causal learning on continuous treatments, with explicit consistency and some empirical edge over single-bridge baselines.

read the letter

The main point is a neural doubly robust estimator for proxy causal inference that handles continuous and structured treatments by pairing a neural mean-embedding treatment bridge with a neural outcome bridge, then finishing with a regression stage for the correction term. It proves consistency by showing the final error is controlled by the smaller of the two weak-norm bridge approximation errors plus the final averaging and regression errors. That bound follows the usual DR pattern once the approximations hold, and the paper applies it to population, heterogeneous, and conditional dose-response curves so you get full response functions rather than single points. The two-stage training per bridge plus history-aware linear-layer updates is presented as a practical fix for the multi-stage optimization, and the experiments on synthetic and image benchmarks show gains over existing baselines and single-bridge neural estimators. The code is released, which helps. The assumptions are the standard ones: valid proxies satisfying the bridge equations and neural nets approximating them well enough in the relevant norms. Those are load-bearing but not hidden. The empirical section would benefit from more detail on variance and baseline tuning, but the reported outperformance is consistent with the DR construction helping. This is aimed at researchers already working on proxy methods or high-dimensional causal response estimation. It is worth sending to peer review because the theoretical claim is stated clearly and the comparison is direct.

Referee Report

0 major / 3 minor

Summary. The paper develops a neural doubly robust framework for proxy causal learning with continuous and structured treatments. It introduces a neural mean-embedding estimator for the treatment bridge, combines it with a neural outcome bridge, and estimates the doubly robust correction via a final regression stage. The method covers population, heterogeneous, and conditional dose-response functions. Training uses two stages per bridge with history-aware linear-layer updates for stability. A consistency result is proved showing that the doubly robust error is controlled by final averaging/regression errors together with the smaller of the outcome- and treatment-side weak-norm bridge errors. Empirical results on synthetic and image-valued benchmarks show outperformance over baselines and single-bridge neural estimators, with code released at a GitHub repository.

Significance. If the consistency theorem holds under the stated proxy and approximation conditions, the work meaningfully extends proxy causal learning by providing a flexible, doubly robust neural approach that improves robustness for non-binary treatments. The explicit error bound in terms of the minimum bridge error plus regression terms follows standard doubly robust logic while accommodating neural function classes. Open-source code supports reproducibility and is a clear strength.

minor comments (3)

§3 (bridge function definitions): the weak-norm notation is introduced without an explicit comparison to standard RKHS or L2 norms; adding one sentence relating the weak norm to the bridge equation would clarify the approximation requirements for neural networks.
§5 (experiments): benchmark tables report point estimates but omit standard errors or confidence intervals across runs; including these would strengthen the claim of consistent outperformance over single-bridge estimators.
Algorithm 1/2 (training procedure): the history-aware linear-layer update is described in prose; a short pseudocode block would improve clarity for readers implementing the two-stage stabilization.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and recommendation for minor revision. The provided summary accurately captures the main contributions of the paper, including the neural doubly robust framework, the mean-embedding treatment bridge, consistency guarantees, and empirical results. No major comments were listed in the report.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper's central consistency result bounds the doubly robust error explicitly in terms of final regression/averaging errors plus the smaller of the two weak-norm bridge approximation errors. This follows directly from standard doubly robust analysis once the neural mean-embedding estimators for the treatment and outcome bridges are assumed to achieve the stated rates in the relevant function spaces; the bound does not reduce to any fitted parameter by construction, nor does it rely on a self-citation chain for its validity. The two-stage training procedure and history-aware linear-layer updates are presented as practical stabilization heuristics without being invoked in the theoretical guarantee. Empirical comparisons on synthetic and image benchmarks are reported separately from the proof and do not serve as load-bearing evidence for the consistency claim. No self-definitional, fitted-input-renamed-as-prediction, or uniqueness-imported-from-authors steps appear in the derivation chain.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the validity of proxy bridge equations and neural approximation power; no new entities are postulated.

free parameters (1)

neural network architecture and hyperparameters
Choice of layers, widths, learning rates, and regularization for the mean-embedding and outcome networks, fitted during training.

axioms (2)

domain assumption Existence of treatment-inducing and outcome-inducing proxies satisfying the bridge equations
Invoked to justify the doubly robust construction for identifying causal response functions.
domain assumption Neural networks can approximate the required bridge functions in the relevant weak norms
Needed for the consistency proof to hold with the stated error bounds.

pith-pipeline@v0.9.0 · 5575 in / 1466 out tokens · 64779 ms · 2026-05-12T03:53:44.680403+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 5.1 (Dose-response consistency of OutcomeNet, TreatmentNet, and DRPCLNET). ... doubly robust error is controlled by the final averaging and regression errors together with the smaller of the outcome- and treatment-side weak-norm bridge errors.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We develop a neural doubly robust framework for proxy causal learning with continuous and structured treatments. ... neural mean-embedding estimator for the treatment bridge

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 2 internal anchors

[1]

Donald B. Rubin. Randomization analysis of experimental data: The fisher randomization test comment. Journal of the American Statistical Association, 75(371):591–593, 1980. ISSN 01621459, 1537274X. URL http://www.jstor.org/stable/2287653

work page arXiv 1980
[2]

Rosenbaum and D

P. Rosenbaum and D. Rubin. The central role of the propensity score in observational studies for causal effects.Biometrika, 70:41–55, 1983

work page 1983
[3]

Nonparametric estimation of average treatment effects under exogeneity: A review.Review of Economics and Statistics, 2004

Guido Imbens. Nonparametric estimation of average treatment effects under exogeneity: A review.Review of Economics and Statistics, 2004

work page 2004
[4]

Jennifer L. Hill. Bayesian nonparametric modeling for causal inference.Journal of Computational and Graphical Statistics, 20:217–240, 03 2011. doi: 10.1198/jcgs.2010.08162

work page doi:10.1198/jcgs.2010.08162 2011
[5]

Learning representations for counterfactual inference

Fredrik Johansson, Uri Shalit, and David Sontag. Learning representations for counterfactual inference. In International Conference on Machine Learning, 2016

work page 2016
[6]

Representation learning for treatment effect estimation from observational data

Liuyi Yao, Sheng Li, Yaliang Li, Mengdi Huai, Jing Gao, and Aidong Zhang. Representation learning for treatment effect estimation from observational data. InAdvances in Neural Information Processing Systems, 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/a50abba8132 a77191791390c3eb19fe7-Paper.pdf

work page 2018
[7]

PhD thesis, Almqvist & Wiksell, 1945

Olav Reiersøl.Confluence analysis by means of instrumental sets of variables. PhD thesis, Almqvist & Wiksell, 1945

work page 1945
[8]

Retrospectives: Who invented instrumental variable regression?Journal of Economic Perspectives - J ECON PERSPECT, 17:177–194, 09 2003

James Stock and Francesco Trebbi. Retrospectives: Who invented instrumental variable regression?Journal of Economic Perspectives - J ECON PERSPECT, 17:177–194, 09 2003. doi: 10.1257/089533003769204416

work page doi:10.1257/089533003769204416 2003
[9]

Newey and James L

Whitney K. Newey and James L. Powell. Instrumental variable estimation of nonparametric models. Econometrica, 71(5):1565–1578, 2003. ISSN 00129682, 14680262. URLhttp://www.jstor.org/stable/1 555512

work page 2003
[10]

Measurement bias and effect restoration in causal inference.Biometrika, 101(2):423–437, 2014

Manabu Kuroki and Judea Pearl. Measurement bias and effect restoration in causal inference.Biometrika, 101(2):423–437, 2014

work page 2014
[11]

Identifying causal effects with proxy variables of an unmeasured confounder.Biometrika, 105(4):987—993, 2018

Wang Miao, Zhi Geng, and Eric Tchetgen Tchetgen. Identifying causal effects with proxy variables of an unmeasured confounder.Biometrika, 105(4):987—993, 2018

work page 2018
[12]

Tchetgen Tchetgen, Andrew Ying, Yifan Cui, Xu Shi, and Wang Miao

Eric J. Tchetgen Tchetgen, Andrew Ying, Yifan Cui, Xu Shi, and Wang Miao. An introduction to proximal causal inference.Statistical Science, 39(3):375–390, 2024. doi: 10.1214/23-STS911

work page doi:10.1214/23-sts911 2024
[13]

Journal of the American Statistical Association, 1–24 (2024) https://doi.org/10.1080/01621459.2023.2300522

Yifan Cui, Hongming Pu, Xu Shi, Wang Miao, and Eric Tchetgen Tchetgen. Semiparametric proximal causal inference.Journal of the American Statistical Association, 119(546):1348–1359, 2024. doi: 10.1080/ 01621459.2023.2191817. URLhttps://doi.org/10.1080/01621459.2023.2191817

work page doi:10.1080/01621459.2023.2191817 2024
[14]

Proxy controls and panel data, 2023

Ben Deaner. Proxy controls and panel data, 2023. URLhttps://arxiv.org/abs/1810.00283

work page arXiv 2023
[15]

Kusner, Arthur Gretton, and Krikamol Muandet

Afsaneh Mastouri, Yuchen Zhu, Limor Gultchin, Anna Korba, Ricardo Silva, Matt J. Kusner, Arthur Gretton, and Krikamol Muandet. Proximal causal learning with kernels: Two-stage estimation and moment restriction. InInternational Conference on Machine Learning, 2021

work page 2021
[16]

Kernel methods for unobserved confounding: Negative controls, proxies, and instruments,

Rahul Singh. Kernel methods for unobserved confounding: Negative controls, proxies, and instruments,

work page
[17]

URLhttps://arxiv.org/abs/2012.10315

work page arXiv 2012
[18]

Deep proxy causal learning and its application to confounded bandit policy evaluation

Liyuan Xu, Heishiro Kanagawa, and Arthur Gretton. Deep proxy causal learning and its application to confounded bandit policy evaluation. InAdvances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=0FDxsIEv9G. 11

work page 2021
[19]

Deep learning methods for proximal inference via maximum moment restriction.Advances in Neural Information Processing Systems, 2022

Benjamin Kompa, David Bellamy, Tom Kolokotrones, Andrew Beam, et al. Deep learning methods for proximal inference via maximum moment restriction.Advances in Neural Information Processing Systems, 2022

work page 2022
[20]

Robins, Miguel A

James M. Robins, Miguel A. Hernán, and Babette Brumback. Marginal structural models and causal inference in epidemiology.Epidemiology, 11(5):550–560, September 2000. doi: 10.1097/00001648-2000090 00-00011. PMID: 10955408

work page doi:10.1097/00001648-2000090 2000
[21]

Causal inference under unmeasured confounding with negative controls: A minimax learning approach, 2021

Nathan Kallus, Xiaojie Mao, and Masatoshi Uehara. Causal inference under unmeasured confounding with negative controls: A minimax learning approach, 2021

work page 2021
[22]

Density ratio-based proxy causal learning without density ratios

Bariscan Bozkurt, Ben Deaner, Dimitri Meunier, Liyuan Xu, and Arthur Gretton. Density ratio-based proxy causal learning without density ratios. InThe 28th International Conference on Artificial Intelligence and Statistics, 2025

work page 2025
[23]

Neural estimation of treatment bridge functions for proximal causal inference.Statistical Analysis and Data Mining: An ASA Data Science Journal, 18(5):e70045, 2025

Bingxi Zhang, Tao Shen, and Yifan Cui. Neural estimation of treatment bridge functions for proximal causal inference.Statistical Analysis and Data Mining: An ASA Data Science Journal, 18(5):e70045, 2025. doi: https://doi.org/10.1002/sam.70045. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/ sam.70045

work page doi:10.1002/sam.70045 2025
[24]

Heejung Bang and James M. Robins. Doubly robust estimation in missing data and causal inference models.Biometrics, 61(4):962–973, December 2005. doi: 10.1111/j.1541-0420.2005.00377.x. PMID: 16401269

work page doi:10.1111/j.1541-0420.2005.00377.x 2005
[25]

Double/debiased machine learning for treatment and structural parameters.The Econometrics Journal, 21(1):C1–C68, 2018

Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters.The Econometrics Journal, 21(1):C1–C68, 2018. doi: 10.1111/ectj.12097

work page doi:10.1111/ectj.12097 2018
[26]

Edward H. Kennedy. Semiparametric doubly robust targeted double machine learning: A review. In Eric Laber, Bibhas Chakraborty, Erica E. M. Moodie, Tianxi Cai, and Mark J. van der Laan, editors, Handbook of Statistical Methods for Precision Medicine, pages 207–236. Chapman and Hall/CRC, 2024. doi: 10.48550/arXiv.2203.06469

work page doi:10.48550/arxiv.2203.06469 2024
[27]

Doubly robust proximal causal learning for continuous treatments

Yong Wu, Yanwei Fu, Shouyan Wang, and Xinwei Sun. Doubly robust proximal causal learning for continuous treatments. InInternational Conference on Learning Representations, 2024. URLhttps: //openreview.net/forum?id=TjGJFkU3xL

work page 2024
[28]

Density ratio-free doubly robust proxy causal learning

Bariscan Bozkurt, Houssam Zenati, Dimitri Meunier, Liyuan Xu, and Arthur Gretton. Density ratio-free doubly robust proxy causal learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=a9HOg4f9Gh

work page 2025
[29]

Closed-Form Last Layer Optimization

Alexandre Galashov, Nathaël Da Costa, Liyuan Xu, Philipp Hennig, and Arthur Gretton. Closed-form last layer optimization, 2025. URLhttps://arxiv.org/abs/2510.04606

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

A neural mean embedding approach for back-door and front-door adjustment

Liyuan Xu and Arthur Gretton. A neural mean embedding approach for back-door and front-door adjustment. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps: //openreview.net/forum?id=rLguqxYvYHB

work page 2023
[31]

Learning deep features in instrumental variable regression

Liyuan Xu, Yutian Chen, Siddarth Srinivasan, Nando de Freitas, Arnaud Doucet, and Arthur Gretton. Learning deep features in instrumental variable regression. InInternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=sy4Kg_ZQmS7

work page 2021
[32]

Peter J. Huber. Robust estimation of a location parameter.The Annals of Mathematical Statistics, 35(1): 73–101, 1964

work page 1964
[33]

Mathematical Programming , author =

Dong C. Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(1):503–528, 1989. doi: 10.1007/BF01589116. URLhttps://doi.org/10 .1007/BF01589116. 12

work page doi:10.1007/bf01589116 1989
[34]

Inference on strongly identified functionals of weakly identified functions.Journal of the Royal Statistical Society Series B: Statistical Methodology, page qkaf075, 2025

Andrew Bennett, Nathan Kallus, Xiaojie Mao, Whitney K Newey, Vasilis Syrgkanis, and Masatoshi Uehara. Inference on strongly identified functionals of weakly identified functions.Journal of the Royal Statistical Society Series B: Statistical Methodology, page qkaf075, 2025

work page 2025
[35]

Towards a unified analysis of neural networks in nonparametric instrumental variable regression: Optimization and generalization.arXiv preprint arXiv:2511.14710, 2025

Zonghao Chen, Atsushi Nitanda, Arthur Gretton, and Taiji Suzuki. Towards a unified analysis of neural networks in nonparametric instrumental variable regression: Optimization and generalization.arXiv preprint arXiv:2511.14710, 2025

work page arXiv 2025
[36]

Bartlett, Olivier Bousquet, and Shahar Mendelson

Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. Local Rademacher complexities.The Annals of Statistics, 33(4):1497 – 1537, 2005. doi: 10.1214/009053605000000282. URL https: //doi.org/10.1214/009053605000000282

work page doi:10.1214/009053605000000282 2005
[37]

Orthogonal statistical learning.The Annals of Statistics, 51(3): 879–908, 2023

Dylan J Foster and Vasilis Syrgkanis. Orthogonal statistical learning.The Annals of Statistics, 51(3): 879–908, 2023

work page 2023
[38]

Semi-nonparametric iv estimation of shape- invariant engel curves.Econometrica, 75(6):1613–1669, 2007

Richard Blundell, Xiaohong Chen, and Dennis Kristensen. Semi-nonparametric iv estimation of shape- invariant engel curves.Econometrica, 75(6):1613–1669, 2007

work page 2007
[39]

Optimal sup-norm rates and uniform inference on nonlinear functionals of nonparametric iv regression.Quantitative Economics, 9(1):39–84, 2018

Xiaohong Chen and Timothy M Christensen. Optimal sup-norm rates and uniform inference on nonlinear functionals of nonparametric iv regression.Quantitative Economics, 9(1):39–84, 2018

work page 2018
[40]

dsprites: Disentanglement testing sprites dataset

Loic Matthey, Irina Higgins, Demis Hassabis, and Alexander Lerchner. dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017

work page 2017
[41]

Jason Abrevaya, Yu-Chin Hsu, and Robert P. Lieli. Estimating conditional average treatment effects. Journal of Business & Economic Statistics, 33(4):485–505, 2015. doi: 10.1080/07350015.2014.975555. URL https://doi.org/10.1080/07350015.2014.975555

work page doi:10.1080/07350015.2014.975555 2015
[42]

Kennedy, Zongming Ma, Matthew D

Edward H. Kennedy, Zongming Ma, Matthew D. McHugh, and Dylan S. Small. Non-parametric Methods for Doubly Robust Estimation of Continuous Treatment Effects.Journal of the Royal Statistical Society Series B: Statistical Methodology, 79(4):1229–1245, 09 2017

work page 2017
[43]

Double debiased machine learning nonparametric inference with continuous treatments.Journal of Business & Economic Statistics, pages 1–26, 2025

Kyle Colangelo and Ying-Ying Lee. Double debiased machine learning nonparametric inference with continuous treatments.Journal of Business & Economic Statistics, pages 1–26, 2025

work page 2025
[44]

Double debiased machine learning for mediation analysis with continuous treatments

Houssam Zenati, Judith Abécassis, Julie Josse, and Bertrand Thirion. Double debiased machine learning for mediation analysis with continuous treatments. InInternational Conference on Artificial Intelligence and Statistics, volume 258, pages 4150–4158, 2025

work page 2025
[45]

Hilbert space embeddings of conditional distributions with applications to dynamical systems

Le Song, Jonathan Huang, Alex Smola, and Kenji Fukumizu. Hilbert space embeddings of conditional distributions with applications to dynamical systems. InInternational Conference on Machine Learning, 2009

work page 2009
[46]

Conditional mean embeddings as regressors

Steffen Grünewälder, Guy Lever, Luca Baldassarre, Sam Patterson, Arthur Gretton, and Massimilano Pontil. Conditional mean embeddings as regressors. InInternational Conference on Machine Learning, 2012

work page 2012
[47]

A measure-theoretic approach to kernel conditional mean embeddings.Advances in Neural Information Processing Systems, 2020

Junhyung Park and Krikamol Muandet. A measure-theoretic approach to kernel conditional mean embeddings.Advances in Neural Information Processing Systems, 2020

work page 2020
[48]

Kernel single proxy control for deterministic confounding, 2024

Liyuan Xu and Arthur Gretton. Kernel single proxy control for deterministic confounding, 2024. URL https://arxiv.org/abs/2308.04585

work page arXiv 2024
[49]

Bernhard Schölkopf, Ralf Herbrich, and Alex J. Smola. A generalized representer theorem. In David Helmbold and Bob Williamson, editors,Computational Learning Theory, pages 416–426, Berlin, Heidelberg,

work page
[50]

ISBN 978-3-540-44581-4

Springer Berlin Heidelberg. ISBN 978-3-540-44581-4. 13

work page
[51]

In: 29th ACM International Conference on Ar- chitectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’24)

Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalambarkar, Laurent Kirsch, Michael L...

work page doi:10.1145/3620665.3640366 2024
[52]

On rate optimality for ill-posed inverse problems in econometrics

Xiaohong Chen and Markus Reiss. On rate optimality for ill-posed inverse problems in econometrics. Econometric Theory, 27(3):497–521, 2011

work page 2011
[53]

Nonparametric Instrumental Regression via Kernel Methods is Minimax Optimal

Dimitri Meunier, Zhu Li, Tim Christensen, and Arthur Gretton. Nonparametric instrumental regression via kernel methods is minimax optimal.arXiv preprint arXiv:2411.19653, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Nonparametric instrumental variable regression with observed covariates, 2025

Zikai Shen, Zonghao Chen, Dimitri Meunier, Ingo Steinwart, Arthur Gretton, and Zhu Li. Nonparametric instrumental variable regression with observed covariates, 2025. URLhttps://arxiv.org/abs/2511.194 04

work page 2025
[55]

Optimality and adaptivity of deep neural features for instrumental variable regression

Juno Kim, Dimitri Meunier, Arthur Gretton, Taiji Suzuki, and Zhu Li. Optimality and adaptivity of deep neural features for instrumental variable regression. In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Learning Representations, volume 2025, pages 94163–94206, 2025. URL https://proceedings.iclr.cc/paper_files/paper/20...

work page 2025
[56]

Cambridge university press, 2019

Martin J Wainwright.High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019

work page 2019
[57]

A tutorial on kernel density estimation and recent advances.Biostatistics & Epidemiology, 1(1):161–187, 2017

Yen-Chi Chen. A tutorial on kernel density estimation and recent advances.Biostatistics & Epidemiology, 1(1):161–187, 2017. doi: 10.1080/24709360.2017.1396742. URLhttps://doi.org/10.1080/24709360.2 017.1396742

work page doi:10.1080/24709360.2017.1396742 2017
[58]

Direct importance estimation for covariate shift adaptation.Annals of the Institute of Statistical Mathematics, 60(4):699–746, December 2008

Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul Bunau, and Motoaki Kawanabe. Direct importance estimation for covariate shift adaptation.Annals of the Institute of Statistical Mathematics, 60(4):699–746, December 2008. doi: 10.1007/s10463-008-0197-x. URL https://ideas.repec.org/a/spr/aistmt/v60y2008i4p699-746.html

work page doi:10.1007/s10463-008-0197-x 2008
[59]

beta-VAE: Learning basic visual concepts with a constrained variational framework

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-VAE: Learning basic visual concepts with a constrained variational framework. InInternational Conference on Learning Representations, 2017. URL https: //openreview.net/forum?id=Sy2fzU9gl

work page 2017
[60]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019. URLhttps://openreview.net/forum?id=Bkg6RiCqY7. 14 Appendix Contents A Treatment bridge identification of the heterogeneous dose-response 16 B Doubly robust identification of causal functions: dose, heterogeneous, and condi...

work page 2019
[61]

We refine this component by incorporating proximal closed-form updates for the final linear layer and by introducing a hybrid optimization scheme for the second-stage head

Outcome bridge network.We begin with the Deep Feature Proxy Causal Learning (DFPCL) architecture of Xu et al.[17], which provides a neural parameterization of the outcome bridge estimator. We refine this component by incorporating proximal closed-form updates for the final linear layer and by introducing a hybrid optimization scheme for the second-stage h...

work page
[62]

Treatment bridge network.We next introduce a neural estimator of the treatment bridge function in Section E.2. This construction follows the same principles as the outcome-side network: adaptive feature learning, proximal closed-form updates for the last linear layer, and hybrid optimization in the second stage. 26

work page
[63]

This final stage is designed to leverage the complementary strengths of both bridge functions

Neural doubly robust unification.Finally, in Section E.3, we combine the outcome- and treatment- bridge components into a fully neural doubly robust estimator of the dose-response curve. This final stage is designed to leverage the complementary strengths of both bridge functions. E.1 Dose-response curve estimation: outcome bridge-based approach We first ...

work page
[64]

Featurizer update.We first update the second-stage neural parametersθ(h) 2 by a gradient step on Equation 42, using the current estimate of the head and the current auxiliary first-stage operatorˇV (h) t

work page
[65]

In our implementation, this inner optimization is performed byKh steps of L-BFGS [32], using thePyTorchimplementation [49]

Head refinement.Holding the feature extractors fixed, we then refine the second-stage linear headh by approximately minimizing Equation 42 with respect toh. In our implementation, this inner optimization is performed byKh steps of L-BFGS [32], using thePyTorchimplementation [49]. This numerical refinement plays the same role as the closed-form update in t...

work page 2000