pith. machine review for the scientific record. sign in

arxiv: 2605.09514 · v1 · submitted 2026-05-10 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Doubly Robust Proxy Causal Learning with Neural Mean Embeddings

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:53 UTC · model grok-4.3

classification 💻 cs.LG
keywords proxy causal learningdoubly robust estimationneural mean embeddingscausal inferencecontinuous treatmentsbridge functionsunobserved confoundingdose-response estimation
0
0 comments X

The pith

A neural doubly robust estimator recovers causal response curves for continuous treatments by combining outcome and treatment proxy bridges.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a neural framework that uses proxy variables to identify causal effects when direct adjustment for confounders is impossible. It estimates a treatment bridge via neural mean embeddings, pairs it with a neural outcome bridge, and applies a final regression to produce a doubly robust correction. This yields consistent estimators for full dose-response functions under continuous or structured treatments. The proof shows the overall error is governed by the smaller of the two bridge approximation errors plus the final-stage errors. Experiments on synthetic and image data indicate gains over single-bridge neural methods and classical baselines.

Core claim

By estimating both the outcome-inducing bridge and a neural mean-embedding treatment bridge, then correcting via a final regression stage, the doubly robust estimator consistently recovers the causal response function; the error is bounded by the final averaging and regression errors together with the minimum of the two weak-norm bridge errors.

What carries the argument

Neural mean-embedding estimator for the treatment bridge function, which is averaged and combined with the outcome bridge inside a doubly robust correction obtained by final regression.

If this is right

  • The method produces consistent estimators for population, heterogeneous, and conditional dose-response functions rather than binary effects.
  • Training stability is achieved through two-stage bridge estimation and history-aware linear-layer updates.
  • Error control depends on the smaller of the outcome-side and treatment-side weak-norm bridge errors plus the final regression error.
  • The approach outperforms single-bridge neural estimators and kernel-based baselines on synthetic and image-valued benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The construction may extend naturally to other high-dimensional structured treatments where kernel methods become impractical.
  • Integration with deeper or recurrent architectures could improve approximation rates for time-varying or sequential treatments.
  • The double-robustness property suggests the method remains useful even when one bridge is misspecified, provided the other is accurate enough.

Load-bearing premise

Valid treatment- and outcome-inducing proxies exist that satisfy the required bridge equations, and neural networks can approximate those bridge functions sufficiently well in the relevant norms.

What would settle it

A dataset where the proxies satisfy the bridge equations yet the estimated response curve deviates from the true causal curve by more than the sum of the reported final-stage and minimum-bridge errors.

Figures

Figures reproduced from arXiv: 2605.09514 by Alexandre Galashov, Arthur Gretton, Bariscan Bozkurt, Dimitri Meunier, Houssam Zenati, Zikai Shen.

Figure 1
Figure 1. Figure 1: Causal graph for PCL setting. Treatment bridges provide the complementary weighting route. For a given causal estimand, a treatment bridge φ(A, X, Z) is characterized by a moment equation of the form E{r0(A, X, W) − φ(A, X, Z) | A, X, W} = 0, where r0 denotes the estimand specific density ratio; for the population dose-response curve, r0(a, x, w) = pA(a)/pA|X,W (a | x, w). This makes treatment bridges anal… view at source ↗
Figure 2
Figure 2. Figure 2: Estimator comparison across the main benchmark settings. Each panel reports causal MSE on a [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Conditional dose-response estimation on the low-dimensional synthetic benchmark. Each panel reports [PITH_FULL_IMAGE:figures/full_fig_p084_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Bridge misspecification on the synthetic low-dimensional dose-response benchmark. Each panel [PITH_FULL_IMAGE:figures/full_fig_p085_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Heterogeneous response estimation under broken proxy–confounder links with [PITH_FULL_IMAGE:figures/full_fig_p086_5.png] view at source ↗
read the original abstract

Unobserved confounding prevents standard covariate adjustment from identifying causal response functions in observational studies. Proxy causal learning addresses this problem through bridge equations involving treatment- and outcome-inducing proxies, avoiding direct recovery of the latent confounder. Existing doubly robust proxy estimators combine outcome and treatment bridges, but typically rely on fixed kernels, sieves, or low-dimensional semiparametric models; existing neural proxy methods are more flexible, but are largely single-bridge estimators. We develop a neural doubly robust framework for proxy causal learning with continuous and structured treatments. Our method introduces a neural mean-embedding estimator for the treatment bridge, combines it with a neural outcome bridge, and estimates the doubly robust correction through a final regression stage. The framework covers population, heterogeneous, and conditional dose-response functions, yielding full response-curve estimators rather than binary-treatment effects. The algorithms use two stages for each bridge and history-aware updates of the final linear layers to stabilize stochastic multi-stage training. We prove consistency of the algorithms showing that the doubly robust error is controlled by the final averaging and regression errors together with the smaller of the outcome- and treatment-side weak-norm bridge errors. Across synthetic and image-valued benchmarks, the proposed estimators outperform existing baselines and single-bridge neural estimators, showing the benefit of combining learned outcome and treatment bridges in a doubly robust construction. Our implementation is available at https://github.com/BariscanBozkurt/DRPCL-Neural-Mean-Embedding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper develops a neural doubly robust framework for proxy causal learning with continuous and structured treatments. It introduces a neural mean-embedding estimator for the treatment bridge, combines it with a neural outcome bridge, and estimates the doubly robust correction via a final regression stage. The method covers population, heterogeneous, and conditional dose-response functions. Training uses two stages per bridge with history-aware linear-layer updates for stability. A consistency result is proved showing that the doubly robust error is controlled by final averaging/regression errors together with the smaller of the outcome- and treatment-side weak-norm bridge errors. Empirical results on synthetic and image-valued benchmarks show outperformance over baselines and single-bridge neural estimators, with code released at a GitHub repository.

Significance. If the consistency theorem holds under the stated proxy and approximation conditions, the work meaningfully extends proxy causal learning by providing a flexible, doubly robust neural approach that improves robustness for non-binary treatments. The explicit error bound in terms of the minimum bridge error plus regression terms follows standard doubly robust logic while accommodating neural function classes. Open-source code supports reproducibility and is a clear strength.

minor comments (3)
  1. §3 (bridge function definitions): the weak-norm notation is introduced without an explicit comparison to standard RKHS or L2 norms; adding one sentence relating the weak norm to the bridge equation would clarify the approximation requirements for neural networks.
  2. §5 (experiments): benchmark tables report point estimates but omit standard errors or confidence intervals across runs; including these would strengthen the claim of consistent outperformance over single-bridge estimators.
  3. Algorithm 1/2 (training procedure): the history-aware linear-layer update is described in prose; a short pseudocode block would improve clarity for readers implementing the two-stage stabilization.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and recommendation for minor revision. The provided summary accurately captures the main contributions of the paper, including the neural doubly robust framework, the mean-embedding treatment bridge, consistency guarantees, and empirical results. No major comments were listed in the report.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper's central consistency result bounds the doubly robust error explicitly in terms of final regression/averaging errors plus the smaller of the two weak-norm bridge approximation errors. This follows directly from standard doubly robust analysis once the neural mean-embedding estimators for the treatment and outcome bridges are assumed to achieve the stated rates in the relevant function spaces; the bound does not reduce to any fitted parameter by construction, nor does it rely on a self-citation chain for its validity. The two-stage training procedure and history-aware linear-layer updates are presented as practical stabilization heuristics without being invoked in the theoretical guarantee. Empirical comparisons on synthetic and image benchmarks are reported separately from the proof and do not serve as load-bearing evidence for the consistency claim. No self-definitional, fitted-input-renamed-as-prediction, or uniqueness-imported-from-authors steps appear in the derivation chain.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the validity of proxy bridge equations and neural approximation power; no new entities are postulated.

free parameters (1)
  • neural network architecture and hyperparameters
    Choice of layers, widths, learning rates, and regularization for the mean-embedding and outcome networks, fitted during training.
axioms (2)
  • domain assumption Existence of treatment-inducing and outcome-inducing proxies satisfying the bridge equations
    Invoked to justify the doubly robust construction for identifying causal response functions.
  • domain assumption Neural networks can approximate the required bridge functions in the relevant weak norms
    Needed for the consistency proof to hold with the stated error bounds.

pith-pipeline@v0.9.0 · 5575 in / 1466 out tokens · 64779 ms · 2026-05-12T03:53:44.680403+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 2 internal anchors

  1. [1]

    Donald B. Rubin. Randomization analysis of experimental data: The fisher randomization test comment. Journal of the American Statistical Association, 75(371):591–593, 1980. ISSN 01621459, 1537274X. URL http://www.jstor.org/stable/2287653

  2. [2]

    Rosenbaum and D

    P. Rosenbaum and D. Rubin. The central role of the propensity score in observational studies for causal effects.Biometrika, 70:41–55, 1983

  3. [3]

    Nonparametric estimation of average treatment effects under exogeneity: A review.Review of Economics and Statistics, 2004

    Guido Imbens. Nonparametric estimation of average treatment effects under exogeneity: A review.Review of Economics and Statistics, 2004

  4. [4]

    Jennifer L. Hill. Bayesian nonparametric modeling for causal inference.Journal of Computational and Graphical Statistics, 20:217–240, 03 2011. doi: 10.1198/jcgs.2010.08162

  5. [5]

    Learning representations for counterfactual inference

    Fredrik Johansson, Uri Shalit, and David Sontag. Learning representations for counterfactual inference. In International Conference on Machine Learning, 2016

  6. [6]

    Representation learning for treatment effect estimation from observational data

    Liuyi Yao, Sheng Li, Yaliang Li, Mengdi Huai, Jing Gao, and Aidong Zhang. Representation learning for treatment effect estimation from observational data. InAdvances in Neural Information Processing Systems, 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/a50abba8132 a77191791390c3eb19fe7-Paper.pdf

  7. [7]

    PhD thesis, Almqvist & Wiksell, 1945

    Olav Reiersøl.Confluence analysis by means of instrumental sets of variables. PhD thesis, Almqvist & Wiksell, 1945

  8. [8]

    Retrospectives: Who invented instrumental variable regression?Journal of Economic Perspectives - J ECON PERSPECT, 17:177–194, 09 2003

    James Stock and Francesco Trebbi. Retrospectives: Who invented instrumental variable regression?Journal of Economic Perspectives - J ECON PERSPECT, 17:177–194, 09 2003. doi: 10.1257/089533003769204416

  9. [9]

    Newey and James L

    Whitney K. Newey and James L. Powell. Instrumental variable estimation of nonparametric models. Econometrica, 71(5):1565–1578, 2003. ISSN 00129682, 14680262. URLhttp://www.jstor.org/stable/1 555512

  10. [10]

    Measurement bias and effect restoration in causal inference.Biometrika, 101(2):423–437, 2014

    Manabu Kuroki and Judea Pearl. Measurement bias and effect restoration in causal inference.Biometrika, 101(2):423–437, 2014

  11. [11]

    Identifying causal effects with proxy variables of an unmeasured confounder.Biometrika, 105(4):987—993, 2018

    Wang Miao, Zhi Geng, and Eric Tchetgen Tchetgen. Identifying causal effects with proxy variables of an unmeasured confounder.Biometrika, 105(4):987—993, 2018

  12. [12]

    Tchetgen Tchetgen, Andrew Ying, Yifan Cui, Xu Shi, and Wang Miao

    Eric J. Tchetgen Tchetgen, Andrew Ying, Yifan Cui, Xu Shi, and Wang Miao. An introduction to proximal causal inference.Statistical Science, 39(3):375–390, 2024. doi: 10.1214/23-STS911

  13. [13]

    Journal of the American Statistical Association, 1–24 (2024) https://doi.org/10.1080/01621459.2023.2300522

    Yifan Cui, Hongming Pu, Xu Shi, Wang Miao, and Eric Tchetgen Tchetgen. Semiparametric proximal causal inference.Journal of the American Statistical Association, 119(546):1348–1359, 2024. doi: 10.1080/ 01621459.2023.2191817. URLhttps://doi.org/10.1080/01621459.2023.2191817

  14. [14]

    Proxy controls and panel data, 2023

    Ben Deaner. Proxy controls and panel data, 2023. URLhttps://arxiv.org/abs/1810.00283

  15. [15]

    Kusner, Arthur Gretton, and Krikamol Muandet

    Afsaneh Mastouri, Yuchen Zhu, Limor Gultchin, Anna Korba, Ricardo Silva, Matt J. Kusner, Arthur Gretton, and Krikamol Muandet. Proximal causal learning with kernels: Two-stage estimation and moment restriction. InInternational Conference on Machine Learning, 2021

  16. [16]

    Kernel methods for unobserved confounding: Negative controls, proxies, and instruments,

    Rahul Singh. Kernel methods for unobserved confounding: Negative controls, proxies, and instruments,

  17. [17]

    URLhttps://arxiv.org/abs/2012.10315

  18. [18]

    Deep proxy causal learning and its application to confounded bandit policy evaluation

    Liyuan Xu, Heishiro Kanagawa, and Arthur Gretton. Deep proxy causal learning and its application to confounded bandit policy evaluation. InAdvances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=0FDxsIEv9G. 11

  19. [19]

    Deep learning methods for proximal inference via maximum moment restriction.Advances in Neural Information Processing Systems, 2022

    Benjamin Kompa, David Bellamy, Tom Kolokotrones, Andrew Beam, et al. Deep learning methods for proximal inference via maximum moment restriction.Advances in Neural Information Processing Systems, 2022

  20. [20]

    Robins, Miguel A

    James M. Robins, Miguel A. Hernán, and Babette Brumback. Marginal structural models and causal inference in epidemiology.Epidemiology, 11(5):550–560, September 2000. doi: 10.1097/00001648-2000090 00-00011. PMID: 10955408

  21. [21]

    Causal inference under unmeasured confounding with negative controls: A minimax learning approach, 2021

    Nathan Kallus, Xiaojie Mao, and Masatoshi Uehara. Causal inference under unmeasured confounding with negative controls: A minimax learning approach, 2021

  22. [22]

    Density ratio-based proxy causal learning without density ratios

    Bariscan Bozkurt, Ben Deaner, Dimitri Meunier, Liyuan Xu, and Arthur Gretton. Density ratio-based proxy causal learning without density ratios. InThe 28th International Conference on Artificial Intelligence and Statistics, 2025

  23. [23]

    Neural estimation of treatment bridge functions for proximal causal inference.Statistical Analysis and Data Mining: An ASA Data Science Journal, 18(5):e70045, 2025

    Bingxi Zhang, Tao Shen, and Yifan Cui. Neural estimation of treatment bridge functions for proximal causal inference.Statistical Analysis and Data Mining: An ASA Data Science Journal, 18(5):e70045, 2025. doi: https://doi.org/10.1002/sam.70045. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/ sam.70045

  24. [24]

    Heejung Bang and James M. Robins. Doubly robust estimation in missing data and causal inference models.Biometrics, 61(4):962–973, December 2005. doi: 10.1111/j.1541-0420.2005.00377.x. PMID: 16401269

  25. [25]

    Double/debiased machine learning for treatment and structural parameters.The Econometrics Journal, 21(1):C1–C68, 2018

    Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters.The Econometrics Journal, 21(1):C1–C68, 2018. doi: 10.1111/ectj.12097

  26. [26]

    Edward H. Kennedy. Semiparametric doubly robust targeted double machine learning: A review. In Eric Laber, Bibhas Chakraborty, Erica E. M. Moodie, Tianxi Cai, and Mark J. van der Laan, editors, Handbook of Statistical Methods for Precision Medicine, pages 207–236. Chapman and Hall/CRC, 2024. doi: 10.48550/arXiv.2203.06469

  27. [27]

    Doubly robust proximal causal learning for continuous treatments

    Yong Wu, Yanwei Fu, Shouyan Wang, and Xinwei Sun. Doubly robust proximal causal learning for continuous treatments. InInternational Conference on Learning Representations, 2024. URLhttps: //openreview.net/forum?id=TjGJFkU3xL

  28. [28]

    Density ratio-free doubly robust proxy causal learning

    Bariscan Bozkurt, Houssam Zenati, Dimitri Meunier, Liyuan Xu, and Arthur Gretton. Density ratio-free doubly robust proxy causal learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=a9HOg4f9Gh

  29. [29]

    Closed-Form Last Layer Optimization

    Alexandre Galashov, Nathaël Da Costa, Liyuan Xu, Philipp Hennig, and Arthur Gretton. Closed-form last layer optimization, 2025. URLhttps://arxiv.org/abs/2510.04606

  30. [30]

    A neural mean embedding approach for back-door and front-door adjustment

    Liyuan Xu and Arthur Gretton. A neural mean embedding approach for back-door and front-door adjustment. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps: //openreview.net/forum?id=rLguqxYvYHB

  31. [31]

    Learning deep features in instrumental variable regression

    Liyuan Xu, Yutian Chen, Siddarth Srinivasan, Nando de Freitas, Arnaud Doucet, and Arthur Gretton. Learning deep features in instrumental variable regression. InInternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=sy4Kg_ZQmS7

  32. [32]

    Peter J. Huber. Robust estimation of a location parameter.The Annals of Mathematical Statistics, 35(1): 73–101, 1964

  33. [33]

    Mathematical Programming , author =

    Dong C. Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(1):503–528, 1989. doi: 10.1007/BF01589116. URLhttps://doi.org/10 .1007/BF01589116. 12

  34. [34]

    Inference on strongly identified functionals of weakly identified functions.Journal of the Royal Statistical Society Series B: Statistical Methodology, page qkaf075, 2025

    Andrew Bennett, Nathan Kallus, Xiaojie Mao, Whitney K Newey, Vasilis Syrgkanis, and Masatoshi Uehara. Inference on strongly identified functionals of weakly identified functions.Journal of the Royal Statistical Society Series B: Statistical Methodology, page qkaf075, 2025

  35. [35]

    Towards a unified analysis of neural networks in nonparametric instrumental variable regression: Optimization and generalization.arXiv preprint arXiv:2511.14710, 2025

    Zonghao Chen, Atsushi Nitanda, Arthur Gretton, and Taiji Suzuki. Towards a unified analysis of neural networks in nonparametric instrumental variable regression: Optimization and generalization.arXiv preprint arXiv:2511.14710, 2025

  36. [36]

    Bartlett, Olivier Bousquet, and Shahar Mendelson

    Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. Local Rademacher complexities.The Annals of Statistics, 33(4):1497 – 1537, 2005. doi: 10.1214/009053605000000282. URL https: //doi.org/10.1214/009053605000000282

  37. [37]

    Orthogonal statistical learning.The Annals of Statistics, 51(3): 879–908, 2023

    Dylan J Foster and Vasilis Syrgkanis. Orthogonal statistical learning.The Annals of Statistics, 51(3): 879–908, 2023

  38. [38]

    Semi-nonparametric iv estimation of shape- invariant engel curves.Econometrica, 75(6):1613–1669, 2007

    Richard Blundell, Xiaohong Chen, and Dennis Kristensen. Semi-nonparametric iv estimation of shape- invariant engel curves.Econometrica, 75(6):1613–1669, 2007

  39. [39]

    Optimal sup-norm rates and uniform inference on nonlinear functionals of nonparametric iv regression.Quantitative Economics, 9(1):39–84, 2018

    Xiaohong Chen and Timothy M Christensen. Optimal sup-norm rates and uniform inference on nonlinear functionals of nonparametric iv regression.Quantitative Economics, 9(1):39–84, 2018

  40. [40]

    dsprites: Disentanglement testing sprites dataset

    Loic Matthey, Irina Higgins, Demis Hassabis, and Alexander Lerchner. dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017

  41. [41]

    Jason Abrevaya, Yu-Chin Hsu, and Robert P. Lieli. Estimating conditional average treatment effects. Journal of Business & Economic Statistics, 33(4):485–505, 2015. doi: 10.1080/07350015.2014.975555. URL https://doi.org/10.1080/07350015.2014.975555

  42. [42]

    Kennedy, Zongming Ma, Matthew D

    Edward H. Kennedy, Zongming Ma, Matthew D. McHugh, and Dylan S. Small. Non-parametric Methods for Doubly Robust Estimation of Continuous Treatment Effects.Journal of the Royal Statistical Society Series B: Statistical Methodology, 79(4):1229–1245, 09 2017

  43. [43]

    Double debiased machine learning nonparametric inference with continuous treatments.Journal of Business & Economic Statistics, pages 1–26, 2025

    Kyle Colangelo and Ying-Ying Lee. Double debiased machine learning nonparametric inference with continuous treatments.Journal of Business & Economic Statistics, pages 1–26, 2025

  44. [44]

    Double debiased machine learning for mediation analysis with continuous treatments

    Houssam Zenati, Judith Abécassis, Julie Josse, and Bertrand Thirion. Double debiased machine learning for mediation analysis with continuous treatments. InInternational Conference on Artificial Intelligence and Statistics, volume 258, pages 4150–4158, 2025

  45. [45]

    Hilbert space embeddings of conditional distributions with applications to dynamical systems

    Le Song, Jonathan Huang, Alex Smola, and Kenji Fukumizu. Hilbert space embeddings of conditional distributions with applications to dynamical systems. InInternational Conference on Machine Learning, 2009

  46. [46]

    Conditional mean embeddings as regressors

    Steffen Grünewälder, Guy Lever, Luca Baldassarre, Sam Patterson, Arthur Gretton, and Massimilano Pontil. Conditional mean embeddings as regressors. InInternational Conference on Machine Learning, 2012

  47. [47]

    A measure-theoretic approach to kernel conditional mean embeddings.Advances in Neural Information Processing Systems, 2020

    Junhyung Park and Krikamol Muandet. A measure-theoretic approach to kernel conditional mean embeddings.Advances in Neural Information Processing Systems, 2020

  48. [48]

    Kernel single proxy control for deterministic confounding, 2024

    Liyuan Xu and Arthur Gretton. Kernel single proxy control for deterministic confounding, 2024. URL https://arxiv.org/abs/2308.04585

  49. [49]

    Bernhard Schölkopf, Ralf Herbrich, and Alex J. Smola. A generalized representer theorem. In David Helmbold and Bob Williamson, editors,Computational Learning Theory, pages 416–426, Berlin, Heidelberg,

  50. [50]

    ISBN 978-3-540-44581-4

    Springer Berlin Heidelberg. ISBN 978-3-540-44581-4. 13

  51. [51]

    In: 29th ACM International Conference on Ar- chitectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’24)

    Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalambarkar, Laurent Kirsch, Michael L...

  52. [52]

    On rate optimality for ill-posed inverse problems in econometrics

    Xiaohong Chen and Markus Reiss. On rate optimality for ill-posed inverse problems in econometrics. Econometric Theory, 27(3):497–521, 2011

  53. [53]

    Nonparametric Instrumental Regression via Kernel Methods is Minimax Optimal

    Dimitri Meunier, Zhu Li, Tim Christensen, and Arthur Gretton. Nonparametric instrumental regression via kernel methods is minimax optimal.arXiv preprint arXiv:2411.19653, 2024

  54. [54]

    Nonparametric instrumental variable regression with observed covariates, 2025

    Zikai Shen, Zonghao Chen, Dimitri Meunier, Ingo Steinwart, Arthur Gretton, and Zhu Li. Nonparametric instrumental variable regression with observed covariates, 2025. URLhttps://arxiv.org/abs/2511.194 04

  55. [55]

    Optimality and adaptivity of deep neural features for instrumental variable regression

    Juno Kim, Dimitri Meunier, Arthur Gretton, Taiji Suzuki, and Zhu Li. Optimality and adaptivity of deep neural features for instrumental variable regression. In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Learning Representations, volume 2025, pages 94163–94206, 2025. URL https://proceedings.iclr.cc/paper_files/paper/20...

  56. [56]

    Cambridge university press, 2019

    Martin J Wainwright.High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019

  57. [57]

    A tutorial on kernel density estimation and recent advances.Biostatistics & Epidemiology, 1(1):161–187, 2017

    Yen-Chi Chen. A tutorial on kernel density estimation and recent advances.Biostatistics & Epidemiology, 1(1):161–187, 2017. doi: 10.1080/24709360.2017.1396742. URLhttps://doi.org/10.1080/24709360.2 017.1396742

  58. [58]

    Direct importance estimation for covariate shift adaptation.Annals of the Institute of Statistical Mathematics, 60(4):699–746, December 2008

    Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul Bunau, and Motoaki Kawanabe. Direct importance estimation for covariate shift adaptation.Annals of the Institute of Statistical Mathematics, 60(4):699–746, December 2008. doi: 10.1007/s10463-008-0197-x. URL https://ideas.repec.org/a/spr/aistmt/v60y2008i4p699-746.html

  59. [59]

    beta-VAE: Learning basic visual concepts with a constrained variational framework

    Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-VAE: Learning basic visual concepts with a constrained variational framework. InInternational Conference on Learning Representations, 2017. URL https: //openreview.net/forum?id=Sy2fzU9gl

  60. [60]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019. URLhttps://openreview.net/forum?id=Bkg6RiCqY7. 14 Appendix Contents A Treatment bridge identification of the heterogeneous dose-response 16 B Doubly robust identification of causal functions: dose, heterogeneous, and condi...

  61. [61]

    We refine this component by incorporating proximal closed-form updates for the final linear layer and by introducing a hybrid optimization scheme for the second-stage head

    Outcome bridge network.We begin with the Deep Feature Proxy Causal Learning (DFPCL) architecture of Xu et al.[17], which provides a neural parameterization of the outcome bridge estimator. We refine this component by incorporating proximal closed-form updates for the final linear layer and by introducing a hybrid optimization scheme for the second-stage h...

  62. [62]

    Treatment bridge network.We next introduce a neural estimator of the treatment bridge function in Section E.2. This construction follows the same principles as the outcome-side network: adaptive feature learning, proximal closed-form updates for the last linear layer, and hybrid optimization in the second stage. 26

  63. [63]

    This final stage is designed to leverage the complementary strengths of both bridge functions

    Neural doubly robust unification.Finally, in Section E.3, we combine the outcome- and treatment- bridge components into a fully neural doubly robust estimator of the dose-response curve. This final stage is designed to leverage the complementary strengths of both bridge functions. E.1 Dose-response curve estimation: outcome bridge-based approach We first ...

  64. [64]

    Featurizer update.We first update the second-stage neural parametersθ(h) 2 by a gradient step on Equation 42, using the current estimate of the head and the current auxiliary first-stage operatorˇV (h) t

  65. [65]

    In our implementation, this inner optimization is performed byKh steps of L-BFGS [32], using thePyTorchimplementation [49]

    Head refinement.Holding the feature extractors fixed, we then refine the second-stage linear headh by approximately minimizing Equation 42 with respect toh. In our implementation, this inner optimization is performed byKh steps of L-BFGS [32], using thePyTorchimplementation [49]. This numerical refinement plays the same role as the closed-form update in t...