pith. sign in

arxiv: 2502.04274 · v4 · submitted 2025-02-06 · 💻 cs.LG

Orthogonal Representation Learning for Estimating Causal Quantities

Pith reviewed 2026-05-23 03:36 UTC · model grok-4.3

classification 💻 cs.LG
keywords causal inferencerepresentation learningNeyman orthogonalityOR-learnersmanifold hypothesishigh-dimensional datacausal estimationbalancing constraint
0
0 comments X

The pith

Under the low-dimensional manifold hypothesis, orthogonal representation learners can strictly reduce the estimation error of standard Neyman-orthogonal learners for causal quantities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

End-to-end representation learning often succeeds in practice for causal estimation from high-dimensional data yet lacks the quasi-oracle efficiency guaranteed by Neyman-orthogonal learners. The paper introduces OR-learners as a unifying framework that integrates representation learning directly into Neyman-orthogonal estimation. Under the low-dimensional manifold hypothesis, the analysis establishes that these OR-learners achieve strictly lower estimation error than standard Neyman-orthogonal methods. The work also shows that a balancing constraint cannot generally substitute for Neyman-orthogonality without an extra inductive bias. The resulting guidelines indicate how practitioners can combine the two approaches to retain both empirical performance and theoretical guarantees.

Core claim

We introduce OR-learners as a unifying framework that connects representation learning with Neyman-orthogonal learners. Under the low-dimensional manifold hypothesis the OR-learners strictly improve the estimation error of the standard Neyman-orthogonal learners. At the same time the balancing constraint requires an additional inductive bias and cannot generally compensate for the lack of Neyman-orthogonality of the end-to-end approaches.

What carries the argument

OR-learners, the framework that augments Neyman-orthogonal learners with learned representations to exploit low-dimensional manifold structure.

If this is right

  • Representation learning strengthens Neyman-orthogonal learners by reducing estimation error when the manifold hypothesis holds.
  • Balancing constraints alone cannot replace Neyman-orthogonality without additional inductive bias.
  • The framework supplies concrete guidelines for combining representation learning with classical Neyman-orthogonal learners.
  • Both practical performance and asymptotic optimality become attainable in the same estimator.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applied domains with plausible low-dimensional structure, such as image or genomic data, may obtain more accurate causal estimates by adopting OR-learners.
  • Empirical checks for manifold structure become relevant before claiming the improved error rates.
  • The result raises the question of whether similar gains appear for other classes of orthogonal estimators beyond the ones analyzed.

Load-bearing premise

The data must satisfy the low-dimensional manifold hypothesis for the strict improvement in estimation error to hold.

What would settle it

Synthetic or real data generated without low-dimensional manifold structure where the estimation error of OR-learners equals or exceeds that of standard Neyman-orthogonal learners.

Figures

Figures reproduced from arXiv: 2502.04274 by Dennis Frauen, Jonas Schweisthal, Stefan Feuerriegel, Valentyn Melnychuk.

Figure 1
Figure 1. Figure 1: Hidden layers of the representation network [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Insights for RQ 2 . For both figures, we highlight in ✄ ✂ [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Results for synthetic data in Setting 2 . Reported: ratio between the performance of TARFlow (CFRFlow with α = 0) and invertible representation networks with varying α; mean ± SE over 15 runs. Lower is better. Here: ntrain = 500, dϕˆ = 2. constraint. Setup. We follow prior literature [15, 52] and use several (semi-)synthetic datasets where both counter￾factual outcomes Y [0] and Y [1] and ground-truth co￾v… view at source ↗
Figure 4
Figure 4. Figure 4: Flow chart of consistency and Neyman-orthogonality for representation learning methods. The [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: An overview of the OR-learners. The OR-learners proceed in three stages: 0 fitting a representation network, 1 estimation of the nuisance functions, and 2 fitting a target network. For the stage 0 , we also show different options for the target network input V . Depending on the choice of the input V , the second-stage model g(V ) obtains different interpretations: it either learns a new model from scratch… view at source ↗
Figure 6
Figure 6. Figure 6: Results for synthetic data in Setting 2 . Reported: ratio between the performance of TARFlow (CFRFlow with α = 0) and invertible representation networks with varying α; mean ± SE over 15 runs. Lower is better. Here: ntrain ∈ {250, 1000}, dϕˆ = 2 [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of the invertible transformations defined by the learned normalizing flow representation [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Results for IHDP experiments in Setting 2 . Reported: ratio between the performance of TARFlow (CFRFlow with α = 0) and invertible representation networks with varying α; mean ± SE over 100 train/test splits. Lower is better. Here: dϕˆ = 12. (iii) HC-MNIST dataset. Finally, in [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗
read the original abstract

End-to-end representation learning has become a powerful tool for estimating causal quantities from high-dimensional observational data, but its efficiency remained unclear. Here, we face a central tension: End-to-end representation learning methods often work well in practice but lack asymptotic optimality in the form of the quasi-oracle efficiency. In contrast, two-stage Neyman-orthogonal learners provide such a theoretical optimality property but do not explicitly benefit from the strengths of representation learning. In this work, we step back and ask two research questions: (1) When do representations strengthen existing Neyman-orthogonal learners? and (2) Can a balancing constraint - a commonly proposed technique in the representation learning literature - provide improvements to Neyman-orthogonality? We address these two questions through our theoretical and empirical analysis, where we introduce a unifying framework that connects representation learning with Neyman-orthogonal learners (namely, OR-learners). In particular, we show that, under the low-dimensional manifold hypothesis, the OR-learners can strictly improve the estimation error of the standard Neyman-orthogonal learners. At the same time, we find that the balancing constraint requires an additional inductive bias and cannot generally compensate for the lack of Neyman-orthogonality of the end-to-end approaches. Building on these insights, we offer guidelines for how users can effectively combine representation learning with the classical Neyman-orthogonal learners to achieve both practical performance and theoretical guarantees.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces OR-learners, a unifying framework that integrates representation learning into Neyman-orthogonal estimation for causal quantities from high-dimensional data. It addresses when representations strengthen Neyman-orthogonal learners and whether balancing constraints can substitute for Neyman-orthogonality. The central theoretical result is that, under the low-dimensional manifold hypothesis, OR-learners strictly improve estimation error over standard Neyman-orthogonal learners; a secondary result is that balancing constraints require extra inductive bias and cannot generally restore Neyman-orthogonality. The work concludes with practical guidelines for combining the approaches.

Significance. If the strict-improvement result holds, the paper bridges a key gap between the empirical success of representation learning and the quasi-oracle efficiency of two-stage Neyman-orthogonal methods, offering both a positive theoretical contribution and a clarifying negative result on balancing. The analysis is grounded in existing Neyman-orthogonality theory rather than redefining estimands, which strengthens its internal consistency.

major comments (2)
  1. [§4.2, Theorem 3] §4.2, Theorem 3 (or equivalent statement of the strict improvement): the error decomposition must explicitly isolate how the manifold dimension produces a strictly smaller leading asymptotic term than the standard Neyman-orthogonal bound without an offsetting bias or slower rate from the representation step; the current argument shows a reduction in nuisance estimation but does not yet demonstrate that the resulting gap is strictly positive for all admissible manifold dimensions.
  2. [§3.1] §3.1, Definition of the OR-learner score: it is not immediate that the representation map preserves the Neyman orthogonality property of the original score when the manifold hypothesis is imposed only on the nuisance functions; an explicit verification that the cross-term remains zero (or o_p(n^{-1/2})) is required for the subsequent rate claims to hold.
minor comments (2)
  1. Notation for the representation function φ and the manifold dimension d_M should be introduced once in the preliminaries and used consistently; occasional redefinition in later sections reduces readability.
  2. Figure 2 (or equivalent empirical plot): axis labels and legend entries should explicitly state whether the plotted quantity is the finite-sample MSE or the estimated asymptotic variance to allow direct comparison with the theoretical bounds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important points for strengthening the theoretical claims. We have revised the paper to address both major comments explicitly, adding clarifications and a new lemma as described below.

read point-by-point responses
  1. Referee: [§4.2, Theorem 3] §4.2, Theorem 3 (or equivalent statement of the strict improvement): the error decomposition must explicitly isolate how the manifold dimension produces a strictly smaller leading asymptotic term than the standard Neyman-orthogonal bound without an offsetting bias or slower rate from the representation step; the current argument shows a reduction in nuisance estimation but does not yet demonstrate that the resulting gap is strictly positive for all admissible manifold dimensions.

    Authors: We appreciate the referee's precise identification of the needed strengthening. The original decomposition already isolates the nuisance rate improvement under the manifold hypothesis (reducing the leading term from the ambient dimension p to the manifold dimension d), with no additional bias term introduced by the representation step under our maintained assumptions. In the revision we have expanded the proof of Theorem 3 with an explicit side-by-side comparison of the two asymptotic expansions, showing that the gap remains strictly positive whenever d < p (the admissible range under the low-dimensional manifold hypothesis). A new remark following the theorem states the condition under which the inequality is strict and confirms that the representation step does not slow the rate or add bias. revision: yes

  2. Referee: [§3.1] §3.1, Definition of the OR-learner score: it is not immediate that the representation map preserves the Neyman orthogonality property of the original score when the manifold hypothesis is imposed only on the nuisance functions; an explicit verification that the cross-term remains zero (or o_p(n^{-1/2})) is required for the subsequent rate claims to hold.

    Authors: We thank the referee for this observation. The representation map is applied only to the nuisance functions (which lie on the manifold by assumption), while the target parameter and the score structure remain unchanged. Because the original score satisfies Neyman orthogonality with respect to the full nuisance, and the representation is a deterministic function of the nuisance estimator, the cross-term vanishes by the law of iterated expectations. In the revision we have inserted a short lemma (new Lemma 3.1) that explicitly computes this cross-term and verifies it is o_p(n^{-1/2}) under the manifold rate, thereby justifying the subsequent rate claims without additional assumptions. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation extends Neyman-orthogonal learners via external manifold assumption

full rationale

The paper defines OR-learners as a new unifying framework that augments existing Neyman-orthogonal methods with representation learning. The key claim of strict improvement is conditioned on the low-dimensional manifold hypothesis, an independent modeling assumption rather than a quantity derived from the learners themselves. No equations or results reduce by construction to fitted parameters renamed as predictions, self-citations that carry the central proof, or ansatzes smuggled from prior author work. The abstract and described analysis present an independent theoretical extension with external benchmarks (Neyman orthogonality and manifold structure) that do not collapse into the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The improvement claim rests on the low-dimensional manifold hypothesis as a domain assumption; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption low-dimensional manifold hypothesis
    Invoked to establish that OR-learners strictly improve estimation error over standard Neyman-orthogonal learners.

pith-pipeline@v0.9.0 · 5791 in / 1098 out tokens · 28092 ms · 2026-05-23T03:36:56.632043+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Annotation-Assisted Learning of Treatment Policies From Multimodal Electronic Health Records

    cs.LG 2025-07 unverdicted novelty 6.0

    AACE is an annotation-assisted method for causal policy learning from multimodal EHRs that outperforms risk-based and representation-based baselines on synthetic, semi-synthetic, and real datasets.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    Doubly robust match- ing estimators for high dimensional confounding adjustment.Biometrics, 74(4):1171–1179, 2018

    Joseph Antonelli, Matthew Cefalu, Nathan Palmer, and Denis Agniel. Doubly robust match- ing estimators for high dimensional confounding adjustment.Biometrics, 74(4):1171–1179, 2018

  2. [2]

    Counterfactual rep- resentation learning with balancing weights

    Serge Assaad, Shuxi Zeng, Chenyang Tao, Shounak Datta, Nikhil Mehta, Ricardo Henao, Fan Li, and Lawrence Carin. Counterfactual rep- resentation learning with balancing weights. In International Conference on Artificial Intelligence and Statistics, 2021

  3. [3]

    Zame, and Mihaela van der Schaar

    Onur Atan, William R. Zame, and Mihaela van der Schaar. Counterfactual policy optimization using domain-adversarial neural networks. 2018

  4. [4]

    Kennedy, and Larry Wasserman

    Sivaraman Balakrishnan, Edward H. Kennedy, and Larry Wasserman. The fundamental limits of structure-agnostic functional estimation.arXiv preprint arXiv:2305.04116, 2023

  5. [5]

    Man- ning

    Anirban Basu, Daniel Polsky, and Willard G. Man- ning. Estimating treatment effects on healthcare costs under exogeneity: is there a ‘magic bullet’? Health Services and Outcomes Research Methodol- ogy, 11:1–26, 2011

  6. [6]

    Alaa, James Jordon, and Mihaela van der Schaar

    Ioana Bica, Ahmed M. Alaa, James Jordon, and Mihaela van der Schaar. Estimating counterfactual treatment outcomes over time through adversar- ially balanced representations. InInternational Conference on Learning Representations, 2020

  7. [7]

    Chauhan, Soheila Molaei, Marzia Hoque Tania, Anshul Thakur, Tingting Zhu, and David A

    Vinod K. Chauhan, Soheila Molaei, Marzia Hoque Tania, Anshul Thakur, Tingting Zhu, and David A. Clifton. Adversarial de-confounding in individu- alised treatment effects estimation. InInterna- tional Conference on Artificial Intelligence and Statistics, 2023

  8. [8]

    Chen, Jens Behrmann, David K

    Ricky T.Q. Chen, Jens Behrmann, David K. Du- venaud, and J¨ orn-Henrik Jacobsen. Residual flows for invertible generative modeling. InAdvances in Neural Information Processing Systems, 2019

  9. [9]

    Xgboost: extreme gradient boosting.R package version 0.4-2, 1(4):1–4, 2015

    Tianqi Chen, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang, Hyunsu Cho, Kailong Chen, Rory Mitchell, Ignacio Cano, Tianyi Zhou, et al. Xgboost: extreme gradient boosting.R package version 0.4-2, 1(4):1–4, 2015

  10. [10]

    Double/debiased/Neyman ma- chine learning of treatment effects.American Eco- nomic Review, 107(5):261–265, 2017

    Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, and Whitney Newey. Double/debiased/Neyman ma- chine learning of treatment effects.American Eco- nomic Review, 107(5):261–265, 2017

  11. [11]

    Efficient adjustment for complex co- variates: Gaining efficiency with DOPE.arXiv preprint arXiv:2402.12980, 2024

    Alexander Mangulad Christgau and Niels Richard Hansen. Efficient adjustment for complex co- variates: Gaining efficiency with DOPE.arXiv preprint arXiv:2402.12980, 2024

  12. [12]

    Counterfactual predictions under runtime confounding.Advances in Neural Infor- mation Processing Systems, 2020

    Amanda Coston, Edward Kennedy, and Alexandra Chouldechova. Counterfactual predictions under runtime confounding.Advances in Neural Infor- mation Processing Systems, 2020

  13. [13]

    Generalization bounds for causal regression: Insights, guarantees and sen- sitivity analysis

    Daniel Csillag, Claudio Jose Struchiner, and Guil- herme Tegoni Goedert. Generalization bounds for causal regression: Insights, guarantees and sen- sitivity analysis. InInternational Conference on Machine Learning, 2024

  14. [14]

    On inductive biases for heterogeneous treatment ef- fect estimation.Advances in Neural Information Processing Systems, 2021

    Alicia Curth and Mihaela van der Schaar. On inductive biases for heterogeneous treatment ef- fect estimation.Advances in Neural Information Processing Systems, 2021

  15. [15]

    Non- parametric estimation of heterogeneous treatment effects: From theory to learning algorithms

    Alicia Curth and Mihaela van der Schaar. Non- parametric estimation of heterogeneous treatment effects: From theory to learning algorithms. In International Conference on Artificial Intelligence and Statistics, 2021

  16. [16]

    In search of insights, not magic bullets: Towards demystification of the model selection dilemma in heterogeneous treatment effect estimation

    Alicia Curth and Mihaela van der Schaar. In search of insights, not magic bullets: Towards demystification of the model selection dilemma in heterogeneous treatment effect estimation. In International Conference on Machine Learning, 2023

  17. [17]

    Alaa, and Mihaela van der Schaar

    Alicia Curth, Ahmed M. Alaa, and Mihaela van der Schaar. Estimating structural target func- tions using machine learning and influence func- tions.arXiv preprint arXiv:2008.06461, 2020

  18. [18]

    Really doing great at estimating CATE? A critical look at ML bench- marking practices in treatment effect estimation

    Alicia Curth, David Svensson, Jim Weatherall, and Mihaela van der Schaar. Really doing great at estimating CATE? A critical look at ML bench- marking practices in treatment effect estimation. InAdvances in Neural Information Processing Sys- tems, 2021

  19. [19]

    De- confounding scores: Feature representations for causal effect estimation with weak overlap.arXiv preprint arXiv:2104.05762, 2021

    Alexander D’Amour and Alexander Franks. De- confounding scores: Feature representations for causal effect estimation with weak overlap.arXiv preprint arXiv:2104.05762, 2021

  20. [20]

    Vincent Dorie, Jennifer Hill, Uri Shalit, Marc Scott, and Dan Cervone. Automated versus do- it-yourself methods for causal inference: Lessons Orthogonal Representation Learning for Estimating Causal Quantities learned from a data analysis competition.Statis- tical Science, 34(1):43–68, 2019

  21. [21]

    Adversarial balancing-based representation learning for causal effect inference with observational data.Data Min- ing and Knowledge Discovery, 35(4):1713–1738, 2021

    Xin Du, Lei Sun, Wouter Duivesteijn, Alexander Nikolaev, and Mykola Pechenizkiy. Adversarial balancing-based representation learning for causal effect inference with observational data.Data Min- ing and Knowledge Discovery, 35(4):1713–1738, 2021

  22. [22]

    Testing the manifold hypothesis.Jour- nal of the American Mathematical Society, 29(4): 983–1049, 2016

    Charles Fefferman, Sanjoy Mitter, and Hariharan Narayanan. Testing the manifold hypothesis.Jour- nal of the American Mathematical Society, 29(4): 983–1049, 2016

  23. [23]

    Kohane, and Mihaela van der Schaar

    Stefan Feuerriegel, Dennis Frauen, Valentyn Mel- nychuk, Jonas Schweisthal, Konstantin Hess, Ali- cia Curth, Stefan Bauer, Niki Kilbertus, Isaac S. Kohane, and Mihaela van der Schaar. Causal ma- chine learning for predicting treatment outcomes. Nature Medicine, 2024

  24. [24]

    Inverse-variance weighting for es- timation of heterogeneous treatment effects

    Aaron Fisher. Inverse-variance weighting for es- timation of heterogeneous treatment effects. In International Conference on Machine Learning, 2024

  25. [25]

    Foster and Vasilis Syrgkanis

    Dylan J. Foster and Vasilis Syrgkanis. Orthogonal statistical learning.The Annals of Statistics, 51 (3):879–908, 2023

  26. [26]

    Fair off-policy learning from obser- vational data

    Dennis Frauen, Valentyn Melnychuk, and Stefan Feuerriegel. Fair off-policy learning from obser- vational data. InInternational Conference on Machine Learning, 2024

  27. [27]

    Model-agnostic meta-learners for estimat- ing heterogeneous treatment effects over time

    Dennis Frauen, Konstantin Hess, and Stefan Feuer- riegel. Model-agnostic meta-learners for estimat- ing heterogeneous treatment effects over time. In International Conference on Learning Representa- tions, 2025

  28. [28]

    Estimating heterogeneous treatment effects: Mutual information bounds and learning algorithms

    Xingzhuo Guo, Yuchen Zhang, Jianmin Wang, and Mingsheng Long. Estimating heterogeneous treatment effects: Mutual information bounds and learning algorithms. InInternational Conference on Machine Learning, 2023

  29. [29]

    Ben B. Hansen. The prognostic analogue of the propensity score.Biometrika, 95(2):481–488, 2008

  30. [30]

    Coun- terFactual regression with importance sampling weights

    Negar Hassanpour and Russell Greiner. Coun- terFactual regression with importance sampling weights. InInternational Joint Conference on Artificial Intelligence, 2019

  31. [31]

    Learning disentangled representations for counterfactual re- gression

    Negar Hassanpour and Russell Greiner. Learning disentangled representations for counterfactual re- gression. InInternational Conference on Learning Representations, 2019

  32. [32]

    Bayesian neural controlled differential equations for treatment ef- fect estimation

    Konstantin Hess, Valentyn Melnychuk, Dennis Frauen, and Stefan Feuerriegel. Bayesian neural controlled differential equations for treatment ef- fect estimation. InInternational Conference on Learning Representations, 2024

  33. [33]

    Jennifer L. Hill. Bayesian nonparametric modeling for causal inference.Journal of Computational and Graphical Statistics, 20(1):217–240, 2011

  34. [34]

    Joint sufficient dimension reduction and estima- tion of conditional and average treatment effects

    Ming-Yueh Huang and Kwun Chuen Gary Chan. Joint sufficient dimension reduction and estima- tion of conditional and average treatment effects. Biometrika, 104(3):583–596, 2017

  35. [35]

    Unveiling the potential of robustness in evaluating causal inference models

    Yiyan Huang, Cheuk Hang Leung, Siyi Wang, Yi- jun Li, and Qi Wu. Unveiling the potential of robustness in evaluating causal inference models. InAdvances in Neural Information Processing Sys- tems, 2024

  36. [36]

    Quantifying ignorance in individual-level causal-effect estimates under hid- den confounding

    Andrew Jesson, S¨ oren Mindermann, Yarin Gal, and Uri Shalit. Quantifying ignorance in individual-level causal-effect estimates under hid- den confounding. InInternational Conference on Machine Learning, 2021

  37. [37]

    Structure- agnostic optimality of doubly robust learning for treatment effect estimation.arXiv preprint arXiv:2402.14264, 2024

    Jikai Jin and Vasilis Syrgkanis. Structure- agnostic optimality of doubly robust learning for treatment effect estimation.arXiv preprint arXiv:2402.14264, 2024

  38. [38]

    Johansson, Uri Shalit, and David Son- tag

    Fredrik D. Johansson, Uri Shalit, and David Son- tag. Learning representations for counterfactual inference. InInternational Conference on Machine Learning, 2016

  39. [39]

    Learning Weighted Representations for Generalization Across Designs

    Fredrik D. Johansson, Nathan Kallus, Uri Shalit, and David Sontag. Learning weighted represen- tations for generalization across designs.arXiv preprint arXiv:1802.08598, 2018

  40. [40]

    Johansson, David Sontag, and Rajesh Ranganath

    Fredrik D. Johansson, David Sontag, and Rajesh Ranganath. Support and invertibility in domain- invariant representations. InInternational Con- ference on Artificial Intelligence and Statistics, 2019

  41. [41]

    Johansson, Uri Shalit, Nathan Kallus, and David Sontag

    Fredrik D. Johansson, Uri Shalit, Nathan Kallus, and David Sontag. Generalization bounds and representation learning for estimation of potential outcomes and causal effects.Journal of Machine Learning Research, 23:7489–7538, 2022

  42. [42]

    In- terval estimation of individual-level causal effects under unobserved confounding

    Nathan Kallus, Xiaojie Mao, and Angela Zhou. In- terval estimation of individual-level causal effects under unobserved confounding. InInternational Conference on Artificial Intelligence and Statistics, 2019

  43. [43]

    Edward H. Kennedy. Towards optimal doubly robust estimation of heterogeneous causal effects. Electronic Journal of Statistics, 17(2):3008–3049, 2023. V alentyn Melnychuk, Dennis F rauen, Jonas Schweisthal, Stefan F euerriegel

  44. [44]

    Fair and ro- bust estimation of heterogeneous treatment effects for policy learning

    Kwangho Kim and Jos´ e R Zubizarreta. Fair and ro- bust estimation of heterogeneous treatment effects for policy learning. InInternational Conference on Machine Learning, 2023

  45. [45]

    K¨ unzel, Jasjeet S

    S¨ oren R. K¨ unzel, Jasjeet S. Sekhon, Peter J. Bickel, and Bin Yu. Metalearners for estimating hetero- geneous treatment effects using machine learning. Proceedings of the National Academy of Sciences, 116(10):4156–4165, 2019

  46. [46]

    Causal machine learning for cost-effective allocation of development aid

    Milan Kuzmanovic, Dennis Frauen, Tobias Hatt, and Stefan Feuerriegel. Causal machine learning for cost-effective allocation of development aid. InACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2024

  47. [47]

    The MNIST database of handwritten digits.http://yann.lecun.com/exdb/mnist/, 1998

    Yann LeCun. The MNIST database of handwritten digits.http://yann.lecun.com/exdb/mnist/, 1998

  48. [48]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Con- ference on Learning Representations, 2019

  49. [49]

    Matching using sufficient dimension reduction for causal inference.Journal of Business & Economic Statistics, 38(4):888–900, 2020

    Wei Luo and Yeying Zhu. Matching using sufficient dimension reduction for causal inference.Journal of Business & Economic Statistics, 38(4):888–900, 2020

  50. [50]

    Learning adversarially fair and transferable representations

    David Madras, Elliot Creager, Toniann Pitassi, and Richard Zemel. Learning adversarially fair and transferable representations. InInternational Conference on Machine Learning, 2018

  51. [51]

    Causal transformer for estimating counterfactual outcomes

    Valentyn Melnychuk, Dennis Frauen, and Stefan Feuerriegel. Causal transformer for estimating counterfactual outcomes. InInternational Confer- ence on Machine Learning, 2022

  52. [52]

    Bounds on representation-induced confounding bias for treatment effect estimation

    Valentyn Melnychuk, Dennis Frauen, and Stefan Feuerriegel. Bounds on representation-induced confounding bias for treatment effect estimation. InInternational Conference on Learning Repre- sentations, 2024

  53. [53]

    On a general class of orthogonal learners for the estimation of heterogeneous treat- ment effects.arXiv preprint arXiv:2303.12687, 2023

    Pawel Morzywolek, Johan Decruyenaere, and Stijn Vansteelandt. On a general class of orthogonal learners for the estimation of heterogeneous treat- ment effects.arXiv preprint arXiv:2303.12687, 2023

  54. [54]

    Quasi-oracle estimation of heterogeneous treatment effects

    Xinkun Nie and Stefan Wager. Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108:299–319, 2021

  55. [55]

    Niswander

    Kenneth R. Niswander. The collaborative perina- tal study of the National Institute of Neurological Diseases and Stroke.The Woman and Their Preg- nancies, 1972

  56. [56]

    Polyak and Anatoli B

    Boris T. Polyak and Anatoli B. Juditsky. Accel- eration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30 (4):838–855, 1992

  57. [57]

    Variational inference with normalizing flows

    Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. InInternational Conference on Machine Learning, 2015

  58. [58]

    Robins and Andrea Rotnitzky

    James M. Robins and Andrea Rotnitzky. Semi- parametric efficiency in multivariate regression models with missing data.Journal of the Ameri- can Statistical Association, 90(429):122–129, 1995

  59. [59]

    Rosenbaum and Donald B

    Paul R. Rosenbaum and Donald B. Rubin. The central role of the propensity score in observational studies for causal effects.Biometrika, 70(1):41–55, 1983

  60. [60]

    Donald B. Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of Educational Psychology, 66 (5):688, 1974

  61. [61]

    Adjustment for confounding using pre-trained representations

    Rickmer Schulte, David R¨ ugamer, and Thomas Na- gler. Adjustment for confounding using pre-trained representations. InInternational Conference on Machine Learning, 2025

  62. [62]

    Perfect Match: A Simple Method for Learning Representations For Counterfactual Inference With Neural Networks

    Patrick Schwab, Lorenz Linhardt, and Walter Karlen. Perfect match: A simple method for learning representations for counterfactual in- ference with neural networks.arXiv preprint arXiv:1810.00656, 2018

  63. [63]

    Johansson, and David Son- tag

    Uri Shalit, Fredrik D. Johansson, and David Son- tag. Estimating individual treatment effect: Gener- alization bounds and algorithms. InInternational Conference on Machine Learning, 2017

  64. [64]

    Adapt- ing neural networks for the estimation of treatment effects.Advances in Neural Information Processing Systems, 2019

    Claudia Shi, David Blei, and Victor Veitch. Adapt- ing neural networks for the estimation of treatment effects.Advances in Neural Information Processing Systems, 2019

  65. [65]

    Charles J. Stone. Optimal global rates of conver- gence for nonparametric regression.The Annals of Statistics, pages 1040–1053, 1982

  66. [66]

    Combining T-learning and DR-learning: a framework for oracle-efficient estimation of causal contrasts.arXiv preprint arXiv:2402.01972, 2024

    Lars van der Laan, Marco Carone, and Alex Luedtke. Combining T-learning and DR-learning: a framework for oracle-efficient estimation of causal contrasts.arXiv preprint arXiv:2402.01972, 2024

  67. [67]

    van der Laan, Sherri Rose, et al.Targeted learning: causal inference for observational and experimental data, volume 4

    Mark J. van der Laan, Sherri Rose, et al.Targeted learning: causal inference for observational and experimental data, volume 4. Springer, 2011

  68. [68]

    Or- thogonal prediction of counterfactual outcomes

    Stijn Vansteelandt and Pawe l Morzywo lek. Or- thogonal prediction of counterfactual outcomes. arXiv preprint arXiv:2311.09423, 2023

  69. [69]

    Hal R. Varian. Causal inference in economics and marketing.Proceedings of the National Academy of Sciences, 113(27):7310–7315, 2016

  70. [70]

    Op- timal transport for treatment effect estimation

    Hao Wang, Jiajun Fan, Zhichao Chen, Haoxuan Li, Weiming Liu, Tianqiao Liu, Quanyu Dai, Yichao Orthogonal Representation Learning for Estimating Causal Quantities Wang, Zhenhua Dong, and Ruiming Tang. Op- timal transport for treatment effect estimation. Advances in Neural Information Processing Sys- tems, 2024

  71. [71]

    Learning decomposed representations for treatment effect estimation.IEEE Transactions on Knowledge and Data Engineering, 35(5):4989– 5001, 2022

    Anpeng Wu, Junkun Yuan, Kun Kuang, Bo Li, Runze Wu, Qiang Zhu, Yueting Zhuang, and Fei Wu. Learning decomposed representations for treatment effect estimation.IEEE Transactions on Knowledge and Data Engineering, 35(5):4989– 5001, 2022

  72. [72]

    Stable estimation of heterogeneous treatment effects

    Anpeng Wu, Kun Kuang, Ruoxuan Xiong, Bo Li, and Fei Wu. Stable estimation of heterogeneous treatment effects. InInternational Conference on Machine Learning, 2023

  73. [73]

    Reducing confounding bias without data splitting for causal inference via optimal transport

    Yuguang Yan, Zongyu Li, Haolin Yang, Zeqin Yang, Hao Zhou, Ruichu Cai, and Zhifeng Hao. Reducing confounding bias without data splitting for causal inference via optimal transport. In International Conference on Machine Learning, 2025

  74. [74]

    Revisiting counterfactual regression through the lens of Gromov-Wasserstein information bottle- neck.arXiv preprint arXiv:2405.15505, 2024

    Hao Yang, Zexu Sun, Hongteng Xu, and Xu Chen. Revisiting counterfactual regression through the lens of Gromov-Wasserstein information bottle- neck.arXiv preprint arXiv:2405.15505, 2024

  75. [75]

    Representation learning for treatment effect estimation from observational data.Advances in Neural Information Processing Systems, 2018

    Liuyi Yao, Sheng Li, Yaliang Li, Mengdi Huai, Jing Gao, and Aidong Zhang. Representation learning for treatment effect estimation from observational data.Advances in Neural Information Processing Systems, 2018

  76. [76]

    Learning fair representations

    Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations. InInternational Conference on Machine Learning, 2013

  77. [77]

    Learning overlapping representations for the estimation of individualized treatment effects

    Yao Zhang, Alexis Bellot, and Mihaela van der Schaar. Learning overlapping representations for the estimation of individualized treatment effects. InInternational Conference on Artificial Intelli- gence and Statistics, 2020. Orthogonal Representation Learning for Estimating Causal Quantities: Appendix A Extended Related Work Our work aims to unify two str...

  78. [78]

    low overlap – low heterogeneity

    that have three hidden layers with a tunable synchronous number of units. All the networks for theOR-learners (see Stages 0 – 2 in Fig. 5) are trained with AdamW [ 48]. Each network was trained with nepoch = 200 epochs for the synthetic dataset and nepoch = 50 for the ACIC 2016 dataset collection. To further stabilize training of the target networks in st...