Flow-based Generative Modeling of Potential Outcomes and Counterfactuals
Pith reviewed 2026-05-22 13:25 UTC · model grok-4.3
The pith
PO-Flow uses continuous normalizing flows to jointly model potential outcome distributions and factual-conditioned counterfactual outcomes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PO-Flow is a continuous normalizing flow framework for causal inference that jointly models potential outcome distributions and factual-conditioned counterfactual outcomes. Trained via flow matching, it supplies a unified approach to individualized potential outcome prediction, conditional average treatment effect estimation, and counterfactual prediction. The key mechanism encodes an observed factual outcome and decodes it under an alternative treatment to produce the counterfactual. A supporting recovery guarantee holds under certain assumptions on the data-generating process and model class, and the model enables likelihood-based evaluation of predictions.
What carries the argument
The encode-decode mechanism of PO-Flow, in which an observed factual outcome is encoded into the continuous normalizing flow and decoded under a different treatment to recover the counterfactual outcome distribution.
If this is right
- The model supplies individualized predictions of potential outcomes under each treatment.
- It produces estimates of the conditional average treatment effect for each unit.
- Counterfactual outcomes are obtained directly by encoding a factual observation and decoding under the alternative treatment.
- Likelihood-based evaluation gives uncertainty-aware assessment of all predictions.
- The recovery guarantee ensures the model can recover true conditional distributions when the stated assumptions hold.
Where Pith is reading between the lines
- Clinicians could use the uncertainty estimates to weigh predictions more carefully when deciding between treatments for individual patients.
- The encode-decode structure might be adapted to handle time-varying or multiple sequential treatments.
- Combining PO-Flow with methods that learn the treatment assignment mechanism could reduce reliance on strong ignorability assumptions.
- Testing on longitudinal clinical registries would reveal whether the generated counterfactuals align with observed follow-up data.
Load-bearing premise
The recovery guarantee holds only under certain unspecified assumptions on the data-generating process and the model class.
What would settle it
Generate synthetic data from a process that violates the paper's recovery assumptions and check whether the encode-decode mechanism still recovers the true conditional counterfactual distributions.
Figures
read the original abstract
Predicting potential and counterfactual outcomes from observational data is central to individualized decision-making, particularly in clinical settings where treatment choices must be tailored to each patient rather than guided solely by population averages. We propose PO-Flow, a continuous normalizing flow (CNF) framework for causal inference that jointly models potential outcome distributions and factual-conditioned counterfactual outcomes. Trained via flow matching, PO-Flow provides a unified approach to individualized potential outcome prediction, conditional average treatment effect estimation, and counterfactual prediction. By encoding an observed factual outcome and decoding under an alternative treatment, PO-Flow provides an encode-decode mechanism for factual-conditioned counterfactual prediction. In addition, PO-Flow supports likelihood-based evaluation of potential outcomes, enabling uncertainty-aware assessment of predictions. A supporting recovery guarantee is established under certain assumptions, and empirical results on benchmark datasets demonstrate strong performance across a range of causal inference tasks within the potential outcomes framework.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces PO-Flow, a continuous normalizing flow (CNF) framework trained via flow matching for jointly modeling potential outcome distributions under the potential outcomes framework. It claims a unified approach to individualized potential outcome prediction, conditional average treatment effect (CATE) estimation, and factual-conditioned counterfactual prediction via an encode-decode mechanism that reuses the latent code from an observed factual outcome. A supporting recovery guarantee is stated under certain (unspecified in the provided abstract) assumptions on the data-generating process and model class, with empirical results reported on benchmark datasets.
Significance. If the recovery guarantee is valid and the empirical results hold under realistic conditions, PO-Flow would offer a flexible generative approach to causal inference tasks with built-in likelihood-based uncertainty quantification. This could be valuable for individualized decision-making in domains like clinical settings, extending beyond standard regression-based methods by directly modeling conditional distributions and enabling counterfactual generation.
major comments (2)
- [Theoretical analysis / recovery guarantee] Recovery guarantee section: The guarantee is described as holding under 'certain assumptions' on the data-generating process (standard ignorability plus sufficient model capacity and correct flow specification), but these assumptions are not explicitly enumerated or stress-tested in the main text. This is load-bearing for the encode-decode counterfactual claim, as violation (e.g., latent treatment leakage or flow mismatch in low-density regions) would bias decoded counterfactuals even if factual marginals appear accurate.
- [Experiments / benchmark results] Empirical evaluation section (benchmark results): The reported strong performance across causal tasks lacks visible details on error bars, data handling, and targeted robustness checks against assumption violations. Without experiments that deliberately introduce latent leakage or poor convergence, the load-bearing step of factual-conditioned counterfactual recovery remains unverified despite good marginal fits.
minor comments (2)
- [Abstract] Abstract: The claim of 'strong performance' would benefit from inclusion of key quantitative metrics or baseline comparisons to allow readers to gauge the improvement.
- [Notation and definitions] Notation consistency: Ensure uniform use of potential outcome notation (e.g., Y(t) for treatment t) and clear distinction between factual and counterfactual conditioning throughout the methods and results.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which help clarify the presentation of our theoretical results and empirical evaluation. We respond to each major comment below and outline the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Theoretical analysis / recovery guarantee] Recovery guarantee section: The guarantee is described as holding under 'certain assumptions' on the data-generating process (standard ignorability plus sufficient model capacity and correct flow specification), but these assumptions are not explicitly enumerated or stress-tested in the main text. This is load-bearing for the encode-decode counterfactual claim, as violation (e.g., latent treatment leakage or flow mismatch in low-density regions) would bias decoded counterfactuals even if factual marginals appear accurate.
Authors: We agree that explicitly enumerating the assumptions strengthens the manuscript. The recovery guarantee is established under the following assumptions, which we will list in a dedicated subsection of the revised theoretical analysis: (i) strong ignorability (no unmeasured confounding between treatment and potential outcomes), (ii) positivity (overlap in treatment assignment), (iii) sufficient model capacity of the continuous normalizing flows, and (iv) correct specification of the flow architecture and training objective. We will also add a short discussion of how violations such as latent treatment leakage or flow mismatch in low-density regions could affect the encode-decode counterfactual mechanism, even when marginal fits remain accurate. revision: yes
-
Referee: [Experiments / benchmark results] Empirical evaluation section (benchmark results): The reported strong performance across causal tasks lacks visible details on error bars, data handling, and targeted robustness checks against assumption violations. Without experiments that deliberately introduce latent leakage or poor convergence, the load-bearing step of factual-conditioned counterfactual recovery remains unverified despite good marginal fits.
Authors: We thank the referee for this observation. While the original experiments used multiple random seeds, error bars and detailed data-handling descriptions were omitted from the main text for space reasons. In the revision we will report standard errors from five independent runs, expand the data-handling subsection to describe preprocessing, train/test splits, and hyperparameter selection, and add targeted robustness experiments that introduce controlled latent leakage and assess convergence behavior to directly verify factual-conditioned counterfactual recovery. revision: yes
Circularity Check
No circularity in derivation; recovery guarantee presented as independent
full rationale
The abstract and provided context describe PO-Flow as a CNF trained via flow matching for potential outcomes, with an encode-decode mechanism for counterfactuals and a supporting recovery guarantee under unspecified assumptions. No equations or steps are quoted that reduce a claimed prediction or guarantee to a fitted parameter by construction, nor is there self-citation load-bearing the central claim. The model is trained on observational data and evaluated on benchmarks, keeping the derivation self-contained without the forbidden patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard causal assumptions (e.g., ignorability / no unmeasured confounding) hold so that the recovery guarantee applies.
Forward citations
Cited by 4 Pith papers
-
Extended Wasserstein-GAN Approach to Causal Distribution Learning: Density-Free Estimation and Minimax Optimality
GANICE uses an extended Wasserstein distance and cellwise critic in a GAN to estimate conditional interventional distributions with minimax optimality guarantees.
-
Intervention-Based Time Series Causal Discovery via Simulator-Generated Interventional Distributions
SVAR-FM uses simulator clamping to produce interventional distributions and flow matching to identify time series causal structures, with an error bound that predicts sign reversal of causal effects below a simulator ...
-
CoreFlow: Low-Rank Matrix Generative Models
CoreFlow is a low-rank matrix generative model that trains normalizing flows on shared subspaces to improve efficiency and quality for high-dimensional limited-sample data, including incomplete matrices.
-
RepFlow: Representation Enhanced Flow Matching for Causal Effect Estimation
RepFlow combines representation learning and conditional flow matching to estimate both point and distributional causal effects while mitigating selection bias via entropically regularized Wasserstein distance on norm...
Reference graph
Works this paper leans on
-
[1]
Statistics and causal inference,
P. W. Holland, “Statistics and causal inference,”Journal of the American statistical Association, vol. 81, no. 396, pp. 945–960, 1986
work page 1986
-
[2]
Causal machine learning for predicting treatment outcomes,
S. Feuerriegel, D. Frauen, V . Melnychuk, J. Schweisthal, K. Hess, A. Curth, S. Bauer, N. Kilbertus, I. S. Kohane, and M. van der Schaar, “Causal machine learning for predicting treatment outcomes,”Nature Medicine, vol. 30, no. 4, pp. 958–968, 2024
work page 2024
-
[3]
Metalearners for estimating heterogeneous treatment ef- fects using machine learning,
S. R. Künzel, J. S. Sekhon, P. J. Bickel, and B. Yu, “Metalearners for estimating heterogeneous treatment ef- fects using machine learning,”Proceedings of the national academy of sciences, vol. 116, no. 10, pp. 4156–4165, 2019
work page 2019
-
[4]
Nonparametric esti- mation of heterogeneous treatment effects: From theory to learning algorithms,
A. Curth and M. Van der Schaar, “Nonparametric esti- mation of heterogeneous treatment effects: From theory to learning algorithms,” inInternational Conference on Artificial Intelligence and Statistics. PMLR, 2021, pp. 1810–1818
work page 2021
-
[5]
On inductive biases for heterogeneous treatment effect estimation,
——, “On inductive biases for heterogeneous treatment effect estimation,”Advances in Neural Information Pro- cessing Systems, vol. 34, pp. 15 883–15 894, 2021
work page 2021
-
[6]
Comparison of meta-learners for estimating multi-valued treatment heterogeneous effects,
N. Acharki, R. Lugo, A. Bertoncello, and J. Garnier, “Comparison of meta-learners for estimating multi-valued treatment heterogeneous effects,” inInternational confer- ence on machine learning. PMLR, 2023, pp. 91–132
work page 2023
-
[7]
Meta-learners for partially-identified treatment ef- fects across multiple environments,
J. Schweisthal, D. Frauen, M. Van Der Schaar, and S. Feuer- riegel, “Meta-learners for partially-identified treatment ef- fects across multiple environments,” inForty-first Interna- tional Conference on Machine Learning, 2024
work page 2024
-
[8]
Causal inference using potential outcomes: Design, modeling, decisions,
D. B. Rubin, “Causal inference using potential outcomes: Design, modeling, decisions,”Journal of the American statistical Association, vol. 100, no. 469, pp. 322–331, 2005
work page 2005
-
[9]
F. D. Johansson, U. Shalit, N. Kallus, and D. Sontag, “Generalization bounds and representation learning for es- timation of potential outcomes and causal effects,”Journal of Machine Learning Research, vol. 23, no. 166, pp. 1–50, 2022
work page 2022
-
[10]
Ganite: Esti- mation of individualized treatment effects using generative adversarial nets,
J. Yoon, J. Jordon, and M. Van Der Schaar, “Ganite: Esti- mation of individualized treatment effects using generative adversarial nets,” inInternational conference on learning representations, 2018
work page 2018
-
[11]
Causal transformer for estimating counterfactual outcomes,
V . Melnychuk, D. Frauen, and S. Feuerriegel, “Causal transformer for estimating counterfactual outcomes,” in International conference on machine learning. PMLR, 2022, pp. 15 293–15 329
work page 2022
-
[12]
Normalizing flows for interventional density estima- tion,
——, “Normalizing flows for interventional density estima- tion,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 24 361–24 397
work page 2023
-
[13]
Counter- factual can be strong in medical question and answering,
Z. Yang, Y . Liu, C. Ouyang, L. Ren, and W. Wen, “Counter- factual can be strong in medical question and answering,” Information Processing & Management, vol. 60, no. 4, p. 103408, 2023
work page 2023
-
[14]
Counterfactual explanations and how to find them: literature review and benchmarking,
R. Guidotti, “Counterfactual explanations and how to find them: literature review and benchmarking,”Data Mining and Knowledge Discovery, vol. 38, no. 5, pp. 2770–2824, 2024
work page 2024
-
[15]
Causality: models, reasoning, and infer- ence, by judea pearl, cambridge university press, 2000,
L. G. Neuberg, “Causality: models, reasoning, and infer- ence, by judea pearl, cambridge university press, 2000,” Econometric Theory, vol. 19, no. 4, pp. 675–685, 2003
work page 2000
-
[16]
Foun- dations of structural causal models with cycles and latent variables,
S. Bongers, P. Forré, J. Peters, and J. M. Mooij, “Foun- dations of structural causal models with cycles and latent variables,”The Annals of Statistics, vol. 49, no. 5, pp. 2885– 2915, 2021
work page 2021
-
[17]
Deep structural causal models for tractable counterfactual infer- ence,
N. Pawlowski, D. Coelho de Castro, and B. Glocker, “Deep structural causal models for tractable counterfactual infer- ence,”Advances in neural information processing systems, vol. 33, pp. 857–869, 2020
work page 2020
-
[18]
Learning rep- resentations for counterfactual inference,
F. Johansson, U. Shalit, and D. Sontag, “Learning rep- resentations for counterfactual inference,” inInternational conference on machine learning. PMLR, 2016, pp. 3020– 3029
work page 2016
-
[19]
Causal effect inference with deep latent- variable models,
C. Louizos, U. Shalit, J. M. Mooij, D. Sontag, R. Zemel, and M. Welling, “Causal effect inference with deep latent- variable models,”Advances in neural information process- ing systems, vol. 30, 2017
work page 2017
-
[20]
Conformal inference of coun- terfactuals and individual treatment effects,
L. Lei and E. J. Candès, “Conformal inference of coun- terfactuals and individual treatment effects,”Journal of the Royal Statistical Society Series B: Statistical Methodology, vol. 83, no. 5, pp. 911–938, 2021
work page 2021
-
[21]
Annealing flow generative models towards sampling high-dimensional and multi-modal dis- tributions,
D. Wu and Y . Xie, “Annealing flow generative models towards sampling high-dimensional and multi-modal dis- tributions,” inForty-second International Conference on Machine Learning
-
[22]
D. Wu, F. Qiu, and Y . Xie, “Doflow: Flow-based generative models for interventional and counterfactual forecasting on time series,” inThe Fourteenth International Conference on Learning Representations, 2025
work page 2025
-
[23]
Flow matching for generative modeling,
Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,” inInternational Conference on Learning Representations (ICLR), 2023
work page 2023
-
[24]
Building Normalizing Flows with Stochastic Interpolants
M. S. Albergo and E. Vanden-Eijnden, “Building normal- izing flows with stochastic interpolants,”arXiv preprint arXiv:2209.15571, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[25]
Diffpo: A causal diffusion model for learning distributions of potential outcomes,
Y . Ma, V . Melnychuk, J. Schweisthal, and S. Feuerriegel, “Diffpo: A causal diffusion model for learning distributions of potential outcomes,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[26]
On the Identifiability of the Post-Nonlinear Causal Model
K. Zhang and A. Hyvarinen, “On the identifiabil- ity of the post-nonlinear causal model,”arXiv preprint arXiv:1205.2599, 2012
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[27]
Identifying patient-specific root causes with the heteroscedastic noise model,
E. V . Strobl and T. A. Lasko, “Identifying patient-specific root causes with the heteroscedastic noise model,”Journal of Computational Science, vol. 72, p. 102099, 2023
work page 2023
-
[28]
Flow-based Generative Modeling of Potential Outcomes and Counterfactuals
D. Wu, D. I. Inouye, and Y . Xie, “Flow-based generative modeling of potential outcomes and counterfactuals,”arXiv preprint arXiv:2505.16051, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Towards optimal doubly robust estimation of heterogeneous causal effects,
E. H. Kennedy, “Towards optimal doubly robust estimation of heterogeneous causal effects,”Electronic Journal of Statistics, vol. 17, no. 2, pp. 3008–3049, 2023
work page 2023
-
[30]
Estimating individual treatment effect: generalization bounds and algo- rithms,
U. Shalit, F. D. Johansson, and D. Sontag, “Estimating individual treatment effect: generalization bounds and algo- rithms,” inInternational conference on machine learning. PMLR, 2017, pp. 3076–3085
work page 2017
-
[31]
Bayesian inference of individualized treatment effects using multi-task gaus- sian processes,
A. M. Alaa and M. Van Der Schaar, “Bayesian inference of individualized treatment effects using multi-task gaus- sian processes,”Advances in neural information processing systems, vol. 30, 2017
work page 2017
-
[32]
B. Huang, C. Chen, J. Liu, and S. Sivaganisan, “Gpmatch: A bayesian causal inference approach using gaussian pro- cess covariance function as a matching tool,”Frontiers in Applied Mathematics and Statistics, vol. 9, p. 1122114, 2023
work page 2023
-
[33]
Estimating con- ditional average treatment effects,
J. Abrevaya, Y .-C. Hsu, and R. P. Lieli, “Estimating con- ditional average treatment effects,”Journal of Business & Economic Statistics, vol. 33, no. 4, pp. 485–505, 2015
work page 2015
-
[34]
Bayesian nonparametric modeling for causal inference,
J. L. Hill, “Bayesian nonparametric modeling for causal inference,”Journal of Computational and Graphical Statis- tics, vol. 20, no. 1, pp. 217–240, 2011
work page 2011
-
[35]
Benchmarking Framework for Performance-Evaluation of Causal Inference Analysis
Y . Shimoni, C. Yanover, E. Karavani, and Y . Gold- schmnidt, “Benchmarking framework for performance- evaluation of causal inference analysis,”arXiv preprint arXiv:1802.05046, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[36]
A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines,
M. F. Hutchinson, “A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines,” Communications in Statistics-Simulation and Computation, vol. 18, no. 3, pp. 1059–1076, 1989
work page 1989
-
[37]
Normalizing flow neural networks by jko scheme,
C. Xu, X. Cheng, and Y . Xie, “Normalizing flow neural networks by jko scheme,”Advances in Neural Information Processing Systems, vol. 36, pp. 47 379–47 405, 2023
work page 2023
-
[38]
Score-Based Generative Modeling through Stochastic Differential Equations
Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,”arXiv preprint arXiv:2011.13456, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[39]
The probability flow ode is provably fast,
S. Chen, S. Chewi, H. Lee, Y . Li, J. Lu, and A. Salim, “The probability flow ode is provably fast,”Advances in Neural Information Processing Systems, vol. 36, pp. 68 552–68 575, 2023
work page 2023
-
[40]
Denoising Diffusion Implicit Models
J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,”arXiv preprint arXiv:2010.02502, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[41]
Counterfac- tual identifiability of bijective causal models,
A. Nasr-Esfahany, M. Alizadeh, and D. Shah, “Counterfac- tual identifiability of bijective causal models,” inInterna- tional conference on machine learning. PMLR, 2023, pp. 25 733–25 754. APPENDIXA PROOFS Proposition A.1.Assume thatp(y|x, a)>0for allyandt∈[0,1]. Then, up to a constant independent ofθ, the Conditional Flow Matching (CFM) loss and the origi...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.