Recognition: unknown
Causal Diffusion Models for Counterfactual Outcome Distributions in Longitudinal Data
Pith reviewed 2026-05-10 13:59 UTC · model grok-4.3
The pith
A new diffusion model generates complete probability distributions of counterfactual outcomes under sequential interventions in longitudinal data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce the Causal Diffusion Model (CDM), the first denoising diffusion probabilistic approach explicitly designed to generate full probabilistic distributions of counterfactual outcomes under sequential interventions. CDM employs a novel residual denoising architecture with relational self-attention, capturing intricate temporal dependencies and multimodal outcome trajectories without requiring explicit adjustments for confounding.
What carries the argument
The Causal Diffusion Model, a residual denoising architecture with relational self-attention that directly produces counterfactual outcome distributions from observed longitudinal trajectories.
Load-bearing premise
A residual denoising architecture with relational self-attention can capture intricate temporal dependencies and multimodal outcome trajectories without requiring explicit adjustments for confounding.
What would settle it
On the pharmacokinetic-pharmacodynamic tumor-growth simulator, the generated counterfactual distributions would show higher or equal 1-Wasserstein distances to the true distributions compared with existing longitudinal causal inference methods.
read the original abstract
Predicting counterfactual outcomes in longitudinal data, where sequential treatment decisions heavily depend on evolving patient states, is critical yet notoriously challenging due to complex time-dependent confounding and inadequate uncertainty quantification in existing methods. We introduce the Causal Diffusion Model (CDM), the first denoising diffusion probabilistic approach explicitly designed to generate full probabilistic distributions of counterfactual outcomes under sequential interventions. CDM employs a novel residual denoising architecture with relational self-attention, capturing intricate temporal dependencies and multimodal outcome trajectories without requiring explicit adjustments (e.g., inverse-probability weighting or adversarial balancing) for confounding. In rigorous evaluation on a pharmacokinetic-pharmacodynamic tumor-growth simulator widely adopted in prior work, CDM consistently outperforms state-of-the-art longitudinal causal inference methods, achieving a 15-30% relative improvement in distributional accuracy (1-Wasserstein distance) while maintaining competitive or superior point-estimate accuracy (RMSE) under high-confounding regimes. By unifying uncertainty quantification and robust counterfactual prediction in complex, sequentially confounded settings, without tailored deconfounding, CDM offers a flexible, high-impact tool for decision support in medicine, policy evaluation, and other longitudinal domains.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Causal Diffusion Model (CDM), a denoising diffusion probabilistic model with a residual denoising network and relational self-attention, designed to generate full probabilistic distributions of counterfactual outcomes under sequential interventions in longitudinal data. It claims this is achieved without explicit deconfounding adjustments such as inverse-probability weighting, and reports 15-30% relative gains in 1-Wasserstein distance (with competitive RMSE) over prior methods on a single PKPD tumor-growth simulator under high-confounding regimes.
Significance. If the central claim is valid, the work would be significant for providing a unified diffusion-based approach to uncertainty quantification and counterfactual prediction in sequentially confounded longitudinal settings. Strengths include the focus on full outcome distributions rather than point estimates and evaluation on a simulator commonly used in the literature. However, the absence of identification results or explicit mechanisms for interventional sampling limits the assessed impact.
major comments (3)
- [Method] Method section (architecture and training): The model is trained by optimizing the standard diffusion ELBO on observational (factual) trajectories. No component is described that marginalizes over counterfactual treatment paths, enforces invariance to the observed policy, or performs g-computation-style adjustment; the relational self-attention is presented as sufficient to capture time-dependent confounding implicitly, but this lacks a supporting derivation or identification argument.
- [Experiments] Evaluation section: The reported 15-30% relative improvement in 1-Wasserstein distance is given without error bars, confidence intervals, or results across multiple random seeds; evaluation is restricted to a single simulator (PKPD), which does not test robustness across different confounding structures or data-generating processes.
- [Abstract/Introduction] Abstract and introduction: The claim that CDM 'captures intricate temporal dependencies and multimodal outcome trajectories without requiring explicit adjustments for confounding' is central but unsupported by any theorem, assumption list, or proof sketch showing that the learned conditional is interventional rather than observational.
minor comments (2)
- [Method] Notation for the relational self-attention mechanism is introduced informally; a formal definition with input/output dimensions and how it interacts with the residual blocks would improve clarity.
- [Experiments] The abstract states 'rigorous evaluation' but the main text provides limited detail on hyperparameter selection and baseline implementations; adding a reproducibility statement or code link would strengthen the paper.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, with clear indications of revisions where the manuscript will be updated.
read point-by-point responses
-
Referee: [Method] Method section (architecture and training): The model is trained by optimizing the standard diffusion ELBO on observational (factual) trajectories. No component is described that marginalizes over counterfactual treatment paths, enforces invariance to the observed policy, or performs g-computation-style adjustment; the relational self-attention is presented as sufficient to capture time-dependent confounding implicitly, but this lacks a supporting derivation or identification argument.
Authors: We agree that the original submission would have been strengthened by an explicit discussion of how counterfactual sampling is achieved. The model is indeed trained solely on factual trajectories using the standard diffusion objective, with no explicit marginalization or g-computation step. The relational self-attention is intended to learn representations that implicitly handle time-dependent confounding through the observed history. We acknowledge the lack of a formal identification derivation. In the revised manuscript we have added a new paragraph in Section 3.2 that states the key assumptions (sequential ignorability conditional on observed covariates and positivity) under which the learned conditional distribution corresponds to the interventional one, and we clarify that the generative sampling procedure directly produces samples from this distribution without post-hoc reweighting. revision: partial
-
Referee: [Experiments] Evaluation section: The reported 15-30% relative improvement in 1-Wasserstein distance is given without error bars, confidence intervals, or results across multiple random seeds; evaluation is restricted to a single simulator (PKPD), which does not test robustness across different confounding structures or data-generating processes.
Authors: We accept that the experimental reporting lacked sufficient statistical detail. The original results were obtained from single runs for space reasons. In the revision we have re-executed all experiments across five independent random seeds and now report means together with standard deviations for both 1-Wasserstein distance and RMSE. We have also added a limitations paragraph noting that the PKPD simulator, while standard in the literature, represents only one data-generating process, and we outline plans for broader robustness checks in future work. revision: partial
-
Referee: [Abstract/Introduction] Abstract and introduction: The claim that CDM 'captures intricate temporal dependencies and multimodal outcome trajectories without requiring explicit adjustments for confounding' is central but unsupported by any theorem, assumption list, or proof sketch showing that the learned conditional is interventional rather than observational.
Authors: We thank the referee for identifying the overstatement. The original wording was meant to emphasize the absence of explicit IPW or adversarial balancing steps, but it did not sufficiently qualify the claim. We have revised both the abstract and the introduction to read that CDM 'empirically captures intricate temporal dependencies and multimodal outcome trajectories without requiring explicit adjustments for confounding in the evaluated settings.' We now explicitly cross-reference the assumptions listed in the new Method paragraph and the empirical nature of the supporting evidence. revision: yes
Circularity Check
No circularity: new architecture evaluated on external simulator data
full rationale
The paper introduces CDM as a novel residual denoising architecture with relational self-attention trained via standard diffusion ELBO on observational trajectories, then reports empirical performance (Wasserstein distance, RMSE) against baselines on an independent PKPD tumor-growth simulator. No derivation step equates a claimed counterfactual distribution to a fitted parameter by construction, renames a known result, or relies on a self-citation chain for a uniqueness theorem. The central claim of generating interventional distributions without explicit deconfounding is presented as an architectural hypothesis whose validity is assessed externally rather than tautologically.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
write newline
" write newline "" before.all 'output.state := FUNCTION format.url url empty "" url if FUNCTION article output.bibitem format.authors "author" output.check author format.key output output.year.check new.block format.title "title" output.check new.block crossref missing format.jour.vol output format.article.crossref output.nonnull format.pages output if ne...
- [2]
-
[3]
Begoli, E., Bhattacharya, T., and Kusnezov, D. (2019). The need for uncertainty quantification in machine-assisted medical decision making. Nature Machine Intelligence 1(1), 20--23
2019
-
[4]
Bica, I., Alaa, A.\ M., Jordon, J., and van der Schaar, M. (2019). Estimating counterfactual treatment outcomes over time through adversarially balanced representations. In International Conference on Learning Representations
2019
-
[5]
Chen, Y., Zhang, C., Ma, M., Liu, Y., Ding, R., Li, B., He, S., Rajmohan, S., Lin, Q., and Zhang, D. (2023). ImDiffusion: Imputed Diffusion Models for Multivariate Time Series Anomaly Detection. Proceedings of the VLDB Endowment 17(3), 359--372
2023
-
[6]
Gal, Y., and Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. arXiv preprint arXiv:1506.02142
work page Pith review arXiv 2016
-
[7]
Geng, C., Paganetti, H., and Grassberger, C. (2017). Prediction of treatment response for combined chemo- and radiation therapy for non-small cell lung cancer patients using a bio-mathematical model. Scientific Reports 7(1), 13542
2017
-
[8]
Ghali, H., Yoon, S., and Won, D. (2023). Uncertainty Quantification for Healthcare Data. In IISE Annual Conference, May 2023
2023
-
[9]
Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion probabilistic models. arXiv preprint arXiv:2006.11239
work page internal anchor Pith review arXiv 2020
-
[10]
Kang, D.\ Y., DeYoung, P.\ N., Tantiongloc, J., Coleman, T.\ P., and Owens, R.\ L. (2021). Statistical Uncertainty Quantification to Augment Clinical Decision Support: A First Implementation in Sleep Medicine. npj Digital Medicine 4(1), 142
2021
-
[11]
Karimi Mamaghan, A.\ M., Dittadi, A., Bauer, S., Johansson, K.\ H., and Quinzan, F. (2024). Diffusion-Based Causal Representation Learning. Entropy 26(7), 556
2024
- [12]
-
[13]
Kollovieh, M., Ansari, A.\ F., Bohlke-Schneider, M., Zschiegner, J., Wang, H., and Wang, Y. (2023). Predict, Refine, Synthesize: Self-Guiding Diffusion Models for Probabilistic Time Series Forecasting. Advances in Neural Information Processing Systems 36, 28341--28364
2023
- [14]
-
[15]
Li, Y., Chen, W., Hu, X., Chen, B., Sun, B., and Zhou, M. (2023). Transformer-modulated diffusion models for probabilistic multivariate time series forecasting. In The Twelfth International Conference on Learning Representations
2023
-
[16]
Li, Q., Zhang, Z., Yao, L., Li, Z., Zhong, T., and Zhang, Y. (2024). Diffusion-based Decoupled Deterministic and Uncertain Framework for Probabilistic Multivariate Time Series Forecasting. In Proceedings (under review/venue not stated), October 2024
2024
-
[17]
Lim, B. (2018). Forecasting treatment responses over time using recurrent marginal structural networks. In Advances in Neural Information Processing Systems, Vol.\ 31. Curran Associates, Inc
2018
-
[18]
E., Green, P.\ J., and Chipman, H
Logan, B.\ R., McCulloch, R. E., Green, P.\ J., and Chipman, H. A. (2019). Decision Making and Uncertainty Quantification for Individualized Treatments Using Bayesian Additive Regression Trees. Statistical Methods in Medical Research 28(6), 1851--1867
2019
-
[19]
Melnychuk, V., Frauen, D., and Feuerriegel, S. (2022). Causal Transformer for estimating counterfactual outcomes. In Proceedings of the 39th International Conference on Machine Learning, 15293--15329. PMLR
2022
-
[20]
Mortimer, K.\ M., Neugebauer, R., van der Laan, M., and Tager, I.\ B. (2005). An application of model-fitting procedures for marginal structural models. American Journal of Epidemiology 162(4), 382--388
2005
- [21]
-
[22]
Rasul, K., Seward, C., Schuster, I., and Vollgraf, R. (2021). Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting. In Proceedings of the 38th International Conference on Machine Learning, 8857--8868
2021
-
[23]
Riley, R.\ D., Collins, G.\ S., Kirton, L., Snell, K.\ I.\ E., Ensor, J., Whittle, R., Dhiman, P., van Smeden, M., Liu, X., Alderman, J., Nirantharakumar, K., Manson-Whitton, J., Westwood, A.\ J., Cazier, J.-B., Moons, K.\ G.\ M., Martin, G.\ P., Sperrin, M., Denniston, A.\ K., Harrell, F.\ E., and Archer, L. (2025). Uncertainty of Risk Estimates from Cli...
2025
-
[24]
Rubin, D.\ B. (2005). Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association 100(469), 322--331
2005
- [25]
- [26]
- [27]
- [28]
- [29]
-
[30]
Tsaneva-Atanasova, K., Pederzanil, G., and Laviola, M. (2025). Decoding Uncertainty for Clinical Decision-Making. Philosophical Transactions of the Royal Society A 383(2292), 20240207
2025
-
[31]
Wang, H., Li, H., Zou, H., Chi, H., Lan, L., Huang, W., and Yang, W. (2024). Effective and Efficient Time-Varying Counterfactual Prediction with State-Space Models. In Proceedings (under review/venue not specified), October 2024
2024
-
[32]
Wu, S., Zhou, W., Chen, M., and Zhu, S. (2024). Counterfactual Generative Models for Time-Varying Treatments. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 3402--3413
2024
-
[33]
Xia, Y., Xu, C., Liang, Y., Wen, Q., Zimmermann, R., and Bian, J. (2025). Causal Time Series Generation via Diffusion Models. arXiv preprint arXiv:2509.20846
work page internal anchor Pith review arXiv 2025
-
[34]
Xiong, H., Wu, F., Deng, L., Su, M., Shahn, Z., and Lehman, L.\ H. (2024). G-Transformer: Counterfactual Outcome Prediction under Dynamic and Time-varying Treatment Regimes. Proceedings of Machine Learning Research 252, August
2024
-
[35]
Zhou, S., Gu, Z., Xiong, Y., Luo, Y., Wang, Q., and Gao, X. (2024). REDI: Recurrent Diffusion Model for Probabilistic Time Series Forecasting. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, 3505--3514
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.