A Causal Foundation Model for Structure and Outcome Prediction

Ching-Hao Wang; Martino Mansoldo; Max Zhu; Stefan Groha

arxiv: 2606.26467 · v1 · pith:HAVVLESJnew · submitted 2026-06-25 · 💻 cs.LG

A Causal Foundation Model for Structure and Outcome Prediction

Max Zhu , Martino Mansoldo , Ching-Hao Wang , Stefan Groha This is my paper

Pith reviewed 2026-06-26 05:47 UTC · model grok-4.3

classification 💻 cs.LG

keywords causal foundation modelcausal structure predictionoutcome predictionPearl's causal hierarchysynthetic data traininggeneralization to real dataobservational data

0 comments

The pith

TabPFN-CFM predicts both causal structures and outcomes from observational data and answers queries across Pearl's three levels of causation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TabPFN-CFM as a single pretrained model that infers causal graphs and predicts outcomes while also handling association, intervention, and counterfactual queries. It is trained only on synthetic data yet transfers to real datasets and outperforms separate baselines for structure learning and outcome prediction. When partial graph information is supplied, the same model further improves its accuracy. This design aims to replace multiple specialized causal tools with one foundation model that works directly from observational inputs.

Core claim

TabPFN-CFM predicts both causal structure and outcomes from observational data, supports queries on all three levels of Pearl's Causal Hierarchy and uses known graph structure when available to improve predictions. It is trained on synthetic datasets, and generalises to real datasets, demonstrating improved performance over both structural and outcome prediction baselines.

What carries the argument

TabPFN-CFM, a single model that ingests observational data to output causal graphs, outcome predictions, and answers to queries at the association, intervention, and counterfactual levels.

If this is right

The model answers queries at the association, intervention, and counterfactual levels from the same observational input.
Supplying known parts of the causal graph improves both structure and outcome predictions.
It outperforms separate baselines trained for structure learning alone or outcome prediction alone.
Performance on real datasets remains competitive after training only on synthetic examples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Practitioners could apply one pretrained model to multiple causal tasks instead of fitting new models for each problem.
The approach may lower barriers to causal analysis in settings where labeled interventional data are scarce.
Combining the model with domain-specific fine-tuning could extend its use to new data distributions without full retraining.

Load-bearing premise

Training exclusively on synthetic datasets produces a model that generalizes to real datasets without substantial performance loss due to distribution shift.

What would settle it

Evaluating TabPFN-CFM on a broad collection of real-world datasets and observing that its accuracy falls below task-specific baselines or degrades sharply relative to its synthetic performance.

Figures

Figures reproduced from arXiv: 2606.26467 by Ching-Hao Wang, Martino Mansoldo, Max Zhu, Stefan Groha.

**Figure 1.** Figure 1: Model architecture diagram. This is the standard prediction loss for PFNs (Hollmann et al., 2023; Balazadeh et al., 2025; Robertson et al., 2025). The model is always given Dfit. In order to allow the model to learn to make predictions with and without the true causal graph, G est is set to zero half the time. The structural prediction losses between predictions A, ˆ R, ˆ Cˆ and true matrices are the eleme… view at source ↗

**Figure 2.** Figure 2: Observational (left), Interventional (center), and Counterfactual (right) distributions for the IV example. Exact solutions are in blue, model predictions in orange for when U is observed, and in green when U is unobserved. Model is not given the graph structure [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Ablation results with 30k training steps for prediction loss and adjacency matrix loss. Changes are applied sequentially from left to right. Next, we compare our model with longer training runs with and without the changes, but with the base version scaled up to the same 23.2M parameter count for fairness. Models are trained for a much longer 150k steps [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Loss curves comparing our and baseline architecture for 150k training steps, prediction loss (left), adjacency matrix loss (right). where the binarize function returns 1 if the input is in the top 50 percent of the overall distribution and 0 otherwise. We draw a single sample from this SEM: z = −0.187 u = 0.421 t = 1 y = −0.568 Now, we identify the exact distribution assuming everything is observed, for ob… view at source ↗

**Figure 5.** Figure 5: Interventional (left), Observational (center), and Counterfactual (right) distributions. Exact solutions are in blue, model predictions in orange for when U is observed, and in green when U is unobserved. Model is given the graph structure. adjacency matrix generally matches the true adjacency matrix, though there is some uncertainty in the children of T likely due to V, W, Y all being correlated. In [PIT… view at source ↗

**Figure 6.** Figure 6: Interventional (left), Observational (center), and Counterfactual (right) distributions. Predictions are compared with and without the true graph structure G. The Observational and Interventional predictions are compared to a single sample drawn from the true distribution, while the Counterfactual predictions are compared to the true counterfactual value. H. OOD data generation We test our model on dataset… view at source ↗

**Figure 7.** Figure 7: Counterfactual distribution predictions for the nonlinear SEM with different D fit sample sizes. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

read the original abstract

We introduce TabPFN-CFM, a causal foundation model that can handle multiple causal problems. TabPFN-CFM predicts both causal structure and outcomes from observational data, supports queries on all three levels of Pearl's Causal Hierarchy and uses known graph structure when available to improve predictions. TabPFN-CFM is trained on synthetic datasets, and generalises to real datasets, demonstrating improved performance over both structural and outcome prediction baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TabPFN-CFM claims a single model trained on synthetic data can handle causal structure, outcomes, and all three Pearl levels on real data, but the abstract gives no metrics or details to back the generalization.

read the letter

The one thing to know is that this paper presents TabPFN-CFM as a foundation model that jointly predicts causal graphs and outcomes from observational data, supports queries across Pearl's three levels, and improves when the graph is supplied.

What is new is the specific combination of a TabPFN-style architecture with joint structure-plus-outcome training plus explicit handling of the full causal hierarchy. Earlier work has separate tools for discovery and prediction, so this integration is the advance.

The paper does a clean job stating how known graph structure can be fed in to boost predictions, which is a practical feature.

The soft spot is the synthetic-to-real transfer. The abstract says the model generalizes and beats baselines on real datasets, yet it shows zero numbers, no description of the graph sampler or noise models, and no analysis of domain gap. If the synthetic distribution does not cover the structures and noise patterns in the real test sets, the performance claims do not hold. That assumption is doing the heavy lifting, and the stress-test note correctly flags it.

No equations or training details appear in the abstract, so circularity cannot be checked, but nothing obvious suggests it.

This paper is for causal ML researchers who track foundation models and want one system instead of several specialized ones. A reader interested in whether a single trained model can replace separate structure and outcome pipelines would find it relevant if the experiments are solid.

It deserves a serious referee because the core idea is coherent and the potential consolidation of workflows is worth checking, even though the visible text leaves the key evidence out.

Recommendation: send to peer review once the full experimental section is there, with the expectation that referees will press hard on the synthetic data construction and the size of any observed shift.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces TabPFN-CFM, a causal foundation model trained exclusively on synthetic datasets. It claims to predict both causal structure and outcomes from observational data, support queries across all three levels of Pearl's Causal Hierarchy, incorporate known graph structure when available, and generalize to real datasets while outperforming structural and outcome prediction baselines.

Significance. If the generalization from synthetic training data to real tabular causal problems holds with the reported improvements, the work would offer a unified foundation-model approach to multiple causal tasks. This could reduce reliance on separate structure-learning and outcome-prediction pipelines and make causal queries more accessible, provided the synthetic data distribution adequately covers real-world causal structures and noise regimes.

major comments (2)

[Abstract] Abstract: The central claim that TabPFN-CFM 'generalises to real datasets, demonstrating improved performance over both structural and outcome prediction baselines' is presented without any reported metrics, baselines, error bars, or analysis of distribution shift. This absence directly undermines verification of the generalization result that the paper positions as its primary practical contribution.
[Abstract] The manuscript provides no description of the synthetic data generator (graph sampling procedure, noise models, intervention mechanisms) or quantitative comparison of its induced distribution against the real evaluation sets. Without such evidence, the assumption that synthetic training produces a model whose support matches real causal problems remains untested and load-bearing for all transfer claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for identifying areas where the abstract could better support the paper's central claims. We address each point below and will revise accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that TabPFN-CFM 'generalises to real datasets, demonstrating improved performance over both structural and outcome prediction baselines' is presented without any reported metrics, baselines, error bars, or analysis of distribution shift. This absence directly undermines verification of the generalization result that the paper positions as its primary practical contribution.

Authors: We agree that the abstract should be self-contained on this point. The full manuscript reports quantitative results on real datasets (including metrics, baselines, error bars, and distribution-shift considerations) in the experimental section. We will revise the abstract to summarize the key numerical improvements and evaluation details. revision: yes
Referee: [Abstract] The manuscript provides no description of the synthetic data generator (graph sampling procedure, noise models, intervention mechanisms) or quantitative comparison of its induced distribution against the real evaluation sets. Without such evidence, the assumption that synthetic training produces a model whose support matches real causal problems remains untested and load-bearing for all transfer claims.

Authors: Section 3 of the manuscript already describes the synthetic data generator, including the graph sampling procedure, noise models, and intervention mechanisms. A quantitative distributional comparison to the real evaluation sets is not currently included; we will add this analysis (or a concise summary) in the revised version to strengthen the transfer argument. revision: partial

Circularity Check

0 steps flagged

No circularity detected; claims are empirical assertions without self-referential derivations

full rationale

The provided abstract and context describe TabPFN-CFM as a model trained exclusively on synthetic data that generalizes to real datasets for causal structure and outcome prediction across Pearl's hierarchy. No equations, parameter-fitting procedures, self-citations, or derivation steps are visible that would reduce any prediction to a fitted input by construction or import uniqueness via author overlap. The generalization claim is presented as an empirical result to be evaluated on external real data, not a mathematical identity or self-definition. The derivation chain is therefore self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no equations, methods, or experimental sections are present from which free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.1-grok · 5590 in / 1333 out tokens · 30615 ms · 2026-06-26T05:47:48.853236+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references

[1]

International Conference on Artificial Intelligence , year =

Understanding the difficulty of training deep feedforward neural networks , author =. International Conference on Artificial Intelligence , year =
[2]

International Conference on Learning Representations (ICLR) , year=

Adam: A Method for Stochastic Optimization , author=. International Conference on Learning Representations (ICLR) , year=
[3]

On Layer Normalization in the Transformer Architecture , author=
[4]

2024 , archivePrefix=

ReLU ^2 Wins: Discovering Efficient Activation Functions for Sparse LLMs , author=. 2024 , archivePrefix=

2024
[5]

Query-Key Normalization for Transformers

Henry, Alex and Dachapally, Prudhvi Raj and Pawar, Shubham Shantaram and Chen, Yuxuan. Query-Key Normalization for Transformers. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020

2020
[6]

Keller Jordan and Jeremy Bernstein and Brendan Rappazzo and @fernbear.bsky.social and Boza Vlado and You Jiacheng and Franz Cesista and Braden Koszarsky and @Grad62304977 , title =
[7]

Keller Jordan and Yuchen Jin and Vlado Boza and You Jiacheng and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =
[8]

2025 , archivePrefix=

Muon is Scalable for LLM Training , author=. 2025 , archivePrefix=

2025
[9]

2024 , primaryClass=

DoWhy-GCM: An extension of DoWhy for causal inference in graphical causal models , author=. 2024 , primaryClass=

2024
[10]

, title =

Wightman, Linda F. , title =
[11]

Causality : models, reasoning, and inference , author =
[12]

Scandinavian Journal of Statistics , year=

Markov Properties for Acyclic Directed Mixed Graphs , author=. Scandinavian Journal of Statistics , year=
[13]

On statistical and causal models associated with acyclic directed mixed graphs , author=
[14]

Advances in Neural Information Processing Systems , year=

Amortized Inference for Causal Structure Learning , author=. Advances in Neural Information Processing Systems , year=
[15]

Noah Hollmann and Samuel M. Tab. International Conference on Artificial Intelligence , year=
[16]

Jake Robertson and Arik Reuter and Siyuan Guo and Noah Hollmann and Frank Hutter and Bernhard Sch. Do-. Advances in Neural Information Processing Systems , year=
[17]

Advances in Neural Information Processing Systems , year=

CausalPFN: Amortized Causal Effect Estimation via In-Context Learning , author=. Advances in Neural Information Processing Systems , year=
[18]

and Sekhon, Jasjeet S

Künzel, Sören R. and Sekhon, Jasjeet S. and Bickel, Peter J. and Yu, Bin , year=. Metalearners for estimating heterogeneous treatment effects using machine learning , journal=
[19]

EconML A Python Package for ML-Based Heterogeneous Treatment Effects Estimation , author=
[20]

Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence , year=

Causal inference in the presence of latent variables and selection bias , author=. Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence , year=
[21]

Journal of machine learning research , author=

Optimal structure identification with greedy search. Journal of machine learning research , author=. Journal of Machine Learning Research , year=
[22]

Journal of Machine Learning Research , year=

A linear non-Gaussian acyclic model for causal discovery , author=. Journal of Machine Learning Research , year=
[23]

MIT press , year=

Causation, prediction, and search , author=. MIT press , year=
[24]

and Rubin, Donald B

Imbens, Guido W. and Rubin, Donald B. , year=. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction , publisher=
[25]

International Conference on Learning Representations , year=

Learning to Induce Causal Structure , author=. International Conference on Learning Representations , year=
[26]

Journal of the American Statistical Association , year=

Bounds on Treatment Effects From Studies With Imperfect Compliance , author=. Journal of the American Statistical Association , year=

[1] [1]

International Conference on Artificial Intelligence , year =

Understanding the difficulty of training deep feedforward neural networks , author =. International Conference on Artificial Intelligence , year =

[2] [2]

International Conference on Learning Representations (ICLR) , year=

Adam: A Method for Stochastic Optimization , author=. International Conference on Learning Representations (ICLR) , year=

[3] [3]

On Layer Normalization in the Transformer Architecture , author=

[4] [4]

2024 , archivePrefix=

ReLU ^2 Wins: Discovering Efficient Activation Functions for Sparse LLMs , author=. 2024 , archivePrefix=

2024

[5] [5]

Query-Key Normalization for Transformers

Henry, Alex and Dachapally, Prudhvi Raj and Pawar, Shubham Shantaram and Chen, Yuxuan. Query-Key Normalization for Transformers. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020

2020

[6] [6]

Keller Jordan and Jeremy Bernstein and Brendan Rappazzo and @fernbear.bsky.social and Boza Vlado and You Jiacheng and Franz Cesista and Braden Koszarsky and @Grad62304977 , title =

[7] [7]

Keller Jordan and Yuchen Jin and Vlado Boza and You Jiacheng and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =

[8] [8]

2025 , archivePrefix=

Muon is Scalable for LLM Training , author=. 2025 , archivePrefix=

2025

[9] [9]

2024 , primaryClass=

DoWhy-GCM: An extension of DoWhy for causal inference in graphical causal models , author=. 2024 , primaryClass=

2024

[10] [10]

, title =

Wightman, Linda F. , title =

[11] [11]

Causality : models, reasoning, and inference , author =

[12] [12]

Scandinavian Journal of Statistics , year=

Markov Properties for Acyclic Directed Mixed Graphs , author=. Scandinavian Journal of Statistics , year=

[13] [13]

On statistical and causal models associated with acyclic directed mixed graphs , author=

[14] [14]

Advances in Neural Information Processing Systems , year=

Amortized Inference for Causal Structure Learning , author=. Advances in Neural Information Processing Systems , year=

[15] [15]

Noah Hollmann and Samuel M. Tab. International Conference on Artificial Intelligence , year=

[16] [16]

Jake Robertson and Arik Reuter and Siyuan Guo and Noah Hollmann and Frank Hutter and Bernhard Sch. Do-. Advances in Neural Information Processing Systems , year=

[17] [17]

Advances in Neural Information Processing Systems , year=

CausalPFN: Amortized Causal Effect Estimation via In-Context Learning , author=. Advances in Neural Information Processing Systems , year=

[18] [18]

and Sekhon, Jasjeet S

Künzel, Sören R. and Sekhon, Jasjeet S. and Bickel, Peter J. and Yu, Bin , year=. Metalearners for estimating heterogeneous treatment effects using machine learning , journal=

[19] [19]

EconML A Python Package for ML-Based Heterogeneous Treatment Effects Estimation , author=

[20] [20]

Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence , year=

Causal inference in the presence of latent variables and selection bias , author=. Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence , year=

[21] [21]

Journal of machine learning research , author=

Optimal structure identification with greedy search. Journal of machine learning research , author=. Journal of Machine Learning Research , year=

[22] [22]

Journal of Machine Learning Research , year=

A linear non-Gaussian acyclic model for causal discovery , author=. Journal of Machine Learning Research , year=

[23] [23]

MIT press , year=

Causation, prediction, and search , author=. MIT press , year=

[24] [24]

and Rubin, Donald B

Imbens, Guido W. and Rubin, Donald B. , year=. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction , publisher=

[25] [25]

International Conference on Learning Representations , year=

Learning to Induce Causal Structure , author=. International Conference on Learning Representations , year=

[26] [26]

Journal of the American Statistical Association , year=

Bounds on Treatment Effects From Studies With Imperfect Compliance , author=. Journal of the American Statistical Association , year=