Why LLMs Fail at Causal Discovery and How Interventional Agents Escape

Amartya Roy; Sonali Parbhoo

arxiv: 2605.27567 · v1 · pith:7UFB5JHOnew · submitted 2026-05-26 · 💻 cs.AI · cs.CL

Why LLMs Fail at Causal Discovery and How Interventional Agents Escape

Amartya Roy , Sonali Parbhoo This is my paper

Pith reviewed 2026-06-29 17:25 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords causal discoverylarge language modelskernel obstructioninterventional agentsbayesian optimizationcausal graphssupervised fine-tuningpreference optimization

0 comments

The pith

Large language models cannot distinguish causal graphs from observational data alone under standard training, requiring unbounded internal representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that supervised fine-tuning, direct preference optimization, and in-context learning on LLMs produce predictors unable to separate causal graphs that generate similar observational data. Any attempt to make such distinctions forces the model's internal representations to expand without bound, which directly contradicts the finite conditions required for these training approaches to function. The authors formalize the barrier as a kernel obstruction theorem that applies independently of specific models or datasets. They then introduce Agentic Causal Bayesian Optimization, in which a frozen language model answers targeted intervention queries while an external Bayesian procedure updates beliefs over graphs. This separation allows convergence in logarithmically many rounds without retraining the underlying model.

Core claim

Supervised fine-tuning, direct preference optimization, and in-context learning all produce predictors that cannot distinguish between causal graphs generating similar observational data, and any attempt to do so requires the model's internal representations to grow unboundedly, violating the very conditions under which these methods work. This limitation is formalized as a kernel obstruction theorem establishing that the failure is intrinsic to the learning paradigm, not any particular model or dataset. Agentic Causal Bayesian Optimization lets a frozen language model serve as an interventional oracle while an external Bayesian loop concentrates beliefs over candidate graphs, enabling prova

What carries the argument

The kernel obstruction theorem, which proves that standard training paradigms cannot separate observationally equivalent causal graphs without unbounded representations, together with Agentic Causal Bayesian Optimization that routes interventional queries outside the obstructed space.

If this is right

A-CBO matches the performance of fine-tuned baselines on the Corr2Cause benchmark without any model training.
On the Extended Corr2Cause benchmark with graphs up to 24 variables, A-CBO outperforms both fine-tuning and preference optimization, with the gap increasing as graph size grows.
The external Bayesian loop requires only logarithmically many intervention queries to concentrate posterior mass on the correct graph.
Because the language model remains frozen, the method avoids the representation-growth requirement that blocks direct training approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of oracle queries from internal representations could be tested on other tasks that require distinguishing latent structures from surface correlations.
Hybrid agentic systems might systematically outperform pure end-to-end training on any problem where observational data alone leaves multiple explanations equally likely.
If the obstruction generalizes, it would predict similar plateaus for LLMs on other scientific reasoning benchmarks that rely on causal rather than correlational patterns.

Load-bearing premise

The kernel obstruction holds for the learning paradigm itself regardless of any particular model architecture or training dataset.

What would settle it

Training an LLM via supervised fine-tuning or preference optimization that correctly distinguishes causal graphs with identical observational distributions while keeping its internal representation dimension bounded would falsify the obstruction claim.

Figures

Figures reproduced from arXiv: 2605.27567 by Amartya Roy, Sonali Parbhoo.

**Figure 1.** Figure 1: Overview of A-CBO. (a) Two near-miss hypotheses (G+: chain, G−: fork) are observationally equivalent; kernel similarity ≥ 1−δ, so bounded-norm SFT/ICL cannot separate them (Thm. 1). (b) A single intervention do(X1 =v) discriminates: under G+ the perturbation propagates to X3; under G− the severed edge leaves X3 unaffected. (c) A-CBO performs Bayesian updates in ∆n−1 (outside H), concentrating belief on th… view at source ↗

**Figure 2.** Figure 2: A-CBO vs. baselines on six evaluation dimensions. A-CBO dominates on all six axes; SFT/DPO collapse at scale. The loop architecture, not raw model capability, drives the advantage. The advantage of A-CBO over fine-tuning grows monotonically with graph complexity. Fine-tuned models degrade catastrophically as d increases: SFT on 1.3M extended samples achieves only 52.2% average accuracy, collapsing to 35.… view at source ↗

**Figure 3.** Figure 3: Convergence over intervention rounds. Posterior concentration grows monotonically; most models converge within 8–12 rounds well before the budget T = 20. High-tier models (blue) converge fastest, consistent with lower effective oracle noise η (Theorem 2). 7 CONCLUSION In this work, we identified a fundamental kernel obstruction that prevents SFT, DPO, and ICL from separating near-miss causal hypotheses, a … view at source ↗

**Figure 4.** Figure 4: Accuracy distribution by model tier on EXTENDED CORR2CAUSE. Tiers are well-separated with increasing variance at lower tiers (IQR: 8.2 vs. 12.1 vs. 18.6). 18 [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

read the original abstract

Causal discovery is a cornerstone of scientific reasoning, yet whether large language models can perform it reliably remains an open question. Recent benchmarks show that even fine-tuned models plateau on simple causal graphs and degrade as complexity grows, but why they fail has not been established. We prove the failure is fundamental: supervised fine-tuning, direct preference optimization, and in-context learning all produce predictors that cannot distinguish between causal graphs generating similar observational data, and any attempt to do so requires the model's internal representations to grow unboundedly, violating the very conditions under which these methods work. We formalize this as a kernel obstruction theorem, establishing that the limitation is intrinsic to the learning paradigm, \emph{not any particular model or dataset}. We propose Agentic Causal Bayesian Optimization (A-CBO), wherein a frozen language model serves as an interventional oracle answering targeted queries about intervention effects, while an external Bayesian loop concentrates beliefs over candidate graphs in logarithmically many rounds. Because the decision operates outside the space where the obstruction applies, A-CBO provably converges while the underlying model remains unchanged. On Corr2Cause, A-CBO matches fine-tuned baselines without any training. On Extended Corr2Cause, a new benchmark scaling to 24 variables with 18K test samples, A-CBO significantly outperforms both fine-tuning and preference optimization, with the advantage growing

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The kernel obstruction theorem probably doesn't apply to real LLMs since they adapt features during training, but A-CBO is a workable hybrid that keeps the model frozen.

read the letter

The main thing to know is that the kernel obstruction theorem looks like it rests on treating LLMs as fixed-kernel predictors, which doesn't match how transformers actually update their weights. That undercuts the claim that the failure is fundamental to the learning paradigm itself.

The paper formalizes why supervised fine-tuning, DPO, and in-context learning cannot separate causal graphs that produce the same observational distributions, and it introduces A-CBO as a workaround. In A-CBO the LLM stays frozen and only answers targeted intervention queries while an external Bayesian loop updates beliefs over graphs. This separation is new and lets the method converge without changing the model. The Extended Corr2Cause benchmark with 24 variables and 18K samples is also a useful addition for testing scale.

The practical results on Corr2Cause, where A-CBO matches fine-tuned baselines with no training, are straightforward to appreciate. The hybrid setup is concrete and avoids the representation-growth issue by moving the decision outside the LLM.

The soft spot is the theorem. The abstract asserts a proof but gives no equations or sketch, and the stress-test concern holds: if the argument only bounds fixed-feature or RKHS predictors, it does not automatically extend to data-dependent feature maps learned by gradient descent on transformer parameters. Without seeing the derivation it is unclear whether the authors handled adaptive models. The claim that the limitation is intrinsic rather than model-specific therefore sits on thin evidence so far.

This paper is for researchers working on LLM-assisted causal discovery or hybrid systems. A reader focused on practical combinations of language models and Bayesian methods would find the A-CBO construction and the new benchmark worth their time. The work engages the literature honestly even if the central theorem needs tightening.

I would send it for peer review. The algorithmic contribution and benchmark stand on their own and the topic matters, though referees will need to see a full proof that addresses parametric adaptation.

Referee Report

3 major / 2 minor

Summary. The paper claims that supervised fine-tuning, DPO, and in-context learning cannot distinguish causal graphs that produce similar observational distributions, formalized via a kernel obstruction theorem that requires internal representations to grow unboundedly; this limitation is intrinsic to the learning paradigm. It introduces Agentic Causal Bayesian Optimization (A-CBO), which keeps the LLM frozen as an interventional oracle and uses an external Bayesian loop to converge on graphs in logarithmically many rounds. Empirical results are reported on Corr2Cause (matching fine-tuned baselines) and a new Extended Corr2Cause benchmark (24 variables, 18K samples) where A-CBO outperforms fine-tuning and preference optimization, with the gap increasing with complexity.

Significance. If the kernel obstruction theorem holds and the reduction from LLM training to kernel predictors is valid, the result would be significant: it supplies a theoretical account for observed plateaus in causal discovery benchmarks and demonstrates a practical way to escape the limitation without retraining. The agentic separation of the decision process from the model's internal representations is a clean architectural contribution. Reproducible code or machine-checked elements are not mentioned.

major comments (3)

The kernel obstruction theorem is the load-bearing claim, yet the provided manuscript text asserts a mathematical proof without any derivation, equations, proof sketch, or formal statement of the kernel or the reduction from SFT/DPO/ICL predictors to RKHS methods. This prevents verification of whether the argument applies to adaptive parametric models whose feature maps are updated by gradient descent rather than fixed kernels.
The central claim that the obstruction is independent of any particular model or dataset rests on the theorem; without the explicit reduction showing why gradient updates on transformer weights cannot induce separating representations within bounded dimension, the conclusion that all three paradigms are affected does not follow.
Empirical claims on Extended Corr2Cause (outperformance growing with complexity) are stated without methods, baselines, error bars, or statistical tests, so it is impossible to assess whether the results support the claim that A-CBO escapes the obstruction while fine-tuning does not.

minor comments (2)

The abstract and introduction should explicitly state the section containing the theorem statement and proof.
Notation for the interventional oracle and the Bayesian update loop in A-CBO should be introduced with a small example before the convergence argument.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on the kernel obstruction theorem and empirical presentation. We address each major point below and will revise the manuscript accordingly to improve clarity and completeness.

read point-by-point responses

Referee: The kernel obstruction theorem is the load-bearing claim, yet the provided manuscript text asserts a mathematical proof without any derivation, equations, proof sketch, or formal statement of the kernel or the reduction from SFT/DPO/ICL predictors to RKHS methods. This prevents verification of whether the argument applies to adaptive parametric models whose feature maps are updated by gradient descent rather than fixed kernels.

Authors: We agree that the current version presents the theorem at a high level. In revision we will insert the formal statement of the kernel obstruction theorem, the explicit RKHS reduction from SFT/DPO/ICL predictors, the definition of the relevant kernel, and a complete proof sketch. The expanded argument will directly address adaptive parametric models by showing that any finite-dimensional feature map inducible by gradient descent on transformer weights remains subject to the same separation obstruction. revision: yes
Referee: The central claim that the obstruction is independent of any particular model or dataset rests on the theorem; without the explicit reduction showing why gradient updates on transformer weights cannot induce separating representations within bounded dimension, the conclusion that all three paradigms are affected does not follow.

Authors: The independence claim is derived from the reduction itself: any predictor obtained by SFT, DPO, or ICL is shown to be equivalent to a kernel predictor whose feature dimension is bounded by the training regime, precluding the unbounded growth required to separate observationally equivalent graphs. The revised proof will spell out why gradient updates on transformer parameters cannot escape this bound without leaving the paradigm. We will also add a short corollary clarifying applicability to all three methods. revision: yes
Referee: Empirical claims on Extended Corr2Cause (outperformance growing with complexity) are stated without methods, baselines, error bars, or statistical tests, so it is impossible to assess whether the results support the claim that A-CBO escapes the obstruction while fine-tuning does not.

Authors: We will expand the experimental section to include the precise construction of Extended Corr2Cause, the full list of baselines with implementation details, error bars computed over multiple random seeds, and the results of statistical significance tests. These elements are present in the supplementary material; the main text will be updated to foreground them so that the performance gap with increasing graph size can be properly evaluated. revision: yes

Circularity Check

0 steps flagged

No circularity: kernel obstruction theorem and A-CBO are presented as independent formal results

full rationale

The paper introduces a new kernel obstruction theorem to formalize why SFT/DPO/ICL cannot distinguish causal graphs with identical observational distributions, requiring unbounded representations. It then proposes A-CBO as an external Bayesian procedure using the frozen LLM only as an interventional oracle. No load-bearing step reduces by construction to fitted parameters renamed as predictions, self-citations, or ansatzes imported from prior author work. The central claim is framed as a mathematical limitation intrinsic to the paradigm, with convergence of A-CBO shown separately; the derivation chain remains self-contained and does not rely on re-labeling known empirical patterns or self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the asserted kernel obstruction theorem, which is introduced in the abstract without derivation or external grounding; no free parameters or invented entities beyond the method itself are specified.

axioms (1)

ad hoc to paper Kernel obstruction theorem applies to supervised fine-tuning, DPO, and in-context learning for causal discovery tasks
This is the load-bearing formalization stated in the abstract as the reason for LLM failure.

invented entities (1)

A-CBO (Agentic Causal Bayesian Optimization) no independent evidence
purpose: Use frozen LLM as interventional oracle inside external Bayesian loop for causal graph identification
New method proposed to escape the obstruction by operating outside the observational predictor space.

pith-pipeline@v0.9.1-grok · 5768 in / 1552 out tokens · 52581 ms · 2026-06-29T17:25:23.143357+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 18 canonical work pages · 1 internal anchor

[1]

Abdulaal, hadjivasiliou, N

A. Abdulaal, hadjivasiliou, N. Montana-Brown, T. He, A. Ijishakin, I. Drobnjak, D. C. Castro, and D. C. Alexander Causal Modelling Agents: Causal Graph Discovery through Synergising Metadata- and Data-driven Reasoning. In The Twelfth International Conference on Learning Representations , 2024. URL https://openreview.net/forum?id=pAoqRlTBtY

2024
[2]

Agrawal, C

R. Agrawal, C. Squires, K. Yang, K. Shanmugam, and C. Uhler ABCD -Strategy: Budgeted experimental design for targeted causal structure discovery. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics , pages 3400--3409, 2019

2019
[3]

H. Chi, H. Li, W. Yang, F. Liu, L. Lan, X. Ren, T. Liu, and B. Han Unveiling causal reasoning in large language models: Reality or mirage?. Advances in Neural Information Processing Systems , 37:96640--96670, 2024

2024
[4]

Ghorbani, S

B. Ghorbani, S. Mei, T. Misiakiewicz, and A. Montanari Limitations of lazy training of two-layers neural network. Advances in Neural Information Processing Systems , 32, 2019

2019
[5]

The essential role of causality in foundation world models for embodied ai.arXiv preprint arXiv:2402.06665, 2024

T. Gupta, W. Gong, C. Ma, N. Pawlowski, A. Hilmkil, M. Scetbon, M. Rigter, A. Famoti, A. J. Llorens, J. Gao, and others The essential role of causality in foundation world models for embodied AI. arXiv preprint arXiv:2402.06665 , 2024

work page arXiv 2024
[6]

C. Han, Z. Wang, H. Zhao, and H. Ji Explaining emergent in-context learning as kernel regression. arXiv preprint arXiv:2305.12766 , 2023

work page arXiv 2023
[7]

Jacot, F

A. Jacot, F. Gabriel, and C. Hongler Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems , 31, 2018

2018
[8]

Z. Jin, J. Liu, Z. Lyu, S. Poff, M. Sachan, R. Mihalcea, M. Diab, and B. Sch \"o lkopf Can large language models infer causation from correlation?. arXiv preprint arXiv:2306.05836 , 2023

work page arXiv 2023
[9]

Z. Jin, Y. Chen, F. Leber, L. Gresele, O. Kamath, B. Xin, Z. Shi, B. Scholkopf, L. Bottou, and R. Mihalcea Cladder: A benchmark to assess causal reasoning capabilities of language models. In Advances in Neural Information Processing Systems , 2024

2024
[10]

Kadziolka and S

K. Kadziolka and S. Salehkaleybar Causal Reasoning in Pieces: Modular In-Context Learning for Causal Discovery. arXiv preprint arXiv:2507.23488 , 2025

work page arXiv 2025
[11]

Karkada The Lazy ( NTK ) and Rich ( P ) Regimes: A Gentle Tutorial

D. Karkada The Lazy ( NTK ) and Rich ( P ) Regimes: A Gentle Tutorial. arXiv preprint arXiv:2404.19719 , 2024

work page arXiv 2024
[12]

H. D. Le, X. Xia, and Z. Chen Multi-agent causal discovery using large language models. arXiv preprint arXiv:2407.15073 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

J. Li, Y. Chen, C. Liu, Q. Cai, T. Liu, B. Han, K. Zhang, and H. Xiong Can Large Language Models Help Experimental Design for Causal Discovery?. arXiv preprint arXiv:2503.01139 , 2025

work page arXiv 2025
[14]

H. Li, L. Duan, and Y. Liang Provable In-Context Learning of Nonlinear Regression with Transformers. arXiv preprint arXiv:2507.20443 , 2025

work page arXiv 2025
[15]

Li and F

Z. Li and F. Russo Leveraging Large Language Models for Causal Discovery: a Constraint-based, Argumentation-driven Approach. arXiv preprint arXiv:2602.16481 , 2026

work page arXiv 2026
[16]

Scherrer, O

N. Scherrer, O. Bilaniuk, Y. Annadani, A. Goyal, P. Schwab, B. Sch \"o lkopf, M. C. Mozer, Y. Bengio, S. Bauer, and N. R. Ke Learning neural causal models with active interventions. arXiv preprint arXiv:2109.02429 , 2021

work page arXiv 2021
[17]

Sch \"o lkopf, F

B. Sch \"o lkopf, F. Locatello, S. Bauer, N. R. Ke, N. Kalchbrenner, A. Goyal, and Y. Bengio Toward Causal Representation Learning. Proceedings of the IEEE , 109(5):612--634, 2021. doi:10.1109/JPROC.2021.3058954

work page doi:10.1109/jproc.2021.3058954 2021
[18]

Sgouritsa, V

E. Sgouritsa, V. Aglietti, Y. W. Teh, A. Doucet, A. Gretton, and S. Chiappa Prompting strategies for enabling large language models to infer causation from correlation. arXiv preprint arXiv:2412.13952 , 2024

work page arXiv 2024
[19]

Sheth, B

I. Sheth, B. Fatemi, and M. Fritz Causalgraph2llm: Evaluating llms for causal queries. In Findings of the Association for Computational Linguistics: NAACL 2025 , pages 2076--2098, 2025

2025
[20]

H. Sun, A. Jadbabaie, and N. Azizan On the role of transformer feed-forward layers in nonlinear in-context learning. arXiv preprint arXiv:2501.18187 , 2025

work page arXiv 2025
[21]

W. Sun, J. P. Nogueira, and A. Silva Structured Thinking Matters: Improving LLMs Generalization in Causal Inference Tasks. arXiv preprint arXiv:2505.18034 , 2025

work page arXiv 2025
[22]

A. Wu, K. Kuang, M. Zhu, Y. Wang, Y. Zheng, K. Han, B. Li, G. Chen, F. Wu, and K. Zhang Causality for large language models. arXiv preprint arXiv:2410.15319 , 2024

work page arXiv 2024
[23]

X. Wu, K. Yu, J. Wu, and K. C. Tan LLM cannot discover causality, and should be restricted to non-decisional support in causal discovery. arXiv preprint arXiv:2506.00844 , 2025

work page arXiv 2025
[24]

Yamin, S

K. Yamin, S. Gupta, G. R. Ghosal, Z. C. Lipton, and B. Wilder Failure modes of llms for causal reasoning on narratives. arXiv preprint arXiv:2410.23884 , 2024

work page arXiv 2024
[25]

Ze c evi \'c , M

M. Ze c evi \'c , M. Willig, D. S. Dhami, and K. Kersting Causal Parrots: Large Language Models May Talk Causality But Are Not Causal. Transactions on Machine Learning Research , 2023

2023
[26]

Zhang, S

C. Zhang, S. Bauer, P. Bennett, J. Gao, W. Gong, A. Hilmkil, J. Jennings, C. Ma, T. Minka, N. Pawlowski, and others Understanding causality with large language models: Feasibility and opportunities. arXiv preprint arXiv:2304.05524 , 2023

work page arXiv 2023

[1] [1]

Abdulaal, hadjivasiliou, N

A. Abdulaal, hadjivasiliou, N. Montana-Brown, T. He, A. Ijishakin, I. Drobnjak, D. C. Castro, and D. C. Alexander Causal Modelling Agents: Causal Graph Discovery through Synergising Metadata- and Data-driven Reasoning. In The Twelfth International Conference on Learning Representations , 2024. URL https://openreview.net/forum?id=pAoqRlTBtY

2024

[2] [2]

Agrawal, C

R. Agrawal, C. Squires, K. Yang, K. Shanmugam, and C. Uhler ABCD -Strategy: Budgeted experimental design for targeted causal structure discovery. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics , pages 3400--3409, 2019

2019

[3] [3]

H. Chi, H. Li, W. Yang, F. Liu, L. Lan, X. Ren, T. Liu, and B. Han Unveiling causal reasoning in large language models: Reality or mirage?. Advances in Neural Information Processing Systems , 37:96640--96670, 2024

2024

[4] [4]

Ghorbani, S

B. Ghorbani, S. Mei, T. Misiakiewicz, and A. Montanari Limitations of lazy training of two-layers neural network. Advances in Neural Information Processing Systems , 32, 2019

2019

[5] [5]

The essential role of causality in foundation world models for embodied ai.arXiv preprint arXiv:2402.06665, 2024

T. Gupta, W. Gong, C. Ma, N. Pawlowski, A. Hilmkil, M. Scetbon, M. Rigter, A. Famoti, A. J. Llorens, J. Gao, and others The essential role of causality in foundation world models for embodied AI. arXiv preprint arXiv:2402.06665 , 2024

work page arXiv 2024

[6] [6]

C. Han, Z. Wang, H. Zhao, and H. Ji Explaining emergent in-context learning as kernel regression. arXiv preprint arXiv:2305.12766 , 2023

work page arXiv 2023

[7] [7]

Jacot, F

A. Jacot, F. Gabriel, and C. Hongler Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems , 31, 2018

2018

[8] [8]

Z. Jin, J. Liu, Z. Lyu, S. Poff, M. Sachan, R. Mihalcea, M. Diab, and B. Sch \"o lkopf Can large language models infer causation from correlation?. arXiv preprint arXiv:2306.05836 , 2023

work page arXiv 2023

[9] [9]

Z. Jin, Y. Chen, F. Leber, L. Gresele, O. Kamath, B. Xin, Z. Shi, B. Scholkopf, L. Bottou, and R. Mihalcea Cladder: A benchmark to assess causal reasoning capabilities of language models. In Advances in Neural Information Processing Systems , 2024

2024

[10] [10]

Kadziolka and S

K. Kadziolka and S. Salehkaleybar Causal Reasoning in Pieces: Modular In-Context Learning for Causal Discovery. arXiv preprint arXiv:2507.23488 , 2025

work page arXiv 2025

[11] [11]

Karkada The Lazy ( NTK ) and Rich ( P ) Regimes: A Gentle Tutorial

D. Karkada The Lazy ( NTK ) and Rich ( P ) Regimes: A Gentle Tutorial. arXiv preprint arXiv:2404.19719 , 2024

work page arXiv 2024

[12] [12]

H. D. Le, X. Xia, and Z. Chen Multi-agent causal discovery using large language models. arXiv preprint arXiv:2407.15073 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

J. Li, Y. Chen, C. Liu, Q. Cai, T. Liu, B. Han, K. Zhang, and H. Xiong Can Large Language Models Help Experimental Design for Causal Discovery?. arXiv preprint arXiv:2503.01139 , 2025

work page arXiv 2025

[14] [14]

H. Li, L. Duan, and Y. Liang Provable In-Context Learning of Nonlinear Regression with Transformers. arXiv preprint arXiv:2507.20443 , 2025

work page arXiv 2025

[15] [15]

Li and F

Z. Li and F. Russo Leveraging Large Language Models for Causal Discovery: a Constraint-based, Argumentation-driven Approach. arXiv preprint arXiv:2602.16481 , 2026

work page arXiv 2026

[16] [16]

Scherrer, O

N. Scherrer, O. Bilaniuk, Y. Annadani, A. Goyal, P. Schwab, B. Sch \"o lkopf, M. C. Mozer, Y. Bengio, S. Bauer, and N. R. Ke Learning neural causal models with active interventions. arXiv preprint arXiv:2109.02429 , 2021

work page arXiv 2021

[17] [17]

Sch \"o lkopf, F

B. Sch \"o lkopf, F. Locatello, S. Bauer, N. R. Ke, N. Kalchbrenner, A. Goyal, and Y. Bengio Toward Causal Representation Learning. Proceedings of the IEEE , 109(5):612--634, 2021. doi:10.1109/JPROC.2021.3058954

work page doi:10.1109/jproc.2021.3058954 2021

[18] [18]

Sgouritsa, V

E. Sgouritsa, V. Aglietti, Y. W. Teh, A. Doucet, A. Gretton, and S. Chiappa Prompting strategies for enabling large language models to infer causation from correlation. arXiv preprint arXiv:2412.13952 , 2024

work page arXiv 2024

[19] [19]

Sheth, B

I. Sheth, B. Fatemi, and M. Fritz Causalgraph2llm: Evaluating llms for causal queries. In Findings of the Association for Computational Linguistics: NAACL 2025 , pages 2076--2098, 2025

2025

[20] [20]

H. Sun, A. Jadbabaie, and N. Azizan On the role of transformer feed-forward layers in nonlinear in-context learning. arXiv preprint arXiv:2501.18187 , 2025

work page arXiv 2025

[21] [21]

W. Sun, J. P. Nogueira, and A. Silva Structured Thinking Matters: Improving LLMs Generalization in Causal Inference Tasks. arXiv preprint arXiv:2505.18034 , 2025

work page arXiv 2025

[22] [22]

A. Wu, K. Kuang, M. Zhu, Y. Wang, Y. Zheng, K. Han, B. Li, G. Chen, F. Wu, and K. Zhang Causality for large language models. arXiv preprint arXiv:2410.15319 , 2024

work page arXiv 2024

[23] [23]

X. Wu, K. Yu, J. Wu, and K. C. Tan LLM cannot discover causality, and should be restricted to non-decisional support in causal discovery. arXiv preprint arXiv:2506.00844 , 2025

work page arXiv 2025

[24] [24]

Yamin, S

K. Yamin, S. Gupta, G. R. Ghosal, Z. C. Lipton, and B. Wilder Failure modes of llms for causal reasoning on narratives. arXiv preprint arXiv:2410.23884 , 2024

work page arXiv 2024

[25] [25]

Ze c evi \'c , M

M. Ze c evi \'c , M. Willig, D. S. Dhami, and K. Kersting Causal Parrots: Large Language Models May Talk Causality But Are Not Causal. Transactions on Machine Learning Research , 2023

2023

[26] [26]

Zhang, S

C. Zhang, S. Bauer, P. Bennett, J. Gao, W. Gong, A. Hilmkil, J. Jennings, C. Ma, T. Minka, N. Pawlowski, and others Understanding causality with large language models: Feasibility and opportunities. arXiv preprint arXiv:2304.05524 , 2023

work page arXiv 2023