Ice Cream Doesn't Cause Drowning: Benchmarking LLMs Against Statistical Pitfalls in Causal Inference

An Luo; Charles Doss; Fangqiao Tian; Ganghua Wang; Jie Ding; Jin Du; Li Chen; Xiaotong Shen; Xun Xian

arxiv: 2505.13770 · v3 · submitted 2025-05-19 · 💻 cs.AI · cs.CL· cs.LG· stat.ME· stat.ML

Ice Cream Doesn't Cause Drowning: Benchmarking LLMs Against Statistical Pitfalls in Causal Inference

Jin Du , Li Chen , Xun Xian , An Luo , Fangqiao Tian , Ganghua Wang , Charles Doss , Xiaotong Shen

show 1 more author

Jie Ding

This is my paper

Pith reviewed 2026-05-22 13:36 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LGstat.MEstat.ML

keywords causal inferencelarge language modelsbenchmarkstatistical pitfallsSimpson's paradoxselection biascode-assisted evaluation

0 comments

The pith

Large language models show significant limitations in statistical causal inference even with code assistance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CausalPitfalls, a benchmark that presents LLMs with structured challenges involving common statistical pitfalls such as Simpson's paradox and selection bias. It evaluates models through direct prompting to test intrinsic reasoning and code-assisted prompting to allow explicit statistical analysis, with scoring rubrics that enable quantitative measurement of both accuracy and reliability. A sympathetic reader cares because trustworthy causal inference supports decisions in medicine, economics, and public policy, where overlooking these pitfalls can lead to incorrect conclusions. The work also compares automated scores against human expert judgments to support the benchmark's validity. Results indicate that current LLMs struggle substantially across these tasks.

Core claim

The paper establishes that current large language models exhibit significant limitations when performing statistical causal inference, as shown by their performance on the CausalPitfalls benchmark across direct prompting and code-assisted protocols. The benchmark supplies challenges at multiple difficulty levels, each with grading rubrics that measure causal reasoning capability and response reliability, and the authors validate the automated judge by alignment with human experts.

What carries the argument

The CausalPitfalls benchmark, a set of structured challenges across difficulty levels paired with grading rubrics that quantify causal reasoning and response reliability under direct and code-assisted prompting.

If this is right

LLMs may generate unreliable causal conclusions in high-stakes domains unless statistical pitfalls are explicitly addressed.
Code-assisted prompting improves performance on some tasks but does not eliminate the identified limitations.
Automated judging aligned with human experts can serve as a scalable metric for tracking progress in causal reasoning.
The benchmark supplies concrete quantitative targets for developing more trustworthy causal reasoning systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training regimes that explicitly include counterexamples of common statistical pitfalls could reduce the observed errors.
Hybrid systems that combine LLMs with dedicated causal inference libraries might outperform either component alone on these tasks.
Extending the benchmark to time-series or high-dimensional observational data would test whether the limitations generalize beyond the current scenarios.

Load-bearing premise

The chosen causal pitfalls and associated grading rubrics accurately capture the statistical challenges that matter in real-world causal inference without introducing artifacts that favor or penalize LLMs unfairly.

What would settle it

A controlled test in which models that score highly on CausalPitfalls nevertheless produce systematically wrong causal conclusions when applied to an independent collection of real medical or economic datasets with known ground-truth causal structures.

Figures

Figures reproduced from arXiv: 2505.13770 by An Luo, Charles Doss, Fangqiao Tian, Ganghua Wang, Jie Ding, Jin Du, Li Chen, Xiaotong Shen, Xun Xian.

**Figure 1.** Figure 1: Overall Message: Our results reveal a clear reliability gap in causal inference when LLMs rely only on direct prompting, with all models struggling most on mediation and external validity questions. Introducing code-assisted prompting leads to substantial gains across every task and brings all models closer together in performance. This shows that executable analysis is essential for large language models … view at source ↗

**Figure 2.** Figure 2: High-level overview of the CausalPitfalls benchmark. (a) An illustrative real-world pitfall [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Causal DAG illustrating how beverage consumption, health awareness, and lifestyle affect [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Code execution failure rates (%) in code-assisted prompting protocol across causal inference challenges and question difficulty. Failure rate is defined as the percentage of codegeneration attempts that either raise execution errors or produce invalid analytical outputs, computed only for the code-assisted prompting protocol. (a) Average failure rate for each of the six causalinference pitfall categories… view at source ↗

**Figure 5.** Figure 5: Success rates across disciplines for men and women. [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗

**Figure 6.** Figure 6: QQ-plot of standardized log odds ratios across disciplines. [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗

read the original abstract

Reliable causal inference is essential for making decisions in high-stakes areas like medicine, economics, and public policy. However, it remains unclear whether large language models (LLMs) can handle rigorous and trustworthy statistical causal inference. Current benchmarks usually involve simplified tasks. For example, these tasks might only ask LLMs to identify semantic causal relationships or draw conclusions directly from raw data. As a result, models may overlook important statistical pitfalls, such as Simpson's paradox or selection bias. This oversight limits the applicability of LLMs in the real world. To address these limitations, we propose CausalPitfalls, a comprehensive benchmark designed to rigorously evaluate the capability of LLMs in overcoming common causal inference pitfalls. Our benchmark features structured challenges across multiple difficulty levels, each paired with grading rubrics. This approach allows us to quantitatively measure both causal reasoning capabilities and the reliability of LLMs' responses. We evaluate models using two protocols: (1) direct prompting, which assesses intrinsic causal reasoning, and (2) code-assisted prompting, where models generate executable code for explicit statistical analysis. Additionally, we validate the effectiveness of this judge by comparing its scoring with assessments from human experts. Our results reveal significant limitations in current LLMs when performing statistical causal inference. The CausalPitfalls benchmark provides essential guidance and quantitative metrics to advance the development of trustworthy causal reasoning systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces CausalPitfalls to test LLMs on statistical causal traps like Simpson's paradox, with direct and code-assisted protocols, but the results' weight depends on whether the rubrics and tasks isolate real reasoning gaps.

read the letter

The main thing to know is that this paper builds a benchmark showing current LLMs still miss key statistical pitfalls in causal inference even when they can generate code. They cover graded difficulties on issues like confounding and selection bias, then score both plain text answers and executable analysis code. Human expert checks on the automated judge add some credibility to the scoring process. That dual-protocol setup is a practical step beyond the usual simple causal-relation tests in other benchmarks. It gives numbers on where models fall short and where code assistance helps or doesn't. The results point to real limits for using these models in medicine or policy without extra safeguards. On the softer side, the task construction and exact rubrics get less space than the high-level design. If the grading favors particular phrasing or code patterns that models rarely produce even when the logic is sound, some of the reported gaps could trace to that rather than pure statistical misunderstanding. The synthetic examples also stay cleaner than the noisy, incomplete data common in real applications, which might narrow how far the findings generalize. This is mainly for groups working on LLM reliability for causal questions or building better evaluation suites. It gives a concrete starting point to track progress. The work is coherent enough and grounded in a clear empirical setup that it deserves a serious referee, though reviewers will likely press on the rubric details and task realism.

Referee Report

2 major / 1 minor

Summary. The paper introduces CausalPitfalls, a benchmark consisting of structured challenges at multiple difficulty levels that test LLMs on statistical causal inference pitfalls such as Simpson's paradox and selection bias. It evaluates models under direct prompting (for intrinsic reasoning) and code-assisted prompting (for explicit statistical analysis), validates the automated judge via human-expert comparison, and concludes that current LLMs exhibit significant limitations in performing trustworthy statistical causal inference.

Significance. If the benchmark tasks and rubrics are shown to be free of artifacts that unfairly penalize LLMs or diverge from real-world distributions, the work would provide useful quantitative metrics and guidance for advancing causal reasoning in LLMs for high-stakes applications in medicine, economics, and policy. The dual-protocol design and human validation are constructive elements that could strengthen future evaluations.

major comments (2)

[§3] §3 (Benchmark Design and Task Construction): The manuscript provides insufficient detail on how the synthetic tasks embed the target pitfalls, the precise grading rubrics, and the data-generation process. Without these specifics it is difficult to confirm that low scores reflect genuine deficits in causal statistics rather than mismatches with expected terminology, code style, or prompt patterns.
[§4] §4 (Human-Expert Validation): The comparison between the automated judge and human experts is described at a high level but lacks quantitative details such as the number of experts, inter-rater agreement statistics, or how disagreements were resolved. These metrics are load-bearing for trusting the reported LLM performance scores that support the central claim.

minor comments (1)

[Abstract] The abstract would benefit from including the number of models tested and a brief summary of the main quantitative performance gaps to give readers an immediate sense of effect size.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and describe the revisions we will make to improve clarity and rigor.

read point-by-point responses

Referee: [§3] §3 (Benchmark Design and Task Construction): The manuscript provides insufficient detail on how the synthetic tasks embed the target pitfalls, the precise grading rubrics, and the data-generation process. Without these specifics it is difficult to confirm that low scores reflect genuine deficits in causal statistics rather than mismatches with expected terminology, code style, or prompt patterns.

Authors: We agree that expanded details on task construction are needed for full transparency. In the revised manuscript we will add to §3: (i) explicit mathematical descriptions of the data-generating processes for each pitfall (e.g., the joint distributions that produce Simpson’s paradox or selection bias), (ii) the complete grading rubrics with point allocations and annotated examples of high-, medium-, and low-scoring responses, and (iii) the exact parameters and pseudocode used to synthesize the datasets. These additions will demonstrate that the tasks target statistical causal reasoning rather than surface-level prompt or terminology matching. revision: yes
Referee: [§4] §4 (Human-Expert Validation): The comparison between the automated judge and human experts is described at a high level but lacks quantitative details such as the number of experts, inter-rater agreement statistics, or how disagreements were resolved. These metrics are load-bearing for trusting the reported LLM performance scores that support the central claim.

Authors: We acknowledge that the human-validation section requires quantitative support. In the revision we will report in §4: the number of experts (three PhD-level statisticians), inter-rater agreement (Fleiss’ kappa and pairwise percentage agreement), and the disagreement-resolution procedure (independent scoring followed by a moderated consensus discussion). These statistics and the resolution protocol will be presented in the main text and accompanied by a supplementary table. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation contains no derivation or self-referential reduction

full rationale

This paper introduces and applies the CausalPitfalls benchmark to measure LLM performance on statistical causal inference tasks. It contains no equations, fitted parameters, or first-principles derivations whose outputs are defined in terms of the same inputs or prior self-citations. The reported results follow directly from running the described protocols on the constructed tasks and comparing to human expert scores, with no reduction of the central claim to a tautology or load-bearing self-citation chain. The work is therefore self-contained as an empirical evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on the domain assumption that the selected pitfalls represent core statistical challenges in causal inference and that rubric-based scoring by an LLM judge aligns with human expert judgment.

axioms (1)

domain assumption LLM responses to causal inference tasks can be reliably scored by a separate judge model whose outputs correlate with human experts.
Abstract states validation against human experts but provides no quantitative agreement metrics.

pith-pipeline@v0.9.0 · 5808 in / 1107 out tokens · 38793 ms · 2026-05-22T13:36:56.548329+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose CausalPitfalls, a comprehensive benchmark … six major categories … 15 distinct challenges … 75 evaluation questions … two protocols: direct prompting and code-assisted prompting.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Simpson’s paradox … Berkson’s paradox … mediator–outcome confounding … domain shift and transportability

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

Request Python code for analysis

Code Generation:Provide the model with the causal question, dataset location, column names, and a small data sample (10 rows). Request Python code for analysis. 2.Code Execution:Extract and run the generated Python code to obtain numerical results

work page
[2]

HealthPlus

Result Interpretation:Show the model the code it generated and its numerical results, and ask it to interpret these results in context. 4.Output Collection:Record the model’s interpretation and analysis for evaluation. Example (Code Generation Prompt): ”Question: Evaluate whether{TREATMENT}causally affects{OUTCOME}. Dataset location:/path/to/dataset.csv C...

work page 2026
[3]

HealthPlus

Brand “HealthPlus”, truly beneficial effect

work page
[4]

HealthPlus

Brand “HealthPlus”, truly harmful effect

work page
[5]

UltraSugar

Brand “UltraSugar”, truly harmful effect

work page
[6]

UltraSugar

Brand “UltraSugar”, truly beneficial effect. Each dataset included 200 samples with variables: Consumption, Outcome (health impact), Health Awareness, and Lifestyle. LLM Performance and Observations.LLMs were asked to assess if each beverage (“HealthPlus” or “UltraSugar”) was beneficial or harmful based purely on the given data. Table 3 in the main paper ...

work page 2026
[7]

First, we divided responses into the six causal inference categories, selecting an equal number (25) from each category

work page
[8]

Within each category, we evenly sampled across the five difficulty levels (very easy, easy, medium, hard, very hard), selecting exactly 5 responses per difficulty level

work page
[9]

For each category-difficulty combination, we randomly selected responses from the evaluated LLMs, ensuring proportional representation of all models’ outputs. 27 Published as a conference paper at ICLR 2026 This sampling approach ensured the validation set accurately represented the complexity, diversity, and balanced coverage of our entire evaluation dat...

work page 2026
[10]

Request Python code for analysis

Code Generation:Same as Protocol 2: provide the causal question, dataset location, column names, and a data sample. Request Python code for analysis. 2.Code Execution:Extract and run the generated code

work page
[11]

Execute the corrected code

Debugging (if execution fails):Present the error message to the model and request corrected code. Execute the corrected code

work page
[12]

Symbolic only

Result Interpretation:Show the model its code and the numerical results, and ask it to interpret the results in context. 5.Output Collection:Record the model’s interpretation for evaluation. Table 9 compares causal reliability across all three protocols. Debugging primarily benefits models that frequently fail on the first code attempt. For example, Mistr...

work page 2026

[1] [1]

Request Python code for analysis

Code Generation:Provide the model with the causal question, dataset location, column names, and a small data sample (10 rows). Request Python code for analysis. 2.Code Execution:Extract and run the generated Python code to obtain numerical results

work page

[2] [2]

HealthPlus

Result Interpretation:Show the model the code it generated and its numerical results, and ask it to interpret these results in context. 4.Output Collection:Record the model’s interpretation and analysis for evaluation. Example (Code Generation Prompt): ”Question: Evaluate whether{TREATMENT}causally affects{OUTCOME}. Dataset location:/path/to/dataset.csv C...

work page 2026

[3] [3]

HealthPlus

Brand “HealthPlus”, truly beneficial effect

work page

[4] [4]

HealthPlus

Brand “HealthPlus”, truly harmful effect

work page

[5] [5]

UltraSugar

Brand “UltraSugar”, truly harmful effect

work page

[6] [6]

UltraSugar

Brand “UltraSugar”, truly beneficial effect. Each dataset included 200 samples with variables: Consumption, Outcome (health impact), Health Awareness, and Lifestyle. LLM Performance and Observations.LLMs were asked to assess if each beverage (“HealthPlus” or “UltraSugar”) was beneficial or harmful based purely on the given data. Table 3 in the main paper ...

work page 2026

[7] [7]

First, we divided responses into the six causal inference categories, selecting an equal number (25) from each category

work page

[8] [8]

Within each category, we evenly sampled across the five difficulty levels (very easy, easy, medium, hard, very hard), selecting exactly 5 responses per difficulty level

work page

[9] [9]

For each category-difficulty combination, we randomly selected responses from the evaluated LLMs, ensuring proportional representation of all models’ outputs. 27 Published as a conference paper at ICLR 2026 This sampling approach ensured the validation set accurately represented the complexity, diversity, and balanced coverage of our entire evaluation dat...

work page 2026

[10] [10]

Request Python code for analysis

Code Generation:Same as Protocol 2: provide the causal question, dataset location, column names, and a data sample. Request Python code for analysis. 2.Code Execution:Extract and run the generated code

work page

[11] [11]

Execute the corrected code

Debugging (if execution fails):Present the error message to the model and request corrected code. Execute the corrected code

work page

[12] [12]

Symbolic only

Result Interpretation:Show the model its code and the numerical results, and ask it to interpret the results in context. 5.Output Collection:Record the model’s interpretation for evaluation. Table 9 compares causal reliability across all three protocols. Debugging primarily benefits models that frequently fail on the first code attempt. For example, Mistr...

work page 2026