arxiv: 2604.24062 · v1 · submitted 2026-04-27 · 💻 cs.AI

Recognition: unknown

Grounding Before Generalizing: How AI Differs from Humans in Causal Transfer

Liangru Xiang , Yuxi Ma , Zhihao Cao , Yixin Zhu , Song-Chun Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:35 UTC · model grok-4.3

classification 💻 cs.AI

keywords causal transferlarge language modelsvision-language modelscausal reasoningOpenLock paradigmanalogical reasoningenvironmental groundingstatistical learning

0 comments

The pith

Current AI models require environmental grounding before transferring causal structures, unlike humans who apply abstract knowledge immediately.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models and vision-language models can extract abstract causal structures through sequential exploration and transfer them to new situations as humans do. Using the OpenLock paradigm with common cause and common effect structures, it shows models need initial environment-specific mapping before any efficiency gains appear, while humans transfer structural knowledge from the very first attempt. Visual inputs often reduce model performance rather than improve it, and models display consistent asymmetries between causal directions that humans do not. These patterns indicate that large-scale statistical learning fails to produce the decontextualized causal schemas that support human analogical reasoning.

Core claim

In sequential discovery tasks requiring identification of Common Cause and Common Effect structures, state-of-the-art LLMs and VLMs exhibit delayed or absent transfer across novel contexts. Successful models first require environmental-specific mapping, termed environmental grounding, before efficiency improves, whereas humans demonstrate immediate transfer based on prior structural knowledge. Text-only conditions allow models to match or exceed human discovery rates, but adding visual information degrades overall performance, and models show systematic CC/CE asymmetries absent in humans, pointing to heuristic biases instead of neutral causal abstraction.

What carries the argument

Environmental grounding, the requirement for initial environment-specific mapping before efficiency gains in causal structure transfer emerge, in contrast to decontextualized causal schemas that enable immediate cross-context application.

If this is right

Models will continue to require extra steps for causal transfer in new contexts, leading to inefficiency in dynamic tasks.
Adding visual information will often reduce rather than enhance causal discovery and transfer performance.
Models will exhibit direction-specific biases in causal reasoning that humans avoid.
Large-scale pretraining without new mechanisms will not produce human-like decontextualized causal schemas.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Architectures that explicitly separate causal abstraction from environmental details may reduce the observed grounding requirement.
This difference could constrain AI use in rapidly changing real-world settings such as adaptive planning or hypothesis testing.
Targeted training that rewards immediate structural transfer might close the gap, but would require new benchmarks beyond current scaling approaches.

Load-bearing premise

The OpenLock paradigm validly measures abstract causal structure transfer in equivalent ways for both humans and current AI models without introducing biases that favor one over the other.

What would settle it

If AI models achieve immediate transfer efficiency matching human levels on the first solution attempt in novel OpenLock environments without any prior environment-specific mapping, the claim of grounding-dependent transfer as a fundamental limitation would be falsified.

Figures

Figures reproduced from arXiv: 2604.24062 by Liangru Xiang, Song-Chun Zhu, Yixin Zhu, Yuxi Ma, Zhihao Cao.

**Figure 1.** Figure 1: OpenLock environment and causal structure schematics. The virtual environment (left) contains seven levers and one door; agents discover sequential causal dependencies through active exploration, with lever positions, colors, and labels varying across environments while the underlying causal topology is preserved. The latent causal graphs (right) differ across the two experimental conditions: CC instanti… view at source ↗

**Figure 2.** Figure 2: Sequential discovery efficiency across causal structures. Marginal discovery cost (attempts required to find each successive solution) within a single environment, shown separately for CC (left) and CE (right) structures. AI models are evaluated in the T condition. Humans and Claude-4.5-Sonnet exhibit sharp non-linear acceleration after the first solution, whereas GPT-5.2, Gemini-3-Flash, and DeepSeek-V3.… view at source ↗

**Figure 4.** Figure 4: Impact of causal structure transfer on sequential discovery dynamics. Marginal discovery cost (attempts required to find each successive solution) for Baseline (Experiment 1, gray) and Transfer (Experiment 2, red) conditions, shown for each model and for human participants. Error bars denote standard error. Humans exhibit immediate transfer, with significantly lower first-solution cost under transfer. All … view at source ↗

read the original abstract

Extracting abstract causal structures and applying them to novel situations is a hallmark of human intelligence. While Large Language Models (LLMs) and Vision Language Models (VLMs) have shown strong performance on a wide range of reasoning tasks, their capacity for interactive causal learning -- inducing latent structures through sequential exploration and transferring them across contexts -- remains uncharacterized. Human learners accomplish such transfer after minimal exposure, whereas classical Reinforcement Learning (RL) agents fail catastrophically. Whether state-of-the-art Artificial Intelligence (AI) models possess human-like mechanisms for abstract causal structure transfer is an open question. Using the OpenLock paradigm requiring sequential discovery of Common Cause (CC) and Common Effect (CE) structures, here we show that models exhibit fundamentally delayed or absent transfer: even successful models require initial environmental-specific mapping -- what we term environmental grounding -- before efficiency gains emerge, whereas humans leverage prior structural knowledge from the very first solution attempt. In the text-only condition, models matched or exceeded human discovery efficiency. In contrast, visual information -- in both the image-only and text-and-image conditions -- overall degraded rather than enhanced performance, revealing a broad reliance on symbolic processing rather than integrated multimodal reasoning. Models further exhibited systematic CC/CE asymmetries absent in humans, suggesting heuristic biases rather than direction-neutral causal abstraction. These findings reveal that large-scale statistical learning does not produce the decontextualized causal schemas underpinning human analogical reasoning, establishing grounding-dependent transfer as a fundamental limitation of current LLMs and VLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows LLMs and VLMs require an initial environmental mapping step for causal transfer in OpenLock that humans skip, with text working better than visuals and clear CC/CE biases.

read the letter

The main takeaway is that current AI models do not extract and apply abstract causal structures the way humans do. They need a first pass at mapping the specific setup before transfer improves, while people use prior knowledge right away. Text-only inputs let models match or beat human efficiency, but adding images hurts performance, and models show consistent biases between common cause and common effect structures that humans lack.

Referee Report

3 major / 2 minor

Summary. The manuscript empirically compares human and AI (LLM/VLM) performance on the OpenLock paradigm for discovering and transferring common-cause (CC) and common-effect (CE) causal structures. It reports that models require an initial 'environmental grounding' phase of context-specific mapping before transfer efficiency improves, unlike humans who apply decontextualized abstract schemas from the first trial. Text-only model conditions match or exceed human discovery efficiency, while visual inputs (image-only or text+image) degrade performance overall; models also exhibit CC/CE asymmetries absent in humans. The authors conclude that large-scale statistical learning does not produce human-like decontextualized causal schemas, establishing grounding-dependent transfer as a core limitation of current models.

Significance. If the core empirical patterns hold after addressing methodological gaps and input-equivalence concerns, the work would usefully document a concrete difference in causal transfer between humans and current AI systems. The direct use of a human-validated interactive paradigm enables a head-to-head comparison that is rarer than purely benchmark-based evaluations, and the dissociation between text-only success and multimodal degradation offers a testable observation about symbolic versus integrated reasoning in large models. These points could inform targeted improvements in grounding or causal abstraction mechanisms.

major comments (3)

[Methods] Methods section: The manuscript provides insufficient detail on the specific models tested (including versions and sizes), exact prompting formats and few-shot examples, number of independent runs or trials, participant/model sample sizes, and statistical procedures (error bars, significance tests, multiple-comparison corrections). These omissions are load-bearing because the central claims rest on quantitative performance differences and asymmetries whose reliability cannot be evaluated without them.
[Results/Discussion] Results/Discussion: The interpretation that models lack decontextualized causal schemas (and therefore require environmental grounding) assumes the OpenLock task measures equivalent processes in embodied humans and non-embodied models. The text-only condition matching human efficiency is consistent with an alternative account in which models must first solve an input-mapping problem that humans bypass via physical interaction; this alternative is not ruled out by the reported data and directly affects the strength of the 'fundamental limitation' conclusion.
[Discussion] Discussion: The reported CC/CE asymmetries are presented as evidence of heuristic biases rather than direction-neutral abstraction. However, the manuscript does not report effect sizes, consistency across individual models, or controls for training-data linguistic biases in causal language; without these, it is unclear whether the asymmetries are robust or task-specific artifacts.

minor comments (2)

[Abstract/Introduction] Abstract and introduction: The term 'environmental grounding' is introduced as a post-hoc label; a brief operational definition or reference to prior usage would improve clarity for readers unfamiliar with the paradigm.
[Figures] Figure legends: Legends should explicitly list the three input conditions (text-only, image-only, text+image) and indicate which panels correspond to discovery versus transfer phases.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important areas for clarification and strengthening. We have revised the manuscript to provide full methodological details and to explicitly discuss alternative interpretations of the results. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [Methods] Methods section: The manuscript provides insufficient detail on the specific models tested (including versions and sizes), exact prompting formats and few-shot examples, number of independent runs or trials, participant/model sample sizes, and statistical procedures (error bars, significance tests, multiple-comparison corrections). These omissions are load-bearing because the central claims rest on quantitative performance differences and asymmetries whose reliability cannot be evaluated without them.

Authors: We agree that these details are essential for evaluating the claims. In the revised manuscript we have expanded the Methods section to specify: exact model versions and sizes (GPT-4o, Claude-3-Opus, LLaVA-1.6-34B, etc.), complete prompting templates with any few-shot examples, number of independent runs (10 per model-condition pair), human sample size (N=24), and statistical procedures including SEM error bars, two-tailed t-tests, and Bonferroni corrections for multiple comparisons. These additions enable full reproducibility and direct assessment of the reported differences and asymmetries. revision: yes
Referee: [Results/Discussion] Results/Discussion: The interpretation that models lack decontextualized causal schemas (and therefore require environmental grounding) assumes the OpenLock task measures equivalent processes in embodied humans and non-embodied models. The text-only condition matching human efficiency is consistent with an alternative account in which models must first solve an input-mapping problem that humans bypass via physical interaction; this alternative is not ruled out by the reported data and directly affects the strength of the 'fundamental limitation' conclusion.

Authors: We acknowledge the force of this alternative account. The revised Discussion now includes an explicit paragraph considering the possibility that text-only performance reflects an input-mapping stage rather than absence of abstract schemas. However, even in the text condition the task requires sequential intervention and outcome observation to induce latent structure, which goes beyond passive mapping. We retain the grounding-dependent transfer claim as a descriptive characterization of the observed behavior while softening the 'fundamental limitation' language to 'a core difference in transfer dynamics' and noting that further experiments would be needed to fully adjudicate the mapping versus schema accounts. revision: partial
Referee: [Discussion] Discussion: The reported CC/CE asymmetries are presented as evidence of heuristic biases rather than direction-neutral abstraction. However, the manuscript does not report effect sizes, consistency across individual models, or controls for training-data linguistic biases in causal language; without these, it is unclear whether the asymmetries are robust or task-specific artifacts.

Authors: We have added the requested analyses to the revised Results and Discussion. Effect sizes (Cohen’s d) for the CC/CE asymmetry are now reported for each condition. Per-model breakdowns show the asymmetry is consistent across the tested LLMs and VLMs. We also include a control analysis using neutral causal phrasing that preserves the asymmetry, reducing the likelihood that it is solely an artifact of training-data linguistic biases. These additions support the interpretation of heuristic biases while acknowledging remaining uncertainty about the precise source. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical observations without derivation or fitted predictions

full rationale

The paper conducts an empirical comparison of LLMs, VLMs, and humans on the OpenLock causal transfer task, reporting performance metrics across text, image, and combined conditions. No equations, parameter fitting, or predictive derivations appear; results are presented as direct observations of discovery efficiency, CC/CE asymmetries, and grounding effects. Claims rest on experimental data rather than any self-referential chain or renamed inputs, satisfying the self-contained criterion with no load-bearing reductions to prior definitions or citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that OpenLock captures decontextualized causal abstraction and that performance differences reflect fundamental architectural limits rather than implementation details.

axioms (1)

domain assumption OpenLock paradigm measures abstract causal structure transfer equivalently for humans and AI
Invoked to interpret model delays as absence of human-like mechanisms

pith-pipeline@v0.9.0 · 5578 in / 1143 out tokens · 32285 ms · 2026-05-08T03:35:37.528614+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 4 canonical work pages · 4 internal anchors

[1]

Anadkat, S., et al. (2023). Gpt-4 technical report.arXiv preprint arXiv:2303.08774(cit. on p. 1). Anthropic. (2024). Claude 3 model card.Anthropic Technical Report(cit. on p. 1)

work page internal anchor Pith review arXiv 2023
[2]

Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Proceedings of Advances in Neural Information Processing Systems (NeurIPS)(cit. on p. 1)

2020
[3]

-C., & Lu, H

Edmonds, M., Kubricht, J., Summers, C., Zhu, Y ., Rothrock, B., Zhu, S. -C., & Lu, H. (2018). Human casual transfer: Challenges for deep reinforcement learning.Annual Meeting of the Cognitive Science Society (CogSci)(cit. on pp. 1–4)

2018
[4]

Edmonds, M., Ma, X., Qi, S., Zhu, Y ., Lu, H., & Zhu, S.-C. (2020). Theory-based causal transfer: Integrating instance- level induction and abstract-level structure learning.Pro- ceedings of AAAI Conference on Artificial Intelligence (AAAI)(cit. on p. 1)

2020
[5]

Edmonds, M., Qi, S., Zhu, Y ., Kubricht, J., Zhu, S.-C., & Lu, H. (2019). Decomposing human causal learning: Bottom- up associative learning and top-down schema reasoning. Annual Meeting of the Cognitive Science Society (CogSci) (cit. on p. 1). Gemini Team. (2023). Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(cit. on p. 1)

work page internal anchor Pith review arXiv 2019
[6]

Gentner, D. (1983). Structure-mapping: A theoretical frame- work for analogy.Cognitive Science,7(2), 155–170 (cit. on p. 5)

1983
[7]

L., & Holyoak, K

Gick, M. L., & Holyoak, K. J. (1983). Schema induction and analogical transfer.Cognitive Psychology,15(1), 1–38 (cit. on p. 5)

1983
[8]

L., & Tenenbaum, J

Griffiths, T. L., & Tenenbaum, J. B. (2005). Structure and strength in causal induction.Cognitive Psychology,51(4), 334–384 (cit. on p. 1)

2005
[9]

L., & Tenenbaum, J

Griffiths, T. L., & Tenenbaum, J. B. (2009). Theory-based causal induction.Psychological Review,116(4), 661 (cit. on p. 1)

2009
[10]

J., & Cheng, P

Holyoak, K. J., & Cheng, P. W. (2011). Causal learning and inference as a rational process: The new synthesis.Annual Review of Psychology,62(1), 135–163 (cit. on p. 1)

2011
[11]

J., Lee, H

Holyoak, K. J., Lee, H. S., & Lu, H. (2010). Analogical and category-based inference: A theoretical integration with bayesian causal models.Journal of Experimental Psychol- ogy: General,139(4), 702 (cit. on p. 1)

2010
[12]

J., & Thagard, P

Holyoak, K. J., & Thagard, P. (1996).Mental leaps: Analogy in creative thought. MIT press. (Cit. on pp. 1, 5)

1996
[13]

Blin, K., Gonzalez Adauto, F., Kleiman-Weiner, M., Sachan, M., et al. (2023). Cladder: Assessing causal reasoning in language models.Proceedings of Advances in Neural Infor- mation Processing Systems (NeurIPS)(cit. on p. 1)

2023
[14]

T., & Schölkopf, B

Diab, M. T., & Schölkopf, B. (2024). Can large language models infer causation from correlation?Proceedings of In- ternational Conference on Learning Representations (ICLR) (cit. on p. 1)

2024
[15]

Kiciman, E., Ness, R., Sharma, A., & Tan, C. (2023). Causal reasoning and large language models: Opening a new fron- tier for causality.Transactions on Machine Learning Re- search(cit. on p. 1)

2023
[16]

M., Ullman, T

Lake, B. M., Ullman, T. D., Tenenbaum, J. B., & Gershman, S. J. (2017). Building machines that learn and think like people.Behavioral and Brain Sciences,40(cit. on p. 1)

2017
[17]

Zhang, B., Lin, C., Dong, C., et al. (2025). Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556(cit. on p. 1)

work page internal anchor Pith review arXiv 2025
[18]

L., Liljeholm, M., Cheng, P

Lu, H., Yuille, A. L., Liljeholm, M., Cheng, P. W., & Holyoak, K. J. (2008). Bayesian generic priors for causal learning. Psychological Review,115(4), 955 (cit. on p. 1)

2008
[19]

Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M., Ha- jishirzi, H., & Zettlemoyer, L. (2022). Rethinking the role of demonstrations: What makes in-context learning work? EMNLP(cit. on p. 5)

2022
[20]

Nakkiran, P., Neyshabur, B., & Sedghi, H. (2020). The deep bootstrap framework: Good online learners are good offline generalizers.Proceedings of International Conference on Learning Representations (ICLR)(cit. on p. 6)

2020
[21]

Ohlsson, S. (1992). Information-processing explanations of insight and related phenomena.Advances in the Psychology of Thinking, 1–44 (cit. on p. 6). OpenAI. (2023). Gpt-4v(ision) system card.OpenAI Technical Report(cit. on p. 1)

1992
[22]

Shams, L., & Seitz, A. R. (2008). Benefits of multisensory learning.Trends in Cognitive Sciences,12(11), 411–417 (cit. on p. 6)

2008
[23]

Lan, X., Gong, J., Ouyang, T., Meng, F., et al. (2025). Towards large reasoning models: A survey of reinforced reasoning with large language models.arXiv preprint arXiv:2501.09686(cit. on p. 1)

work page internal anchor Pith review arXiv 2025
[24]

Zhang, C., Song, J., Li, S., Liang, Y ., Ma, Y ., Wang, W., Zhu, Y ., & Zhu, S.-C. (2026). Proposing and solving olympiad geometry with guided tree search.Nature Machine Intelli- gence,8, 84–95 (cit. on p. 1)

2026