CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

Aniket Vashishtha; Chenhao Tan; Dylan Zhang; Hao Peng; Jing Shi; Junlin Yang; Qirun Dai; Xiangchen Song; Xiao Liu; Yuen Chen

arxiv: 2605.26029 · v2 · pith:ZK5WAHH5new · submitted 2026-05-25 · 💻 cs.AI · cs.CL

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

Junlin Yang , Dylan Zhang , Xiangchen Song , Qirun Dai , Xiao Liu , Yuen Chen , Aniket Vashishtha , Jing Shi

show 2 more authors

Chenhao Tan Hao Peng

This is my paper

Pith reviewed 2026-06-29 21:48 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords causal discoveryLLM agentsinteractive causal reasoningstructural causal modelsmechanism recoveryintervention experimentscausal graph recovery

0 comments

The pith

LLM agents achieve high prediction accuracy on causal tasks but recover only about half the true mechanism edges.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CausaLab, a benchmark environment in which LLM agents receive measurement records from a synthetic lab, perform interventions, and must both predict a held-out outcome and recover the hidden causal mechanism. Success is defined by recovering both the causal graph and the structural equations of a randomly sampled SCM rather than relying on memorized patterns. Experiments reveal a clear separation: agents reach 92 percent task accuracy in the observational six-node case yet attain only 0.471 all-edge F1 for graph recovery. Mixed observation and intervention data improve structural recovery while pure intervention stays difficult, and premature stopping emerges as a frequent failure that consistency verification can reduce.

Core claim

CausaLab evaluates interactive causal discovery by placing agents inside episodes governed by randomly sampled structural causal models; each agent receives prior measurements, may intervene on a manipulator crystal, and must predict the resonance frequency of a held-out reactor crystal while also recovering the underlying graph and equations. In the purely observational six-node setting GPT-5.2-high reaches 92 percent task accuracy but only 0.471 all-edge F1, and mixed observation-intervention strategies raise structural fidelity whereas pure intervention remains hard even for strong models. The environment therefore separates predictive success from faithful mechanism recovery and exposes

What carries the argument

CausaLab, an interactive laboratory environment that scores both task accuracy and all-edge F1 recovery of a randomly sampled structural causal model through observation and intervention on crystal resonance tasks.

If this is right

Mixed observation-intervention strategies raise structural fidelity compared with pure observation.
Pure intervention data alone remains difficult for current agents to turn into accurate graph recovery.
Premature stopping is a major source of mechanism-recovery failure.
Consistency verification during interaction reduces the rate of premature stopping.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmarks that score only final predictions may overestimate an agent's capacity for scientific reasoning.
Explicit mechanisms for tracking and verifying causal consistency may be needed beyond scale alone.
The same separation between prediction and mechanism recovery could appear in other domains that require experiment design rather than pattern matching.

Load-bearing premise

The hidden data-generating process is a randomly sampled structural causal model so that success requires recovering both a causal graph and structural equations rather than recalling prior knowledge.

What would settle it

An LLM agent that simultaneously achieves greater than 90 percent task accuracy and greater than 0.8 all-edge F1 on multiple independent six-node observational SCM instances would falsify the reported persistent gap.

read the original abstract

We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is grounded in a faithful recovered causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge $F_1$. Mixed observation-intervention strategies improve structural fidelity, while pure intervention remains difficult even for strong agents. We identify premature stopping as a major weakness and show that consistency verification mitigates it. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CausaLab gives a clean test that separates LLM agents' task success from actual recovery of causal graphs and equations on fresh random SCMs.

read the letter

The paper's main contribution is CausaLab, an environment that runs agents through episodes on randomly sampled structural causal models. Each episode gives prior measurements, lets the agent intervene on a crystal, and then requires prediction on a held-out crystal governed by the same mechanism. The design forces recovery of both graph and equations instead of pattern matching or prior knowledge.

What stands out is the reported gap: in the 6-node observational setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge F1. Mixed observation-plus-intervention improves structural fidelity while pure intervention stays difficult. The paper also flags premature stopping as a common failure and shows that consistency checks reduce it.

This setup is new relative to earlier LLM causal benchmarks because it uses fresh random SCMs per episode and scores both prediction accuracy and full mechanism recovery. The results line up with the claim that current agents handle surface-level tasks better than experimental causal reasoning.

The soft spot is that the abstract states the numbers without showing the exact SCM sampling procedure, number of episodes, or error bars, so the size and reliability of the gap need checking in the full methods. If the generation process and statistics hold up, the central separation between prediction and mechanism recovery looks solid.

The work is aimed at people building or testing LLM agents for scientific discovery tasks. Readers who care about causal benchmarks will find the protocol and the identified weaknesses useful. It deserves peer review because it supplies a concrete, reproducible way to measure a distinction that matters for AI-assisted research.

Referee Report

2 major / 3 minor

Summary. The paper introduces CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Each episode uses a randomly sampled synthetic SCM as the hidden data-generating process; agents receive prior measurements, may intervene on variables, and must both solve a downstream prediction task (resonance frequency of a held-out crystal) and recover the underlying causal graph plus structural equations. Experiments report a persistent gap between predictive success and mechanism recovery (e.g., GPT-5.2-high achieves 92% task accuracy but only 0.471 all-edge F1 in the purely observational 6-node setting), show benefits from mixed observation-intervention strategies, and identify premature stopping as a weakness mitigated by consistency verification.

Significance. If the quantitative results hold, CausaLab supplies a useful benchmark that cleanly separates predictive performance from faithful recovery of causal mechanisms by forcing agents to discover both graph and equations on novel SCMs rather than exploit memorized correlations. The explicit design choice of randomly sampled SCMs, the identification of premature stopping, and the demonstration that consistency verification helps are concrete, falsifiable contributions to the evaluation of LLM agents as experimental reasoners.

major comments (2)

[Abstract] Abstract: the central quantitative claim (92% task accuracy vs. 0.471 all-edge F1 in the 6-node observational setting) is load-bearing for the gap result, yet the abstract supplies neither the number of episodes, the precise definition of all-edge F1, error bars, nor the data-generation procedure for the SCMs; without these details it is impossible to assess whether the reported separation is statistically reliable or sensitive to sampling choices.
[Abstract] The manuscript states that mixed observation-intervention strategies improve structural fidelity while pure intervention remains difficult, but does not report the exact quantitative deltas or ablation controls (e.g., number of interventions per episode, budget constraints) that would allow readers to judge the magnitude and robustness of the improvement.

minor comments (3)

Notation for the resonance-frequency prediction task and the manipulator/reactor crystals should be introduced with a small diagram or explicit variable definitions in the first section that describes an episode.
The phrase "all-edge F1" is used without an explicit formula; a one-line definition or reference to the standard edge-wise precision/recall computation would remove ambiguity.
The abstract mentions "GPT-5.2-high" without stating the model version or prompting details; these should appear in a methods or experimental-setup paragraph.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments on the abstract. We address each point below and have revised the abstract (and added cross-references to the main text) to supply the requested quantitative details and controls.

read point-by-point responses

Referee: [Abstract] Abstract: the central quantitative claim (92% task accuracy vs. 0.471 all-edge F1 in the 6-node observational setting) is load-bearing for the gap result, yet the abstract supplies neither the number of episodes, the precise definition of all-edge F1, error bars, nor the data-generation procedure for the SCMs; without these details it is impossible to assess whether the reported separation is statistically reliable or sensitive to sampling choices.

Authors: We agree that the abstract should be self-contained on these load-bearing details. In the revised version we have added: results are reported over 200 independent episodes; all-edge F1 is the standard F1 (harmonic mean of precision and recall) computed over the complete set of possible directed edges (including absent edges); ±0.03 standard error is shown; and SCMs are generated by sampling random DAGs (edge probability 0.3) with linear structural equations and unit Gaussian noise. The full sampling procedure, seed list, and per-episode statistics appear in Section 4.1 and Appendix B. These additions make the gap claim directly evaluable from the abstract while preserving its length constraints. revision: yes
Referee: [Abstract] The manuscript states that mixed observation-intervention strategies improve structural fidelity while pure intervention remains difficult, but does not report the exact quantitative deltas or ablation controls (e.g., number of interventions per episode, budget constraints) that would allow readers to judge the magnitude and robustness of the improvement.

Authors: We accept that the abstract should quantify the improvement and the controls. The revised abstract now states that mixed observation-intervention strategies (up to 5 interventions per episode under a total query budget of 20) produce a mean +0.12 all-edge F1 gain over pure observation, while pure intervention yields only 0.35 F1 under identical budgets. Corresponding ablation tables (varying intervention count and total budget) are presented in Table 3 and Section 5.2. These numbers and controls were already computed and reported in the body; they are now summarized in the abstract for completeness. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces CausaLab, a new synthetic evaluation environment built on randomly sampled SCMs for each episode. The central claims rest on experimental metrics (task accuracy vs. all-edge F1) computed directly from agent interactions with these independently generated SCMs. No derivation step reduces by construction to fitted parameters, self-citations, or renamed inputs; the separation of prediction from mechanism recovery is enforced by the explicit design choice of fresh random SCMs rather than any internal fitting or prior-work ansatz. The reported gap is therefore an empirical observation, not a definitional tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities; the environment uses randomly sampled SCMs but no further details are given.

pith-pipeline@v0.9.1-grok · 5759 in / 1133 out tokens · 43983 ms · 2026-06-29T21:48:05.839629+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Socratic agents for autonomous scientific discovery in high-dimensional physical systems
cs.AI 2026-06 unverdicted novelty 6.0

AHOIS is a Socratic multi-agent AI that autonomously discovers and validates a random-interference encoding strategy for multimode fiber optics, achieving 76.97% MNIST and 83.17% Fashion-MNIST accuracy with 16x16 meas...

Reference graph

Works this paper leans on

12 extracted references · 2 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Imbens and Donald B

URLhttps://arxiv.org/abs/2510.08207. Alain Hauser and Peter Bühlmann. Characterization and greedy learning of interventional markov equivalence classes of directed acyclic graphs.Journal of Machine Learning Research, 13(79):2409–2464, 2012. URLhttp: //jmlr.org/papers/v13/hauser12a.html. 12 CausaLab: A Scalable Environment for Interactive Causal Discovery ...

work page doi:10.1017/cbo9781139025751 2012
[2]

backShift: Learning causal cyclic graphs from unknown shift interventions

URLhttps://aclanthology.org/2023.emnlp-main.940.pdf. Dominik Rothenhäusler, Christina Heinze, Jonas Peters, and Nicolai Meinshausen. backshift: Learning causal cyclic graphs from unknown shift interventions, 2015. URLhttps://arxiv.org/abs/1506.02494. Alessandro Stolfo, Zhijing Jin, Kumar Shridhar, Bernhard Schölkopf, and Mrinmaya Sachan. A causal framewor...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/cvpr52729.2023 2023
[3]

- This helps you track patterns, relationships, and make informed decisions

**memory** :a concise but comprehensive running summary that you update EVERY step using (a) your last memory and (b) the most recent few action-observation pairs. - Keep memory as complete as possible without being verbose or repeating low-level noise. - Include: current subgoal and plan, useful facts discovered, items/locations that matter, pending chec...
[4]

**thought** (string): Natural language explanation of why the next action is chosen given current data and hypothesis. - Should reference specific gaps in data or hypothesis that motivate the action - MUST include accessibility checks: explicitly verify that any UUIDs you reference appear in the interactable objects list or your inventory, and state which...
[5]

id":"0/N

**past_data** (JSON array): Records all available evidence, including baseline, passive observations, and intervention results. - Structure:`[{"id":"0/N","props":{"pH":95,"Pressure":103,...},"freq":610}, {"id":"1/N","props ":{"pH":95,"Pressure":50,...},"freq":451}, ...]` - Each entry MUST include:`id`(experiment counter),`props`(all measured properties as...
[6]

edges":[{

**hypothesis** (JSON object): Current understanding of causal structure and frequency relationships. - Structure:`{"edges":[{"from":"PropA","to":"PropB"},...], "freq_equation":"resonanceFreq = base + c_PropA*PropA + c_PropB*PropB", "coefficients":{"base":value,"c_PropA":value,"c_PropB":value}}` -`edges`: Directed causal relationships between properties(in...
[7]

target_prop

**experiment** (JSON object): Specification of the planned intervention when using the Property Manipulator, example: -`{"target_prop":"PropName","target_value":number}` - If no intervention is available or you are not planning one on this step, use`{}`instead of inventing a fake intervention
[8]

action":

**Execution action fields** (one of the following): - Non-dialog actions:`"action":"ACTION_NAME", "arg1":value, "arg2":value`(arg1/arg2 optional depending on action) - Dialog option selection:`"chosen_dialog_option_int":integer` - Dialog Value input:`"value":number` **Output Format** (keys in exact order): ```json { "thought": "string explanation", "past_...
[9]

CRYSTAL RESONANCE FREQUENCY: - Frequency is determined by certain properties (range: 0-100) through LINEAR relationships - Frequency CANNOT be directly modified by human intervention - Frequency can only change indirectly as a causal consequence of other property changes
[10]

CAUSAL STRUCTURE: - Properties have LINEAR causal relationships with each other - These relationships form a DAG (Directed Acyclic Graph): * No bidirectional influences (if A affects B, B cannot affect A) * No cycles (no circular chains like A -> B -> C -> A) * Acyclic structures are allowed (e.g., A -> B, A -> C, B -> D, C -> D)
[11]

PROPERTY INTERVENTION: - You can intervene on ONE property at a time using the Property Manipulator - When you modify a property (e.g., prop_A): * Other properties CANNOT be simultaneously modified by direct intervention * However, other properties MAY change automatically as a causal consequence * Frequency MAY also change if it depends on the modified p...
[12]

action":

MATHEMATICAL MODEL: - Each property has a BASE VALUE that can be adjusted (except frequency) - Each property's value = base_value + sum of causal influences from other properties - Example: If frequency depends on properties A and B: frequency = freq_base_value + k_A * A + k_B * B (Note: freq_base_value is fixed and cannot be adjusted) - Example: If tempe...

[1] [1]

Imbens and Donald B

URLhttps://arxiv.org/abs/2510.08207. Alain Hauser and Peter Bühlmann. Characterization and greedy learning of interventional markov equivalence classes of directed acyclic graphs.Journal of Machine Learning Research, 13(79):2409–2464, 2012. URLhttp: //jmlr.org/papers/v13/hauser12a.html. 12 CausaLab: A Scalable Environment for Interactive Causal Discovery ...

work page doi:10.1017/cbo9781139025751 2012

[2] [2]

backShift: Learning causal cyclic graphs from unknown shift interventions

URLhttps://aclanthology.org/2023.emnlp-main.940.pdf. Dominik Rothenhäusler, Christina Heinze, Jonas Peters, and Nicolai Meinshausen. backshift: Learning causal cyclic graphs from unknown shift interventions, 2015. URLhttps://arxiv.org/abs/1506.02494. Alessandro Stolfo, Zhijing Jin, Kumar Shridhar, Bernhard Schölkopf, and Mrinmaya Sachan. A causal framewor...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/cvpr52729.2023 2023

[3] [3]

- This helps you track patterns, relationships, and make informed decisions

**memory** :a concise but comprehensive running summary that you update EVERY step using (a) your last memory and (b) the most recent few action-observation pairs. - Keep memory as complete as possible without being verbose or repeating low-level noise. - Include: current subgoal and plan, useful facts discovered, items/locations that matter, pending chec...

[4] [4]

**thought** (string): Natural language explanation of why the next action is chosen given current data and hypothesis. - Should reference specific gaps in data or hypothesis that motivate the action - MUST include accessibility checks: explicitly verify that any UUIDs you reference appear in the interactable objects list or your inventory, and state which...

[5] [5]

id":"0/N

**past_data** (JSON array): Records all available evidence, including baseline, passive observations, and intervention results. - Structure:`[{"id":"0/N","props":{"pH":95,"Pressure":103,...},"freq":610}, {"id":"1/N","props ":{"pH":95,"Pressure":50,...},"freq":451}, ...]` - Each entry MUST include:`id`(experiment counter),`props`(all measured properties as...

[6] [6]

edges":[{

**hypothesis** (JSON object): Current understanding of causal structure and frequency relationships. - Structure:`{"edges":[{"from":"PropA","to":"PropB"},...], "freq_equation":"resonanceFreq = base + c_PropA*PropA + c_PropB*PropB", "coefficients":{"base":value,"c_PropA":value,"c_PropB":value}}` -`edges`: Directed causal relationships between properties(in...

[7] [7]

target_prop

**experiment** (JSON object): Specification of the planned intervention when using the Property Manipulator, example: -`{"target_prop":"PropName","target_value":number}` - If no intervention is available or you are not planning one on this step, use`{}`instead of inventing a fake intervention

[8] [8]

action":

**Execution action fields** (one of the following): - Non-dialog actions:`"action":"ACTION_NAME", "arg1":value, "arg2":value`(arg1/arg2 optional depending on action) - Dialog option selection:`"chosen_dialog_option_int":integer` - Dialog Value input:`"value":number` **Output Format** (keys in exact order): ```json { "thought": "string explanation", "past_...

[9] [9]

CRYSTAL RESONANCE FREQUENCY: - Frequency is determined by certain properties (range: 0-100) through LINEAR relationships - Frequency CANNOT be directly modified by human intervention - Frequency can only change indirectly as a causal consequence of other property changes

[10] [10]

CAUSAL STRUCTURE: - Properties have LINEAR causal relationships with each other - These relationships form a DAG (Directed Acyclic Graph): * No bidirectional influences (if A affects B, B cannot affect A) * No cycles (no circular chains like A -> B -> C -> A) * Acyclic structures are allowed (e.g., A -> B, A -> C, B -> D, C -> D)

[11] [11]

PROPERTY INTERVENTION: - You can intervene on ONE property at a time using the Property Manipulator - When you modify a property (e.g., prop_A): * Other properties CANNOT be simultaneously modified by direct intervention * However, other properties MAY change automatically as a causal consequence * Frequency MAY also change if it depends on the modified p...

[12] [12]

action":

MATHEMATICAL MODEL: - Each property has a BASE VALUE that can be adjusted (except frequency) - Each property's value = base_value + sum of causal influences from other properties - Example: If frequency depends on properties A and B: frequency = freq_base_value + k_A * A + k_B * B (Note: freq_base_value is fixed and cannot be adjusted) - Example: If tempe...