Predicting Causal Effects from Natural Language Queries using Structured Representations

Abelardo Carlos Martinez Lorenzo; Arianna Legovini; Giuliano Martinelli; Jasmin Baier; Linxi Wang; Piriyakorn Piriyatamwong; Riccardo Orlando; Samuel Fraiberger; Satvik Garg; Sharif Kazemi

arxiv: 2605.29631 · v1 · pith:RCMNCOS5new · submitted 2026-05-28 · 💻 cs.CL · cs.AI

Predicting Causal Effects from Natural Language Queries using Structured Representations

Giuliano Martinelli , Piriyakorn Piriyatamwong , Abelardo Carlos Martinez Lorenzo , Jasmin Baier , Riccardo Orlando , Satvik Garg , Sharif Kazemi , Linxi Wang

show 2 more authors

Arianna Legovini Samuel Fraiberger

This is my paper

Pith reviewed 2026-06-29 08:04 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords causal effect predictionnatural language queriesstructured representationslarge language modelsfinetuningbenchmarkQuery2Effectout-of-domain generalization

0 comments

The pith

Finetuning a two-step framework that first builds structured representations of queries reduces error in predicting causal effect sizes from natural language by 27 to 71 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Query2Effect, a benchmark of more than 72,000 natural language questions aligned with experiment descriptions that vary in implicitness, abstraction, and ambiguity. It proposes generating a synthetic structured representation of each query first, then feeding that representation into a supervised encoder model to estimate the causal effect size. Experiments demonstrate that finetuning this pipeline substantially lowers absolute prediction error relative to prompted large language models used directly. The same separation of steps also improves performance when the test queries come from domains absent during training. This setup targets the high cost of running new randomized trials by extracting more value from existing experimental records.

Core claim

The central claim is that separating semantic interpretation of a natural language query into a synthetic structured representation from the subsequent numerical estimation of effect size, followed by supervised finetuning of the encoder, yields lower prediction error and stronger out-of-domain generalization than prompting large language models end-to-end.

What carries the argument

The two-step framework that first generates a synthetic structured representation of the query before predicting effect size using a supervised encoder model.

If this is right

Finetuning reduces absolute error by 27% to 71% compared to prompted out-of-the-box LLMs.
The two-step framework improves out-of-domain generalization.
The benchmark enables systematic testing across different levels of query implicitness, abstraction, and ambiguity.
Separating semantic interpretation from numerical effect estimation is the mechanism that drives the reported gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Predicted effect sizes could be used to rank which new randomized trials are most worth conducting.
The same separation into structured representation plus supervised estimation may apply to other text-to-numeric forecasting tasks.
Larger-scale versions of the benchmark could test whether gains persist when queries span many scientific fields.

Load-bearing premise

The benchmark's construction of natural language questions aligned with experiment descriptions, varied along implicitness, abstraction, and ambiguity, accurately simulates realistic information-seeking scenarios for causal effect prediction.

What would settle it

A held-out collection of queries drawn from new domains in which the two-step finetuned model shows no reduction in absolute error and no gain in generalization relative to direct prompting of large language models.

Figures

Figures reproduced from arXiv: 2605.29631 by Abelardo Carlos Martinez Lorenzo, Arianna Legovini, Giuliano Martinelli, Jasmin Baier, Linxi Wang, Piriyakorn Piriyatamwong, Riccardo Orlando, Samuel Fraiberger, Satvik Garg, Sharif Kazemi.

**Figure 2.** Figure 2: Performance degradation at decreasing levels [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗

read the original abstract

Randomized controlled trials are a cornerstone of medicine and the social sciences as they enable reliable estimates of causal effects. However, they are costly and time-consuming to conduct, motivating interest in predicting causal effects from existing experimental evidence. Recent advances in large language models (LLMs) have demonstrated strong performance on knowledge-intensive tasks, raising the question of whether these models can be used for forecasting causal effect sizes. To investigate this, we introduce Query2Effect, a new large-scale benchmark consisting of more than 72,000 natural language questions aligned with experiment descriptions, created to simulate realistic information-seeking scenarios by varying query specificity along dimensions of implicitness, abstraction, and ambiguity. We then propose a two-step framework that first generates a synthetic structured representation of a query before predicting effect size using a supervised encoder model. Experiments show that finetuning plays a crucial role in improving prediction performance, with absolute error reducing by -27% up to -71% compared to prompted out-of-the-box LLMs, and that our two-step framework is beneficial for out-of-domain generalization, highlighting the benefits of separating semantic interpretation from numerical effect estimation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Query2Effect gives a sizable new benchmark and a two-step parse-then-predict pipeline that shows clear finetuning gains and OOD benefits, but the synthetic queries lack any check against real user behavior.

read the letter

The paper's main new piece is the Query2Effect benchmark of over 72,000 questions aligned to experiment descriptions, built by varying implicitness, abstraction, and ambiguity. They pair this with a two-step setup that first produces a structured representation of the query and then feeds it to a supervised encoder for effect-size prediction.

The experiments make a straightforward case that finetuning matters a lot, cutting absolute error 27 to 71 percent versus prompted LLMs, and that the two-step split helps when moving to new domains. The architectural choice to separate semantic interpretation from the numerical step is reasonable and lines up with the reported generalization numbers.

The soft spot is the benchmark's construction. The questions are generated synthetically from the experiment texts, but the paper gives no external check on whether those variations match the distribution or difficulty of actual queries that domain experts or literature searches would produce. If the artificial spread over- or under-emphasizes certain features, both the finetuning gains and the OOD advantage stay tied to this particular data rather than proving more general.

The rest of the work is standard empirical NLP: clear data splits, no circular definitions, and ordinary citation patterns. Nothing in the claims reduces to quantities defined only by the authors' own fitted parameters.

This is for people building or testing systems that turn existing trial records into queryable causal predictions, or for benchmark work in structured prediction from text. A reader who needs a large dataset for causal-effect tasks would get concrete material to work with.

It deserves peer review. The scale and the testable claims are enough to warrant referee time, even if the realism question needs direct attention in revision.

Referee Report

2 major / 2 minor

Summary. The paper introduces Query2Effect, a benchmark of >72k natural language questions aligned to experiment descriptions and varied along implicitness, abstraction, and ambiguity to simulate causal-effect information-seeking. It proposes a two-step framework that first produces a synthetic structured representation of the query and then applies a supervised encoder to predict effect size. Experiments claim that fine-tuning yields absolute-error reductions of 27–71% versus prompted out-of-the-box LLMs and that the two-step design improves out-of-domain generalization.

Significance. If the benchmark is shown to be a faithful proxy for realistic queries, the work would supply a useful large-scale testbed and a modular architecture that separates semantic interpretation from numerical estimation. The scale of the dataset and the reported fine-tuning gains constitute concrete empirical contributions; however, their broader significance hinges on external validation of the benchmark’s realism.

major comments (2)

[Benchmark construction section] Benchmark construction section: the paper states that varying implicitness, abstraction, and ambiguity produces queries that “simulate realistic information-seeking scenarios,” yet supplies no external validation (e.g., comparison to real user queries from domain experts or literature searches). Because the headline claims of fine-tuning gains and OOD generalization rest on this assumption, the absence of such validation is load-bearing.
[Experimental results section] Experimental results section: the reported absolute-error reductions (−27 % to −71 %) are presented without accompanying information on the precise baselines, data splits, error bars, or statistical tests used. This omission prevents assessment of whether the two-step framework’s advantage is robust or benchmark-specific.

minor comments (2)

[Method section] Clarify the exact definition and generation procedure for the “synthetic structured representation” (including any learned components) so that the separation between the two steps can be reproduced.
Add a limitations paragraph that explicitly discusses the synthetic nature of the query distribution and the conditions under which the reported gains may not transfer.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Benchmark construction section] Benchmark construction section: the paper states that varying implicitness, abstraction, and ambiguity produces queries that “simulate realistic information-seeking scenarios,” yet supplies no external validation (e.g., comparison to real user queries from domain experts or literature searches). Because the headline claims of fine-tuning gains and OOD generalization rest on this assumption, the absence of such validation is load-bearing.

Authors: We appreciate the referee's observation. The benchmark was constructed by systematically varying the three dimensions based on linguistic principles from information-seeking and query formulation literature. We acknowledge that no direct external validation against real user queries from domain experts was performed. In the revised manuscript, we will expand the benchmark construction section with additional discussion of the design rationale, grounding in prior work, an explicit limitations paragraph, and suggestions for future external validation. This will better frame the scope of our simulation claims without overstating realism. revision: yes
Referee: [Experimental results section] Experimental results section: the reported absolute-error reductions (−27 % to −71 %) are presented without accompanying information on the precise baselines, data splits, error bars, or statistical tests used. This omission prevents assessment of whether the two-step framework’s advantage is robust or benchmark-specific.

Authors: We agree that greater transparency is needed. While some details appear in the appendix, the revised main experimental results section will explicitly specify the baselines (exact LLMs and prompt templates), data splits (including OOD construction criteria), error bars (standard deviations across random seeds), and statistical tests (e.g., paired significance tests). These additions will enable readers to assess robustness directly. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark and supervised learning study

full rationale

The paper presents Query2Effect as a constructed benchmark of >72k aligned NL questions and evaluates a two-step framework via finetuning experiments showing error reductions. No equations, derivations, or load-bearing self-citations reduce any prediction to quantities defined by the authors' own fitted parameters or prior ansatzes. The central claims rest on independent experimental comparisons against prompted LLMs and OOD splits, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities beyond standard supervised learning assumptions; the central claim rests on the unstated premise that the generated structured representations preserve causal semantics.

pith-pipeline@v0.9.1-grok · 5762 in / 1096 out tokens · 30033 ms · 2026-06-29T08:04:49.918630+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 2 canonical work pages · 1 internal anchor

[1]

gpt-oss-120b & gpt-oss-20b Model Card

M3Retrieve: Benchmarking multimodal re- trieval for medicine. InProceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing, pages 15263–15276, Suzhou, China. As- sociation for Computational Linguistics. Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Alt- man, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haimi...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

What is the impact of [intervention] on [outcome]?

Consort 2025 statement: updated guideline for reporting randomized trials.Nature Medicine, 31(6):1776–1783. Ali Hummos, Felipe del Río, Brabeeba Mien Wang, Julio Hurtado, Cristian B Calderon, and Guangyu Robert Yang. 2024. Gradient-based in- ference of abstract task representations for gen- eralization in neural networks.ArXiv preprint, abs/2407.17356. Ko...

work page arXiv 2025
[3]

Generate exactly FOUR queries
[4]

Each query must be ONE sentence
[5]

Do NOT include answers, explanations, metadata, or formatting beyond what is requested
[6]

Queries must differ meaningfully in implicitness, abstraction, and ambiguity
[7]

Do NOT hallucinate details that are not present in the RCT (e.g., dosage, sample size, metrics)
[8]

When information is not allowed by the difficulty level, it must be omitted, not guessed
[9]

Queries should sound like realistic questions asked by policymakers, practitioners, or researchers. -------------------------------------------------- RCT INFORMATION -------------------------------------------------- Intervention: - Description: {intervention_description} Outcome: - Description: {outcome_description} Sector: - Description: {sector} -----...
[10]

I0-A0-U0 (Fully explicit, concrete, unambiguous)
[11]

I1-A1-U1 (Implicit elements, paraphrased, mildly underspecified)
[12]

I2-A2-U2 (Conceptual abstraction with multiple plausible interpretations)
[13]

query":

I3-A3-U3 (Very high-level, ill-posed causal question) -------------------------------------------------- OUTPUT FORMAT (STRICT) -------------------------------------------------- Each object must have the following structure: { "query": "<one-sentence>", "difficulty": { "implicitness": "I0|I1|I2|I3", "abstraction": "A0|A1|A2|A3", "ambiguity": "U0|U1|U2|U3...
[14]

Generate a detailed description (rather than just a keyword or keyphrase) for each of the intervention and outcome
[15]

intervention

Prefer underspecification over extrapolating knowledge without basis. ------------------------------------------------------------ WHAT YOU MAY INFER ------------------------------------------------------------ You MAY extract or cautiously infer: - Intervention type (if named or clearly implied) - Outcome variable (possibly abstracted) - Target populatio...
[16]

Hedges g score: a float between -2 and 2 representing the estimated impact of the intervention on the outcome
[17]

Lower bound of the confidence interval for Hedges g: must be a float smaller than Hedges g
[18]

Hedges_g

Upper bound of the confidence interval for Hedges g: must be a float larger than Hedges g. Format your output exactly as a JSON object like: {"Hedges_g": <float between -2 and 2>, "Hedges_g_ci_lower": <float smaller than Hedges_g>, "Hedges_g_ci_upper": <float larger than Hedges_g>, } --- I will give you three examples of intervention–outcome pairs with th...

1956

[1] [1]

gpt-oss-120b & gpt-oss-20b Model Card

M3Retrieve: Benchmarking multimodal re- trieval for medicine. InProceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing, pages 15263–15276, Suzhou, China. As- sociation for Computational Linguistics. Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Alt- man, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haimi...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

What is the impact of [intervention] on [outcome]?

Consort 2025 statement: updated guideline for reporting randomized trials.Nature Medicine, 31(6):1776–1783. Ali Hummos, Felipe del Río, Brabeeba Mien Wang, Julio Hurtado, Cristian B Calderon, and Guangyu Robert Yang. 2024. Gradient-based in- ference of abstract task representations for gen- eralization in neural networks.ArXiv preprint, abs/2407.17356. Ko...

work page arXiv 2025

[3] [3]

Generate exactly FOUR queries

[4] [4]

Each query must be ONE sentence

[5] [5]

Do NOT include answers, explanations, metadata, or formatting beyond what is requested

[6] [6]

Queries must differ meaningfully in implicitness, abstraction, and ambiguity

[7] [7]

Do NOT hallucinate details that are not present in the RCT (e.g., dosage, sample size, metrics)

[8] [8]

When information is not allowed by the difficulty level, it must be omitted, not guessed

[9] [9]

Queries should sound like realistic questions asked by policymakers, practitioners, or researchers. -------------------------------------------------- RCT INFORMATION -------------------------------------------------- Intervention: - Description: {intervention_description} Outcome: - Description: {outcome_description} Sector: - Description: {sector} -----...

[10] [10]

I0-A0-U0 (Fully explicit, concrete, unambiguous)

[11] [11]

I1-A1-U1 (Implicit elements, paraphrased, mildly underspecified)

[12] [12]

I2-A2-U2 (Conceptual abstraction with multiple plausible interpretations)

[13] [13]

query":

I3-A3-U3 (Very high-level, ill-posed causal question) -------------------------------------------------- OUTPUT FORMAT (STRICT) -------------------------------------------------- Each object must have the following structure: { "query": "<one-sentence>", "difficulty": { "implicitness": "I0|I1|I2|I3", "abstraction": "A0|A1|A2|A3", "ambiguity": "U0|U1|U2|U3...

[14] [14]

Generate a detailed description (rather than just a keyword or keyphrase) for each of the intervention and outcome

[15] [15]

intervention

Prefer underspecification over extrapolating knowledge without basis. ------------------------------------------------------------ WHAT YOU MAY INFER ------------------------------------------------------------ You MAY extract or cautiously infer: - Intervention type (if named or clearly implied) - Outcome variable (possibly abstracted) - Target populatio...

[16] [16]

Hedges g score: a float between -2 and 2 representing the estimated impact of the intervention on the outcome

[17] [17]

Lower bound of the confidence interval for Hedges g: must be a float smaller than Hedges g

[18] [18]

Hedges_g

Upper bound of the confidence interval for Hedges g: must be a float larger than Hedges g. Format your output exactly as a JSON object like: {"Hedges_g": <float between -2 and 2>, "Hedges_g_ci_lower": <float smaller than Hedges_g>, "Hedges_g_ci_upper": <float larger than Hedges_g>, } --- I will give you three examples of intervention–outcome pairs with th...

1956