On the error control of invariant causal prediction

Jelle J Goeman; Jinzhou Li

arxiv: 2401.03834 · v4 · pith:CBSQDVPKnew · submitted 2024-01-08 · 📊 stat.ME

On the error control of invariant causal prediction

Jinzhou Li , Jelle J Goeman This is my paper

Pith reviewed 2026-05-24 04:22 UTC · model grok-4.3

classification 📊 stat.ME

keywords invariant causal predictionfalse discovery ratemultiple testinge-Closureclosed testingcausal inferenceheterogeneous datatrue discovery bounds

0 comments

The pith

Reformulating invariant causal prediction as a multiple testing problem enables false discovery rate control and simultaneous true discovery bounds while preserving causal guarantees.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that the original invariant causal prediction method can be equipped with less conservative error controls to extract more causal information from heterogeneous data across environments. It achieves this by treating the identification of causal predictors as a multiple testing problem and then applying the e-Closure principle with new calibrators plus closed testing procedures. A sympathetic reader would care because the strict no-false-discovery guarantee often returns few or zero discoveries in practice, limiting usefulness. Simulations and a real-data example on educational attainment of US teenagers demonstrate that the adjusted guarantees yield more discoveries without extra assumptions or loss of the original invariance properties.

Core claim

By reformulating invariant causal prediction as a multiple testing problem, the authors apply the e-Closure principle together with tailored p-to-e calibrators to obtain simultaneous false discovery rate control, and derive simultaneous true discovery bounds via closed testing; these guarantees are more liberal than the original no-false-discovery requirement, retain every discovery made by the original method, require no additional assumptions, and continue to respect the invariance properties across multiple environments.

What carries the argument

The reformulation of invariant causal prediction as a multiple testing problem, which permits direct application of the e-Closure principle for false discovery rate control and closed testing for true discovery bounds.

If this is right

More causal predictors can be identified in data sets where the original invariant causal prediction method returns none.
False discovery rate control at a user-specified level is obtained while keeping all original discoveries.
Simultaneous true discovery bounds supply additional quantitative causal information on the number of true predictors.
The procedures apply directly to the same heterogeneous data collected from multiple environments.
No extra assumptions beyond those of the original method are required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The multiple-testing view may allow other invariance-based causal methods to adopt similar error-control upgrades.
In high-dimensional settings the added power could support more stable downstream prediction models built on the discovered causal features.
The US educational attainment application indicates the approach can be used in observational social-science studies that feature environmental heterogeneity.

Load-bearing premise

Reformulating invariant causal prediction as a multiple testing problem must preserve the original invariance properties and permit valid error control without introducing invalid rates or losing causal guarantees.

What would settle it

A simulation study or new dataset in which the proportion of false causal discoveries among those reported under the e-Closure procedure exceeds the nominal false discovery rate level, or in which the simultaneous true discovery bounds are violated when predictors are tested in additional held-out environments.

Figures

Figures reproduced from arXiv: 2401.03834 by Jelle J Goeman, Jinzhou Li.

**Figure 2.** Figure 2: The false discovery upper bounds, true discovery lower bounds, and the size [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

read the original abstract

Invariant causal prediction provides a useful framework for identifying causal predictors of a response using heterogeneous data from multiple environments. One valuable property of the original invariant causal prediction method is that it guarantees no false causal discoveries with high probability. Such a guarantee, however, can be overly conservative in some applications, resulting in few or no causal discoveries. This raises a natural question: can invariant causal prediction be equipped with less conservative error guarantees and thereby extract more causal information from the data? In this paper, we address this question by focusing on two widely used and more liberal guarantees: false discovery rate control and simultaneous true discovery bounds. A key step in our approach is to reformulate invariant causal prediction as a multiple testing problem. We then adopt the e-Closure principle to obtain (simultaneous) false discovery rate control, together with new p-to-e calibrators tailored to this setting. We also derive simultaneous true discovery bounds via closed testing, which provide additional causal information without requiring extra assumptions and retain all discoveries from the original invariant causal prediction method. Through simulations and a real data application on educational attainment of teenagers in the United States, we show that these more liberal error control guarantees can improve the practical usefulness of invariant causal prediction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper reformulates invariant causal prediction (ICP) as a multiple-testing problem to equip it with false discovery rate (FDR) control via the e-Closure principle and simultaneous true-discovery bounds via closed testing. New p-to-e calibrators are derived for the ICP setting. Simulations and a real-data example on educational attainment are used to argue that the resulting procedures are less conservative than the original ICP while retaining its causal guarantees and recovering all of its discoveries.

Significance. If the reformulation preserves the original per-predictor invariance nulls and the validity of the error-rate guarantees under heterogeneous environments, the work would meaningfully increase the practical utility of ICP by permitting more discoveries without additional assumptions. The provision of both simulation evidence and a real-data application is a positive feature.

major comments (3)

[§3] §3 (reformulation as multiple testing): the claim that the individual null hypotheses in the MT formulation are exactly equivalent to the original ICP invariance nulls (and therefore inherit the same causal interpretation) is not shown explicitly; a direct statement equating the two families of nulls, including how environment-induced dependence is handled, is needed to confirm that the FDR and TDB guarantees remain causally valid.
[§4.2] §4.2 (p-to-e calibrators): the derivation of the new calibrators is presented without an explicit verification that they remain valid under the heterogeneous-environment measure used by ICP; if the calibrators rely on exchangeability or identical distribution across environments, this must be stated and justified, as any mismatch would invalidate the subsequent e-Closure application.
[§5] §5 (simulations): the reported power gains are shown only for the new procedures; a direct comparison table that also reports the original ICP output (number of discoveries and empirical FDR) on the same replicates is required to substantiate the claim that the liberal guarantees improve usefulness without inflating false discoveries beyond the nominal level.

minor comments (2)

Notation for the environment index and the set of candidate predictors is introduced inconsistently between the abstract, §2, and §3; a single consistent notation should be used throughout.
[§6] The real-data application in §6 would benefit from a brief description of how the environments were defined and whether any preprocessing steps (e.g., missing-data handling) could affect the invariance assumption.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and will revise the manuscript accordingly to improve clarity and strengthen the presentation of our results.

read point-by-point responses

Referee: [§3] §3 (reformulation as multiple testing): the claim that the individual null hypotheses in the MT formulation are exactly equivalent to the original ICP invariance nulls (and therefore inherit the same causal interpretation) is not shown explicitly; a direct statement equating the two families of nulls, including how environment-induced dependence is handled, is needed to confirm that the FDR and TDB guarantees remain causally valid.

Authors: We agree that an explicit equating of the null hypotheses would improve the manuscript. In the revised version we will insert a dedicated paragraph in §3 that states the precise equivalence between the individual invariance nulls in the multiple-testing formulation and the original ICP nulls, together with an explanation of how the dependence induced by the heterogeneous environments is handled through the test statistics. This addition will confirm that the FDR and TDB guarantees inherit the causal validity of the original ICP procedure. revision: yes
Referee: [§4.2] §4.2 (p-to-e calibrators): the derivation of the new calibrators is presented without an explicit verification that they remain valid under the heterogeneous-environment measure used by ICP; if the calibrators rely on exchangeability or identical distribution across environments, this must be stated and justified, as any mismatch would invalidate the subsequent e-Closure application.

Authors: The calibrators are constructed to be valid under the heterogeneous-environment measure that underlies ICP and do not assume exchangeability or identical distributions across environments. In the revision we will add an explicit verification paragraph in §4.2 that derives the validity of the calibrators directly from the invariance property under the null, thereby justifying their use with the e-Closure principle. revision: yes
Referee: [§5] §5 (simulations): the reported power gains are shown only for the new procedures; a direct comparison table that also reports the original ICP output (number of discoveries and empirical FDR) on the same replicates is required to substantiate the claim that the liberal guarantees improve usefulness without inflating false discoveries beyond the nominal level.

Authors: We agree that a side-by-side comparison on identical replicates would strengthen the simulation section. We will add a table in §5 that reports, for every replicate, the number of discoveries and the empirical FDR achieved by the original ICP procedure alongside the corresponding quantities for the new procedures, thereby allowing direct assessment of power gains while confirming that error rates remain controlled. revision: yes

Circularity Check

0 steps flagged

No significant circularity; reformulation applies standard MT tools to ICP

full rationale

The paper's core contribution is reformulating ICP as a multiple-testing problem to enable e-Closure for FDR control and closed testing for simultaneous true-discovery bounds. This step is presented as a direct mapping that preserves original invariance properties, with no equations showing that the new error guarantees are defined in terms of themselves or fitted parameters. No self-citations are load-bearing for the central claims, no uniqueness theorems are imported from the authors' prior work, and no ansatzes or renamings reduce the results to tautologies. The derivation rests on independent standard multiple-testing machinery (e-Closure, closed testing) applied to the existing ICP nulls, making the approach self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract; the approach rests on standard multiple-testing assumptions and the validity of environment-specific p-values or e-values from the original ICP framework.

axioms (1)

domain assumption p-values or e-values obtained from the multiple environments are valid for the multiple-testing reformulation
Invoked when the abstract states that the reformulation enables e-Closure and closed testing.

pith-pipeline@v0.9.0 · 5737 in / 1215 out tokens · 26465 ms · 2026-05-24T04:22:35.330819+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Uniform Improvement of the Benjamini-Hochberg Procedure via e-Closure
stat.ME 2026-06 unverdicted novelty 7.0

Closed BH improves the Benjamini-Hochberg procedure via e-Closure, controlling FDR under PRDS or weaker assumptions while never rejecting fewer hypotheses.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 1 Pith paper

[1]

and Hochberg, Y

Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological) , 57(1):289--300

work page 1995
[2]

and Yekutieli, D

Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of statistics , pages 1165--1188

work page 2001
[3]

J., and Pollard, K

Dudoit, S., Van der Laan, M. J., and Pollard, K. S. (2004). Multiple testing. part i. single-step procedures for control of general type i error rates. Statistical Applications in Genetics and Molecular Biology , 3(1)

work page 2004
[4]

Finner, H., Dickhaus, T., and Roters, M. (2009). On the false discovery rate and an asymptotically optimal rejection curve. Ann. Statist. , 37(1):596--618

work page 2009
[5]

Genovese, C. R. and Wasserman, L. (2006). Exceedance control of the false discovery proportion. Journal of the American Statistical Association , 101(476):1408--1417

work page 2006
[6]

J., Hemerik, J., and Solari, A

Goeman, J. J., Hemerik, J., and Solari, A. (2021). Only closed testing procedures are admissible for controlling false discovery proportions. The Annals of Statistics , 49(2):1218--1238

work page 2021
[7]

Goeman, J. J. and Solari, A. (2011). Multiple testing for exploratory research. Statistical Science , 26(4):584--597

work page 2011
[8]

Heinze-Deml, C., Peters, J., and Meinshausen, N. (2018). Invariant causal prediction for nonlinear models. Journal of Causal Inference , 6(2)

work page 2018
[9]

Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics , pages 65--70

work page 1979
[10]

R., Hothorn, T., and Peters, J

Kook, L., Saengkyongam, S., Lundborg, A. R., Hothorn, T., and Peters, J. (2023). Model-based causal feature selection for general response types. arXiv preprint arXiv:2309.12833

work page arXiv 2023
[11]

L., Troendle, J

Korn, E. L., Troendle, J. F., McShane, L. M., and Simon, R. (2004). Controlling the number of false discoveries: application to high-dimensional genomic data. Journal of Statistical Planning and Inference , 124(2):379--398

work page 2004
[12]

Lehmann, E. L. and Romano, J. P. (2005). Generalizations of the familywise error rate . The Annals of Statistics , 33(3):1138 -- 1154

work page 2005
[13]

Marcus, R., Eric, P., and Gabriel, K. R. (1976). On closed testing procedures with special reference to ordered analysis of variance. Biometrika , 63(3):655--660

work page 1976
[14]

Peters, J., B \"u hlmann, P., and Meinshausen, N. (2016). Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 78(5):947--1012

work page 2016
[15]

Pfister, N., B \"u hlmann, P., and Peters, J. (2019). Invariant causal prediction for sequential data. Journal of the American Statistical Association , 114(527):1264--1276

work page 2019
[16]

Romano, J. P. and Shaikh, A. M. (2006). Stepup procedures for control of generalizations of the familywise error rate. The Annals of Statistics , pages 1850--1873

work page 2006
[17]

Rouse, C. E. (1995). Democratization or diversion? the effect of community colleges on educational attainment. Journal of Business & Economic Statistics , 13(2):217--224

work page 1995
[18]

Sarkar, S. K. (2007). Stepup procedures controlling generalized fwer and generalized fdr. The Annals of Statistics , pages 2405--2420

work page 2007
[19]

H., Watson, M

Stock, J. H., Watson, M. W., et al. (2003). Introduction to econometrics , volume 104. Addison Wesley Boston

work page 2003

[1] [1]

and Hochberg, Y

Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological) , 57(1):289--300

work page 1995

[2] [2]

and Yekutieli, D

Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of statistics , pages 1165--1188

work page 2001

[3] [3]

J., and Pollard, K

Dudoit, S., Van der Laan, M. J., and Pollard, K. S. (2004). Multiple testing. part i. single-step procedures for control of general type i error rates. Statistical Applications in Genetics and Molecular Biology , 3(1)

work page 2004

[4] [4]

Finner, H., Dickhaus, T., and Roters, M. (2009). On the false discovery rate and an asymptotically optimal rejection curve. Ann. Statist. , 37(1):596--618

work page 2009

[5] [5]

Genovese, C. R. and Wasserman, L. (2006). Exceedance control of the false discovery proportion. Journal of the American Statistical Association , 101(476):1408--1417

work page 2006

[6] [6]

J., Hemerik, J., and Solari, A

Goeman, J. J., Hemerik, J., and Solari, A. (2021). Only closed testing procedures are admissible for controlling false discovery proportions. The Annals of Statistics , 49(2):1218--1238

work page 2021

[7] [7]

Goeman, J. J. and Solari, A. (2011). Multiple testing for exploratory research. Statistical Science , 26(4):584--597

work page 2011

[8] [8]

Heinze-Deml, C., Peters, J., and Meinshausen, N. (2018). Invariant causal prediction for nonlinear models. Journal of Causal Inference , 6(2)

work page 2018

[9] [9]

Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics , pages 65--70

work page 1979

[10] [10]

R., Hothorn, T., and Peters, J

Kook, L., Saengkyongam, S., Lundborg, A. R., Hothorn, T., and Peters, J. (2023). Model-based causal feature selection for general response types. arXiv preprint arXiv:2309.12833

work page arXiv 2023

[11] [11]

L., Troendle, J

Korn, E. L., Troendle, J. F., McShane, L. M., and Simon, R. (2004). Controlling the number of false discoveries: application to high-dimensional genomic data. Journal of Statistical Planning and Inference , 124(2):379--398

work page 2004

[12] [12]

Lehmann, E. L. and Romano, J. P. (2005). Generalizations of the familywise error rate . The Annals of Statistics , 33(3):1138 -- 1154

work page 2005

[13] [13]

Marcus, R., Eric, P., and Gabriel, K. R. (1976). On closed testing procedures with special reference to ordered analysis of variance. Biometrika , 63(3):655--660

work page 1976

[14] [14]

Peters, J., B \"u hlmann, P., and Meinshausen, N. (2016). Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 78(5):947--1012

work page 2016

[15] [15]

Pfister, N., B \"u hlmann, P., and Peters, J. (2019). Invariant causal prediction for sequential data. Journal of the American Statistical Association , 114(527):1264--1276

work page 2019

[16] [16]

Romano, J. P. and Shaikh, A. M. (2006). Stepup procedures for control of generalizations of the familywise error rate. The Annals of Statistics , pages 1850--1873

work page 2006

[17] [17]

Rouse, C. E. (1995). Democratization or diversion? the effect of community colleges on educational attainment. Journal of Business & Economic Statistics , 13(2):217--224

work page 1995

[18] [18]

Sarkar, S. K. (2007). Stepup procedures controlling generalized fwer and generalized fdr. The Annals of Statistics , pages 2405--2420

work page 2007

[19] [19]

H., Watson, M

Stock, J. H., Watson, M. W., et al. (2003). Introduction to econometrics , volume 104. Addison Wesley Boston

work page 2003