arxiv: 2605.09741 · v1 · submitted 2026-05-10 · 📊 stat.ME

Recognition: 2 theorem links

· Lean Theorem

Adaptive discovery of effect modification in matched observational studies

Dylan S Small, Yu Gui, Zhimei Ren

Pith reviewed 2026-05-12 02:57 UTC · model grok-4.3

classification 📊 stat.ME

keywords effect modificationobservational studiesmatched samplingfalse discovery ratesensitivity analysissubgroup discoverytreatment effect heterogeneitymultiple controls

0 comments

The pith

A finite-sample valid procedure identifies covariate-interpretable subgroups showing different treatment effects in matched observational studies while exactly controlling the subgroup false discovery rate and bounding unmeasured bias via a

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In observational studies where randomization is impossible, researchers match treated units to multiple controls on observed covariates to estimate treatment effects. The paper develops a procedure that scans for subgroups defined by those covariates where the treatment effect appears to differ, then selects only those subgroups that pass a statistical test. The selection comes with an exact guarantee that the proportion of false discoveries among the selected subgroups stays below a preset level, even in small samples. The procedure further incorporates sensitivity models that limit how much unmeasured confounding could distort the results, and it exploits the extra controls to increase the chance of detecting real differences. A reader would care because many policy and medical decisions hinge on knowing which groups gain most from an intervention, yet hidden biases and multiple testing have historically made such claims unreliable.

Core claim

We develop a finite-sample valid procedure for identifying and selecting covariate-interpretable subgroups, with exact control of the subgroup-level false discovery rate (FDR). Our method explicitly accounts for unmeasured confounding via sensitivity models, and leverages multiple matched controls to improve statistical power. We demonstrate the favorable performance of our method relative to baseline methods through extensive simulation studies and a real-world application to the economic returns to college education.

What carries the argument

An adaptive selection rule that tests for effect modification on matched data using sensitivity-adjusted statistics, then applies a step-down threshold to achieve exact finite-sample FDR control at the subgroup level.

If this is right

Researchers obtain a list of subgroups with heterogeneous treatment effects whose false discovery rate is guaranteed not to exceed the target in finite samples.
Results remain valid under any unmeasured confounding whose strength stays inside the pre-specified sensitivity parameters.
Multiple matched controls per treated unit raise power to detect true effect modification compared with one-to-one matching.
The same framework can be applied to policy questions such as identifying demographic groups that receive larger economic returns to college.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the sensitivity parameters are set too loosely, the procedure may still select subgroups whose effects are driven by hidden bias.
The method could be paired with data-driven subgroup search algorithms to handle high-dimensional covariates.
In practice, analysts would need to report the sensitivity parameters alongside the selected subgroups so readers can judge the robustness claim.
The college-education application suggests the procedure can surface subgroups defined by observable demographics that show meaningfully different returns.

Load-bearing premise

The chosen sensitivity models correctly bound the possible impact of unmeasured confounding and the initial matching on observed covariates is adequate for those bounds to be meaningful.

What would settle it

In a simulation where true effect-modifying subgroups are planted and the magnitude of unmeasured confounding is set within the sensitivity bounds, the proportion of falsely selected subgroups exceeds the nominal FDR level.

Figures

Figures reproduced from arXiv: 2605.09741 by Dylan S Small, Yu Gui, Zhimei Ren.

**Figure 1.** Figure 1: (a) Power comparison of our method (Ours-NP) with the baselines BH-baseline (Karmakar et al., 2018) and P-screening (Duan et al., 2024) as subgroup size varies. (b) Power of our proposed method under different choices of masking methods as a function of the number of control units. The power curves Max, TopGap, MedSplit correspond to existing methods, while NP denotes our proposal. (c) Power comparison be… view at source ↗

**Figure 2.** Figure 2: Histograms of subgroup-level p-values and [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Histograms of unit-level ranks ri and subgroup-level statistics Lg · WNP g (5 controls). 4.2 Desiderata for the optimal magnitude statistic We now turn to the design of Wi using g(ri) together with F and Z. Given a masked rank g(ri), let J denote its preimage. We note that |J | ∈ {1, 2}, with |J | = 1 if and only if ni is odd and ri = ⌊ni/2⌋ + 1; in this case, Li = 0 and this matched set effectively does n… view at source ↗

**Figure 4.** Figure 4: reports the averaged FDP and power over 100 simulated datasets against varying group sizes. All methods control averaged FDP below the nominal level α = 0.1 in both panels. Regarding power, Ours-NP—our method with NP-based Wg— maintains consistently high discovery power across all group sizes, achieving power around 0.7 even at the smallest group size of 5. Notably, the performance of Ours-NP dominates oth… view at source ↗

**Figure 5.** Figure 5: FDR and Power comparison with random subgroup partition. [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: FDR and Power comparison with tree-based subgroup partition. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of subgroups partition and selection: [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Boxplots of FDR and power for comparison with and without conditional calibration. [PITH_FULL_IMAGE:figures/full_fig_p039_8.png] view at source ↗

**Figure 9.** Figure 9: Covariate balance with propensity score matching: caliper [PITH_FULL_IMAGE:figures/full_fig_p041_9.png] view at source ↗

**Figure 10.** Figure 10: FDR and Power comparison versus varying group sizes with tree-based subgroup parti [PITH_FULL_IMAGE:figures/full_fig_p049_10.png] view at source ↗

**Figure 11.** Figure 11: FDR and Power comparison with random subgroup partition: two-sided effects and [PITH_FULL_IMAGE:figures/full_fig_p050_11.png] view at source ↗

**Figure 12.** Figure 12: FDR and power versus group size: multiple outcomes and varying subgroup sizes. [PITH_FULL_IMAGE:figures/full_fig_p050_12.png] view at source ↗

**Figure 13.** Figure 13: FDR and Power comparison versus varying group sizes with tree-based subgroup parti [PITH_FULL_IMAGE:figures/full_fig_p051_13.png] view at source ↗

**Figure 14.** Figure 14: Performance with different values of |Qg|. 51 [PITH_FULL_IMAGE:figures/full_fig_p051_14.png] view at source ↗

**Figure 15.** Figure 15: Performance of Ours-NP with different values of |Qg| and Γ [PITH_FULL_IMAGE:figures/full_fig_p052_15.png] view at source ↗

read the original abstract

Understanding effect modification -- how treatment effects vary across subpopulations -- is practically important in observational studies, as it helps identify which subgroups are likely to benefit from a given treatment. In this paper, we study the discovery of effect modification in matched observational studies, where each treated unit may be matched to multiple controls. We develop a finite-sample valid procedure for identifying and selecting covariate-interpretable subgroups, with exact control of the subgroup-level false discovery rate (FDR). Our method explicitly accounts for unmeasured confounding via sensitivity models, and leverages multiple matched controls to improve statistical power. We demonstrate the favorable performance of our method relative to baseline methods through extensive simulation studies and a real-world application to the economic returns to college education.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a finite-sample procedure for finding covariate subgroups with effect modification in matched observational studies, with exact FDR control that incorporates sensitivity analysis for unmeasured confounding.

read the letter

The core contribution is a method that selects interpretable subgroups while maintaining exact finite-sample FDR control at the subgroup level, even after building in Rosenbaum-style sensitivity bounds for hidden bias. It also uses multiple matched controls per treated unit to gain power without losing the exactness property. That combination is what the abstract highlights as new for this matched-design setting, and the simulations plus the college-education application show it outperforming simpler baselines in power while keeping the error rate in check under the assumed sensitivity model. The real-data example is straightforward and helps ground the claims. The math appears to rest on standard permutation arguments extended to the adaptive, multiple-matching case, which is credible on its face if the proofs are complete. The main soft spot is the usual one for sensitivity analysis: the exact FDR guarantee holds only when the chosen Gamma bound actually dominates the worst-case confounding; if the bound is misspecified, the null distribution used for thresholding is off and control can fail. The adaptive subgroup search adds another layer that needs careful justification to avoid inflating the error rate in finite samples, though the paper claims to handle it. Minor practical issue is that users still have to pick the sensitivity parameter and the matching design up front. This is aimed at applied statisticians and economists who already use matching for causal questions and want a disciplined way to hunt for effect modifiers without post-hoc p-hacking. Readers working on FDR methods in observational data or on subgroup analysis in policy settings will get the most out of it. It is worth sending to a serious referee because the finite-sample exactness claim is substantive and the problem is practically relevant, even if the sensitivity-model dependence will draw the usual questions.

Referee Report

2 major / 2 minor

Summary. The paper develops a finite-sample valid procedure for adaptive discovery of effect modification in matched observational studies with multiple controls per treated unit. It claims to identify covariate-interpretable subgroups while achieving exact control of the subgroup-level false discovery rate (FDR), explicitly incorporating sensitivity models (e.g., Rosenbaum-style bounds) for unmeasured confounding, with supporting evidence from simulations and a real-data analysis of economic returns to college education.

Significance. If the finite-sample FDR guarantees hold under the stated sensitivity models, the work would offer a useful advance for causal inference in observational settings by enabling data-driven subgroup selection with rigorous error control and improved power from multiple matches. The emphasis on covariate-interpretable subgroups and the empirical demonstrations via simulations and the college-education application are practical strengths that could aid applied researchers.

major comments (2)

[§3 (Theoretical Results), main FDR theorem] The central claim of exact finite-sample FDR control (abstract and §3) relies on the sensitivity model correctly bounding unmeasured confounding for the matched design. However, because subgroup selection is adaptive and data-dependent, the null distribution for the FDR threshold may no longer be valid post-selection; the manuscript should explicitly show (e.g., via the proof of the main theorem) that the procedure preserves the required stochastic dominance or exchangeability properties despite this adaptivity.
[§4 (Simulations), Table 2] Table 2 and the simulation design in §4 report power gains from multiple controls, but do not include cases with sensitivity-parameter misspecification (e.g., true confounding exceeding the assumed Γ). This is load-bearing for the practical interpretation of the 'exact control' guarantee, as the skeptic note correctly flags that under-bounding voids the null distribution.

minor comments (2)

[§2 (Setup)] Notation for the sensitivity parameter Γ and the multiple-control matching ratio is introduced late; defining these in §2 would improve readability for readers new to Rosenbaum bounds.
[§5 (Application)] The real-data application section would benefit from a brief table summarizing the selected subgroups and their estimated effects under the chosen Γ.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major point below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [§3 (Theoretical Results), main FDR theorem] The central claim of exact finite-sample FDR control (abstract and §3) relies on the sensitivity model correctly bounding unmeasured confounding for the matched design. However, because subgroup selection is adaptive and data-dependent, the null distribution for the FDR threshold may no longer be valid post-selection; the manuscript should explicitly show (e.g., via the proof of the main theorem) that the procedure preserves the required stochastic dominance or exchangeability properties despite this adaptivity.

Authors: We agree that the impact of adaptive, data-dependent subgroup selection on the validity of the finite-sample FDR guarantee merits explicit treatment. The proof of the main theorem in §3 establishes control by showing that the sensitivity-bounded test statistics for candidate subgroups satisfy uniform stochastic dominance under the null, with the selection rule being a monotone function of these statistics within the fixed matched design. This structure preserves the necessary exchangeability properties for the step-down threshold. In the revised manuscript we will add a dedicated lemma immediately preceding the main theorem that isolates and proves this preservation step, making the argument fully explicit. revision: yes
Referee: [§4 (Simulations), Table 2] Table 2 and the simulation design in §4 report power gains from multiple controls, but do not include cases with sensitivity-parameter misspecification (e.g., true confounding exceeding the assumed Γ). This is load-bearing for the practical interpretation of the 'exact control' guarantee, as the skeptic note correctly flags that under-bounding voids the null distribution.

Authors: The referee is correct that the current simulation design assumes the sensitivity parameter Γ is at least as large as the true confounding strength. To address this, the revised §4 will include new simulation scenarios in which the true level of unmeasured confounding exceeds the assumed Γ. We will report the realized subgroup-level FDR in these misspecified cases, thereby illustrating the practical consequences of under-bounding and clarifying the scope of the exact-control guarantee. revision: yes

Circularity Check

0 steps flagged

No circularity: procedure is a constructed method with independent finite-sample guarantees

full rationale

The paper constructs a new adaptive procedure for subgroup selection with exact FDR control under sensitivity models for unmeasured confounding. The derivation chain begins from the matched design and sensitivity bounds (treated as given inputs) and produces a data-dependent selection rule whose validity is proved directly via finite-sample arguments rather than by fitting parameters to the target quantities or by self-referential definitions. No equations reduce the claimed FDR control to a fitted input or to a prior result whose only justification is self-citation. The use of multiple controls is a power-enhancing feature of the design, not a circular re-use of the same data. The method is therefore self-contained against external benchmarks once the sensitivity model is accepted as an assumption.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides minimal detail; the primary domain assumption is the adequacy of sensitivity models for unmeasured confounding.

axioms (1)

domain assumption Sensitivity models can be used to bound unmeasured confounding in matched designs
Abstract states the method explicitly accounts for unmeasured confounding via sensitivity models.

pith-pipeline@v0.9.0 · 5410 in / 1217 out tokens · 51281 ms · 2026-05-12T02:57:40.455872+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
We develop a finite-sample valid procedure for identifying and selecting covariate-interpretable subgroups, with exact control of the subgroup-level false discovery rate (FDR). Our method explicitly accounts for unmeasured confounding via sensitivity models...
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
Under the Γ-sensitivity model... P(Lg=1|Vg,Wg,F,Z)≤κ·P(Lg=−1|Vg,Wg,F,Z) with κ bounded by Γ.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 1 internal anchor

[1]

J., Imberman, S

Andrews, R. J., Imberman, S. A., Lovenheim, M. F., and Stange, K. (2024). The returns to college major choice: Average and distributional effects, career trajectories, and earnings variability. Review of Economics and Statistics, pages 1–45. Armstrong, T. and Shen, S. (2015). Inference on optimal treatment assignments.The Japanese Economic Review, 74(4):4...

work page 2024
[2]

Baum, S. (2014). Higher education earnings premium: Value, variation, and trends.Urban Institute. Bekerman, W., Dalal, A., del Ninno, C., and Small, D. S. (2024). Planning for gold: Sample splitting for valid powerful design of observational studies.arXiv preprint arXiv:2406.00866. Benjamini, Y.andHochberg, Y.(1995). Controllingthefalsediscoveryrate: apra...

work page arXiv 2014
[3]

Candès, E., Fan, Y., Janson, L., and Lv, J. (2018). Panning for gold:‘model-x’knockoffs for high dimensional controlled variable selection.Journal of the Royal Statistical Society Series B: Sta- tistical Methodology, 80(3):551–577. Chao, P. and Fithian, W. (2021). Adapt-gmm: Powerful and robust covariate-assisted multiple testing.arXiv preprint arXiv:2106...

work page arXiv 2018
[4]

Duan, B., Ramdas, A., and Wasserman, L. (2020). Familywise error rate control by interactive unmasking. InInternational Conference on Machine Learning, pages 2720–2729. PMLR. Duan, B., Ramdas, A., and Wasserman, L. (2022). Interactive rank testing by betting. InConference on Causal Learning and Reasoning, pages 201–235. PMLR. Duan, B., Wasserman, L., and ...

work page arXiv 2020
[5]

and Ren, Z

Lee, J. and Ren, Z. (2024). Boosting e-bh via conditional calibration.arXiv preprint arXiv:2404.17562. Lee, K., Small, D. S., Hsu, J. Y., Silber, J. H., and Rosenbaum, P. R. (2018). Discovering effect modification in an observational study of surgical mortality at hospitals with superior nursing. Journal of the Royal Statistical Society Series A: Statisti...

work page arXiv 2024
[6]

Lei, L., Ramdas, A., and Fithian, W. (2021). A general interactive framework for false discovery rate control under structural constraints.Biometrika, 108(2):253–267. Li, M. L. and Imai, K. (2023). Statistical performance guarantee for subgroup identification with generic machine learning.arXiv preprint arXiv:2310.07973. Lipkovich, I., Svensson, D., Ratit...

work page arXiv 2021
[7]

Perna, L. W. (2005). The benefits of higher education: Sex, racial/ethnic, and socioeconomic group differences.The Review of Higher Education, 29(1):23–52. Reeve, H. W., Cannings, T. I., and Samworth, R. J. (2023). Optimal subgroup selection.The Annals of Statistics, 51(6):2342–2365. Ren, Z. and Candes, E. (2023). Knockoffs with side information.The Annal...

work page internal anchor Pith review arXiv 2005
[8]

P g∈H0,G 1{g/∈ Oτ(L,Ξ), L g = 1} 1∨ P g′∈G 1{g′ /∈ Oτ(L,Ξ), L g′ = 1} # =E

and heterogeneous treatment-effect discovery in program and policy evaluation(AtheyandImbens,2016). Identifyingsubgroupsfromdataisthefirstemergingquestion. There is a rich literature on learning subgroups from data, in particular on tree-based methods, including the CART algorithm (Su et al., 2009; Breiman et al., 2017), causal trees (Athey and Imbens, 20...

work page 2016
[9]

Lettingp i =P(L i = 1| F,Z)andp ∗ = Γ 1+Γ, we define ∆g = P i∈g ri(pi −p ∗) P i∈g r2 i p∗(1−p ∗) 1/2

Given an observed Wilcoxon signed-rankstatisticS obs g , thedeterministicandrandomizedone-sidedp-valuesundertheΓ-sensitivity model are pg =P X i∈g riL∗ i ≥S obs g ! , p rand g =P X i∈g riL∗ i > S obs g ! +U·P X i∈g riL∗ i =S obs g ! ,(24) where the probability is overL∗ i iid∼Bern( Γ 1+Γ)andU∼Unif(0,1)is an independent tie-breaking variable. Lettingp i =P...

work page 1971
[10]

Define the events E+ = β(J, ˜R, X)>0 andE − = β(J, ˜R, X)<0

+E 1{˜L= 1} ·(P(L=−1| J, ˜R, X)−P(L= 1| J, ˜R, X)) ,(29) where the last step follows from the fact that˜Lis a function ofJ, ˜R, Xand the tower property. Define the events E+ = β(J, ˜R, X)>0 andE − = β(J, ˜R, X)<0 . 33 We then have (29)=P(L= 1)−E h 1{˜L= 1,E +} · P(L= 1| J, ˜R, X)−P(L=−1| J, ˜R, X) i +E h 1{˜L= 1,E −} · P(L= 1| J, ˜R, X)−P(L=−1| J, ˜R, X) ...

work page 2024
[11]

for subgroup selection. C.2 Screening by incorporating additional information In our framework, the sign-magnitude pair is used in a disentangled way: the sign is used for FDP estimation, while the magnitude determines the screening ordering. For FDR control, the magnitude statistic is subject to essentially no restriction beyond a near-independence condi...

work page 2015
[12]

Figure 8: Boxplots of FDR and power for comparison with and without conditional calibration

Power NP CC−NP Max CC−Max 0.2 0.4 0.6 0.8 1.0 Method (b)Γ = 1.5. Figure 8: Boxplots of FDR and power for comparison with and without conditional calibration. From Figure 8, we can see that bothOurs-cc-NPandOurs-ccfull-NPimprove upon the power ofOurs-NPin these two settings. In the first setting, without sensitivity adjustment, conditional calibration is l...

work page 2021
[13]

num_sib sibsttNumber of siblings rural_res res57 Residential area of graduate. Degree of urbanization (Counties with no city or with a city of less than 50,000) prox_college avcl57 Geographic availability of college (High school in community≤15 miles from any college) class_rank hsrscorqHigh school grades percentile rank-normalized IQ gwiiq_j IQ score map...

work page 1993
[14]

We exclude individuals with extreme parental income values by removing observations withparents_income≥998, and recode intactas a binary indicator of family intactness

We preprocess the WLS data prior to matching and inference: observations with incomplete treatment coding (treatment value−2) are recoded as untreated. We exclude individuals with extreme parental income values by removing observations withparents_income≥998, and recode intactas a binary indicator of family intactness. Parents’ income is log-transformed t...

work page 2011
[15]

Under two-sided effects (Figure 13b), the baselines, theBH-baselineandP-screening, are substantially weaker, while our methods maintain high discovery power

Under one-sided effects (Figure 13a),Ours-NPachieves power above0.8 at small group sizes, and theBH-baselineimproves with group size but plateaus well below our methods. Under two-sided effects (Figure 13b), the baselines, theBH-baselineandP-screening, are substantially weaker, while our methods maintain high discovery power. F.5 Additional simulations wi...

work page 2095