A Statistical Framework for Understanding Causal Effects that Vary by Treatment Initiation Time in EHR-based Studies

Alexander W. Levis; Catherine Lee; David Arterburn; Heidi Fischer; Luke Benz; Rajarshi Mukherjee; Rui Wang; Sebastien Haneuse; Susan M. Shortreed

arxiv: 2512.19553 · v2 · submitted 2025-12-22 · 📊 stat.ME

A Statistical Framework for Understanding Causal Effects that Vary by Treatment Initiation Time in EHR-based Studies

Luke Benz , Rajarshi Mukherjee , Rui Wang , David Arterburn , Heidi Fischer , Catherine Lee , Susan M. Shortreed , Alexander W. Levis

show 1 more author

Sebastien Haneuse

This is my paper

Pith reviewed 2026-05-16 20:23 UTC · model grok-4.3

classification 📊 stat.ME

keywords causal inferenceelectronic health recordsmarginal structural modelstime-varying treatment effectscovariate shiftdoubly robust estimationbariatric surgerycomparative effectiveness

0 comments

The pith

A framework estimates time-specific treatment effects in EHR studies by projecting doubly robust estimates onto marginal structural models and quantifying covariate shift with standardization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a statistical framework to estimate average treatment effects that vary with the calendar time when treatment starts, using data from electronic health records. Standard practice reports a single constant effect, but real-world changes in techniques or patient groups can make effects evolve, which matters for understanding comparative effectiveness of procedures like bariatric surgery. The method first obtains doubly robust estimates of effects at each time point, then projects those estimates onto candidate marginal structural models and selects the best one to describe the pattern of variation. It adds a standardization-based metric that measures how much any observed changes stem from shifts in the patient population rather than changes in treatment efficacy itself. This separation helps explain both the direction and the source of time variation in EHR analyses.

Core claim

The paper claims that projecting doubly robust, time-specific treatment effect estimates onto candidate marginal structural models, using a model selection procedure to describe the pattern of variation, and applying a standardization analysis to create a summary metric for the role of covariate shift, allows researchers to describe both how and why causal effects vary by treatment initiation time in EHR-based studies.

What carries the argument

Projection of doubly robust time-specific treatment effect estimates onto candidate marginal structural models with model selection, plus a standardization-based metric that quantifies the contribution of covariate shift to observed effect changes.

If this is right

Time-specific estimates can be summarized by a selected model that reveals whether effects improve, decline, or remain stable over calendar time.
The standardization metric distinguishes changes due to evolving treatment techniques from changes due to different patients receiving treatment.
Model selection identifies the simplest description of time variation that fits the data without overfitting.
In settings like bariatric surgery versus standard care, the approach shows whether efficacy has changed since the procedures began.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could extend to studies of other interventions such as medications or devices where clinical practice changes over time.
When covariate shift accounts for most variation, attention could shift toward improving patient selection criteria rather than modifying the intervention itself.
Adapting the method to continuous rather than discrete time periods might yield smoother descriptions of effect trajectories.

Load-bearing premise

The candidate marginal structural models are flexible enough to capture the true pattern of time variation, and the standardization procedure isolates covariate shift without residual confounding or model misspecification.

What would settle it

Apply the framework to simulated EHR data where treatment effects are truly constant across time but patient covariates shift; the selected model should indicate constant effects and the metric should attribute all apparent change to covariate shift.

Figures

Figures reproduced from arXiv: 2512.19553 by Alexander W. Levis, Catherine Lee, David Arterburn, Heidi Fischer, Luke Benz, Rajarshi Mukherjee, Rui Wang, Sebastien Haneuse, Susan M. Shortreed.

**Figure 1.** Figure 1: Distribution of patient characteristics in DURABLE electronic health record database over calendar time for eligible patients undergoing bariatric surgery between 2005-2011. Covariate mean values (continuous covariates) or frequency (binary/categorical covariates) are plotted by month of surgery. m = 1 corresponds to January 2005 and m = 84 corresponds to December 2011. (-16.6%, -16.1%) and SG, in close ag… view at source ↗

**Figure 2.** Figure 2: Overview of cross-fitting procedure for evaluating Lb(ψbk). Given the appearance of χm(P) in the pseudorisk, we replace χm(P) with estimates based on its influence function, i.e., we proceed with the influence function-based estimator of the pseudorisk, Lb(ψb k) = Pn X M m=1 w(m) ψk(m; βb k) 2 − 2ψk(m; βb k) ˙χm(O; Pb) [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗

**Figure 3.** Figure 3: (A) Estimates of calendar time-specific treatment effects χbm (dots) and trends based on five candidate marginal structural models, ψ(m; βb). (B) Estimates of cross-trial effects χbj,m for select comparisons. time. Non-constant MSMs were selected in other comparisons involving SG patients, however, as 6-month weight change treatment effects for SG patients between m = 30 and m = 84 improved by 1.0% [-17.5%… view at source ↗

read the original abstract

Standard practice in electronic health record (EHR)-based studies evaluating the comparative effectiveness of bariatric surgery relative to no surgery is to estimate and report a constant treatment effect across calendar time. However, real-world treatment strategies can evolve, particularly when comparators include standard of care or surgical procedures where techniques may improve, making it clinically important to ascertain whether efficacy of bariatric surgery has changed over time. Efforts to determine whether treatment efficacy itself is evolving are complicated by changing patient populations, with potential covariate shift in key effect modifiers. Through a comprehensive analysis of EHR data from Kaiser Permanente following two bariatric surgical procedures compared to standard of care, we develop a statistical framework to estimate calendar time-specific average treatment effects and describe both how and why effects vary across treatment initiation time in EHR-based studies. Our approach projects doubly robust, time-specific treatment effect estimates onto candidate marginal structural models and uses a model selection procedure to best describe how effects vary by treatment initiation time. We further introduce a novel summary metric, based on standardization analysis, to quantify the role of covariate shift in explaining observed effect changes and disentangle changes in treatment effects from changes in the patient population receiving treatment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a usable way to project time-specific doubly robust estimates onto marginal structural models and separate effect changes from covariate shifts in EHR data, but the finite candidate set for the projection risks misspecification bias.

read the letter

The main thing here is a framework that takes calendar-time-specific doubly robust estimates of treatment effects, projects them onto a set of candidate marginal structural models, picks the best one via model selection, and then uses a standardization step to measure how much of the observed change comes from shifts in the patient population rather than true changes in efficacy. That combination is not standard in the time-varying causal literature they cite, and it directly targets a practical headache in bariatric surgery studies where techniques and patient mixes evolve together.

Referee Report

2 major / 2 minor

Summary. The paper develops a statistical framework for EHR-based studies of bariatric surgery versus standard care. It first obtains calendar-time-specific average treatment effects via doubly robust estimation, then projects these estimates onto a discrete collection of candidate marginal structural models (MSMs), applies a model-selection procedure to characterize how effects vary by treatment initiation time, and introduces a novel standardization-based summary metric to quantify the contribution of covariate shift to observed changes while attempting to separate it from changes in treatment efficacy.

Significance. If the central claims hold, the framework offers a practical way to move beyond constant-effect assumptions in observational EHR analyses of procedures whose techniques and patient populations evolve over time. By combining established doubly robust and MSM tools with an explicit decomposition for covariate shift, it could improve clinical interpretability of time-varying effects; the Kaiser Permanente application provides a concrete demonstration, though the absence of reported sensitivity analyses leaves the practical gain uncertain.

major comments (2)

[Abstract / framework description] Abstract and framework description: the projection step onto a finite set of candidate MSMs is load-bearing for both the selected description of time variation and the subsequent standardization metric. If the true dependence of the effect on initiation time lies outside the span of the candidates (e.g., non-monotonic or threshold patterns common when surgical techniques change), the selected MSM will be misspecified and the covariate-shift decomposition will inherit systematic error. No argument or diagnostic is supplied showing that the candidate library is rich enough for the bariatric-surgery setting.
[Abstract / standardization analysis] The standardization-based summary metric is presented as isolating covariate shift, yet its validity rests on correct specification of both the outcome and treatment models used in the doubly robust step and on the MSM chosen in the projection step. Because all components are estimated from the same EHR sample, any residual confounding or model misspecification propagates directly into the metric; the manuscript provides no sensitivity checks or alternative specifications to bound this propagation.

minor comments (2)

[Abstract] The abstract refers to 'candidate marginal structural models' without enumerating the specific functional forms considered (constant, linear, piecewise, etc.). Adding an explicit list or reference to the supplementary material would clarify the scope of the projection.
[Abstract] Notation for the novel summary metric is introduced only descriptively; an explicit formula (e.g., in terms of the standardized contrast under the selected MSM) would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and robustness of the manuscript. We address each major comment point by point below and have revised the paper to incorporate additional diagnostics and sensitivity analyses.

read point-by-point responses

Referee: Abstract and framework description: the projection step onto a finite set of candidate MSMs is load-bearing for both the selected description of time variation and the subsequent standardization metric. If the true dependence of the effect on initiation time lies outside the span of the candidates (e.g., non-monotonic or threshold patterns common when surgical techniques change), the selected MSM will be misspecified and the covariate-shift decomposition will inherit systematic error. No argument or diagnostic is supplied showing that the candidate library is rich enough for the bariatric-surgery setting.

Authors: We agree that the finite candidate library is a key modeling choice. Our original library included constant, linear, quadratic, and piecewise-constant specifications in calendar time. In the revision we have added a dedicated subsection on library sensitivity that reports projection residuals, cross-validated prediction error for the time-specific effects, and results under an expanded library that includes natural cubic splines with 3-5 knots. In the Kaiser Permanente application the linear specification was selected by the procedure and yielded residuals comparable to the spline-augmented library; the estimated covariate-shift contribution changed by less than 8% across these specifications. We now explicitly discuss that while highly non-monotonic patterns (e.g., abrupt technique shifts) could in principle lie outside the span, the gradual evolution of bariatric procedures makes low-order polynomials clinically plausible, and the added diagnostics allow readers to assess this assumption directly. revision: yes
Referee: The standardization-based summary metric is presented as isolating covariate shift, yet its validity rests on correct specification of both the outcome and treatment models used in the doubly robust step and on the MSM chosen in the projection step. Because all components are estimated from the same EHR sample, any residual confounding or model misspecification propagates directly into the metric; the manuscript provides no sensitivity checks or alternative specifications to bound this propagation.

Authors: We concur that the standardization metric inherits dependence on the nuisance models and the selected MSM. The doubly robust estimators used for the calendar-time-specific effects already confer protection against misspecification of either the outcome or treatment model (provided the other is consistent). In the revised manuscript we have added a sensitivity section that re-computes the metric under (i) alternative machine-learning estimators for the nuisance functions (random forests and neural nets in addition to the original super learner), (ii) two additional MSM specifications, and (iii) a simple bounding exercise that inflates the estimated effects by 10-20% to proxy residual confounding. Across these checks the reported contribution of covariate shift to the observed decline in treatment effect varied by at most 12 percentage points and remained statistically distinguishable from zero, supporting the original qualitative conclusion. revision: yes

Circularity Check

0 steps flagged

No circularity: standard DR projection and standardization remain independent of inputs

full rationale

The derivation obtains calendar-time-specific doubly robust estimates, projects them onto a discrete collection of candidate marginal structural models, selects via a criterion, and computes a standardization-based summary metric for covariate shift. None of these steps reduce by construction to the input estimates or to self-citations; the MSM candidates and standardization decomposition are external modeling choices whose validity rests on separate assumptions (coverage of the true time pattern, no residual confounding) rather than tautological re-expression of the same quantities. The framework is therefore self-contained against external benchmarks and receives score 0.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The framework rests on standard causal identification assumptions plus modeling choices for the marginal structural models and the standardization procedure; the new summary metric is introduced without external validation.

free parameters (1)

parameters of candidate marginal structural models
Coefficients describing how treatment effects vary with treatment initiation time are estimated from the projected time-specific estimates.

axioms (2)

domain assumption No unmeasured confounding for the treatment effect at each calendar time
Required for the doubly robust estimators to identify causal effects in the EHR observational data.
domain assumption Positivity (overlap) at each time point
Needed for stable doubly robust estimation across treatment initiation times.

invented entities (1)

standardization-based summary metric for covariate shift contribution no independent evidence
purpose: Quantifies the portion of observed effect change attributable to shifts in patient population rather than changes in treatment efficacy
Newly defined in the paper; no independent evidence or external validation provided in the abstract.

pith-pipeline@v0.9.0 · 5535 in / 1506 out tokens · 22253 ms · 2026-05-16T20:23:14.028726+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

projects doubly robust, time-specific treatment effect estimates onto candidate marginal structural models and uses a model selection procedure
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

novel summary metric, based on standardization analysis, to quantify the role of covariate shift

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

Sharp instruments for classifying compliers and generalizing causal eﬀects

Edward H Kennedy, Sivaraman Balakrishnan, and Max G’Sell. Sharp instruments for classifying compliers and generalizing causal eﬀects. The Annals of Statistics , 48(4):2008–2030,

work page 2008
[2]

H., Balakrishnan, S., and Wasserman, L

doi: 10.1093/biomet/asad017. Eric Polley, Erin LeDell, Chris Kennedy, and Mark van der Laan. Superlearner: Super learner prediction. https://CRAN.R-project.org/package=SuperLearner,

work page doi:10.1093/biomet/asad017
[3]

doi: 10.1214/20-AOAS1386. Aad W. van der Vaart and Jon A. Wellner. Weak Convergence and Empirical Processes: With Applications to Statistics . Springer Series in Statistics. Springer, New York, 1st edition,

work page doi:10.1214/20-aoas1386
[4]

Marvin N

doi: 10.1016/j.csda.2008.02.016. Marvin N. Wright and Andreas Ziegler. ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software , 77(1):1–17,

work page doi:10.1016/j.csda.2008.02.016 2008

[1] [1]

Sharp instruments for classifying compliers and generalizing causal eﬀects

Edward H Kennedy, Sivaraman Balakrishnan, and Max G’Sell. Sharp instruments for classifying compliers and generalizing causal eﬀects. The Annals of Statistics , 48(4):2008–2030,

work page 2008

[2] [2]

H., Balakrishnan, S., and Wasserman, L

doi: 10.1093/biomet/asad017. Eric Polley, Erin LeDell, Chris Kennedy, and Mark van der Laan. Superlearner: Super learner prediction. https://CRAN.R-project.org/package=SuperLearner,

work page doi:10.1093/biomet/asad017

[3] [3]

doi: 10.1214/20-AOAS1386. Aad W. van der Vaart and Jon A. Wellner. Weak Convergence and Empirical Processes: With Applications to Statistics . Springer Series in Statistics. Springer, New York, 1st edition,

work page doi:10.1214/20-aoas1386

[4] [4]

Marvin N

doi: 10.1016/j.csda.2008.02.016. Marvin N. Wright and Andreas Ziegler. ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software , 77(1):1–17,

work page doi:10.1016/j.csda.2008.02.016 2008