arxiv: 2604.21658 · v1 · submitted 2026-04-23 · 📊 stat.ME

Recognition: unknown

Estimator-Aligned Prospective Sample Size Determination for Designs Using Inverse Probability of Treatment Weighting

Daeyoung Lim, Taekwon Hong, Woojung Bae, Yong Ma

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:09 UTC · model grok-4.3

classification 📊 stat.ME

keywords sample size determinationinverse probability of treatment weightingpropensity scoremarginal structural modelgeneralized estimating equationsobservational studiescausal inferencevariance estimation

0 comments

The pith

Merging propensity score and outcome models into one estimating system targets the IPTW estimator's variance for accurate sample size planning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method for determining sample sizes in observational studies that use inverse probability of treatment weighting. Standard approaches often ignore the uncertainty from estimating the propensity scores, leading to miscalibrated designs. By combining the propensity score model and the marginal structural model into a unified system of estimating equations, the new framework directly accounts for this variability in the large-sample variance. This leads to prospective designs that better match the actual performance of the causal estimator at analysis time. The approach uses pilot data with bootstrap adjustments and applies to different types of outcomes.

Core claim

By merging the propensity score model and marginal structural model into a single system of estimating equations using generalized estimating equations and stacked M-estimation, the large-sample variance of the IPTW estimator is directly targeted for sample size determination, propagating the uncertainty from nuisance parameter estimation and improving power calibration compared to methods that treat weights as fixed.

What carries the argument

The stacked system of estimating equations that jointly solves for propensity scores and the marginal structural model parameters to derive the variance factor used in sample size formulas.

If this is right

Sample sizes chosen this way will produce studies whose actual power more closely matches the planned power than those based on RCT formulas.
The method works uniformly for binary, count, and continuous outcomes through appropriate GEE link functions.
Bootstrap stabilization from pilot data accounts for both within-sample and between-pilot variability in variance estimates.
Performance gains are largest when weights are unstable or outcomes are sparse or heavy-tailed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This alignment could prevent both underpowered and unnecessarily large observational studies in causal research.
Similar stacking ideas might improve sample size planning for other estimators like augmented IPTW or matching-based methods.
Extensions to time-to-event outcomes or longitudinal data would follow the same merged-equation principle.

Load-bearing premise

Pilot data must be representative of the future study population, and the large-sample approximations must accurately reflect the variance for the intended sample sizes.

What would settle it

If simulations or real applications show that the actual variance or power of the IPTW estimator differs markedly from the value predicted by this method, the framework's accuracy would be disproven.

Figures

Figures reproduced from arXiv: 2604.21658 by Daeyoung Lim, Taekwon Hong, Woojung Bae, Yong Ma.

read the original abstract

In observational studies, accurately characterizing variance is critical for sample size determination, yet unaccounted-for variability from propensity score estimation and the resulting weights limit the accuracy of standard variance approximations for design. Existing approaches often rely on heuristics or randomized controlled trial (RCT) formulas that treat weights as fixed, potentially misaligning prospective design with the causal estimator used at analysis. We propose an estimator-aligned framework for prospective sample size determination based on generalized estimating equations (GEE) and stacked M-estimation. By merging the propensity score model and marginal structural model (MSM) into a single system of estimating equations, the method propagates nuisance-model uncertainty and directly targets the large-sample variance of the IPTW estimator. For study planning, we estimate a pilot-based large-sample variance factor and introduce a bootstrap stabilization procedure that accounts for both within- and between-pilot variability. The framework applies uniformly across binary, count, and continuous outcomes through link-specific GEE representations under a common design principle. Simulation studies motivated by post-marketing safety and healthcare cost applications demonstrate that anchoring design to this variance improves power calibration relative to conventional RCT-style formulas, particularly in settings with weight instability, outcome sparsity, or heavy-tailed variability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper stacks the propensity score and outcome models into one estimating equation system so sample size planning directly targets the IPTW estimator's asymptotic variance instead of treating weights as fixed.

read the letter

The main point is that they merge the propensity score estimating equations with the marginal structural model into a single stacked M-estimation setup inside a GEE framework. This lets the prospective sample size calculation pick up the extra variability from estimating the weights, rather than relying on RCT-style formulas that ignore it. They pull a large-sample variance factor from pilot data and add a bootstrap step to stabilize it across pilot draws. The same structure works for binary, count, and continuous outcomes by swapping the link function. That is the concrete advance over the heuristics mentioned in the abstract. Simulations tied to post-marketing safety and cost data show tighter power calibration, especially when weights are unstable or outcomes are sparse. The approach follows directly from standard sandwich variance results for M-estimators, so the central claim is not surprising once you see the stacking. It is a clean way to make design match analysis. The soft spots are the usual ones for large-sample methods in this area. The pilot data must be representative, and the approximations plus bootstrap may still understate variability in small pilots or with very heavy tails. The paper does not appear to include machine-checked proofs or public code, so independent verification of the finite-sample behavior would help. No internal contradictions show up in the setup. This is aimed at applied statisticians and epidemiologists who plan observational studies with IPTW and need sample size formulas that do not systematically underpower the analysis. A methods reader working on causal inference or design would find the framework worth examining. It deserves a serious referee because the idea is grounded, the gap it fills is real, and the simulations give initial evidence even if more checks on small-sample performance would strengthen it. I would send it to peer review.

Referee Report

1 major / 3 minor

Summary. The paper proposes an estimator-aligned framework for prospective sample size determination in observational studies using inverse probability of treatment weighting (IPTW). It merges the propensity score model and marginal structural model (MSM) into a single system of estimating equations via generalized estimating equations (GEE) and stacked M-estimation to propagate nuisance-parameter uncertainty and directly target the large-sample variance of the IPTW estimator. A pilot-based large-sample variance factor is estimated, with an added bootstrap stabilization procedure accounting for within- and between-pilot variability. The approach is presented uniformly for binary, count, and continuous outcomes through link-specific GEE representations, and simulation studies motivated by post-marketing safety and healthcare cost applications report improved power calibration relative to conventional RCT-style formulas, especially under weight instability, outcome sparsity, or heavy-tailed variability.

Significance. If the derivations and simulation results hold, the work addresses a practical gap in causal inference study planning by aligning design-stage variance calculations with the actual IPTW analysis estimator rather than relying on heuristics or fixed-weight approximations. Grounding the procedure in standard M-estimation theory for joint sandwich variance, combined with the pilot-based factor and bootstrap stabilization, provides a coherent extension that could improve power calibration in applied settings with unstable weights. This is a strength when the central claim is supported by explicit derivations and reproducible code.

major comments (1)

The central claim that stacking the propensity-score and IPTW-weighted MSM equations produces the joint sandwich variance including weight-estimation uncertainty follows directly from M-estimation theory, but the manuscript must explicitly derive or display the relevant sandwich formula (including the cross-term contributions) to confirm it is not merely sketched; without this, the 'directly targets' assertion remains load-bearing but unverified in the provided description.

minor comments (3)

The bootstrap stabilization procedure is described at a high level in the abstract; the manuscript should include pseudocode or explicit steps for how within- and between-pilot variability are combined, as this is key for reproducibility.
Notation for the pilot-based large-sample variance factor should be introduced with a clear definition and symbol early in the methods section to avoid ambiguity when it is later used in the sample-size formula.
The simulation section would benefit from a table summarizing achieved versus nominal power across scenarios (including weight instability cases) to make the reported improvement over RCT formulas more transparent and quantifiable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and the recommendation of minor revision. The single major comment identifies a clear opportunity to strengthen the exposition of the M-estimation theory, which we will address directly.

read point-by-point responses

Referee: The central claim that stacking the propensity-score and IPTW-weighted MSM equations produces the joint sandwich variance including weight-estimation uncertainty follows directly from M-estimation theory, but the manuscript must explicitly derive or display the relevant sandwich formula (including the cross-term contributions) to confirm it is not merely sketched; without this, the 'directly targets' assertion remains load-bearing but unverified in the provided description.

Authors: We agree that an explicit derivation would improve verifiability. In the revised manuscript we will insert a new subsection (Methods, Section 3.2) that presents the stacked estimating function, derives the corresponding sandwich variance estimator, and explicitly displays the A and B matrices together with the cross-term contributions arising from the propensity-score nuisance parameters. This addition will confirm that the procedure directly targets the large-sample variance of the IPTW estimator without relying on fixed-weight approximations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies standard M-estimation to stacked equations

full rationale

The central step merges the propensity score estimating equations with the IPTW-weighted MSM equations into a joint system whose sandwich variance is the standard asymptotic result from M-estimation theory; this is not a self-definition or a fitted input renamed as a prediction. The pilot-based large-sample variance factor is estimated from separate data and then used prospectively for sample-size planning, with bootstrap stabilization applied to that external estimate. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling is present in the abstract or framework description. The procedure therefore remains self-contained against external benchmarks and does not reduce the claimed variance or design target to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The framework rests on standard asymptotic theory for M-estimators and GEE plus the assumption that pilot data can be used to estimate the variance factor reliably; no new entities are introduced.

free parameters (1)

pilot-based large-sample variance factor
Estimated from pilot data to scale the sample size calculation for the target study.

axioms (2)

standard math Standard regularity conditions hold for consistency and asymptotic normality of the stacked M-estimators.
Required to justify targeting the large-sample variance of the IPTW estimator.
domain assumption Pilot data are representative of the target population for variance estimation.
Necessary for the bootstrap stabilization to produce a usable variance factor.

pith-pipeline@v0.9.0 · 5518 in / 1510 out tokens · 50402 ms · 2026-05-09T21:09:07.259737+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 30 canonical work pages

[1]

Biometrika , author =

doi: 10.1093/biomet/70.1.41. Peter C Austin and Elizabeth A Stuart. Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies.Statistics in Medicine, 34(28):3661–3679,

work page doi:10.1093/biomet/70.1.41
[2]

Shein-Chung Chow, Jun Shao, Hansheng Wang, and Yuliya Lokhnygina.Sample size calculations in clinical research

doi: 10.1002/sim.6607. Shein-Chung Chow, Jun Shao, Hansheng Wang, and Yuliya Lokhnygina.Sample size calculations in clinical research. Chapman and Hall/CRC, 3rd edition,

work page doi:10.1002/sim.6607
[3]

Sin-Ho Jung, Shein-Chung Chow, and Eric M Chi

doi: 10.1201/9781315183084. Sin-Ho Jung, Shein-Chung Chow, and Eric M Chi. A note on sample size calculation based on propensity analysis in nonrandomized trials.Journal of Biopharmaceutical Statistics, 17(1):35–41,

work page doi:10.1201/9781315183084
[4]

Peter C Austin

doi: 10.1080/10543400601044790. Peter C Austin. Informing power and sample size calculations when using inverse probability of treatment weighting using the propensity score.Statistics in Medicine, 40(27):6150–6163,

work page doi:10.1080/10543400601044790
[5]

Bo Liu, Chengxin Yang, and Fan Li

doi: 10.1111/biom.13405. Bo Liu, Chengxin Yang, and Fan Li. Sample size and power calculations for causal inference of observational studies,

work page doi:10.1111/biom.13405
[6]

Leonard A Stefanski and Dennis D Boos

URLhttps://arxiv.org/abs/2501.11181. Leonard A Stefanski and Dennis D Boos. The calculus of M-estimation.The American Statistician, 56(1): 29–38,

work page arXiv
[7]

Jared K Lunceford and Marie Davidian

doi: 10.1198/000313002753631330. Jared K Lunceford and Marie Davidian. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study.Statistics in Medicine, 23(19):2937–2960,

work page doi:10.1198/000313002753631330
[8]

Donald B Rubin

doi: doi:10.1002/sim.1903. Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of Educational Psychology, 66(5):688–701,

work page doi:10.1002/sim.1903 1903
[9]

Page 19 of 26 Stephen R Cole and Constantine E Frangakis

doi: 10.1037/h0037350. Page 19 of 26 Stephen R Cole and Constantine E Frangakis. The consistency statement in causal inference: a definition or an assumption?Epidemiology, 20(1):3–5,

work page doi:10.1037/h0037350
[10]

Peter Auer

doi: 10.1093/aje/kwp436. Peter Auer. Using confidence bounds for exploitation-exploration trade-offs.Journal of Machine Learning Research, 3(null):397–422, March

work page doi:10.1093/aje/kwp436
[11]

doi: 10.5555/944919.944941

ISSN 1532-4435. doi: 10.5555/944919.944941. Botao Hao, Yasin Abbasi Yadkori, Zheng Wen, and Guang Cheng. Bootstrapping upper confidence bound. Advances in Neural Information Processing Systems, 32,

work page doi:10.5555/944919.944941
[12]

Cheryl S Broussard, Sonja A Rasmussen, Jennita Reefhuis, Jan M Friedman, Michael W Jann, Tiffany Riehle-Colarusso, Margaret A Honein, and National Birth Defects Prevention Study

doi: 10.5555/3454287.3455373. Cheryl S Broussard, Sonja A Rasmussen, Jennita Reefhuis, Jan M Friedman, Michael W Jann, Tiffany Riehle-Colarusso, Margaret A Honein, and National Birth Defects Prevention Study. Maternal treatment with opioid analgesics and risk for birth defects.American Journal of Obstetrics and Gynecology, 204(4): 314.e1–314.e11,

work page doi:10.5555/3454287.3455373
[13]

Jennifer N Lind, Julia D Interrante, Elizabeth C Ailes, Suzanne M Gilboa, Sara Khan, Meghan T Frey, April L Dawson, Margaret A Honein, Nicole F Dowling, Hilda Razzaghi, et al

doi: 10.1016/j.ajog.2010.12.039. Jennifer N Lind, Julia D Interrante, Elizabeth C Ailes, Suzanne M Gilboa, Sara Khan, Meghan T Frey, April L Dawson, Margaret A Honein, Nicole F Dowling, Hilda Razzaghi, et al. Maternal use of opioids during pregnancy and congenital malformations: a systematic review.Pediatrics, 139(6):e20164131,

work page doi:10.1016/j.ajog.2010.12.039 2010
[14]

Brian T Bateman, Sonia Hernandez-Diaz, Loreen Straub, Yanmin Zhu, Kathryn J Gray, Rishi J Desai, Helen Mogun, Nileesa Gautam, and Krista F Huybrechts

doi: 10.1542/peds.2016-4131. Brian T Bateman, Sonia Hernandez-Diaz, Loreen Straub, Yanmin Zhu, Kathryn J Gray, Rishi J Desai, Helen Mogun, Nileesa Gautam, and Krista F Huybrechts. Association of first trimester prescription opioid use with congenital malformations in the offspring: population based cohort study.BMJ, 372,

work page doi:10.1542/peds.2016-4131 2016
[15]

Centers for Disease Control and Prevention

doi: 10.1136/bmj.n102. Centers for Disease Control and Prevention. Update on overall prevalence of major birth defects–Atlanta, Georgia, 1978-2005.MMWR Morb Mortal Wkly Rep, 57(1):1–5,

work page doi:10.1136/bmj.n102 1978
[16]

doi: 10.1016/S0140-6736(14)62449-1. Christopher C Butler, Alike W van Der Velden, Emily Bongard, Benjamin R Saville, Jane Holmes, Samuel Coenen, Johanna Cook, Nick A Francis, Roger J Lewis, Maciek Godycki-Cwirko, et al. Oseltamivir plus usual care versus usual care for influenza-like illness in primary care: an open-label, pragmatic, randomised controlled...

work page doi:10.1016/s0140-6736(14)62449-1
[17]

Page 20 of 26 James W Antoon, Jyotirmoy Sarker, Abdullah Abdelaziz, Pei-Wen Lien, Derek J Williams, Todd A Lee, and Carlos G Grijalva

doi: 10.1016/S0140-6736(19)32982-4. Page 20 of 26 James W Antoon, Jyotirmoy Sarker, Abdullah Abdelaziz, Pei-Wen Lien, Derek J Williams, Todd A Lee, and Carlos G Grijalva. Trends in outpatient influenza antiviral use among children and adolescents in the United States.Pediatrics, 152(6):e2023061960,

work page doi:10.1016/s0140-6736(19)32982-4
[18]

Center for Infectious Disease Research and Policy (CIDRAP)

doi: 10.1542/peds.2023-061960. Center for Infectious Disease Research and Policy (CIDRAP). Fda panel: Children’s deaths unrelated to tamiflu. http://www.cidrap.umn.edu/news-perspective/2005/11/ fda-panel-childrens-deaths-unrelated-tamiflu,

work page doi:10.1542/peds.2023-061960 2023
[19]

Accessed: 2026-03-14. James W. Antoon, Derek J. Williams, Jean Bruce, Mert Sekmen, Yuwei Zhu, and Carlos G. Grijalva. Influenza with and without oseltamivir treatment and neuropsychiatric events among children and adolescents.JAMA Neurology, 82(10):1013–1021, 10

2026
[20]

doi: 10.1001/jamaneurol.2025.1995

ISSN 2168-6149. doi: 10.1001/jamaneurol.2025.1995. Sean P Keehan, Jacqueline A Fiore, John A Poisal, Gigi A Cuckler, Andrea M Sisko, Sheila D Smith, Andrew J Madison, and Kathryn E Rennie. National health expenditure projections, 2022–31: growth to stabilize once the COVID-19 public health emergency ends.Health Affairs, 42(7):886–898,

work page doi:10.1001/jamaneurol.2025.1995 2025
[21]

Karen E Skinner, Ancilla W Fernandes, Mark S Walker, Melissa Pavilack, and Ari VanderWalde

doi: 10.1377/hlthaff.2023.00403. Karen E Skinner, Ancilla W Fernandes, Mark S Walker, Melissa Pavilack, and Ari VanderWalde. Healthcare costs in patients with advanced non-small cell lung cancer and disease progression during targeted therapy: a real-world observational study.Journal of Medical Economics, 21(2):192–200,

work page doi:10.1377/hlthaff.2023.00403 2023
[22]

William D Travis, Elisabeth Brambilla, and Gregory J Riely

doi: 10.1080/13696998.2017.1389744. William D Travis, Elisabeth Brambilla, and Gregory J Riely. New pathologic classification of lung cancer: relevance for clinical practice and clinical trials.Journal of Clinical Oncology, 31(8):992–1001,

work page doi:10.1080/13696998.2017.1389744 2017
[23]

Rebecca L Siegel, Kimberly D Miller, and Ahmedin Jemal

doi: 10.1200/JCO.2012.46.9270. Rebecca L Siegel, Kimberly D Miller, and Ahmedin Jemal. Cancer statistics, 2018.CA: A Cancer Journal for Clinicians, 68(1):7–30,

work page doi:10.1200/jco.2012.46.9270 2012
[24]

Mihajlo Jakovljevic, Christina Malmose-Stapelfeldt, Olivera Milovanovic, Nemanja Rancic, and Dubravko Bokonjic

doi: 10.3322/caac.21442. Mihajlo Jakovljevic, Christina Malmose-Stapelfeldt, Olivera Milovanovic, Nemanja Rancic, and Dubravko Bokonjic. Disability, work absenteeism, sickness benefits, and cancer in selected European OECD countries—Forecasts to 2020.Frontiers in Public Health, 5:23,

work page doi:10.3322/caac.21442 2020
[25]

James Robins

doi: 10.3389/fpubh.2017.00023. James Robins. A new approach to causal inference in mortality studies with a sustained exposure pe- riod—application to control of the healthy worker survivor effect.Mathematical Modelling, 7(9-12): 1393–1512,

work page doi:10.3389/fpubh.2017.00023 2017
[26]

Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins

doi: 10.1016/0270-0255(86)90088-6. Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters.The Page 21 of 26 Econometrics Journal, 21(1):C1–C68, 01

work page doi:10.1016/0270-0255(86)90088-6
[27]

Double/debiased machine learning for treatment and structural parameters.The Econometrics Journal, 21(1):C1–C68, 2018

ISSN 1368-4221. doi: 10.1111/ectj.12097. Per K. Andersen, Ørnulf Borgan, Richard D. Gill, and Niels Keiding.Statistical Models Based on Counting Processes. Springer, New York,

work page doi:10.1111/ectj.12097
[28]

Richard K Crump, V Joseph Hotz, Guido W Imbens, and Oscar A Mitnik

doi: 10.1007/978-1-4612-4348-9. Richard K Crump, V Joseph Hotz, Guido W Imbens, and Oscar A Mitnik. Dealing with limited overlap in estimation of average treatment effects.Biometrika, 96(1):187–199,

work page doi:10.1007/978-1-4612-4348-9
[29]

Dealing With Limited Overlap in Estimation of Average Treatment Effects

doi: 10.1093/biomet/asn055. Til Stürmer, Michael Webster-Clark, Jennifer L Lund, Richard Wyss, Alan R Ellis, Mark Lunt, Kenneth J Rothman, and Robert J Glynn. Propensity score weighting and trimming strategies for reducing variance and bias of treatment effect estimates: a simulation study.American Journal of Epidemiology, 190(8): 1659–1670,

work page doi:10.1093/biomet/asn055
[30]

Fan Li, Laine E Thomas, and Fan Li

doi: 10.1093/aje/kwab041. Fan Li, Laine E Thomas, and Fan Li. Addressing extreme propensity scores via the overlap weights.American Journal of Epidemiology, 188(1):250–257,

work page doi:10.1093/aje/kwab041
[31]

doi: 10.1093/aje/kwy201. Page 22 of 26 A Supplementary Appendix: Pseudocode Algorithm 1:Bootstrap-Stabilized Design Variance Selection from Pilot Data Input:Pilot datasetD pilot of sizen pilot; number of pilot bootstrapsB; stability functionalF; generic functionalϕ; number of second-stage bootstrapsB ucb; upper-tail levelγ ucb Output:Stable design varianc...

work page doi:10.1093/aje/kwy201 2000