pith. machine review for the scientific record. sign in

arxiv: 2604.21658 · v1 · submitted 2026-04-23 · 📊 stat.ME

Recognition: unknown

Estimator-Aligned Prospective Sample Size Determination for Designs Using Inverse Probability of Treatment Weighting

Daeyoung Lim, Taekwon Hong, Woojung Bae, Yong Ma

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:09 UTC · model grok-4.3

classification 📊 stat.ME
keywords sample size determinationinverse probability of treatment weightingpropensity scoremarginal structural modelgeneralized estimating equationsobservational studiescausal inferencevariance estimation
0
0 comments X

The pith

Merging propensity score and outcome models into one estimating system targets the IPTW estimator's variance for accurate sample size planning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method for determining sample sizes in observational studies that use inverse probability of treatment weighting. Standard approaches often ignore the uncertainty from estimating the propensity scores, leading to miscalibrated designs. By combining the propensity score model and the marginal structural model into a unified system of estimating equations, the new framework directly accounts for this variability in the large-sample variance. This leads to prospective designs that better match the actual performance of the causal estimator at analysis time. The approach uses pilot data with bootstrap adjustments and applies to different types of outcomes.

Core claim

By merging the propensity score model and marginal structural model into a single system of estimating equations using generalized estimating equations and stacked M-estimation, the large-sample variance of the IPTW estimator is directly targeted for sample size determination, propagating the uncertainty from nuisance parameter estimation and improving power calibration compared to methods that treat weights as fixed.

What carries the argument

The stacked system of estimating equations that jointly solves for propensity scores and the marginal structural model parameters to derive the variance factor used in sample size formulas.

If this is right

  • Sample sizes chosen this way will produce studies whose actual power more closely matches the planned power than those based on RCT formulas.
  • The method works uniformly for binary, count, and continuous outcomes through appropriate GEE link functions.
  • Bootstrap stabilization from pilot data accounts for both within-sample and between-pilot variability in variance estimates.
  • Performance gains are largest when weights are unstable or outcomes are sparse or heavy-tailed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This alignment could prevent both underpowered and unnecessarily large observational studies in causal research.
  • Similar stacking ideas might improve sample size planning for other estimators like augmented IPTW or matching-based methods.
  • Extensions to time-to-event outcomes or longitudinal data would follow the same merged-equation principle.

Load-bearing premise

Pilot data must be representative of the future study population, and the large-sample approximations must accurately reflect the variance for the intended sample sizes.

What would settle it

If simulations or real applications show that the actual variance or power of the IPTW estimator differs markedly from the value predicted by this method, the framework's accuracy would be disproven.

Figures

Figures reproduced from arXiv: 2604.21658 by Daeyoung Lim, Taekwon Hong, Woojung Bae, Yong Ma.

Figure 1
Figure 1. Figure 1: Pilot-based prospective sample size determination for IPTW marginal structural models. The [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
read the original abstract

In observational studies, accurately characterizing variance is critical for sample size determination, yet unaccounted-for variability from propensity score estimation and the resulting weights limit the accuracy of standard variance approximations for design. Existing approaches often rely on heuristics or randomized controlled trial (RCT) formulas that treat weights as fixed, potentially misaligning prospective design with the causal estimator used at analysis. We propose an estimator-aligned framework for prospective sample size determination based on generalized estimating equations (GEE) and stacked M-estimation. By merging the propensity score model and marginal structural model (MSM) into a single system of estimating equations, the method propagates nuisance-model uncertainty and directly targets the large-sample variance of the IPTW estimator. For study planning, we estimate a pilot-based large-sample variance factor and introduce a bootstrap stabilization procedure that accounts for both within- and between-pilot variability. The framework applies uniformly across binary, count, and continuous outcomes through link-specific GEE representations under a common design principle. Simulation studies motivated by post-marketing safety and healthcare cost applications demonstrate that anchoring design to this variance improves power calibration relative to conventional RCT-style formulas, particularly in settings with weight instability, outcome sparsity, or heavy-tailed variability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper proposes an estimator-aligned framework for prospective sample size determination in observational studies using inverse probability of treatment weighting (IPTW). It merges the propensity score model and marginal structural model (MSM) into a single system of estimating equations via generalized estimating equations (GEE) and stacked M-estimation to propagate nuisance-parameter uncertainty and directly target the large-sample variance of the IPTW estimator. A pilot-based large-sample variance factor is estimated, with an added bootstrap stabilization procedure accounting for within- and between-pilot variability. The approach is presented uniformly for binary, count, and continuous outcomes through link-specific GEE representations, and simulation studies motivated by post-marketing safety and healthcare cost applications report improved power calibration relative to conventional RCT-style formulas, especially under weight instability, outcome sparsity, or heavy-tailed variability.

Significance. If the derivations and simulation results hold, the work addresses a practical gap in causal inference study planning by aligning design-stage variance calculations with the actual IPTW analysis estimator rather than relying on heuristics or fixed-weight approximations. Grounding the procedure in standard M-estimation theory for joint sandwich variance, combined with the pilot-based factor and bootstrap stabilization, provides a coherent extension that could improve power calibration in applied settings with unstable weights. This is a strength when the central claim is supported by explicit derivations and reproducible code.

major comments (1)
  1. The central claim that stacking the propensity-score and IPTW-weighted MSM equations produces the joint sandwich variance including weight-estimation uncertainty follows directly from M-estimation theory, but the manuscript must explicitly derive or display the relevant sandwich formula (including the cross-term contributions) to confirm it is not merely sketched; without this, the 'directly targets' assertion remains load-bearing but unverified in the provided description.
minor comments (3)
  1. The bootstrap stabilization procedure is described at a high level in the abstract; the manuscript should include pseudocode or explicit steps for how within- and between-pilot variability are combined, as this is key for reproducibility.
  2. Notation for the pilot-based large-sample variance factor should be introduced with a clear definition and symbol early in the methods section to avoid ambiguity when it is later used in the sample-size formula.
  3. The simulation section would benefit from a table summarizing achieved versus nominal power across scenarios (including weight instability cases) to make the reported improvement over RCT formulas more transparent and quantifiable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and the recommendation of minor revision. The single major comment identifies a clear opportunity to strengthen the exposition of the M-estimation theory, which we will address directly.

read point-by-point responses
  1. Referee: The central claim that stacking the propensity-score and IPTW-weighted MSM equations produces the joint sandwich variance including weight-estimation uncertainty follows directly from M-estimation theory, but the manuscript must explicitly derive or display the relevant sandwich formula (including the cross-term contributions) to confirm it is not merely sketched; without this, the 'directly targets' assertion remains load-bearing but unverified in the provided description.

    Authors: We agree that an explicit derivation would improve verifiability. In the revised manuscript we will insert a new subsection (Methods, Section 3.2) that presents the stacked estimating function, derives the corresponding sandwich variance estimator, and explicitly displays the A and B matrices together with the cross-term contributions arising from the propensity-score nuisance parameters. This addition will confirm that the procedure directly targets the large-sample variance of the IPTW estimator without relying on fixed-weight approximations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies standard M-estimation to stacked equations

full rationale

The central step merges the propensity score estimating equations with the IPTW-weighted MSM equations into a joint system whose sandwich variance is the standard asymptotic result from M-estimation theory; this is not a self-definition or a fitted input renamed as a prediction. The pilot-based large-sample variance factor is estimated from separate data and then used prospectively for sample-size planning, with bootstrap stabilization applied to that external estimate. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling is present in the abstract or framework description. The procedure therefore remains self-contained against external benchmarks and does not reduce the claimed variance or design target to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The framework rests on standard asymptotic theory for M-estimators and GEE plus the assumption that pilot data can be used to estimate the variance factor reliably; no new entities are introduced.

free parameters (1)
  • pilot-based large-sample variance factor
    Estimated from pilot data to scale the sample size calculation for the target study.
axioms (2)
  • standard math Standard regularity conditions hold for consistency and asymptotic normality of the stacked M-estimators.
    Required to justify targeting the large-sample variance of the IPTW estimator.
  • domain assumption Pilot data are representative of the target population for variance estimation.
    Necessary for the bootstrap stabilization to produce a usable variance factor.

pith-pipeline@v0.9.0 · 5518 in / 1510 out tokens · 50402 ms · 2026-05-09T21:09:07.259737+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 30 canonical work pages

  1. [1]

    Biometrika , author =

    doi: 10.1093/biomet/70.1.41. Peter C Austin and Elizabeth A Stuart. Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies.Statistics in Medicine, 34(28):3661–3679,

  2. [2]

    Shein-Chung Chow, Jun Shao, Hansheng Wang, and Yuliya Lokhnygina.Sample size calculations in clinical research

    doi: 10.1002/sim.6607. Shein-Chung Chow, Jun Shao, Hansheng Wang, and Yuliya Lokhnygina.Sample size calculations in clinical research. Chapman and Hall/CRC, 3rd edition,

  3. [3]

    Sin-Ho Jung, Shein-Chung Chow, and Eric M Chi

    doi: 10.1201/9781315183084. Sin-Ho Jung, Shein-Chung Chow, and Eric M Chi. A note on sample size calculation based on propensity analysis in nonrandomized trials.Journal of Biopharmaceutical Statistics, 17(1):35–41,

  4. [4]

    Peter C Austin

    doi: 10.1080/10543400601044790. Peter C Austin. Informing power and sample size calculations when using inverse probability of treatment weighting using the propensity score.Statistics in Medicine, 40(27):6150–6163,

  5. [5]

    Bo Liu, Chengxin Yang, and Fan Li

    doi: 10.1111/biom.13405. Bo Liu, Chengxin Yang, and Fan Li. Sample size and power calculations for causal inference of observational studies,

  6. [6]

    Leonard A Stefanski and Dennis D Boos

    URLhttps://arxiv.org/abs/2501.11181. Leonard A Stefanski and Dennis D Boos. The calculus of M-estimation.The American Statistician, 56(1): 29–38,

  7. [7]

    Jared K Lunceford and Marie Davidian

    doi: 10.1198/000313002753631330. Jared K Lunceford and Marie Davidian. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study.Statistics in Medicine, 23(19):2937–2960,

  8. [8]

    Donald B Rubin

    doi: doi:10.1002/sim.1903. Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of Educational Psychology, 66(5):688–701,

  9. [9]

    Page 19 of 26 Stephen R Cole and Constantine E Frangakis

    doi: 10.1037/h0037350. Page 19 of 26 Stephen R Cole and Constantine E Frangakis. The consistency statement in causal inference: a definition or an assumption?Epidemiology, 20(1):3–5,

  10. [10]

    Peter Auer

    doi: 10.1093/aje/kwp436. Peter Auer. Using confidence bounds for exploitation-exploration trade-offs.Journal of Machine Learning Research, 3(null):397–422, March

  11. [11]

    doi: 10.5555/944919.944941

    ISSN 1532-4435. doi: 10.5555/944919.944941. Botao Hao, Yasin Abbasi Yadkori, Zheng Wen, and Guang Cheng. Bootstrapping upper confidence bound. Advances in Neural Information Processing Systems, 32,

  12. [12]

    Cheryl S Broussard, Sonja A Rasmussen, Jennita Reefhuis, Jan M Friedman, Michael W Jann, Tiffany Riehle-Colarusso, Margaret A Honein, and National Birth Defects Prevention Study

    doi: 10.5555/3454287.3455373. Cheryl S Broussard, Sonja A Rasmussen, Jennita Reefhuis, Jan M Friedman, Michael W Jann, Tiffany Riehle-Colarusso, Margaret A Honein, and National Birth Defects Prevention Study. Maternal treatment with opioid analgesics and risk for birth defects.American Journal of Obstetrics and Gynecology, 204(4): 314.e1–314.e11,

  13. [13]

    Jennifer N Lind, Julia D Interrante, Elizabeth C Ailes, Suzanne M Gilboa, Sara Khan, Meghan T Frey, April L Dawson, Margaret A Honein, Nicole F Dowling, Hilda Razzaghi, et al

    doi: 10.1016/j.ajog.2010.12.039. Jennifer N Lind, Julia D Interrante, Elizabeth C Ailes, Suzanne M Gilboa, Sara Khan, Meghan T Frey, April L Dawson, Margaret A Honein, Nicole F Dowling, Hilda Razzaghi, et al. Maternal use of opioids during pregnancy and congenital malformations: a systematic review.Pediatrics, 139(6):e20164131,

  14. [14]

    Brian T Bateman, Sonia Hernandez-Diaz, Loreen Straub, Yanmin Zhu, Kathryn J Gray, Rishi J Desai, Helen Mogun, Nileesa Gautam, and Krista F Huybrechts

    doi: 10.1542/peds.2016-4131. Brian T Bateman, Sonia Hernandez-Diaz, Loreen Straub, Yanmin Zhu, Kathryn J Gray, Rishi J Desai, Helen Mogun, Nileesa Gautam, and Krista F Huybrechts. Association of first trimester prescription opioid use with congenital malformations in the offspring: population based cohort study.BMJ, 372,

  15. [15]

    Centers for Disease Control and Prevention

    doi: 10.1136/bmj.n102. Centers for Disease Control and Prevention. Update on overall prevalence of major birth defects–Atlanta, Georgia, 1978-2005.MMWR Morb Mortal Wkly Rep, 57(1):1–5,

  16. [16]

    doi: 10.1016/S0140-6736(14)62449-1. Christopher C Butler, Alike W van Der Velden, Emily Bongard, Benjamin R Saville, Jane Holmes, Samuel Coenen, Johanna Cook, Nick A Francis, Roger J Lewis, Maciek Godycki-Cwirko, et al. Oseltamivir plus usual care versus usual care for influenza-like illness in primary care: an open-label, pragmatic, randomised controlled...

  17. [17]

    Page 20 of 26 James W Antoon, Jyotirmoy Sarker, Abdullah Abdelaziz, Pei-Wen Lien, Derek J Williams, Todd A Lee, and Carlos G Grijalva

    doi: 10.1016/S0140-6736(19)32982-4. Page 20 of 26 James W Antoon, Jyotirmoy Sarker, Abdullah Abdelaziz, Pei-Wen Lien, Derek J Williams, Todd A Lee, and Carlos G Grijalva. Trends in outpatient influenza antiviral use among children and adolescents in the United States.Pediatrics, 152(6):e2023061960,

  18. [18]

    Center for Infectious Disease Research and Policy (CIDRAP)

    doi: 10.1542/peds.2023-061960. Center for Infectious Disease Research and Policy (CIDRAP). Fda panel: Children’s deaths unrelated to tamiflu. http://www.cidrap.umn.edu/news-perspective/2005/11/ fda-panel-childrens-deaths-unrelated-tamiflu,

  19. [19]

    Accessed: 2026-03-14. James W. Antoon, Derek J. Williams, Jean Bruce, Mert Sekmen, Yuwei Zhu, and Carlos G. Grijalva. Influenza with and without oseltamivir treatment and neuropsychiatric events among children and adolescents.JAMA Neurology, 82(10):1013–1021, 10

  20. [20]

    doi: 10.1001/jamaneurol.2025.1995

    ISSN 2168-6149. doi: 10.1001/jamaneurol.2025.1995. Sean P Keehan, Jacqueline A Fiore, John A Poisal, Gigi A Cuckler, Andrea M Sisko, Sheila D Smith, Andrew J Madison, and Kathryn E Rennie. National health expenditure projections, 2022–31: growth to stabilize once the COVID-19 public health emergency ends.Health Affairs, 42(7):886–898,

  21. [21]

    Karen E Skinner, Ancilla W Fernandes, Mark S Walker, Melissa Pavilack, and Ari VanderWalde

    doi: 10.1377/hlthaff.2023.00403. Karen E Skinner, Ancilla W Fernandes, Mark S Walker, Melissa Pavilack, and Ari VanderWalde. Healthcare costs in patients with advanced non-small cell lung cancer and disease progression during targeted therapy: a real-world observational study.Journal of Medical Economics, 21(2):192–200,

  22. [22]

    William D Travis, Elisabeth Brambilla, and Gregory J Riely

    doi: 10.1080/13696998.2017.1389744. William D Travis, Elisabeth Brambilla, and Gregory J Riely. New pathologic classification of lung cancer: relevance for clinical practice and clinical trials.Journal of Clinical Oncology, 31(8):992–1001,

  23. [23]

    Rebecca L Siegel, Kimberly D Miller, and Ahmedin Jemal

    doi: 10.1200/JCO.2012.46.9270. Rebecca L Siegel, Kimberly D Miller, and Ahmedin Jemal. Cancer statistics, 2018.CA: A Cancer Journal for Clinicians, 68(1):7–30,

  24. [24]

    Mihajlo Jakovljevic, Christina Malmose-Stapelfeldt, Olivera Milovanovic, Nemanja Rancic, and Dubravko Bokonjic

    doi: 10.3322/caac.21442. Mihajlo Jakovljevic, Christina Malmose-Stapelfeldt, Olivera Milovanovic, Nemanja Rancic, and Dubravko Bokonjic. Disability, work absenteeism, sickness benefits, and cancer in selected European OECD countries—Forecasts to 2020.Frontiers in Public Health, 5:23,

  25. [25]

    James Robins

    doi: 10.3389/fpubh.2017.00023. James Robins. A new approach to causal inference in mortality studies with a sustained exposure pe- riod—application to control of the healthy worker survivor effect.Mathematical Modelling, 7(9-12): 1393–1512,

  26. [26]

    Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins

    doi: 10.1016/0270-0255(86)90088-6. Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters.The Page 21 of 26 Econometrics Journal, 21(1):C1–C68, 01

  27. [27]

    Double/debiased machine learning for treatment and structural parameters.The Econometrics Journal, 21(1):C1–C68, 2018

    ISSN 1368-4221. doi: 10.1111/ectj.12097. Per K. Andersen, Ørnulf Borgan, Richard D. Gill, and Niels Keiding.Statistical Models Based on Counting Processes. Springer, New York,

  28. [28]

    Richard K Crump, V Joseph Hotz, Guido W Imbens, and Oscar A Mitnik

    doi: 10.1007/978-1-4612-4348-9. Richard K Crump, V Joseph Hotz, Guido W Imbens, and Oscar A Mitnik. Dealing with limited overlap in estimation of average treatment effects.Biometrika, 96(1):187–199,

  29. [29]

    Dealing With Limited Overlap in Estimation of Average Treatment Effects

    doi: 10.1093/biomet/asn055. Til Stürmer, Michael Webster-Clark, Jennifer L Lund, Richard Wyss, Alan R Ellis, Mark Lunt, Kenneth J Rothman, and Robert J Glynn. Propensity score weighting and trimming strategies for reducing variance and bias of treatment effect estimates: a simulation study.American Journal of Epidemiology, 190(8): 1659–1670,

  30. [30]

    Fan Li, Laine E Thomas, and Fan Li

    doi: 10.1093/aje/kwab041. Fan Li, Laine E Thomas, and Fan Li. Addressing extreme propensity scores via the overlap weights.American Journal of Epidemiology, 188(1):250–257,

  31. [31]

    doi: 10.1093/aje/kwy201. Page 22 of 26 A Supplementary Appendix: Pseudocode Algorithm 1:Bootstrap-Stabilized Design Variance Selection from Pilot Data Input:Pilot datasetD pilot of sizen pilot; number of pilot bootstrapsB; stability functionalF; generic functionalϕ; number of second-stage bootstrapsB ucb; upper-tail levelγ ucb Output:Stable design varianc...