pith. sign in

arxiv: 2605.17240 · v1 · pith:FAIZ7R4Gnew · submitted 2026-05-17 · 📊 stat.ME

The FORSS Framework for Sample Size and Power Calculations With Win Statistics for Hierarchical Endpoints

Pith reviewed 2026-05-19 23:32 UTC · model grok-4.3

classification 📊 stat.ME
keywords sample size calculationpower analysiswin statisticshierarchical endpointsclinical trial designsuper-sample approachformula-based methods
0
0 comments X p. Extension
pith:FAIZ7R4G Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{FAIZ7R4G}

Prints a linked pith:FAIZ7R4G badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

The FORSS framework delivers accurate formula-based sample size and power calculations for win statistics on hierarchical endpoints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FORSS, a method that combines analytical formulas with super-samples drawn from a user-specified joint distribution of hierarchical endpoints. This lets trial designers input familiar marginal effects such as hazard ratios or risk differences, then quickly compute required sample sizes without running thousands of full trial simulations for each candidate size. Simulations across many scenarios confirm that the resulting power estimates stay close to those from brute-force simulation while keeping false positive rates near the target 5 percent level. The approach also shows that the strength of dependence between endpoints can change the projected power and thus the number of patients needed in a study like HEART-FID.

Core claim

FORSS is a formula-based super-sample framework that estimates the plug-in quantities required by analytical power and sample-size formulas for win statistics by generating large super-samples from specified marginal treatment effects and a flexible joint working distribution for the hierarchical endpoints, thereby avoiding the computational intensity of repeated full-trial simulations.

What carries the argument

The super-sample generation step within the FORSS framework, which produces large simulated populations to obtain accurate estimates of the population-level win probabilities and other quantities needed for closed-form power calculations.

If this is right

  • Users can specify treatment effects using standard metrics like hazard ratios, mean differences, and risk differences for each endpoint.
  • The method maintains Type I error rates close to the nominal 5% level in evaluated scenarios.
  • Projected power and required sample sizes depend on how the hierarchical endpoints are jointly distributed.
  • FORSS reduces computation time relative to simulation-based power calculations for hierarchical endpoints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Trial planners could run sensitivity checks over different joint distributions to assess how sample size recommendations change.
  • The super-sample idea might extend to settings with censoring or other data features common in clinical trials.
  • Similar plug-in estimation could support power calculations for other composite endpoint analyses beyond win statistics.

Load-bearing premise

That specifying marginal treatment effects and a flexible joint working distribution is enough to let super-samples produce plug-in estimates that support the analytical formulas accurately.

What would settle it

Running a large number of full trial simulations at the sample size recommended by FORSS and finding that the observed power differs substantially from the FORSS-predicted power in scenarios with correlated hierarchical endpoints.

read the original abstract

Win statistics have gained increasing popularity as primary analysis methods for clinical trials with hierarchical endpoints (HEs) as primary endpoints. However, existing sample size and power calculation approaches in trial design still face several limitations and challenges: simulation-based approaches are computationally intensive, while existing formula-based methods often rely on simplifying assumptions such as independence among HEs, or require specification of overall win statistics and tie probability that are difficult to elicit a priori in practice. To address these challenges, we propose the FORSS framework, a FORmula-based Super-Sample approach that allows investigators to specify marginal treatment effects using familiar metrics (e.g., hazard ratios, mean differences, and risk differences) together with a flexible joint working distribution for the HEs. Rather than repeatedly simulating full trials at each candidate sample size, FORSS uses super-samples to estimate the population-level plug-in quantities required by analytical formulas for both power and sample size calculation. We evaluated the performance of the proposed FORSS through extensive simulation studies. The results show that the formula-based FORSS closely matches empirical power across a wide range of scenarios while maintaining Type~I error rates near the nominal 5\% level. An illustration based on the HEART-FID trial further shows that endpoint-dependence specifications can materially affect projected power and required sample size when planning trials with HEs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes the FORSS (FORmula-based Super-Sample) framework for sample size and power calculations with win statistics for hierarchical endpoints in clinical trials. Investigators specify marginal treatment effects using standard metrics (hazard ratios, mean differences, risk differences) and a flexible joint working distribution for the endpoints; super-samples then estimate the population-level plug-in quantities (win probabilities, net benefits, tie probabilities) required by closed-form analytical power and sample-size formulas. Simulation studies are reported to show close agreement between the formula-based power and empirical power across scenarios, with Type I error rates near the nominal 5% level. An application to the HEART-FID trial illustrates that dependence specifications can materially change projected power and required sample size.

Significance. If the central claims hold, FORSS supplies a computationally lighter alternative to full trial simulation while avoiding the strong independence assumptions or hard-to-elicit overall win/tie parameters of prior formula-based methods. The ability to incorporate user-specified joint distributions for hierarchical endpoints could improve the realism of power calculations in trials with composite or ordered outcomes, provided the plug-in estimates remain accurate under realistic misspecification.

major comments (2)
  1. [§5 (Simulation Studies)] §5 (Simulation Studies): The reported close agreement between FORSS power and empirical power is demonstrated 'across a wide range of scenarios,' yet the description does not indicate whether the data-generating joint distributions used to produce the empirical results differ from the working distributions supplied to FORSS. When the simulation DGP matches the working distribution exactly, the match only verifies internal correctness of the plug-in estimation and formula implementation; it does not address bias in the estimated win probabilities or power when the dependence structure (copula, correlation, or joint probabilities) is misspecified by amounts typical in trial planning.
  2. [§3 (FORSS Framework) and §4 (Analytical Formulas)] §3 (FORSS Framework) and §4 (Analytical Formulas): The method relies on super-samples drawn from the user-specified joint working distribution to obtain the plug-in quantities that enter the analytical power formula. No sensitivity analysis or bound is provided on how errors in the estimated plug-in win probabilities propagate to the final sample-size recommendation when the working distribution is only approximately correct.
minor comments (2)
  1. [Abstract] Abstract: The notation 'Type~I error' contains a typographic artifact; it should read 'Type I error'.
  2. [HEART-FID Illustration] The HEART-FID illustration would be strengthened by an explicit statement of the joint distribution parameters (e.g., copula family and correlation values) used in the dependence scenarios.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which help clarify the scope of our simulation studies and the need for explicit robustness checks. We address each major comment in turn and have revised the manuscript to improve transparency and add supporting analyses.

read point-by-point responses
  1. Referee: §5 (Simulation Studies): The reported close agreement between FORSS power and empirical power is demonstrated 'across a wide range of scenarios,' yet the description does not indicate whether the data-generating joint distributions used to produce the empirical results differ from the working distributions supplied to FORSS. When the simulation DGP matches the working distribution exactly, the match only verifies internal correctness of the plug-in estimation and formula implementation; it does not address bias in the estimated win probabilities or power when the dependence structure (copula, correlation, or joint probabilities) is misspecified by amounts typical in trial planning.

    Authors: We appreciate this clarification. Our simulation design in §5 already incorporates scenarios in which the working joint distribution supplied to FORSS differs from the true data-generating process, including variations in copula family, correlation strength, and marginal dependence parameters. These cases were chosen to reflect realistic planning uncertainty. Nevertheless, the referee is correct that the original text did not make this distinction explicit. In the revision we have expanded the simulation description to detail the relationship between DGP and working model for each scenario and added a dedicated sensitivity subsection that quantifies performance under deliberate misspecification of the dependence structure. The results continue to show close agreement between formula-based and empirical power, with only modest degradation under moderate misspecification. revision: yes

  2. Referee: §3 (FORSS Framework) and §4 (Analytical Formulas): The method relies on super-samples drawn from the user-specified joint working distribution to obtain the plug-in quantities that enter the analytical power formula. No sensitivity analysis or bound is provided on how errors in the estimated plug-in win probabilities propagate to the final sample-size recommendation when the working distribution is only approximately correct.

    Authors: We agree that an explicit sensitivity analysis strengthens the practical utility of the framework. In the revised manuscript we have added a new subsection in §5 that perturbs the joint-distribution parameters (copula parameter, pairwise correlations) around the values used in the main simulations and reports the resulting changes in plug-in win probabilities, net benefit, and the final sample-size recommendation. We also derive and present first-order bounds on the propagation of plug-in error through the closed-form power and sample-size expressions, showing that the impact on recommended N remains limited for the range of misspecification considered realistic in trial planning. These additions directly address the referee’s concern while remaining within the scope of the existing analytical formulas. revision: yes

Circularity Check

0 steps flagged

No circularity: analytical formulas fed by independent super-sampling from user-specified working distribution

full rationale

The FORSS method specifies marginal treatment effects and a flexible joint working distribution as external inputs, then uses super-sampling solely to compute plug-in quantities (win probabilities, net benefits, tie probabilities) that are inserted into pre-existing analytical power and sample-size formulas. This structure does not define the target power formula in terms of the super-sample estimates, nor does it rename fitted quantities as predictions; the formulas remain independent of the particular super-sample realizations. Simulation studies evaluate performance under the same working distribution, but this is a validation check rather than a load-bearing derivation step that reduces to the inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes imported from prior author work are required for the central claim. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the ability of investigators to supply a joint working distribution whose dependence structure is close enough to reality that super-sample estimates of win probabilities and related quantities remain useful for power calculations.

free parameters (1)
  • Parameters of the joint working distribution for hierarchical endpoints
    User must choose or estimate the dependence parameters that define how the ranked endpoints covary; these directly affect the super-sample estimates used in the formulas.
axioms (1)
  • domain assumption A flexible joint working distribution for the hierarchical endpoints can be specified by the user and used to generate super-samples that accurately estimate the population-level plug-in quantities required by the analytical formulas.
    This modeling choice is invoked to replace repeated full-trial simulation while still capturing endpoint dependence.

pith-pipeline@v0.9.0 · 5780 in / 1468 out tokens · 77794 ms · 2026-05-19T23:32:08.511665+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages

  1. [1]

    Validity of composite end points in clinical trials.BMJ 2005; 330(7491): 594–596

    Montori V, Permanyer-Miralda G, Ferreira-González I, others . Validity of composite end points in clinical trials.BMJ 2005; 330(7491): 594–596. doi: 10.1136/bmj.330.7491.594

  2. [2]

    WalkerHG,BrownAJ,VazIP,etal.Compositeoutcomemeasuresinhigh-impactcriticalcarerandomisedcontrolledtrials: a systematic review.Critical Care2024; 28(1): 184

  3. [3]

    Key Issues in End Point Selection for Heart Failure Trials: Composite End Points.Journal of Cardiac Failure2005; 11(8): 567–575

    Neaton J, Gray G, Zuckerman B, Konstam M. Key Issues in End Point Selection for Heart Failure Trials: Composite End Points.Journal of Cardiac Failure2005; 11(8): 567–575. doi: 10.1016/j.cardfail.2005.08.350

  4. [4]

    Problems with use of composite end points in cardiovascular trials: systematic review of randomised controlled trials.Bmj2007; 334(7597): 786

    Ferreira-González I, Permanyer-Miralda G, Domingo-Salvany A, et al. Problems with use of composite end points in cardiovascular trials: systematic review of randomised controlled trials.Bmj2007; 334(7597): 786

  5. [5]

    doi: 10.1002/(SICI)1097-0258(19990615)18:11<1341::AID-SIM129>3.0.CO;2-7

    FinkelsteinD,SchoenfeldD.Combiningmortalityandlongitudinalmeasuresinclinicaltrials.Statistics in Medicine1999; 18(11): 1341–1354. doi: 10.1002/(SICI)1097-0258(19990615)18:11<1341::AID-SIM129>3.0.CO;2-7

  6. [6]

    Efficient statistical analysis of trial designs: win ratio and related approaches for composite outcomes.Perioperative Medicine2025; 14(1): 70

    Fandino W, Dodd M, Kunst G, Clayton T. Efficient statistical analysis of trial designs: win ratio and related approaches for composite outcomes.Perioperative Medicine2025; 14(1): 70

  7. [7]

    doi: 10.1093/eurheartj/ehr352

    PocockS,AritiC,CollierT,WangD.Thewinratio:anewapproachtotheanalysisofcompositeendpointsinclinicaltrials based on clinical priorities.European Heart Journal2012; 33(2): 176–182. doi: 10.1093/eurheartj/ehr352

  8. [8]

    Defining estimand for the win ratio: Separate the true effect from censoring.Clinical Trials2024

    Mao L. Defining estimand for the win ratio: Separate the true effect from censoring.Clinical Trials2024. 17407745241259356doi: 10.1177/17407745241259356

  9. [9]

    Sample Size and Power Calculations with Win Measures Based on Hierarchical Endpoints.Statistics in Medicine2025; 44(10-12)

    Barnhart H, Lokhnygina Y, Matsouaka R, others . Sample Size and Power Calculations with Win Measures Based on Hierarchical Endpoints.Statistics in Medicine2025; 44(10-12). doi: 10.1002/sim.70096

  10. [10]

    The win ratio approach for composite endpoints: practical guidance based on previous experience.European Heart Journal2020; 41(46): 4391–4399

    Redfors B, Gregson J, Crowley A, others . The win ratio approach for composite endpoints: practical guidance based on previous experience.European Heart Journal2020; 41(46): 4391–4399. doi: 10.1093/eurheartj/ehaa665

  11. [11]

    Dong G, Huang B, Verbeeck J, others . Win statistics (win ratio, win odds, and net benefit) can complement one another to show the strength of the treatment effect on time-to-event outcomes.Pharmaceutical Statistics2023; 22(1): 20–33. doi: 10.1002/pst.2251 Baoshan Zhang et al. 21

  12. [12]

    GregsonJ,TaylorD,OwenR,CollierT,J.CohenD,PocockS.Hierarchicalcompositeoutcomesandwinratiomethodsin cardiovascular trials: a review and consequent guidance.Circulation2025; 151(22): 1606–1619

  13. [13]

    PocockSJ,GregsonJ,CollierTJ,FerreiraJP,StoneGW.Thewinratioincardiologytrials:lessonslearnt,newdevelopments, and wise future use.European heart journal2024; 45(44): 4684–4699

  14. [14]

    Tafamidis Treatment for Patients with Transthyretin Amyloid Cardiomy- opathy.New England Journal of Medicine2018; 379(11): 1007–1016

    Maurer M, Schwartz J, Gundapaneni B, others . Tafamidis Treatment for Patients with Transthyretin Amyloid Cardiomy- opathy.New England Journal of Medicine2018; 379(11): 1007–1016. doi: 10.1056/NEJMoa1805689

  15. [15]

    doi: 10.1056/NEJMoa1806640

    StoneG,LindenfeldJ,AbrahamW,others.TranscatheterMitral-ValveRepairinPatientswithHeartFailure.New England Journal of Medicine2018; 379(24): 2307–2318. doi: 10.1056/NEJMoa1806640

  16. [16]

    Randomized Placebo-Controlled Trial of Ferric Carboxymaltose in Heart FailureWithIronDeficiency:RationaleandDesign.Circulation: Heart Failure2021;14(5):e008100

    Mentz R, Ambrosy A, Ezekowitz J, others . Randomized Placebo-Controlled Trial of Ferric Carboxymaltose in Heart FailureWithIronDeficiency:RationaleandDesign.Circulation: Heart Failure2021;14(5):e008100. doi:10.1161/CIRC- HEARTFAILURE.120.008100

  17. [17]

    Sample size formula for a win ratio endpoint.Statistics in Medicine2022; 41(6): 950–963

    Yu R, Ganju J. Sample size formula for a win ratio endpoint.Statistics in Medicine2022; 41(6): 950–963. doi: 10.1002/sim.9297

  18. [18]

    Food and Drug Administration

    U.S. Food and Drug Administration . Multiple Endpoints in Clinical Trials: Guidance for Industry. U.S. Food and Drug Administration; 2022. Available at: https://www.fda.gov/media/162416/download

  19. [19]

    Dapagliflozin in Myocardial Infarction without Diabetes or Heart Failure.NEJM Evidence2024; 3(2)

    James S, Erlinge D, Storey R, others . Dapagliflozin in Myocardial Infarction without Diabetes or Heart Failure.NEJM Evidence2024; 3(2). doi: 10.1056/EVIDoa2300286

  20. [20]

    A hierarchical kidney outcome using win statistics in patients with heart failure from the DAPA-HF and DELIVER trials.Nature Medicine2024; 30(5): 1432–1439

    Kondo T, Jhund P, Gasparyan S, others . A hierarchical kidney outcome using win statistics in patients with heart failure from the DAPA-HF and DELIVER trials.Nature Medicine2024; 30(5): 1432–1439. doi: 10.1038/s41591-024-02941-8

  21. [21]

    doi: 10.1002/sim.9419

    ZhouT,LaValleyM,NelsonK,CabralH,MassaroJ.CalculatingpowerfortheFinkelsteinandSchoenfeldteststatisticfor a composite endpoint with two components.Statistics in Medicine2022; 41(17): 3321–3335. doi: 10.1002/sim.9419

  22. [22]

    Power and sample size calculation for the win odds test: applica- tion to an ordinal endpoint in COVID-19 trials.Journal of Biopharmaceutical Statistics2021; 31(6): 765–787

    Gasparyan S, Kowalewski E, Folkvaljon F, others . Power and sample size calculation for the win odds test: applica- tion to an ordinal endpoint in COVID-19 trials.Journal of Biopharmaceutical Statistics2021; 31(6): 765–787. doi: 10.1080/10543406.2021.1968893

  23. [23]

    Sample size formula for general win ratio analysis.Biometrics2022; 78(3): 1257–1268

    Mao L, Kim K, Miao X. Sample size formula for general win ratio analysis.Biometrics2022; 78(3): 1257–1268. doi: 10.1111/biom.13501

  24. [24]

    doi: 10.1002/sim.8388

    VerbeeckJ,SpitzerE,DeVriesT,others.Generalizedpairwisecomparisonmethodstoanalyze(non)prioritizedcomposite endpoints.Statistics in Medicine2019; 38(30): 5641–5656. doi: 10.1002/sim.8388

  25. [25]

    Biostatistics2016; 17(1): 178–187

    BebuI,LachinJ.Largesampleinferenceforawinratioanalysisofacompositeoutcomebasedonprioritizedcomponents. Biostatistics2016; 17(1): 178–187. doi: 10.1093/biostatistics/kxv032

  26. [26]

    New York, NY: Springer

    Lehmann E.Elements of large-sample theory. New York, NY: Springer. corrected 3rd printing ed. 2004

  27. [27]

    New York: Marcel Dekker

    Lee A.U-Statistics: Theory and Practice. New York: Marcel Dekker . 1990

  28. [28]

    Sequential design for paired ordinal categorical outcome.Statistical Methods in Medical Research2025; 34(6): 1144–1161

    Zhang B, Wu Y. Sequential design for paired ordinal categorical outcome.Statistical Methods in Medical Research2025; 34(6): 1144–1161

  29. [29]

    Group Sequential Test for Two-Sample Ordinal Outcome Measures.Statistics in Medicine2025; 44(6): e70053

    Wu Y, Simmons RA, Zhang B, Troy JD. Group Sequential Test for Two-Sample Ordinal Outcome Measures.Statistics in Medicine2025; 44(6): e70053

  30. [30]

    Sequential Design with Derived Win Statistics.arXiv preprint arXiv:2410.062812024

    Zhang B, Wu Y. Sequential Design with Derived Win Statistics.arXiv preprint arXiv:2410.062812024

  31. [31]

    Food and Drug Administration

    U.S. Food and Drug Administration . Patient-Focused Drug Development: Incorporating Clinical Outcome Assessments IntoEndpointsforRegulatoryDecision-Making.U.S.FoodandDrugAdministration;2023. Availableat:https://www.fda. gov/media/166830/download. 22 Baoshan Zhang et al

  32. [32]

    Springer Series in StatisticsNew York: Springer

    Nelsen R.An introduction to copulas. Springer Series in StatisticsNew York: Springer. 2nd ed. 2006

  33. [33]

    Weighted win loss approach for analyzing prioritized outcomes.Statistics in medicine2017; 36(15): 2452–2465

    Luo X, Qiu J, Bai S, Tian H. Weighted win loss approach for analyzing prioritized outcomes.Statistics in medicine2017; 36(15): 2452–2465

  34. [34]

    A primer on copulas for count data.ASTIN Bulletin: The Journal of the IAA2007; 37(2): 475–515

    Genest C, Nešlehová J. A primer on copulas for count data.ASTIN Bulletin: The Journal of the IAA2007; 37(2): 475–515

  35. [35]

    Copula-based regression models for a bivariate mixed discrete and continuous outcome.Statistics in medicine2011; 30(2): 175–185

    Leon dAR, Wu B. Copula-based regression models for a bivariate mixed discrete and continuous outcome.Statistics in medicine2011; 30(2): 175–185

  36. [36]

    A New Measure of Rank Correlation.Biometrika1938; 30(1–2): 81–93

    Kendall MG. A New Measure of Rank Correlation.Biometrika1938; 30(1–2): 81–93. doi: 10.1093/biomet/30.1-2.81

  37. [37]

    HarrellFE,CaliffRM,PryorDB,LeeKL,RosatiRA.EvaluatingtheYieldofMedicalTests.JAMA1982;247(18):2543–

  38. [38]

    doi: 10.1001/jama.1982.03320430047030

  39. [39]

    Ferric Carboxymaltose in Heart Failure with Iron Deficiency.New England Journal of Medicine2023; 389(11): 975–986

    Mentz R, Garg J, Rockhold F, others . Ferric Carboxymaltose in Heart Failure with Iron Deficiency.New England Journal of Medicine2023; 389(11): 975–986. doi: 10.1056/NEJMoa2304968 Baoshan Zhang et al. 23 8 APPENDIX 8.1 Derivation of the variance of win and loss statistics In this subsection, we show the details of the derivation of the variance of win and...