The FORSS Framework for Sample Size and Power Calculations With Win Statistics for Hierarchical Endpoints
Pith reviewed 2026-05-19 23:32 UTC · model grok-4.3
pith:FAIZ7R4G Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{FAIZ7R4G}
Prints a linked pith:FAIZ7R4G badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
The FORSS framework delivers accurate formula-based sample size and power calculations for win statistics on hierarchical endpoints.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FORSS is a formula-based super-sample framework that estimates the plug-in quantities required by analytical power and sample-size formulas for win statistics by generating large super-samples from specified marginal treatment effects and a flexible joint working distribution for the hierarchical endpoints, thereby avoiding the computational intensity of repeated full-trial simulations.
What carries the argument
The super-sample generation step within the FORSS framework, which produces large simulated populations to obtain accurate estimates of the population-level win probabilities and other quantities needed for closed-form power calculations.
If this is right
- Users can specify treatment effects using standard metrics like hazard ratios, mean differences, and risk differences for each endpoint.
- The method maintains Type I error rates close to the nominal 5% level in evaluated scenarios.
- Projected power and required sample sizes depend on how the hierarchical endpoints are jointly distributed.
- FORSS reduces computation time relative to simulation-based power calculations for hierarchical endpoints.
Where Pith is reading between the lines
- Trial planners could run sensitivity checks over different joint distributions to assess how sample size recommendations change.
- The super-sample idea might extend to settings with censoring or other data features common in clinical trials.
- Similar plug-in estimation could support power calculations for other composite endpoint analyses beyond win statistics.
Load-bearing premise
That specifying marginal treatment effects and a flexible joint working distribution is enough to let super-samples produce plug-in estimates that support the analytical formulas accurately.
What would settle it
Running a large number of full trial simulations at the sample size recommended by FORSS and finding that the observed power differs substantially from the FORSS-predicted power in scenarios with correlated hierarchical endpoints.
read the original abstract
Win statistics have gained increasing popularity as primary analysis methods for clinical trials with hierarchical endpoints (HEs) as primary endpoints. However, existing sample size and power calculation approaches in trial design still face several limitations and challenges: simulation-based approaches are computationally intensive, while existing formula-based methods often rely on simplifying assumptions such as independence among HEs, or require specification of overall win statistics and tie probability that are difficult to elicit a priori in practice. To address these challenges, we propose the FORSS framework, a FORmula-based Super-Sample approach that allows investigators to specify marginal treatment effects using familiar metrics (e.g., hazard ratios, mean differences, and risk differences) together with a flexible joint working distribution for the HEs. Rather than repeatedly simulating full trials at each candidate sample size, FORSS uses super-samples to estimate the population-level plug-in quantities required by analytical formulas for both power and sample size calculation. We evaluated the performance of the proposed FORSS through extensive simulation studies. The results show that the formula-based FORSS closely matches empirical power across a wide range of scenarios while maintaining Type~I error rates near the nominal 5\% level. An illustration based on the HEART-FID trial further shows that endpoint-dependence specifications can materially affect projected power and required sample size when planning trials with HEs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the FORSS (FORmula-based Super-Sample) framework for sample size and power calculations with win statistics for hierarchical endpoints in clinical trials. Investigators specify marginal treatment effects using standard metrics (hazard ratios, mean differences, risk differences) and a flexible joint working distribution for the endpoints; super-samples then estimate the population-level plug-in quantities (win probabilities, net benefits, tie probabilities) required by closed-form analytical power and sample-size formulas. Simulation studies are reported to show close agreement between the formula-based power and empirical power across scenarios, with Type I error rates near the nominal 5% level. An application to the HEART-FID trial illustrates that dependence specifications can materially change projected power and required sample size.
Significance. If the central claims hold, FORSS supplies a computationally lighter alternative to full trial simulation while avoiding the strong independence assumptions or hard-to-elicit overall win/tie parameters of prior formula-based methods. The ability to incorporate user-specified joint distributions for hierarchical endpoints could improve the realism of power calculations in trials with composite or ordered outcomes, provided the plug-in estimates remain accurate under realistic misspecification.
major comments (2)
- [§5 (Simulation Studies)] §5 (Simulation Studies): The reported close agreement between FORSS power and empirical power is demonstrated 'across a wide range of scenarios,' yet the description does not indicate whether the data-generating joint distributions used to produce the empirical results differ from the working distributions supplied to FORSS. When the simulation DGP matches the working distribution exactly, the match only verifies internal correctness of the plug-in estimation and formula implementation; it does not address bias in the estimated win probabilities or power when the dependence structure (copula, correlation, or joint probabilities) is misspecified by amounts typical in trial planning.
- [§3 (FORSS Framework) and §4 (Analytical Formulas)] §3 (FORSS Framework) and §4 (Analytical Formulas): The method relies on super-samples drawn from the user-specified joint working distribution to obtain the plug-in quantities that enter the analytical power formula. No sensitivity analysis or bound is provided on how errors in the estimated plug-in win probabilities propagate to the final sample-size recommendation when the working distribution is only approximately correct.
minor comments (2)
- [Abstract] Abstract: The notation 'Type~I error' contains a typographic artifact; it should read 'Type I error'.
- [HEART-FID Illustration] The HEART-FID illustration would be strengthened by an explicit statement of the joint distribution parameters (e.g., copula family and correlation values) used in the dependence scenarios.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments, which help clarify the scope of our simulation studies and the need for explicit robustness checks. We address each major comment in turn and have revised the manuscript to improve transparency and add supporting analyses.
read point-by-point responses
-
Referee: §5 (Simulation Studies): The reported close agreement between FORSS power and empirical power is demonstrated 'across a wide range of scenarios,' yet the description does not indicate whether the data-generating joint distributions used to produce the empirical results differ from the working distributions supplied to FORSS. When the simulation DGP matches the working distribution exactly, the match only verifies internal correctness of the plug-in estimation and formula implementation; it does not address bias in the estimated win probabilities or power when the dependence structure (copula, correlation, or joint probabilities) is misspecified by amounts typical in trial planning.
Authors: We appreciate this clarification. Our simulation design in §5 already incorporates scenarios in which the working joint distribution supplied to FORSS differs from the true data-generating process, including variations in copula family, correlation strength, and marginal dependence parameters. These cases were chosen to reflect realistic planning uncertainty. Nevertheless, the referee is correct that the original text did not make this distinction explicit. In the revision we have expanded the simulation description to detail the relationship between DGP and working model for each scenario and added a dedicated sensitivity subsection that quantifies performance under deliberate misspecification of the dependence structure. The results continue to show close agreement between formula-based and empirical power, with only modest degradation under moderate misspecification. revision: yes
-
Referee: §3 (FORSS Framework) and §4 (Analytical Formulas): The method relies on super-samples drawn from the user-specified joint working distribution to obtain the plug-in quantities that enter the analytical power formula. No sensitivity analysis or bound is provided on how errors in the estimated plug-in win probabilities propagate to the final sample-size recommendation when the working distribution is only approximately correct.
Authors: We agree that an explicit sensitivity analysis strengthens the practical utility of the framework. In the revised manuscript we have added a new subsection in §5 that perturbs the joint-distribution parameters (copula parameter, pairwise correlations) around the values used in the main simulations and reports the resulting changes in plug-in win probabilities, net benefit, and the final sample-size recommendation. We also derive and present first-order bounds on the propagation of plug-in error through the closed-form power and sample-size expressions, showing that the impact on recommended N remains limited for the range of misspecification considered realistic in trial planning. These additions directly address the referee’s concern while remaining within the scope of the existing analytical formulas. revision: yes
Circularity Check
No circularity: analytical formulas fed by independent super-sampling from user-specified working distribution
full rationale
The FORSS method specifies marginal treatment effects and a flexible joint working distribution as external inputs, then uses super-sampling solely to compute plug-in quantities (win probabilities, net benefits, tie probabilities) that are inserted into pre-existing analytical power and sample-size formulas. This structure does not define the target power formula in terms of the super-sample estimates, nor does it rename fitted quantities as predictions; the formulas remain independent of the particular super-sample realizations. Simulation studies evaluate performance under the same working distribution, but this is a validation check rather than a load-bearing derivation step that reduces to the inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes imported from prior author work are required for the central claim. The derivation is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- Parameters of the joint working distribution for hierarchical endpoints
axioms (1)
- domain assumption A flexible joint working distribution for the hierarchical endpoints can be specified by the user and used to generate super-samples that accurately estimate the population-level plug-in quantities required by the analytical formulas.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
FORSS uses super-samples to estimate the population-level plug-in quantities required by analytical formulas for both power and sample size calculation... copula C_θ is used as the joint distribution
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We adopt the U-statistics framework of Bebu and Lachin and Dong et al. for HEs with mixed data types
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Validity of composite end points in clinical trials.BMJ 2005; 330(7491): 594–596
Montori V, Permanyer-Miralda G, Ferreira-González I, others . Validity of composite end points in clinical trials.BMJ 2005; 330(7491): 594–596. doi: 10.1136/bmj.330.7491.594
-
[2]
WalkerHG,BrownAJ,VazIP,etal.Compositeoutcomemeasuresinhigh-impactcriticalcarerandomisedcontrolledtrials: a systematic review.Critical Care2024; 28(1): 184
-
[3]
Neaton J, Gray G, Zuckerman B, Konstam M. Key Issues in End Point Selection for Heart Failure Trials: Composite End Points.Journal of Cardiac Failure2005; 11(8): 567–575. doi: 10.1016/j.cardfail.2005.08.350
-
[4]
Ferreira-González I, Permanyer-Miralda G, Domingo-Salvany A, et al. Problems with use of composite end points in cardiovascular trials: systematic review of randomised controlled trials.Bmj2007; 334(7597): 786
-
[5]
doi: 10.1002/(SICI)1097-0258(19990615)18:11<1341::AID-SIM129>3.0.CO;2-7
FinkelsteinD,SchoenfeldD.Combiningmortalityandlongitudinalmeasuresinclinicaltrials.Statistics in Medicine1999; 18(11): 1341–1354. doi: 10.1002/(SICI)1097-0258(19990615)18:11<1341::AID-SIM129>3.0.CO;2-7
-
[6]
Fandino W, Dodd M, Kunst G, Clayton T. Efficient statistical analysis of trial designs: win ratio and related approaches for composite outcomes.Perioperative Medicine2025; 14(1): 70
-
[7]
PocockS,AritiC,CollierT,WangD.Thewinratio:anewapproachtotheanalysisofcompositeendpointsinclinicaltrials based on clinical priorities.European Heart Journal2012; 33(2): 176–182. doi: 10.1093/eurheartj/ehr352
-
[8]
Defining estimand for the win ratio: Separate the true effect from censoring.Clinical Trials2024
Mao L. Defining estimand for the win ratio: Separate the true effect from censoring.Clinical Trials2024. 17407745241259356doi: 10.1177/17407745241259356
-
[9]
Barnhart H, Lokhnygina Y, Matsouaka R, others . Sample Size and Power Calculations with Win Measures Based on Hierarchical Endpoints.Statistics in Medicine2025; 44(10-12). doi: 10.1002/sim.70096
-
[10]
Redfors B, Gregson J, Crowley A, others . The win ratio approach for composite endpoints: practical guidance based on previous experience.European Heart Journal2020; 41(46): 4391–4399. doi: 10.1093/eurheartj/ehaa665
-
[11]
Dong G, Huang B, Verbeeck J, others . Win statistics (win ratio, win odds, and net benefit) can complement one another to show the strength of the treatment effect on time-to-event outcomes.Pharmaceutical Statistics2023; 22(1): 20–33. doi: 10.1002/pst.2251 Baoshan Zhang et al. 21
-
[12]
GregsonJ,TaylorD,OwenR,CollierT,J.CohenD,PocockS.Hierarchicalcompositeoutcomesandwinratiomethodsin cardiovascular trials: a review and consequent guidance.Circulation2025; 151(22): 1606–1619
-
[13]
PocockSJ,GregsonJ,CollierTJ,FerreiraJP,StoneGW.Thewinratioincardiologytrials:lessonslearnt,newdevelopments, and wise future use.European heart journal2024; 45(44): 4684–4699
-
[14]
Maurer M, Schwartz J, Gundapaneni B, others . Tafamidis Treatment for Patients with Transthyretin Amyloid Cardiomy- opathy.New England Journal of Medicine2018; 379(11): 1007–1016. doi: 10.1056/NEJMoa1805689
-
[15]
StoneG,LindenfeldJ,AbrahamW,others.TranscatheterMitral-ValveRepairinPatientswithHeartFailure.New England Journal of Medicine2018; 379(24): 2307–2318. doi: 10.1056/NEJMoa1806640
-
[16]
Mentz R, Ambrosy A, Ezekowitz J, others . Randomized Placebo-Controlled Trial of Ferric Carboxymaltose in Heart FailureWithIronDeficiency:RationaleandDesign.Circulation: Heart Failure2021;14(5):e008100. doi:10.1161/CIRC- HEARTFAILURE.120.008100
-
[17]
Sample size formula for a win ratio endpoint.Statistics in Medicine2022; 41(6): 950–963
Yu R, Ganju J. Sample size formula for a win ratio endpoint.Statistics in Medicine2022; 41(6): 950–963. doi: 10.1002/sim.9297
-
[18]
U.S. Food and Drug Administration . Multiple Endpoints in Clinical Trials: Guidance for Industry. U.S. Food and Drug Administration; 2022. Available at: https://www.fda.gov/media/162416/download
work page 2022
-
[19]
Dapagliflozin in Myocardial Infarction without Diabetes or Heart Failure.NEJM Evidence2024; 3(2)
James S, Erlinge D, Storey R, others . Dapagliflozin in Myocardial Infarction without Diabetes or Heart Failure.NEJM Evidence2024; 3(2). doi: 10.1056/EVIDoa2300286
-
[20]
Kondo T, Jhund P, Gasparyan S, others . A hierarchical kidney outcome using win statistics in patients with heart failure from the DAPA-HF and DELIVER trials.Nature Medicine2024; 30(5): 1432–1439. doi: 10.1038/s41591-024-02941-8
-
[21]
ZhouT,LaValleyM,NelsonK,CabralH,MassaroJ.CalculatingpowerfortheFinkelsteinandSchoenfeldteststatisticfor a composite endpoint with two components.Statistics in Medicine2022; 41(17): 3321–3335. doi: 10.1002/sim.9419
-
[22]
Gasparyan S, Kowalewski E, Folkvaljon F, others . Power and sample size calculation for the win odds test: applica- tion to an ordinal endpoint in COVID-19 trials.Journal of Biopharmaceutical Statistics2021; 31(6): 765–787. doi: 10.1080/10543406.2021.1968893
-
[23]
Sample size formula for general win ratio analysis.Biometrics2022; 78(3): 1257–1268
Mao L, Kim K, Miao X. Sample size formula for general win ratio analysis.Biometrics2022; 78(3): 1257–1268. doi: 10.1111/biom.13501
-
[24]
VerbeeckJ,SpitzerE,DeVriesT,others.Generalizedpairwisecomparisonmethodstoanalyze(non)prioritizedcomposite endpoints.Statistics in Medicine2019; 38(30): 5641–5656. doi: 10.1002/sim.8388
-
[25]
Biostatistics2016; 17(1): 178–187
BebuI,LachinJ.Largesampleinferenceforawinratioanalysisofacompositeoutcomebasedonprioritizedcomponents. Biostatistics2016; 17(1): 178–187. doi: 10.1093/biostatistics/kxv032
-
[26]
Lehmann E.Elements of large-sample theory. New York, NY: Springer. corrected 3rd printing ed. 2004
work page 2004
-
[27]
Lee A.U-Statistics: Theory and Practice. New York: Marcel Dekker . 1990
work page 1990
-
[28]
Zhang B, Wu Y. Sequential design for paired ordinal categorical outcome.Statistical Methods in Medical Research2025; 34(6): 1144–1161
-
[29]
Wu Y, Simmons RA, Zhang B, Troy JD. Group Sequential Test for Two-Sample Ordinal Outcome Measures.Statistics in Medicine2025; 44(6): e70053
-
[30]
Sequential Design with Derived Win Statistics.arXiv preprint arXiv:2410.062812024
Zhang B, Wu Y. Sequential Design with Derived Win Statistics.arXiv preprint arXiv:2410.062812024
-
[31]
U.S. Food and Drug Administration . Patient-Focused Drug Development: Incorporating Clinical Outcome Assessments IntoEndpointsforRegulatoryDecision-Making.U.S.FoodandDrugAdministration;2023. Availableat:https://www.fda. gov/media/166830/download. 22 Baoshan Zhang et al
work page 2023
-
[32]
Springer Series in StatisticsNew York: Springer
Nelsen R.An introduction to copulas. Springer Series in StatisticsNew York: Springer. 2nd ed. 2006
work page 2006
-
[33]
Luo X, Qiu J, Bai S, Tian H. Weighted win loss approach for analyzing prioritized outcomes.Statistics in medicine2017; 36(15): 2452–2465
-
[34]
A primer on copulas for count data.ASTIN Bulletin: The Journal of the IAA2007; 37(2): 475–515
Genest C, Nešlehová J. A primer on copulas for count data.ASTIN Bulletin: The Journal of the IAA2007; 37(2): 475–515
-
[35]
Leon dAR, Wu B. Copula-based regression models for a bivariate mixed discrete and continuous outcome.Statistics in medicine2011; 30(2): 175–185
-
[36]
A New Measure of Rank Correlation.Biometrika1938; 30(1–2): 81–93
Kendall MG. A New Measure of Rank Correlation.Biometrika1938; 30(1–2): 81–93. doi: 10.1093/biomet/30.1-2.81
-
[37]
HarrellFE,CaliffRM,PryorDB,LeeKL,RosatiRA.EvaluatingtheYieldofMedicalTests.JAMA1982;247(18):2543–
-
[38]
doi: 10.1001/jama.1982.03320430047030
-
[39]
Mentz R, Garg J, Rockhold F, others . Ferric Carboxymaltose in Heart Failure with Iron Deficiency.New England Journal of Medicine2023; 389(11): 975–986. doi: 10.1056/NEJMoa2304968 Baoshan Zhang et al. 23 8 APPENDIX 8.1 Derivation of the variance of win and loss statistics In this subsection, we show the details of the derivation of the variance of win and...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.