arxiv: 2605.12797 · v1 · submitted 2026-05-12 · 📊 stat.ME · stat.AP

Recognition: 1 theorem link

· Lean Theorem

Evaluating the impact of outcome delay on the efficiency of sample size re-estimation

Aritra Mukherjee, James J M S Wason, Michael J Grayling

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:33 UTC · model grok-4.3

classification 📊 stat.ME stat.AP

keywords sample size re-estimationoutcome delayinternal pilotclinical trialscontinuous outcomesbinary outcomesdelay impact

0 comments

The pith

Outcome delays during recruitment inflate final sample sizes and power in sample size re-estimation trials

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models how long waits for primary outcomes affect internal pilot sample size re-estimation designs when recruitment continues. Pipeline participants recruited during the delay do not contribute to the interim analysis, so the final sample size exceeds what the re-estimation step would otherwise select. This produces higher average sample sizes, elevated power, and greater costs. The size of the inflation depends on the trial setting: it is largest when the re-estimated sample size falls below the original plan and smaller when the original plan is below the re-estimate.

Core claim

For both continuous and binary outcomes, the distribution of the final sample size after re-estimation widens and shifts upward with longer delays. The delay impact and cost metrics, together with root-mean-square error, quantify the resulting loss of precision in the sample-size estimate. The effect is strongest in settings where the re-estimated size is smaller than originally planned, often producing overpowered trials; the effect is weaker when the original plan remains smaller than the re-estimate.

What carries the argument

The internal-pilot SSR procedure with continuous recruitment during the outcome-delay window, tracked through the delay-impact and cost metrics that measure inflation of the final sample size relative to the re-estimation target.

If this is right

Longer delays raise average final sample size and achieved power for any fixed original plan.
The largest excess recruitment occurs when the re-estimation step would otherwise reduce the sample size.
Root-mean-square error of the final sample-size estimate grows with delay length.
The cost metric rises steadily as more pipeline participants are enrolled who do not inform the interim decision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers could cap recruitment rate during the delay window to limit pipeline participants and reduce over-powering.
Switching to shorter-term surrogate endpoints would shrink the delay window and thereby preserve the efficiency gains of SSR.
Variable recruitment rates or staggered site activation would likely amplify the inflation shown in the constant-rate model.

Load-bearing premise

Recruitment continues at a constant rate throughout the entire outcome-delay period and no participants drop out or alter the planned enrollment speed.

What would settle it

A trial or simulation in which the observed distribution of final sample sizes under increasing delay lengths deviates from the predicted upward shift and widening, especially in the case where the re-estimated size is smaller than planned.

Figures

Figures reproduced from arXiv: 2605.12797 by Aritra Mukherjee, James J M S Wason, Michael J Grayling.

**Figure 2.** Figure 2: The ‘delay impact’ for varying delay lengths ( [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: RMSE for varying delay lengths (m = 1, 2, . . . , 24) for σ 2 = 8, 10, 12, under uniform and linear recruitment patterns. The dotted line in each graph represents the RMSE for a single-stage design. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: The ‘Cost’ for varying delay lengths (m = 1, 2, . . . , 24), for σ 2 = 8, 10, 12, under uniform and linear recruitment patterns. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Final sample sizes for varying delay lengths for different first stage sample sizes ( [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

read the original abstract

Sample size reestimation can be a powerful tool to ensure that a clinical trial meets its prespecified power requirements when uncertainty regarding a design parameter exists at the planning stage. However, long term primary endpoints can be harmful to the efficiency of this trial design. If recruitment is continued while treatment outcomes are awaited, long delay can potentially lead to a large number of pipeline participants being recruited in the trial that do not contribute to the interim analysis. This may lead to a larger number of recruited participants than are actually deemed required, resulting in an overpowered trial with high cost. This paper studies the exact impact of such outcome delay on the efficiency of internal pilot type SSR designs. The distribution of the final sample size post SSR is obtained under various delay lengths for both continuous and binary outcome data, how delay impacts the precision of the final sample size estimate is then discussed. Precisely, the impact of delay on this precision is assessed through RMSE, as well as two more novel metrics, termed the delay impact and cost. The results indicate that with increase in delay length, the delay impact increases, inflating average sample size and power. However, the severity of the effect of delayed outcomes depends highly on the exact trial setting. Trials where the reestimated sample size is smaller than originally planned suffer the most from delayed outcomes, often leading to an overpowered trial. However, the impact of delay is substantially less if the original planned sample size remains smaller than the reestimated sample size.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Outcome delays in internal pilot sample size re-estimation inflate final sample sizes and power, most when the re-estimated N falls below the original plan.

read the letter

The key point is that longer delays between treatment and outcome observation in internal pilot sample size re-estimation designs lead to more pipeline participants, which inflates the final sample size and power, especially in cases where the re-estimated size is smaller than the original plan. The paper derives the exact distribution of the final sample size under various delay lengths for continuous and binary data. It then defines delay impact and cost metrics to quantify the loss in precision of the sample size estimate. Simulations show that as delay grows, average sample size and power increase, but the severity depends on the trial setting. When the interim estimate suggests a smaller N, the over-powering is most pronounced. This extends prior work by giving these specific distributions and metrics rather than just general warnings about delays. The work is grounded in forward simulation of the trial process, which avoids circularity. The results are reproducible in principle from the described setup. One limitation is the fixed recruitment rate assumption during the delay period. Variable rates could mean fewer pipeline subjects and thus less inflation than reported. The paper does not include sensitivity checks for that, so the quantitative severity might not hold in all real-world scenarios with fluctuating accrual. The focus is also limited to internal pilot SSR, without broader comparisons to other adaptive methods. This is for statisticians and trialists working on adaptive designs with delayed outcomes. It would be worth bringing to a reading group for discussion on the metrics. The paper shows clear thinking on the problem and honest engagement with the practical issue. I would send it for peer review, as the core findings address a genuine efficiency concern in these designs.

Referee Report

1 major / 2 minor

Summary. The manuscript investigates the effects of outcome delays on the efficiency of internal pilot sample size re-estimation (SSR) designs in clinical trials. For both continuous and binary endpoints, it derives the distribution of the final sample size under varying delay lengths with ongoing recruitment, and quantifies the impact using root mean square error (RMSE), a proposed 'delay impact' metric, and a cost measure. The key finding is that longer delays increase the average sample size and power, with the effect being most pronounced when the re-estimated sample size is less than the originally planned size, often resulting in overpowered trials.

Significance. This work addresses a practical issue in adaptive trial design by quantifying how outcome delays can lead to inefficiencies and overpowered trials. The simulation-based approach for normal and Bernoulli data, combined with the introduction of delay-specific metrics, offers valuable guidance for trialists planning SSR. Strengths include the explicit derivation of final N distributions and the differentiation of impact based on whether re-estimated N exceeds or falls below the planned N.

major comments (1)

[Methods / Simulation Setup] The modeling and simulations assume constant recruitment rate during outcome delays (as stated in the abstract and methods). This produces a deterministic pipeline of non-informative subjects; variable rates (e.g., slowing accrual) would shrink the pipeline and reduce the reported inflation in average sample size and power. No sensitivity analysis to time-varying recruitment is described, making the quantitative severity statements load-bearing on this assumption.

minor comments (2)

[Abstract] The abstract introduces the 'delay impact' and 'cost' metrics without definitions or formulas; a one-sentence definition in the abstract would improve accessibility.
[Simulation Study] The paper reports results for continuous and binary data but does not specify the exact parameter values (e.g., variance, event rates) or error-handling rules used in the simulations; adding a short table of simulation parameters would aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and recommendation for minor revision. We address the major comment below.

read point-by-point responses

Referee: The modeling and simulations assume constant recruitment rate during outcome delays (as stated in the abstract and methods). This produces a deterministic pipeline of non-informative subjects; variable rates (e.g., slowing accrual) would shrink the pipeline and reduce the reported inflation in average sample size and power. No sensitivity analysis to time-varying recruitment is described, making the quantitative severity statements load-bearing on this assumption.

Authors: We agree that the constant recruitment rate assumption is central to the derivations and simulations, as it produces a deterministic pipeline and isolates the delay effect for analytical tractability. Variable rates would indeed shrink the pipeline and reduce inflation, but would require specifying an additional recruitment function, complicating the exact distributions we derive. We have added a paragraph in the revised Discussion section acknowledging this as a limitation, noting that the reported inflation represents an upper bound under constant accrual and that slower accrual would mitigate the impact. No full sensitivity analysis is included, as it would expand the scope beyond the current focus on delay length. revision: partial

Circularity Check

0 steps flagged

No significant circularity; results from forward simulation of trial processes

full rationale

The paper obtains the distribution of final sample size after SSR under varying delay lengths via direct modeling and simulation of recruitment and outcome processes for normal and Bernoulli data. Delay impact, RMSE, and cost metrics are computed as explicit functions of these simulated distributions rather than being redefined or fitted from the target quantities themselves. No derivation step reduces by construction to its own inputs, no load-bearing self-citation chain is invoked to justify uniqueness or ansatz choices, and the quantitative claims (inflation of average N and power with delay, especially when re-estimated N is smaller than planned) are outputs of the forward model under stated assumptions. The analysis is therefore self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Evaluation rests on simulation of recruitment and outcome timing under standard statistical assumptions for clinical trials; new metrics are defined to capture delay effects.

free parameters (1)

delay length
Varied across simulation scenarios to assess impact on final sample size distribution and power.

axioms (1)

domain assumption Recruitment continues during the outcome observation delay period
Core modeling choice for internal pilot SSR that creates pipeline participants.

pith-pipeline@v0.9.0 · 5571 in / 1152 out tokens · 40375 ms · 2026-05-14T19:33:35.349488+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 19 canonical work pages

[1]

G., Royston, P., and Holder, R

Burton, A., Altman, D. G., Royston, P., and Holder, R. L. (2006). The design of simulation studies in medical statistics. Statistics in Medicine, 25(24), 4279–4292.https://doi.org/10. 1002/sim.2673

2006
[2]

Chang, M. (2014). Adaptive Design Theory and Implementation Using SAS and R. In CRC press (2nd ed.). CRC Press, Taylor and Francis Group

2014
[3]

promising zone

Edwards, J. M., Walters, S. J., Kunz, C., and Julious, S. A. (2020). A systematic review of the “promising zone” design. Trials, 21(1).https://doi.org/10.1186/s13063-020-04931-w

work page doi:10.1186/s13063-020-04931-w 2020
[4]

European Medicines Agency.http://www.ema.europa.eu/docs/en_GB/ document_library/Scientific_guideline/2009/09/WC500003616.pdf

(2007).Reflection paper on methodological issues in confirmatory clinical trials planned with an adaptive design. European Medicines Agency.http://www.ema.europa.eu/docs/en_GB/ document_library/Scientific_guideline/2009/09/WC500003616.pdf

2007
[5]

Friede T, and Kieser M. (2002). On the inappropriateness of an EM algorithm based procedure for blinded sample size re-estimation. Statistics in Medicine. 30;21(2):165-76. https://onlinelibrary.wiley.com/doi/10.1002/sim.977

work page doi:10.1002/sim.977 2002
[6]

Friede, T., and Kieser, M. (2004). Sample size recalculation for binary data in internal pilot study designs. Pharmaceutical Statistics, 3(4), 269–279.https://doi.org/10.1002/pst.140

work page doi:10.1002/pst.140 2004
[7]

Friede, T., and Kieser, M. (2006). Sample size recalculation in Internal pilot study designs: A review. In Biometrical Journal (Vol. 48, Issue 4, pp. 537–555). Wiley-VCH Verlag.https: //doi.org/10.1002/bimj.200510238

work page doi:10.1002/bimj.200510238 2006
[8]

Friede, T., and Kieser, M. (2013). Blinded sample size re-estimation in superiority and noninferiority trials: Bias versus variance in variance estimation. Pharmaceutical Statistics, 12(3), 141–146.https://doi.org/10.1002/pst.1564

work page doi:10.1002/pst.1564 2013
[9]

I., Shih, W

Gang, L. I., Shih, W. J., Xie, T., and Lu, J. (2002). A sample size adjustment procedure for clinical trials based on conditional power. Biostatistics 3(2), 277-287

2002
[10]

H., and Mehta, C

Gao, P., Ware, J. H., and Mehta, C. (2008). Sample size re-estimation for adaptive se- quential design in clinical trials. Journal of Biopharmaceutical Statistics, 18(6), 1184–1196. https://doi.org/10.1080/10543400802369053

work page doi:10.1080/10543400802369053 2008
[11]

L., and Shih W

Gould A. L., and Shih W. J. (1992). Sample size re-estimation without unblinding for normally distributed outcomes with unknown variance. Communications in Statistics (A), 21(10), 2833–2853

1992
[12]

Jennison, C., and Turnbull, B. W. (2015). Adaptive sample size modification in clinical trials: Start small then ask for more? Statistics in Medicine, 34(29), 3793–3810.https: //doi.org/10.1002/sim.6575

work page doi:10.1002/sim.6575 2015
[13]

Kieser, M., and Friede, T. (2003). Simple procedures for blinded sample size adjustment that do not affect the type I error rate. Statistics in Medicine, 22(23), 3571–3581.https: //doi.org/10.1002/sim.1585 17

work page doi:10.1002/sim.1585 2003
[14]

J., Lee, K

Kunzmann, K., Grayling, M. J., Lee, K. M., Robertson, D. S., Rufibach, K., and Wason, J. M. S. (2022). Conditional power and friends: The why and how of (un)planned, unblinded sample size recalculations in confirmatory trials. Statistics in Medicine, 41(5), 877–890.https: //doi.org/10.1002/SIM.9288

work page doi:10.1002/sim.9288 2022
[15]

J., and Wason, J

Mukherjee, A., Grayling, M. J., and Wason, J. M. S. (2022). Adaptive Designs: Benefits and Cautions for Neurosurgery Trials. World Neurosurgery, 161, 316–322.https://doi.org/ 10.1016/J.WNEU.2021.07.061

work page doi:10.1016/j.wneu.2021.07.061 2022
[16]

J., and Wason, J

Mukherjee, A., Grayling, M. J., and Wason, J. M. S. (2025). Evaluating the impact of outcome delay on the efficiency of two-arm group-sequential trials. Statistics in Biopharma- ceutical Research,https://doi.org/10.1080/19466315.2025.2565162

work page doi:10.1080/19466315.2025.2565162 2025
[17]

and Wason, J

Mukherjee, A. and Wason, J. M. S. (2025). Impact of Endpoint Delay on the Efficiency of Multi Arm Multi Stage Trials. Statistics in Medicine, 44(20-22)https://onlinelibrary. wiley.com/doi/10.1002/sim.70245

work page doi:10.1002/sim.70245 2025
[18]

Mukherjee, A., Wason, J. M. S., and Grayling, M. J. (2022). When is a two-stage single-arm trial efficient? An evaluation of the impact of outcome delay. European Journal of Cancer, 166, 270–278.https://doi.org/10.1016/j.ejca.2022.02.010

work page doi:10.1016/j.ejca.2022.02.010 2022
[19]

Proschan, M. A. (2009). Sample size re-estimation in clinical trials. Biometrical Journal, 51(2), 348–357.https://doi.org/10.1002/bimj.200800266

work page doi:10.1002/bimj.200800266 2009
[20]

E., Wardlaw, A

Roufosse, F., Kahn, J.-E., Rothenberg, M. E., Wardlaw, A. J., Klion, A. D., Kirby, S. Y., Gilson, M. J., Bentley, J. H., Bradford, E. S., Yancey, S. W., Steinfeld, J., and Gleich, G. J. (2020). Efficacy and safety of mepolizumab in hypereosinophilic syndrome: A phase III, randomized, placebo-controlled trial. Journal of Allergy and Clinical Immunology, 14...

work page doi:10.1016/j.jaci.2020.08.037 2020
[21]

J., Li, G., and Wang, Y

Shih, W. J., Li, G., and Wang, Y. (2016). Methods for flexible sample-size design in clinical trials: Likelihood, weighted, dual test, and promising zone approaches. Contemporary Clinical Trials, 47, 40–48.https://doi.org/10.1016/j.cct.2015.12.007

work page doi:10.1016/j.cct.2015.12.007 2016
[22]

D., Dimairo, M., Shephard, N., Hayman, A., Whitehead, A., and Walters, S

Teare, M. D., Dimairo, M., Shephard, N., Hayman, A., Whitehead, A., and Walters, S. J. (2014). Sample size requirements to estimate key design parameters from external pilot randomised controlled trials: A simulation study. Trials, 15(1).https://doi.org/10.1186/ 1745-6215-15-264

2014
[23]

Wang, P., and Chow, S. C. (2021). Sample size re-estimation in clinical trials. Statistics in Medicine, 40(27), 6133–6149.https://doi.org/10.1002/sim.9175

work page doi:10.1002/sim.9175 2021
[24]

Wason, J. M. S., Brocklehurst, P., and Yap, C. (2019). When to keep it simple - Adaptive designs are not always useful. BMC Medicine, 17(1).https://doi.org/10.1186/ s12916-019-1391-9

2019
[25]

and Kieser M.(2003)

W¨ ust, K. and Kieser M.(2003). Blinded Sample Size Recalculation for Normally Dis- tributed Outcomes Using Long-and Short-term Data. Biometrical Journal, 45https:// onlinelibrary.wiley.com/doi/10.1002/bimj.200390060

work page doi:10.1002/bimj.200390060 2003
[26]

and Kieser M.(2005)

W¨ ust, K. and Kieser M.(2005). Including long- and short-term data in blinded sample size recalculation for binary endpoints. Computational Statistics and Data Analysis, 4(48) 10.1016/J.CSDA.2004.04.006 18

work page doi:10.1016/j.csda.2004.04.006 2005