arxiv: 2605.03406 · v1 · submitted 2026-05-05 · 📊 stat.ME

Recognition: unknown

Optimal MILP Approach to Group Sequential Hypothesis Test

Dae Woong Ham , Stefanus Jasin , Xuejun Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-07 14:09 UTC · model grok-4.3

classification 📊 stat.ME

keywords group sequential testsmixed integer linear programmingsample average approximationalpha spendingtype I error controloptimal boundariessequential hypothesis testingclinical trial design

0 comments

The pith

A mixed-integer linear program can optimize group sequential test boundaries to reject the null faster than classical methods while controlling type-1 and type-2 errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formulates the design of group sequential hypothesis test boundaries as an optimization problem that directly minimizes expected stopping time subject to fixed type-1 and type-2 error limits. It solves this problem with a sample-average approximation embedded in a mixed-integer linear program, then compares the resulting boundaries against standard alpha-spending rules such as Lan-DeMets, Pocock, and O'Brien-Fleming. The optimized boundaries achieve the same error control yet reach rejection sooner on average, and they tend to allocate more of the alpha budget in the earliest looks. When applied to an acute kidney injury trial, the new boundaries detect the same significant effect with fewer observations than the original analysis or the classical procedures.

Core claim

By casting the search for optimal group-sequential rejection thresholds as a mixed-integer linear program whose objective and constraints are estimated by Monte-Carlo sampling, one obtains boundaries that strictly dominate the classical Lan-DeMets, Pocock, and O'Brien-Fleming rules in expected sample size while preserving the nominal type-1 and type-2 error rates.

What carries the argument

The S-MILP formulation that optimizes the vector of group-wise rejection thresholds under sampled estimates of type-1 and type-2 error and expected stopping time.

If this is right

Optimal boundaries allocate alpha more aggressively in the first few groups than classical spending functions.
The same optimization framework can be re-run for any desired target error rates or any number of interim analyses.
In the kidney-injury example the optimized rule reaches the same conclusion with a smaller total sample than the study actually used.
Any existing GST procedure can be replaced by its S-MILP counterpart without changing the error guarantees.
The approach yields a concrete numerical benchmark against which future alpha-spending proposals can be measured.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The early-alpha-spending pattern may generalize to other sequential decision problems where the cost of continued sampling is linear.
One could replace the sample-average approximation with a more accurate stochastic program or with exact dynamic programming when the state space is small.
The method supplies a practical way to incorporate additional constraints such as minimum power at each look or cost functions that penalize late stopping.
If the underlying distribution is misspecified, the optimized boundaries could be re-computed on-line as data arrive, turning the static design into an adaptive procedure.

Load-bearing premise

The sample-average approximation of the error rates and expected stopping times remains accurate for new data drawn from the same distribution after the MILP solver returns its solution.

What would settle it

Generate fresh Monte-Carlo replications from the same data-generating process used in the optimization, apply the reported S-MILP boundaries, and check whether the realized type-1 error exceeds the target level or whether the average stopping time fails to be smaller than that of the Pocock or O'Brien-Fleming boundaries.

Figures

Figures reproduced from arXiv: 2605.03406 by Dae Woong Ham, Stefanus Jasin, Xuejun Zhao.

**Figure 1.** Figure 1: Plot of alpha-spending budgets for all methods. The simulation setting is the same as that view at source ↗

read the original abstract

Sequential hypothesis tests are widely adopted as a principled way to perform multiple tests on data that arrives over time. In particular, researchers frequently utilize group sequential hypothesis tests (GST) to test the same hypotheses at K times or "groups" while data arrives sequentially. In this setting, many methods have been proposed to allow researchers to uniformly control type-1 error across K checks (often known as various alpha-spending budgets). Although these methods are all successfully valid in controlling uniform type-1 error, it is not clear which of these methods are optimal when trying to reject the null as soon as possible. In this paper, we directly optimize the rejection criterion in the GST setting under the same constraints of controlling type-1 and type-2 errors. We use a sample average approximation combined with mixed integer linear programming (S-MILP) approach for this problem and show how our S-MILP approach dominates classical GST procedures such as Lan-DeMets, Pocock, and O'Brien-Fleming methods. We also find that the optimal solution typically aggressively spends the alpha-budget early, shedding insight to the long-standing debate of which alpha-spending budgets are more efficient. We finally apply our optimal S-MILP approach to a recent study on acute kidney injury interventions and find our optimal S-MILP approach can reach the same statistically significant conclusion faster than the original study and other GST methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper turns GST boundary design into an MILP with SAA to minimize expected sample size and reports earlier stopping than Pocock or O'Brien-Fleming, but the finite-sample constraints make exact error control uncertain.

read the letter

The main thing here is that the authors cast the choice of group sequential stopping boundaries as a mixed-integer linear program whose objective is expected sample size and whose constraints are type-1 and type-2 error rates approximated by Monte Carlo averages. They solve the resulting S-MILP and obtain boundaries that spend alpha early, stop sooner in simulation, and reach significance faster than Lan-DeMets, Pocock, or O'Brien-Fleming in their acute kidney injury example. That optimization framing is new for this literature, which has mostly relied on fixed spending functions rather than direct search under the error constraints. The numerical pattern they find—that aggressive early spending is often best—also gives a concrete data point in the old debate about alpha allocation. The real-data application is useful because it shows the method can change a trial's stopping time in practice. The soft spot is the sample-average approximation itself. The MILP enforces the error rates only on the finite set of simulated paths used to build the model, so the returned boundaries are guaranteed to meet the nominal rates only approximately. Classical spending functions satisfy the constraints exactly for any sample size; here there is no stated out-of-sample guarantee or bound on the Monte Carlo error. If the training paths overfit or if the number of samples is modest, the reported dominance may shrink or disappear on fresh data. The paper needs to report the exact Monte Carlo size, any held-out validation, and the precise MILP encoding so readers can judge how tight the approximation is. This is for statisticians who design sequential trials or who want to minimize expected enrollment while keeping error rates controlled. Readers already comfortable with GST methods will see the value in the optimization view and the empirical comparison. It is worth sending to peer review so that experts can check the formulation details and the robustness of the SAA step.

Referee Report

3 major / 2 minor

Summary. The paper proposes a sample-average approximation mixed-integer linear programming (S-MILP) formulation to directly optimize the rejection boundaries of group sequential hypothesis tests (GST) so as to minimize expected stopping time while enforcing type-I and type-II error constraints. It claims that the resulting boundaries dominate the classical Lan-DeMets, Pocock, and O'Brien-Fleming spending-function procedures both in simulation and in a re-analysis of an acute-kidney-injury trial, and that the optimal policy tends to spend alpha aggressively early.

Significance. A computationally tractable, optimization-based method that produces GST boundaries with demonstrably better expected performance while preserving error control would be a useful addition to the sequential-analysis toolkit, particularly if it can be shown to generalize beyond the training sample. The MILP encoding itself is a novel technical contribution for this class of problems.

major comments (3)

[§3] §3 (S-MILP formulation): the type-I and type-II error constraints are replaced by their empirical averages over a finite Monte-Carlo sample; the returned boundaries are therefore guaranteed to satisfy the nominal error rates only on the training paths. No concentration inequality, out-of-sample validation set, or a-priori error bound is supplied to control the deviation from the true expectations, in contrast to the exact control provided by classical spending functions for any sample size.
[§5] Simulation study (likely §5 and associated tables): performance comparisons are reported on the same Monte-Carlo paths used to construct the S-MILP objective and constraints. This leaves open the possibility that the reported dominance is partly due to overfitting the training sample rather than superior performance under the true data-generating distribution.
[Abstract, §4] Abstract and §4 (application): the claim that the optimized boundaries “exactly control” type-I and type-II errors is not supported by the SAA formulation; the manuscript should either qualify the control as approximate or provide a post-hoc verification on an independent test sample.

minor comments (2)

The number of Monte-Carlo samples used in the SAA is listed as a free parameter but no sensitivity analysis or convergence diagnostics are shown.
Notation for the boundary parameters and the MILP decision variables should be introduced more explicitly before the optimization model is stated.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments correctly identify the approximate nature of error control under sample-average approximation and the risk of in-sample evaluation. We address each point below and will revise the manuscript to qualify claims, add out-of-sample validation, and improve the presentation of results.

read point-by-point responses

Referee: [§3] §3 (S-MILP formulation): the type-I and type-II error constraints are replaced by their empirical averages over a finite Monte-Carlo sample; the returned boundaries are therefore guaranteed to satisfy the nominal error rates only on the training paths. No concentration inequality, out-of-sample validation set, or a-priori error bound is supplied to control the deviation from the true expectations, in contrast to the exact control provided by classical spending functions for any sample size.

Authors: We agree that the S-MILP formulation provides only approximate control of the type-I and type-II errors because the constraints are replaced by their empirical averages over a finite Monte Carlo sample. While the SAA approach is standard in stochastic optimization and the approximation becomes arbitrarily accurate for large sample sizes, the manuscript does not supply concentration bounds or a priori guarantees. In the revision we will explicitly state that error control is approximate, discuss the dependence on Monte Carlo sample size, and add an independent validation set of paths (generated from the same data-generating process but not used in optimization) to empirically verify that the nominal error rates are respected out of sample. revision: yes
Referee: [§5] Simulation study (likely §5 and associated tables): performance comparisons are reported on the same Monte-Carlo paths used to construct the S-MILP objective and constraints. This leaves open the possibility that the reported dominance is partly due to overfitting the training sample rather than superior performance under the true data-generating distribution.

Authors: The referee is correct that the simulation results in §5 were obtained on the same Monte Carlo paths used to build the S-MILP objective and constraints. This creates a legitimate concern about possible overfitting. We will add a new set of experiments in the revised manuscript that evaluate the optimized boundaries on a large, independent test sample drawn from the identical data-generating distribution. These out-of-sample results will be reported alongside the original tables to demonstrate that the reported dominance persists beyond the training paths. revision: yes
Referee: [Abstract, §4] Abstract and §4 (application): the claim that the optimized boundaries “exactly control” type-I and type-II errors is not supported by the SAA formulation; the manuscript should either qualify the control as approximate or provide a post-hoc verification on an independent test sample.

Authors: We acknowledge that the wording in the abstract and §4 is imprecise. The SAA formulation does not deliver exact control, and the manuscript should not use the term “exactly.” In the revision we will replace such language with “approximately control” (with the approximation improving as the Monte Carlo sample grows) and will include post-hoc verification on an independent test sample, as already planned for the simulation section, to support the application results. revision: yes

Circularity Check

1 steps flagged

S-MILP dominance reduces to superiority on the same SAA samples used to define the optimized objective

specific steps

fitted input called prediction [Abstract]
"We use a sample average approximation combined with mixed integer linear programming (S-MILP) approach for this problem and show how our S-MILP approach dominates classical GST procedures such as Lan-DeMets, Pocock, and O'Brien-Fleming methods."

The S-MILP directly optimizes an objective (e.g., expected stopping time) whose value is defined by the identical sample-average approximation that is later used to demonstrate dominance. Consequently the reported superiority on those Monte-Carlo paths is guaranteed by the optimality of the MILP solution rather than by an independent comparison under the true measure.

full rationale

The paper formulates an optimization problem whose objective and type-I/II error constraints are replaced by finite-sample averages (SAA). It then claims dominance over classical spending functions by comparing the same approximated metrics. Because the MILP is solved to optimality on those exact averages, any reported improvement in the approximated objective holds by construction on the training paths; no separate out-of-sample evaluation or uniform convergence guarantee is invoked to separate the optimization from the reported performance. This matches the fitted-input-called-prediction pattern and produces partial circularity in the central empirical claim.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on standard parametric assumptions for sequential testing and on the accuracy of the simulation-based approximation; no new physical entities are introduced.

free parameters (1)

Number of Monte Carlo samples in SAA
The sample size used to approximate expectations is chosen by the authors to trade off accuracy against computation time.

axioms (1)

domain assumption Data at each group follow a distribution (typically normal or binomial) for which the test statistic is well-defined and the error probabilities can be estimated by simulation.
Invoked when generating the simulated data sets used inside the MILP.

pith-pipeline@v0.9.0 · 5549 in / 1486 out tokens · 102256 ms · 2026-05-07T14:09:52.218968+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 9 canonical work pages

[1]

2013.Convergence of probability measures

Billingsley, Patrick. 2013.Convergence of probability measures. John Wiley & Sons

2013
[1]

2013.Convergence of probability measures

Billingsley, Patrick. 2013.Convergence of probability measures. John Wiley & Sons. Clautiaux, François, Ivana Ljubić

2013
[2]

Clautiaux, François, Ivana Ljubić. 2025. Last fifty years of integer linear programming: A focus on recent practical advances.European Journal of Operational Research324(3) 707–731

2025
[2]

Cohen, J

Last fifty years of integer linear programming: A focus on recent practical advances.European Journal of Operational Research324(3) 707–731. Cohen, J. 1988.Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates. Gurobi Optimization, LLC

1988
[3]

1988.Statistical Power Analysis for the Behavioral Sciences

Cohen, J. 1988.Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates. Gurobi Optimization, LLC. 2026. Gurobi Optimizer Reference Manual. URLhttps://www. gurobi.com

1988
[3]

URLhttps: //arxiv.org/abs/2210.08639

Design-based confidence sequences: A general approach to risk mitigation in online experimentation. URLhttps: //arxiv.org/abs/2210.08639. Ham, Dae Woong, Kosuke Imai, Lucas Janson

work page arXiv
[4]

Howard, Steven R., Aaditya Ramdas, Jon D

Using machine learning to test causal hypotheses in conjoint analysis doi:10.48550/arXiv.2201.08343. Howard, Steven R., Aaditya Ramdas, Jon D. McAuliffe, Jasjeet S. Sekhon

work page doi:10.48550/arxiv.2201.08343
[5]

https://www.ibm.com/products/ ilog-cplex-optimization-studio

IBM ILOG CPLEX Optimization Studio. https://www.ibm.com/products/ ilog-cplex-optimization-studio. Version 22.1. Jennison, Christopher, {Bruce W.} Turnbull. 1999.Group sequential tests with applications to clinical trials. Chapman & Hall/CRC Interdisciplinary Statistics, Chapman & Hall, UK United Kingdom. Johari, Ramesh, Leonid Pekelis, David Walsh

1999
[6]

McAuliffe, Jasjeet S

Howard, Steven R., Aaditya Ramdas, Jon D. McAuliffe, Jasjeet S. Sekhon. 2020. Time-uniform, nonparametric, nonasymptotic confidence sequences.The Annals of Statistics. IBM. 2025. IBM ILOG CPLEX Optimization Studio. https://www.ibm.com/products/ ilog-cplex-optimization-studio. Version 22.1

2020
[6]

Biometrika70(3) 659–663

Discrete sequential boundaries for clinical trials. Biometrika70(3) 659–663. URLhttp://www.jstor.org/stable/2336502. Laurent, Beatrice, Pascal Massart

work page arXiv
[7]

1999.Group sequential tests with applications to clinical trials

Jennison, Christopher, {Bruce W.} Turnbull. 1999.Group sequential tests with applications to clinical trials. Chapman & Hall/CRC Interdisciplinary Statistics, Chapman & Hall, UK United Kingdom

1999
[7]

KDD ’22, Association for Computing Machinery, New York, NY, USA, 3336–3346

Rapid regression detection in software deployments through sequential testing. KDD ’22, Association for Computing Machinery, New York, NY, USA, 3336–3346. doi:10.1145/3534678.3539099. URLhttps://doi.org/10. 1145/3534678.3539099. Luedtke, James, Shabbir Ahmed, George L Nemhauser

work page doi:10.1145/3534678.3539099
[8]

Johari, Ramesh, Leonid Pekelis, David Walsh. 2015. Always valid inference: Bringing sequential analysis to a/b testing . 93

2015
[8]

The Zero Set of a Real Analytic Function

The zero set of a real analytic function.arXiv preprint arXiv:1512.07276. O’Brien, Peter C., Thomas R. Fleming

work page Pith review arXiv
[9]

Lachin, John M. 1981. Introduction to sample size determination and power analysis for clinical trials.Controlled clinical trials2 293–113

1981
[9]

Biometrics35(3) 549–556

A multiple testing procedure for clinical trials. Biometrics35(3) 549–556. URLhttp://www.jstor.org/stable/2530245. Pagnoncelli,BernardoK,ShabbirAhmed,AlexanderShapiro.2009. Sampleaverageapproximation method for chance constrained programming: theory and applications.Journal of optimization theory and applications142(2) 399–416. Pocock, Stuart J

work page arXiv 2009
[10]

Biometrika64(2) 191–199

Group sequential methods in the design and analysis of clinical trials. Biometrika64(2) 191–199. URLhttp://www.jstor.org/stable/2335684. 94 Schultzberg, Marten, Sebastian Ankargren

work page arXiv
[11]

Laurent, Beatrice, Pascal Massart. 2000. Adaptive estimation of a quadratic functional by model selection.Annals of statistics1302–1338

2000
[11]

URL https://engineering.atspotify.com/2023/03/ choosing-sequential-testing-framework-comparisons-and-discussions

Choosing a sequential testing framework — comparisons and discussions. URL https://engineering.atspotify.com/2023/03/ choosing-sequential-testing-framework-comparisons-and-discussions. Shapiro, Alexander, Darinka Dentcheva, Andrzej Ruszczynski. 2021.Lectures on stochastic programming: modeling and theory. SIAM. Silva,IvairR.,MartinKulldorff,W.KatherineYih...

work page doi:10.1111/rssb.12379 2023
[12]

Lindon, Michael, Alan Malek. 2022. Anytime-valid inference for multinomial count data. Alice H

2022
[12]

Waudby-Smith,Ian,LiliWu,AadityaRamdas,NikosKarampatziakis,PaulMineiro.2022

Group sequential and confirmatory adaptive designs in clinical trials. Waudby-Smith,Ian,LiliWu,AadityaRamdas,NikosKarampatziakis,PaulMineiro.2022. Anytime- valid off-policy inference for contextual bandits. doi:10.48550/ARXIV.2210.10768. URLhttps: //arxiv.org/abs/2210.10768. Wilson, F Perry, Yu Yamamoto, Melissa Martin, Claudia Coronel-Moreno, Fan Li, Cha...

work page doi:10.48550/arxiv.2210.10768 2022
[13]

Lindon, Michael, Chris Sanden, Vaché Shirikian. 2022. Rapid regression detection in software deployments through sequential testing. KDD ’22, Association for Computing Machinery, New

2022
[15]

Luedtke, James, Shabbir Ahmed, George L Nemhauser. 2010. An integer programming approach for linear programs with probabilistic constraints.Mathematical programming122(2) 247–272

2010
[18]

Schultzberg, Marten, Sebastian Ankargren. 2023. Choosing a sequential testing framework — comparisons and discussions. URL https://engineering.atspotify.com/2023/03/ choosing-sequential-testing-framework-comparisons-and-discussions

2023
[20]

Still, Georg. 2018. Lectures on parametric optimization: An introduction.Optimization Online2

2018
[21]

Walker, AM. 1968. A note on the asymptotic distribution of sample quantiles.Journal of the Royal Statistical Society Series B: Statistical Methodology30(3) 570–575

1968
[22]

Wang, Wei, Shabbir Ahmed. 2008. Sample average approximation of expected value constrained stochastic programs.Operations Research Letters36(5) 515–519

2008
[24]

Wilson, F Perry, Yu Yamamoto, Melissa Martin, Claudia Coronel-Moreno, Fan Li, Chao Cheng, Abinet Aklilu, Lama Ghazi, Jason H Greenberg, Stephen Latham, et al. 2023. A randomized clinical trial assessing the effect of automated medication-targeted alerts on acute kidney injury outcomes.Nature communications14(1) 2826. 95

2023