Recognition: unknown
Prior-Free Sample Size Design for Test-and-Roll Experiments
Pith reviewed 2026-05-08 02:07 UTC · model grok-4.3
The pith
The worst-case marginal benefit rule for test-and-roll experiments sets optimal sample size at roughly one third of the population.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that the Worst-case Marginal Benefit (WMB) criterion for choosing the sample size m in a test-and-roll experiment with total population N yields m approximately N/3. This holds after excluding pathological cases for Bernoulli outcomes through a Gaussian approximation, and exactly for Gaussian outcomes when the common variance is known. The criterion avoids the over-penalization of exploration that occurs under absolute minimax regret by focusing on marginal changes in the worst case.
What carries the argument
The Worst-case Marginal Benefit (WMB) rule that equates the worst-case gain from testing one more matched pair with the associated marginal welfare cost of exploration.
If this is right
- Standard absolute minimax regret leads to implausibly small sample sizes.
- Optimal testing size is about one third of the population.
- The benchmark is prior-free and applies to common outcome types like Bernoulli and Gaussian.
- Welfare losses in the test phase are traded off against improved decisions for the rollout phase.
Where Pith is reading between the lines
- This marginal approach could generalize to other experiment designs where units are assigned sequentially.
- Experimenters might combine the one-third rule with adaptive stopping rules for greater efficiency.
- The result highlights how reframing the objective from absolute to marginal worst-case can change practical recommendations substantially.
Load-bearing premise
That framing the problem in terms of worst-case marginal benefits and costs correctly captures the welfare tradeoff between testing and rollout.
What would settle it
Calculating the exact optimal m for a specific Bernoulli distribution under the WMB objective and finding it differs substantially from N/3 would show the approximation or benchmark is not reliable.
Figures
read the original abstract
This paper studies sample-size design for finite-population test-and-roll experiments, where a decision-maker first conducts an experiment on $m$ units and then assigns the remaining $N-m$ units to the treatment that performs better in the experiment. We consider welfare-aware sample-size choice, which involves an exploration-exploitation tradeoff: larger experiments improve the rollout decision but impose welfare losses on experimental units assigned to the inferior treatment. We show that the standard absolute minimax regret criterion can lead to implausibly small experiments by over-penalizing exploration in its worst-case objective. To address this limitation, we propose the Worst-case Marginal Benefit (WMB) rule, which compares the worst-case marginal benefit of adding one more matched pair to the experiment with the corresponding marginal exploration cost. We establish a simple rule-of-thirds benchmark. For Bernoulli outcomes, after excluding pathological cases, the WMB criterion yields the optimal sample size of $m \approx N/3$ through a Gaussian approximation. For Gaussian outcomes with a known common variance, the same benchmark arises exactly. These results provide a prior-free and practically implementable guide for welfare-based sample-size design.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies welfare-aware sample-size design for finite-population test-and-roll experiments. It argues that absolute minimax regret produces implausibly small experiments, proposes the Worst-case Marginal Benefit (WMB) rule that compares the worst-case marginal welfare benefit of an additional matched pair against its marginal exploration cost, and derives a simple benchmark: m ≈ N/3 for Bernoulli outcomes (via Gaussian approximation after excluding pathological cases) and exactly for Gaussian outcomes with known common variance.
Significance. If the WMB derivation holds, the paper supplies a prior-free, analytically tractable rule that directly addresses the exploration-exploitation tradeoff in test-and-roll settings and yields an easily communicated benchmark. The exact N/3 result for the Gaussian case and the introduction of the WMB criterion are clear strengths; the work could influence practical experimental design in economics and marketing.
major comments (1)
- [Abstract / Bernoulli WMB derivation] Abstract and the Bernoulli derivation: the headline claim that WMB yields m ≈ N/3 rests on an un-bounded Gaussian approximation to the finite-population sampling distribution inside the worst-case marginal-benefit objective. Bernoulli outcomes are discrete and bounded; without an analytic error bound on the marginal welfare comparison (especially near the excluded pathological boundaries), it is unclear whether the approximation error can overturn the N/3 benchmark in worst-case regimes, unlike the exact Gaussian-outcome case.
minor comments (1)
- [Abstract] The abstract should explicitly define or characterize the 'pathological cases' that are excluded for Bernoulli outcomes so readers can assess the practical scope of the N/3 rule.
Simulated Author's Rebuttal
We thank the referee for the careful review and for identifying the reliance on the Gaussian approximation in the Bernoulli derivation. We respond to the major comment below.
read point-by-point responses
-
Referee: [Abstract / Bernoulli WMB derivation] Abstract and the Bernoulli derivation: the headline claim that WMB yields m ≈ N/3 rests on an un-bounded Gaussian approximation to the finite-population sampling distribution inside the worst-case marginal-benefit objective. Bernoulli outcomes are discrete and bounded; without an analytic error bound on the marginal welfare comparison (especially near the excluded pathological boundaries), it is unclear whether the approximation error can overturn the N/3 benchmark in worst-case regimes, unlike the exact Gaussian-outcome case.
Authors: We agree that the Bernoulli WMB derivation employs a Gaussian approximation to the finite-population sampling distribution of the welfare metric without an explicit analytic error bound, in contrast to the exact result for Gaussian outcomes. The approximation is invoked only after excluding pathological cases (where the worst-case marginal benefit is zero or negative, rendering experimentation irrelevant). In the interior of the parameter space the finite-population central limit theorem supplies the justification, and the resulting m ≈ N/3 serves as a simple, prior-free benchmark. We nevertheless accept that, absent a quantitative bound on the approximation error near the excluded boundaries, it remains conceivable that the error could shift the location of the optimum in certain worst-case regimes. In the revision we will (i) state the approximation assumption more explicitly in the abstract and main text, (ii) add a brief discussion of the finite-population CLT and its limitations, and (iii) include Monte Carlo evidence confirming that the optimal sample size remains close to N/3 for a wide range of N and non-pathological parameters. This is a partial revision; we will strengthen the supporting analysis but do not supply a new closed-form error bound. revision: partial
- Deriving a rigorous analytic error bound on the Gaussian approximation error for the worst-case marginal-benefit objective under Bernoulli outcomes.
Circularity Check
WMB benchmark derivation is self-contained; no reduction to inputs by construction
full rationale
The paper defines the WMB criterion explicitly as a comparison of worst-case marginal benefit of an additional matched pair against marginal exploration cost. It then applies this rule to the finite-population test-and-roll objective. For Gaussian outcomes the m = N/3 benchmark follows exactly from the resulting optimization; for Bernoulli outcomes it follows from the stated Gaussian approximation after excluding pathological cases. Neither step renames a fitted quantity as a prediction, invokes a self-citation as the sole justification for a uniqueness claim, nor defines the target result in terms of itself. The approximation is presented as an explicit modeling choice whose accuracy is left as an assumption rather than asserted by construction. Consequently the central claim does not collapse to a tautology or to data-driven fitting.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Gaussian approximation is valid for Bernoulli outcomes after excluding pathological cases
invented entities (1)
-
Worst-case Marginal Benefit (WMB) rule
no independent evidence
Reference graph
Works this paper leans on
-
[1]
, title =
Lachin, John M. , title =. Controlled Clinical Trials , year =
-
[2]
Handbook of Development Economics , editor =
Duflo, Esther and Glennerster, Rachel and Kremer, Michael , title =. Handbook of Development Economics , editor =. 2007 , volume =
2007
-
[3]
The Econometrics Journal , year =
Carneiro, Pedro and Lee, Sokbae and Wilhelm, Daniel , title =. The Econometrics Journal , year =
-
[4]
Journal of Econometrics , year =
Tetenov, Aleksey , title =. Journal of Econometrics , year =
-
[5]
, title =
Manski, Charles F. , title =. The American Statistician , year =
-
[6]
and Tetenov, Aleksey , title =
Manski, Charles F. and Tetenov, Aleksey , title =. Proceedings of the National Academy of Sciences of the United States of America , year =
-
[7]
Proceedings of the National Academy of Sciences , year =
Narita, Yusuke , title =. Proceedings of the National Academy of Sciences , year =
-
[8]
Claiborne , title =
Wang, Yongjun and Wang, Yilong and Zhao, Xingquan and Liu, Liping and Wang, David and Wang, Chunxue and Wang, Chen and Li, Hao and Meng, Xia and Cui, Liying and Jia, Jianping and Dong, Qiang and Xu, Anding and Zeng, Jinsheng and Li, Yansheng and Wang, Zhimin and Xia, Haiqin and Johnston, S. Claiborne , title =. New England Journal of Medicine , year =
-
[9]
, title =
Hirano, Keisuke and Porter, Jack R. , title =. Econometrica , volume =
-
[10]
Minimax Regret Treatment Choice with Covariates or with Limited Validity of Experiments , journal =
Stoye, J. Minimax Regret Treatment Choice with Covariates or with Limited Validity of Experiments , journal =
-
[11]
, title =
Manski, Charles F. , title =. Econometrica , volume =
-
[12]
Kitagawa, Toru and Lee, Sokbae and Qiu, Chen , title =. Biometrika , year =. doi:10.1093/biomet/asag008 , url =
-
[13]
Manski, C. F. and Tetenov, A. , title =. Journal of Statistical Planning and Inference , volume =
-
[14]
American Economic Review , Volume =
Angrist, Joshua and Lavy, Victor , Title =. American Economic Review , Volume =. 2009 , Month =. doi:10.1257/aer.99.4.1384 , URL =
-
[15]
Malaria Journal , volume=
Long-lasting insecticidal nets and indoor residual spraying may not be sufficient to eliminate malaria in a low malaria incidence area: results from a cluster randomized controlled trial in Ethiopia , author=. Malaria Journal , volume=
-
[16]
The Quarterly Journal of Economics , volume =
Alan, Sule and Corekcioglu, Gozde and Sutter, Matthias , title =. The Quarterly Journal of Economics , volume =. 2022 , month =. doi:10.1093/qje/qjac034 , url =
-
[17]
Olken , journal =
Benjamin A. Olken , journal =. Monitoring Corruption: Evidence from a Field Experiment in Indonesia , urldate =
-
[18]
American Economic Review , Volume =
Chetty, Raj and Looney, Adam and Kroft, Kory , Title =. American Economic Review , Volume =. 2009 , Month =. doi:10.1257/aer.99.4.1145 , URL =
-
[19]
American Economic Review , year =
Team Incentives and Performance: Evidence from a Retail Chain , author =. American Economic Review , year =
-
[20]
The Quarterly Journal of Economics , year =
Does Working from Home Work? Evidence from a Chinese Experiment , author =. The Quarterly Journal of Economics , year =
-
[21]
Journal of Political Economy , year =
What Do Employee Referral Programs Do? Measuring the Direct and Overall Effects of a Management Practice , author =. Journal of Political Economy , year =
-
[22]
Journal of Labor Economics , year =
Tournament Incentives in the Field: Gender Differences in the Workplace , author =. Journal of Labor Economics , year =
-
[23]
Statistical treatment choice based on asymmetric minimax regret criteria , journal =
Aleksey Tetenov , keywords =. Statistical treatment choice based on asymmetric minimax regret criteria , journal =. 2012 , note =. doi:https://doi.org/10.1016/j.jeconom.2011.06.013 , url =
-
[24]
The Annals of Statistics , year =
Batched Bandit Problems , author =. The Annals of Statistics , year =. doi:10.1214/15-AOS1381 , publisher =
-
[25]
Advances in Neural Information Processing Systems , year =
On explore-then-commit strategies , author =. Advances in Neural Information Processing Systems , year =
-
[26]
Proceedings of the 25th ACM Conference on Economics and Computation , year =
Minimax-Regret Sample Selection in Randomized Experiments , author =. Proceedings of the 25th ACM Conference on Economics and Computation , year =
-
[27]
Marketing Science , year =
Test & roll: Profit-maximizing A/B tests , author =. Marketing Science , year =
-
[28]
Journal of Econometrics , year =
Minimax regret treatment choice with finite samples , author =. Journal of Econometrics , year =
-
[29]
The Annals of Statistics , year =
Minimaxity for Randomized Designs: Some General Results , author =. The Annals of Statistics , year =
-
[30]
The Annals of Statistics , year =
On the Robustness and Efficiency of Some Randomized Designs , author =. The Annals of Statistics , year =
-
[31]
Journal of the Royal Statistical Society: Series B (Statistical Methodology) , year =
On the optimality of randomization in experimental design: How to randomize for minimax variance and design-based inference , author =. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , year =
-
[32]
Journal of Econometrics , year =
Why randomize? Minimax optimality under permutation invariance , author =. Journal of Econometrics , year =
-
[33]
Journal of Political Economy , year =
A/B Testing with Fat Tails , author =. Journal of Political Economy , year =
-
[34]
Journal of Economic Theory , year =
The A/B testing problem with Gaussian priors , author =. Journal of Economic Theory , year =
-
[35]
Econometrica , year =
Policy Learning With Observational Data , author =. Econometrica , year =
-
[36]
Econometrica , year =
Statistical Treatment Rules for Heterogeneous Populations , author =. Econometrica , year =
-
[37]
Proceedings of the National Academy of Sciences , year =
Sufficient trial size to inform clinical practice , author =. Proceedings of the National Academy of Sciences , year =
-
[38]
The American Statistician , year =
Trial Size for Near-Optimal Choice Between Surveillance and Aggressive Treatment: Reconsidering MSLT-II , author =. The American Statistician , year =
-
[39]
Econometrica , year =
Who Should Be Treated? Empirical Welfare Maximization Methods for Treatment Choice , author =. Econometrica , year =
-
[40]
Journal of Machine Learning Research , year =
Batch Learning from Logged Bandit Feedback through Counterfactual Risk Minimization , author =. Journal of Machine Learning Research , year =
-
[41]
Journal of the American Statistical Association , year =
Estimating individualized treatment rules using outcome weighted learning , author =. Journal of the American Statistical Association , year =
-
[42]
American Economic Review , year =
A Theory of Experimenters: Robustness, Randomization, and Balance , author =. American Economic Review , year =
-
[43]
Bandit Algorithms , author =. 2020 , publisher =. doi:10.1017/9781108571401 , isbn =
-
[44]
Operations Research , year =
New two-stage and sequential procedures for selecting the best simulated system , author =. Operations Research , year =
-
[45]
Marketing Science , year =
Customer acquisition via display advertising using multi-armed bandit experiments , author =. Marketing Science , year =
-
[46]
Marketing Science , year =
Dynamic online pricing with incomplete information using multiarmed bandit experiments , author =. Marketing Science , year =
-
[47]
Biometrika , year =
Choosing sample size for a clinical trial using decision analysis , author =. Biometrika , year =
-
[48]
Biometrical Journal , year =
Determination of the optimal sample size for a clinical trial accounting for the population size , author =. Biometrical Journal , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.