arxiv: 2605.02414 · v1 · submitted 2026-05-04 · 💰 econ.EM · stat.ME

Recognition: unknown

Prior-Free Sample Size Design for Test-and-Roll Experiments

Kentaro Kawato, Shosei Sakaguchi

Pith reviewed 2026-05-08 02:07 UTC · model grok-4.3

classification 💰 econ.EM stat.ME

keywords test-and-roll experimentssample sizewelfare-aware designworst-case marginal benefitminimax regretBernoulli outcomesGaussian outcomes

0 comments

The pith

The worst-case marginal benefit rule for test-and-roll experiments sets optimal sample size at roughly one third of the population.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In finite populations, test-and-roll experiments test on m units and then assign the better treatment to the remaining N-m units. Standard minimax regret criteria for choosing m tend to favor very small experiments because they focus on the absolute worst case. The paper instead proposes comparing the worst-case marginal benefit of one additional test pair to its marginal cost. This approach produces an optimal m of approximately N/3 for Bernoulli outcomes under Gaussian approximation and exactly for Gaussian outcomes with known variance. The result supplies a practical, prior-free guide for balancing learning against welfare losses during the experiment.

Core claim

The paper claims that the Worst-case Marginal Benefit (WMB) criterion for choosing the sample size m in a test-and-roll experiment with total population N yields m approximately N/3. This holds after excluding pathological cases for Bernoulli outcomes through a Gaussian approximation, and exactly for Gaussian outcomes when the common variance is known. The criterion avoids the over-penalization of exploration that occurs under absolute minimax regret by focusing on marginal changes in the worst case.

What carries the argument

The Worst-case Marginal Benefit (WMB) rule that equates the worst-case gain from testing one more matched pair with the associated marginal welfare cost of exploration.

If this is right

Standard absolute minimax regret leads to implausibly small sample sizes.
Optimal testing size is about one third of the population.
The benchmark is prior-free and applies to common outcome types like Bernoulli and Gaussian.
Welfare losses in the test phase are traded off against improved decisions for the rollout phase.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This marginal approach could generalize to other experiment designs where units are assigned sequentially.
Experimenters might combine the one-third rule with adaptive stopping rules for greater efficiency.
The result highlights how reframing the objective from absolute to marginal worst-case can change practical recommendations substantially.

Load-bearing premise

That framing the problem in terms of worst-case marginal benefits and costs correctly captures the welfare tradeoff between testing and rollout.

What would settle it

Calculating the exact optimal m for a specific Bernoulli distribution under the WMB objective and finding it differs substantially from N/3 would show the approximation or benchmark is not reliable.

Figures

Figures reproduced from arXiv: 2605.02414 by Kentaro Kawato, Shosei Sakaguchi.

**Figure 2.** Figure 2: Maximized marginal cost-benefit ratio at view at source ↗

**Figure 3.** Figure 3: Relative regret as a function of m for least favorable states approaching p0, ϵq. 0.30 0.35 0.40 0.45 0.50 0 100 200 300 400 500 Size of the Experiment (m) Relative Regret Epsilon 10^−2 (long−dash) 10^−4 (dashed) 10^−6 (dotted) Relative Regret Across Epsilon view at source ↗

read the original abstract

This paper studies sample-size design for finite-population test-and-roll experiments, where a decision-maker first conducts an experiment on $m$ units and then assigns the remaining $N-m$ units to the treatment that performs better in the experiment. We consider welfare-aware sample-size choice, which involves an exploration-exploitation tradeoff: larger experiments improve the rollout decision but impose welfare losses on experimental units assigned to the inferior treatment. We show that the standard absolute minimax regret criterion can lead to implausibly small experiments by over-penalizing exploration in its worst-case objective. To address this limitation, we propose the Worst-case Marginal Benefit (WMB) rule, which compares the worst-case marginal benefit of adding one more matched pair to the experiment with the corresponding marginal exploration cost. We establish a simple rule-of-thirds benchmark. For Bernoulli outcomes, after excluding pathological cases, the WMB criterion yields the optimal sample size of $m \approx N/3$ through a Gaussian approximation. For Gaussian outcomes with a known common variance, the same benchmark arises exactly. These results provide a prior-free and practically implementable guide for welfare-based sample-size design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main contribution is a Worst-case Marginal Benefit rule that produces a simple m ≈ N/3 benchmark for test-and-roll sample sizes, avoiding the tiny experiments that come from standard minimax regret.

read the letter

The punchline is that this paper replaces absolute minimax regret with a marginal-benefit comparison and lands on a clean one-third rule for how much of a finite population to test before rolling out the winner. For Gaussian outcomes with known variance the result is exact; for Bernoulli it comes from a Gaussian approximation after dropping pathological cases. That is the actual new piece: a prior-free criterion that directly weighs the welfare gain from better information against the cost of bad assignments in the experiment itself. The derivations look careful where they are exact, and the rule-of-thirds benchmark is easy to communicate to practitioners who face this exact setup in marketing or policy work. The paper does a good job framing why the usual regret criterion over-penalizes exploration and why a marginal version is more sensible here. It also keeps the focus on implementable guidance rather than heavy theory. The soft spot is the Bernoulli case. The approximation error in the worst-case marginal benefit is not bounded analytically, and the exclusion of pathological cases leaves open how much the N/3 result shifts when p is near zero or one. If those regions matter in applications, the benchmark could be less reliable than stated. The Gaussian-outcome result stands on firmer ground. This is for applied economists, marketers, and evaluators who run finite-population experiments and then assign the rest based on the test. Readers who want a simple, welfare-aware sample-size rule without priors will find it useful. The work is coherent enough on its own terms to deserve referee time; the derivations and any supporting checks on the approximation should be examined, but the core idea is practical and worth a proper review.

Referee Report

1 major / 1 minor

Summary. The paper studies welfare-aware sample-size design for finite-population test-and-roll experiments. It argues that absolute minimax regret produces implausibly small experiments, proposes the Worst-case Marginal Benefit (WMB) rule that compares the worst-case marginal welfare benefit of an additional matched pair against its marginal exploration cost, and derives a simple benchmark: m ≈ N/3 for Bernoulli outcomes (via Gaussian approximation after excluding pathological cases) and exactly for Gaussian outcomes with known common variance.

Significance. If the WMB derivation holds, the paper supplies a prior-free, analytically tractable rule that directly addresses the exploration-exploitation tradeoff in test-and-roll settings and yields an easily communicated benchmark. The exact N/3 result for the Gaussian case and the introduction of the WMB criterion are clear strengths; the work could influence practical experimental design in economics and marketing.

major comments (1)

[Abstract / Bernoulli WMB derivation] Abstract and the Bernoulli derivation: the headline claim that WMB yields m ≈ N/3 rests on an un-bounded Gaussian approximation to the finite-population sampling distribution inside the worst-case marginal-benefit objective. Bernoulli outcomes are discrete and bounded; without an analytic error bound on the marginal welfare comparison (especially near the excluded pathological boundaries), it is unclear whether the approximation error can overturn the N/3 benchmark in worst-case regimes, unlike the exact Gaussian-outcome case.

minor comments (1)

[Abstract] The abstract should explicitly define or characterize the 'pathological cases' that are excluded for Bernoulli outcomes so readers can assess the practical scope of the N/3 rule.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the careful review and for identifying the reliance on the Gaussian approximation in the Bernoulli derivation. We respond to the major comment below.

read point-by-point responses

Referee: [Abstract / Bernoulli WMB derivation] Abstract and the Bernoulli derivation: the headline claim that WMB yields m ≈ N/3 rests on an un-bounded Gaussian approximation to the finite-population sampling distribution inside the worst-case marginal-benefit objective. Bernoulli outcomes are discrete and bounded; without an analytic error bound on the marginal welfare comparison (especially near the excluded pathological boundaries), it is unclear whether the approximation error can overturn the N/3 benchmark in worst-case regimes, unlike the exact Gaussian-outcome case.

Authors: We agree that the Bernoulli WMB derivation employs a Gaussian approximation to the finite-population sampling distribution of the welfare metric without an explicit analytic error bound, in contrast to the exact result for Gaussian outcomes. The approximation is invoked only after excluding pathological cases (where the worst-case marginal benefit is zero or negative, rendering experimentation irrelevant). In the interior of the parameter space the finite-population central limit theorem supplies the justification, and the resulting m ≈ N/3 serves as a simple, prior-free benchmark. We nevertheless accept that, absent a quantitative bound on the approximation error near the excluded boundaries, it remains conceivable that the error could shift the location of the optimum in certain worst-case regimes. In the revision we will (i) state the approximation assumption more explicitly in the abstract and main text, (ii) add a brief discussion of the finite-population CLT and its limitations, and (iii) include Monte Carlo evidence confirming that the optimal sample size remains close to N/3 for a wide range of N and non-pathological parameters. This is a partial revision; we will strengthen the supporting analysis but do not supply a new closed-form error bound. revision: partial

standing simulated objections not resolved

Deriving a rigorous analytic error bound on the Gaussian approximation error for the worst-case marginal-benefit objective under Bernoulli outcomes.

Circularity Check

0 steps flagged

WMB benchmark derivation is self-contained; no reduction to inputs by construction

full rationale

The paper defines the WMB criterion explicitly as a comparison of worst-case marginal benefit of an additional matched pair against marginal exploration cost. It then applies this rule to the finite-population test-and-roll objective. For Gaussian outcomes the m = N/3 benchmark follows exactly from the resulting optimization; for Bernoulli outcomes it follows from the stated Gaussian approximation after excluding pathological cases. Neither step renames a fitted quantity as a prediction, invokes a self-citation as the sole justification for a uniqueness claim, nor defines the target result in terms of itself. The approximation is presented as an explicit modeling choice whose accuracy is left as an assumption rather than asserted by construction. Consequently the central claim does not collapse to a tautology or to data-driven fitting.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review based on abstract only; the paper claims to be prior-free so no fitted parameters are apparent from the given text. The Gaussian approximation and exclusion of pathological cases are the main modeling choices.

axioms (1)

domain assumption Gaussian approximation is valid for Bernoulli outcomes after excluding pathological cases
Invoked to obtain the closed-form m ≈ N/3 result

invented entities (1)

Worst-case Marginal Benefit (WMB) rule no independent evidence
purpose: Criterion for choosing sample size by comparing worst-case marginal benefit of one more matched pair against marginal exploration cost
New decision rule proposed to replace absolute minimax regret

pith-pipeline@v0.9.0 · 5499 in / 1389 out tokens · 35050 ms · 2026-05-08T02:07:51.841474+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 7 canonical work pages

[1]

, title =

Lachin, John M. , title =. Controlled Clinical Trials , year =
[2]

Handbook of Development Economics , editor =

Duflo, Esther and Glennerster, Rachel and Kremer, Michael , title =. Handbook of Development Economics , editor =. 2007 , volume =

2007
[3]

The Econometrics Journal , year =

Carneiro, Pedro and Lee, Sokbae and Wilhelm, Daniel , title =. The Econometrics Journal , year =
[4]

Journal of Econometrics , year =

Tetenov, Aleksey , title =. Journal of Econometrics , year =
[5]

, title =

Manski, Charles F. , title =. The American Statistician , year =
[6]

and Tetenov, Aleksey , title =

Manski, Charles F. and Tetenov, Aleksey , title =. Proceedings of the National Academy of Sciences of the United States of America , year =
[7]

Proceedings of the National Academy of Sciences , year =

Narita, Yusuke , title =. Proceedings of the National Academy of Sciences , year =
[8]

Claiborne , title =

Wang, Yongjun and Wang, Yilong and Zhao, Xingquan and Liu, Liping and Wang, David and Wang, Chunxue and Wang, Chen and Li, Hao and Meng, Xia and Cui, Liying and Jia, Jianping and Dong, Qiang and Xu, Anding and Zeng, Jinsheng and Li, Yansheng and Wang, Zhimin and Xia, Haiqin and Johnston, S. Claiborne , title =. New England Journal of Medicine , year =
[9]

, title =

Hirano, Keisuke and Porter, Jack R. , title =. Econometrica , volume =
[10]

Minimax Regret Treatment Choice with Covariates or with Limited Validity of Experiments , journal =

Stoye, J. Minimax Regret Treatment Choice with Covariates or with Limited Validity of Experiments , journal =
[11]

, title =

Manski, Charles F. , title =. Econometrica , volume =
[12]

Biometrika , year =

Kitagawa, Toru and Lee, Sokbae and Qiu, Chen , title =. Biometrika , year =. doi:10.1093/biomet/asag008 , url =

work page doi:10.1093/biomet/asag008
[13]

Manski, C. F. and Tetenov, A. , title =. Journal of Statistical Planning and Inference , volume =
[14]

American Economic Review , Volume =

Angrist, Joshua and Lavy, Victor , Title =. American Economic Review , Volume =. 2009 , Month =. doi:10.1257/aer.99.4.1384 , URL =

work page doi:10.1257/aer.99.4.1384 2009
[15]

Malaria Journal , volume=

Long-lasting insecticidal nets and indoor residual spraying may not be sufficient to eliminate malaria in a low malaria incidence area: results from a cluster randomized controlled trial in Ethiopia , author=. Malaria Journal , volume=
[16]

The Quarterly Journal of Economics , volume =

Alan, Sule and Corekcioglu, Gozde and Sutter, Matthias , title =. The Quarterly Journal of Economics , volume =. 2022 , month =. doi:10.1093/qje/qjac034 , url =

work page doi:10.1093/qje/qjac034 2022
[17]

Olken , journal =

Benjamin A. Olken , journal =. Monitoring Corruption: Evidence from a Field Experiment in Indonesia , urldate =
[18]

American Economic Review , Volume =

Chetty, Raj and Looney, Adam and Kroft, Kory , Title =. American Economic Review , Volume =. 2009 , Month =. doi:10.1257/aer.99.4.1145 , URL =

work page doi:10.1257/aer.99.4.1145 2009
[19]

American Economic Review , year =

Team Incentives and Performance: Evidence from a Retail Chain , author =. American Economic Review , year =
[20]

The Quarterly Journal of Economics , year =

Does Working from Home Work? Evidence from a Chinese Experiment , author =. The Quarterly Journal of Economics , year =
[21]

Journal of Political Economy , year =

What Do Employee Referral Programs Do? Measuring the Direct and Overall Effects of a Management Practice , author =. Journal of Political Economy , year =
[22]

Journal of Labor Economics , year =

Tournament Incentives in the Field: Gender Differences in the Workplace , author =. Journal of Labor Economics , year =
[23]

Statistical treatment choice based on asymmetric minimax regret criteria , journal =

Aleksey Tetenov , keywords =. Statistical treatment choice based on asymmetric minimax regret criteria , journal =. 2012 , note =. doi:https://doi.org/10.1016/j.jeconom.2011.06.013 , url =

work page doi:10.1016/j.jeconom.2011.06.013 2012
[24]

The Annals of Statistics , year =

Batched Bandit Problems , author =. The Annals of Statistics , year =. doi:10.1214/15-AOS1381 , publisher =

work page doi:10.1214/15-aos1381
[25]

Advances in Neural Information Processing Systems , year =

On explore-then-commit strategies , author =. Advances in Neural Information Processing Systems , year =
[26]

Proceedings of the 25th ACM Conference on Economics and Computation , year =

Minimax-Regret Sample Selection in Randomized Experiments , author =. Proceedings of the 25th ACM Conference on Economics and Computation , year =
[27]

Marketing Science , year =

Test & roll: Profit-maximizing A/B tests , author =. Marketing Science , year =
[28]

Journal of Econometrics , year =

Minimax regret treatment choice with finite samples , author =. Journal of Econometrics , year =
[29]

The Annals of Statistics , year =

Minimaxity for Randomized Designs: Some General Results , author =. The Annals of Statistics , year =
[30]

The Annals of Statistics , year =

On the Robustness and Efficiency of Some Randomized Designs , author =. The Annals of Statistics , year =
[31]

Journal of the Royal Statistical Society: Series B (Statistical Methodology) , year =

On the optimality of randomization in experimental design: How to randomize for minimax variance and design-based inference , author =. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , year =
[32]

Journal of Econometrics , year =

Why randomize? Minimax optimality under permutation invariance , author =. Journal of Econometrics , year =
[33]

Journal of Political Economy , year =

A/B Testing with Fat Tails , author =. Journal of Political Economy , year =
[34]

Journal of Economic Theory , year =

The A/B testing problem with Gaussian priors , author =. Journal of Economic Theory , year =
[35]

Econometrica , year =

Policy Learning With Observational Data , author =. Econometrica , year =
[36]

Econometrica , year =

Statistical Treatment Rules for Heterogeneous Populations , author =. Econometrica , year =
[37]

Proceedings of the National Academy of Sciences , year =

Sufficient trial size to inform clinical practice , author =. Proceedings of the National Academy of Sciences , year =
[38]

The American Statistician , year =

Trial Size for Near-Optimal Choice Between Surveillance and Aggressive Treatment: Reconsidering MSLT-II , author =. The American Statistician , year =
[39]

Econometrica , year =

Who Should Be Treated? Empirical Welfare Maximization Methods for Treatment Choice , author =. Econometrica , year =
[40]

Journal of Machine Learning Research , year =

Batch Learning from Logged Bandit Feedback through Counterfactual Risk Minimization , author =. Journal of Machine Learning Research , year =
[41]

Journal of the American Statistical Association , year =

Estimating individualized treatment rules using outcome weighted learning , author =. Journal of the American Statistical Association , year =
[42]

American Economic Review , year =

A Theory of Experimenters: Robustness, Randomization, and Balance , author =. American Economic Review , year =
[43]

Bandit Algorithms

Bandit Algorithms , author =. 2020 , publisher =. doi:10.1017/9781108571401 , isbn =

work page doi:10.1017/9781108571401 2020
[44]

Operations Research , year =

New two-stage and sequential procedures for selecting the best simulated system , author =. Operations Research , year =
[45]

Marketing Science , year =

Customer acquisition via display advertising using multi-armed bandit experiments , author =. Marketing Science , year =
[46]

Marketing Science , year =

Dynamic online pricing with incomplete information using multiarmed bandit experiments , author =. Marketing Science , year =
[47]

Biometrika , year =

Choosing sample size for a clinical trial using decision analysis , author =. Biometrika , year =
[48]

Biometrical Journal , year =

Determination of the optimal sample size for a clinical trial accounting for the population size , author =. Biometrical Journal , year =