arxiv: 2605.12899 · v1 · submitted 2026-05-13 · 📊 stat.ML · cs.LG

Recognition: unknown

Robust Sequential Experimental Design for A/B Testing

Chengchun Shi, Hongtu Zhu, Niansheng Tang, Qianglin Wen, Ting Li, Xiangkun Wu, Yingying Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:11 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords robust experimental designA/B testingsequential designmodel misspecificationcontextual banditstreatment effect estimationdynamic settings

0 comments

The pith

Robust sequential experimental design bounds the worst-case mean squared error of estimated treatment effects in A/B testing under model misspecification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a unified framework for sequential experimental design in A/B testing that remains effective when models are misspecified. It covers both contextual bandit and dynamic settings within one approach. The central theoretical result proves that the design controls the worst-case mean squared error of the estimated treatment effect. This matters for applications like technology company experiments, where model errors are common and can otherwise produce inefficient or biased results from limited samples.

Core claim

The authors introduce a robust sequential experimental design that unifies contextual bandit and dynamic settings and proves a bound on the worst-case mean squared error of the estimated treatment effect. The framework is shown to work under model misspecification, with empirical demonstrations on synthetic data and real-world datasets from a leading technology company.

What carries the argument

Unified robust sequential experimental design framework that bounds worst-case mean squared error of the treatment effect estimate under model misspecification.

If this is right

A/B tests can produce reliable treatment effect estimates without assuming correctly specified models.
The same design applies to both contextual bandit problems and dynamic treatment regimes.
Sample efficiency improves while maintaining an explicit error bound in practice.
Real-world performance holds on datasets drawn from technology company experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This type of worst-case bound could lower the chance of misleading conclusions in large-scale online testing platforms.
Similar robustness ideas might transfer to other sequential decision settings where model error is a concern.
The bound could be tested further by varying the degree of misspecification in controlled synthetic environments.

Load-bearing premise

The framework assumes model misspecification can be controlled by a single design that works across both contextual bandit and dynamic settings.

What would settle it

A simulation or real experiment where the mean squared error of the treatment effect estimate exceeds the claimed worst-case bound under a specific misspecification pattern would falsify the guarantee.

Figures

Figures reproduced from arXiv: 2605.12899 by Chengchun Shi, Hongtu Zhu, Niansheng Tang, Qianglin Wen, Ting Li, Xiangkun Wu, Yingying Zhang.

**Figure 1.** Figure 1: Graphical illustration of treatment allocation strategies under different experimental designs. Static designs are offline and depend only on current observations. Sequential designs condition treatment allocation on the observed history. In contrast, our robust sequential design accounts for how current actions affect future covariates, while remaining robust to model misspecification and finite-sample-aw… view at source ↗

**Figure 2.** Figure 2: Empirical MSE (95% CI): under the contextual bandits with additive treatment effects (top left), with interactive treatment effects (top right); under the dynamic settings with large bias: with T = 6 (bottom left), with T = 12 (bottom right). 228 293 309 231 280 269 208 219 233 218 228 228 174 189 177 186 188 182 123 136 135 137 144 147 n = 21 n = 28 n = 35 n = 42 0 100 200 300 M S E × 10 − 4 Design RSD NR… view at source ↗

**Figure 3.** Figure 3: Empirical MSE (95% CI) based on the real-data-based simulation: with additive treatment effects (top), with interactive treatment effects (bottom). Program for Innovative Research Team of Shanghai University of Finance and Economics. Impact Statement This paper presents a robust sequential experimental design algorithm for A/B testing. Compared with standard randomization and other treatment allocation m… view at source ↗

**Figure 4.** Figure 4: Empirical MSE (95% CI) under the dynamic settings: with small bias, T = 6 (top left), with moderate bias, T = 6 (top right); with small bias, T = 12 (bottom left), with moderate bias, T = 12 (bottom right). do not fully exploit the sequential structure of the experiment to improve future assignments. In contrast, our sequential allocation strategy leverages accumulated information from past allocations and… view at source ↗

**Figure 5.** Figure 5: Empirical MSE with 95% confidence intervals under the contextual bandit setting with additive treatment effects. Furthermore, we set M = 12000, B = 10000, and ν 2 = 0.005. For the DNN, we employ a custom-designed 7-layer MLP. The network architecture consists of an input layer followed by six hidden layers with decreasing widths of 512, 256, 128, 64, 32, and 16 units, respectively, and a final linear outpu… view at source ↗

read the original abstract

Experimental design has emerged as a powerful approach for improving the sample efficiency of A/B testing, yet existing designs rely critically on correctly specified models. We study robust sequential experimental design under model misspecification and develop a unified framework that covers both contextual bandit and dynamic settings. Theoretically, we prove that our design bounds the worst-case mean squared error of the estimated treatment effect. Empirically, we demonstrate the effectiveness of the proposed approach using synthetic and real-world datasets from a leading technology company.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims a unified robust sequential design for A/B testing that bounds worst-case MSE under misspecification, but the abstract leaves the uncertainty set undefined so the bound's value is hard to judge.

read the letter

The core claim is a sequential design that keeps the worst-case mean squared error of the treatment effect estimator controlled even when the working model is wrong, and it tries to cover both contextual bandit and dynamic regimes in one setup. That direction addresses a practical pain point in industry A/B testing, where model misspecification is routine and standard designs lose efficiency or validity. The empirical checks on synthetic data plus real datasets from a tech company give at least initial evidence that the method can be implemented and shows some gains. Those are the parts that land cleanly from the abstract. The soft spot is exactly the one the stress-test flags: the bound is stated to hold uniformly over a misspecification set, yet the abstract never names the set, the norm, or the radius. Without that definition it is impossible to tell whether the result is non-vacuous or whether the sequential allocation actually keeps the estimator inside the set. The derivation details are also absent, so the proof cannot be checked for hidden assumptions or tightness. This is not a minor omission; it is the load-bearing step. The paper is aimed at researchers working on robust causal inference and sequential experimentation, plus practitioners who run large-scale online tests and need methods that degrade more gracefully. A reader already familiar with minimax designs in bandits or dynamic treatment regimes could extract the high-level idea and the empirical comparison, but would need the full proofs and the precise uncertainty class before deciding whether to build on it. I would send it to peer review because the problem is timely, the unified framing is new enough to warrant referee time, and the gaps are fixable with added definitions and derivations rather than fatal.

Referee Report

2 major / 1 minor

Summary. The manuscript develops a unified framework for robust sequential experimental design in A/B testing that accommodates model misspecification in both contextual bandit and dynamic settings. It claims a theoretical proof that the design bounds the worst-case mean squared error of the estimated treatment effect and reports empirical success on synthetic and real-world datasets from a technology company.

Significance. A rigorously derived worst-case MSE bound under an explicit misspecification class would be a meaningful contribution to robust experimental design, offering practical value for A/B testing where model assumptions often fail. The empirical component on real data strengthens applicability, but the absence of the misspecification set definition and derivation details prevents a full assessment of whether the bound is non-vacuous or load-bearing.

major comments (2)

[Abstract] Abstract: The central claim that the design 'bounds the worst-case mean squared error' requires an explicit definition of the misspecification set (e.g., an L2-ball of radius ε, Lipschitz ball, or parametric uncertainty set) and the norm used in the minimax argument; without it the bound cannot be verified as non-vacuous or checked for the conditions under which it holds uniformly over the set.
[Theoretical section] Theoretical development: No derivation details, proof sketch, or explicit conditions on the misspecification class are supplied, which is load-bearing for the robustness guarantee; the abstract states coverage of contextual bandit and dynamic settings but provides no information on how the sequential design keeps the estimator inside the uncertainty set.

minor comments (1)

[Abstract] Abstract: Expand to include the precise form of the MSE bound and the key assumptions required for it to hold.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which identify key areas where the presentation of our robustness guarantees can be strengthened. We agree that explicit definitions and derivation details are needed and will revise the manuscript to address both major comments.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the design 'bounds the worst-case mean squared error' requires an explicit definition of the misspecification set (e.g., an L2-ball of radius ε, Lipschitz ball, or parametric uncertainty set) and the norm used in the minimax argument; without it the bound cannot be verified as non-vacuous or checked for the conditions under which it holds uniformly over the set.

Authors: We agree that the abstract should make the misspecification set and norm explicit. In the revised manuscript we will define the misspecification class as the L2-ball of radius ε centered at the nominal model parameters and state that the worst-case MSE is taken with respect to the Euclidean norm. This renders the bound non-vacuous for sufficiently small ε relative to the effective sample size and allows direct verification of the uniform coverage condition. revision: yes
Referee: [Theoretical section] Theoretical development: No derivation details, proof sketch, or explicit conditions on the misspecification class are supplied, which is load-bearing for the robustness guarantee; the abstract states coverage of contextual bandit and dynamic settings but provides no information on how the sequential design keeps the estimator inside the uncertainty set.

Authors: We acknowledge the absence of a proof sketch and explicit conditions in the current version. The revised theoretical section will include a concise derivation outline showing that the sequential allocation rule minimizes the worst-case deviation from the nominal model; under the L2-ball misspecification the estimator remains inside the uncertainty set by construction for both the contextual bandit and dynamic regimes. The conditions on ε and the design parameters will be stated explicitly. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper presents a theoretical proof that the proposed sequential design bounds worst-case MSE of the treatment-effect estimator under a unified framework for contextual bandits and dynamic settings. No equations or steps in the abstract or summary reduce the bound to a self-definition, a fitted input renamed as prediction, or a load-bearing self-citation whose content is unverified. The minimax guarantee is stated as derived from the design choice rather than tautological with its inputs, and the derivation chain remains independent of any circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the central claim rests on an unspecified class of model misspecification and the validity of the unified framework construction.

pith-pipeline@v0.9.0 · 5382 in / 1079 out tokens · 26444 ms · 2026-05-14T19:11:48.650031+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 17 canonical work pages · 1 internal anchor

[1]

Atkinson, A

doi: 10.1093/jrsssb/qkad072. Atkinson, A. C. Optimum biased coin designs for sequential clinical trials with prognostic factors.Biometrika, 69(1): 61–67,

work page doi:10.1093/jrsssb/qkad072
[2]

W., Masoero, L., Mc- Queen, J., Richardson, T., and Rosen, I

Bajari, P., Burdick, B., Imbens, G. W., Masoero, L., Mc- Queen, J., Richardson, T., and Rosen, I. M. Multiple randomization designs.arXiv preprint arXiv:2112.13495,

work page arXiv
[3]

URL https: //doi.org/10.1287/mnsc.2019.3424

doi: 10.1287/mnsc.2019.3424. URL https: //doi.org/10.1287/mnsc.2019.3424. Bojinov, I., Simchi-Levi, D., and Zhao, J. Design and analy- sis of switchback experiments.Management Science, 69 (7):3759–3777,

work page doi:10.1287/mnsc.2019.3424 2019
[4]

Budgeted active experimen- tation for treatment effect estimation from observational and randomized data.arXiv preprint arXiv:2602.22021,

Gao, J., Su, X., Ma, M., Huang, Y ., Xu, X., Wan, X., Gu, T., Yu, E., Guo, J., and Zhang, Z. Budgeted active experimen- tation for treatment effect estimation from observational and randomized data.arXiv preprint arXiv:2602.22021,

work page arXiv
[5]

Detecting interference in A/B testing with increasing allocation.arXiv preprint arXiv:2211.03262,

Han, K., Li, S., Mao, J., and Wu, H. Detecting interference in A/B testing with increasing allocation.arXiv preprint arXiv:2211.03262,

work page arXiv
[6]

URL https: //doi.org/10.1214/08-AOS655

doi: 10.1214/08-AOS655. URL https: //doi.org/10.1214/08-AOS655. Hu, F., Hu, Y ., Ma, Z., and Rosenberger, W. F. Adaptive randomization for balancing over covariates.Wiley In- terdisciplinary Reviews: Computational Statistics, 6(4): 288–303,

work page doi:10.1214/08-aos655
[7]

and Wager, S

Hu, Y . and Wager, S. Switchback experiments under geo- metric mixing.arXiv preprint arXiv:2209.00197,

work page arXiv
[8]

Sequential Bayesian optimal experimental design via approximate dynamic programming

URL https://arxiv.org/abs/ 1604.08320. Huber, P. J. Robustness and designs. In Srivastava, J. N. (ed.),A Survey of Statistical Design and Linear Models, pp. 287–303. North-Holland, Amsterdam,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Jia, S., Frazier, P., Kallus, N., and Yu, C. L. Experimen- tation under non-stationary interference.arXiv preprint arXiv:2511.06685,

work page arXiv
[10]

Estimation of treatment effects under nonstationarity via the truncated policy gra- dient estimator.arXiv preprint arXiv:2506.05308,

Johari, R., Peng, T., and Xing, W. Estimation of treatment effects under nonstationarity via the truncated policy gra- dient estimator.arXiv preprint arXiv:2506.05308,

work page arXiv
[11]

Active adaptive experimental design for treatment ef- fect estimation with covariate choices.arXiv preprint arXiv:2403.03589,

Kato, M., Oga, A., Komatsubara, W., and Inokuchi, R. Active adaptive experimental design for treatment ef- fect estimation with covariate choices.arXiv preprint arXiv:2403.03589,

work page arXiv
[12]

A multiagent reinforcement learning framework for off-policy evaluation in two-sided markets.The Annals of Applied Statistics, 17(4):2701–2722, 2023a

Shi, C., Wan, R., Song, G., Luo, S., Zhu, H., and Song, R. A multiagent reinforcement learning framework for off-policy evaluation in two-sided markets.The Annals of Applied Statistics, 17(4):2701–2722, 2023a. Shi, C., Wang, X., Luo, S., Zhu, H., Ye, J., and Song, R. Dynamic causal effects evaluation in A/B testing with a reinforcement learning framework....

2059
[13]

ARMA-design: Optimal treatment allocation strategies for A/B testing in partially observable time series experiments.arXiv preprint arXiv:2408.05342,

Sun, K., Kong, L., Zhu, H., and Shi, C. ARMA-design: Optimal treatment allocation strategies for A/B testing in partially observable time series experiments.arXiv preprint arXiv:2408.05342,

work page arXiv
[14]

A two- armed bandit framework for a/b testing.arXiv preprint arXiv:2507.18118,

11 Robust Sequential Experimental Design for A/B Testing Wang, J., Wen, Q., Zhang, Y ., Yan, X., and Shi, C. A two- armed bandit framework for a/b testing.arXiv preprint arXiv:2507.18118,

work page arXiv
[15]

Designing time series experiments in A/B testing with transformer reinforcement learning.arXiv preprint arXiv:2602.01853,

Wu, X., Wen, Q., Zhang, Y ., Zhu, H., Li, T., and Shi, C. Designing time series experiments in A/B testing with transformer reinforcement learning.arXiv preprint arXiv:2602.01853,

work page arXiv
[16]

Xiong, R., Chin, A., and Taylor, S. J. Data-driven switch- back experiments: Theoretical tradeoffs and empirical bayes designs.arXiv preprint arXiv:2406.06768,

work page arXiv
[17]

Spatially randomized designs can enhance policy evaluation.arXiv preprint arXiv:2403.11400,

Yang, Y ., Shi, C., Yao, F., Wang, S., and Zhu, H. Spatially randomized designs can enhance policy evaluation.arXiv preprint arXiv:2403.11400,

work page arXiv
[18]

Balanc- ing interference and correlation in spatial experimental designs: A causal graph cut approach.arXiv preprint arXiv:2505.20130,

Zhu, J., Li, J., Zhou, H., Lin, Y ., Lin, Z., and Shi, C. Balanc- ing interference and correlation in spatial experimental designs: A causal graph cut approach.arXiv preprint arXiv:2505.20130,

work page arXiv
[19]

Assumption A.2.(Non-singular covariance matrix) The covariance matrixE(XX ⊤)is positive definite. Assumption A.3.(H ¨older smoothness & bounded sieves) The nonlinear component f(·) belongs to a H¨older class Λ(d, c) (defined at the end of this subsection), which admits a uniformly bounded sieve approximation. Moreover, f(·) is uniformly bounded in the sen...

2020
[20]

A key observation is that both O1 and O2 admit closed-form expressions in terms of low-dimensional imbalance summaries

Then we can rewrite (7) as MSE(ˆγa)≤ σ2 O1 + η2O2 O2 1 . A key observation is that both O1 and O2 admit closed-form expressions in terms of low-dimensional imbalance summaries. ForO 1, a standard projection argument gives O1 =N− 1 N ∆⊤ NΣ−1 N ∆N , where ∆N = NX i=1 aiXi,Σ N = 1 N NX i=1 XiX ⊤ i . Here, ∆N captures both the count imbalance and the covariat...

2020
[21]

The action space at each stage is {−1,1}

At decision stagei, the state is defined as Si = (Ω1,i−1,Ω 2,i−1, Xi), with terminal state SN+1 = (Ω1,N ,Ω 2,N). The action space at each stage is {−1,1} . If action ai is chosen at state Si, the state transitions according to (Ω1,i−1,Ω 2,i−1, Xi)− →(Ω 1,i−1 +a iXiX ⊤ i ,Ω 2,i−1 +a iXiΨ⊤(Xi), X i+1). There is no per-stage loss, and the terminal objective ...

2022
[22]

1 N NX i=1 Z ⊤ it f π t (Xit) # G−1 t utu⊤ t G−1 t

Assumptions A.4–A.6 are the dynamic versions of Assumptions A.1–A.3. As discussed in the introduction, the weak-signal condition is common in ride-sharing platform settings (Tang et al., 2019; Sun et al., 2024). Derivation of an upper bound for the MSE.To illustrate the main procedure, we proceed in two steps. We first derive the conditional MSE of the AT...

2019
[23]

We then invoke the Bellman optimality equations for time-dependent MDPs to complete the proof

To establish the across-day Bellman recursion, we first formulate the robust design problem as a day-level, time-dependent MDP with a finite horizon. We then invoke the Bellman optimality equations for time-dependent MDPs to complete the proof. Proof.We view the across-day design problem as a finite-horizon, time-dependent MDP indexed by daysi= 0,1, . . ....

2019
[24]

(2020), denoted byNRD, is optimal in the sense that it minimizes the true MSE

In our setting, when f(X) = 0 , the sequential design of Bhat et al. (2020), denoted byNRD, is optimal in the sense that it minimizes the true MSE. We therefore useNRDas the oracle benchmark under correct specification and define the efficiency of a candidate designdas Eff(d) := MSE(NRD) MSE(d) . 30 Robust Sequential Experimental Design for A/B Testing By...

2020
[25]

Larger values indicate better efficiency relative toNRD. DesignN= 21N= 28N= 35N= 42 RSD 0.9910 0.9855 0.9778 0.9886 RND 0.9144 0.9321 0.9391 0.9486 BBD 0.9721 0.9708 0.9725 0.9737 SBD 0.9691 0.9657 0.9683 0.9704 NBD 0.9379 0.9542 0.9556 0.9606 Table 2.MSE of the ATE estimator under the contextual bandit setting with additive treatment effects and f(X) = 0...

1958