arxiv: 2605.15108 · v1 · submitted 2026-05-14 · 📊 stat.ML · cs.AI· cs.IR· cs.LG· stat.ME

Recognition: 2 theorem links

· Lean Theorem

Logging Policy Design for Off-Policy Evaluation

Connor Douglas , Joel Persson , Foster Provost

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:04 UTC · model grok-4.3

classification 📊 stat.ML cs.AIcs.IRcs.LGstat.ME

keywords off-policy evaluationlogging policyreward-coverage tradeoffOPE errorrecommendation systemsinformational regimestreatment selectionpolicy value estimation

0 comments

The pith

A unifying framework derives optimal logging policies that minimize off-policy evaluation error by balancing reward concentration against action coverage across known, unknown, and partial information regimes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to design logging policies that collect data allowing low-error estimates of a target policy's value, such as a recommender system, without deploying it live. It identifies a core reward-coverage tradeoff: putting more logging probability on high-reward actions lowers variance but risks leaving gaps for actions the target policy might choose. The work solves for the best logging policy in three standard cases—when the target and rewards are fully known at collection time, when they are unknown, and when only priors or noisy estimates are available—and supplies practical rules for firms that must choose among candidate systems.

Core claim

We characterize a fundamental reward-coverage tradeoff and propose a unifying framework for logging policy design and derive optimal policies in canonical informational regimes where the target policy and reward distribution are (i) known, (ii) unknown, and (iii) partially known through priors or noisy estimates at logging time.

What carries the argument

The reward-coverage tradeoff, which determines how logging probability mass should be allocated to minimize the combined variance and bias of standard OPE estimators.

If this is right

When the target policy and rewards are known, the optimal logging policy concentrates mass on high-reward actions the target is likely to select.
When both are unknown, the optimal policy spreads probability to guarantee coverage of every action the target might take.
When only priors or noisy estimates exist, the optimal policy interpolates between the known and unknown cases using the available information.
Firms evaluating multiple candidate recommenders can use the derived policies to collect data that yields more accurate offline comparisons.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same tradeoff logic could be used to design adaptive logging policies that update as rewards are observed during data collection.
The framework connects directly to problems of experimental design in causal inference when the goal is policy-value estimation rather than simple average treatment effects.
Simulation studies in recommendation environments could quantify how much OPE error drops when the derived policies replace standard uniform or epsilon-greedy logging.

Load-bearing premise

The three informational regimes accurately describe the knowledge available when the logging policy is chosen and the variance and bias formulas used for OPE estimators match real behavior.

What would settle it

A controlled simulation or field test in which the theoretically optimal logging policy produces higher mean squared error for the target policy value than a uniform random logging policy when the target policy and reward distribution are known in advance.

Figures

Figures reproduced from arXiv: 2605.15108 by Connor Douglas, Foster Provost, Joel Persson.

**Figure 2.** Figure 2: Informational settings for logging policy design . The two dimensions of information [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of error across logging policy choices in [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: MSE as a function of the level of noise in the reward estimates ˆµ [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗

**Figure 5.** Figure 5: Effect of posterior shrinkage and reward prediction noise on MSE and policy value . A [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

**Figure 6.** Figure 6: MSE of IPW estimator for soft-greedy logging policy classes [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗

**Figure 7.** Figure 7: MSE of IPW estimator for soft-greedy logging policy classes [PITH_FULL_IMAGE:figures/full_fig_p054_7.png] view at source ↗

read the original abstract

Off-policy evaluation (OPE) estimates the value of a target treatment policy (e.g., a recommender system) using data collected by a different logging policy. It enables high-stakes experimentation without live deployment, yet in practice accuracy depends heavily on the logging policy used to collect data for computing the estimate. We study how to design logging policies that minimize OPE error for given target policies. We characterize a fundamental reward-coverage tradeoff: concentrating probability mass on high-reward actions reduces variance but risks missing signal on actions the target policy may take. We propose a unifying framework for logging policy design and derive optimal policies in canonical informational regimes where the target policy and reward distribution are (i) known, (ii) unknown, and (iii) partially known through priors or noisy estimates at logging time. Our results provide actionable guidance for firms choosing among multiple candidate recommendation systems. We demonstrate the importance of treatment selection when gathering data for OPE, and describe theoretically optimal approaches when this is a firm's primary objective. We also distill practical design principles for selecting logging policies when operational constraints prevent implementing the theoretical optimum.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives explicit optimal logging policies for OPE under known, unknown, and partial-knowledge regimes, which is a clean extension of standard variance formulas but may not survive estimator misspecification.

read the letter

The core contribution is a framework that optimizes the logging policy to minimize OPE error by balancing reward concentration against coverage of the target policy's actions. It supplies closed-form or derivable optima in the three regimes: fully known target and rewards, completely unknown, and partial information via priors or noisy estimates. This is new; most OPE papers fix the logging policy and focus on better estimators instead of designing the data collection step itself. The reward-coverage tradeoff is stated plainly and the practical principles at the end give firms something they can actually use when choosing among candidate policies for offline testing. The math appears to rest on standard IPS and DR variance expressions plugged into the objective, which keeps the derivations tractable. The soft spot is exactly the one the stress-test flags: if the deployed estimator uses clipping, a misspecified reward model, or other common tweaks, the policy that is optimal for the modeled objective need not minimize true error. The abstract gives no simulation or robustness checks, so it is unclear how large this gap is in practice. The paper is aimed at people who control logging in recommendation or bandit settings and want to improve downstream OPE accuracy. A reader who already knows the basic OPE variance formulas will follow the derivations without trouble. It is worth sending to peer review; the central claim is concrete and the framework is internally consistent even if the robustness question needs more attention in revision.

Referee Report

2 major / 2 minor

Summary. The paper proposes a unifying framework for designing logging policies that minimize off-policy evaluation (OPE) error for a given target policy. It characterizes a fundamental reward-coverage tradeoff and derives optimal logging policies under three informational regimes: (i) target policy and reward distribution known, (ii) both unknown, and (iii) partially known via priors or noisy estimates at logging time. The results are illustrated with practical guidance for firms selecting among candidate recommendation systems.

Significance. If the derivations hold under standard OPE estimators, the work provides theoretically grounded and actionable principles for data collection in OPE, addressing a practical gap in high-stakes settings such as recommender systems. The unification across knowledge regimes and explicit treatment of the reward-coverage tradeoff represent a clear contribution to the OPE literature.

major comments (2)

[§4] §4 (unknown regime): The minimax optimality derivation plugs the standard IPS/DR variance formula directly into the objective and solves over reward distributions. This yields a closed-form logging policy only under the exact variance expression; the paper does not show that the same policy remains optimal when the deployed estimator uses a misspecified reward model or clipped importance weights, which is the typical case in practice.
[§5] §5 (partial-knowledge regime): The optimality result relies on the prior or noisy estimate entering the objective exactly as modeled. No sensitivity analysis is provided for how errors in the prior propagate to the derived logging policy or to the resulting OPE error bound.

minor comments (2)

[Abstract] The abstract states that derivations exist but the main text should include at least one explicit equation (e.g., the objective in Eq. (3) or the closed-form policy in the known regime) to allow readers to verify the claimed optimality without reconstructing the algebra.
[§2] Notation for the logging policy π_log and target policy π_tgt should be introduced once in §2 and used consistently; several later sections re-define the same symbols.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments highlight important practical considerations for the derived logging policies. We address each major comment below and outline targeted revisions that clarify assumptions and strengthen applicability without altering the core theoretical contributions.

read point-by-point responses

Referee: [§4] §4 (unknown regime): The minimax optimality derivation plugs the standard IPS/DR variance formula directly into the objective and solves over reward distributions. This yields a closed-form logging policy only under the exact variance expression; the paper does not show that the same policy remains optimal when the deployed estimator uses a misspecified reward model or clipped importance weights, which is the typical case in practice.

Authors: We agree that the closed-form result in §4 is derived under the exact IPS/DR variance expression without misspecification or clipping. The manuscript positions this as the ideal theoretical benchmark for the reward-coverage tradeoff, consistent with standard OPE analysis. We will revise §4 to explicitly state these assumptions, add a paragraph discussing how the policy may serve as a robust initialization in practice, and include a brief remark that extensions to misspecified or clipped estimators are left for future work. This does not change the main result but improves clarity on scope. revision: yes
Referee: [§5] §5 (partial-knowledge regime): The optimality result relies on the prior or noisy estimate entering the objective exactly as modeled. No sensitivity analysis is provided for how errors in the prior propagate to the derived logging policy or to the resulting OPE error bound.

Authors: The partial-knowledge results treat the prior or noisy estimate as entering the objective in the modeled form, which enables the closed-form characterization. We acknowledge the value of sensitivity analysis for robustness. We will add a short subsection in §5 with a numerical sensitivity study (perturbing the prior mean/variance and reporting changes in the resulting logging policy and OPE bound) to quantify propagation of errors. This revision directly addresses the concern while remaining within the paper's scope. revision: yes

Circularity Check

0 steps flagged

No circularity: derivations optimize standard OPE variance expressions without self-referential reduction

full rationale

The paper's core derivations minimize an OPE error objective constructed from established IPS/DR variance and bias formulas applied to the logging policy probabilities, target policy, and reward distribution. These steps constitute a standard optimization problem over known mathematical expressions rather than redefining the target quantity in terms of itself or fitting parameters that are then relabeled as predictions. No self-citation chains, uniqueness theorems, or ansatzes from prior author work are invoked as load-bearing justifications; the informational regimes are treated as modeling assumptions under which the optimization is solved. The resulting policies are therefore independent outputs of the framework, not equivalent to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard domain assumptions about reward distributions and policy knowledge levels; no new free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Reward distributions and target policies exist and can be characterized as known, unknown, or partially known at logging time
Invoked to define the three canonical regimes in which optimal policies are derived.

pith-pipeline@v0.9.0 · 5493 in / 1170 out tokens · 55165 ms · 2026-05-15T03:04:05.796307+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We characterize a fundamental reward-coverage tradeoff... derive optimal policies in canonical informational regimes
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Neyman allocation... posterior shrinkage under Gaussian hierarchical prior

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · 2 internal anchors

[1]

Proceedings of the 39th International Conference on Machine Learning (ICML) , pages =

Safe Exploration for Efficient Policy Evaluation and Comparison , author =. Proceedings of the 39th International Conference on Machine Learning (ICML) , pages =. 2022 , publisher =

work page 2022
[2]

arXiv preprint arXiv:2402.08201 , year=

Off-policy evaluation in markov decision processes under weak distributional overlap , author=. arXiv preprint arXiv:2402.08201 , year=

work page arXiv
[3]

The Annals of Statistics , volume=

Off-policy evaluation in partially observed Markov decision processes under sequential ignorability , author=. The Annals of Statistics , volume=. 2023 , publisher=

work page 2023
[4]

Advances in neural information processing systems , volume=

Learning to optimize via information-directed sampling , author=. Advances in neural information processing systems , volume=

work page
[5]

Mathematics of Operations Research , volume=

Learning to optimize via posterior sampling , author=. Mathematics of Operations Research , volume=. 2014 , publisher=

work page 2014
[6]

arXiv preprint arXiv:2305.11812 , year=

Off-policy evaluation beyond overlap: partial identification through smoothness , author=. arXiv preprint arXiv:2305.11812 , year=

work page arXiv
[7]

Journal of Causal Inference , volume=

Adaptive normalization for IPW estimation , author=. Journal of Causal Inference , volume=. 2023 , publisher=

work page 2023
[8]

Advances in Neural Information Processing Systems , volume=

Counterfactual evaluation of peer-review assignment policies , author=. Advances in Neural Information Processing Systems , volume=

work page
[9]

Information Systems Research , year =

Carlos Fernández-Loría and Foster Provost and Jesse Anderton and Benjamin Carterette and Praveen Chandar , title =. Information Systems Research , year =

work page
[10]

Off-Policy Evaluation for Slate Recommendation , url =

Swaminathan, Adith and Krishnamurthy, Akshay and Agarwal, Alekh and Dudik, Miro and Langford, John and Jose, Damien and Zitouni, Imed , booktitle =. Off-Policy Evaluation for Slate Recommendation , url =

work page
[11]

Joel Persson , title =

work page
[12]

Imbens, Guido W and Rubin, Donald B , year=

work page
[13]

2020 , publisher=

Causal Inference: What If , author=. 2020 , publisher=

work page 2020
[14]

Clinical Kidney Journal , volume=

An Introduction to Inverse Probability of Treatment Weighting in Observational Research , author=. Clinical Kidney Journal , volume=. 2022 , publisher=

work page 2022
[15]

2020 , publisher=

Bandit Algorithms , author=. 2020 , publisher=

work page 2020
[16]

, title =

Ma, Cong and Zhu, Banghua and Jiao, Jiantao and Wainwright, Martin J. , title =. IEEE Transactions on Information Theory , volume =. 2022 , doi =

work page 2022
[17]

, title =

Chen, Minmin and Beutel, Alex and Covington, Paul and Jain, Sagar and Belletti, Francois and Chi, Ed H. , title =. Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining , pages =. 2019 , isbn =. doi:10.1145/3289600.3290999 , abstract =

work page doi:10.1145/3289600.3290999 2019
[18]

Proceedings of the AAAI Conference on Artificial Intelligence , volume =

Predictive Off-Policy Policy Evaluation for Nonstationary Decision Problems, with Applications to Digital Marketing , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =. 2017 , month =. doi:10.1609/aaai.v31i2.19104 , url =

work page doi:10.1609/aaai.v31i2.19104 2017
[19]

Journal of the American Statistical Association , volume=

Marginal Mean Models for Dynamic Regimes , author=. Journal of the American Statistical Association , volume=. 2001 , publisher=

work page 2001
[20]

Advances in Neural Information Processing Systems 36 (NeurIPS) , year =

Optimal Treatment Allocation for Efficient Policy Evaluation in Sequential Decision Making , author =. Advances in Neural Information Processing Systems 36 (NeurIPS) , year =

work page
[21]

Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages =

Off-Policy Evaluation and Learning from Logged Bandit Feedback , author =. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages =

work page
[22]

Evaluating the Robustness of Off-Policy Evaluation , url =

Saito, Yuta and Udagawa, Takuma and Kiyohara, Haruka and Mogi, Kazuki and Narita, Yusuke and Tateno, Kei , urldate =. Evaluating the Robustness of Off-Policy Evaluation , url =. doi:10.48550/arXiv.2108.13703 , abstract =. 2108.13703 , keywords =

work page doi:10.48550/arxiv.2108.13703
[23]

Towards Scalable and Robust Structured Bandits: A Meta-Learning Framework , url =

Wan, Runzhe and Ge, Lin and Song, Rui , urldate =. Towards Scalable and Robust Structured Bandits: A Meta-Learning Framework , url =. doi:10.48550/arXiv.2202.13227 , shorttitle =. 2202.13227 , keywords =

work page doi:10.48550/arxiv.2202.13227
[24]

Experimentation Platforms Meet Reinforcement Learning: Bayesian Sequential Decision-Making for Continuous Monitoring , url =

Wan, Runzhe and Liu, Yu and. Experimentation Platforms Meet Reinforcement Learning: Bayesian Sequential Decision-Making for Continuous Monitoring , url =. doi:10.48550/arXiv.2304.00420 , shorttitle =. 2304.00420 , keywords =

work page doi:10.48550/arxiv.2304.00420
[25]

Off-Policy Policy Evaluation for Sequential Decisions Under Unobserved Confounding , url =

Namkoong, Hongseok and Keramati, Ramtin and Yadlowsky, Steve and Brunskill, Emma , urldate =. Off-Policy Policy Evaluation for Sequential Decisions Under Unobserved Confounding , url =. doi:10.48550/arXiv.2003.05623 , abstract =. 2003.05623 , keywords =

work page doi:10.48550/arxiv.2003.05623 2003
[26]

Minimax-Regret Sample Selection in Randomized Experiments , url =

Hu, Yuchen and Zhu, Henry and Brunskill, Emma and Wager, Stefan , urldate =. Minimax-Regret Sample Selection in Randomized Experiments , url =. doi:10.48550/arXiv.2403.01386 , abstract =. 2403.01386 , keywords =

work page doi:10.48550/arxiv.2403.01386
[27]

Adaptive Instrument Design for Indirect Experiments , url =

Chandak, Yash and Shankar, Shiv and Syrgkanis, Vasilis and Brunskill, Emma , urldate =. Adaptive Instrument Design for Indirect Experiments , url =. doi:10.48550/arXiv.2312.02438 , abstract =. 2312.02438 , keywords =

work page doi:10.48550/arxiv.2312.02438
[28]

Chaloner, I

Chaloner, Kathryn and Verdinelli, Isabella , urldate =. Bayesian Experimental Design: A Review , volume =. doi:10.1214/ss/1177009939 , shorttitle =

work page doi:10.1214/ss/1177009939
[29]

2022 , organization=

Safe Optimal Design with Applications in Off-Policy Learning , author=. 2022 , organization=

work page 2022
[30]

Proceedings of The Web Conference (WWW) , year =

Variance-Minimizing Augmentation Logging for Counterfactual Evaluation in Contextual Bandits , author =. Proceedings of The Web Conference (WWW) , year =

work page
[31]

Proceedings of the 34th International Conference on Machine Learning , series =

Optimal and Adaptive Off-Policy Evaluation in Contextual Bandits , author =. Proceedings of the 34th International Conference on Machine Learning , series =

work page
[32]

Optimal Off-Policy Evaluation from Multiple Logging Policies , url =

Kallus, Nathan and Saito, Yuta and Uehara, Masatoshi , urldate =. Optimal Off-Policy Evaluation from Multiple Logging Policies , url =. Proceedings of the 38th International Conference on Machine Learning , publisher =

work page
[33]

arXiv preprint arXiv:2212.06355 , year=

A Review of Off-Policy Evaluation in Reinforcement Learning , author=. arXiv preprint arXiv:2212.06355 , year=

work page arXiv
[34]

, urldate =

Carlsson, Emil and Dubhashi, Devdatt and Johansson, Fredrik D. , urldate =. Thompson Sampling for Bandits with Clustered Arms , volume =. doi:10.24963/ijcai.2021/305 , abstract =

work page doi:10.24963/ijcai.2021/305 2021
[35]

and Dubhashi, Devdatt , urldate =

Carlsson, Emil and Basu, Debabrota and Johansson, Fredrik D. and Dubhashi, Devdatt , urldate =. Pure Exploration in Bandits with Linear Constraints , url =. doi:10.48550/arXiv.2306.12774 , abstract =. 2306.12774 , keywords =

work page doi:10.48550/arxiv.2306.12774
[36]

Power Constrained Bandits , url =

Yao, Jiayu and Brunskill, Emma and Pan, Weiwei and Murphy, Susan and Doshi-Velez, Finale , urldate =. Power Constrained Bandits , url =. doi:10.48550/arXiv.2004.06230 , abstract =. 2004.06230 , keywords =

work page doi:10.48550/arxiv.2004.06230 2004
[37]

Provably Good Batch Reinforcement Learning Without Great Exploration , url =

Liu, Yao and Swaminathan, Adith and Agarwal, Alekh and Brunskill, Emma , urldate =. Provably Good Batch Reinforcement Learning Without Great Exploration , url =. doi:10.48550/arXiv.2007.08202 , abstract =. 2007.08202 , keywords =

work page doi:10.48550/arxiv.2007.08202 2007
[38]

Design of Experiments for Stochastic Contextual Linear Bandits , url =

Zanette, Andrea and Dong, Kefan and Lee, Jonathan and Brunskill, Emma , urldate =. Design of Experiments for Stochastic Contextual Linear Bandits , url =. doi:10.48550/arXiv.2107.09912 , abstract =. 2107.09912 , keywords =

work page doi:10.48550/arxiv.2107.09912
[39]

Manski, Charles , urldate =

Dominitz, Jeff and F. Manski, Charles , urldate =. More Data or Better Data? A Statistical Decision Problem , volume =. doi:10.1093/restud/rdx005 , shorttitle =

work page doi:10.1093/restud/rdx005
[40]

Policy-Adaptive Estimator Selection for Off-Policy Evaluation , url =

Udagawa, Takuma and Kiyohara, Haruka and Narita, Yusuke and Saito, Yuta and Tateno, Kei , urldate =. Policy-Adaptive Estimator Selection for Off-Policy Evaluation , url =. doi:10.48550/arXiv.2211.13904 , abstract =. 2211.13904 , keywords =

work page doi:10.48550/arxiv.2211.13904
[41]

arXiv preprint arXiv:2402.10592 , year=

Optimizing Adaptive Experiments: A Unified Approach to Regret Minimization and Best-Arm Identification , author=. arXiv preprint arXiv:2402.10592 , year=

work page arXiv
[42]

Journal of the Royal Statistical Society , volume =

Neyman, Jerzy , title =. Journal of the Royal Statistical Society , volume =. 1934 , doi =

work page 1934
[43]

Federated Offline Policy Learning , url =

Carranza, Aldo Gael and Athey, Susan , urldate =. Federated Offline Policy Learning , url =. doi:10.48550/arXiv.2305.12407 , abstract =. 2305.12407 , keywords =

work page doi:10.48550/arxiv.2305.12407
[44]

1998 , publisher=

Reinforcement Learning: An Introduction , author=. 1998 , publisher=

work page 1998
[45]

Reward Maximization for Pure Exploration: Minimax Optimal Good Arm Identification for Nonparametric Multi-Armed Bandits , url =

Cho, Brian and Meier, Dominik and Gan, Kyra and Kallus, Nathan , urldate =. Reward Maximization for Pure Exploration: Minimax Optimal Good Arm Identification for Nonparametric Multi-Armed Bandits , url =. doi:10.48550/arXiv.2410.15564 , shorttitle =. 2410.15564 , keywords =

work page doi:10.48550/arxiv.2410.15564
[46]

Optimal Treatment Allocation Strategies for A/B Testing in Partially Observable Time Series Experiments , url =

Sun, Ke and Kong, Linglong and Zhu, Hongtu and Shi, Chengchun , urldate =. Optimal Treatment Allocation Strategies for A/B Testing in Partially Observable Time Series Experiments , url =. doi:10.48550/arXiv.2408.05342 , abstract =. 2408.05342 , keywords =

work page doi:10.48550/arxiv.2408.05342
[47]

Sequential Experimental Design for Transductive Linear Bandits

Fiez, Tanner and Jain, Lalit and Jamieson, Kevin and Ratliff, Lillian , urldate =. Sequential Experimental Design for Transductive Linear Bandits , url =. doi:10.48550/arXiv.1906.08399 , abstract =. 1906.08399 , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1906.08399 1906
[48]

Best-Arm Identification in Linear Bandits

Soare, Marta and Lazaric, Alessandro and Munos, Rémi , urldate =. Best-Arm Identification in Linear Bandits , url =. doi:10.48550/arXiv.1409.6110 , abstract =. 1409.6110 , keywords =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1409.6110
[49]

Sample Complexity Reduction via Policy Difference Estimation in Tabular Reinforcement Learning , url =

Narang, Adhyyan and Wagenmaker, Andrew and Ratliff, Lillian and Jamieson, Kevin , urldate =. Sample Complexity Reduction via Policy Difference Estimation in Tabular Reinforcement Learning , url =. doi:10.48550/arXiv.2406.06856 , abstract =. 2406.06856 , keywords =

work page doi:10.48550/arxiv.2406.06856
[50]

2025 , eprint =

Practical Improvements of A/B Testing with Off-Policy Estimation , author =. 2025 , eprint =. doi:10.48550/arXiv.2506.10677 , url =

work page doi:10.48550/arxiv.2506.10677 2025
[51]

Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics , pages =

Toward Minimax Off-Policy Value Estimation , author =. Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics , pages =. 2015 , editor =

work page 2015
[52]

International Conference on Machine Learning , pages=

Optimal and Adaptive Off-Policy Evaluation in Contextual Bandits , author=. International Conference on Machine Learning , pages=. 2017 , organization=

work page 2017
[53]

Journal of the American Statistical Association , volume=

Statistical Inference for Online Decision Making: In a Contextual Bandit Setting , author=. Journal of the American Statistical Association , volume=. 2021 , publisher=

work page 2021
[54]

Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining , pages=

Off-Policy Evaluation via Adaptive Weighting with Data from Contextual Bandits , author=. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining , pages=

work page
[55]

arXiv preprint arXiv:2411.06329 , year=

Regret Minimization and Statistical Inference in Online Decision Making with High-Dimensional Covariates , author=. arXiv preprint arXiv:2411.06329 , year=

work page arXiv
[56]

International Conference on Artificial Intelligence and Statistics , pages=

Multi-Armed Bandit Experimental Design: Online Decision-Making and Adaptive Inference , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2023 , organization=

work page 2023
[57]

Journal of the American Statistical Association , volume=

Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning , author=. Journal of the American Statistical Association , volume=. 2024 , publisher=

work page 2024
[58]

Advances in Neural Information Processing Systems , volume=

Inference for Batched Bandits , author=. Advances in Neural Information Processing Systems , volume=

work page
[59]

Advances in Neural Information Processing Systems , volume=

Statistical Inference with M-Estimators on Adaptively Collected Data , author=. Advances in Neural Information Processing Systems , volume=

work page
[60]

Advances in Neural Information Processing Systems , volume=

Post-Contextual-Bandit Inference , author=. Advances in Neural Information Processing Systems , volume=

work page
[61]

Annual Review of Statistics and its Application , volume=

Demystifying Inference After Adaptive Experiments , author=. Annual Review of Statistics and its Application , volume=. 2025 , publisher=

work page 2025
[62]

Statistical Science , volume=

Doubly Robust Policy Evaluation and Optimization , author=. Statistical Science , volume=. 2014 , publisher=

work page 2014
[63]

Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics , year =

Balanced Off-Policy Evaluation in General Action Spaces , author =. Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics , year =

work page
[64]

Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing , pages =

Linear Bandits with Limited Adaptivity and Learning Distributional Optimal Design , author =. Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing , pages =

work page
[65]

Proceedings of the 39th International Conference on Machine Learning , year =

Off-Policy Evaluation for Large Action Spaces via Embeddings , author =. Proceedings of the 39th International Conference on Machine Learning , year =

work page
[66]

Journal of the American Statistical Association , volume=

A Generalization of Sampling Without Replacement from a Finite Universe , author=. Journal of the American Statistical Association , volume=. 1952 , publisher=

work page 1952
[67]

2003 , publisher=

Model Assisted Survey Sampling , author=. 2003 , publisher=

work page 2003
[68]

Journal of Computational and Graphical Statistics , volume=

Truncated Importance Sampling , author=. Journal of Computational and Graphical Statistics , volume=. 2008 , publisher=

work page 2008
[69]

Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability , volume=

Estimation with Quadratic Loss , author=. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability , volume=. 1961 , organization=

work page 1961
[70]

1973 , publisher=

Efron, Bradley and Morris, Carl , journal=. 1973 , publisher=

work page 1973
[71]

The Annals of Mathematical Statistics , pages=

Optimum allocation in linear regression theory , author=. The Annals of Mathematical Statistics , pages=. 1952 , publisher=

work page 1952
[72]

Optimum Experimental Designs, with

Atkinson, Anthony and Donev, Alexander and Tobias, Randall , volume=. Optimum Experimental Designs, with. 2007 , publisher=

work page 2007
[73]

Available at SSRN 5126080 , year=

Automated Experimental Design with Optimization from Historical Data Simulations , author=. Available at SSRN 5126080 , year=

work page
[74]

Journal of the Royal Statistical Society: Series B (Methodological) , volume=

Optimum Experimental Designs , author=. Journal of the Royal Statistical Society: Series B (Methodological) , volume=. 1959 , publisher=

work page 1959
[75]

The Annals of Mathematical Statistics , volume=

On the Efficient Design of Statistical Investigations , author=. The Annals of Mathematical Statistics , volume=. 1943 , publisher=

work page 1943
[76]

2009 , publisher=

An Introduction to Optimal Designs for Social and Biomedical Research , author=. 2009 , publisher=

work page 2009
[77]

Tutorials in Operations Research: Smarter Decisions for a Better World , pages=

Experimental Design for Causal Inference Through an Optimization Lens , author=. Tutorials in Operations Research: Smarter Decisions for a Better World , pages=. 2024 , publisher=

work page 2024
[78]

Management Science , volume=

Optimal Experimental Design for Staggered Rollouts , author=. Management Science , volume=. 2024 , publisher=

work page 2024
[79]

Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics , volume=

An Empirical Bayes Approach to Statistics , author=. Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics , volume=. 1956 , organization=

work page 1956
[80]

The Empirical

Robbins, Herbert , journal=. The Empirical. 1964 , publisher=

work page 1964

Showing first 80 references.