pith. sign in

arxiv: 2605.12840 · v1 · pith:5PXBZDKTnew · submitted 2026-05-13 · 📊 stat.AP · cs.LG

Decision Support for Marketplace Policies under Incomplete Evidence: From Replay to Launch Readiness

Pith reviewed 2026-06-30 21:59 UTC · model grok-4.3

classification 📊 stat.AP cs.LG
keywords decision support systemoff-policy evaluationmarketplace policiesreal-time biddingpartial identificationpolicy launch readinessconservative boundsinterference effects
0
0 comments X

The pith

A decision-support system for RTB marketplace policies selects a margin-gated floor policy for online validation rather than direct launch when evidence on propensities and interference remains incomplete.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Marketplace platforms evaluate pricing policies on logged data but must determine whether offline gains justify immediate deployment or only further checks, given feedback loops such as bidder responses and auction interference. The paper builds a support-aware decision-support system that folds replay, conservative lower-bound ranking, multi-sided guardrails, sensitivity checks, and interference-aware validation into one pipeline that returns a launch-readiness label instead of a lone metric. On iPinYou-style RTB logs the system flags the margin-gated floor policy as strongest on replay and conservative bounds yet still routes it to online validation because propensities, bidder behavior, and interference effects stay unresolved. An ablation demonstrates that pipelines omitting these components reach the same policy but wrongly endorse deployment.

Core claim

The framework integrates support-aware off-policy evaluation, conservative lower-bound ranking, multi-sided guardrails, and interference-aware validation design to produce a launch-readiness classification; when applied to iPinYou-style RTB logs it selects the margin-gated floor policy as leading candidate with 47.7 percent replay yield lift and 45.8 percent conservative lower-tail lift but does not recommend direct launch, instead requiring online validation to address missing evidence on propensities, bidder response, and interference.

What carries the argument

The support-aware decision-support system (DSS) that combines replay, support-aware OPE, conservative lower-bound ranking, multi-sided guardrails, out-of-time validation, sensitivity analysis, and interference-aware validation design into a pipeline outputting a launch-readiness classification.

If this is right

  • Simplified pipelines reach the same policy yet incorrectly recommend deployment, leaving key causal assumptions unresolved.
  • The DSS changes the action from deployment to online validation when evidence on propensities, bidder response, and interference is missing.
  • The protocol converts offline evaluation into an auditable recommendation that prevents overclaim under partial identification.
  • The selected policy shows stable out-of-time performance but still requires further validation steps before launch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same structure of conservative bounds plus interference-aware validation could be tested on non-RTB marketplaces that face multi-sided feedback.
  • Running the framework on synthetic auction logs with known interference magnitudes would show how often it correctly withholds launch.
  • The ablation implies that conservative ranking and guardrails are the components that shift the output from deploy to validate.

Load-bearing premise

The integrated pipeline of support-aware OPE, conservative lower-bound ranking, multi-sided guardrails, and interference-aware validation design can reliably separate promising evidence from actionable evidence under partial identification.

What would settle it

Deploy the margin-gated floor policy in a controlled online experiment and measure whether realized revenue, fill rate, and advertiser value stay within the conservative lower bounds or deviate because of unmodeled interference.

Figures

Figures reproduced from arXiv: 2605.12840 by Caroline Howard, Prashant Shekhar.

Figure 1
Figure 1. Figure 1: Marketplace interference and decision-support workflow for reserve/floor-policy evaluation. (a) Decision gap under interference: Nodes represent auction opportunities linked through shared advertiser state (e.g., budgets and pacing). Solid arrows denote the offline evaluation path used for policy selection, where effects are estimated from logged data via replay or OPE under a fixed marketplace state. Dash… view at source ↗
Figure 2
Figure 2. Figure 2: Sample bid, payment, and floor distributions from the data audit. The heavy-tailed price [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Outcome density by day in the full season-two panel. The main empirical results are based on the processed full-panel artifact rather than only on audit samples. bid-floor margin, while others sit close to the logged floor or realized payment. This makes the policy problem nontrivial. Higher floors can plausibly increase yield on high-margin opportunities, but the same intervention can destroy delivery or … view at source ↗
Figure 4
Figure 4. Figure 4: Nuisance-model diagnostics for the model-assisted decision layer. Panel (a) reports held-out proba￾bility calibration for the LightGBM classification models, estimated with Python’s lightgbm implementation through scikit-learn pipelines: the fill model estimates Pr(filled = 1 | x) and the click-through-rate model estimates Pr(clicked = 1 | filled = 1, x). Panels (b) and (c) report binned held-out regressio… view at source ↗
Figure 5
Figure 5. Figure 5: Reserve/floor policy replay frontier on the Season 2 panel. Each point is one candidate policy from the reserve-policy catalog. The horizontal axis reports the share of bid opportunities whose floor would change under the policy, and the vertical axis reports replayed yield lift relative to the logged-floor baseline. Marker size reflects the number of replay guardrails passed, and the priority policy, Q75 … view at source ↗
Figure 6
Figure 6. Figure 6: Daily stability of shortlisted reserve/floor policies on the Season 2 panel. Each line reports daily replayed yield lift relative to the logged-floor baseline for one shortlisted policy. The horizontal zero line marks no lift over the baseline. The purpose of the figure is to check whether aggregate replay gains are repeatable across days rather than driven by a single unusual traffic or auction condition.… view at source ↗
Figure 7
Figure 7. Figure 7: Replay guardrail matrix for reserve-policy candidates. Rows are candidate floor policies and columns are pre-specified replay guardrails. A policy passes the full screen only if it has at least 0.5% replay yield lift, at least 98% aggregate retained impressions, at least 98% minimum daily retained impressions, at least 97% retained clicks, at least 90% retained conversions, at least 97% retained value prox… view at source ↗
Figure 8
Figure 8. Figure 8: Estimator comparison for shortlisted reserve/floor policies. The figure compares replay mean lift with conservative lower-tail cross-fitted doubly robust (DR) lift under the simulated known-propensity logger. Agreement between replay and lower-tail DR diagnostics supports validation priority, while disagreement would indicate estimator or support fragility [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Importance-weight support diagnostics under the simulated known-propensity logger. The left panel reports effective sample size (ESS) as a share of the evaluation sample. The right panel reports the 99th percentile realized importance weight, with the vertical reference line marking weight 10. Weight support is treated as a decision gate: weak support reduces the credibility of OPE evidence and prevents di… view at source ↗
Figure 10
Figure 10. Figure 10: Conservative lower-bound ranking from cross-fitted DR diagnostics. Points show lower-tail and median bootstrap lift estimates for candidate policies. The ranking uses lower-tail evidence rather than mean lift alone, aligning the statistical diagnostic with the managerial question of which policy deserves scarce online validation capacity. validation. At the same time, the figure reinforces the paper’s cen… view at source ↗
Figure 11
Figure 11. Figure 11: Segment-level heterogeneity for the priority reserve/floor policy. Bars report cross-fitted doubly robust (DR) lift estimates for the largest observed segment cells, including exchange, region, support-cluster, and time-bucket slices. Positive bars indicate segments where the priority policy improves estimated yield relative to the logged-floor baseline; negative bars would indicate segment-level downside… view at source ↗
Figure 12
Figure 12. Figure 12: Marketplace-response sensitivity. The figure applies an adverse response-loss share ρ to each shortlisted policy’s replay yield per opportunity and recomputes lift relative to the logged-floor baseline using Eq. (15). The underlying artifact evaluates six shortlisted policies over 31 response-loss values from 0% to 60%; the plotted view shows the priority policy and two interpretable comparison policies f… view at source ↗
Figure 13
Figure 13. Figure 13: Effective-support sensitivity. The curve starts from each policy’s cross-fitted DR median and p10 lift, then widens the median-to-p10 gap by 1/ √ s as the effective support scale s declines, as shown in Eq. (17). The underlying artifact evaluates multiple shortlisted policies at six support scales. Q75 Margin￾Gated Floor policy remains positive even under severe support contraction, but the declining lowe… view at source ↗
Figure 14
Figure 14. Figure 14: Robustness summary for the priority policy. The summary combines replay lift, conservative p10 DR lift, break-even adverse response loss, and the number of robustness checks passed. Q75 Margin￾Gated Floor policy has a 47.7% replay lift, a 45.8% conservative p10 DR lift, a 32.3% break-even adverse response-loss share, and passes 5 out of 5 offline robustness checks. The decision remains validation-first be… view at source ↗
Figure 15
Figure 15. Figure 15: Season 2 and Season 3 panel comparison. The validation window is smaller than the Season 2 development panel, with 10.57 million bid opportunities over 9 (Season 3) days compared with 53.29 million opportunities over 7 (Season 2) days. Season 3 has a higher fill rate, 29.6% versus 22.9%, and different daily scale [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Out-of-time transfer of replay lift from Season 2 to Season 3. Each point is a reserve/floor policy. The horizontal axis shows Season 2 replay lift and the vertical axis shows Season 3 replay lift, both relative to the logged-floor baseline within the same season. The dashed diagonal marks equal lift across seasons. the leading candidates remain stable. In particular, the priority policy has a transfer ga… view at source ↗
Figure 17
Figure 17. Figure 17: Season 3 validation for the priority policy. Panel (a) reports daily Season 3 replay lift for Q75 Margin-Gated Floor policy relative to the logged-floor baseline. The lift is positive on all 9 holdout days, ranging from 40.0% to 61.2%. Panel (b) summarizes holdout guardrail quantities: Season 3 yield lift, impression retention, click retention, conversion retention, and value-proxy retention. every Season… view at source ↗
Figure 18
Figure 18. Figure 18: Validation design and launch-readiness gates. Panel (a) reports data-calibrated MDE curves for candidate online validation designs. The dashed horizontal lines show the priority policy’s Season 2 replay lift and conservative p10 DR lift. Lower curves indicate designs that can detect smaller effects over the same duration. Panel (b) summarizes launch-readiness gates. Replay upside, OPE lower-tail evidence,… view at source ↗
Figure 19
Figure 19. Figure 19: Final DSS recommendation and validation sequence. Panel (a) shows the decision waterfall where Q75 Margin-Gated Floor policy is the priority candidate, the offline evidence is favorable, response￾stress tests are passed, direct launch is blocked by missing real propensities and unvalidated live marketplace response, and the recommended action is shadow logging followed by an exchange-hour switchback. Pane… view at source ↗
Figure 20
Figure 20. Figure 20: Decision-rule evidence matrix. Each row is a candidate decision rule and each column is an evidence class used by the rule. Simplified rules use only a subset of the available evidence, whereas the full DSS combines replay, guardrails, OPE, support diagnostics, season-three validation, response sensitivity, and interference/propensity gates [PITH_FULL_IMAGE:figures/full_fig_p033_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Unresolved direct-launch gates under each decision rule. Simplified rules recommend direct launch despite omitting important evidence classes. The full DSS recommends online validation rather than launch, so it does not overclaim launch readiness. live assignment and marketplace-response evidence are still missing [PITH_FULL_IMAGE:figures/full_fig_p033_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Selection concentration across decision rules. Bars show the leading policies under each rule’s scoring criterion. The priority policy remains the leading candidate across simplified and full DSS rules, but the full DSS changes the operational recommendation from direct launch to online validation. final launch gate prevents these offline findings from being misread as live causal evidence. The ablation s… view at source ↗
read the original abstract

Marketplace platforms routinely evaluate pricing and allocation policies using logged observational data, yet strong offline performance does not imply that a policy is safe to deploy. In real-time bidding (RTB) marketplaces, reserve-price and floor-policy changes affect not only revenue but also fill, advertiser value, budget pacing, and competition across auctions, creating feedback and interference. The central problem is therefore not to estimate whether a policy improves an offline metric, but to determine whether the available evidence justifies direct launch or only further validation. In this regard, we propose a support-aware decision-support system (DSS) that distinguishes promising from actionable evidence. The framework integrates replay, support-aware off-policy evaluation (OPE), conservative lower-bound ranking, multi-sided guardrails, out-of-time validation, sensitivity analysis, and interference-aware validation design into a claim-preserving pipeline that outputs a launch-readiness classification rather than a single performance estimate. Applying the framework to iPinYou-style RTB logs, we identify a margin-gated floor policy as the leading candidate, with a 47.7% replay yield lift, a 45.8% conservative lower-tail lift, and stable out-of-time performance. However, the framework does not recommend direct launch. A decision-rule ablation shows that simplified pipelines select the same policy but incorrectly recommend deployment, leaving key causal assumptions unresolved. In contrast, the proposed DSS selects the same policy but changes the action to online validation, reflecting missing evidence on propensities, bidder response, and interference. Overall, the contribution is a reproducible DSS protocol that prevents decision overclaim under partial identification and converts offline evaluation into an auditable, action-oriented recommendation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a support-aware decision-support system (DSS) for RTB marketplace policies that integrates replay, support-aware OPE, conservative lower-bound ranking, multi-sided guardrails, out-of-time validation, sensitivity analysis, and interference-aware design. The framework outputs a launch-readiness classification rather than a point estimate. On iPinYou-style logs, a margin-gated floor policy achieves a 47.7% replay yield lift and 45.8% conservative lower-tail lift with stable out-of-time performance, yet the DSS recommends online validation instead of direct launch due to unresolved evidence on propensities, bidder response, and interference. A decision-rule ablation shows that simplified pipelines select the same policy but incorrectly recommend deployment.

Significance. If the integrated pipeline reliably separates promising from actionable evidence under partial identification, the work could supply a reproducible protocol that reduces overconfident deployment decisions in marketplaces. The decision-rule ablation and emphasis on converting offline evaluation into an auditable recommendation are concrete strengths. The contribution is primarily methodological and would gain substantially from external grounding against observed post-deployment outcomes.

major comments (2)
  1. [Empirical evaluation] Empirical evaluation (application to iPinYou-style logs): The central claim that the DSS correctly withholds direct launch rests on the pipeline distinguishing actionable from promising evidence, yet the demonstration uses only historical logs without any ground-truth post-launch metrics (revenue, fill, or interference) for the evaluated policies. This leaves the classification untested against actual deployment outcomes.
  2. [Framework description] Framework integration (support-aware OPE and conservative lower-bound ranking): The abstract reports a 45.8% conservative lower-tail lift and states that the pipeline resolves key causal assumptions under partial identification, but no explicit equations or derivation steps are provided showing how the lower bounds are constructed or how they differ from standard OPE estimators; without these, it is impossible to verify whether the conservative ranking is load-bearing or reduces to a fitted threshold.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'iPinYou-style RTB logs' should include a citation to the original dataset and a brief description of any preprocessing differences.
  2. [Methods] Notation: The multi-sided guardrails and interference-aware validation design would benefit from a compact table summarizing the exact thresholds or sensitivity parameters used in the reported classification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, with honest acknowledgment of limitations inherent to offline evaluation.

read point-by-point responses
  1. Referee: [Empirical evaluation] Empirical evaluation (application to iPinYou-style logs): The central claim that the DSS correctly withholds direct launch rests on the pipeline distinguishing actionable from promising evidence, yet the demonstration uses only historical logs without any ground-truth post-launch metrics (revenue, fill, or interference) for the evaluated policies. This leaves the classification untested against actual deployment outcomes.

    Authors: We agree this is a genuine limitation: no post-deployment outcomes exist in the iPinYou-style logs, so the launch-readiness classification cannot be validated against real metrics. The paper's core contribution is precisely to demonstrate how a DSS should behave under incomplete evidence—flagging unresolved assumptions on propensities, bidder response, and interference and recommending online validation rather than launch. The decision-rule ablation supports this by showing simplified pipelines reach the opposite (incorrect) recommendation on identical data. We cannot fabricate ground-truth data; we will add explicit discussion of this offline-only constraint and its implications for future work. revision: partial

  2. Referee: [Framework description] Framework integration (support-aware OPE and conservative lower-bound ranking): The abstract reports a 45.8% conservative lower-tail lift and states that the pipeline resolves key causal assumptions under partial identification, but no explicit equations or derivation steps are provided showing how the lower bounds are constructed or how they differ from standard OPE estimators; without these, it is impossible to verify whether the conservative ranking is load-bearing or reduces to a fitted threshold.

    Authors: The full manuscript derives the bounds in Section 3 via support-constrained partial identification, taking the infimum of the OPE functional over propensity models consistent with observed support and a sensitivity parameter for unmeasured confounding; this produces a conservative lower tail distinct from standard IPS or DR estimators. However, we accept that the presentation may lack sufficient step-by-step clarity. We will insert a dedicated subsection with explicit equations, derivation steps, and comparison to standard OPE in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework assembles standard components into a pipeline without self-referential reduction

full rationale

The manuscript describes an integrated DSS that combines replay, support-aware OPE, conservative lower-bound ranking, multi-sided guardrails, out-of-time validation, sensitivity analysis, and interference-aware design to produce a launch-readiness classification. No equations, fitted parameters, or self-citations are exhibited that would make the output classification equivalent to its inputs by construction. The decision-rule ablation compares the proposed pipeline against simplified alternatives on historical logs, but the distinction rests on the explicit inclusion of conservative bounds and unresolved causal assumptions rather than any renaming or self-definition of results. The derivation chain therefore remains self-contained and does not reduce to fitted inputs or prior self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the framework is described at the level of integrated standard methods without detailing any fitted quantities or new postulates.

pith-pipeline@v0.9.1-grok · 5827 in / 1321 out tokens · 27183 ms · 2026-06-30T21:59:58.674310+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 13 canonical work pages

  1. [1]

    Charles, D

    Leon Bottou, Jonas Peters, Joaquin Quinonero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and 35 learning systems: The example of computational advertising.Journal of Machine Learning Research, 14:3207–3260, 2013. URLhttps://www.jmlr.org/papers/v14/bottou13a.html

  2. [2]

    Real-time bidding benchmarking with iPinYou dataset

    Weinan Zhang, Shuai Yuan, and Jun Wang. Real-time bidding benchmarking with iPinYou dataset. InProceedings of the 2014 KDD Workshop on Data Mining for Online Advertising, 2014

  3. [3]

    Real-time bidding by reinforcement learning in display advertising

    Han Cai, Kan Ren, Weinan Zhang, Kleanthis Malialis, Jun Wang, Yong Yu, and Defeng Guo. Real-time bidding by reinforcement learning in display advertising. InProceedings of the Tenth ACM International Conference on Web Search and Data Mining, pages 661–670, 2017. doi: 10.1145/3018661.3018702

  4. [4]

    Kristof Coussement and Dries F. Benoit. Interpretable data science for decision making. Decision Support Systems, 150:113664, 2021. doi: 10.1016/j.dss.2021.113664

  5. [5]

    Explainable AI for enhanced decision-making.Decision Support Systems, 184:114276,

    Mohammad Abedin, Kristof Coussement, Mathias Kraus, Sebastian Maldonado, and Kazim Topuz. Explainable AI for enhanced decision-making.Decision Support Systems, 184:114276,

  6. [6]

    doi: 10.1016/j.dss.2024.114276

  7. [7]

    Interpretablecost-sensitiveregression through one-step boosting.Decision Support Systems, 175:114024, 2023

    ThomasDecorte, JakobRaymaekers, andTimVerdonck. Interpretablecost-sensitiveregression through one-step boosting.Decision Support Systems, 175:114024, 2023. doi: 10.1016/j.dss. 2023.114024

  8. [8]

    Reserve prices in internet advertising auctions: A field experiment.Proceedings of the ACM Conference on Electronic Commerce, 2011

    Michael Ostrovsky and Michael Schwarz. Reserve prices in internet advertising auctions: A field experiment.Proceedings of the ACM Conference on Electronic Commerce, 2011. doi: 10.1145/1993574.1993584

  9. [9]

    An empirical study of reserve price optimisation in real-time bidding

    Shuai Yuan, Jun Wang, Bowei Chen, Peter Mason, and Sam Seljan. An empirical study of reserve price optimisation in real-time bidding. InProceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2014

  10. [10]

    Reserve price optimization in first- price auctions via multi-task learning, 2023

    Achir Kalra, Chong Wang, Cristian Borcea, and Yi Chen. Reserve price optimization in first- price auctions via multi-task learning, 2023. URLhttps://digitalcommons.njit.edu/fac_ pubs/2172/

  11. [11]

    Hana Choi and Carl F. Mela. Optimizing reserve prices in display advertising auctions. SSRN working paper, 2025. URLhttps://papers.ssrn.com/sol3/papers.cfm?abstract_ id=4523022

  12. [12]

    Ye Chen, Pavel Berkhin, Bo Anderson, and Nikhil R. Devanur. Real-time bidding algorithms for performance-based display ad allocation. InProceedings of the 17th ACM SIGKDD Inter- national Conference on Knowledge Discovery and Data Mining, pages 1307–1315, 2011. doi: 10.1145/2020408.2020604

  13. [13]

    Bid landscape forecasting in online ad exchange marketplace

    Ying Cui, Ruofei Zhang, Wei Li, and Jianchang Mao. Bid landscape forecasting in online ad exchange marketplace. InProceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 265–273, 2011. doi: 10.1145/2020408.2020454

  14. [14]

    Balseiro and Yonatan Gur

    Santiago R. Balseiro and Yonatan Gur. Learning in repeated auctions with budgets: Regret minimization and equilibrium.Management Science, 65(9):3952–3968, 2019. doi: 10.1287/ mnsc.2018.3174. 36

  15. [15]

    Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms

    Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. InProceedings of the Fourth ACM International Conference on Web Search and Data Mining, pages 297–306, 2011. doi: 10.1145/1935826.1935878

  16. [16]

    Doubly robust policy evaluation and learning

    Miroslav Dudik, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. InProceedings of the 28th International Conference on Machine Learning, 2011

  17. [17]

    Doubly robust off-policy value evaluation for reinforcement learning

    Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. InProceedings of the 33rd International Conference on Machine Learning, pages652–661, 2016. URLhttps://proceedings.mlr.press/v48/jiang16.html

  18. [18]

    Counterfactual risk minimization: Learning from logged bandit feedback

    Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. InProceedings of the 32nd International Conference on Machine Learning, 2015

  19. [19]

    Optimal and adaptive off-policy evalu- ation in contextual bandits

    Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudik. Optimal and adaptive off-policy evalu- ation in contextual bandits. InProceedings of the 34th International Conference on Machine Learning, 2017

  20. [20]

    Doubly robust off-policy evaluation with shrinkage

    Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudik. Doubly robust off-policy evaluation with shrinkage. InProceedings of the 37th International Conference on Machine Learning, pages 9167–9176, 2020. URLhttps://proceedings.mlr.press/v119/ su20a.html

  21. [21]

    Hirshberg, and Susan Athey

    Ruohan Zhan, Vitor Hadad, David A. Hirshberg, and Susan Athey. Off-policy evaluation via adaptive weighting with data from contextual bandits.arXiv preprint arXiv:2106.02029, 2021

  22. [22]

    Open bandit dataset and pipeline: Towards realistic and reproducible off-policy evaluation

    Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. Open bandit dataset and pipeline: Towards realistic and reproducible off-policy evaluation. InAdvances in Neu- ral Information Processing Systems Datasets and Benchmarks Track, 2021. URLhttps: //openreview.net/forum?id=tyn3MYS_uDT

  23. [23]

    Limiting bias from test-control interference in online marketplace experiments.arXiv preprint arXiv:2004.12162, 2020

    David Holtz and Sinan Aral. Limiting bias from test-control interference in online marketplace experiments.arXiv preprint arXiv:2004.12162, 2020. URLhttps://arxiv.org/abs/2004. 12162

  24. [24]

    Weintraub

    Hannah Li, Geng Zhao, Ramesh Johari, and Gabriel Y. Weintraub. Interference, bias, and variance in two-sided marketplace experimentation: Guidance for platforms.Management Science, 2022

  25. [25]

    Weintraub

    Ramesh Johari, Hannah Li, Inessa Liskovich, and Gabriel Y. Weintraub. Experimental design in two-sided platforms: An analysis of bias.Management Science, 2022

  26. [26]

    Imbens, Lorenzo Masoero, James McQueen, Thomas S

    Patrick Bajari, Brian Burdick, Guido W. Imbens, Lorenzo Masoero, James McQueen, Thomas S. Richardson, and Ido M. Rosen. Experimental design in marketplaces.Statisti- cal Science, 2023. doi: 10.1214/23-STS883

  27. [27]

    Correlated cluster-based randomized exper- iments: Robust variance minimization.Management Science, 2023

    Ozan Candogan, Chen Chen, and Rad Niazadeh. Correlated cluster-based randomized exper- iments: Robust variance minimization.Management Science, 2023. doi: 10.1287/mnsc.2021. 02741. 37

  28. [28]

    Design and analysis of switchback experiments.Management Science, 2022

    Iavor Bojinov, David Simchi-Levi, and Jinglong Zhao. Design and analysis of switchback experiments.Management Science, 2022

  29. [29]

    Enhancing efficiency and robustness for switchback experiments: A practical model-assisted design and analysis.SSRN working paper, 2025

    Tu Ni and Iavor Bojinov. Enhancing efficiency and robustness for switchback experiments: A practical model-assisted design and analysis.SSRN working paper, 2025. URLhttps: //papers.ssrn.com/sol3/papers.cfm?abstract_id=5229804

  30. [30]

    Efficient switchback experiments with surrogate vari- ables: Estimation and experimental design.SSRN working paper, 2025

    Hongyu Chen and David Simchi-Levi. Efficient switchback experiments with surrogate vari- ables: Estimation and experimental design.SSRN working paper, 2025. URLhttps: //papers.ssrn.com/sol3/papers.cfm?abstract_id=4436643. 38