pith. machine review for the scientific record. sign in

arxiv: 2605.07171 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.SY· eess.SY· stat.ML

Recognition: no theorem link

Cost-Ordered Feasibility for Multi-Armed Bandits with Cost Subsidy

Carlee Joe-Wong, Ishank Juneja, Osman Ya\u{g}an

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:16 UTC · model grok-4.3

classification 💻 cs.LG cs.SYeess.SYstat.ML
keywords multi-armed banditscost subsidyinstance-dependent boundsregret boundsfeasibility testingquality constraintsuboptimal samplescost minimization
0
0 comments X

The pith

Any policy for multi-armed bandits with cost subsidies must pull a certain number of suboptimal arms, and a new algorithm matches this with upper bounds on cost and quality regret.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper focuses on multi-armed bandits where the goal is to minimize total cost while keeping reward within a specified distance of the unknown best arm, and where arm costs are known in advance. It first derives instance-dependent lower bounds on the expected number of suboptimal samples any algorithm must take, showing these bounds generalize earlier results and reveal how cost gaps and reward gaps interact. It then introduces the Cost-Ordered Feasibility algorithm, which samples arms in cost order and pools observations across arms to test whether a cheaper arm meets the quality constraint. Analysis of the algorithm produces matching upper bounds on cumulative cost and quality regret measured against the cheapest feasible arm.

Core claim

The paper proves instance-dependent lower bounds on the expected number of suboptimal samples required by any policy in the MAB-CS setting; these bounds are a strict generalization of prior bounds. It then presents the Cost-Ordered Feasibility (COF) algorithm that orders arms by increasing cost and combines samples from all arms to determine the feasibility of cheaper arms relative to the unknown optimal reward, and establishes instance-dependent upper bounds on COF's expected cumulative cost and quality regret.

What carries the argument

Cost-Ordered Feasibility (COF) algorithm, which orders arms by known cost and pools samples across arms to test whether each cheap arm satisfies the quality constraint defined relative to the unknown best reward.

If this is right

  • Any policy must incur at least a minimum number of suboptimal pulls that depends on the specific cost gaps and reward gaps of the instance.
  • COF achieves instance-dependent upper bounds on both cumulative cost and quality regret that are qualitatively tighter than earlier results.
  • The algorithm can be run directly on recommendation datasets such as MovieLens and Goodreads and yields lower cost and regret than standard baselines.
  • The lower bounds provide structural insight into how the feasibility test interacts with the cost ordering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same ordering-plus-pooling idea could be adapted to cases where costs are also uncertain, provided cost estimates are obtained alongside reward samples.
  • The feasibility-gauging step may apply to other constrained bandit problems in which one must verify that a low-cost option meets a threshold defined by an unknown optimum.
  • Empirical gains on real datasets suggest the method scales to large discrete action spaces where exhaustive sampling is expensive.

Load-bearing premise

The cost of every arm is known beforehand and the quality constraint is defined relative to the unknown best reward.

What would settle it

Running COF on a concrete instance whose cost and reward gaps are known exactly and finding that its total cost or quality regret exceeds the derived upper bound by more than a constant factor would falsify the upper-bound claim.

Figures

Figures reproduced from arXiv: 2605.07171 by Carlee Joe-Wong, Ishank Juneja, Osman Ya\u{g}an.

Figure 1
Figure 1. Figure 1: (a) Arm a ∗ must be deemed feasible by each expensive arm ak ∈ A+ introducing dependence on µa∗ /(1 − α) − µk. (b) Each arm in A† is diminished to make µ† feasible. ϕ(i) denotes the index of the i th highest reward arm; aϕ(1) and aϕ(2) are in A† while aϕ(3) is not. Proof sketch for Theorem 2.3: The proof for Theorem 2.3 can be illustrated using [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Left: The empirical mean of neither arm ap nor arm ak is sufficiently larger than UCBℓ/(1− α) to eliminate aℓ. Since µˆ−UCBℓ/(1−α) < βℓ(δ) for both ap, ak. However if ϵp,ℓ, ϵk,ℓ aggregated are less than δ, then together they can eliminate aℓ. Right: ϵk,ℓ is derived using a back calculation. Determining infeasibility: Previous approaches have worked to first perform a best arm identification (BAI) round to … view at source ↗
Figure 3
Figure 3. Figure 3: Panels (a) and (c) show the evolution of summed cost and quality regret over a horizon of 5 million samples for COF and other algorithms. Data is averaged over 50 runs and error bars represent the 20-80 percentile band. Left panels are for subsidy factor α = 0.3. Panels (b), (d) contain more comprehensive results for α varying between 0.01 and 0.60. Each α column has 50 markers per algorithm corresponding … view at source ↗
Figure 4
Figure 4. Figure 4: Good reads bandit instance A.2 Ablation Experiments Ablation instance ν1 ν1 with arm reward array µ = [0.15, 0.24, 0.96, 0.95, 0.99, 0.98, 0.97] and costs 1, 2, . . . , 7. We use α = 0.8 making µCS = 0.198 and a ∗ = 2. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: MovieLens bandit instance Ablation instance ν2 ν2 with µ = [0.44, 0.46, 0.48, 0.7, 0.71, 0.704, 0.714, 0.702, 0.716, 0.708, 0.712, 0.706] and costs 1, 2, . . . , 12. We use α = 0.3 making µCS = 0.501 and a ∗ = 4 Next we highlight the role of the exclusive sampling and combining samples features of COF by ablating them from COF and comparing the performance of the COF to its ablated variants on cost and qua… view at source ↗
Figure 6
Figure 6. Figure 6: Top: Panels (a) and (b) depict windows from ablation on exclusive sampling experiment with instance ν1. Subsidy factor α = 0.8, horizon T = 200, 000 samples. The event rectangles in panels (a) and (b) represent the observed times at which a1 was eliminated and a2 was deemed optimal respectively. Bottom: Panels (c), (d), and (e) depict windows from ablation on combining samples experiment with instance ν2. … view at source ↗
Figure 7
Figure 7. Figure 7: Variants of COF with uniform and non-uniform sampling [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Cheap arms ak ∈ A− are infeasible as indicated by the gap ∆Q,k. We outline the proof of Theorem 2.2 using the reward line illustration of [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
read the original abstract

The classic multi-armed bandit (MAB) problem tackles the challenge of accruing maximum reward while making decisions under uncertainty. However, in applications, often the goal is to minimize cost subject to a constraint on the minimum permissible reward, an objective captured by multi-armed bandits with cost-subsidy (MAB-CS). Of interest to this paper is the setting where the quality (reward) constraint is specified relative to the unknown best reward and the cost of each arm is known. We characterize the expected sub-optimal samples required by any policy by proving instance-dependent lower bounds that offer new insight into the problem and are a strict generalization of prior bounds. Then, we propose an algorithm called Cost-Ordered Feasibility (COF) that leverages our insight and intelligently combine samples from all arms to gauge the feasibility of a cheap arm. Thereafter, we analyze COF to establish instance-dependent upper bounds on its expected cumulative cost and quality regret, i.e., relative to the cheapest feasible arm. Finally, we empirically validate the merits of COF, comparing it to baselines from the literature through extensive simulation experiments on the MovieLens and Goodreads datasets as well as representative synthetic instances. Not only does our paper develop qualitatively better theoretical regret upper bounds, but COF also convincingly demonstrates improved empirical performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper addresses multi-armed bandits with cost subsidy (MAB-CS) where arm costs are known and the quality constraint is defined relative to the unknown best reward. It derives instance-dependent lower bounds on the expected number of sub-optimal samples required by any policy, presented as a strict generalization of prior MAB bounds. The Cost-Ordered Feasibility (COF) algorithm is proposed to combine samples across arms for gauging the feasibility of low-cost arms. Instance-dependent upper bounds are established for COF on expected cumulative cost and quality regret relative to the cheapest feasible arm. The approach is validated through simulations on MovieLens, Goodreads, and synthetic instances, with claims of improved empirical performance over baselines.

Significance. If the lower and upper bounds hold as stated, the work provides meaningful new insight into the sample complexity of constrained MAB problems by extending existing lower-bound techniques to the cost-ordered setting and pairing them with an algorithm whose regret scales instance-dependently. The empirical results on recommendation datasets add practical value. The instance-dependent nature of both the bounds and the algorithm is a clear strength relative to worst-case analyses common in the literature.

minor comments (3)
  1. Abstract: the term 'quality regret' is introduced without a one-sentence definition; a brief parenthetical explanation would improve readability for readers unfamiliar with the exact regret definition used later.
  2. Experiments section: details on data exclusion rules for MovieLens and Goodreads, the precise definition of the quality threshold, and how error bars are computed (e.g., number of runs, confidence intervals) are not provided; these omissions hinder reproducibility.
  3. COF algorithm description: while the high-level idea of combining samples to gauge feasibility is clear, a short pseudocode listing or explicit step-by-step procedure in the main text would clarify the sampling order and stopping condition.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review, accurate summary of our contributions, and recommendation for minor revision. The referee correctly identifies the value of our instance-dependent lower bounds as a generalization of prior MAB results and the practical utility of the COF algorithm paired with matching upper bounds. Since the report lists no specific major comments, we have no points requiring rebuttal or revision at this time.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper first states the MAB-CS setting with known costs and relative quality threshold, then derives instance-dependent lower bounds on sub-optimal samples as a strict generalization of prior MAB results. It next defines the COF algorithm that uses these insights to gauge feasibility and finally proves instance-dependent upper bounds on cumulative cost and quality regret directly from the algorithm's behavior. No equation or claim reduces by construction to a fitted parameter, self-citation chain, or renamed input; the lower and upper bounds are obtained via independent analysis steps that remain falsifiable outside the fitted values.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard stochastic bandit assumptions plus the cost-subsidy model; no free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption Costs of each arm are known a priori
    Explicitly stated in the abstract as part of the problem setting.
  • standard math Rewards are drawn from unknown but fixed distributions
    Classic MAB modeling assumption required for regret analysis.

pith-pipeline@v0.9.0 · 5544 in / 1240 out tokens · 32091 ms · 2026-05-11T01:16:18.802785+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages

  1. [1]

    Platform power struggle: Spotify and the major record labels

    Luis Aguiar, Joel Waldfogel, and Axel Zeijen. Platform power struggle: Spotify and the major record labels. Technical report, National Bureau of Economic Research, 2024

  2. [2]

    On the complexity of all ε-best arms identification

    Aymen Al Marjani, Tomas Kocak, and Aurélien Garivier. On the complexity of all ε-best arms identification. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 317–332. Springer, 2022

  3. [3]

    Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem.Periodica Mathematica Hungarica, 61(1-2):55–65, 2010

    Peter Auer and Ronald Ortner. Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem.Periodica Mathematica Hungarica, 61(1-2):55–65, 2010

  4. [4]

    Finite-time analysis of the multiarmed bandit problem.Machine learning, 47(2):235–256, 2002

    Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.Machine learning, 47(2):235–256, 2002

  5. [5]

    Bandits with knapsacks.J

    Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Bandits with knapsacks.J. ACM, 65(3), mar 2018. ISSN 0004-5411. doi: 10.1145/3164539. URL https://doi.org/10.1145/3164539

  6. [6]

    Multiple identifications in multi- armed bandits

    Séebastian Bubeck, Tengyao Wang, and Nitin Viswanathan. Multiple identifications in multi- armed bandits. InInternational Conference on Machine Learning, pages 258–265. PMLR, 2013

  7. [7]

    Semih Cayci, Atilla Eryilmaz, and R. Srikant. Budget-constrained bandits over general cost and reward distributions. In Silvia Chiappa and Roberto Calandra, editors,Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 ofProceedings of Machine Learning Research, pages 4388–4398. PMLR, 2020. URL http...

  8. [8]

    Multi-armed bandit with budget constraint and variable costs

    Wenkui Ding, Tao Qin, Xu-Dong Zhang, and Tie-Yan Liu. Multi-armed bandit with budget constraint and variable costs. InProceedings of the AAAI Conference on Artificial Intelligence, volume 27, pages 232–238, 2013. doi: 10.1609/aaai.v27i1.8637

  9. [9]

    Designing multi-objective multi-armed bandits algorithms: A study

    Madalina M Drugan and Ann Nowe. Designing multi-objective multi-armed bandits algorithms: A study. InThe 2013 international joint conference on neural networks (IJCNN), pages 1–8. IEEE, 2013

  10. [10]

    A one-size-fits-all solution to conservative bandit problems

    Yihan Du, Siwei Wang, and Longbo Huang. A one-size-fits-all solution to conservative bandit problems. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7254–7261, 2021

  11. [11]

    Multi-armed bandits with costly probes.IEEE Transactions on Information Theory, 71(1):618–643, 2025

    Eray Can Elumar, Cem Tekin, and Osman Ya˘gan. Multi-armed bandits with costly probes.IEEE Transactions on Information Theory, 71(1):618–643, 2025. doi: 10.1109/TIT.2024.3506866

  12. [12]

    Best arm identification: A unified approach to fixed budget and fixed confidence.Advances in neural information processing systems, 25, 2012

    Victor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. Best arm identification: A unified approach to fixed budget and fixed confidence.Advances in neural information processing systems, 25, 2012

  13. [13]

    Explore first, exploit next: The true shape of regret in bandit problems.Mathematics of Operations Research, 44(2):377–408, 2019

    Aurélien Garivier, Emilie Kaufmann, and Tor Lattimore. Explore first, exploit next: The true shape of regret in bandit problems.Mathematics of Operations Research, 44(2):377–408, 2019

  14. [14]

    Maxwell Harper and Joseph A

    F. Maxwell Harper and Joseph A. Konstan. The movielens datasets: History and context.ACM Trans. Interact. Intell. Syst., 5(4), dec 2015. ISSN 2160-6455. doi: 10.1145/2827872. URL https://doi.org/10.1145/2827872

  15. [15]

    A hybrid advertising media selection model using ahp and fuzzy-based ga decision making.Neural Computing and Applications, 29(4):1153–1167, 2018

    Hooshang Tadrisi Javan, Amir Khanlari, Omid Motamedi, and Hadi Mokhtari. A hybrid advertising media selection model using ahp and fuzzy-based ga decision making.Neural Computing and Applications, 29(4):1153–1167, 2018

  16. [16]

    Universal Model Routing for Efficient LLM Inference.arXiv preprint arXiv:2502.08773, 2025

    Wittawat Jitkrittum, Harikrishna Narasimhan, Ankit Singh Rawat, Jeevesh Juneja, Congchao Wang, Zifeng Wang, Alec Go, Chen-Yu Lee, Pradeep Shenoy, Rina Panigrahy, et al. Universal model routing for efficient llm inference.arXiv preprint arXiv:2502.08773, 2025. 10

  17. [17]

    Pairwise elimination with instance- dependent guarantees for bandits with cost subsidy

    Ishank Juneja, Carlee Joe-Wong, and Osman Ya ˘gan. Pairwise elimination with instance- dependent guarantees for bandits with cost subsidy. InInternational Conference on Learning Representations (ICLR), 2025. URLhttps://openreview.net/forum?id=eB7T1bqthA

  18. [18]

    Efficient selection of multiple bandit arms: theory and practice

    Shivaram Kalyanakrishnan and Peter Stone. Efficient selection of multiple bandit arms: theory and practice. InProceedings of the 27th International Conference on International Conference on Machine Learning, pages 511–518, 2010

  19. [19]

    Cost aware best arm identification.arXiv preprint arXiv:2402.16710, 2024

    Kellen Kanarios, Qining Zhang, and Lei Ying. Cost aware best arm identification.arXiv preprint arXiv:2402.16710, 2024

  20. [20]

    Cambridge University Press, 2020

    Tor Lattimore and Csaba Szepesvári.Bandit Algorithms. Cambridge University Press, 2020

  21. [21]

    Anatomy of a machine learning ecosystem: 2 million models on hugging face, 2025

    Benjamin Laufer, Hamidah Oderinwale, and Jon Kleinberg. Anatomy of a machine learning ecosystem: 2 million models on hugging face, 2025. URL https://arxiv.org/abs/2508. 06811

  22. [22]

    Finding all ϵ-good arms in stochastic bandits.Advances in Neural Information Processing Systems, 33:20707–20718, 2020

    Blake Mason, Lalit Jain, Ardhendu Tripathy, and Robert Nowak. Finding all ϵ-good arms in stochastic bandits.Advances in Neural Information Processing Systems, 33:20707–20718, 2020

  23. [23]

    A theory of mul- tiformat communication: mechanisms, dynamics, and strategies.Journal of the Academy of Marketing Science, 49(3):441–461, 2021

    Jordan W Moffett, Judith Anne Garretson Folse, and Robert W Palmatier. A theory of mul- tiformat communication: mechanisms, dynamics, and strategies.Journal of the Academy of Marketing Science, 49(3):441–461, 2021

  24. [24]

    Multi-armed bandits with cost subsidy

    Abhishek Sinha, Aditya Gopalan, and Shie Mannor. Multi-armed bandits with cost subsidy. InProceedings of the 38th International Conference on Machine Learning, volume 130 of Proceedings of Machine Learning Research, pages 9734–9744. PMLR, 2021

  25. [25]

    How royalties work on spotify, 2026

    Spotify for Artists. How royalties work on spotify, 2026. URLhttps://artists.spotify. com/royalties-guide. Accessed: 2026-04-27

  26. [26]

    Epsilon–first policies for budget–limited multi-armed bandits

    Long Tran-Thanh, Archie Chapman, Enrique Munoz De Cote, Alex Rogers, and Nicholas R Jennings. Epsilon–first policies for budget–limited multi-armed bandits. InProceedings of the AAAI Conference on Artificial Intelligence, volume 24, pages 1211–1216, 2010

  27. [27]

    Knapsack based optimal policies for budget–limited multi–armed bandits

    Long Tran-Thanh, Archie Chapman, Alex Rogers, and Nicholas Jennings. Knapsack based optimal policies for budget–limited multi–armed bandits. InProceedings of the AAAI Conference on Artificial Intelligence, volume 26, pages 1134–1140, 2012

  28. [28]

    Causal llm routing: End-to-end regret minimization from observational data.arXiv preprint arXiv:2505.16037, 2025

    Asterios Tsiourvas, Wei Sun, and Georgia Perakis. Causal llm routing: End-to-end regret minimization from observational data.arXiv preprint arXiv:2505.16037, 2025

  29. [29]

    Item recommendation on monotonic behavior chains

    Mengting Wan and Julian McAuley. Item recommendation on monotonic behavior chains. In Proceedings of the 12th ACM conference on recommender systems, pages 86–94, 2018

  30. [30]

    Fine-grained spoiler detection from large-scale review corpora

    Mengting Wan, Rishabh Misra, Ndapa Nakashole, and Julian McAuley. Fine-grained spoiler detection from large-scale review corpora. In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguis- tics, pages 2605–2610, Florence, Italy, July 2019. Association for Computational Lin...

  31. [31]

    Learning to route llms from bandit feedback,

    Wang Wei, Tiankai Yang, Hongjie Chen, Yue Zhao, Franck Dernoncourt, Ryan A Rossi, and Hoda Eldardiry. Learning to route llms from bandit feedback: One policy, many trade-offs. arXiv preprint arXiv:2510.07429, 2025

  32. [32]

    Efficient training-free online routing for high-volume multi-llm serving

    Fangzhou Wu and Sandeep Silwal. Efficient training-free online routing for high-volume multi-llm serving. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  33. [33]

    Conservative bandits

    Yifan Wu, Roshan Shariff, Tor Lattimore, and Csaba Szepesvari. Conservative bandits. In Maria Florina Balcan and Kilian Q. Weinberger, editors,Proceedings of The 33rd International Conference on Machine Learning, volume 48 ofProceedings of Machine Learning Research, pages 1254–1262, New York, New York, USA, 20–22 Jun 2016. PMLR. URL https:// proceedings.m...

  34. [34]

    Thompson sampling for bud- geted multi-armed bandits

    Yingce Xia, Haifang Li, Tao Qin, Nenghai Yu, and Tie-Yan Liu. Thompson sampling for bud- geted multi-armed bandits. InProceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, pages 3960–3966. AAAI Press, 2015

  35. [35]

    Budgeted bandit problems with continuous random costs

    Yingce Xia, Wenkui Ding, Xu-Dong Zhang, Nenghai Yu, and Tao Qin. Budgeted bandit problems with continuous random costs. In Geoffrey Holmes and Tie-Yan Liu, editors,Asian Conference on Machine Learning, volume 45 ofProceedings of Machine Learning Research, pages 317–332. PMLR, 2016. URL https://proceedings.mlr.press/v45/Xia15.html

  36. [36]

    The value of personalized recommendations: Evidence from netflix

    Kevin Zielnicki, Guy Aridor, Aurelien Bibaut, Allen Tran, Winston Chou, and Nathan Kallus. The value of personalized recommendations: Evidence from netflix. Technical Report 12257, CESifo Working Paper Series, 2025. 12 A Additional Experiments and Details In this section we include all the empirical results and documentation supporting the experiments for...

  37. [38]

    3(b) perform the following sequence of steps:

    Run ./scripts/plot_regret_vs_time.sh after making sure the correct destination of log files is hard coded into the bash file • To generate Fig. 3(b) perform the following sequence of steps:

  38. [39]

    All resource path based commands from here on out will be relative to the parent directory

    Run ./scripts/run_good_reads.sh from the project parent directory. All resource path based commands from here on out will be relative to the parent directory. This will save the log file intoresults/run_logs

  39. [40]

    3(c) perform the following sequence of steps:

    Run ./scripts/plot_only.sh after making sure the correct destination of log files is hard coded into the bash file 13 • To generate Fig. 3(c) perform the following sequence of steps:

  40. [42]

    3(d) perform the following sequence of steps:

    Run ./scripts/plot_regret_vs_time.sh after making sure the correct destination of log files is hard coded into the bash file • To generate Fig. 3(d) perform the following sequence of steps:

  41. [43]

    All resource path based commands from here on out will be relative to the parent directory

    Run ./scripts/run_movie_lens.sh from the project parent directory. All resource path based commands from here on out will be relative to the parent directory. This will save the log file intoresults/run_logs

  42. [44]

    Run ./scripts/plot_only.sh after making sure the correct destination of log files is hard coded into the bash file The other bash scripts in./scriptsare used to generate the plots available here in this supplemental experiments section. Computational resources and wall-clock execution time To run our bandit experiments we used a machine with the below con...

  43. [45]

    That is, any arm included inA † is also a part of otherA ℓ for any cheapa ℓ

    As remarked in D.6,A † ⊆ A ℓ. That is, any arm included inA † is also a part of otherA ℓ for any cheapa ℓ

  44. [46]

    Starting with the general feasibility condition forτ ℓ and applying our two observations, X ak∈Aℓ I n≤ 8 log(1/δ) ∆2 k (∆k,ℓ −3β) + 2 ≥ X ak∈A† I n≤ 8 log(1/δ) ∆2 k (∆k,† −3β) + 2

    Since(∆ k,ℓ −3β) + = max{∆k,ℓ −3β,0}, and∆ k,ℓ = (1−α)µ k −µ ℓ, we have∆ k,† ≤∆ k,ℓ ∀ℓ < a ∗. Starting with the general feasibility condition forτ ℓ and applying our two observations, X ak∈Aℓ I n≤ 8 log(1/δ) ∆2 k (∆k,ℓ −3β) + 2 ≥ X ak∈A† I n≤ 8 log(1/δ) ∆2 k (∆k,† −3β) + 2 . Which in turn means thatτ ℓ(δ)as defined in Equation 42 is not more thanτ †(δ). L...