Differentially Private Best-Arm Identification
Pith reviewed 2026-05-23 23:54 UTC · model grok-4.3
The pith
Differentially private Top-Two algorithms achieve the asymptotic lower bound on sample complexity for best-arm identification up to multiplicative constants.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We derive lower bounds on the sample complexity of any δ-correct BAI algorithm that satisfies ε-global DP or ε-local DP; the bounds reveal two privacy regimes whose hardness is governed by a coupled privacy-Total-Variation quantity in the high-privacy case. We introduce CTB-TT, an ε-local-DP Top-Two algorithm that plugs in a randomised-response estimator of the means, and AdaP-TT*, an ε-global-DP Top-Two algorithm whose private mean estimator runs in arm-dependent adaptive episodes and adds Laplace noise. By adapting the transportation costs, the expected sample complexity of AdaP-TT* reaches the asymptotic lower bound up to multiplicative constants; CTB-TT is asymptotically optimal for ε-lD
What carries the argument
Top-Two sampling with private mean estimators (Randomised Response for local DP, adaptive Laplace for global DP) and adapted transportation costs that control allocation between the leader and challenger arms.
If this is right
- In the high-privacy regime sample complexity depends on the coupled effect of privacy and novel total-variation information-theoretic quantities.
- In the low-privacy regime the lower bounds reduce exactly to the classical non-private lower bounds.
- CTB-TT is asymptotically optimal for ε-local DP by using a randomised-response mean estimator.
- AdaP-TT* reaches the asymptotic lower bound up to multiplicative constants for ε-global DP by adapting transportation costs.
Where Pith is reading between the lines
- The same transportation-cost adaptation may extend to differentially private regret minimization or other sequential decision problems.
- The high/low-privacy regime split suggests concrete thresholds on ε beyond which privacy imposes no extra asymptotic cost.
- Empirical checks on non-Gaussian reward distributions would test whether the optimality claims survive when the private estimator variance assumption is only approximately met.
Load-bearing premise
The reward distributions admit a private mean estimator whose variance scales appropriately with the privacy budget ε, and the algorithm can execute arm-dependent adaptive episodes without violating the fixed-confidence guarantee.
What would settle it
Running AdaP-TT* or CTB-TT on standard Gaussian arms and observing that its expected sample complexity exceeds any fixed constant times the paper's lower bound would falsify the claimed near-optimality.
Figures
read the original abstract
Best Arm Identification (BAI) problems are progressively used for data-sensitive applications, such as designing adaptive clinical trials, tuning hyper-parameters, and conducting user studies. Motivated by the data privacy concerns invoked by these applications, we study the problem of BAI with fixed confidence in both the local and central models, i.e. $\epsilon$-local and $\epsilon$-global Differential Privacy (DP). First, to quantify the cost of privacy, we derive lower bounds on the sample complexity of any $\delta$-correct BAI algorithm satisfying $\epsilon$-global DP or $\epsilon$-local DP. Our lower bounds suggest the existence of two privacy regimes. In the high-privacy regime, the hardness depends on a coupled effect of privacy and novel information-theoretic quantities involving the Total Variation. In the low-privacy regime, the lower bounds reduce to the non-private lower bounds. We propose $\epsilon$-local DP and $\epsilon$-global DP variants of a Top Two algorithm, namely CTB-TT and AdaP-TT*, respectively. For $\epsilon$-local DP, CTB-TT is asymptotically optimal by plugging in a private estimator of the means based on Randomised Response. For $\epsilon$-global DP, our private estimator of the mean runs in arm-dependent adaptive episodes and adds Laplace noise to ensure a good privacy-utility trade-off. By adapting the transportation costs, the expected sample complexity of AdaP-TT* reaches the asymptotic lower bound up to multiplicative constants.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies best-arm identification (BAI) in the fixed-confidence setting under both ε-local and ε-global differential privacy. It derives lower bounds on sample complexity that identify two regimes: a high-privacy regime whose hardness couples privacy with total-variation information quantities, and a low-privacy regime that recovers the classical non-private bounds. The authors introduce CTB-TT (local DP) that plugs in a Randomised-Response mean estimator and is claimed to be asymptotically optimal, and AdaP-TT* (global DP) that uses arm-dependent adaptive episodes with Laplace noise; by adapting transportation costs the expected sample complexity of AdaP-TT* is stated to match the lower bound up to multiplicative constants.
Significance. If the lower-bound derivations and the matching upper bounds hold, the work supplies the first asymptotic optimality results for differentially private BAI. The explicit separation into high- and low-privacy regimes and the adaptation of transportation costs to the private setting are technically useful extensions of the non-private literature. The reliance on standard mechanisms (Randomised Response, Laplace) keeps the algorithmic contribution clean and reproducible.
major comments (2)
- [Abstract / lower-bound derivation] Abstract and lower-bound section: the statement that the high-privacy regime hardness 'depends on a coupled effect of privacy and novel information-theoretic quantities involving the Total Variation' is load-bearing for the two-regime claim; the precise functional form of this coupling (e.g., the exact combination of KL, TV, and ε) must be exhibited and shown to be tight.
- [AdaP-TT* analysis] AdaP-TT* description: the claim that 'by adapting the transportation costs, the expected sample complexity of AdaP-TT* reaches the asymptotic lower bound up to multiplicative constants' requires an explicit theorem that states both the adapted cost and the resulting multiplicative factor; without that theorem the optimality statement cannot be verified.
minor comments (2)
- [Abstract] Notation for the two privacy regimes should be introduced once and used consistently; the abstract switches between 'high-privacy regime' and 'low-privacy regime' without a formal definition.
- [AdaP-TT* algorithm] The description of the arm-dependent adaptive episodes for the global-DP estimator should clarify whether the privacy budget is allocated per episode or globally, and how the fixed-confidence guarantee is preserved.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments. We address the two major comments below. Both points identify places where the manuscript would benefit from greater explicitness; we will revise accordingly.
read point-by-point responses
-
Referee: [Abstract / lower-bound derivation] Abstract and lower-bound section: the statement that the high-privacy regime hardness 'depends on a coupled effect of privacy and novel information-theoretic quantities involving the Total Variation' is load-bearing for the two-regime claim; the precise functional form of this coupling (e.g., the exact combination of KL, TV, and ε) must be exhibited and shown to be tight.
Authors: We agree that the precise functional form should be stated explicitly for the two-regime claim to be verifiable from the abstract and early sections. The lower bounds appear in Theorems 3.1 (global DP) and 3.2 (local DP). In the high-privacy regime the sample-complexity lower bound takes the form Ω(max_i 1/(TV(μ_i,μ*)·g(ε)) + non-private term), where g(ε) is a privacy-dependent factor that replaces the usual KL divergence; the low-privacy regime recovers the classical KL-based bound once ε exceeds a threshold depending on the TV distances. We will revise the abstract to include this functional form and add a short paragraph immediately after the theorem statements that isolates the coupling term and notes that the bound is tight because the matching upper bounds (Theorems 4.1 and 5.3) attain the same expression up to constants. revision: yes
-
Referee: [AdaP-TT* analysis] AdaP-TT* description: the claim that 'by adapting the transportation costs, the expected sample complexity of AdaP-TT* reaches the asymptotic lower bound up to multiplicative constants' requires an explicit theorem that states both the adapted cost and the resulting multiplicative factor; without that theorem the optimality statement cannot be verified.
Authors: We accept the request for an explicit theorem. Section 5 defines the privacy-adjusted transportation cost C_ε(μ) obtained by replacing the usual stopping threshold with a Laplace-noise-adjusted version that accounts for the arm-dependent episode lengths. Theorem 5.3 already shows that the expected sample complexity of AdaP-TT* is at most 4·C_ε(μ)·(lower-bound expression), but the constant 4 and the exact definition of C_ε are currently only implicit in the proof. We will extract these two objects into a self-contained theorem statement placed immediately before the complexity analysis, making both the adapted cost and the multiplicative factor (which remains 4) fully explicit. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper first derives independent information-theoretic lower bounds on sample complexity for ε-DP best-arm identification (both local and global models), identifying two privacy regimes via Total Variation quantities that recover non-private bounds at low privacy. It then constructs CTB-TT and AdaP-TT* by inserting standard mechanisms (Randomised Response, Laplace noise) into an existing Top-Two template and adapting transportation costs. Asymptotic optimality is claimed by direct matching to the separately derived lower bounds up to constants; no step reduces a prediction to a fitted quantity defined inside the paper, no self-definition of quantities, and no load-bearing self-citation chain is required for the central claims. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Y. Abbasi-Yadkori, P. Bartlett, V. Gabillon, A. Malek, and M. Valko. Best of both worlds: Stochastic & adversarial best-arm identification. In Conference on Learning Theory, pages 918--949. PMLR, 2018
work page 2018
-
[2]
J. Acharya, Z. Sun, and H. Zhang. Differentially private assouad, fano, and le cam. In Algorithmic Learning Theory, pages 48--78. PMLR, 2021
work page 2021
-
[3]
J.-Y. Audibert, S. Bubeck, and R. Munos. Best Arm Identification in Multi-armed Bandits . In Conference on Learning Theory , 2010
work page 2010
-
[4]
P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47 0 (2-3): 0 235--256, 2002
work page 2002
-
[5]
M. Aziz, E. Kaufmann, and M.-K. Riviere. On multi-armed bandit designs for dose-finding clinical trials. The Journal of Machine Learning Research, 22 0 (1): 0 686--723, 2021
work page 2021
-
[6]
A. Azize and D. Basu. When privacy meets partial information: A refined analysis of differentially private bandits. Advances in Neural Information Processing Systems, 35: 0 32199--32210, 2022
work page 2022
-
[7]
A. Azize and D. Basu. Concentrated differential privacy for bandits. In 2nd IEEE Conference on Secure and Trustworthy Machine Learning, 2024
work page 2024
- [8]
-
[9]
R. E. Bechhofer. A single-sample multiple decision procedure for ranking means of normal populations with known variances. The Annals of Mathematical Statistics, pages 16--39, 1954
work page 1954
-
[10]
R. E. Bechhofer. A sequential multiple-decision procedure for selecting the best one of several normal populations with a common unknown variance, and its use with various experimental designs. Biometrics, 14 0 (3): 0 408--429, 1958
work page 1958
-
[11]
T. T. Cai, Y. Wang, and L. Zhang. The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy. The Annals of Statistics, 49 0 (5): 0 2825--2850, 2021
work page 2021
-
[12]
A. Carpentier and A. Locatelli. Tight (lower) bounds for the fixed budget best arm identification bandit problem. In Conference on Learning Theory, pages 590--604. PMLR, 2016
work page 2016
-
[13]
S. Chen, T. Lin, I. King, M. R. Lyu, and W. Chen. Combinatorial pure exploration of multi-armed bandits. Advances in neural information processing systems, 27, 2014
work page 2014
- [14]
-
[15]
R. Degenne, W. M. Koolen, and P. M \'e nard. Non-asymptotic pure exploration by solving games. Advances in Neural Information Processing Systems, 32, 2019
work page 2019
-
[16]
R. Degenne, H. Shao, and W. Koolen. Structure adaptive algorithms for stochastic bandits. In International Conference on Machine Learning, pages 2443--2452. PMLR, 2020
work page 2020
-
[17]
J. Dong, A. Roth, and W. J. Su. Gaussian differential privacy. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84 0 (1): 0 3--37, 2022
work page 2022
-
[18]
J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Local privacy and statistical minimax rates. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pages 429--438. IEEE, 2013
work page 2013
-
[19]
C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science , 9 0 (3--4): 0 211--407, 2014
work page 2014
- [20]
- [21]
-
[22]
E. Even-Dar, S. Mannor, Y. Mansour, and S. Mahadevan. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7 0 (6), 2006
work page 2006
-
[23]
V. Gabillon, M. Ghavamzadeh, and A. Lazaric. Best arm identification: A unified approach to fixed budget and fixed confidence. Advances in Neural Information Processing Systems, 25, 2012
work page 2012
-
[24]
A. Garivier and E. Kaufmann. Optimal best arm identification with fixed confidence. In Conference on Learning Theory, pages 998--1027. PMLR, 2016
work page 2016
-
[25]
A. M. Girgis, D. Data, S. Diggavi, A. T. Suresh, and P. Kairouz. On the renyi differential privacy of the shuffle model. In ACM SIGSAC Conference on Computer and Communications Security, pages 2321--2341, 2021
work page 2021
-
[26]
K. Jamieson and R. Nowak. Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. In Conference on Information Sciences and Systems (CISS), pages 1--6. IEEE, 2014
work page 2014
-
[27]
K. Jamieson and A. Talwalkar. Non-stochastic best arm identification and hyperparameter optimization. In Artificial intelligence and statistics, pages 240--248. PMLR, 2016
work page 2016
-
[28]
K. Jamieson, M. Malloy, R. Nowak, and S. Bubeck. lil’ucb: An optimal exploration algorithm for multi-armed bandits. In Conference on Learning Theory, pages 423--439. PMLR, 2014
work page 2014
-
[29]
T. Jin, J. Shi, X. Xiao, and E. Chen. Efficient pure exploration in adaptive round model. Advances in Neural Information Processing Systems, 32, 2019
work page 2019
- [30]
-
[31]
M. Jourdan and R. Degenne. Non-asymptotic analysis of a ucb-based top two algorithm. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[32]
M. Jourdan, R. Degenne, D. Baudry, R. de Heide, and E. Kaufmann. Top two algorithms revisited. Advances in Neural Information Processing Systems, 35: 0 26791--26803, 2022
work page 2022
-
[33]
M. Jourdan, R. Degenne, and E. Kaufmann. Dealing with unknown variances in best-arm identification. International Conference on Algorithmic Learning Theory, 2023
work page 2023
-
[34]
M. Jourdan, R. Degenne, and E. Kaufmann. An -best-arm identification algorithm for fixed-confidence and beyond. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[35]
P. Kairouz, K. Bonawitz, and D. Ramage. Discrete distribution estimation under local privacy. In International Conference on Machine Learning, pages 2436--2444. PMLR, 2016
work page 2016
-
[36]
D. S. Kalogerias, K. E. Nikolakakis, A. D. Sarwate, and O. Sheffet. Quantile multi-armed bandits: Optimal best-arm identification and a differentially private scheme. IEEE Journal on Selected Areas in Information Theory, 2 0 (2): 0 534--548, 2021
work page 2021
-
[37]
S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone. Pac subset selection in stochastic multi-armed bandits. In International Conference on Machine Learning, volume 12, pages 655--662, 2012
work page 2012
- [38]
-
[39]
V. Karwa and S. Vadhan. Finite Sample Differentially Private Confidence Intervals . In 9th Innovations in Theoretical Computer Science Conference (ITCS 2018), volume 94. Schloss Dagstuhl -- Leibniz-Zentrum f \"u r Informatik, 2018
work page 2018
-
[40]
E. Kaufmann and S. Kalyanakrishnan. Information complexity in bandit subset selection. In Conference on Learning Theory, pages 228--251. PMLR, 2013
work page 2013
-
[41]
E. Kaufmann and W. M. Koolen. Mixture martingales revisited with applications to sequential tests and confidence intervals. Journal of Machine Learning Research, 22 0 (246): 0 1--44, 2021
work page 2021
-
[42]
E. Kaufmann, O. Capp \'e , and A. Garivier. On the complexity of best arm identification in multi-armed bandit models. Journal of Machine Learning Research, 17: 0 1--42, 2016
work page 2016
-
[43]
T. Lattimore and C. Szepesv \'a ri. Bandit algorithms. Cambridge University Press, 2020
work page 2020
-
[44]
L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. The Journal of Machine Learning Research, 18 0 (1): 0 6765--6816, 2017
work page 2017
-
[45]
P. J. Libin, T. Verstraeten, D. M. Roijers, J. Grujic, K. Theys, P. Lemey, and A. Now \'e . Bayesian best-arm identification for selecting influenza mitigation strategies. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2018, 2019
work page 2018
-
[46]
S. Lindst hl, A. Proutiere, and A. Johnsson. Measurement-based admission control in sliced networks: A best arm identification approach. In GLOBECOM 2022-2022 IEEE Global Communications Conference, pages 1484--1490. IEEE, 2022
work page 2022
-
[47]
D. E. Losada, D. Elsweiler, M. Harvey, and C. Trattner. A day at the races: using best arm identification algorithms to reduce the cost of information retrieval user studies. Applied Intelligence, 52 0 (5): 0 5617--5632, 2022
work page 2022
-
[48]
S. Mannor and J. N. Tsitsiklis. The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5 0 (Jun): 0 623--648, 2004
work page 2004
-
[49]
I. Mironov. R \'e nyi differential privacy. In 2017 IEEE 30th computer security foundations symposium (CSF), pages 263--275. IEEE, 2017
work page 2017
-
[50]
N. Mishra and A. Thakurta. ( N early) optimal differentially private stochastic multi-arm bandits. In Conference on Uncertainty in Artificial Intelligence, 2015
work page 2015
-
[51]
S. Neel and A. Roth. Mitigating bias in adaptive data gathering via differential privacy. In International Conference on Machine Learning, pages 3720--3729. PMLR, 2018
work page 2018
- [52]
-
[53]
C. Qin, D. Klabjan, and D. Russo. Improving the expected improvement algorithm. Advances in Neural Information Processing Systems, 30, 2017
work page 2017
- [54]
-
[55]
A. Rio, M. Barlier, I. Colin, and M. Soare. Multi-agent best arm identification with private communications. In International Conference on Machine Learning, 2023
work page 2023
-
[56]
D. Russo. Simple bayesian algorithms for best arm identification. In Conference on Learning Theory, pages 1417--1418. PMLR, 2016
work page 2016
-
[57]
T. Sajed and O. Sheffet. An optimal private stochastic-mab algorithm based on optimal private stopping rule. In International Conference on Machine Learning, pages 5579--5588. PMLR, 2019
work page 2019
- [58]
-
[59]
R. Shariff and O. Sheffet. Differentially private contextual linear bandits. In Advances in Neural Information Processing Systems, pages 4296--4306, 2018
work page 2018
- [60]
-
[61]
C. Tao, Q. Zhang, and Y. Zhou. Collaborative learning with limited interaction: Tight bounds for distributed exploration in multi-armed bandits. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 126--146, 2019
work page 2019
-
[62]
A. C. Tossou and C. Dimitrakakis. Algorithms for differentially private multi-armed bandits. In Thirtieth AAAI Conference on Artificial Intelligence, 2016
work page 2016
- [63]
-
[64]
P.-A. Wang, R.-C. Tzeng, and A. Proutiere. Fast pure exploration via frank-wolfe. Advances in Neural Information Processing Systems, 34: 0 5810--5821, 2021
work page 2021
-
[65]
S. L. Warner. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American statistical association, pages 63--69, 1965
work page 1965
-
[66]
W. You, C. Qin, Z. Wang, and S. Yang. Information-directed selection for top-two algorithms. In Conference on Learning Theory, pages 2850--2851. PMLR, 2023 a
work page 2023
-
[67]
W. You, C. Qin, Z. Wang, and S. Yang. Information-directed selection for top-two algorithms. In Conference on Learning Theory, pages 2850--2851. PMLR, 2023 b
work page 2023
-
[68]
Y. Zhou, X. Chen, and J. Li. Optimal pac multiple arm identification with applications to crowdsourcing. In International Conference on Machine Learning, pages 217--225. PMLR, 2014
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.