The Adversarial Robustness of Sampling
Pith reviewed 2026-05-25 14:52 UTC · model grok-4.3
The pith
To resist adaptive adversaries, sampling algorithms must use sample size based on log of the range cardinality rather than VC-dimension.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For a set system (U, R), to ensure that Bernoulli or reservoir sampling produces an ε-approximation with high probability even against an adaptive adversary, it suffices to take a sample of size Ω(log |R| / ε²). This nearly matches the lower bound from the attack that works when |R| is exponential.
What carries the argument
The cardinality term log |R| replacing the VC-dimension d in the sample size formula for robustness against adaptive streaming adversaries.
If this is right
- A sample of size Ω(log |R| / ε²) suffices to make sampling robust to adaptive adversaries.
- The attack shows failure for sublinear sample sizes whenever |R| is exponential in the stream length.
- Both Bernoulli sampling and reservoir sampling require exactly this replacement of d by log |R|.
- The result holds whenever the cardinality |R| can be bounded in advance.
Where Pith is reading between the lines
- For geometric set systems where log |R| is only modestly larger than d, the extra sampling cost may be modest.
- Practical streaming systems would need a way to obtain or upper-bound |R| before choosing the sample size.
- The same cardinality-based adjustment might apply to other streaming primitives that currently rely on VC-dimension.
Load-bearing premise
The positive robustness result assumes a finite set system whose cardinality |R| is known or bounded.
What would settle it
An experiment where an adaptive adversary attacks Bernoulli or reservoir sampling of size o(log |R| / ε²) on a concrete finite set system and checks whether the sample fails to be an ε-approximation with high probability.
Figures
read the original abstract
Random sampling is a fundamental primitive in modern algorithms, statistics, and machine learning, used as a generic method to obtain a small yet "representative" subset of the data. In this work, we investigate the robustness of sampling against adaptive adversarial attacks in a streaming setting: An adversary sends a stream of elements from a universe $U$ to a sampling algorithm (e.g., Bernoulli sampling or reservoir sampling), with the goal of making the sample "very unrepresentative" of the underlying data stream. The adversary is fully adaptive in the sense that it knows the exact content of the sample at any given point along the stream, and can choose which element to send next accordingly, in an online manner. Well-known results in the static setting indicate that if the full stream is chosen in advance (non-adaptively), then a random sample of size $\Omega(d / \varepsilon^2)$ is an $\varepsilon$-approximation of the full data with good probability, where $d$ is the VC-dimension of the underlying set system $(U,R)$. Does this sample size suffice for robustness against an adaptive adversary? The simplistic answer is \emph{negative}: We demonstrate a set system where a constant sample size (corresponding to VC-dimension $1$) suffices in the static setting, yet an adaptive adversary can make the sample very unrepresentative, as long as the sample size is (strongly) sublinear in the stream length, using a simple and easy-to-implement attack. However, this attack is "theoretical only", requiring the set system size to (essentially) be exponential in the stream length. This is not a coincidence: We show that to make Bernoulli or reservoir sampling robust against adaptive adversaries, the modification required is solely to replace the VC-dimension term $d$ in the sample size with the cardinality term $\log |R|$. This nearly matches the bound imposed by the attack.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript examines the robustness of Bernoulli and reservoir sampling against fully adaptive adversaries in a streaming model over a set system (U, R). It shows that the standard sample size Ω(d/ε²) based on VC-dimension d fails to guarantee an ε-approximation under adaptivity when |R| is exponential in the stream length, even for d=1, via a concrete attack. It then proves that replacing d by log|R| in the sample-size bound suffices for robustness of both samplers, yielding nearly tight bounds.
Significance. If the analysis holds, the result cleanly separates the roles of VC-dimension and cardinality for adversarial robustness, with a simple, parameter-free modification to existing sampling primitives. The matching upper and lower bounds, explicit finite-cardinality assumption, and concrete attack/fix constitute a substantive contribution to streaming algorithms and robust sampling.
minor comments (2)
- [Abstract] Abstract: the phrase 'nearly matches the bound imposed by the attack' would benefit from an explicit statement of the ε-dependence and any hidden logarithmic factors to allow immediate comparison with the positive result.
- The positive robustness theorem assumes |R| is known or bounded a priori; a brief remark on how this assumption can be relaxed (e.g., via doubling) would improve applicability.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the manuscript, accurate summary of the results, and recommendation for minor revision. We appreciate the recognition that the work cleanly separates the roles of VC-dimension and set-system cardinality for adversarial robustness.
Circularity Check
No significant circularity; derivation is self-contained via direct analysis
full rationale
The paper derives the log |R| sample-size bound via standard union-bound analysis over the finite collection R (explicitly stated as the modification needed for robustness). This is a direct probabilistic argument, not a reduction to fitted parameters, self-definition, or self-citation chains. The negative attack result is an explicit construction requiring |R| exponential in stream length, presented as a limitation rather than a general case, and is consistent with the positive bound without circular dependence. All load-bearing steps use external standard techniques (union bounds, VC-dimension results in static case) and are falsifiable independently of the paper's fitted values or prior self-work. No patterns of self-definitional replacement, prediction-from-fit, or ansatz smuggling appear.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard VC-dimension approximation bounds hold in the non-adaptive static setting
Reference graph
Works this paper leans on
-
[1]
Noga Alon and Joel H. Spencer. The Probabilistic Method . Wiley Publishing, 4th edition, 2016
work page 2016
- [2]
-
[3]
Evasion attacks against machine learning at test time
Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Srndic, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD , pages 387--402, 2013
work page 2013
-
[4]
Weighted sampling without replacement from data streams
Vladimir Braverman, Rafail Ostrovsky, and Gregory Vorsanger. Weighted sampling without replacement from data streams. Information Processing Letters , 115(12):923--926, 2015
work page 2015
-
[5]
Wild patterns: Ten years after the rise of adversarial machine learning
Battista Biggio and Fabio Roli. Wild patterns: Ten years after the rise of adversarial machine learning. Pattern Recognition , 84:317--331, 2018
work page 2018
-
[6]
PAC -learning in the presence of evasion adversaries
Daniel Cullina, Arjun Nitin Bhagoji, and Prateek Mittal. PAC -learning in the presence of evasion adversaries. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems , NIPS, pages 228--239, 2018
work page 2018
-
[7]
Duffield, Haim Kaplan, Carsten Lund, and Mikkel Thorup
Edith Cohen, Nick G. Duffield, Haim Kaplan, Carsten Lund, and Mikkel Thorup. Efficient stream sampling for variance-optimal estimation of subset sums. SIAM Journal on Computing , 40(5):1402--1431, 2011
work page 2011
-
[8]
Clarkson, David Eppstein, Gary L
Kenneth L. Clarkson, David Eppstein, Gary L. Miller, Carl Sturtivant, and Shang - Hua Teng. Approximating center points with iterative radon points. International Journal of Computational Geometry and Applications , 6(3):357--377, 1996
work page 1996
-
[9]
Graham Cormode and Minos N. Garofalakis. Sketching streams through the net: Distributed approximate query tracking. In Proceedings of the 31st International Conference on Very Large Data Bases, VLDB , pages 13--24, 2005
work page 2005
-
[10]
Timothy Chu, Yu Gao, Richard Peng, Sushant Sachdeva, Saurabh Sawlani, and Junxing Wang. Graph sparsification, spectral sketches, and faster resistance computation, via short cycle decompositions. In 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS , pages 361--372, 2018
work page 2018
-
[11]
The discrepancy method - randomness and complexity
Bernard Chazelle. The discrepancy method - randomness and complexity . Cambridge University Press, 2001
work page 2001
-
[12]
A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations
Herman Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics , 23(4):493--507, 1952
work page 1952
-
[13]
Cranor, Theodore Johnson, Oliver Spatscheck, and Vladislav Shkapenyuk
Charles D. Cranor, Theodore Johnson, Oliver Spatscheck, and Vladislav Shkapenyuk. The gigascope stream database. IEEE Data Engineering Bulletin , 26(1):27--32, 2003
work page 2003
-
[14]
Concentration inequalities and martingale inequalities: a survey
Fan Chung and Linyuan Lu. Concentration inequalities and martingale inequalities: a survey. Internet Mathematics , 3(1):79--127, 2006
work page 2006
-
[15]
Graham Cormode, S. Muthukrishnan, and Ke Yi. Algorithms for distributed functional monitoring. ACM Transactions on Algorithms , 7(2):21:1--21:20, 2011
work page 2011
-
[16]
Muthukrishnan, Ke Yi, and Qin Zhang
Graham Cormode, S. Muthukrishnan, Ke Yi, and Qin Zhang. Continuous sampling from distributed streams. Journal of the ACM , 59:10:1--10:25, 2012
work page 2012
- [17]
-
[18]
The reusable holdout: Preserving validity in adaptive data analysis
Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. The reusable holdout: Preserving validity in adaptive data analysis. Science , 349(6248):636--638, 2015
work page 2015
-
[19]
Strongly adaptive online learning
Amit Daniely, Alon Gonen, and Shai Shalev - Shwartz. Strongly adaptive online learning. In Proceedings of the 32nd International Conference on Machine Learning, ICML , pages 1405--1411, 2015
work page 2015
-
[20]
Duffield, Carsten Lund, and Mikkel Thorup
Nick G. Duffield, Carsten Lund, and Mikkel Thorup. Estimating flow distributions from sampled flow statistics. IEEE/ACM Transactions on Networking , 13(5):933--946, 2005
work page 2005
-
[21]
Pavlos S. Efraimidis and Paul G. Spirakis. Weighted random sampling with a reservoir. Information Processing Letters , 97(5):181--185, 2006
work page 2006
- [22]
-
[23]
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Priya Goyal, Piotr Doll\' a r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[24]
Gilbert, Brett Hemenway, Atri Rudra, Martin J
Anna C. Gilbert, Brett Hemenway, Atri Rudra, Martin J. Strauss, and Mary Wootters. Recovering simple signals. In Information Theory and Applications Workshop, ITA , pages 382--391, 2012
work page 2012
-
[25]
Gilbert, Brett Hemenway, Martin J
Anna C. Gilbert, Brett Hemenway, Martin J. Strauss, David P. Woodruff, and Mary Wootters. Reusable low-error compressive sampling schemes through privacy. In IEEE Statistical Signal Processing Workshop, SSP , pages 536--539, 2012
work page 2012
-
[26]
Space-efficient online computation of quantile summaries
Michael Greenwald and Sanjeev Khanna. Space-efficient online computation of quantile summaries. SIGMOD Record , 30(2):58--66, 2001
work page 2001
-
[27]
Michael B. Greenwald and Sanjeev Khanna. Quantiles and equi-depth histograms over streams. In Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi, editors, Data Stream Management: Processing High-Speed Data Streams , pages 45--86. Springer Berlin Heidelberg, 2016
work page 2016
-
[28]
State-of-the-art on clustering data streams
Mohammed Ghesmoune, Mustapha Lebbah, and Hanene Azzag. State-of-the-art on clustering data streams. Big Data Analytics , 1(1):13, 2016
work page 2016
-
[29]
Ian J. Goodfellow, Patrick D. McDaniel, and Nicolas Papernot. Making machine learning robust against adversarial inputs. Communications of the ACM , 61(7):56--66, 2018
work page 2018
-
[30]
Logarithmic regret algorithms for online convex optimization
Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning , 69(2-3):169--192, 2007
work page 2007
-
[31]
Introduction to online convex optimization
Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization , 2(3-4):157--325, 2016
work page 2016
- [32]
-
[33]
Muthukrishnan, and Irina Rozenbaum
Theodore Johnson, S. Muthukrishnan, and Irina Rozenbaum. Sampling algorithms in a stream operator. In Proceedings of the ACM SIGMOD International Conference on Management of Data , pages 1--12, 2005
work page 2005
-
[34]
Online maintenance of very large random samples
Chris Jermaine, Abhijit Pol, and Subramanian Arumugam. Online maintenance of very large random samples. In Proceedings of the ACM SIGMOD International Conference on Management of Data , pages 299--310, 2004
work page 2004
- [35]
-
[36]
Optimal quantile approximation in streams
Zohar Karnin , Kevin Lang , and Edo Liberty . Optimal quantile approximation in streams. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS) , pages 71--78, 2016
work page 2016
-
[37]
Donald E. Knuth. The Art of Computer Programming, Volume 2 (3rd Ed.): Seminumerical Algorithms . Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1997
work page 1997
-
[38]
Yi Li, Philip M. Long, and Aravind Srinivasan. Improved bounds on the sample complexity of learning. Journal of Computer and System Sciences , 62(3):516--527, 2001
work page 2001
-
[39]
Stochastic bandits robust to adversarial corruptions
Thodoris Lykouris, Vahab Mirrokni, and Renato Paes Leme. Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing , STOC 2018, pages 114--122, 2018
work page 2018
-
[40]
Richard J. Lipton and Jeffrey F. Naughton. Clocked adversaries for hashing. Algorithmica , 9(3):239--252, 1993
work page 1993
-
[41]
Colin McDiarmid. Concentration. In Michel Habib, Colin McDiarmid, Jorge Ramirez-Alfonsin, and Bruce Reed, editors, Probabilistic Methods for Algorithmic Discrete Mathematics , pages 195--248. Springer Berlin Heidelberg, 1998
work page 1998
-
[42]
Vc classes are adversarially robustly learnable, but only improperly
Omar Montasser, Steve Hanneke, and Nathan Srebro. Vc classes are adversarially robustly learnable, but only improperly. In Alina Beygelzimer and Daniel Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, COLT , volume 99 of Proceedings of Machine Learning Research , pages 2512--2530, 2019
work page 2019
-
[43]
Saeed Mahloujifar and Mohammad Mahmoody. Can adversarially robust learning leverage computational hardness? In Algorithmic Learning Theory, ALT , pages 581--609, 2019
work page 2019
-
[44]
Sketching in adversarial environments
Ilya Mironov, Moni Naor, and Gil Segev. Sketching in adversarial environments. SIAM Journal on Computing , 40(6):1845--1870, 2011
work page 2011
-
[45]
Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G. Lindsay. Random sampling techniques for space efficient online computation of order statistics of large datasets. SIGMOD Record , 28(2):251--262, 1999
work page 1999
-
[46]
Nabil H. Mustafa and Kasturi R. Varadarajan. Epsilon-approximations and epsilon-nets. In Csaba D. Toth, Joseph O'Rourke, and Jacob E. Goodman, editors, Handbook of Discrete and Computational Geometry, 3rd Edition , chapter 47, page 27. Chapman and Hall/CRC, New York, 2017
work page 2017
-
[47]
Michael McCoyd and David A. Wagner. Background class defense against adversarial examples. In 2018 IEEE Security and Privacy Workshops, SP Workshops , pages 96--102, 2018
work page 2018
-
[48]
Bloom filters in adversarial environments
Moni Naor and Eylon Yogev. Bloom filters in adversarial environments. In Advances in Cryptology - CRYPTO 2015 - 35th Annual Cryptology Conference , pages 565--584, 2015
work page 2015
-
[49]
Tamer \" O zsu and Patrick Valduriez
M. Tamer \" O zsu and Patrick Valduriez. Principles of Distributed Database Systems . Springer Publishing Company, Incorporated, 3rd edition, 2011
work page 2011
-
[50]
Nicolas Papernot, Patrick D. McDaniel, Ian J. Goodfellow, Somesh Jha, Z. Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In ACM Asia Conference on Computer and Communications Security, AsiaCCS , pages 506--519, 2017
work page 2017
-
[51]
On the density of families of sets
Norbert Sauer. On the density of families of sets. Journal of Combinatorial Theory, Series A , 13(1):145 -- 147, 1972
work page 1972
-
[52]
DARTS: Deceiving Autonomous Cars with Toxic Signs
Chawin Sitawarin, Arjun Nitin Bhagoji, Arsalan Mosenia, Mung Chiang, and Prateek Mittal. DARTS: deceiving autonomous cars with toxic signs. CoRR , abs/1802.06430, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[53]
Online learning and online convex optimization
Shai Shalev - Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning , 4(2):107--194, 2012
work page 2012
-
[54]
A combinatorial problem; stability and order for models and theories in infinitary languages
Saharon Shelah. A combinatorial problem; stability and order for models and theories in infinitary languages. Pacific Journal of Mathematics , 41(1):247--261, 1972
work page 1972
-
[55]
Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V
Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V. Le. Don't decay the learning rate, increase the batch size, 2017. Published as a conference paper at ICLR 2018
work page 2017
-
[56]
Online learning with local permutations and delayed feedback
Ohad Shamir and Liran Szlak. Online learning with local permutations and delayed feedback. In Proceedings of the 34th International Conference on Machine Learning, ICML , pages 3086--3094, 2017
work page 2017
-
[57]
Smoothness, low noise and fast rates
Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast rates. In Advances in Neural Information Processing Systems 23: Annual Conference on Neural Information Processing Systems, NIPS , pages 2199--2207, 2010
work page 2010
-
[58]
Subhash Suri, Csaba D. T\' o th, and Yunhong Zhou. Range counting over multidimensional data streams. In Proceedings of the Twentieth Annual Symposium on Computational Geometry , SCG '04, pages 160--169, 2004
work page 2004
-
[59]
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In 2nd International Conference on Learning Representations, ICLR , 2014
work page 2014
-
[60]
Sharper bounds for gaussian and empirical processes
Michel Talagrand. Sharper bounds for gaussian and empirical processes. The Annals of Probability , 22(1):28--76, 1994
work page 1994
-
[61]
L. G. Valiant. A theory of the learnable. Communications of the ACM , 27(11):1134--1142, 1984
work page 1984
-
[62]
V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications , 16(2):264--280, 1971
work page 1971
-
[63]
Random sampling with a reservoir
Jeffrey Scott Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software , 11(1):37--57, 1985
work page 1985
-
[64]
Woodworth, Vitaly Feldman, Saharon Rosset, and Nati Srebro
Blake E. Woodworth, Vitaly Feldman, Saharon Rosset, and Nati Srebro. The everlasting database: Statistical validity at a fair price. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems, NeurIPS , pages 6532--6541, 2018
work page 2018
-
[65]
Quantiles over data streams: An experimental study
Lu Wang, Ge Luo, Ke Yi, and Graham Cormode. Quantiles over data streams: An experimental study. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data , pages 737--748, 2013
work page 2013
-
[66]
Adaptive online learning in dynamic environments
Lijun Zhang, Shiyin Lu, and Zhi - Hua Zhou. Adaptive online learning in dynamic environments. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems, NeurIPS , pages 1330--1340, 2018
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.