pith. machine review for the scientific record. sign in

arxiv: 2605.07551 · v1 · submitted 2026-05-08 · 💻 cs.LG

Recognition: 3 theorem links

· Lean Theorem

Disagreement-Regularized Importance Sampling for Adversarial Label Corruption

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:04 UTC · model grok-4.3

classification 💻 cs.LG
keywords importance samplinglabel corruptionadversarial robustnessconcentration boundsensemble disagreementdata selectionmachine learning
0
0 comments X

The pith

Disagreement-regularized importance sampling uses loss rank disagreement to separate bulk corrupted examples from boundary-clean ones with explicit finite-sample concentration bounds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard importance sampling breaks under adversarial label corruption because it favors high-norm outliers. The paper formalizes the problem with an epsilon-contamination model and introduces disagreement-regularized importance sampling that subsamples by measuring inconsistency in loss ranks across an independent proxy ensemble. It derives concentration inequalities showing that the empirical rank disagreement of corrupted points concentrates above their mean while clean boundary points concentrate below, both at rate O of square root of log N over delta over K. When a positive structural gap separates the two groups and enough clean examples exist, the bounds certify that the selected subset has limited contamination. Experiments show the method resists targeted high-norm attacks that defeat magnitude-based baselines like EL2N.

Core claim

DR-IS selects a subset by ranking examples according to their loss-rank disagreement across proxy models and retaining those with high disagreement. The central theorem establishes that, with probability 1-delta, the empirical disagreement of bulk corrupted examples lies at most O(sqrt(log(N/delta)/K)) above its expectation while that of boundary-clean examples lies at least that far below; whenever the structural gap Delta-prime is positive and the clean set is large enough relative to the target subset size, this guarantees strict separation and an explicit upper bound on the contamination fraction inside the selected subset.

What carries the argument

Loss rank-disagreement measured across an independent proxy ensemble, which quantifies how stably an example ranks high or low in loss relative to the rest of the data.

If this is right

  • The selected subset carries an explicit, finite-sample upper bound on the fraction of corrupted labels it can contain.
  • DR-IS remains stable under targeted attacks that prioritize high-norm adversarial outliers, unlike pure magnitude-based selection.
  • The method supplies concentration certificates that complement heuristic training-dynamics filters such as AUM.
  • The same disagreement signal can be used to control noise leakage from the statistical tail of the corrupted distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The O(1/sqrt(K)) rate suggests that increasing the proxy ensemble size can tighten separation without additional labeled data.
  • The approach may extend to other selection problems such as active learning or core-set construction under label noise.
  • If the gap Delta-prime can be estimated from data, the method could adaptively choose subset size to meet a target contamination level.

Load-bearing premise

A positive structural expectation gap exists between the disagreement distributions of corrupted and clean groups, and the boundary-clean set is at least as large as the subset being selected.

What would settle it

A dataset where the empirical rank disagreement of known corrupted points exceeds that of known clean points at a rate slower than O(sqrt(log(N/delta)/K)), or where the selected subset contamination rate exceeds the certified bound despite a verified positive Delta-prime and sufficient clean examples, would falsify the separation claim.

Figures

Figures reproduced from arXiv: 2605.07551 by Csongor Horv\'ath, Ida-Maria Sintorn, Prashant Singh.

Figure 1
Figure 1. Figure 1: Left: σb 2 histograms on Dclean (blue) and Dcorr (red) at K ∈ {1, 2, 4, 8, 16, 32}, CIFAR￾10, 10% targeted noise. Right: empirical gap ∆b K = σb 2|Dclean − σb 2|Dcorr (solid) vs. Θ(K−1/2 ) McDiarmid radius (dashed). Empirical separation outpaces the worst-case bound. sufficient in practice: the "simplicity bias" of proxies is robust enough that empirical separation out￾paces our conservative theoretical li… view at source ↗
Figure 2
Figure 2. Figure 2: Corruption-rate breakdown sweep. Left: test accuracy vs. [PITH_FULL_IMAGE:figures/full_fig_p022_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Proxy-diversity ablation on CIFAR-10/VGG-19/ [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: DR-IS ensemble separation score comparison between multi seed and multi architecture [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Up: Single layer FC, Down: SGD log loss. Within figures: Left: σb 2 histograms on Dclean (blue) and Dcorr (red) at K ∈ {1, 2, 4, 8, 16, 32}, CIFAR-10, 10% targeted noise. Right: empirical gap ∆b K = σb 2|Dclean − σb 2|Dcorr (solid) vs. Θ(K−1/2 ) McDiarmid radius (dashed). Empirical separation outpaces the worst-case bound. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: AUM identification histograms on Dclean (blue) and Dcor (red) on MNIST, 20% uniform noise. (left: single layer FC, right: SGD log loss classifier) ∥∇fi∥ is unbounded and grows disproportionately large for corrupted outliers. Consequently, even for small values of β, the massive magnitude of the corrupted gradients completely overshadows the bounded disagreement signal. The corrupted outliers effectively "h… view at source ↗
read the original abstract

Standard Importance Sampling (IS) collapses under label corruption because high-norm examples, prioritized for variance reduction, are often adversarial outliers. We formalize this misalignment using an $\varepsilon$-contamination model and propose Disagreement-Regularized Importance Sampling (DR-IS), a sub-sampling method based on loss rank-disagreement across independent proxy ensemble. We prove finite-sample concentration bounds showing that the empirical rank disagreement of bulk corrupted examples is bounded above, and that of boundary-clean examples bounded below, both at rate $O(\sqrt{\log(N/\delta)/K})$ with probability $1-\delta$; when the structural expectation gap $\Delta'$ between the two groups is positive and the boundary-clean set is at least as large as the selected subset, these bounds certify strict separation and control the contamination rate of the selected subset. Empirically, DR-IS remains robust under targeted high-norm attacks that break magnitude-based methods such as the Error $L_2$-norm (EL2N) on benchmark datasets. DR-IS complements training-dynamics approaches like Area Under the Margin ranking (AUM), offering improved robustness in the loss-aligned regime alongside explicit finite-sample concentration certificates and a contamination bound limiting noise leakage from the statistical tail of corrupted points.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Disagreement-Regularized Importance Sampling (DR-IS), a subsampling method that uses loss rank-disagreement across an independent proxy ensemble to mitigate the collapse of standard importance sampling under adversarial label corruption formalized via an ε-contamination model. It derives finite-sample concentration bounds showing that the empirical rank disagreement of bulk corrupted examples is upper-bounded and that of boundary-clean examples is lower-bounded, each at rate O(√(log(N/δ)/K)) with probability 1-δ. Under the structural assumptions that the expectation gap Δ' > 0 and that the boundary-clean set is at least as large as the selected subset, these bounds certify strict separation and bound the contamination rate of the selected high-disagreement subset. Empirical evaluations demonstrate robustness to targeted high-norm attacks that defeat magnitude-based baselines such as EL2N, while complementing dynamics-based methods like AUM.

Significance. If the concentration bounds and separation hold under the stated assumptions, the work supplies explicit finite-sample certificates and a contamination bound that are absent from most training-dynamics or heuristic subsampling approaches. This provides a principled, theoretically grounded alternative for robust importance sampling in corrupted regimes and could inform the design of other disagreement-based filters.

major comments (2)
  1. [Abstract / main theorem] Abstract and the statement of the main separation result: the certification of strict separation and the contamination-rate bound are conditioned on the structural assumption that Δ' := E_clean - E_corrupt > 0 together with |boundary-clean| ≥ size of selected subset. Neither condition is derived from the ε-contamination model; the paper must either supply verifiable conditions guaranteeing Δ' > 0 or analyze the degradation when an adversary places corrupted points near the decision boundary to shrink Δ'.
  2. [Proof of concentration bounds] The finite-sample bounds are obtained by applying standard concentration inequalities to rank statistics. The derivation should be checked for any hidden dependence on the proxy ensemble size K or on the particular loss used to compute ranks; if the O(√(log(N/δ)/K)) rate is tight only under additional regularity on the loss distribution, this must be stated explicitly.
minor comments (2)
  1. [Notation] Notation for the proxy ensemble and the rank-disagreement statistic should be introduced once and used consistently; the current abstract uses both “proxy ensemble” and “independent proxy ensemble” without a single defining equation.
  2. [Experiments] The empirical section would benefit from an explicit table reporting the measured Δ' on each benchmark under the attack configurations, so readers can assess how close the operating regime is to the assumed positive-gap regime.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. The comments highlight important aspects of the assumptions and proof details that we address point by point below. We will incorporate revisions to strengthen the presentation of the structural assumptions and the concentration analysis.

read point-by-point responses
  1. Referee: [Abstract / main theorem] Abstract and the statement of the main separation result: the certification of strict separation and the contamination-rate bound are conditioned on the structural assumption that Δ' := E_clean - E_corrupt > 0 together with |boundary-clean| ≥ size of selected subset. Neither condition is derived from the ε-contamination model; the paper must either supply verifiable conditions guaranteeing Δ' > 0 or analyze the degradation when an adversary places corrupted points near the decision boundary to shrink Δ'.

    Authors: We agree that Δ' > 0 is a structural assumption not implied by the ε-contamination model. The model allows arbitrary placement of corrupted points, so an adversary can indeed position them near the decision boundary to reduce Δ'. We cannot derive Δ' > 0 in general from the contamination model alone, as it depends on the data distribution and proxy model behavior. In the revision we will add a dedicated discussion section that (i) states verifiable conditions under which Δ' > 0 is expected (e.g., when clean examples produce systematically higher rank disagreement than bulk corrupted ones under the true labeling), and (ii) analyzes the degradation of the separation and contamination bounds as Δ' approaches or falls below zero, including a quantitative sensitivity result showing how the certified contamination rate worsens. revision: yes

  2. Referee: [Proof of concentration bounds] The finite-sample bounds are obtained by applying standard concentration inequalities to rank statistics. The derivation should be checked for any hidden dependence on the proxy ensemble size K or on the particular loss used to compute ranks; if the O(√(log(N/δ)/K)) rate is tight only under additional regularity on the loss distribution, this must be stated explicitly.

    Authors: The O(√(log(N/δ)/K)) rate arises directly from applying Hoeffding’s inequality to the empirical disagreement, which is the average of K independent bounded [0,1] rank-disagreement indicators. The 1/√K factor is therefore explicit and not hidden. The boundedness holds for any loss function used to produce ranks, because disagreement is defined via relative ordering rather than the absolute loss values. No further regularity (e.g., sub-Gaussian tails or moment bounds beyond [0,1]) is required for the stated rate. In the revision we will (i) restate the boundedness assumption at the beginning of the proof, (ii) note that the rate is general for any loss yielding bounded disagreement statistics, and (iii) add a remark that if an unbounded loss were used without truncation the bounds would require a different concentration tool. revision: yes

Circularity Check

0 steps flagged

No circularity: bounds follow from standard concentration inequalities on rank statistics

full rationale

The paper's central derivation applies standard finite-sample concentration inequalities (e.g., Hoeffding-type bounds) to the empirical rank-disagreement random variables computed across an independent proxy ensemble. The resulting O(√(log(N/δ)/K)) rates are applied separately to the bulk-corrupted and boundary-clean groups; strict separation is then stated as holding only under the explicit structural assumptions Δ' > 0 and |boundary-clean| ≥ selected-subset size. Neither the concentration statement nor the DR-IS selection rule is defined in terms of its own output, fitted to a prediction target, or justified solely by self-citation. The derivation chain therefore remains self-contained against external probabilistic tools and does not reduce to any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the ε-contamination model and the existence of a positive structural gap Δ' in expected disagreement; K and δ are user-chosen parameters rather than fitted constants.

free parameters (2)
  • K
    Number of independent proxy models; controls the concentration rate but chosen by the practitioner.
  • δ
    Failure probability parameter in the concentration bounds; set by the user.
axioms (2)
  • domain assumption ε-contamination model for label corruption
    Assumes a fraction ε of labels can be replaced by an arbitrary adversarial distribution.
  • ad hoc to paper Positive structural expectation gap Δ' between corrupted and clean groups
    Required for the separation argument to hold; stated as a condition in the abstract.

pith-pipeline@v0.9.0 · 5521 in / 1434 out tokens · 39315 ms · 2026-05-11T02:04:04.472358+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages

  1. [1]

    Variance reduction in sgd by distributed importance sampling.arXiv preprint arXiv:1511.06481,

    Guillaume Alain, Alex Lamb, Chinnadhurai Sankar, Aaron Courville, and Yoshua Ben- gio. Variance reduction in SGD by distributed importance sampling.arXiv preprint arXiv:1511.06481, 2015. Also appeared at the ICLR 2016 Workshop Track

  2. [2]

    Food-101 – mining discriminative components with random forests

    Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 – mining discriminative components with random forests. InComputer Vision – ECCV 2014, pages 446–461. Springer, 2014

  3. [3]

    Selection via proxy: Efficient data selection for deep learning

    Cody Coleman, Christopher Yeh, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, and Matei Zaharia. Selection via proxy: Efficient data selection for deep learning. InInternational Conference on Learning Representations (ICLR), 2020

  4. [4]

    Importance sampling for minibatches.Journal of Machine Learning Research, 19(27):1–21, 2018

    Dominik Csiba and Peter Richtárik. Importance sampling for minibatches.Journal of Machine Learning Research, 19(27):1–21, 2018

  5. [5]

    Ilias Diakonikolas and Daniel M. Kane. Recent advances in algorithmic high-dimensional robust statistics.arXiv preprint arXiv:1911.05911, 2019

  6. [6]

    Bridging the gap between constant step size stochastic gradient descent and Markov chains.Annals of Statistics, 48(3), 2020

    Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the gap between constant step size stochastic gradient descent and Markov chains.Annals of Statistics, 48(3), 2020

  7. [7]

    Deep Bayesian active learning with image data

    Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep Bayesian active learning with image data. InProceedings of the 34th International Conference on Machine Learning (ICML), pages 1183–1192, 2017

  8. [8]

    Tsang, and Masashi Sugiyama

    Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor W. Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. InAdvances in Neural Information Processing Systems (NeurIPS), volume 31, 2018

  9. [9]

    Harris, K

    Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Vir- tanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Hal- dane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin S...

  10. [10]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 770–778, 2016

  11. [11]

    Robust estimation of a location parameter

    Peter J Huber. Robust estimation of a location parameter. pages 492–518, 1992

  12. [12]

    Matplotlib: A 2d graphics environment.Computing in science & engineering, 9(3):90–95, 2007

    John D Hunter. Matplotlib: A 2d graphics environment.Computing in science & engineering, 9(3):90–95, 2007

  13. [13]

    Jiang, Daniel L.-K

    Angela H. Jiang, Daniel L.-K. Wong, Giulio Zhou, David G. Andersen, Jeffrey Dean, Gre- gory R. Ganger, Gauri Joshi, Michael Kaminksy, Michael Kozuch, Zachary C. Lipton, and Padmanabhan Pillai. Accelerating deep learning by focusing on the biggest losers.arXiv preprint arXiv: 1910.00762, 2019

  14. [14]

    Johnson and Carlos Guestrin

    Tyler B. Johnson and Carlos Guestrin. Training deep models faster with robust, approximate importance sampling. InAdvances in Neural Information Processing Systems (NeurIPS), vol- ume 31, 2018. 10

  15. [15]

    Not all samples are created equal: Deep learning with importance sampling

    Angelos Katharopoulos and François Fleuret. Not all samples are created equal: Deep learning with importance sampling. InProceedings of the 35th International Conference on Machine Learning (ICML), pages 2525–2534, 2018

  16. [16]

    What uncertainties do we need in Bayesian deep learning for com- puter vision? InAdvances in Neural Information Processing Systems (NeurIPS), volume 30, 2017

    Alex Kendall and Yarin Gal. What uncertainties do we need in Bayesian deep learning for com- puter vision? InAdvances in Neural Information Processing Systems (NeurIPS), volume 30, 2017

  17. [17]

    GRAD-MATCH: Gradient matching based data subset selection for efficient deep model training

    Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, Abir De, and Rishabh Iyer. GRAD-MATCH: Gradient matching based data subset selection for efficient deep model training. InProceedings of the 38th International Conference on Machine Learn- ing (ICML), pages 5464–5474, 2021

  18. [18]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

  19. [19]

    Simple and scalable pre- dictive uncertainty estimation using deep ensembles

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable pre- dictive uncertainty estimation using deep ensembles. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), volume 30, 2017

  20. [20]

    Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 2002

    Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 2002

  21. [21]

    Junnan Li, Richard Socher, and Steven C. H. Hoi. DivideMix: Learning with noisy labels as semi-supervised learning. InInternational Conference on Learning Representations (ICLR), 2020

  22. [22]

    Early- learning regularization prevents memorization of noisy labels

    Sheng Liu, Jonathan Niles-Weed, Narges Razavian, and Carlos Fernandez-Granda. Early- learning regularization prevents memorization of noisy labels. InAdvances in Neural Infor- mation Processing Systems (NeurIPS), volume 33, 2020

  23. [23]

    arXiv preprint arXiv:1511.06343 , year =

    Ilya Loshchilov and Frank Hutter. Online batch selection for faster training of neural networks. InInternational Conference on Learning Representations (ICLR) Workshop Track, 2016. arXiv preprint arXiv:1511.06343, 2015

  24. [24]

    SGDR: Stochastic gradient descent with warm restarts

    Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations (ICLR), 2017

  25. [25]

    On the method of bounded differences

    Colin McDiarmid. On the method of bounded differences. InSurveys in Combinatorics, 1989, volume 141 ofLondon Mathematical Society Lecture Note Series, pages 148–188. Cambridge University Press, 1989

  26. [26]

    Data structures for statistical computing in python

    Wes McKinney. Data structures for statistical computing in python. InProceedings of the Python in Science Conference, volume 445, pages 56–61. SciPy, 2010

  27. [27]

    Brauner, Muhammed T

    Sören Mindermann, Jan M. Brauner, Muhammed T. Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt Höltgen, Aidan N. Gomez, Adrien Morisot, Sebastian Farquhar, and Yarin Gal. Prioritized training on points that are learnable, worth learning, and not yet learnt. InProceedings of the 39th International Conference on Machine Learning (ICML), pages 156...

  28. [28]

    Coresets for data-efficient training of machine learning models

    Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models. InProceedings of the 37th International Conference on Machine Learning (ICML), pages 6950–6960, 2020

  29. [29]

    Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm.Mathematical Programming, 155(1–2): 549–573, 2016

    Deanna Needell, Nathan Srebro, and Rachel Ward. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm.Mathematical Programming, 155(1–2): 549–573, 2016

  30. [30]

    Pytorch: An imperative style, high- performance deep learning library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high- perf...

  31. [31]

    Deep learning on a data diet: Finding important examples early in training

    Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. Deep learning on a data diet: Finding important examples early in training. InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, pages 20596–20607, 2021. 11

  32. [32]

    Concentration inequalities for Markov chains by Marton couplings and spectral methods.Electron

    Daniel Paulin. Concentration inequalities for Markov chains by Marton couplings and spectral methods.Electron. J. Probab, 20:1–32, 2015

  33. [33]

    Scikit-learn: Machine learning in Python.J

    Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikit-learn: Machine learning in Python.J. Mach. Learn. Res., 12: 2825–2830, 2011

  34. [34]

    Elenberg, and Kilian Q

    Geoff Pleiss, Tianyi Zhang, Ethan R. Elenberg, and Kilian Q. Weinberger. Identifying mis- labeled data using the area under the margin ranking. InAdvances in Neural Information Processing Systems (NeurIPS), volume 33, 2020

  35. [35]

    Non-convex learning via stochas- tic gradient Langevin dynamics: a nonasymptotic analysis

    Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochas- tic gradient Langevin dynamics: a nonasymptotic analysis. InConference on Learning Theory, pages 1674–1703. PMLR, 2017

  36. [36]

    Prioritized experience replay,

    Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay,

  37. [37]

    URLhttps://arxiv.org/abs/1511.05952

  38. [38]

    H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. InProceedings of the Fifth Annual Workshop on Computational Learning Theory (COLT), pages 287–294. Association for Computing Machinery, 1992

  39. [39]

    Very deep convolutional networks for large-scale image recognition

    Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. InInternational Conference on Learning Representations (ICLR), 2015

  40. [40]

    Beyond neural scaling laws: Beating power law scaling via data pruning

    Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Morcos. Beyond neural scaling laws: Beating power law scaling via data pruning. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, 2022

  41. [41]

    A randomized Kaczmarz algorithm with exponential convergence.Journal of Fourier Analysis and Applications, 15(2):262–278, 2009

    Thomas Strohmer and Roman Vershynin. A randomized Kaczmarz algorithm with exponential convergence.Journal of Fourier Analysis and Applications, 15(2):262–278, 2009

  42. [42]

    Smith, and Yejin Choi

    Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A. Smith, and Yejin Choi. Dataset cartography: Mapping and diagnosing datasets with training dynamics. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9275–9293, 2020

  43. [43]

    Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Ben- gio, and Geoffrey J. Gordon. An empirical study of example forgetting during deep neural network learning. InInternational Conference on Learning Representations (ICLR), 2019

  44. [44]

    Scipy 1.0: fundamental algorithms for scientific computing in python.Nature methods, 17(3):261–272, 2020

    Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt Haberland, Tyler Reddy, David Cour- napeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al. Scipy 1.0: fundamental algorithms for scientific computing in python.Nature methods, 17(3):261–272, 2020

  45. [45]

    Understand- ing deep learning requires rethinking generalization

    Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understand- ing deep learning requires rethinking generalization. InInternational Conference on Learning Representations (ICLR), 2017

  46. [46]

    step parity

    Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized loss minimization. InProceedings of the 32nd International Conference on Machine Learning (ICML), pages 1–9, 2015. A Proofs A.1 Proof of Theorem 1 Proof.Fix anyi∈ {1, . . . , N}. We first establish the concentration of the empirical rank- disagreementbσ2 i around ...