pith. sign in

arxiv: 2606.04212 · v2 · pith:CAXIZ7G3new · submitted 2026-06-02 · 💻 cs.LG · stat.ML

Edge of Stability Selectively Shapes Learning Across the Data Distribution

Pith reviewed 2026-06-28 10:25 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords edge of stabilitylearning dynamicsdata distributionHessian eigenvectorgradient alignmentcross-entropy lossoptimization trajectorygroup-wise progress
0
0 comments X

The pith

The edge of stability redistributes learning progress across subsets of the training data by favoring some groups and suppressing others.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the edge of stability is not merely a global constraint on optimization but acts selectively on different parts of the data. It amplifies progress for groups whose aggregate gradients align with the top eigenvector of the Hessian and whose gradients remain non-vanishing over time. Under cross-entropy loss this shifts advantage toward output-outliers whose gradients persist while saturating on confidently classified examples. A sympathetic reader would care because the result shows how a single stability boundary can create systematic differences in which examples get learned effectively during training.

Core claim

Entering the edge-of-stability regime from a fixed training state produces a measurable trade-off: groups whose gradients align with the dominant Hessian direction and sustain magnitude receive accelerated progress, while other groups see their progress slowed. The alignment condition is isolated by a controlled perturbation that preserves distance but randomizes direction, removing the benefit. Gradient saturation under cross-entropy decouples groups that become confidently classified, leaving the advantage with output-outliers whose gradients do not vanish.

What carries the argument

The branching intervention that enters or exits the edge-of-stability regime from an identical training state, together with the requirement that a group's aggregate gradient aligns with the top Hessian eigenvector.

If this is right

  • Groups whose gradients point along the top Hessian eigenvector will show faster loss reduction while the optimizer stays at the edge of stability.
  • Confidently classified groups lose their learning rate once their gradients approach zero, shifting relative advantage to output-outliers.
  • The stability boundary therefore functions as an allocator that determines which subsets of the data receive more optimization effort.
  • Any intervention that preserves or breaks alignment with the top eigenvector will directly control whether a group benefits from the edge-of-stability regime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designers of training curricula could deliberately engineer gradient alignments to steer which subgroups receive accelerated learning.
  • The same mechanism may help explain why certain subpopulations remain under-learned even after overall training loss has plateaued.
  • Extending the branching intervention to other loss functions could test whether the advantage for output-outliers is specific to cross-entropy saturation.

Load-bearing premise

The branching intervention changes only the stability regime and does not introduce separate alterations to gradient magnitudes or directions that would affect group progress on their own.

What would settle it

A controlled experiment in which groups are matched for gradient magnitude but one set is forced into alignment with the top Hessian eigenvector while the other is not; if the aligned set shows no extra progress once inside the edge-of-stability regime, the claimed selectivity mechanism is false.

Figures

Figures reproduced from arXiv: 2606.04212 by Anakha Ganesh, Pierfrancesco Beneventano, Shauna Kwag, Tomaso Poggio.

Figure 1
Figure 1. Figure 1: Conceptual taxonomy of prototypes. Data samples are categorized based on geometric proximity in input space relative to class-specific cluster centroids (µ0, µ1). inlier boundary output-outlier input-outlier Automobile Truck [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Directional coupling at EoS. The optimizer oscillates along v1 (red zigzag). When a group’s gradient ∇ℓk aligns with v1 (left), self-stabilization reduces loss for that group. When ∇ℓk is orthogonal to v1 (right), the group is decoupled from the oscillation and its loss does not benefit. Curvature influence. While cos2 θk measures direction, it does not capture gradient magnitude. We report the squared pro… view at source ↗
Figure 4
Figure 4. Figure 4: shows the effect of the branching intervention under MSE training. After the exit branch leaves the EoS regime at t ∗ , prototype losses begin to diverge between the two runs. The divergence is group-specific: input-outlier and output-outlier loss decrease faster under the baseline (∆ℓk > 0), while inlier and boundary loss decrease faster under the exit branch (∆ℓk < 0). The stability constraint does not u… view at source ↗
Figure 5
Figure 5. Figure 5: Alignment dynamics under EoS. Input-outlier cos2 θk rises during progressive sharpening and dominates v1 at EoS onset. Under the baseline (solid), self-stabilization resolves this alignment and v1 rotates toward boundary points. Under the exit branch (dashed), alignment remains elevated until the new threshold is reached at t ∗∗ . Once EoS is reached at t ∗ , input-outlier cos2 θk declines under the baseli… view at source ↗
Figure 6
Figure 6. Figure 6: Directional alignment is necessary for the selective EoS advantage. Identical seeds and configurations; only the input-outlier displacement direction differs. Top: Coherent displacement along vdiff yields high alignment and curvature influence for input-outliers, which capture the EoS advantage. Bottom: Random orthogonal displacement at equal distance reduces alignment and curvature influence, largely elim… view at source ↗
Figure 7
Figure 7. Figure 7: Gradient persistence determines which group retains curvature influence. Identical seeds and configurations; only the loss differs. Top: Input-outlier have elevated alignment, strong curvature influence, and captures the EoS advantage. Bottom: Alignment for input-outlier is high, but gradient saturation weakens their curvature influence; the EoS advantage shifts to output-outliers, whose gradients remain a… view at source ↗
Figure 8
Figure 8. Figure 8: Centroid distance predicts per-example alignment with v1 at EoS. Each dot is one training example; the line indicates the monotonic trend captured by Spearman ρ. Left: during progressive sharpening, centroid distance and cos2 (∇ℓi , v1) are uncorrelated (ρ = −0.11). Right: at EoS, correlation emerges (ρ = 0.39). Full trajectory of correlation is shown in Appendix E. 5 Generalizing the Alignment Principle S… view at source ↗
Figure 9
Figure 9. Figure 9: EoS improves adversarial robustness only when boundary points dominate v1. Left (α = 3, boundary dominates v1): the EoS branch (solid) outperforms the exit branch (dashed) after t ∗∗ . Right (α = 10, input-outlier dominates v1): the pattern reverses, and the exit branch performs better. Robustness gains appear only when EoS prioritizes the evaluated subset. Single seed. 8 [PITH_FULL_IMAGE:figures/full_fig… view at source ↗
Figure 10
Figure 10. Figure 10: EoS steers generalization toward the dominant group. Test MSE on input-outliers across αtest (after t ∗∗). Left (α = 3, boundary dominates v1): no OOD advantage. Right (α = 10, input-outlier dominates v1): EoS improves OOD performance at large αtest. Single seed. 6 Discussion We find that the EoS stability constraint acts as an inductive bias, not merely an implicit regularizer. Rather than selecting amon… view at source ↗
Figure 11
Figure 11. Figure 11: GD CNN MSE. Curvature influence is comparable for output-outliers and input-outliers, [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: GD ResNet MSE. Curvature influence is highest for output-outliers, and it is the primary [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: SGD MLP MSE. Curvature influence is highest for input-outliers, and it is the primary [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: GD Momentum MLP MSE. Curvature influence is highest for input-outliers, and it is [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Distribution of centroid distance by prototype subgroup on the (3,5) class pair. Boundary [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Curvature influence under input-outlier construction ( [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Curvature influence under input-outlier construction ( [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Schematic for coherent vs. incoherent input-outlier construction. [PITH_FULL_IMAGE:figures/full_fig_p020_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Centroid distance is preserved under the random-direction control (median 69.4 vs. 70.3 [PITH_FULL_IMAGE:figures/full_fig_p020_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Top: beginning of training, initially negative correlation that increases; Middle: EoS onset, monotonically increasing correlation; Bottom: correlation peak when large-amplitude oscillations along v1 develop, then monotonically decrease to negative after peaking 50 100 150 200 sharpness 2/ sharpness 0 500 1000 1500 2000 2500 3000 training step 0.05 0.10 0.15 0.20 m e a n cos 2 ( i, v1) 0 20 40 60 80 100 c… view at source ↗
Figure 21
Figure 21. Figure 21: shows the underlying per-example trajectories. Atypical points (red) exhibit high alignment that peaks just before EoS onset and then declines, while typical points (blue) remain weakly aligned over the same interval. This relationship is monotonic with respect to distance from the centroid near EoS onset. Later in training, the ordering reverses, with typical points eventually exceeding atypical ones in … view at source ↗
Figure 22
Figure 22. Figure 22: Empirical validation of the single-mode proxy. Scatter of Q2 k = ⟨∇ℓk, ∇S⟩ 2 versus Ck = (∇ℓk · v1) 2 across (subgroup, checkpoint) pairs in training. The dashed line shows the median proportionality Q2 k = α Ck, α = 7.7 × 105 . The relationship holds across the trajectory; for inliers (nearly orthogonal to v1), Ck is small and the proxy is loosest, in the regime where the selector predicts no EoS advanta… view at source ↗
read the original abstract

Existing analyses of the edge of stability (EoS) treat it as a global property of optimization. We show that it is also selective: the stability constraint redistributes learning across subsets of the training distribution, amplifying progress on some groups while suppressing progress on others. Using a branching intervention that enters or exits the EoS regime from the same training state, we causally demonstrate this trade-off and identify two necessary conditions for a group to benefit. First, its aggregate gradient must align with the top Hessian eigenvector. We isolate this mechanism with a controlled perturbation that preserves distance but randomizes direction, destroying alignment and eliminating the advantage. Second, the group must sustain non-vanishing gradient magnitude over time. Under cross-entropy loss, gradient saturation decouples confidently classified groups, shifting the advantage to output-outliers, whose gradients persist. Together, these results show that EoS functions not only as a stability boundary, but as a mechanism governing the allocation of learning across the data distribution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that the edge of stability (EoS) is not merely a global optimization property but selectively redistributes learning progress across subsets of the training distribution. A branching intervention is used to enter or exit the EoS regime from an identical training state, causally demonstrating amplification of progress on some groups and suppression on others. Two necessary conditions for a group to benefit are identified: (1) alignment of its aggregate gradient with the top Hessian eigenvector (isolated via a controlled perturbation that preserves distance but randomizes direction), and (2) sustained non-vanishing gradient magnitude (attributed to gradient saturation under cross-entropy, which decouples confidently classified groups and favors output-outliers).

Significance. If the causal claims hold, the work provides a mechanistic account of how sharpness constraints allocate learning resources differentially across the data distribution. This could explain differential progress on subgroups, inform analyses of example difficulty, and connect EoS to broader questions of generalization and fairness. The use of controlled interventions (branching and perturbation) is a methodological strength that moves beyond purely observational analyses of sharpness.

major comments (2)
  1. [§3 (Branching Intervention)] §3 (Branching Intervention): the manuscript does not report explicit verification that the branching step preserves per-group gradient norms and directions. This verification is load-bearing for the central causal claim, as any systematic shift in gradient statistics induced by the intervention itself could produce the observed amplification/suppression trade-off independent of the stability constraint.
  2. [§4.1 (Alignment Mechanism)] §4.1 (Alignment Mechanism): the controlled perturbation is described as preserving distance while randomizing direction, but no quantitative check is provided on whether the post-perturbation gradient alignment with the top eigenvector is reduced to chance levels across the tested groups. Without this, the isolation of the alignment condition remains incomplete.
minor comments (2)
  1. Notation for group-wise quantities (e.g., aggregate gradient G_g) is introduced without a consolidated table of symbols, making cross-references between the alignment and magnitude conditions harder to track.
  2. Figure captions for the branching results should explicitly state the number of random seeds and whether error bars reflect standard deviation across seeds or across groups.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight opportunities to strengthen the causal evidence in our branching and perturbation experiments. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3 (Branching Intervention)] §3 (Branching Intervention): the manuscript does not report explicit verification that the branching step preserves per-group gradient norms and directions. This verification is load-bearing for the central causal claim, as any systematic shift in gradient statistics induced by the intervention itself could produce the observed amplification/suppression trade-off independent of the stability constraint.

    Authors: We agree that explicit verification is important for the causal claim. The branching intervention is constructed to act only on global optimizer state (e.g., step-size or sharpness threshold) while leaving the model parameters, data batches, and per-example loss computations unchanged at the branching instant. In the revision we will add a supplementary figure and table that directly compare per-group gradient norms and cosine similarities immediately before and after the branch, confirming that any differences fall within numerical precision and do not systematically favor or disfavor any group. revision: yes

  2. Referee: [§4.1 (Alignment Mechanism)] §4.1 (Alignment Mechanism): the controlled perturbation is described as preserving distance while randomizing direction, but no quantitative check is provided on whether the post-perturbation gradient alignment with the top eigenvector is reduced to chance levels across the tested groups. Without this, the isolation of the alignment condition remains incomplete.

    Authors: We accept that a quantitative confirmation would make the isolation of the alignment mechanism more rigorous. The perturbation is implemented by adding a small random vector of fixed Euclidean norm to the aggregate gradient before the update. In the revised version we will include a table reporting the pre- and post-perturbation cosine similarities between each group’s gradient and the top Hessian eigenvector, demonstrating that post-perturbation alignments are statistically indistinguishable from zero (i.e., chance level) across all groups examined. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical intervention claims rest on experimental controls, not self-referential definitions or fitted predictions

full rationale

The paper presents its central claims via a branching intervention and controlled perturbations described in the abstract, without any equations, fitted parameters renamed as predictions, or self-citations that reduce the result to its inputs by construction. No self-definitional steps, uniqueness theorems, or ansatzes appear in the provided text. The derivation chain is therefore self-contained as an empirical demonstration rather than a mathematical reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.1-grok · 5709 in / 1024 out tokens · 26866 ms · 2026-06-28T10:25:31.016933+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 15 canonical work pages · 9 internal anchors

  1. [1]

    Understanding deep learning requires rethinking generalization

    Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. InInternational Conference on Learning Representations, 2017. arXiv:1611.03530

  2. [2]

    On large-batch training for deep learning: Generalization gap and sharp minima

    Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. InInternational Conference on Learning Representations, 2017

  3. [3]

    Three Factors Influencing Minima in SGD

    Stanislaw Jastrz˛ ebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in sgd.arXiv preprint arXiv:1711.04623, September 2018

  4. [4]

    A Walk with SGD

    Chen Xing, Devansh Arpit, Christos Tsirigotis, and Yoshua Bengio. A walk with sgd.arXiv preprint arXiv:1802.08770, 2018

  5. [5]

    On the relation between the sharpest directions of DNN loss and the SGD step length

    Stanisław Jastrz˛ ebski, Zachary Kenton, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. On the relation between the sharpest directions of DNN loss and the SGD step length. InInternational Conference on Learning Representations, 2019. arXiv:1807.05031

  6. [6]

    The break-even point on optimization trajectories of deep neural networks

    Stanislaw Jastrzebski, Maciej Szymczak, Stanislav Fort, Devansh Arpit, Jacek Tabor, Kyunghyun Cho, and Krzysztof Geras. The break-even point on optimization trajectories of deep neural networks. InInternational Conference on Learning Representations, 2020. arXiv:2002.09572

  7. [7]

    Zico Kolter, and Ameet Talwalkar

    Jeremy Cohen, Simran Kaur, Yuanzhi Li, J. Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. InInternational Conference on Learning Representations, 2021

  8. [8]

    Dahl, and Justin Gilmer

    Jeremy Cohen, Behrooz Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, Zachary Nado, George E. Dahl, and Justin Gilmer. Adaptive gradient methods at the edge of stability. InNeurIPS 2023 Workshop on Heavy Tails in Machine Learning: Structure, Stability, and Dynamics, 2023

  9. [9]

    Edge of stochastic stability: Revisiting the edge of stability for SGD.arXiv preprint arXiv:2412.20553, 2024

    Arseniy Andreyev and Pierfrancesco Beneventano. Edge of stochastic stability: Revisiting the edge of stability for SGD.arXiv preprint arXiv:2412.20553, 2024

  10. [10]

    Momentum Further Constrains Sharpness at the Edge of Stochastic Stability

    Arseniy Andreyev, Advikar Ananthkumar, Marc Walden, Tomaso Poggio, and Pierfrancesco Beneventano. Momentum further constrains sharpness at the edge of stochastic stability.arXiv preprint arXiv:2604.14108, 2026

  11. [11]

    Non-Euclidean Gradient Descent Operates at the Edge of Stability

    Rustem Islamov, Michael Crawshaw, Jeremy Cohen, and Robert Gower. Non-euclidean gradient descent operates at the edge of stability.arXiv preprint arXiv:2603.05002, 2026

  12. [12]

    A scalable measure of loss landscape curvature for analyzing the training dynamics of llms.arXiv preprint arXiv:2601.16979, 2026

    Dayal Singh Kalra, Jean-Christophe Gagnon-Audet, Andrey Gromov, Ishita Mediratta, Kelvin Niu, Alexander H Miller, and Michael Shvartsman. A scalable measure of loss landscape curvature for analyzing the training dynamics of llms.arXiv preprint arXiv:2601.16979, 2026

  13. [13]

    Does Weight Decay Enhance Training Stability?

    Marius Saether, Amir Kolic, Tomaso Poggio, and Pierfrancesco Beneventano. Does weight decay enhance training stability?arXiv preprint arXiv:2605.16622, 2026. 10

  14. [14]

    Flat minima.Neural Computation, 9(1):1–42, 1997

    Sepp Hochreiter and Jürgen Schmidhuber. Flat minima.Neural Computation, 9(1):1–42, 1997

  15. [15]

    Sharpness-aware min- imization for efficiently improving generalization

    Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware min- imization for efficiently improving generalization. InInternational Conference on Learning Representations, 2021

  16. [16]

    Sharp minima can generalize for deep nets

    Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. InInternational Conference on Machine Learning, volume 70, pages 1019–1028. PMLR, 2017

  17. [17]

    Too Sharp, Too Sure: When Calibration Follows Curvature

    Alessandro Morosini, Matea Gjika, Tomaso Poggio, and Pierfrancesco Beneventano. Too sharp, too sure: When calibration follows curvature.arXiv preprint arXiv:2604.20614, 2026

  18. [18]

    Alex Damian, Eshaan Nichani, and Jason D. Lee. Self-stabilization: The implicit bias of gradient descent at the edge of stability. InInternational Conference on Learning Representations, 2023

  19. [19]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

  20. [20]

    Towards deep learning models resistant to adversarial attacks

    Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. InInternational Conference on Learning Representations, 2018

  21. [21]

    Beyond the edge of stability via two-step gradient updates

    Lei Chen and Joan Bruna. Beyond the edge of stability via two-step gradient updates. InInter- national Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 4330–4391. PMLR, 2023

  22. [22]

    Edge of stochastic stability, 2025

    Arseniy Andreyev and Pierfrancesco Beneventano. Edge of stochastic stability, 2025. Software, Apache 2.0 license

  23. [23]

    Understanding edge-of- stability training dynamics with a minimalist example

    Xingyu Zhu, Zixuan Wang, Xiang Wang, Mo Zhou, and Rong Ge. Understanding edge-of- stability training dynamics with a minimalist example. InInternational Conference on Learning Representations, 2023

  24. [24]

    Understanding the generalization benefit of normalization layers: Sharpness reduction

    Kaifeng Lyu, Zhiyuan Li, and Sanjeev Arora. Understanding the generalization benefit of normalization layers: Sharpness reduction. InAdvances in Neural Information Processing Systems, volume 35, pages 34689–34708, 2022

  25. [25]

    Understanding gradient descent on the edge of stability in deep learning

    Sanjeev Arora, Zhiyuan Li, and Abhishek Panigrahi. Understanding gradient descent on the edge of stability in deep learning. InInternational Conference on Machine Learning, pages 948–1024. PMLR, 2022

  26. [26]

    Understanding black-box predictions via influence functions

    Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. InInternational Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1885–1894. PMLR, 2017

  27. [27]

    Does learning require memorization? A short tale about a long tail

    Vitaly Feldman. Does learning require memorization? A short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 954–959. ACM, 2020

  28. [28]

    What neural networks memorize and why: Discovering the long tail via influence estimation

    Vitaly Feldman and Chiyuan Zhang. What neural networks memorize and why: Discovering the long tail via influence estimation. InAdvances in Neural Information Processing Systems, volume 33, 2020

  29. [29]

    Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J. Gordon. An empirical study of example forgetting during deep neural network learning. InInternational Conference on Learning Representations, 2019

  30. [30]

    Smith, and Yejin Choi

    Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A. Smith, and Yejin Choi. Dataset cartography: Mapping and diagnosing datasets with training dynamics. InEmpirical Methods in Natural Language Processing, pages 9275–9293. Association for Computational Linguistics, 2020. 11

  31. [31]

    Deep learning on a data diet: Finding important examples early in training

    Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. Deep learning on a data diet: Finding important examples early in training. InAdvances in Neural Information Processing Systems, volume 34, 2021. arXiv:2107.07075

  32. [32]

    Gradient starvation: A learning proclivity in neural networks

    Mohammad Pezeshki, Sékou-Oumar Kaba, Yoshua Bengio, Aaron Courville, Doina Precup, and Guillaume Lajoie. Gradient starvation: A learning proclivity in neural networks. InAdvances in Neural Information Processing Systems, volume 34, 2021

  33. [33]

    Outliers with opposing signals have an outsized effect on neural network optimization

    Elan Rosenfeld and Andrej Risteski. Outliers with opposing signals have an outsized effect on neural network optimization. InInternational Conference on Learning Representations, 2024

  34. [34]

    Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari S. Morcos. Beyond neural scaling laws: Beating power law scaling via data pruning. InAdvances in Neural Information Processing Systems, volume 35, 2022

  35. [35]

    The large learning rate phase of deep learning: The catapult mechanism.arXiv preprint arXiv:2003.02218, 2020

    Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, and Guy Gur-Ari. The large learning rate phase of deep learning: The catapult mechanism.arXiv preprint arXiv:2003.02218, 2020

  36. [36]

    Fantastic generalization measures and where to find them

    Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them. InInternational Conference on Learning Representations, 2020

  37. [37]

    Exploring generalization in deep learning

    Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. Exploring generalization in deep learning. InAdvances in Neural Information Processing Systems, volume 30, 2017

  38. [38]

    Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes

    Lei Wu, Zhanxing Zhu, and Weinan E. Towards understanding generalization of deep learning: Perspective of loss landscapes.arXiv preprint arXiv:1706.10239, 2017

  39. [39]

    Averaging weights leads to wider optima and better generalization

    Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. InConference on Uncertainty in Artificial Intelligence, 2018

  40. [40]

    Understanding the unstable convergence of gradient descent

    Kwangjun Ahn, Jingzhao Zhang, and Suvrit Sra. Understanding the unstable convergence of gradient descent. InInternational Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages 247–257. PMLR, 2022

  41. [41]

    Learning threshold neurons via edge of stability

    Kwangjun Ahn, Sébastien Bubeck, Sinho Chewi, Yin Tat Lee, Felipe Suarez, and Yi Zhang. Learning threshold neurons via edge of stability. InAdvances in Neural Information Processing Systems, volume 36, 2023

  42. [42]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

  43. [43]

    Boris T. Polyak. Some methods of speeding up the convergence of iteration methods.USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964. 12 A Experimental setup A.1 Architecture We use a fully connected MLP that flattens each input image to a vector, then applies two hidden linear layers of width 512 with ReLU activations, followed by a...