Flatland: The Adventures of Gradient Descent with Large Step Sizes

Curtis Fox; Holger Rauhut; Leonardo Galli; Mark Schmidt; Wiebke Bartolomaeus

arxiv: 2606.06722 · v1 · pith:D5YDHQYOnew · submitted 2026-06-04 · 💻 cs.LG

Flatland: The Adventures of Gradient Descent with Large Step Sizes

Leonardo Galli , Curtis Fox , Wiebke Bartolomaeus , Mark Schmidt , Holger Rauhut This is my paper

Pith reviewed 2026-06-28 02:10 UTC · model grok-4.3

classification 💻 cs.LG

keywords gradient descentedge of stabilityadaptive methodslarge step sizessharpnessneural network optimizationnon-monotonic losslocal Lipschitz continuity

0 comments

The pith

Adaptive methods allow gradient descent to use large step sizes while remaining at the edge of stability from the start of training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to define large step sizes for gradient descent on functions that are only locally smooth using local Lipschitz continuity of the gradient. It introduces adaptive first-order methods that achieve these large steps and keep the product of step size and sharpness above 2, the edge of stability threshold, right away. This leads to non-monotonic loss decrease and the ability to reach the global minimum sharpness. The work also finds that reaching very flat regions too early can slow convergence and reduce generalization, but a self-stabilization trick can steer training into slightly sharper areas for better results.

Core claim

First-order adaptive methods can be designed to provably produce large step sizes under local gradient continuity and operate at the edge of stability throughout training, with the loss decreasing nonmonotonically and the step size-sharpness product staying above 2. These methods can minimize sharpness to its global minimum. Avoiding globally flat regions early in training prevents slower convergence and poorer generalization, while self-stabilization allows gradient descent to enter slightly sharper valleys successfully.

What carries the argument

The edge of stability condition, defined as the step size multiplied by the largest Hessian eigenvalue exceeding 2, which the adaptive methods maintain using local continuity to permit large steps.

Load-bearing premise

The objective function's gradient is locally Lipschitz continuous or Hölder continuous, rather than requiring global L-smoothness.

What would settle it

A counterexample where an adaptive method with large step sizes fails to keep the step size-sharpness product above 2 while still achieving faster convergence than standard gradient descent with small steps.

Figures

Figures reproduced from arXiv: 2606.06722 by Curtis Fox, Holger Rauhut, Leonardo Galli, Mark Schmidt, Wiebke Bartolomaeus.

**Figure 1.** Figure 1: We plot the sharpness * step size (1st Row), step size (2nd Row), sharpness (3rd Row), and training loss (4th Row) for five different methods. We compare gradient descent with the LS, NLS, and PoNLS line searches as well as gradient descent with the CDAT step size selection and SAM. This is repeated for 3 different models and 3 different datasets. cost when δ = 0.5, while this is not the case when δ = 0.9.… view at source ↗

**Figure 2.** Figure 2: We plot the top 20 (30 for EMNIST) eigenvalues (Top Row), layer-wise gradient norm for the hidden layers (which we average across all hidden layers) and gradient norm of the bias parameters in the last layer (Middle Row), and the maximum per layer percentage of neural activations approximately equal to zero (Bottom Row) for the NLS method. This is repeated for 3 different models and 3 different datasets. 0… view at source ↗

**Figure 3.** Figure 3: We plot the sharpness (Left), training loss (Middle), and test accuracy (Right) for three different methods. We compare gradient descent with the NLS and NLS-ub line searches as well as gradient descent with the CDAT step size selection. This experiment is performed for the resnet34 model on the SVHN dataset. By this argument, if ηk = K, the sharpness is forced to hit its minimum of 2/K, while if we mainta… view at source ↗

**Figure 4.** Figure 4: We plot the segment smoothness (2.1) for different steps along the gradient direction at varying training iterations for the NLS line search method. In addition, the vertical dashed lines correspond to the selected step size at each iteration. Finally, the horizontal dashed lines correspond to the sharpness at each iteration. This is repeated for 3 different models and 3 different datasets. ues of ηw. In … view at source ↗

**Figure 5.** Figure 5: We plot the sharpness * step size (Top Left), step size (Top Right), sharpness (Middle Left), sharpness for first 200 iterations (Middle Right), training loss (Bottom Left), and test accuracy (Bottom Right) for two different runs of gradient descent using different variants of warmup. This experiment is performed for the resnet34 model on the CIFAR100 dataset. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_5.png] view at source ↗

**Figure 6.** Figure 6: We plot the training loss (1st Row), test accuracy (2nd Row), step size (3rd Row), minimum sharpness across all batches in each epoch (4th Row), and average sharpness across all batches in each epoch (5th row) for five different methods. We compare gradient descent with the LS, NLS, and NLS-ub line searches as well as gradient descent with the CDAT step size selection and SAM. This is repeated on the CIFAR… view at source ↗

**Figure 7.** Figure 7: We plot the training loss (1st Row), test accuracy (2nd Row), step size (3rd Row), minimum sharpness across all batches in each epoch (4th Row), and average sharpness across all batches in each epoch (5th row) for five different methods. We compare gradient descent with the LS, NLS, and NLS-ub line searches as well as gradient descent with the CDAT step size selection and SAM. This is repeated on the CIFAR… view at source ↗

**Figure 8.** Figure 8: We plot the sharpness * step size (1st Row), step size (2nd Row), sharpness (3rd Row), training loss (4th Row), and test accuracy (5th row) for five different methods. We compare gradient descent with the LS, NLS, and PoNLS line searches as well as gradient descent with the CDAT step size selection and SAM. This is repeated for a vision transformer (tinyVIT) model on the CIFAR10, CIFAR100, and SVHN dataset… view at source ↗

**Figure 9.** Figure 9: We plot the sharpness * step size (1st Row), step size (2nd Row), sharpness (3rd Row), training loss (4th Row), and function evaluations (5th row) for five different methods. We compare the NLS method with δ = 0.5 and δ = 0.9. This is repeated for the CNN, resnet34, and vgg11 models on the CIFAR10 dataset. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_9.png] view at source ↗

**Figure 10.** Figure 10: We plot the sharpness * step size (1st Row), step size (2nd Row), sharpness (3rd Row), training loss (4th Row), and function evaluations (5th row) for five different methods. We compare the NLS method with δ = 0.5 and δ = 0.9. This is repeated for the CNN, resnet34, and vgg11 models on the SVHN dataset. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_10.png] view at source ↗

**Figure 11.** Figure 11: We plot the sharpness * step size (1st Row), step size (2nd Row), sharpness (3rd Row), training loss (4th Row), test accuracy (5th Row), and gradient norm (6th Row) for five different methods. We compare gradient descent with the LS, NLS, and PoNLS line searches as well as gradient descent with the CDAT step size selection and SAM. This is repeated for the CNN, resnet34, and vgg11 models on the CIFAR10 da… view at source ↗

**Figure 12.** Figure 12: We plot the sharpness * step size (1st Row), step size (2nd Row), sharpness (3rd Row), training loss (4th Row), test accuracy (5th Row), and gradient norm (6th Row) for five different methods. We compare gradient descent with the LS, NLS, and PoNLS line searches as well as gradient descent with the CDAT step size selection and SAM. This is repeated for the MLP, densenet121, and wide resnet50 2 models on t… view at source ↗

**Figure 13.** Figure 13: We plot the sharpness * step size (1st Row), step size (2nd Row), sharpness (3rd Row), training loss (4th Row), test accuracy (5th Row), and gradient norm (6th Row) for five different methods. We compare gradient descent with the LS, NLS, and PoNLS line searches as well as gradient descent with the CDAT step size selection and SAM. This is repeated for the CNN, resnet34, and vgg11 models on the CIFAR100 d… view at source ↗

**Figure 14.** Figure 14: We plot the sharpness * step size (1st Row), step size (2nd Row), sharpness (3rd Row), training loss (4th Row), test accuracy (5th Row), and gradient norm (6th Row) for five different methods. We compare gradient descent with the LS, NLS, and PoNLS line searches as well as gradient descent with the CDAT step size selection and SAM. This is repeated for the MLP, densenet121, and wide resnet50 2 models on t… view at source ↗

**Figure 15.** Figure 15: We plot the sharpness * step size (1st Row), step size (2nd Row), sharpness (3rd Row), training loss (4th Row), test accuracy (5th Row), and gradient norm (6th Row) for five different methods.. We compare gradient descent with the LS, NLS, and PoNLS line searches as well as gradient descent with the CDAT step size selection and SAM. This is repeated for the CNN, resnet34, and vgg11 models on the SVHN data… view at source ↗

**Figure 16.** Figure 16: We plot the sharpness * step size (1st Row), step size (2nd Row), sharpness (3rd Row), training loss (4th Row), test accuracy (5th Row), and gradient norm (6th Row) for five different methods.. We compare gradient descent with the LS, NLS, and PoNLS line searches as well as gradient descent with the CDAT step size selection and SAM. This is repeated for the MLP, densenet121, and wide resnet50 2 models on … view at source ↗

**Figure 17.** Figure 17: We plot the sharpness * step size (1st Row), step size (2nd Row), sharpness (3rd Row), training loss (4th Row), test accuracy (5th Row), and gradient norm (6th Row) for five different methods. We compare gradient descent with the LS, NLS, and PoNLS line searches as well as gradient descent with the CDAT step size selection and SAM. This is repeated for the CNN, resnet34, and vgg11 models on the EMNIST dat… view at source ↗

**Figure 18.** Figure 18: We plot the sharpness * step size (1st Row), step size (2nd Row), sharpness (3rd Row), training loss (4th Row), test accuracy (5th Row), and gradient norm (6th Row) for five different methods. We compare gradient descent with the LS, NLS, and PoNLS line searches as well as gradient descent with the CDAT step size selection and SAM. This is repeated for the MLP, densenet121, and wide resnet50 2 models on t… view at source ↗

**Figure 19.** Figure 19: We plot the sharpness * step size (1st Row), step size (2nd Row), sharpness (3rd Row), training loss (4th Row), and test accuracy (5th Row) for five different methods. We compare gradient descent with the LS, NLS, and PoNLS line searches as well as gradient descent with the CDAT step size selection and SAM. This is repeated for the CNN, resnet34, and vgg11 models on the CIFAR10 dataset. 47 [PITH_FULL_IMA… view at source ↗

**Figure 20.** Figure 20: We plot the sharpness * step size (1st Row), step size (2nd Row), sharpness (3rd Row), training loss (4th Row), and test accuracy (5th Row) for five different methods. We compare gradient descent with the LS, NLS, and PoNLS line searches as well as gradient descent with the CDAT step size selection and SAM. This is repeated for the MLP, densenet121, and wide resnet50 2 models on the CIFAR10 dataset. 48 [… view at source ↗

**Figure 21.** Figure 21: We plot the sharpness * step size (1st Row), step size (2nd Row), sharpness (3rd Row), training loss (4th Row), and test accuracy (5th Row) for five different methods. We compare gradient descent with the LS, NLS, and PoNLS line searches as well as gradient descent with the CDAT step size selection and SAM. This is repeated for the CNN, resnet34, and vgg11 models on the CIFAR100 dataset. 49 [PITH_FULL_IM… view at source ↗

**Figure 22.** Figure 22: We plot the sharpness * step size (1st Row), step size (2nd Row), sharpness (3rd Row), training loss (4th Row), and test accuracy (5th Row) for five different methods. We compare gradient descent with the LS, NLS, and PoNLS line searches as well as gradient descent with the CDAT step size selection and SAM. This is repeated for the MLP, densenet121, and wide resnet50 2 models on the CIFAR100 dataset. 50 … view at source ↗

**Figure 23.** Figure 23: We plot the sharpness * step size (1st Row), step size (2nd Row), sharpness (3rd Row), training loss (4th Row), and test accuracy (5th Row) for five different methods. We compare gradient descent with the LS, NLS, and PoNLS line searches as well as gradient descent with the CDAT step size selection and SAM. This is repeated for the CNN, resnet34, and vgg11 models on the SVHN dataset. 51 [PITH_FULL_IMAGE:… view at source ↗

**Figure 24.** Figure 24: We plot the sharpness * step size (1st Row), step size (2nd Row), sharpness (3rd Row), training loss (4th Row), and test accuracy (5th Row) for five different methods. We compare gradient descent with the LS, NLS, and PoNLS line searches as well as gradient descent with the CDAT step size selection and SAM. This is repeated for the MLP, densenet121, and wide resnet50 2 models on the SVHN dataset. 52 [PIT… view at source ↗

**Figure 25.** Figure 25: We plot the sharpness * step size (1st Row), step size (2nd Row), sharpness (3rd Row), training loss (4th Row), and test accuracy (5th Row) for five different methods. We compare gradient descent with the LS, NLS, and PoNLS line searches as well as gradient descent with the CDAT step size selection and SAM. This is repeated for the CNN, resnet34, and vgg11 models on the EMNIST dataset. 53 [PITH_FULL_IMAG… view at source ↗

**Figure 26.** Figure 26: We plot the sharpness * step size (1st Row), step size (2nd Row), sharpness (3rd Row), training loss (4th Row), and test accuracy (5th Row) for five different methods. We compare gradient descent with the LS, NLS, and PoNLS line searches as well as gradient descent with the CDAT step size selection and SAM. This is repeated for the MLP, densenet121, and wide resnet50 2 models on the EMNIST dataset. I.2. A… view at source ↗

**Figure 27.** Figure 27: We plot the top 20 eigenvalues (1st row), layer-wise gradient norm for the hidden layers (which we average across all hidden layers) and gradient norm of the bias parameters in the last layer (2nd Row), the maximum per layer percentage of neural activations approximately equal to zero (3rd Row), the percentage of the gradient entries equal to 0 (4th Row), and training accuracy (5th Row) for the NLS line s… view at source ↗

**Figure 28.** Figure 28: We plot the top 20 eigenvalues (1st row), layer-wise gradient norm for the hidden layers (which we average across all hidden layers) and gradient norm of the bias parameters in the last layer (2nd Row), the maximum per layer percentage of neural activations approximately equal to zero (3rd Row), the percentage of the gradient entries equal to 0 (4th Row), and training accuracy (5th Row) for the NLS line s… view at source ↗

**Figure 29.** Figure 29: We plot the top 20 eigenvalues (1st row), layer-wise gradient norm for the hidden layers (which we average across all hidden layers) and gradient norm of the bias parameters in the last layer (2nd Row), the maximum per layer percentage of neural activations approximately equal to zero (3rd Row), the percentage of the gradient entries equal to 0 (4th Row), and training accuracy (5th Row) for the NLS line s… view at source ↗

**Figure 30.** Figure 30: We plot the top 20 eigenvalues (1st row), layer-wise gradient norm for the hidden layers (which we average across all hidden layers) and gradient norm of the bias parameters in the last layer (2nd Row), the maximum per layer percentage of neural activations approximately equal to zero (3rd Row), the percentage of the gradient entries equal to 0 (4th Row), and training accuracy (5th Row) for the NLS line s… view at source ↗

**Figure 31.** Figure 31: We plot the top 20 eigenvalues (1st row), layer-wise gradient norm for the hidden layers (which we average across all hidden layers) and gradient norm of the bias parameters in the last layer (2nd Row), the maximum per layer percentage of neural activations approximately equal to zero (3rd Row), the percentage of the gradient entries equal to 0 (4th Row), and training accuracy (5th Row) for the NLS line s… view at source ↗

**Figure 32.** Figure 32: We plot the sharpness * step size (1st Row), step size (2nd Row), sharpness (3rd Row), training loss (4th Row), and test accuracy (5th Row) for five different methods. We compare gradient descent with the NLS and NLS-ub line searches as well as gradient descent with the CDAT step size selection. This is repeated for the CNN, resnet34, and vgg11 models on the CIFAR10 dataset. 61 [PITH_FULL_IMAGE:figures/f… view at source ↗

**Figure 33.** Figure 33: We plot the sharpness * step size (1st Row), step size (2nd Row), sharpness (3rd Row), training loss (4th Row), and test accuracy (5th Row) for five different methods. We compare gradient descent with the NLS and NLS-ub line searches as well as gradient descent with the CDAT step size selection. This is repeated for the MLP, densenet121, and wide resnet50 2 models on the CIFAR10 dataset. 62 [PITH_FULL_IM… view at source ↗

**Figure 34.** Figure 34: We plot the sharpness * step size (1st Row), step size (2nd Row), sharpness (3rd Row), training loss (4th Row), and test accuracy (5th Row) for five different methods. We compare gradient descent with the NLS and NLS-ub line searches as well as gradient descent with the CDAT step size selection. This is repeated for the CNN, resnet34, and vgg11 models on the CIFAR100 dataset. 63 [PITH_FULL_IMAGE:figures/… view at source ↗

**Figure 35.** Figure 35: We plot the sharpness * step size (1st Row), step size (2nd Row), sharpness (3rd Row), training loss (4th Row), and test accuracy (5th Row) for five different methods. We compare gradient descent with the NLS and NLS-ub line searches as well as gradient descent with the CDAT step size selection. This is repeated for the MLP, densenet121, and wide resnet50 2 models on the CIFAR100 dataset. 64 [PITH_FULL_I… view at source ↗

**Figure 36.** Figure 36: We plot the sharpness * step size (1st Row), step size (2nd Row), sharpness (3rd Row), training loss (4th Row), and test accuracy (5th Row) for five different methods. We compare gradient descent with the NLS and NLS-ub line searches as well as gradient descent with the CDAT step size selection. This is repeated for the CNN, resnet34, and vgg11 models on the SVHN dataset. 65 [PITH_FULL_IMAGE:figures/full… view at source ↗

**Figure 37.** Figure 37: We plot the sharpness * step size (1st Row), step size (2nd Row), sharpness (3rd Row), training loss (4th Row), and test accuracy (5th Row) for five different methods. We compare gradient descent with the NLS and NLS-ub line searches as well as gradient descent with the CDAT step size selection. This is repeated for the MLP, densenet121, and wide resnet50 2 models on the SVHN dataset. 66 [PITH_FULL_IMAGE… view at source ↗

**Figure 38.** Figure 38: We plot the sharpness * step size (1st Row), step size (2nd Row), sharpness (3rd Row), training loss (4th Row), and test accuracy (5th Row) for five different methods. We compare gradient descent with the NLS and NLS-ub line searches as well as gradient descent with the CDAT step size selection. This is repeated for the CNN, resnet34, and vgg11 models on the EMNIST dataset. 67 [PITH_FULL_IMAGE:figures/fu… view at source ↗

**Figure 39.** Figure 39: We plot the sharpness * step size (1st Row), step size (2nd Row), sharpness (3rd Row), training loss (4th Row), and test accuracy (5th Row) for five different methods. We compare gradient descent with the NLS and NLS-ub line searches as well as gradient descent with the CDAT step size selection. This is repeated for the MLP, densenet121, and wide resnet50 2 models on the EMNIST dataset. I.4. Additional Ex… view at source ↗

**Figure 40.** Figure 40: We plot the segment smoothness (2.1) for different steps along the gradient direction at varying training iterations for the NLS line search method. In addition, the vertical dashed lines correspond to the selected step size at each iteration. Finally, the horizontal dashed lines correspond to the sharpness at each iteration. This is repeated for the CNN, resnet34, and vgg11 models on the CIFAR10, CIFAR10… view at source ↗

**Figure 41.** Figure 41: We plot the segment smoothness (2.1) for different steps along the gradient direction at varying training iterations for the NLS line search method. In addition, the vertical dashed lines correspond to the selected step size at each iteration. Finally, the horizontal dashed lines correspond to the sharpness at each iteration. This is repeated for the MLP, densenet121, and wide resnet50 2 models on the CIF… view at source ↗

**Figure 42.** Figure 42: We plot the training loss (1st Row), sharpness (2nd Row), and gradient norm (3rd row) for the NLS line search method. This is repeated for the EMNIST×CNN, SVHN×resnet34, and CIFAR10×vgg11 experiments, where the biases have been removed from the last layer of the networks being trained. 72 [PITH_FULL_IMAGE:figures/full_fig_p072_42.png] view at source ↗

read the original abstract

The training of neural networks often entails objective functions that are not globally $L$-smooth. For these functions, it is both theoretically and practically difficult to reply to the question: what is the largest possible step size that ensures the convergence of gradient descent (GD)? We address this longstanding open question in deep learning by providing a unifying definition of "large" step sizes that requires only local Lipschitz (or even H\"older) continuity of the gradient. We design first-order adaptive methods that provably yield large step sizes and show that they operate at the edge of stability (EoS) right from the start of the training. In particular, the loss decreases nonmonotonically and the product between the step size and sharpness, i.e., the largest eigenvalue of the Hessian, stays above the EoS threshold of 2 throughout training. Using our method, we are also able to minimize the sharpness all the way down to its global minimum. Contrary to expectation, we find that encountering globally-flat regions too early in the training may both slow down convergence and jeopardize the generalization ability of the network. Exploiting a self-stabilization argument, we allow GD to enter slightly sharper valleys and turn unsuccessful training runs into very successful ones.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines large step sizes via local gradient continuity and claims adaptive first-order methods that hit edge-of-stability from iteration one, but the local assumption may leave the stability guarantee under-supported.

read the letter

The punchline is that this work gives a local-Lipschitz or Hölder definition of what counts as a large step size for objectives without global smoothness, then builds adaptive methods said to deliver those steps and keep the stepsize-sharpness product above 2 right away, with non-monotonic loss. That framing is new relative to prior EoS work that leaned on global smoothness or post-hoc observations.

What the paper does cleanly is tackle the practical mismatch between theory and deep learning objectives by dropping the global L-smoothness requirement. The self-stabilization argument that lets the method enter slightly sharper regions and the note that hitting flat areas too early can hurt both speed and generalization are concrete angles that could be tested. The claim that the methods can drive sharpness to its global minimum is also stated directly.

The soft spot sits in the soundness of the guarantees. Local continuity lets the effective constant change across regions, so showing that the product stays above the threshold from step one requires some control on how that constant evolves relative to the chosen step. The stress-test concern is on point here: without an explicit uniformity condition or mechanism in the proof, the “provably operate at EoS from the start” part rests on more than the stated assumption alone. Full proofs would need to be checked for whether they close that gap or introduce an implicit fixed-ball argument.

This is for readers working on optimization theory for locally smooth or non-globally smooth problems in machine learning. Someone already thinking about step-size rules or EoS phenomena would get usable definitions and method sketches to examine or extend. It deserves peer review because it targets a real open question with a fresh modeling choice, even though the local-continuity proofs will likely need tightening.

Referee Report

2 major / 2 minor

Summary. The paper defines 'large' step sizes for gradient descent using only local Lipschitz or Hölder continuity of the gradient (instead of global L-smoothness). It introduces first-order adaptive methods that provably achieve such steps and operate at the edge of stability (EoS) from the first iteration onward, characterized by non-monotonic loss decrease and step-size × sharpness remaining above the threshold of 2. The work also shows how to drive sharpness to its global minimum and uses a self-stabilization argument to allow entry into slightly sharper regions, converting failed runs into successful ones.

Significance. If the local-continuity framework and the EoS guarantees hold, the results supply a principled way to select and adapt large steps in the non-globally-smooth regimes typical of deep learning, directly linking theory to the observed edge-of-stability phenomenon. The self-stabilization mechanism and the empirical demonstration that early global flatness can harm generalization are noteworthy contributions.

major comments (2)

[§3.2, Theorem 2] §3.2 and Theorem 2: the proof that the adaptive methods keep step-size × sharpness ≥ 2 from iteration 1 onward invokes only local Lipschitz/Hölder continuity; the argument appears to require an implicit uniformity condition on how the local constant evolves relative to the chosen step size, which is not guaranteed by the stated assumption alone and is load-bearing for the 'provably operate at EoS right from the start' claim.
[§5.1] §5.1, the self-stabilization construction: the claim that GD can be steered into slightly sharper valleys without instability rests on the same local-continuity premise; an explicit bound relating the local constant to the step-size adaptation is needed to make the 'turn unsuccessful training runs into successful ones' result rigorous.

minor comments (2)

[§2] Notation for the local Hölder exponent and the adaptive step-size rule should be introduced once in §2 and used consistently thereafter.
[Figure 4] Figure 4 caption should explicitly state the number of independent runs and the precise definition of 'sharpness' used in the plot.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address the two major comments point by point below, indicating planned revisions where they strengthen the rigor of the claims.

read point-by-point responses

Referee: [§3.2, Theorem 2] §3.2 and Theorem 2: the proof that the adaptive methods keep step-size × sharpness ≥ 2 from iteration 1 onward invokes only local Lipschitz/Hölder continuity; the argument appears to require an implicit uniformity condition on how the local constant evolves relative to the chosen step size, which is not guaranteed by the stated assumption alone and is load-bearing for the 'provably operate at EoS right from the start' claim.

Authors: We appreciate the referee identifying this point. The adaptive methods select the step size at each iterate using only local first-order information (gradient and a local sharpness estimate) at the current point, so the local Lipschitz/Hölder assumption is applied pointwise. The one-step analysis in Theorem 2 then shows that the chosen step size satisfies step-size × sharpness ≥ 2 by construction. Nevertheless, to eliminate any ambiguity about the evolution of the local constant, we will add a short clarifying lemma in §3.2 that explicitly bounds the change in the local constant over a single step under the adaptive rule. revision: yes
Referee: [§5.1] §5.1, the self-stabilization construction: the claim that GD can be steered into slightly sharper valleys without instability rests on the same local-continuity premise; an explicit bound relating the local constant to the step-size adaptation is needed to make the 'turn unsuccessful training runs into successful ones' result rigorous.

Authors: We agree that an explicit bound would make the self-stabilization argument fully rigorous. In the revised version we will insert, in §5.1, a bound that relates the local continuity constant to the adaptive step-size choice, thereby justifying that the method can enter slightly sharper regions without instability and convert unsuccessful runs into successful ones. revision: yes

Circularity Check

0 steps flagged

No circularity: definition and claims are independent of fitted inputs or self-referential reductions

full rationale

The paper defines 'large' step sizes via local Lipschitz/Hölder continuity of the gradient—an external modeling choice stated upfront—and then constructs adaptive first-order methods whose properties (non-monotonic loss decrease, stepsize × sharpness ≥ 2 from iteration 1) are derived as consequences rather than presupposed. No equation or claim reduces a prediction to a fitted parameter by construction, no uniqueness theorem is imported from prior self-work, and the self-stabilization argument is presented as an exploitation of the local-continuity setting rather than an input. The derivation chain therefore remains self-contained against external benchmarks and does not match any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the modeling assumption that the gradient is locally Lipschitz or Hölder continuous; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption The gradient of the objective is locally Lipschitz continuous or Hölder continuous.
Invoked to replace global L-smoothness and to define large step sizes that still guarantee convergence properties.

pith-pipeline@v0.9.1-grok · 5758 in / 1219 out tokens · 23761 ms · 2026-06-28T02:10:35.109810+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references

[1]

Notice that the compactness of the L0 level set is implied by the coercivity of the function f, i.e., lim ||w||→∞ f(w) =∞

As∇fis continuous, it is uniformly continuous on the compact set, hence lim i→∞ ∥∇f(z ki)− ∇f(w ki)∥= 0, which by (27) gives the convergence to zero of the subsequence of the gradient. Notice that the compactness of the L0 level set is implied by the coercivity of the function f, i.e., lim ||w||→∞ f(w) =∞ . This property holds for MSE loss functions appli...

2025
[2]

23 Flatland: The Adventures of Gradient Descent with Large Step Sizes

The mean squared error (MSE)-loss:l:R n →R,l(w) = 1 Km Pm i=1 ∥n(w, xi)−y i∥2 2. 23 Flatland: The Adventures of Gradient Descent with Large Step Sizes
[3]

Lemma B.23.Let (x,ˆy)be a data point, f the MSE-loss for the output of the network nx :R n →R K with identity afterwards

The cross-entropy loss:l:R n →R,l(w) = 1 m Pm i=1 PC k=1 −yi,k log(n(w, xi)k). Lemma B.23.Let (x,ˆy)be a data point, f the MSE-loss for the output of the network nx :R n →R K with identity afterwards. Assume that the last layer is linear, i.e. nx(w) =W x ′(w′) +b , where x′(w′) is the output of the previous layer and denote byw ′ :=w\{W, b}all parameters ...
[4]

the weights of the last layer has the form ∇2 W f(w) = 2 K IK ⊗x ′x′T ,∇ 2 bf(w) = 2 K IK,∇ 2 W,bf(w) = 2 K IK ⊗x ′

The subhessian of w.r.t. the weights of the last layer has the form ∇2 W f(w) = 2 K IK ⊗x ′x′T ,∇ 2 bf(w) = 2 K IK,∇ 2 W,bf(w) = 2 K IK ⊗x ′
[5]

The trace of the subhessian is given by trace(∇ 2 (W,b)f(w)) = 2(∥x ′∥2 + 1)
[6]

The biggest eigenvalue of the hessian is at least 2 K , i.e.λ 1(∇2 wf(w))≥ 2 K . Proof.We start with some general derivations for our loss functional: f(w) = 1 K ∥nx(w)−ˆy∥2 2,∇ wf(w) = 2 K KX i=1 (nx,i(w)−ˆyi)∇nx,i(w) ∇2f(w) = 2 K KX i=1 ∇wnx,i(w)∇wnx,i(w)T + (nx,i(w)−ˆyi)∇2 wnx,i(w)
[7]

Letw i denote the rows ofW, then ∇wj nx,i(w) =δ i,jx′,∇ bj nx,i(w) =δ i,j,∇ 2 W nx,i(w) =∇ 2 bnx,i(w) =∇ 2 W,bnx,i = 0, ⇒ ∇ 2 W f(w) = 2 K   x′x′T 0

As we are interested in the last layer ofn x, which is given byW x ′ +b. Letw i denote the rows ofW, then ∇wj nx,i(w) =δ i,jx′,∇ bj nx,i(w) =δ i,j,∇ 2 W nx,i(w) =∇ 2 bnx,i(w) =∇ 2 W,bnx,i = 0, ⇒ ∇ 2 W f(w) = 2 K   x′x′T 0. . .0 0x ′x′T . . .0 . . . . . . . . . . . . 0 0. . . x ′x′T   ,∇ 2 bf(w) = 2 K KX i=1 eieT i ,∇ W,bf(w) = 2 K   x′ 0. . .0...
[8]

With this at hand, it is straightforward to calculate the trace trace(∇2 (W,b)f(w)) = 2 K KX i=1 ∥∇(W,b)nx,i(w)∥2 = 2 K KX i=1 ∥x∥2 + 1 = 2(∥x∥2 + 1)
[9]

Which we use with the sub-block∇ 2 bf(w)which has eigenvalues 2 K , as calculated previously

This can be seen by Cauchy’s interlacing theorem (Horn & Johnson, 2012), which implies for a symmetric matrix A= B C C T D, , λ 1(A)≥λ 1(D). Which we use with the sub-block∇ 2 bf(w)which has eigenvalues 2 K , as calculated previously. As the previous Lemma was just for a single data point by the linearity of the loss function we can easily derive a result...

2012
[10]

Therefore, again trace ∇2 (W,b)f(w) = 2 1 m Pm i=1 ∥x′ i∥2 2 + 1 and the biggest eigenvalue of the Hessian w.r.t

The subhessian is given by ∇2 W f(w) = 2 K IK ⊗ 1 m mX i=1 x′ ix′T i ,∇ 2 bf(w) = 2 K IK,∇ W,bf(w) = 2 K IK ⊗ 1 m mX i=1 x′. Therefore, again trace ∇2 (W,b)f(w) = 2 1 m Pm i=1 ∥x′ i∥2 2 + 1 and the biggest eigenvalue of the Hessian w.r.t. the whole data set is lower bounded by 2 K . 24 Flatland: The Adventures of Gradient Descent with Large Step Sizes Pro...
[11]

the weights of the last layer has the form ∇2 W f(w) = diag(p)−pp T ⊗x ′x′T ,∇ 2 bf(w) = diag(p)−pp T ,∇ 2 W,bf(w) = diag(p)−pp T ⊗x ′

The subhessian of w.r.t. the weights of the last layer has the form ∇2 W f(w) = diag(p)−pp T ⊗x ′x′T ,∇ 2 bf(w) = diag(p)−pp T ,∇ 2 W,bf(w) = diag(p)−pp T ⊗x ′
[12]

instantaneous

The trace of the subhessian is given by trace(∇ 2 (W,b)f(w)) = 1− ∥p∥ 2 2 (1 +∥x ′∥2 2). Proof. As ˆyis a one-hot encoding ˆy=e k for one k∈ {1, . . . , K} , the loss function of the network and its gradient can thus be written as f(w) =−n x(w)k + log   KX j=1 en(w)j   , ∇wf(w) =−∇ wnx,k(w) + KX j=1 pj∇wnx,j(w), ∇2 wf(w) =−∇ 2 wnx,k(w) + KX j=1 pl ∇wn...

2009

[1] [1]

Notice that the compactness of the L0 level set is implied by the coercivity of the function f, i.e., lim ||w||→∞ f(w) =∞

As∇fis continuous, it is uniformly continuous on the compact set, hence lim i→∞ ∥∇f(z ki)− ∇f(w ki)∥= 0, which by (27) gives the convergence to zero of the subsequence of the gradient. Notice that the compactness of the L0 level set is implied by the coercivity of the function f, i.e., lim ||w||→∞ f(w) =∞ . This property holds for MSE loss functions appli...

2025

[2] [2]

23 Flatland: The Adventures of Gradient Descent with Large Step Sizes

The mean squared error (MSE)-loss:l:R n →R,l(w) = 1 Km Pm i=1 ∥n(w, xi)−y i∥2 2. 23 Flatland: The Adventures of Gradient Descent with Large Step Sizes

[3] [3]

Lemma B.23.Let (x,ˆy)be a data point, f the MSE-loss for the output of the network nx :R n →R K with identity afterwards

The cross-entropy loss:l:R n →R,l(w) = 1 m Pm i=1 PC k=1 −yi,k log(n(w, xi)k). Lemma B.23.Let (x,ˆy)be a data point, f the MSE-loss for the output of the network nx :R n →R K with identity afterwards. Assume that the last layer is linear, i.e. nx(w) =W x ′(w′) +b , where x′(w′) is the output of the previous layer and denote byw ′ :=w\{W, b}all parameters ...

[4] [4]

the weights of the last layer has the form ∇2 W f(w) = 2 K IK ⊗x ′x′T ,∇ 2 bf(w) = 2 K IK,∇ 2 W,bf(w) = 2 K IK ⊗x ′

The subhessian of w.r.t. the weights of the last layer has the form ∇2 W f(w) = 2 K IK ⊗x ′x′T ,∇ 2 bf(w) = 2 K IK,∇ 2 W,bf(w) = 2 K IK ⊗x ′

[5] [5]

The trace of the subhessian is given by trace(∇ 2 (W,b)f(w)) = 2(∥x ′∥2 + 1)

[6] [6]

The biggest eigenvalue of the hessian is at least 2 K , i.e.λ 1(∇2 wf(w))≥ 2 K . Proof.We start with some general derivations for our loss functional: f(w) = 1 K ∥nx(w)−ˆy∥2 2,∇ wf(w) = 2 K KX i=1 (nx,i(w)−ˆyi)∇nx,i(w) ∇2f(w) = 2 K KX i=1 ∇wnx,i(w)∇wnx,i(w)T + (nx,i(w)−ˆyi)∇2 wnx,i(w)

[7] [7]

Letw i denote the rows ofW, then ∇wj nx,i(w) =δ i,jx′,∇ bj nx,i(w) =δ i,j,∇ 2 W nx,i(w) =∇ 2 bnx,i(w) =∇ 2 W,bnx,i = 0, ⇒ ∇ 2 W f(w) = 2 K   x′x′T 0

As we are interested in the last layer ofn x, which is given byW x ′ +b. Letw i denote the rows ofW, then ∇wj nx,i(w) =δ i,jx′,∇ bj nx,i(w) =δ i,j,∇ 2 W nx,i(w) =∇ 2 bnx,i(w) =∇ 2 W,bnx,i = 0, ⇒ ∇ 2 W f(w) = 2 K   x′x′T 0. . .0 0x ′x′T . . .0 . . . . . . . . . . . . 0 0. . . x ′x′T   ,∇ 2 bf(w) = 2 K KX i=1 eieT i ,∇ W,bf(w) = 2 K   x′ 0. . .0...

[8] [8]

With this at hand, it is straightforward to calculate the trace trace(∇2 (W,b)f(w)) = 2 K KX i=1 ∥∇(W,b)nx,i(w)∥2 = 2 K KX i=1 ∥x∥2 + 1 = 2(∥x∥2 + 1)

[9] [9]

Which we use with the sub-block∇ 2 bf(w)which has eigenvalues 2 K , as calculated previously

This can be seen by Cauchy’s interlacing theorem (Horn & Johnson, 2012), which implies for a symmetric matrix A= B C C T D, , λ 1(A)≥λ 1(D). Which we use with the sub-block∇ 2 bf(w)which has eigenvalues 2 K , as calculated previously. As the previous Lemma was just for a single data point by the linearity of the loss function we can easily derive a result...

2012

[10] [10]

Therefore, again trace ∇2 (W,b)f(w) = 2 1 m Pm i=1 ∥x′ i∥2 2 + 1 and the biggest eigenvalue of the Hessian w.r.t

The subhessian is given by ∇2 W f(w) = 2 K IK ⊗ 1 m mX i=1 x′ ix′T i ,∇ 2 bf(w) = 2 K IK,∇ W,bf(w) = 2 K IK ⊗ 1 m mX i=1 x′. Therefore, again trace ∇2 (W,b)f(w) = 2 1 m Pm i=1 ∥x′ i∥2 2 + 1 and the biggest eigenvalue of the Hessian w.r.t. the whole data set is lower bounded by 2 K . 24 Flatland: The Adventures of Gradient Descent with Large Step Sizes Pro...

[11] [11]

the weights of the last layer has the form ∇2 W f(w) = diag(p)−pp T ⊗x ′x′T ,∇ 2 bf(w) = diag(p)−pp T ,∇ 2 W,bf(w) = diag(p)−pp T ⊗x ′

The subhessian of w.r.t. the weights of the last layer has the form ∇2 W f(w) = diag(p)−pp T ⊗x ′x′T ,∇ 2 bf(w) = diag(p)−pp T ,∇ 2 W,bf(w) = diag(p)−pp T ⊗x ′

[12] [12]

instantaneous

The trace of the subhessian is given by trace(∇ 2 (W,b)f(w)) = 1− ∥p∥ 2 2 (1 +∥x ′∥2 2). Proof. As ˆyis a one-hot encoding ˆy=e k for one k∈ {1, . . . , K} , the loss function of the network and its gradient can thus be written as f(w) =−n x(w)k + log   KX j=1 en(w)j   , ∇wf(w) =−∇ wnx,k(w) + KX j=1 pj∇wnx,j(w), ∇2 wf(w) =−∇ 2 wnx,k(w) + KX j=1 pl ∇wn...

2009