Edge Flow: A Tractable and Predictive Continuous-Time Model for Gradient Descent at the Edge of Stability

Pierre Marion

arxiv: 2606.18080 · v1 · pith:YATUSZKTnew · submitted 2026-06-16 · 💻 cs.LG

Edge Flow: A Tractable and Predictive Continuous-Time Model for Gradient Descent at the Edge of Stability

Pierre Marion This is my paper

Pith reviewed 2026-06-27 01:14 UTC · model grok-4.3

classification 💻 cs.LG

keywords edge of stabilitygradient descentcontinuous-time modelHessiansharpnessordinary differential equationsoptimization dynamicsdeep learning

0 comments

The pith

Gradient descent at the edge of stability follows three coupled ODEs whose feedback loop automatically stabilizes sharpness near 2/η.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Gradient descent in deep networks frequently operates at the edge of stability, where the largest Hessian eigenvalue hovers near twice the inverse learning rate. Standard tools such as gradient flow cease to apply in this regime. Edge Flow supplies a system of three coupled ordinary differential equations that decomposes the discrete trajectory into a center trajectory, an oscillation direction, and an oscillation magnitude. The center obeys a modified gradient flow on a symmetrized loss, the direction follows Rayleigh-quotient dynamics, and the magnitude grows or decays exponentially according to whether sharpness lies above or below the threshold. The three components interact through a self-stabilization loop that keeps sharpness near the stability boundary without external tuning.

Core claim

Edge Flow is a system of three coupled ordinary differential equations that models gradient descent at the edge of stability. The center follows a modified gradient flow on a symmetrized loss. The direction evolves by Rayleigh quotient dynamics that track a top eigenvector of the Hessian. The magnitude grows or decays exponentially depending on whether sharpness exceeds or falls below 2/η. Sharpness stabilization emerges automatically from the coupled dynamics via a self-stabilization feedback loop. The resulting continuous-time model admits a discretization that requires only two gradient evaluations and one Hessian-vector product per iteration.

What carries the argument

Edge Flow, the system of three coupled ODEs that decomposes dynamics into a center (modified gradient flow on symmetrized loss), a direction (Rayleigh quotient dynamics), and a magnitude (exponential growth or decay set by sharpness relative to 2/η).

If this is right

Discretizing Edge Flow requires only two gradient evaluations and one Hessian-vector product at each iteration.
The model tracks gradient descent trajectories at least as faithfully as earlier continuous-time EoS models while also resolving the oscillation of sharpness at the onset of the regime.
The coupled dynamics supply a principled framework for understanding when instabilities arise and for designing interventions that mitigate them.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decomposition could be tested on optimization algorithms other than plain gradient descent to check whether self-stabilization is a broader discrete phenomenon.
The explicit ODE system offers a low-cost simulation testbed for exploring learning-rate schedules or regularizers before full-scale training runs.
If the self-stabilization loop proves robust, it may suggest new ways to choose step sizes that deliberately ride the edge rather than avoid it.

Load-bearing premise

The dynamics of gradient descent at EoS can be faithfully decomposed into a center obeying modified gradient flow on a symmetrized loss, a direction evolving by Rayleigh quotient dynamics, and an oscillation magnitude that grows or decays exponentially according to whether sharpness exceeds 2/η, with the three components coupled so that self-stabilization arises automatically.

What would settle it

Integrate Edge Flow alongside actual gradient descent on a network known to enter the edge-of-stability regime and observe that the predicted parameter trajectories or sharpness time series diverge substantially from the recorded run.

Figures

Figures reproduced from arXiv: 2606.18080 by Pierre Marion.

**Figure 1.** Figure 1: Edge Flow models gradient descent at the edge of stability (EoS). Zoom on the onset of EoS for a Vision Transformer trained with mean-squared error loss on a subset of CIFAR-10 (η = 0.02). Left: training loss of gradient descent (blue) and of the Edge Flow (red), with the Edge Flow prediction of the GD loss (dashed red). Right: top three eigenvalues of the effective Hessian (i.e. eigenvalues multiplied by … view at source ↗

**Figure 2.** Figure 2: Two-timescale structure of gradient descent at EoS, and the Edge Flow notation. The GD iterates (black) oscillate rapidly back and forth, almost orthogonally to the slow drift of their center w¯t (grey): two consecutive iterates straddle w¯t and are displaced along a common direction. The decomposition wt ≈ w¯t + (−1)txtut captures this with three slowly varying quantities: the center w¯t; its unit directi… view at source ↗

**Figure 3.** Figure 3: Gradient descent, Edge Flow, and Central Flow for a CNN with MSE loss on a 1000-example, 4-class subset of CIFAR-10 (η = 0.02). Throughout, gradient descent is shown in blue, Edge Flow in red, and Central Flow (Cohen et al., 2025) in black; the vertical dashed line marks the onset of the edge of stability. Top left: training loss, with each flow’s prediction of the GD loss (dashed, see main text for detail… view at source ↗

**Figure 4.** Figure 4: Onset of the edge of stability for the CNN with MSE loss (zoom of [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Effect of the base oscillation level ε and of the discretization K on the EoS onset, for a 3-layer MLP trained with MSE loss on a 400-example, 4-class subset of CIFAR-10 (η = 0.02). Rows: a coarse (K = 1) and a fine (K = 8) discretization of Edge Gradient Descent (ρ = η/K). Columns: training loss, normalized sharpness ηλmax, and—for K = 8—the oscillation magnitude x. As ε increases, the sharpness overshoot… view at source ↗

**Figure 6.** Figure 6: Sharpness stabilization by the three flows (3-layer MLP, MSE loss on a 400- example, 4-class subset of CIFAR-10, η = 0.017). Normalized sharpness ηλmax at the center of each process: gradient descent (true dynamics), Central Flow, Edge Flow, and Rod Flow; the dotted line is the edge ηλ = 2. Central Flow pins the sharpness at 2 and Edge Flow tracks the gradient-descent oscillation around it, whereas Rod Flo… view at source ↗

**Figure 7.** Figure 7: CNN with MSE loss (η = 0.02): full set of metrics (top) and zoom on the onset of the edge of stability (bottom). The inset on the train losses panels shows the small gap between the Edge Flow and Central Flow centers. Some of these figures are also in the main text. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: CNN with cross-entropy loss (η = 0.01). Full set of metrics (top) and zoom on the onset of the edge of stability (bottom). 19 [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: ResNet with MSE loss (η = 0.02). Full set of metrics (top) and zoom on the onset of the edge of stability (bottom). 20 [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: ResNet with cross-entropy loss (η = 0.02). Full set of metrics (top) and zoom on the onset of the edge of stability (bottom). 21 [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: ViT with MSE loss (η = 0.02): full set of metrics and zoom on the onset of the edge of stability (bottom). Some of these figures are also in the main text. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: ViT with cross-entropy loss (η = 0.02). Full set of metrics (top) and zoom on the onset of the edge of stability (bottom). 23 [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: Effect of the learning rate on stability, for the MLP/MSE setup of [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

read the original abstract

Gradient descent in deep learning may operate at the edge of stability (EoS), a regime in which the largest eigenvalue of the loss Hessian hovers near the stability threshold $2/\eta$, where $\eta$ is the learning rate. Classical analysis tools such as gradient flow and the descent lemma do not apply here, motivating the search for a continuous-time model valid at EoS. We propose Edge Flow, a system of three coupled ordinary differential equations that provides a tractable, faithful, and predictive model of gradient descent dynamics at EoS. Edge Flow decomposes the dynamics into a center, an oscillation direction, and an oscillation magnitude. The center follows a modified gradient flow on a symmetrized loss; the direction tracks a top eigenvector of the Hessian via Rayleigh quotient dynamics; and the magnitude grows or decays exponentially depending on whether the sharpness exceeds or falls below the threshold $2/\eta$. Crucially, sharpness stabilization emerges from the coupled dynamics via a self-stabilization feedback loop. Discretizing Edge Flow only requires two gradient evaluations and one Hessian--vector product at each iteration. We demonstrate empirically that Edge Flow tracks the dynamics of gradient descent at least as faithfully as previously proposed continuous-time EoS models, while in addition resolving the oscillation of the sharpness at the onset of EoS, and that it provides a principled framework for understanding and mitigating instabilities in this regime.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Edge Flow gives a concrete three-ODE decomposition for EoS dynamics with cheap discretization, but the abstract leaves the derivation and quantitative match to real runs unspecified.

read the letter

The paper's main contribution is a new continuous-time model called Edge Flow built from three coupled ODEs: a center that follows modified gradient flow on a symmetrized loss, a direction that evolves by Rayleigh-quotient dynamics, and a scalar magnitude that grows or decays exponentially according to whether sharpness sits above or below 2/η. The claim is that the coupling produces automatic sharpness stabilization without extra terms. Discretization costs two gradient evaluations and one Hessian-vector product per step.

That decomposition is new relative to earlier EoS models I have seen. The explicit, low-cost discretization is a practical plus and could be checked directly on existing training runs. The abstract also says the model resolves the initial oscillation in sharpness that prior continuous approximations miss.

The main gap is that the abstract supplies no derivation steps, no error bounds, and no numerical tables comparing trajectories or sharpness traces to actual gradient descent. Without those, it is impossible to tell whether the self-stabilization is an emergent consequence of the coupling or whether the functional forms were chosen to produce it. The reader's circularity concern therefore stands until the full equations and experiments are examined.

The work is aimed at researchers who already follow the EoS literature and want a tractable ODE they can analyze or discretize. It is not yet ready for broad citation because the empirical faithfulness claim is not quantified here. A serious editor should send it to referees so the derivations and the reported matches can be checked; the proposal is specific enough to be falsifiable and the discretization cost is low enough that independent tests are feasible.

Referee Report

2 major / 1 minor

Summary. The paper proposes Edge Flow, a system of three coupled ODEs modeling gradient descent at the edge of stability (EoS). It decomposes the dynamics into a center following modified gradient flow on a symmetrized loss, a direction evolving via Rayleigh quotient dynamics to track the top Hessian eigenvector, and an oscillation magnitude that grows or decays exponentially according to whether sharpness exceeds or falls below 2/η. The central claim is that sharpness stabilization emerges automatically from the coupling via a self-stabilization feedback loop. Discretization requires two gradient evaluations and one Hessian-vector product per iteration. Empirical results are said to show that Edge Flow tracks GD at least as faithfully as prior continuous-time EoS models while additionally resolving sharpness oscillations at EoS onset and providing a framework for instabilities.

Significance. If the derivations establish that stabilization is a derived consequence of the coupling (rather than presupposed by the functional forms) and if the empirical comparisons include rigorous quantitative metrics with error analysis, the model could supply a tractable continuous-time tool for analyzing a regime where classical gradient flow and descent lemma do not apply. The explicit, low-cost discretization (two gradients + one HVP) is a concrete strength that could enable practical use. The work addresses a practically relevant phenomenon in deep learning training.

major comments (2)

[Abstract] Abstract: the claim that 'sharpness stabilization emerges from the coupled dynamics via a self-stabilization feedback loop' is load-bearing for the contribution, yet the abstract supplies no derivation steps, explicit equations for the three ODEs, or analysis showing that the stabilization is a consequence of the coupling rather than built into the chosen functional forms for center, direction, and magnitude. This must be shown explicitly (e.g., via the ODE definitions and any subsequent analysis section) to substantiate the central claim.
[Abstract] Abstract: the assertions that the model is 'faithful and predictive' and 'tracks the dynamics of gradient descent at least as faithfully as previously proposed continuous-time EoS models' rest on unspecified experiments with no reported quantitative metrics, error bounds, or comparison tables. Without these, the faithfulness claim cannot be evaluated and is central to the paper's empirical contribution.

minor comments (1)

[Abstract] The discretization cost is stated clearly, but the abstract would benefit from a brief parenthetical note on how the two gradients and one HVP arise from the ODE discretization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight opportunities to strengthen the presentation of our central claims. We address each major comment below with references to the full manuscript and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'sharpness stabilization emerges from the coupled dynamics via a self-stabilization feedback loop' is load-bearing for the contribution, yet the abstract supplies no derivation steps, explicit equations for the three ODEs, or analysis showing that the stabilization is a consequence of the coupling rather than built into the chosen functional forms for center, direction, and magnitude. This must be shown explicitly (e.g., via the ODE definitions and any subsequent analysis section) to substantiate the central claim.

Authors: The full manuscript defines the three ODEs explicitly in Section 3 (center as modified gradient flow on the symmetrized loss, direction via Rayleigh quotient dynamics, and magnitude via exponential growth/decay conditioned on sharpness relative to 2/η). Section 4 then analyzes the coupled system and derives that the self-stabilization feedback loop arises from the interaction: the magnitude modulates the effective sharpness experienced by the center, which in turn influences the direction and feeds back to bound the magnitude, rather than the threshold behavior being independently imposed. We agree the abstract does not sufficiently signal this structure or the derivation and will revise it to include a concise reference to the ODE components and the emergence of stabilization from their coupling. revision: yes
Referee: [Abstract] Abstract: the assertions that the model is 'faithful and predictive' and 'tracks the dynamics of gradient descent at least as faithfully as previously proposed continuous-time EoS models' rest on unspecified experiments with no reported quantitative metrics, error bounds, or comparison tables. Without these, the faithfulness claim cannot be evaluated and is central to the paper's empirical contribution.

Authors: Section 5 presents the empirical evaluation, including direct trajectory comparisons against GD and prior EoS models (e.g., via integrated squared error on parameter paths and sharpness time series), with reported error norms, oscillation amplitude statistics, and qualitative resolution of sharpness oscillations at EoS onset. While these metrics appear in the main text and appendix, the abstract does not reference them. We will revise the abstract to briefly note the use of quantitative trajectory and sharpness metrics in the comparisons, and we will ensure a summary comparison table is clearly highlighted in the main body. revision: partial

Circularity Check

0 steps flagged

No significant circularity; model construction is self-contained

full rationale

The provided abstract and reader summary describe a proposed three-component ODE system (center on symmetrized loss, Rayleigh-quotient direction, exponential magnitude) whose coupling is asserted to produce self-stabilization of sharpness. No equations, self-citations, fitted parameters renamed as predictions, or ansatzes smuggled via prior work are visible in the text. The discretization cost is stated explicitly and parameter-free. Because no load-bearing step reduces by construction to its own inputs or to a self-citation chain, the derivation chain does not exhibit circularity. The central claim remains an independent modeling proposal whose faithfulness is to be checked against external GD trajectories rather than by internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The model rests on an unstated derivation that the three-component decomposition is sufficient and on the domain assumption that EoS dynamics admit this separation; no free parameters are explicitly named in the abstract, but the functional forms of the magnitude equation and the symmetrized loss are introduced by the paper.

axioms (2)

ad hoc to paper Gradient descent at EoS can be decomposed into center, oscillation direction, and oscillation magnitude components whose coupling produces self-stabilization.
Core modeling premise stated in the abstract as the basis for the three ODEs.
ad hoc to paper The center follows a modified gradient flow on a symmetrized loss.
Modeling choice for the center component.

invented entities (1)

Edge Flow (three coupled ODEs with center, direction, and magnitude) no independent evidence
purpose: To provide a tractable continuous-time description of GD at EoS
New postulated dynamical system whose validity is asserted but not derived in the abstract.

pith-pipeline@v0.9.1-grok · 5775 in / 1672 out tokens · 52863 ms · 2026-06-27T01:14:36.738087+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 3 linked inside Pith

[1]

Lewkowycz, Y

A. Lewkowycz, Y. Bahri, E. Dyer, J. Sohl-Dickstein, and G. Gur-Ari. The large learning rate phase of deep learning: the catapult mechanism.arXiv preprint arXiv:2003.02218,

arXiv 2003
[2]

Marion and Y.-H

P. Marion and Y.-H. Wu. Understanding diffusion models requires rethinking (again) generaliza- tion.arXiv preprint arXiv:2605.06077,

Pith/arXiv arXiv
[3]

Accessed: 2026-06-15. R. Mulayoff and T. Michaeli. Exact mean square linear stability analysis for SGD. InConference on Learning Theory (COLT), volume 247, pages 3915–3969,

2026
[4]

Mulayoff and S

R. Mulayoff and S. U. Stich. On the stability of nonlinear dynamics in GD and SGD: Beyond quadratic potentials.arXiv preprint arXiv:2602.14789,

Pith/arXiv arXiv
[5]

Regis and S

12 E. Regis and S. Chewi. Rod flow: A continuous-time model for gradient descent at the edge of stability.arXiv preprint arXiv:2602.01480,

arXiv
[6]

Y.-H. Wu, P. Marion, G. Biau, and C. Boyer. Taking a big step: Large learning rates in denoising score matching prevent memorization. InConference on Learning Theory, volume 291, pages 5718–5756, 2025b. C. Xing, D. Arpit, C. Tsirigotis, and Y. Bengio. A walk with SGD.arXiv preprint arXiv:1802.08770,

Pith/arXiv arXiv
[7]

and Rod Flow (Regis and Chewi, 2026). These two models as well as Edge Flow share the same starting point: at EoS, the GD trajectory splits into a slowly moving center and a fast oscillation, and one seeks autonomous dynamics for the slow variables. They differ in two essential choices. The first is how the oscillation isrepresented: a constraint on the s...

2026
[8]

Running Gradient Flow yields the same dynamics as Rod Flow

Central Flow pins the sharpness at2and Edge Flow tracks the gradient-descent oscillation around it, whereas Rod Flow does not stabilize the sharpness and drifts upward. Running Gradient Flow yields the same dynamics as Rod Flow. doing a Taylor expansion of the gradients, we obtain that the magnitude of the oscillations evolves according to d(x2 t ) dt ≈u ...

2026
[9]

The two settings share most of their setup, which we describe first

and for the rod-flow comparison (Figure 6). The two settings share most of their setup, which we describe first. Common setup.All experiments are implemented inPyTorch, building on the publicly available Central Flows codebase of Cohen et al. (2025) (https://github.com/locuslab/ central_flows), which we extend to implement Edge Gradient Descent (see pseud...

2025
[10]

Test metrics are evaluated on the CIFAR-10 test images of the same four classes, and the network-output panels display the model’s four output logits on a single fixed test example

combine three architectures with two losses—mean-squared error (MSE) regressed onto one-hot labels and softmax cross-entropy (CE)—and use n = 1000training examples (250per class), with4output units. Test metrics are evaluated on the CIFAR-10 test images of the same four classes, and the network-output panels display the model’s four output logits on a sin...

2025
[11]

Refining the discretization of the center thus seems to extend the range of stable learning rates, as discussed in Section 4.3

Pushing further, K = 4also eventually diverges, around η = 0.5. Refining the discretization of the center thus seems to extend the range of stable learning rates, as discussed in Section 4.3. 17 0 250 500 750 1000 1250 1500 1750 2000 step 0.1 0.2 0.3 0.4 0.5 train loss gradient descent edge flow central flow edge flow prediction central flow prediction 0 ...

2000
[12]

central flow edge flow gradient descent 0 250 500 750 1000 1250 1500 1750 2000 step 10 4 10 3 10 2 distance between processes edge GD central GD central edge 0 250 500 750 1000 1250 1500 1750 2000 step 0 1 2 3 4 5 squared gradient norm gradient descent edge flow central flow 600 800 1000 1200 1400 1600 1800 step 0.0000 0.0001 0.0002 0.0003 0.0004 0.0005 0...

2000
[13]

The inset on the train losses panels shows the small gap between the Edge Flow and Central Flow centers

central flow edge flow gradient descent 600 800 1000 5 0 5 1e 4 edge central CNN · MSE loss ( = 0.02) onset of the edge of stability Figure 7:CNN with MSE loss(η = 0.02): full set of metrics (top) and zoom on the onset of the edge of stability (bottom). The inset on the train losses panels shows the small gap between the Edge Flow and Central Flow centers...

2000
[14]

edge flow gradient descent 0 500 1000 1500 2000 2500 3000 step 10 5 10 4 10 3 10 2 10 1 distance between processes edge GD 0 500 1000 1500 2000 2500 3000 step 0 10 20 30 40 50 60 70 squared gradient norm gradient descent edge flow 1000 1250 1500 1750 2000 2250 2500 2750 step 0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 0.0014 oscillated variance gradi...

2000
[15]

Full set of metrics (top) and zoom on the onset of the edge of stability (bottom)

edge flow gradient descent CNN · cross-entropy loss ( = 0.01) onset of the edge of stability Figure 8:CNN with cross-entropy loss(η = 0.01). Full set of metrics (top) and zoom on the onset of the edge of stability (bottom). 19 0 500 1000 1500 2000 2500 3000 3500 step 0.1 0.2 0.3 0.4 0.5 train loss gradient descent edge flow edge flow prediction 0 500 1000...

2000
[16]

edge flow gradient descent 0 500 1000 1500 2000 2500 3000 3500 step 10 2 10 1 distance between processes edge GD 0 500 1000 1500 2000 2500 3000 3500 step 0.0 0.5 1.0 1.5 2.0 2.5 3.0 squared gradient norm gradient descent edge flow 0 500 1000 1500 2000 2500 3000 3500 step 0.00000 0.00005 0.00010 0.00015 0.00020 0.00025 0.00030 0.00035 oscillated variance g...

2000
[17]

Full set of metrics (top) and zoom on the onset of the edge of stability (bottom)

edge flow gradient descent ResNet · MSE loss ( = 0.02) onset of the edge of stability Figure 9:ResNet with MSE loss(η = 0.02). Full set of metrics (top) and zoom on the onset of the edge of stability (bottom). 20 0 500 1000 1500 2000 2500 3000 3500 step 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 train loss gradient descent edge flow edge flow prediction 0 500 1000 1...

2000
[18]

edge flow gradient descent 0 500 1000 1500 2000 2500 3000 3500 step 10 3 10 2 10 1 distance between processes edge GD 0 500 1000 1500 2000 2500 3000 3500 step 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 squared gradient norm gradient descent edge flow 0 500 1000 1500 2000 2500 3000 step 0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 oscillated variance gradient...

2000
[19]

Full set of metrics (top) and zoom on the onset of the edge of stability (bottom)

edge flow gradient descent ResNet · cross-entropy loss ( = 0.02) onset of the edge of stability Figure 10:ResNet with cross-entropy loss(η = 0.02). Full set of metrics (top) and zoom on the onset of the edge of stability (bottom). 21 0 500 1000 1500 2000 2500 3000 step 0.20 0.25 0.30 0.35 0.40 0.45 0.50 train loss gradient descent edge flow edge flow pred...

2000
[20]

edge flow gradient descent 0 500 1000 1500 2000 2500 3000 step 10 2 distance between processes edge GD 0 500 1000 1500 2000 2500 3000 step 0 1 2 3 4 5 6 squared gradient norm gradient descent edge flow 1000 1500 2000 2500 3000 step 0.0000 0.0001 0.0002 0.0003 0.0004 0.0005 oscillated variance gradient descent edge flow 0 500 1000 1500 2000 2500 3000 step ...

2000
[21]

Some of these figures are also in the main text

edge flow gradient descent ViT · MSE loss ( = 0.02) onset of the edge of stability Figure 11:ViT with MSE loss(η = 0.02): full set of metrics and zoom on the onset of the edge of stability (bottom). Some of these figures are also in the main text. 22 0 1000 2000 3000 4000 step 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 train loss gradient descent edge flow edge flow...

2000
[22]

edge flow gradient descent 0 1000 2000 3000 4000 step 10 3 10 2 10 1 100 distance between processes edge GD 0 1000 2000 3000 4000 step 0 10 20 30 40 50 60 squared gradient norm gradient descent edge flow 500 1000 1500 2000 2500 3000 3500 4000 4500 step 0.000 0.001 0.002 0.003 0.004 0.005 oscillated variance gradient descent edge flow 0 1000 2000 3000 4000...

2000

[1] [1]

Lewkowycz, Y

A. Lewkowycz, Y. Bahri, E. Dyer, J. Sohl-Dickstein, and G. Gur-Ari. The large learning rate phase of deep learning: the catapult mechanism.arXiv preprint arXiv:2003.02218,

arXiv 2003

[2] [2]

Marion and Y.-H

P. Marion and Y.-H. Wu. Understanding diffusion models requires rethinking (again) generaliza- tion.arXiv preprint arXiv:2605.06077,

Pith/arXiv arXiv

[3] [3]

Accessed: 2026-06-15. R. Mulayoff and T. Michaeli. Exact mean square linear stability analysis for SGD. InConference on Learning Theory (COLT), volume 247, pages 3915–3969,

2026

[4] [4]

Mulayoff and S

R. Mulayoff and S. U. Stich. On the stability of nonlinear dynamics in GD and SGD: Beyond quadratic potentials.arXiv preprint arXiv:2602.14789,

Pith/arXiv arXiv

[5] [5]

Regis and S

12 E. Regis and S. Chewi. Rod flow: A continuous-time model for gradient descent at the edge of stability.arXiv preprint arXiv:2602.01480,

arXiv

[6] [6]

Y.-H. Wu, P. Marion, G. Biau, and C. Boyer. Taking a big step: Large learning rates in denoising score matching prevent memorization. InConference on Learning Theory, volume 291, pages 5718–5756, 2025b. C. Xing, D. Arpit, C. Tsirigotis, and Y. Bengio. A walk with SGD.arXiv preprint arXiv:1802.08770,

Pith/arXiv arXiv

[7] [7]

and Rod Flow (Regis and Chewi, 2026). These two models as well as Edge Flow share the same starting point: at EoS, the GD trajectory splits into a slowly moving center and a fast oscillation, and one seeks autonomous dynamics for the slow variables. They differ in two essential choices. The first is how the oscillation isrepresented: a constraint on the s...

2026

[8] [8]

Running Gradient Flow yields the same dynamics as Rod Flow

Central Flow pins the sharpness at2and Edge Flow tracks the gradient-descent oscillation around it, whereas Rod Flow does not stabilize the sharpness and drifts upward. Running Gradient Flow yields the same dynamics as Rod Flow. doing a Taylor expansion of the gradients, we obtain that the magnitude of the oscillations evolves according to d(x2 t ) dt ≈u ...

2026

[9] [9]

The two settings share most of their setup, which we describe first

and for the rod-flow comparison (Figure 6). The two settings share most of their setup, which we describe first. Common setup.All experiments are implemented inPyTorch, building on the publicly available Central Flows codebase of Cohen et al. (2025) (https://github.com/locuslab/ central_flows), which we extend to implement Edge Gradient Descent (see pseud...

2025

[10] [10]

Test metrics are evaluated on the CIFAR-10 test images of the same four classes, and the network-output panels display the model’s four output logits on a single fixed test example

combine three architectures with two losses—mean-squared error (MSE) regressed onto one-hot labels and softmax cross-entropy (CE)—and use n = 1000training examples (250per class), with4output units. Test metrics are evaluated on the CIFAR-10 test images of the same four classes, and the network-output panels display the model’s four output logits on a sin...

2025

[11] [11]

Refining the discretization of the center thus seems to extend the range of stable learning rates, as discussed in Section 4.3

Pushing further, K = 4also eventually diverges, around η = 0.5. Refining the discretization of the center thus seems to extend the range of stable learning rates, as discussed in Section 4.3. 17 0 250 500 750 1000 1250 1500 1750 2000 step 0.1 0.2 0.3 0.4 0.5 train loss gradient descent edge flow central flow edge flow prediction central flow prediction 0 ...

2000

[12] [12]

central flow edge flow gradient descent 0 250 500 750 1000 1250 1500 1750 2000 step 10 4 10 3 10 2 distance between processes edge GD central GD central edge 0 250 500 750 1000 1250 1500 1750 2000 step 0 1 2 3 4 5 squared gradient norm gradient descent edge flow central flow 600 800 1000 1200 1400 1600 1800 step 0.0000 0.0001 0.0002 0.0003 0.0004 0.0005 0...

2000

[13] [13]

The inset on the train losses panels shows the small gap between the Edge Flow and Central Flow centers

central flow edge flow gradient descent 600 800 1000 5 0 5 1e 4 edge central CNN · MSE loss ( = 0.02) onset of the edge of stability Figure 7:CNN with MSE loss(η = 0.02): full set of metrics (top) and zoom on the onset of the edge of stability (bottom). The inset on the train losses panels shows the small gap between the Edge Flow and Central Flow centers...

2000

[14] [14]

edge flow gradient descent 0 500 1000 1500 2000 2500 3000 step 10 5 10 4 10 3 10 2 10 1 distance between processes edge GD 0 500 1000 1500 2000 2500 3000 step 0 10 20 30 40 50 60 70 squared gradient norm gradient descent edge flow 1000 1250 1500 1750 2000 2250 2500 2750 step 0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 0.0014 oscillated variance gradi...

2000

[15] [15]

Full set of metrics (top) and zoom on the onset of the edge of stability (bottom)

edge flow gradient descent CNN · cross-entropy loss ( = 0.01) onset of the edge of stability Figure 8:CNN with cross-entropy loss(η = 0.01). Full set of metrics (top) and zoom on the onset of the edge of stability (bottom). 19 0 500 1000 1500 2000 2500 3000 3500 step 0.1 0.2 0.3 0.4 0.5 train loss gradient descent edge flow edge flow prediction 0 500 1000...

2000

[16] [16]

edge flow gradient descent 0 500 1000 1500 2000 2500 3000 3500 step 10 2 10 1 distance between processes edge GD 0 500 1000 1500 2000 2500 3000 3500 step 0.0 0.5 1.0 1.5 2.0 2.5 3.0 squared gradient norm gradient descent edge flow 0 500 1000 1500 2000 2500 3000 3500 step 0.00000 0.00005 0.00010 0.00015 0.00020 0.00025 0.00030 0.00035 oscillated variance g...

2000

[17] [17]

Full set of metrics (top) and zoom on the onset of the edge of stability (bottom)

edge flow gradient descent ResNet · MSE loss ( = 0.02) onset of the edge of stability Figure 9:ResNet with MSE loss(η = 0.02). Full set of metrics (top) and zoom on the onset of the edge of stability (bottom). 20 0 500 1000 1500 2000 2500 3000 3500 step 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 train loss gradient descent edge flow edge flow prediction 0 500 1000 1...

2000

[18] [18]

edge flow gradient descent 0 500 1000 1500 2000 2500 3000 3500 step 10 3 10 2 10 1 distance between processes edge GD 0 500 1000 1500 2000 2500 3000 3500 step 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 squared gradient norm gradient descent edge flow 0 500 1000 1500 2000 2500 3000 step 0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 oscillated variance gradient...

2000

[19] [19]

Full set of metrics (top) and zoom on the onset of the edge of stability (bottom)

edge flow gradient descent ResNet · cross-entropy loss ( = 0.02) onset of the edge of stability Figure 10:ResNet with cross-entropy loss(η = 0.02). Full set of metrics (top) and zoom on the onset of the edge of stability (bottom). 21 0 500 1000 1500 2000 2500 3000 step 0.20 0.25 0.30 0.35 0.40 0.45 0.50 train loss gradient descent edge flow edge flow pred...

2000

[20] [20]

edge flow gradient descent 0 500 1000 1500 2000 2500 3000 step 10 2 distance between processes edge GD 0 500 1000 1500 2000 2500 3000 step 0 1 2 3 4 5 6 squared gradient norm gradient descent edge flow 1000 1500 2000 2500 3000 step 0.0000 0.0001 0.0002 0.0003 0.0004 0.0005 oscillated variance gradient descent edge flow 0 500 1000 1500 2000 2500 3000 step ...

2000

[21] [21]

Some of these figures are also in the main text

edge flow gradient descent ViT · MSE loss ( = 0.02) onset of the edge of stability Figure 11:ViT with MSE loss(η = 0.02): full set of metrics and zoom on the onset of the edge of stability (bottom). Some of these figures are also in the main text. 22 0 1000 2000 3000 4000 step 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 train loss gradient descent edge flow edge flow...

2000

[22] [22]

edge flow gradient descent 0 1000 2000 3000 4000 step 10 3 10 2 10 1 100 distance between processes edge GD 0 1000 2000 3000 4000 step 0 10 20 30 40 50 60 squared gradient norm gradient descent edge flow 500 1000 1500 2000 2500 3000 3500 4000 4500 step 0.000 0.001 0.002 0.003 0.004 0.005 oscillated variance gradient descent edge flow 0 1000 2000 3000 4000...

2000