pith. sign in

arxiv: 2606.05219 · v1 · pith:6YP52TAJnew · submitted 2026-05-29 · 💻 cs.LG · cs.AI

Gradient Descent with Large Step Size Restores Symmetry in Deep Linear Networks with Multi-Pathway

Pith reviewed 2026-06-28 23:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords deep linear networksgradient descentedge of stabilitysymmetry breakingmulti-pathwaysharpnesssignal redistributionre-balancing
0
0 comments X

The pith

Large-step gradient descent in multi-pathway deep linear networks redistributes signals across paths after initial symmetry breaking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Analyses based on gradient flow predict that deep linear networks with multiple pathways undergo winner-takes-all specialization, with each feature concentrating in a single pathway. This paper demonstrates that discrete gradient descent with large step sizes produces a different outcome. Single-path solutions form sharp minima, while spreading signals across pathways lowers sharpness by an amount that grows with both the number of pathways and network depth. Early training follows the expected symmetry breaking, but oscillations at the edge of stability then trigger a re-balancing phase in which signals redistribute. The result accounts for why large-step GD produces shared representations rather than persistent single-path dominance.

Core claim

In multi-pathway deep linear networks, single-path solutions are sharp minima while distributing signals across pathways reduces sharpness by a factor that decreases with both the number of pathways and depth. Early training reproduces the depth-driven symmetry breaking predicted by gradient flow, but oscillations at the edge of stability subsequently override this tendency and drive the network into a re-balancing phase where signals redistribute across pathways.

What carries the argument

The sharpness reduction obtained by spreading signals across multiple pathways, which scales with pathway count and depth, combined with the effect of edge-of-stability oscillations on the discrete GD trajectory.

If this is right

  • Early training exhibits the depth-driven symmetry breaking seen under gradient flow.
  • Edge-of-stability oscillations drive a subsequent re-balancing phase that redistributes signals.
  • Large-step GD favors shared representations over single-pathway dominance.
  • Depth amplifies the sharpness reduction from multi-path distributions and therefore the strength of re-balancing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same re-balancing may occur in nonlinear networks that exhibit edge-of-stability behavior under large steps.
  • Depth-dependent sharpness reduction offers one concrete reason deeper architectures can sustain distributed representations under practical training.
  • The mechanism links sharpness-based accounts of training dynamics to the emergence of shared features rather than specialized pathways.

Load-bearing premise

The analysis assumes that edge-of-stability oscillations and the associated sharpness reduction apply directly to the discrete gradient descent trajectory and that the invoked loss landscape properties hold under standard random initialization.

What would settle it

Training a multi-pathway deep linear network with large-step GD and measuring whether pathway signal norms equalize after the loss begins oscillating at the edge of stability.

Figures

Figures reproduced from arXiv: 2606.05219 by Hee-Sung Kim, Sungyoon Lee.

Figure 1
Figure 1. Figure 1: Training dynamics of GD with multi-path DLNs (L = 20, H = 2, σ⋆1 = 1) across different learning rates η. The evolution of pathway singular values is shown in blue (σ11, h = 1), light blue (σ21, h = 2), and gray (σ1 = σ11 + σ21). While small η leads to a single-path solution, GD with a larger learning rate (η > 2/S1) drives the system toward a more balanced configuration where singular values are distribute… view at source ↗
Figure 2
Figure 2. Figure 2: Trajectories of (σ11, σ21) for different learning rates η. In the stable regime (left), the dominant pathway suppresses others. Beyond the stability threshold (middle, right), the pathways bifurcate into the more balanced minimum (λmax < S1). Appendix E identifies the mechanism driving this phase. On the SVS depth-balanced manifold, the dynamics reduce to a scalar recursion per mode and pathway. Under grad… view at source ↗
Figure 3
Figure 3. Figure 3: Symmetry Breaking vs. Re-balancing in Nonlinear Networks. Training dynamics of a two-pathway MLP (L = 3, H = 2, Tanh activation). Training with small η (Top row) shows symmetry breaking and Training with larger η = 2/λmin 1 (Bottom row) shows re-balancing phase. 5.2. Contrast with Depth-Wise Balancing A fundamental distinction exists between the pathway re￾balancing observed in this work and the depth-wise… view at source ↗
Figure 4
Figure 4. Figure 4: Trajectories of (σ11, σ21) for heterogeneous depth mod￾els with learning rate η ∗ = 2/λmin 1 . See [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Loss 1 2 (w L − σ⋆1) 2 of the deep linear chain (L = 5, σ⋆1 = 1; w is the per-layer scale, minimizer w⋆ = σ 1/L ⋆1 = 1). GD follows the map (19). From wt, the maximizer of fη producing the largest overshoot wt+1, the second step wt+2 either returns to (0, w⋆) (blue, η ≈ 0.97 ηWCR) or crosses the origin into w < 0 (red, η ≈ 1.03 ηWCR). ηWCR is the knife-edge between the two. On the relevant overshoot branch… view at source ↗
Figure 6
Figure 6. Figure 6: Deep linear chain update map (19) for different values of η/η1 and L. The worst-case overshoot (c, fη(c)) and the return point (fη(c), fη(fη(c))) illustrate how larger depth permits larger normalized steps before a sign flip occurs. F. Worst-Case Return Threshold for Deep Chains This appendix provides the proofs and computational details for Section 6. Our goal is to characterize the largest learning rate … view at source ↗
Figure 7
Figure 7. Figure 7: Training dynamics of GD with multi path MLP (L = 3, H = 2, σ⋆1 = 1) according to learning rate η. GD with larger learning rate η mitigate sharp minima with λmax > 2/η. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Trajectory of (σ11, σ21) for heterogeneous depth model with learning rate η ∗ . 31 [PITH_FULL_IMAGE:figures/full_fig_p031_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Trajectory of σh1 for learning rate η = 0.99ηWCR. As learning rates getting bigger, σh1 returns close to zero. η < 2/S1 η > 2/S1 η = 2/λmin 1 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 u 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 v 21 11 31 the 2-Simplex Projected Trajectory 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 u 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 v 21 11 31 the 2-Simplex Projected Trajectory 0.75 0.50 0.25 0.00… view at source ↗
Figure 10
Figure 10. Figure 10: Trajecory of σh1 trained with η1 = 2/λmin 1 with 2D unfolded view on minima P3 h=1 σh1 = σ⋆1. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗
read the original abstract

Recent analyses of multi-pathway Deep Linear Networks use Gradient Flow to predict a "winner-takes-all" specialization in which path symmetry breaks and each feature concentrates in a single pathway. In this work, we show that discrete Gradient Descent (GD) with a large step size tells a different story. We prove that single-path solutions are sharp minima, whereas distributing signals across pathways reduces sharpness by a factor that decreases with both the number of pathways and depth. Consequently, while early training reproduces the depth-driven symmetry breaking predicted by GF, oscillations at the Edge of Stability subsequently override this tendency and drive the network into a re-balancing phase, where signals redistribute across pathways. Together, these results clarify how depth shapes pathway competition and explain why large-step GD favors shared representations rather than persistent single-pathway dominance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript examines multi-pathway deep linear networks and contrasts gradient flow (GF) predictions of depth-driven symmetry breaking and winner-takes-all specialization with the behavior of discrete gradient descent (GD) using large step sizes. It proves that single-path solutions are sharp minima while distributing signals across pathways reduces sharpness by a factor that decreases with pathway count and depth. Consequently, early GD training reproduces GF-style symmetry breaking, but Edge of Stability (EoS) oscillations later override this and induce a re-balancing phase in which signals redistribute across pathways.

Significance. If the central claims hold, the work clarifies how discretization and large step sizes interact with depth to shape pathway competition, offering an explanation for why practical large-step GD often favors distributed rather than single-path representations. The explicit proofs of the sharpness properties constitute a verifiable landscape analysis and are a clear strength.

major comments (1)
  1. [Abstract] Abstract (final sentence) and the transition from landscape analysis to dynamics: the claim that EoS oscillations 'override this tendency and drive the network into a re-balancing phase' treats the dynamical consequence as following directly from the sharpness reduction. However, sharpness is a local curvature property at critical points; the manuscript does not appear to supply a derivation or theorem establishing that the discrete large-step GD trajectory necessarily produces net signal redistribution (rather than sustained oscillation without redistribution or trapping). This link is load-bearing for the 'consequently' clause and the overall narrative that large-step GD restores symmetry.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive report on arXiv:2606.05219. The major comment concerns the transition from our landscape results to the claimed dynamical re-balancing under large-step GD. We respond point-by-point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract (final sentence) and the transition from landscape analysis to dynamics: the claim that EoS oscillations 'override this tendency and drive the network into a re-balancing phase' treats the dynamical consequence as following directly from the sharpness reduction. However, sharpness is a local curvature property at critical points; the manuscript does not appear to supply a derivation or theorem establishing that the discrete large-step GD trajectory necessarily produces net signal redistribution (rather than sustained oscillation without redistribution or trapping). This link is load-bearing for the 'consequently' clause and the overall narrative that large-step GD restores symmetry.

    Authors: We agree that sharpness is a local property and that a fully rigorous global trajectory theorem would strengthen the narrative. Our contribution establishes that single-path solutions are strictly sharper than multi-path solutions (by a factor that grows with depth and shrinks with pathway count). Standard analyses of the Edge of Stability show that large-step GD cannot converge to sharp minima and instead produces persistent oscillations whose time-averaged effect moves parameters toward flatter regions. Because the flatter regions are precisely the multi-pathway configurations (per our sharpness theorems), the observed re-balancing follows. We support the mechanism with both the landscape theorems and extensive simulations that document the transition from early symmetry breaking to later redistribution. We will add a short clarifying paragraph in the main text and revise the abstract wording to emphasize that the link relies on the combination of our sharpness results with established EoS behavior rather than a new global convergence theorem. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation rests on independent loss-landscape proof

full rationale

The paper's central claims consist of a mathematical proof that single-path solutions are sharp minima while multi-path distributions reduce sharpness (with the reduction factor depending on pathway count and depth), followed by the observation that early GD follows GF symmetry breaking but EoS oscillations later induce re-balancing. No quoted step reduces by construction to a fitted parameter, self-citation, or redefinition of its own inputs; the sharpness analysis is presented as an external property of the loss landscape rather than an ansatz or renamed empirical pattern. The dynamical link to discrete GD is asserted as a consequence but does not exhibit self-definitional or load-bearing self-citation patterns within the provided derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.1-grok · 5665 in / 1147 out tokens · 22147 ms · 2026-06-28T23:14:03.033123+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 6 canonical work pages · 2 internal anchors

  1. [1]

    International Conference on Learning Representations , year=

    Understanding deep learning requires rethinking generalization , author=. International Conference on Learning Representations , year=

  2. [2]

    2017 , eprint=

    Geometry of Optimization and Implicit Regularization in Deep Learning , author=. 2017 , eprint=

  3. [3]

    Exploring Generalization in Deep Learning , volume =

    Neyshabur, Behnam and Bhojanapalli, Srinadh and Mcallester, David and Srebro, Nati , booktitle =. Exploring Generalization in Deep Learning , volume =

  4. [4]

    Implicit Regularization in Matrix Factorization , volume =

    Gunasekar, Suriya and Woodworth, Blake E and Bhojanapalli, Srinadh and Neyshabur, Behnam and Srebro, Nati , booktitle =. Implicit Regularization in Matrix Factorization , volume =

  5. [5]

    Journal of Machine Learning Research , year =

    Daniel Soudry and Elad Hoffer and Mor Shpigel Nacson and Suriya Gunasekar and Nathan Srebro , title =. Journal of Machine Learning Research , year =

  6. [6]

    Second-Order Regression Models Exhibit Progressive Sharpening to the Edge of Stability , booktitle =

    Agarwala, Atish and Pedregosa, Fabian and Pennington, Jeffrey , year = 2023, pages =. Second-Order Regression Models Exhibit Progressive Sharpening to the Edge of Stability , booktitle =

  7. [7]

    International Conference on Learning Representations , year =

    A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks , author =. International Conference on Learning Representations , year =

  8. [8]

    Arora, Sanjeev and Cohen, Nadav and Hazan, Elad , year = 2018, pages =. On the. Proceedings of the 35th

  9. [9]

    Implicit

    Arora, Sanjeev and Cohen, Nadav and Hu, Wei and Luo, Yuping , year = 2019, volume =. Implicit. Advances in

  10. [10]

    Understanding

    Arora, Sanjeev and Li, Zhiyuan and Panigrahi, Abhishek , year = 2022, pages =. Understanding. Proceedings of the 39th

  11. [11]

    Neural Networks and Principal Component Analysis:

    Baldi, Pierre and Hornik, Kurt , year = 1989, journal =. Neural Networks and Principal Component Analysis:

  12. [12]

    Information and Inference: A Journal of the IMA , volume =

    Bah, Bubacarr and Rauhut, Holger and Terstiege, Ulrich and Westdickenberg, Michael , title =. Information and Inference: A Journal of the IMA , volume =. 2021 , issn =. doi:10.1093/imaiai/iaaa039 , eprint =

  13. [13]

    Beyond the

    Chen, Lei and Bruna, Joan , year = 2023, pages =. Beyond the. Proceedings of the 40th

  14. [14]

    Chizat, L. On. Advances in

  15. [15]

    Gradient Descent for Deep Matrix Factorization:

    Chou, Hung-Hsu and Gieshoff, Carsten and Maly, Johannes and Rauhut, Holger , year = 2024, journal =. Gradient Descent for Deep Matrix Factorization:

  16. [16]

    International Conference on Learning Representations , year =

    Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability , author =. International Conference on Learning Representations , year =

  17. [17]

    The Thirteenth International Conference on Learning Representations , year =

    Understanding Optimization in Deep Learning with Central Flows , author =. The Thirteenth International Conference on Learning Representations , year =

  18. [18]

    The Eleventh International Conference on Learning Representations , year =

    Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability , author =. The Eleventh International Conference on Learning Representations , year =

  19. [19]

    Even, Mathieu and Pesme, Scott and Gunasekar, Suriya and Flammarion, Nicolas , year = 2023, journal =. (

  20. [20]

    International Conference on Learning Representations , year=

    The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , author=. International Conference on Learning Representations , year=

  21. [21]

    The Thirteenth International Conference on Learning Representations , year =

    Learning Dynamics of Deep Matrix Factorization Beyond the Edge of Stability , author =. The Thirteenth International Conference on Learning Representations , year =

  22. [22]

    Implicit

    Gidel, Gauthier and Bach, Francis and. Implicit. Advances in

  23. [23]

    Exact Learning Dynamics of Deep Linear Networks with Prior Knowledge

    J Domin. Exact Learning Dynamics of Deep Linear Networks with Prior Knowledge. Journal of Statistical Mechanics: Theory and Experiment , volume =

  24. [24]

    Efficient

    Kwon, Soo Min and Zhang, Zekai and Song, Dogyoon and Balzano, Laura and Qu, Qing , year = 2024, pages =. Efficient. Proceedings of

  25. [25]

    Lampinen and Surya Ganguli , title =

    Andrew K. Lampinen and Surya Ganguli , title =. 7th International Conference on Learning Representations,. 2019 , url =

  26. [26]

    arXiv preprint arXiv:2003.02218 , year=

    Lewkowycz, Aitor and Bahri, Yasaman and Dyer, Ethan and. The Large Learning Rate Phase of Deep Learning: The Catapult Mechanism , shorttitle =. doi:10.48550/arXiv.2003.02218 , archiveprefix =. 2003.02218 , primaryclass =

  27. [27]

    Understanding the

    Lyu, Kaifeng and Li, Zhiyuan and Arora, Sanjeev , year = 2022, journal =. Understanding the

  28. [28]

    Abide by the Law and Follow the Flow: Conservation Laws for Gradient Flows , shorttitle =

    Marcotte, Sibylle and Gribonval, Remi and Peyr. Abide by the Law and Follow the Flow: Conservation Laws for Gradient Flows , shorttitle =. Advances in Neural Information Processing Systems , volume =

  29. [29]

    Advances in Neural Information Processing Systems , volume =

    Deep Linear Networks for Regression Are Implicitly Regularized towards Flat Minima , author =. Advances in Neural Information Processing Systems , volume =

  30. [30]

    Position:

    Nam, Yoonsoo and Lee, Seok Hyeong and Domin. Position:. Proceedings of the 42nd

  31. [31]

    Advances in Neural Information Processing Systems , author =

    Algorithmic Regularization in Learning Deep Homogeneous Models:. Advances in Neural Information Processing Systems , author =

  32. [32]

    Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

    Exact Solutions to the Nonlinear Dynamics of Learning in Deep Linear Neural Networks , author =. doi:10.48550/arXiv.1312.6120 , archiveprefix =. 1312.6120 , primaryclass =

  33. [33]

    Proceedings of the National Academy of Sciences , volume =

    A Mathematical Theory of Semantic Development in Deep Neural Networks , author =. Proceedings of the National Academy of Sciences , volume =

  34. [34]

    The Neural Race Reduction: Dynamics of Abstraction in Gated Networks , shorttitle =

    Saxe, Andrew and Sodhani, Shagun and Lewallen, Sam Jay , year = 2022, pages =. The Neural Race Reduction: Dynamics of Abstraction in Gated Networks , shorttitle =. Proceedings of the 39th

  35. [35]

    Learning dynamics of deep linear networks with multiple pathways , volume =

    Shi, Jianghong and Shea-Brown, Eric and Buice, Michael , booktitle =. Learning dynamics of deep linear networks with multiple pathways , volume =

  36. [36]

    Advances in Neural Information Processing Systems , volume =

    On the Spectral Bias of Two-Layer Linear Networks , author =. Advances in Neural Information Processing Systems , volume =

  37. [37]

    Analyzing

    Wang, Zixuan and Li, Zhouzi and Li, Jian , year = 2022, journal =. Analyzing

  38. [38]

    Wu, Lei and Ma, Chao and E, Weinan , year = 2018, volume =. How. Advances in

  39. [39]

    Xing, Chen and Arpit, Devansh and Tsirigotis, Christos and Bengio, Yoshua , year = 2018, number =. A. doi:10.48550/arXiv.1802.08770 , archiveprefix =. 1802.08770 , primaryclass =

  40. [40]

    Understanding

    Yoo, Geonhui and Song, Minhak and Yun, Chulhee , year = 2025, number =. Understanding. doi:10.48550/arXiv.2506.06940 , archiveprefix =. 2506.06940 , primaryclass =

  41. [41]

    , Singh , Aaditya K A

    Zhang, Yedi and Singh, Aaditya K. and Latham, Peter E. and Saxe, Andrew , year = 2025, number =. Training. doi:10.48550/arXiv.2501.16265 , archiveprefix =. 2501.16265 , primaryclass =

  42. [42]

    The Eleventh International Conference on Learning Representations , year =

    Understanding Edge-of-Stability Training Dynamics with a Minimalist Example , author =. The Eleventh International Conference on Learning Representations , year =

  43. [43]

    Catapults in

    Zhu, Libin and Liu, Chaoyue and Radhakrishnan, Adityanarayanan and Belkin, Mikhail , year = 2024, pages =. Catapults in. Proceedings of the 41st