pith. sign in

arxiv: 2604.07405 · v1 · submitted 2026-04-08 · 💻 cs.LG · cs.AI

Conservation Law Breaking at the Edge of Stability: A Spectral Theory of Non-Convex Neural Network Optimization

Pith reviewed 2026-05-10 18:00 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords conservation lawsgradient descentReLU networksspectral theoryedge of stabilitydrift analysisnon-convex optimization
0
0 comments X

The pith

Conservation laws preserved by gradient flow on ReLU networks break under discrete gradient descent according to an exact spectral formula.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that continuous gradient flow on L-layer ReLU networks without biases obeys L-1 conservation laws that equate the squared Frobenius norms of consecutive weight matrices and thereby confine the trajectory to a lower-dimensional manifold. Discrete gradient steps break these laws, producing a cumulative drift whose magnitude scales as the learning rate to a power between roughly 1.1 and 1.6. This drift decomposes exactly as eta squared times a gradient-imbalance sum S(eta) whose value is supplied by a closed-form spectral crossover expression whose mode coefficients depend only on the initial error and the eigenvalues of the input data. For cross-entropy loss the softmax concentrates probability mass and compresses the Hessian spectrum on a timescale proportional to one over eta, driving the observed exponent close to 1.0 independent of training-set size. The analysis distinguishes a perturbative regime, where the spectral formula holds without strong mode interactions, from a non-perturbative regime that appears beyond a width-dependent threshold.

Core claim

Gradient flow on L-layer ReLU networks without bias preserves L-1 conservation laws C_l = ||W_{l+1}||_F^2 - ||W_l||_F^2, confining trajectories to lower-dimensional manifolds. Under discrete gradient descent these laws break with total drift scaling as eta^alpha where alpha is approximately 1.1-1.6 depending on architecture, loss function, and width. The drift decomposes exactly as eta^2 * S(eta), where the gradient imbalance sum S(eta) admits a closed-form spectral crossover formula with mode coefficients c_k proportional to e_k(0)^2 * lambda_{x,k}^2, derived from first principles and validated for both linear (R=0.85) and ReLU (R>0.80) networks. For cross-entropy loss, softmax probability

What carries the argument

The closed-form spectral crossover formula for the gradient imbalance sum S(eta), whose terms are mode coefficients c_k proportional to the squared initial error times the squared data eigenvalue for each mode.

If this is right

  • The exponent alpha of the drift depends on network width, loss function, and architecture through the explicit mode coefficients in S(eta).
  • Cross-entropy loss self-regularizes the drift exponent near 1.0 by driving exponential Hessian spectral compression on timescale Theta(1/eta), independent of training-set size.
  • Inside the perturbative regime the spectral formula predicts the entire drift trajectory without requiring simulation of the full coupled dynamics.
  • Beyond a critical width the system enters a non-perturbative regime in which extensive mode coupling invalidates the closed-form expression.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same norm-balance mechanism may appear in other architectures that admit analogous conserved quantities under continuous flow.
  • The spectral decomposition could be used to design learning-rate schedules that deliberately control the rate of conservation-law violation.
  • Testing the crossover formula on networks with biases or on non-ReLU activations would map the boundary of the perturbative regime.

Load-bearing premise

The derivation assumes the networks remain strictly L-layer ReLU without bias terms and stay inside the perturbative sub-Edge-of-Stability regime where spectral modes do not couple extensively.

What would settle it

Train a linear network with gradient descent at several small learning rates, compute the observed total drift in the conservation quantities, and check whether it equals eta squared times the predicted S(eta) within 15 percent across the tested range.

Figures

Figures reproduced from arXiv: 2604.07405 by Daniel Nobrega Medeiros.

Figure 1
Figure 1. Figure 1: Conservation laws and their breaking. (a) Under gradient flow (small η), the conservation quantities Cl = ∥Wl+1∥ 2 F − ∥Wl∥ 2 F are preserved to high precision. (b) Under discrete gradient descent, the total drift follows a power law η α with a non-integer exponent explained by our spectral theory. paradoxically, training performance improves. This phenomenon was recently confirmed for linear networks by G… view at source ↗
Figure 2
Figure 2. Figure 2: Spectral crossover formula and ck validation. (a) The formula (4) predicts the gra￾dient imbalance sum for both linear (14–18% error) and ReLU (14–27% error) networks. (b,c) The first-principles mode coefficients ck ∝ e 2 k λ 2 x,k match empirical values with R ≥ 0.80 for both archi￾tectures. Each mode k transitions between two regimes at the crossover learning rate η ∗ k = 1/(λkT): • Unconverged (η ≪ η ∗ … view at source ↗
Figure 3
Figure 3. Figure 3: Cross-entropy self-regularization. (a) The CE Hessian spectrum compresses expo￾nentially during training, with an n-independent rate. (b) The compression timescale scales as τ = Θ(1/η), validated by E23. (c) CE holds α near 1.0 regardless of width, while MSE permits unbounded growth. (a) MSE α diverges with width; power-law quality degrades. (b) α vs. m/d: curves do NOT col￾lapse. (c) Per-neuron switch rat… view at source ↗
Figure 4
Figure 4. Figure 4: Width scaling and dynamical regimes. (a) α − 1 ∼ m1.18 for MSE, with increasing curvature at large widths. (b) The transition width m∗ depends on absolute overparameterization, not m/d. (c) At EoS, the per-neuron activation switch rate is width-independent, confirming extensive O(m) total mode coupling. 1. Sub-EoS (λmax < 2/η): Per-neuron activation switch rate ∼ m−0.5 , total mode coupling O( √ m), and th… view at source ↗
read the original abstract

Why does gradient descent reliably find good solutions in non-convex neural network optimization, despite the landscape being NP-hard in the worst case? We show that gradient flow on L-layer ReLU networks without bias preserves L-1 conservation laws C_l = ||W_{l+1}||_F^2 - ||W_l||_F^2, confining trajectories to lower-dimensional manifolds. Under discrete gradient descent, these laws break with total drift scaling as eta^alpha where alpha is approximately 1.1-1.6 depending on architecture, loss function, and width. We decompose this drift exactly as eta^2 * S(eta), where the gradient imbalance sum S(eta) admits a closed-form spectral crossover formula with mode coefficients c_k proportional to e_k(0)^2 * lambda_{x,k}^2, derived from first principles and validated for both linear (R=0.85) and ReLU (R>0.80) networks. For cross-entropy loss, softmax probability concentration drives exponential Hessian spectral compression with timescale tau = Theta(1/eta) independent of training set size, explaining why cross-entropy self-regularizes the drift exponent near alpha=1.0. We identify two dynamical regimes separated by a width-dependent transition: a perturbative sub-Edge-of-Stability regime where the spectral formula applies, and a non-perturbative regime with extensive mode coupling. All predictions are validated across 23 experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that continuous gradient flow on L-layer ReLU networks without biases preserves L-1 conservation laws C_l = ||W_{l+1}||_F^2 - ||W_l||_F^2, confining trajectories to lower-dimensional manifolds, but discrete gradient descent breaks these laws with total drift scaling as eta^alpha (alpha approximately 1.1-1.6 depending on architecture, loss, and width). It decomposes the drift exactly as eta^2 * S(eta), where S(eta) admits a closed-form spectral crossover formula with mode coefficients c_k proportional to e_k(0)^2 * lambda_{x,k}^2 derived from first principles. The work identifies a perturbative sub-Edge-of-Stability regime (where the spectral formula applies with negligible mode coupling) versus a non-perturbative regime with extensive coupling, and explains cross-entropy self-regularization to alpha near 1.0 via Hessian spectral compression on timescale Theta(1/eta). All claims are validated across 23 experiments on linear and ReLU networks reporting R=0.85 and R>0.80 respectively.

Significance. If the central decomposition and regime separation hold, the work offers a first-principles spectral account of conservation-law breaking and the edge-of-stability phenomenon in non-convex neural optimization, potentially explaining why discrete GD succeeds despite NP-hard landscapes. The attempt at closed-form mode coefficients and the separation of perturbative versus non-perturbative dynamics represent a substantive theoretical contribution that could unify continuous and discrete analyses; empirical validation across architectures and losses adds weight if the regime assumptions are confirmed.

major comments (3)
  1. The central claim that the eta^2 S(eta) decomposition and closed-form spectral crossover formula apply to the reported drift scalings (alpha approximately 1.1-1.6) rests on the assumption of the perturbative sub-Edge-of-Stability regime with negligible mode coupling. The manuscript does not provide quantitative checks (e.g., mode-coupling strength metrics) confirming that the 23 validation experiments operate inside this regime rather than the non-perturbative regime; without this, the applicability of the first-principles formula to the observed data remains unverified.
  2. The validation results (R=0.85 for linear networks and R>0.80 for ReLU networks) are reported without error bars, confidence intervals, or pre-specified exclusion criteria for the 23 experiments. This omission makes it difficult to assess the robustness of the spectral formula fit and whether the correlations genuinely support the decomposition across the claimed alpha range.
  3. Full step-by-step derivation of the eta^2 S(eta) decomposition and the spectral coefficients c_k proportional to e_k(0)^2 * lambda_{x,k}^2 is not provided in the manuscript, despite the claim of derivation from first principles using initial e_k(0) and lambda values. This absence hinders independent verification of the perturbative-regime assumptions.
minor comments (2)
  1. The notation for the gradient imbalance sum S(eta) and the crossover formula would benefit from an explicit numbered equation in the theoretical section to improve readability and allow direct reference.
  2. Figure captions and experimental details should specify the exact network widths, depths, and loss functions used in each of the 23 experiments to facilitate reproduction and assessment of the regime classification.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment point by point below, proposing specific revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: The central claim that the eta^2 S(eta) decomposition and closed-form spectral crossover formula apply to the reported drift scalings (alpha approximately 1.1-1.6) rests on the assumption of the perturbative sub-Edge-of-Stability regime with negligible mode coupling. The manuscript does not provide quantitative checks (e.g., mode-coupling strength metrics) confirming that the 23 validation experiments operate inside this regime rather than the non-perturbative regime; without this, the applicability of the first-principles formula to the observed data remains unverified.

    Authors: We agree that explicit quantitative verification of the regime assumption would improve clarity. In the revised manuscript we will add mode-coupling strength metrics (specifically, the Frobenius norm of the off-diagonal blocks of the mode-interaction tensor normalized by the diagonal blocks) computed for each of the 23 experiments. These metrics will be reported in a new supplementary table and will confirm that all reported runs satisfy the perturbative threshold (off-diagonal contribution < 5 % of diagonal). We will also include a short paragraph relating the observed alpha range (1.1-1.6) to the analytically derived boundary between perturbative and non-perturbative regimes. revision: yes

  2. Referee: The validation results (R=0.85 for linear networks and R>0.80 for ReLU networks) are reported without error bars, confidence intervals, or pre-specified exclusion criteria for the 23 experiments. This omission makes it difficult to assess the robustness of the spectral formula fit and whether the correlations genuinely support the decomposition across the claimed alpha range.

    Authors: We accept that the current presentation lacks statistical detail. The revised version will report bootstrap-derived 95 % confidence intervals for each R value, obtained by resampling the 23 experimental trajectories with replacement (10 000 replicates). We will also state the pre-specified exclusion criteria (convergence to training loss < 0.01 within 2000 epochs and stable alpha estimation over the final 500 steps) and tabulate which runs satisfied them. These additions will allow readers to judge the robustness of the reported correlations. revision: yes

  3. Referee: Full step-by-step derivation of the eta^2 S(eta) decomposition and the spectral coefficients c_k proportional to e_k(0)^2 * lambda_{x,k}^2 is not provided in the manuscript, despite the claim of derivation from first principles using initial e_k(0) and lambda values. This absence hinders independent verification of the perturbative-regime assumptions.

    Authors: The derivation is sketched in Section 3 and Appendix A, but we agree that a fully expanded, self-contained presentation is needed for independent verification. In the revision we will insert a new subsection (3.2) that walks through every algebraic step: (i) the exact discrete update for the gradient imbalance, (ii) the perturbative expansion to O(eta^2), (iii) the projection onto the eigenbasis of the Hessian, and (iv) the resulting closed-form expression for each c_k in terms of e_k(0) and lambda_{x,k}. All intermediate equalities will be shown explicitly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained from first principles

full rationale

The paper starts from the continuous-time gradient flow on L-layer ReLU networks (no bias) to obtain the exact L-1 conservation laws C_l = ||W_{l+1}||_F^2 - ||W_l||_F^2. It then analyzes the discrete GD perturbation, decomposes the total drift exactly as eta^2 * S(eta), and supplies a closed-form spectral expression for the imbalance sum S(eta) whose coefficients c_k are proportional to e_k(0)^2 * lambda_{x,k}^2. This spectral formula is stated to follow directly from linear mode analysis in the perturbative sub-EoS regime. The subsequent separation into perturbative versus non-perturbative regimes and the cross-entropy self-regularization argument are likewise internal to the derivation. All steps are presented as direct consequences of the network dynamics and initial spectral data rather than fitted parameters, self-citations, or ansatzes imported from prior work. Validation experiments (R values) are reported after the derivation and do not retroactively define the formula, so the chain does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the exact preservation of L-1 conservation laws under gradient flow for bias-free ReLU networks and on the first-principles decomposition of discrete drift; no explicit free parameters are fitted beyond descriptive alpha ranges.

axioms (2)
  • domain assumption Gradient flow on L-layer ReLU networks without bias preserves the L-1 conservation laws C_l = ||W_{l+1}||_F^2 - ||W_l||_F^2
    Invoked as the starting point that discrete GD breaks.
  • ad hoc to paper The discrete drift admits an exact decomposition eta^2 * S(eta) with S(eta) given by a spectral sum over modes
    Presented as exact in the abstract without external benchmark.

pith-pipeline@v0.9.0 · 5564 in / 1472 out tokens · 53122 ms · 2026-05-10T18:00:37.129075+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages

  1. [1]

    A convergence theory for deep learning via over-parameterization

    Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. InInternational Conference on Machine Learning (ICML), 2019

  2. [2]

    On the global convergence of gradient descent for over- parameterized models using optimal transport

    L´ ena¨ ıc Chizat and Francis Bach. On the global convergence of gradient descent for over- parameterized models using optimal transport. InAdvances in Neural Information Processing Systems (NeurIPS), 2018

  3. [3]

    The loss surfaces of multilayer networks

    Anna Choromanska, Mikael Henaff, Michael Mathieu, G´ erard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2015

  4. [4]

    Gradient descent on neural networks typically occurs at the edge of stability

    Jeremy Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. InInternational Conference on Learning Representations (ICLR), 2021. 7

  5. [5]

    Gradient descent provably op- timizes over-parameterized neural networks

    Simon S Du, Xiyu Zhai, Barnab´ as Poczos, and Aarti Singh. Gradient descent provably op- timizes over-parameterized neural networks. InInternational Conference on Learning Repre- sentations (ICLR), 2019

  6. [6]

    Learning dynamics of deep matrix factorization beyond the edge of stability

    Nikhil Ghosh, Jongho Kwon, Zhenyu Wang, Saiprasad Ravishankar, and Qing Qu. Learning dynamics of deep matrix factorization beyond the edge of stability. InInternational Conference on Learning Representations (ICLR), 2025

  7. [7]

    Neural tangent kernel: Convergence and generalization in neural networks

    Arthur Jacot, Franck Gabriel, and Cl´ ement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), 2018

  8. [8]

    Neural mechanics: Symmetry and broken conservation laws in deep learning dy- namics

    Daniel Kunin, Javier Sagastuy-Brena, Surya Ganguli, Daniel L K Yamins, and Hidenori Tanaka. Neural mechanics: Symmetry and broken conservation laws in deep learning dy- namics. InInternational Conference on Learning Representations (ICLR), 2021

  9. [9]

    Abide by the law and follow the flow: Conservation laws for gradient flows

    Sibylle Marcotte, R´ emi Gribonval, and Gabriel Peyr´ e. Abide by the law and follow the flow: Conservation laws for gradient flows. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  10. [10]

    A mean field view of the landscape of two-layer neural networks.Proceedings of the National Academy of Sciences, 2018

    Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks.Proceedings of the National Academy of Sciences, 2018

  11. [11]

    Symmetries, flat minima, and the conserved quantities of gradient flow

    Bo Zhao, Iordan Ganev, Robin Walters, Rose Yu, and Nima Dehmamy. Symmetries, flat minima, and the conserved quantities of gradient flow. InInternational Conference on Learning Representations (ICLR), 2023. 8 A Full Proofs A.1 Proof of Theorem 1 (Conservation Laws) Consider anL-layer ReLU networkf(x;θ) =W Lσ(WL−1σ(· · ·σ(W 1x))) with no bias terms. ReLU is...