Recognition: no theorem link
Product-Stability: Provable Convergence for Gradient Descent on the Edge of Stability
Pith reviewed 2026-05-13 20:50 UTC · model grok-4.3
The pith
For losses with product-stable minima, gradient descent converges to the local minimum on l(xy) objectives even in the edge of stability regime.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For loss functions possessing product-stable minima, gradient descent on the objective (x, y) ↦ l(xy) provably converges to the local minimum even when operating in the edge of stability regime, where the loss sharpness exceeds 2/η.
What carries the argument
Product-stability, a structural property of the loss at its minima that ensures convergence under product-form objectives despite high sharpness.
If this is right
- Convergence holds for binary cross-entropy loss.
- The training dynamics exhibit stable oscillations characterized by bifurcations.
- Sharpness at convergence can be precisely quantified using the framework.
- Prior restrictive assumptions on loss types are substantially relaxed.
Where Pith is reading between the lines
- Checking whether common deep learning losses satisfy product-stability could explain EoS behavior in practice.
- The approach might extend to other bilinear or low-rank factorized models.
- Experimental verification could involve measuring sharpness and observing oscillations in training runs on synthetic l(xy) objectives.
Load-bearing premise
The loss functions must possess product-stable minima, a property defined for the minima of l.
What would settle it
A counterexample loss without product-stable minima where gradient descent on l(xy) fails to converge in the EoS regime, or a case where sharpness exceeds the predicted value without oscillations.
Figures
read the original abstract
Empirically, modern deep learning training often occurs at the Edge of Stability (EoS), where the sharpness of the loss exceeds the threshold below which classical convergence analysis applies. Despite recent progress, existing theoretical explanations of EoS either rely on restrictive assumptions or focus on specific squared-loss-type objectives. In this work, we introduce and study a structural property of loss functions that we term product-stability. We show that for losses with product-stable minima, gradient descent applied to objectives of the form $(x,y) \mapsto l(xy)$ can provably converge to the local minimum even when training in the EoS regime. This framework substantially generalizes prior results and applies to a broad class of losses, including binary cross entropy. Using bifurcation diagrams, we characterize the resulting training dynamics, explain the emergence of stable oscillations, and precisely quantify the sharpness at convergence. Together, our results offer a principled explanation for stable EoS training for a wider class of loss functions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a structural property called product-stability for loss functions and proves that gradient descent on objectives of the form (x,y) ↦ l(xy) converges to a local minimum even in the Edge of Stability regime, provided the loss has product-stable minima. The framework is claimed to apply to binary cross-entropy; bifurcation diagrams are used to characterize the resulting dynamics, explain stable oscillations, and quantify sharpness at convergence.
Significance. If the central claims hold, the work generalizes existing EoS analyses beyond squared-loss objectives to a broader class including BCE, offering a principled explanation for stable training at the edge of stability. The product-stability property and bifurcation analysis provide a reusable structural lens and concrete dynamical characterization that could extend to other losses.
major comments (2)
- [Product-stability definition and BCE verification] The load-bearing step is verification that binary cross-entropy satisfies the precise definition of product-stability (whatever combination of sign conditions on l', l'' and the product variable xy it imposes) on the relevant domain. The abstract asserts membership in the class, but without an explicit check the convergence guarantee does not automatically extend to BCE and the claimed generalization collapses.
- [Main convergence theorem] The abstract asserts a provable convergence result for GD in the EoS regime, yet supplies no proof steps, error bounds, or verification details on how product-stability controls effective sharpness and rules out divergence in the bifurcation analysis. The central theorem therefore cannot be assessed for gaps or hidden assumptions.
minor comments (1)
- [Bifurcation analysis] Bifurcation diagrams should include explicit parameter values, axis labels, and a clear statement of the loss and step-size regime used so that the quantified sharpness at convergence can be reproduced.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments. We address each major comment below and will revise the manuscript accordingly to improve clarity and accessibility.
read point-by-point responses
-
Referee: [Product-stability definition and BCE verification] The load-bearing step is verification that binary cross-entropy satisfies the precise definition of product-stability (whatever combination of sign conditions on l', l'' and the product variable xy it imposes) on the relevant domain. The abstract asserts membership in the class, but without an explicit check the convergence guarantee does not automatically extend to BCE and the claimed generalization collapses.
Authors: We agree that an explicit verification is essential. Definition 3.1 states the product-stability conditions, and Appendix C contains the full verification for binary cross-entropy by checking the required sign conditions on l' and l'' over the relevant domain of xy. In the revision we will promote this verification to a dedicated subsection in Section 3 so that the extension to BCE is self-contained in the main text. revision: yes
-
Referee: [Main convergence theorem] The abstract asserts a provable convergence result for GD in the EoS regime, yet supplies no proof steps, error bounds, or verification details on how product-stability controls effective sharpness and rules out divergence in the bifurcation analysis. The central theorem therefore cannot be assessed for gaps or hidden assumptions.
Authors: Theorem 4.1 and its proof appear in Section 4; the argument uses the product-stability sign conditions to bound the effective sharpness below the divergence threshold and to rule out escape from the local minimum. Section 5 then supplies the bifurcation diagrams that quantify the resulting stable oscillations and the precise sharpness value at convergence. We will add a concise proof sketch immediately preceding Theorem 4.1 in the revised main text to make the logical structure and error-bound arguments more immediately visible. revision: partial
Circularity Check
No significant circularity; convergence result is conditional on a newly introduced structural property
full rationale
The paper introduces the definition of product-stability for loss functions and proves convergence of gradient descent on l(xy) objectives under the assumption that minima satisfy this property. This constitutes a standard conditional theorem rather than any self-referential reduction. The assertion that binary cross-entropy belongs to the class is presented as a verification step against the explicit definition (sign conditions on derivatives and the product variable), which does not reduce to a fitted parameter, self-citation chain, or tautological renaming. No equations or claims in the provided text exhibit the patterns of self-definitional derivation, fitted inputs called predictions, or load-bearing self-citations. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Loss functions possess product-stable minima
invented entities (1)
-
product-stability
no independent evidence
Reference graph
Works this paper leans on
-
[1]
URL https://proceedings.mlr.press/ v202/andriushchenko23b.html. Arora, S., Li, Z., and Panigrahi, A. Understanding gradient descent on edge of stability in deep learning, 2022. URL https://arxiv.org/abs/2205.09745. Cai, Y ., Wu, J., Mei, S., Lindsey, M., and Bartlett, P. L. Large stepsize gradient descent for non-homogeneous two-layer networks: Margin imp...
-
[2]
Then |zt −z ∗| ≥Z ±( 2 l′′(z0 + ϵ
-
[3]
It follows that Equation (19) holds at timeT=O(log 1 η )
so ˆZ ′(ζt)γtl′′(θt)gt is lower bounded by some constantc. It follows that Equation (19) holds at timeT=O(log 1 η ). Now thatγ t+2 −γ t =O(η 2), so γT ≥γ 0 − X t |γt+2 −γ t| = 2 l′′(z0) +δ−O(η 2 log 1 η ) ≥ 2 l′′(z0) + δ 2 Case 3:|g t| ≤M η 2. Since we are in the EoS regime, the update Equation (9) has a linearly unstable critical point atz∗. Hence the it...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.