arxiv: 2604.14669 · v1 · submitted 2026-04-16 · 💻 cs.LG · math.DS· math.OC· stat.ML

Recognition: unknown

Zeroth-Order Optimization at the Edge of Stability

Minhak Song , Liang Zhang , Bingcong Li , Niao He , Michael Muehlebach , Sewoong Oh

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:15 UTC · model grok-4.3

classification 💻 cs.LG math.DSmath.OCstat.ML

keywords zeroth-order optimizationedge of stabilitylinear stability analysisHessian spectrumtwo-point estimatorimplicit regularizationdeep learning optimizationgradient estimation

0 comments

The pith

Zeroth-order methods remain mean-square stable only when their step size satisfies a bound that depends on the full Hessian spectrum rather than its largest eigenvalue alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives an exact step-size condition for the mean-square linear stability of zeroth-order optimization methods that rely on the standard two-point gradient estimator. This condition differs sharply from first-order methods because it incorporates every eigenvalue of the Hessian, not merely the largest one. Tractable upper and lower bounds on the allowable step size are obtained using only the largest eigenvalue together with the Hessian trace. Experiments show that common full-batch zeroth-order algorithms, including variants of gradient descent, momentum, and Adam, consistently operate near the predicted stability boundary across several deep-learning tasks. The analysis implies that large step sizes in the zeroth-order setting produce an implicit regularization effect focused on the trace of the Hessian.

Core claim

We provide an explicit step size condition that exactly captures the mean-square linear stability of a family of zeroth-order methods based on the standard two-point estimator. Mean-square stability of these methods depends on the entire Hessian spectrum. Tractable stability bounds that depend only on the largest eigenvalue and the Hessian trace are derived. Full-batch zeroth-order methods operate at the edge of stability, and large step sizes primarily regularize the Hessian trace rather than the top eigenvalue.

What carries the argument

The mean-square linear stability condition for the two-point zeroth-order gradient estimator, whose allowable step-size range is set by the full set of Hessian eigenvalues.

Load-bearing premise

The loss can be locally approximated by a quadratic form whose Hessian remains constant along the relevant trajectory.

What would settle it

Observe a full-batch zeroth-order training run in which the largest stable step size deviates measurably from the value predicted by the derived condition once the Hessian spectrum is computed exactly.

Figures

Figures reproduced from arXiv: 2604.14669 by Bingcong Li, Liang Zhang, Michael Muehlebach, Minhak Song, Niao He, Sewoong Oh.

**Figure 1.** Figure 1: EoS behaviors of FO and ZO methods are captured by different spectral quantities of the Hessian. We train fullbatch GD (left) and ZO-GD (right) with varying step sizes η on a CNN for CIFAR-10. For GD, the largest eigenvalue of the Hessian λmaxpHtq stabilizes near 2{η. For ZO-GD, the trace of the Hessian TrpHtq instead stabilizes slightly below 2{η. els (LLMs) with competitive accuracy while substantially … view at source ↗

**Figure 2.** Figure 2: Zeroth-order methods operate at the mean-square edge of stability. We train full-batch ZO methods on a CNN for CIFAR-10 and track the curvature terms defining the mean-square stability interval from Section 4.2. Across all panels, each color denotes one run; the solid curve is the lower-band term, the dash-dotted curve is the upper-band term, and the dashed line is the predicted stability threshold. Left (… view at source ↗

**Figure 3.** Figure 3: Catapult dynamics in ZO-GD. We train ZO-GD on CNN and increase the step size midway through training (from η1 to η2 and then to η3). Top: the training loss exhibits a pronounced spike after each step size increase, consistent with catapult dynamics. Bottom: the Hessian trace TrpHtq drops sharply during the catapult phase and then rises again, re-equilibrating near the new stability threshold 2{η. 5.3. Mai… view at source ↗

**Figure 4.** Figure 4: Effect of the smoothing parameter µ. We train ZO-GD on a CNN with a fixed step size and vary the smoothing parameter µ in the two-point estimator. For moderate and small smoothing (µ ď 10´3 ), ZO-GD operates at the mean-square EoS. For larger smoothing (µ ě 3 ˆ 10´3 ), ZO-GD no longer reaches the EoS threshold and instead trains in a lower-curvature regime with a smaller Hessian trace. 0 100000 200000 3000… view at source ↗

**Figure 5.** Figure 5: Effect of batch size in mini-batch ZO-SGD. We train mini-batch ZO-SGD on a CNN with a fixed step size and vary the batch size. Compared to full-batch ZO-GD, mini-batch ZO-SGD trains in a lower-curvature regime with a smaller Hessian trace. iterate, so the dynamics become locally unstable and the loss increases sharply. At the same time, the Hessian trace drops rapidly below the new threshold and then rises… view at source ↗

**Figure 6.** Figure 6: Mean-square EoS for full-batch ZO methods on ResNet. We train full-batch ZO-GD, ZO-GDM, and ZO-Adam on ResNet20 for CIFAR-10 and track the corresponding mean-square stability bounds and threshold in Section 5.1. 0 100000 200000 300000 400000 500000 Iteration 0 2000 4000 6000 8000 10000 12000 14000 Hessian Trace 2/ = 10000 2/ = 6000 2/ = 3000 ZO-GD (varying ) = 2/10000 : Tr(Ht) = 2/6000 : Tr(Ht) = 2/3000 : … view at source ↗

**Figure 7.** Figure 7: Mean-square EoS for full-batch ZO methods on Vision Transformer. We train full-batch ZO-GD, ZO-GDM, and ZO-Adam on a Vision Transformer for CIFAR-10 and track the corresponding mean-square stability bounds and threshold in Section 5.1. second moments, this extra accumulated noise makes the dynamics less stable and causes increasing β to shrink the stable region. A similar contrast appears for adaptive meth… view at source ↗

**Figure 8.** Figure 8: Mean-square EoS on a synthetic sorting task with an LSTM. On the synthetic sorting task described in Karpathy (2020), using the setup adopted by Cohen et al. (2025, Appendix B.3), we train full-batch ZO-GD, ZO-GDM, and ZO-Adam and track the corresponding mean-square stability quantities. The trace-based curvature terms stabilize near the predicted thresholds, mirroring the behavior observed in the vision e… view at source ↗

**Figure 9.** Figure 9: Mean-square EoS on a synthetic sorting task with Mamba. On the synthetic sorting task described in Karpathy (2020), using the setup adopted by Cohen et al. (2025, Appendix B.3), we train full-batch ZO-GD, ZO-GDM, and ZO-Adam and track the corresponding mean-square stability quantities. The trace-based curvature terms stabilize near the predicted thresholds, mirroring the behavior observed in the vision exp… view at source ↗

**Figure 10.** Figure 10: ZO-Adam sweep over β1 and η. Full-batch ZO-Adam on a CNN trained on CIFAR-10 with β1 P t0.1, 0.5, 0.9u (left to right) and multiple step sizes η per setting. Top: training loss. Middle: preconditioned curvature statistics TrpP ´1 t Htq and TrpP ´1 t Htq ` 2 1`β1 λmaxpP ´1 t Htq. Bottom: relative commutator ratio }rPt, Hts}F {}PtHt}F (cf. Appendix D.3). 36 [PITH_FULL_IMAGE:figures/full_fig_p036_10.png] view at source ↗

**Figure 11.** Figure 11: CNN: same experiments as [PITH_FULL_IMAGE:figures/full_fig_p037_11.png] view at source ↗

**Figure 12.** Figure 12: ResNet: same experiments as [PITH_FULL_IMAGE:figures/full_fig_p037_12.png] view at source ↗

**Figure 13.** Figure 13: Vision Transformer: same experiments as [PITH_FULL_IMAGE:figures/full_fig_p038_13.png] view at source ↗

read the original abstract

Zeroth-order (ZO) methods are widely used when gradients are unavailable or prohibitively expensive, including black-box learning and memory-efficient fine-tuning of large models, yet their optimization dynamics in deep learning remain underexplored. In this work, we provide an explicit step size condition that exactly captures the (mean-square) linear stability of a family of ZO methods based on the standard two-point estimator. Our characterization reveals a sharp contrast with first-order (FO) methods: whereas FO stability is governed solely by the largest Hessian eigenvalue, mean-square stability of ZO methods depends on the entire Hessian spectrum. Since computing the full Hessian spectrum is infeasible in practical neural network training, we further derive tractable stability bounds that depend only on the largest eigenvalue and the Hessian trace. Empirically, we find that full-batch ZO methods operate at the edge of stability: ZO-GD, ZO-GDM, and ZO-Adam consistently stabilize near the predicted stability boundary across a range of deep learning training problems. Our results highlight an implicit regularization effect specific to ZO methods, where large step sizes primarily regularize the Hessian trace, whereas in FO methods they regularize the top eigenvalue.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives an exact mean-square stability condition for two-point ZO methods that depends on the full Hessian spectrum and shows these methods run at the predicted edge in full-batch DL training.

read the letter

The main point is that this work derives an exact mean-square stability condition for zeroth-order methods that involves the full Hessian spectrum. It also finds that these methods run at the edge of stability in deep learning training. The new element is the step-size condition for the two-point ZO estimator. For quadratic losses with constant Hessian, they show stability depends on all eigenvalues, not just the largest one. They then give usable bounds with only the top eigenvalue and the trace. The experiments confirm that ZO-GD, ZO-GDM, and ZO-Adam stay near this boundary across several problems. This is a useful addition because it highlights a regularization effect unique to ZO: bigger steps mainly control the trace. The empirical part is straightforward and consistent. The analysis rests on a local quadratic approximation with fixed curvature. In practice the Hessian changes during training, and the paper does not measure how much variation the bound can tolerate before it fails. The tests are full-batch, leaving open questions about noisy gradients. This paper suits researchers in zeroth-order optimization and memory-efficient training. Anyone looking at optimization dynamics in black-box settings would benefit. It has enough substance for a serious referee. I would send it to peer review. The core stability result and the empirical observation are worth detailed feedback.

Referee Report

2 major / 3 minor

Summary. The paper derives an explicit step-size condition for the mean-square linear stability of a family of zeroth-order (ZO) methods based on the standard two-point estimator. It shows that, unlike first-order methods whose stability depends only on the largest Hessian eigenvalue, ZO mean-square stability depends on the full Hessian spectrum. Tractable upper and lower bounds are provided that depend only on λ_max and the trace of the Hessian. Empirically, full-batch ZO-GD, ZO-GDM, and ZO-Adam are shown to stabilize near the predicted boundary across several deep-learning tasks, with the claim that large step sizes implicitly regularize the Hessian trace rather than λ_max.

Significance. If the derivation is correct under its assumptions and the empirical edge-of-stability observation generalizes, the work supplies a concrete theoretical tool for analyzing ZO dynamics that is absent from the current literature. The explicit contrast with first-order stability, the spectrum-dependent characterization, and the practical bounds using only λ_max and trace(H) are useful for both theory and practice in black-box and memory-efficient training. The reported implicit-regularization effect specific to ZO methods is a potentially important distinction from the first-order edge-of-stability literature.

major comments (2)

[§3] §3 (linear stability analysis): The exact mean-square stability condition is derived under the assumption of a quadratic loss with fixed Hessian H. This assumption is load-bearing for the central claim that ZO methods operate at the edge of stability in deep networks, because the paper provides no controlled experiment that isolates the effect of a time-varying Hessian (e.g., by comparing a quadratic surrogate to a non-quadratic loss while keeping all other factors fixed) or quantifies how rapidly H may change before the predicted threshold loses predictive power.
[§4] §4 (tractable bounds): The reduction from the full-spectrum condition to bounds involving only λ_max and trace(H) is presented as a practical surrogate, yet the manuscript does not report the tightness of these bounds on the actual Hessians encountered during the reported training runs, nor does it show that crossing the bound (rather than the exact condition) reliably predicts divergence when curvature evolves.

minor comments (3)

The two-point estimator is introduced without an explicit equation reference in the opening paragraphs; adding the standard definition (e.g., Eq. (2) or (3)) would improve readability for readers outside the ZO community.
[Figures 2-4] Figure captions for the stability-boundary plots should state the number of independent runs and whether shaded regions represent standard deviation or min/max.
[§5] A brief discussion of how the trace(H) regularization claim was verified (e.g., via direct Hessian estimation or proxy) would strengthen the implicit-regularization paragraph in §5.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments highlight important limitations in the scope of our theoretical analysis and its empirical validation. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [§3] §3 (linear stability analysis): The exact mean-square stability condition is derived under the assumption of a quadratic loss with fixed Hessian H. This assumption is load-bearing for the central claim that ZO methods operate at the edge of stability in deep networks, because the paper provides no controlled experiment that isolates the effect of a time-varying Hessian (e.g., by comparing a quadratic surrogate to a non-quadratic loss while keeping all other factors fixed) or quantifies how rapidly H may change before the predicted threshold loses predictive power.

Authors: We agree that the derivation relies on the quadratic fixed-Hessian setting, which is standard for local linear stability analysis but does not directly capture Hessian evolution. Our empirical results across multiple deep-learning tasks nevertheless show that full-batch ZO methods stabilize near the predicted boundary, indicating that the condition remains informative under the curvature changes encountered in practice. In the revision we will add an expanded discussion of this modeling assumption, its relation to prior edge-of-stability work, and the conditions under which the threshold is expected to retain predictive value. revision: partial
Referee: [§4] §4 (tractable bounds): The reduction from the full-spectrum condition to bounds involving only λ_max and trace(H) is presented as a practical surrogate, yet the manuscript does not report the tightness of these bounds on the actual Hessians encountered during the reported training runs, nor does it show that crossing the bound (rather than the exact condition) reliably predicts divergence when curvature evolves.

Authors: We concur that quantifying the gap between the exact spectrum-dependent condition and the λ_max/trace bounds on the Hessians arising in our experiments would strengthen the practical utility claim. In the revised manuscript we will include additional analysis (new plots or tables) that evaluate bound tightness using Hessian estimates from the reported training runs. We will also note the current lack of direct evidence that bound violation predicts divergence under evolving curvature and flag this as an avenue for future work. revision: yes

Circularity Check

0 steps flagged

Stability condition analytically derived under quadratic assumption; empirical edge observation independent

full rationale

The paper derives an explicit step-size condition for mean-square linear stability of ZO methods from the two-point estimator under the assumption of a locally quadratic loss with fixed Hessian. This derivation depends on the full Hessian spectrum and is not obtained by fitting parameters to data or by self-referential definition. Tractable bounds using only λ_max and trace(H) are obtained by mathematical bounding of the spectrum-dependent expression, not by renaming a fit. The claim that full-batch ZO methods operate near the predicted boundary is presented as an empirical observation on DL tasks, separate from the derivation and without the threshold being adjusted to the observed data. No self-citation chains, uniqueness theorems, or ansatzes smuggled via prior work appear in the load-bearing steps. The derivation is therefore self-contained against the quadratic model.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions for local linear stability analysis of iterative methods; no free parameters or invented entities are mentioned in the abstract.

axioms (2)

domain assumption The objective is twice continuously differentiable and the Hessian is locally constant for the purpose of linear stability analysis.
Required for the mean-square stability derivation of the two-point estimator.
domain assumption Full-batch gradient estimates are used in the empirical validation.
Stated explicitly for the reported ZO-GD, ZO-GDM, and ZO-Adam runs.

pith-pipeline@v0.9.0 · 5533 in / 1369 out tokens · 46430 ms · 2026-05-10T11:15:27.370755+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 5 canonical work pages

[1]

and Beneventano, P

Andreyev, A. and Beneventano, P. Edge of stochastic sta- bility: Revisiting the edge of stability for SGD.arXiv preprint arXiv:2412.20553,

work page arXiv
[2]

arXiv:2207.14484 , year=

Cohen, J., Ghorbani, B., Krishnan, S., Agarwal, N., Medap- ati, S., Badura, M., Suo, D., Cardoze, D., Nado, Z., Dahl, G. E., et al. Adaptive gradient methods at the edge of stability.arXiv preprint arXiv:2207.14484,

work page arXiv
[3]

arXiv:2003.02218 , year=

Lewkowycz, A., Bahri, Y ., Dyer, E., Sohl-Dickstein, J., and Gur-Ari, G. The large learning rate phase of deep learning: the catapult mechanism.arXiv preprint arXiv:2003.02218,

work page arXiv 2003
[4]

Salimans, J

Salimans, T., Ho, J., Chen, X., Sidor, S., and Sutskever, I. Evolution strategies as a scalable alternative to rein- forcement learning.arXiv preprint arXiv:1703.03864,

work page arXiv
[5]

Type-ii saddles and probabilistic stability of stochastic gradient descent

Ziyin, L., Li, B., Galanti, T., and Ueda, M. Type-ii saddles and probabilistic stability of stochastic gradient descent. arXiv preprint arXiv:2303.13093,

work page arXiv
[6]

Applyings Σp¨qand cancellings ‹ ą0gives r“˜η 2 sΣ ˆ´ Id´ 1 r M ¯´1 sQΣ ˙

The bounds ρp 1 r Miq ă1 imply ρp 1 r Mq ă1 , so Id´ 1 r M is invertible by the Neumann series and W ‹ “ ˜η2s‹ r ´ Id´ 1 r M ¯´1 sQΣ. Applyings Σp¨qand cancellings ‹ ą0gives r“˜η 2 sΣ ˆ´ Id´ 1 r M ¯´1 sQΣ ˙ . Using the Neumann series andră1, ˜η2 sΣ ˆ´ Id´ 1 r M ¯´1 sQΣ ˙ ě˜η2 sΣ ` pId´Mq ´1 sQΣ ˘ “˜η2 dÿ i“1 ˜λ2 i γi “S adampη, β1q. HenceS adampη, β1q ďră...

1948
[7]

Experimental details In this section, we provide additional experimental details

33 Zeroth-Order Optimization at the Edge of Stability C. Experimental details In this section, we provide additional experimental details. Our dataset construction, preprocessing, and model architectures follow the setup of Cohen et al. (2025). Dataset.We train on a subset of CIFAR-10 consisting of 1,000 training examples drawn from the first four CIFAR-1...

2025
[8]

• Vision Transformer (ViT).A Vision Transformer (Dosovitskiy et al.,

with GeLU activations and GroupNorm (Wu & He, 2018). • Vision Transformer (ViT).A Vision Transformer (Dosovitskiy et al.,

2018
[9]

Top eigenvalue and trace estimation.During training, we log curvature statistics every 1,000 iterations

with depth 3, embedding dimension 64, 8 attention heads, MLP dimension256, and patch size4. Top eigenvalue and trace estimation.During training, we log curvature statistics every 1,000 iterations. Specifically, we compute the largest eigenvalue and trace of the Hessian (or the preconditioned Hessian P ´1 t Ht for ZO-Adam) using matrix-free procedures, wit...

2020
[10]

34 Zeroth-Order Optimization at the Edge of Stability 0 100000 200000 300000 400000 500000 Iteration 0 1000 2000 3000 4000 5000Hessian Trace 2/ = 3000 2/ = 2000 2/ = 1000 ZO-GD (varying ) = 2/3000 : Tr(Ht) = 2/2000 : Tr(Ht) = 2/1000 : Tr(Ht) = 2/3000 : Tr(Ht) + 2 max(Ht) = 2/2000 : Tr(Ht) + 2 max(Ht) = 2/1000 : Tr(Ht) + 2 max(Ht) 0 100000 200000 300000 40...

2000
[11]

= 2/2000 : Tr( ) + 2 max( )/(1 +

2000
[12]

(2025, Appendix B.3), we train full-batch ZO-GD, ZO-GDM, and ZO-Adam and track the corresponding mean-square stability quantities

Figure 8.Mean-square EoS on a synthetic sorting task with an LSTM.On the synthetic sorting task described in Karpathy (2020), using the setup adopted by Cohen et al. (2025, Appendix B.3), we train full-batch ZO-GD, ZO-GDM, and ZO-Adam and track the corresponding mean-square stability quantities. The trace-based curvature terms stabilize near the predicted...

2020
[13]

(2025, Appendix B.3), we train full-batch ZO-GD, ZO-GDM, and ZO-Adam and track the corresponding mean-square stability quantities

Figure 9.Mean-square EoS on a synthetic sorting task with Mamba.On the synthetic sorting task described in Karpathy (2020), using the setup adopted by Cohen et al. (2025, Appendix B.3), we train full-batch ZO-GD, ZO-GDM, and ZO-Adam and track the corresponding mean-square stability quantities. The trace-based curvature terms stabilize near the predicted t...

2020
[14]

Appendix D.3)

0 100000 200000 300000 400000 500000 Iteration 0.00 0.20 0.40 0.60 0.80 1.00 0.05Relative Commutator Ratio = 2/40000 = 2/20000 = 2/10000 0 100000 200000 300000 400000 500000 Iteration 0.00 0.20 0.40 0.60 0.80 1.00 0.05Relative Commutator Ratio = 2/40000 = 2/20000 = 2/10000 0 100000 200000 300000 400000 500000 Iteration 0.00 0.20 0.40 0.60 0.80 1.00 0.05Re...

2000
[15]

Figure 11.CNN: same experiments as Figure 2, with training loss.Top:training loss.Bottom:the stability-band plots from Figure 2 for ZO-GD (left), ZO-GDM (middle), and ZO-Adam (right). 0.0 0.1 0.2 0.3 0.4 0.5Training Loss ZO-GD (varying ) = 2/5000 = 2/4000 = 2/3000 0.0 0.1 0.2 0.3 0.4 0.5 ZO-GDM (fixed , varying ) = 0.7 = 0.75 = 0.8 0.0 0.1 0.2 0.3 0.4 0.5...

2000
[16]

Figure 12.ResNet: same experiments as Figure 6, with training loss.Top:training loss.Bottom:the stability-band plots from Figure 6 for ZO-GD (left), ZO-GDM (middle), and ZO-Adam (right). 37 Zeroth-Order Optimization at the Edge of Stability 0.0 0.1 0.2 0.3 0.4 0.5Training Loss ZO-GD (varying ) = 2/10000 = 2/6000 = 2/3000 0.0 0.1 0.2 0.3 0.4 0.5 ZO-GDM (fi...

2000