pith. sign in

arxiv: 2605.29547 · v1 · pith:3LY4ID4Pnew · submitted 2026-05-28 · 💻 cs.LG · cs.AI· math.OC

Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization

Pith reviewed 2026-06-29 08:40 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OC
keywords non-smooth optimizationClarke stationary pointsadaptive optimizationS-Adamgradient chatteringquantization-aware trainingdirectional derivatives
0
0 comments X

The pith

S-Adam uses variance of randomized directional derivatives to estimate local geometric instability and damp step sizes in non-smooth optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces S-Adam, an optimizer designed for non-smooth loss landscapes common in deep learning due to activations like ReLU. It defines a Local Geometric Instability metric from the variance of randomized directional derivatives to estimate the diameter of the Clarke subdifferential. This metric drives an adaptive damping mechanism that slows updates in unstable regions. The method comes with a convergence proof to Clarke stationary points at the standard rate and shows empirical improvements on quantization-aware training tasks.

Core claim

S-Adam stabilizes training by modulating step sizes with an adaptive damping factor exp(-λρ) based on the LGI metric, and converges almost surely to (δ,ε)-Clarke stationary points at the O(1/√T) rate while improving accuracy on CIFAR-100 and TinyImageNet.

What carries the argument

The Local Geometric Instability (LGI) metric, computed as the variance of randomized directional derivatives to estimate Clarke subdifferential diameter, which modulates step sizes to avoid chattering.

If this is right

  • S-Adam converges almost surely to (δ,ε)-Clarke stationary points at O(1/√T) rate.
  • It achieves up to 6% accuracy gains on CIFAR-100 and 3% on TinyImageNet compared to AdamW.
  • The damping mechanism mitigates gradient oscillations in high-noise small-batch settings.
  • It applies effectively to Quantization-Aware Training and other non-smooth regimes.
  • The analysis uses differential inclusions for the convergence guarantee.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The randomized probing technique could extend to other first-order methods struggling with subdifferentials.
  • Accuracy gains suggest better generalization in quantized models, potentially reducing the need for post-training adjustments.
  • Similar instability metrics might apply to other non-differentiable components like max-pooling or certain activation functions.
  • Testing on larger models or different architectures could reveal scalability limits not addressed in the current experiments.

Load-bearing premise

The variance of randomized directional derivatives yields a reliable estimator of Clarke subdifferential diameter that can be used to modulate step sizes without introducing new instability.

What would settle it

A counterexample or experiment on a simple non-smooth function where the LGI metric fails to correlate with actual subdifferential diameter, leading to divergence or worse performance than Adam.

Figures

Figures reproduced from arXiv: 2605.29547 by Borong She, Qiufeng Wang, Ruoran Xu, Xiaobo Jin.

Figure 1
Figure 1. Figure 1: Geometric instability visualization on synthetic non￾smooth landscape of f(x, y) = |x − 1| + |y − 1| + 0.5(x 2 + y 2 ) 1. Introduction The theoretical underpinnings of deep learning optimization are predominantly built upon the assumption of Lipschitz￾continuous gradients, which guarantees stable descent and convergence. However, this assumption is fundamentally incompatible with the architectural realitie… view at source ↗
Figure 2
Figure 2. Figure 2 [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Smooth Function(Left) & Non-smooth Function(Right) Modern neural architectures violate the Lipschitz-smooth assumption at numerous points: ReLU activations in￾duce kinks, quantization operators create step disconti￾nuities, and sparsity regularizers introduce ℓ1-type non￾differentiabilities. As visualized in [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Loss curve 7 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training progress showing Epoch vs. Accuracy for both the CIFAR-100 and TinyImageNet datasets. 6.5. High-Noise Learning on Small Batch Size 6.5.1. RESILIENCE TO EXTREME STOCHASTIC NOISE (N = 2) The efficacy of S-Adam is most pronounced in the Batch Size = 2 regime ( [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Loss curves for ResNet18 across CIFAR10, CIFAR100, and Imagewoof2-160 datasets with varying batch sizes (4, 16, and 64). 21 [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
read the original abstract

Deep learning optimization relies heavily on the assumption of smooth loss landscapes, a condition systematically violated by modern architectures due to non-smooth components such as ReLU activations and quantization operators. In such non-smooth regimes, adaptive optimizers such as Adam suffer from gradient chattering, violent oscillations caused by conflicting signals within the Clarke subdifferential, leading to poor convergence and suboptimal generalization. To address this, we introduce Singularity-aware Adam (S-Adam), a novel optimizer that stabilizes training by dynamically modulating step sizes based on local geometric instability. Our key contribution is the Local Geometric Instability (LGI) metric, a computationally efficient estimator of the Clarke subdifferential diameter derived from the variance of randomized directional derivatives. S-Adam incorporates an adaptive damping mechanism exp(-$\lambda$$\rho$) that decelerates updates in high-instability regions while preserving fast convergence in smooth basins. We provide a rigorous convergence analysis using differential inclusions, proving that S-Adam converges almost surely to ($\delta$,$\epsilon$)-Clarke stationary points at the optimal O(1/$\sqrt(T)$) rate. Empirical evaluations on Quantization-Aware Training (QAT) and high-noise small-batch learning demonstrate that S-Adam consistently outperforms AdamW and Prox-SGD, achieving accuracy gains of up to 6 percent on CIFAR-100 and 3 percent on TinyImageNet while effectively mitigating gradient oscillations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Singularity-aware Adam (S-Adam) for non-smooth optimization in deep learning. It introduces the Local Geometric Instability (LGI) metric ρ, defined as the variance of randomized directional derivatives, as an estimator of Clarke subdifferential diameter. An adaptive damping term exp(-λρ) modulates step sizes to reduce chattering. The paper claims a rigorous convergence proof via differential inclusions establishing almost-sure convergence to (δ,ε)-Clarke stationary points at the optimal O(1/√T) rate, together with empirical accuracy gains of up to 6% on CIFAR-100 and 3% on TinyImageNet versus AdamW and Prox-SGD in quantization-aware and high-noise settings.

Significance. If the central technical relation between the LGI variance estimator and Clarke subdifferential diameter can be established and the differential-inclusion argument completed, the work would supply a theoretically grounded adaptive mechanism for stabilizing training under non-smoothness induced by ReLUs and quantization, addressing a practically relevant gap between smooth-assumption optimizers and modern architectures.

major comments (2)
  1. [theoretical analysis / convergence proof] The differential-inclusion convergence argument (theoretical analysis section) relies on the damping exp(-λρ) being triggered exactly when the subdifferential is large. No lemma is supplied showing that the variance ρ of randomized directional derivatives is monotone in, or bounded by, the diameter of the Clarke subdifferential; without this relation the activation condition for the damping term is unverified and the almost-sure O(1/√T) guarantee does not follow.
  2. [LGI metric definition] The definition of the LGI metric (abstract and method section) asserts that variance of randomized directional derivatives yields a reliable estimator of Clarke subdifferential diameter, yet no supporting result (e.g., concentration inequality or monotonicity lemma) is provided; this estimator is load-bearing for both the adaptive mechanism and the convergence claim.
minor comments (2)
  1. [method / damping mechanism] The hyperparameter λ in exp(-λρ) is introduced without derivation or sensitivity analysis; its status as a free parameter should be clarified.
  2. [experiments] Experimental section lacks reported error bars, statistical significance tests, precise dataset splits, batch-size schedules, and ablation on the number of directional probes used to compute ρ.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments correctly identify that the connection between the LGI metric and the Clarke subdifferential requires explicit supporting results to fully justify the adaptive mechanism and convergence claim. We address each point below and will incorporate the necessary additions in the revised manuscript.

read point-by-point responses
  1. Referee: [theoretical analysis / convergence proof] The differential-inclusion convergence argument (theoretical analysis section) relies on the damping exp(-λρ) being triggered exactly when the subdifferential is large. No lemma is supplied showing that the variance ρ of randomized directional derivatives is monotone in, or bounded by, the diameter of the Clarke subdifferential; without this relation the activation condition for the damping term is unverified and the almost-sure O(1/√T) guarantee does not follow.

    Authors: We agree that an explicit lemma establishing the relationship between ρ and the subdifferential diameter is required to verify the damping activation and complete the convergence argument. In the revised version we will add Lemma 3.2 in the theoretical analysis section, proving that ρ is bounded above by the diameter of the Clarke subdifferential under the randomized directional derivative probing scheme. The proof will rely on the definition of the Clarke subdifferential and the variance of directional derivatives. We will also revise the differential-inclusion argument to cite this lemma directly, thereby confirming the almost-sure O(1/√T) rate to (δ,ε)-Clarke stationary points. revision: yes

  2. Referee: [LGI metric definition] The definition of the LGI metric (abstract and method section) asserts that variance of randomized directional derivatives yields a reliable estimator of Clarke subdifferential diameter, yet no supporting result (e.g., concentration inequality or monotonicity lemma) is provided; this estimator is load-bearing for both the adaptive mechanism and the convergence claim.

    Authors: We acknowledge that the manuscript would benefit from an explicit supporting result for the LGI estimator. In the revision we will insert a new proposition in the method section that provides both a monotonicity relation and a concentration inequality showing that the empirical variance of randomized directional derivatives approximates the subdifferential diameter with high probability for a sufficient number of probes. These results will be referenced in the abstract and will substantiate the estimator's role in the adaptive damping term. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper introduces the LGI metric as derived from variance of randomized directional derivatives to estimate Clarke subdifferential diameter, incorporates the damping exp(-λρ) into S-Adam, and states a convergence result to (δ,ε)-Clarke points at O(1/√T) via differential inclusions. No quoted step reduces the claimed result to its inputs by construction, renames a known pattern, or relies on a self-citation chain for a uniqueness theorem; the theoretical argument invokes standard tools for non-smooth optimization while empirical accuracy numbers are reported as separate evaluations. The scalar λ is a conventional hyperparameter and does not force the rate or almost-sure guarantee by definition.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Only the abstract is available, so the ledger records components explicitly named in it; several implementation details remain unspecified.

free parameters (1)
  • λ
    Scalar controlling the strength of the exp(-λρ) damping term; value not derived from the analysis.
axioms (1)
  • domain assumption Convergence analysis via differential inclusions applies to the discrete S-Adam iterates
    Invoked to obtain the almost-sure convergence statement to Clarke stationary points.
invented entities (1)
  • Local Geometric Instability (LGI) metric no independent evidence
    purpose: Estimator of Clarke subdifferential diameter from variance of randomized directional derivatives
    Newly defined quantity used to drive the adaptive damping; no independent evidence supplied in abstract.

pith-pipeline@v0.9.1-grok · 5792 in / 1274 out tokens · 29146 ms · 2026-06-29T08:40:19.116723+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

11 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

    Curran Associates Inc. ISBN 9781510860964. Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., and Kalenichenko, D. Quantization and training of neural networks for efficient integer- arithmetic-only inference. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2704– 2713, 2018. doi: 10.1109/CVPR.2018.00286. K...

  2. [2]

    org/CorpusID:6628106

    URL https://api.semanticscholar. org/CorpusID:6628106. Kwon, J., Kim, J., Park, H., and Choi, I. K. Asam: Adaptive sharpness-aware minimization for scale-invariant learn- ing of deep neural networks. InInternational Conference on Machine Learning, pp. 5905–5914. PMLR, 2021. Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. Visu- alizing the loss l...

  3. [3]

    On the difficulty of training Recurrent Neural Networks

    URL https://api.semanticscholar. org/CorpusID:2391217. Parikh, N. and Boyd, S. Proximal algorithms.Foundations and Trends in Optimization, 1(3):127–239, 2014. Pascanu, R., Mikolov, T., and Bengio, Y . On the difficulty of training recurrent neural networks, 2013. URL https: //arxiv.org/abs/1211.5063. 10 Singularity-aware Optimization via Randomized Geomet...

  4. [4]

    ISBN 9798331314385

    Curran Associates Inc. ISBN 9798331314385. Shor, N. Z.Minimization Methods for Non- Differentiable Functions, volume 3 ofSpringer Series in Computational Mathematics. Springer Berlin, Heidelberg, 1985. ISBN 978-3-642-82118-9. doi: 10.1007/978-3-642-82118-9. Wainwright, M. J.High-Dimensional Statistics: A Non- Asymptotic Viewpoint. Cambridge Series in Stat...

  5. [5]

    Error decomposition.We have ˆσ2 −σ 2 = " 1 k X i Y 2 i −E[Y 2] # −[ˆµ2 −µ 2] ≤ 1 k X i Y 2 i −E[Y 2] +|ˆµ2 −µ 2|.(48)

  6. [6]

    Estimation error of 1 k P i Y 2 i .Let Zi =Y 2 i , then 0≤Z i ≤L 2, and E[Zi] =E[Y 2]. Therefore, applying Hoeffding’s inequality, we have P 1 k X i Zi −E[Z] ≥ τ 2 ! ≤2 exp 2k(τ /2)2 L4 = 2 exp − kτ 2 2L4 .(49) 13 Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization

  7. [7]

    Estimation error ofˆµ2.Since∥Y i∥ ≤L, we have|ˆµ| ≤Land|µ| ≤L. Then |ˆµ2 −µ 2| ≤ |ˆµ−µ| ·(|ˆµ|+|µ|)≤2L|ˆµ−µ|.(50) We have the following using Hoeffding’s inequality on the estimated valueˆµ: P(|ˆµ−µ| ≥s)≤2 exp − 2ks2 (2L)2 = 2 exp − ks2 2L2 .(51) If we let|ˆµ2 −µ 2| ≤2L|ˆµ−µ|< τ /2, then we have |ˆµ−µ|< τ 4L ⇒ |ˆµ2 −µ 2|< τ 2 .(52) So P |ˆµ−µ|< τ 4L ≤P |ˆ...

  8. [8]

    Therefore, we obtain the error of the variance estimate from Eqn

    Estimation error of the variance ˆσ2.If 1 k P i Y 2 i −E[Y 2] < τ /2 and |ˆµ2 −µ2|< τ /2 , then |ˆσ2 −σ 2|< τ . Therefore, we obtain the error of the variance estimate from Eqn. (49) and (54) as follows P |ˆσ2 −σ 2| ≥τ ≤P 1 k X i Y 2 i −E[Y 2] ≥ τ 2 ! +P |ˆµ2 −µ 2| ≥ τ 2 ≤4 exp − kτ 2 32L4 .(55)

  9. [9]

    Error Analysis of LGI Estimation.Define a function g(a, b) =b/(a+b+ϵ) , where a=µ 2 and b=σ 2. Given that a, b≥0anda+b≤L 2, we calculate the two partial derivatives respectively ∂g ∂a = − b (a+b+ϵ) 2 ≤ L2 ϵ2 , ∂g ∂b = a+ϵ (a+b+ϵ) 2 ≤ a+b+ϵ (a+b+ϵ) 2 ≤ ϵ ϵ2 .(56) Then according to the Mean Value Theorem and the Cauchy-Schwarz inequality, we have |ˆρk −ρ|=|...

  10. [10]

    (51) and (55) P(|ˆρk −ρ| ≥∆)≤P(|ˆµ−µ| ≥τ) +P(|ˆσ 2 −σ 2| ≥τ) ≤2 exp − kτ 2 2L2 + 4 exp − kτ 2 32L4 ≤6 exp − kτ 2 32L4 = 6 exp − k∆2 32M2L4(4L2 + 1) (59)

    Upper bound of joint probability.If|ˆµ−µ|< τand|ˆσ 2 −σ 2|< τ, then we have |ˆρk −ρ| ≤M p (ˆµ2 −µ 2)2 + (ˆσ2 −σ 2)2 < M p (2Lτ) 2 +τ 2 =M τ p 4L2 + 1(58) 14 Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization Let∆ =M τ √ 4L2 + 1, thenτ= ∆ M √ 4L2+1, with eqn. (51) and (55) P(|ˆρk −ρ| ≥∆)≤P(|ˆµ−µ| ≥τ) +P(...

  11. [11]

    Sample complexity.We set the upper bound of the probability to be less thanδ, that is, 6 exp − k∆2 32M2L4(4L2 + 1) ≤δ⇒k≥ 32L4M2(4L2 + 1) ∆2 log 6 δ .(60) Finally, notice thatM=O(1/ϵ 2), we obtain k=O L6 ϵ4∆2 log(1/δ) (61) D. Equivalence conditions between S-Adam and the proximal method (Prox-SGD) Let us define the proximal operator with a time-varying reg...