Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization
Pith reviewed 2026-06-29 08:40 UTC · model grok-4.3
The pith
S-Adam uses variance of randomized directional derivatives to estimate local geometric instability and damp step sizes in non-smooth optimization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
S-Adam stabilizes training by modulating step sizes with an adaptive damping factor exp(-λρ) based on the LGI metric, and converges almost surely to (δ,ε)-Clarke stationary points at the O(1/√T) rate while improving accuracy on CIFAR-100 and TinyImageNet.
What carries the argument
The Local Geometric Instability (LGI) metric, computed as the variance of randomized directional derivatives to estimate Clarke subdifferential diameter, which modulates step sizes to avoid chattering.
If this is right
- S-Adam converges almost surely to (δ,ε)-Clarke stationary points at O(1/√T) rate.
- It achieves up to 6% accuracy gains on CIFAR-100 and 3% on TinyImageNet compared to AdamW.
- The damping mechanism mitigates gradient oscillations in high-noise small-batch settings.
- It applies effectively to Quantization-Aware Training and other non-smooth regimes.
- The analysis uses differential inclusions for the convergence guarantee.
Where Pith is reading between the lines
- The randomized probing technique could extend to other first-order methods struggling with subdifferentials.
- Accuracy gains suggest better generalization in quantized models, potentially reducing the need for post-training adjustments.
- Similar instability metrics might apply to other non-differentiable components like max-pooling or certain activation functions.
- Testing on larger models or different architectures could reveal scalability limits not addressed in the current experiments.
Load-bearing premise
The variance of randomized directional derivatives yields a reliable estimator of Clarke subdifferential diameter that can be used to modulate step sizes without introducing new instability.
What would settle it
A counterexample or experiment on a simple non-smooth function where the LGI metric fails to correlate with actual subdifferential diameter, leading to divergence or worse performance than Adam.
Figures
read the original abstract
Deep learning optimization relies heavily on the assumption of smooth loss landscapes, a condition systematically violated by modern architectures due to non-smooth components such as ReLU activations and quantization operators. In such non-smooth regimes, adaptive optimizers such as Adam suffer from gradient chattering, violent oscillations caused by conflicting signals within the Clarke subdifferential, leading to poor convergence and suboptimal generalization. To address this, we introduce Singularity-aware Adam (S-Adam), a novel optimizer that stabilizes training by dynamically modulating step sizes based on local geometric instability. Our key contribution is the Local Geometric Instability (LGI) metric, a computationally efficient estimator of the Clarke subdifferential diameter derived from the variance of randomized directional derivatives. S-Adam incorporates an adaptive damping mechanism exp(-$\lambda$$\rho$) that decelerates updates in high-instability regions while preserving fast convergence in smooth basins. We provide a rigorous convergence analysis using differential inclusions, proving that S-Adam converges almost surely to ($\delta$,$\epsilon$)-Clarke stationary points at the optimal O(1/$\sqrt(T)$) rate. Empirical evaluations on Quantization-Aware Training (QAT) and high-noise small-batch learning demonstrate that S-Adam consistently outperforms AdamW and Prox-SGD, achieving accuracy gains of up to 6 percent on CIFAR-100 and 3 percent on TinyImageNet while effectively mitigating gradient oscillations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Singularity-aware Adam (S-Adam) for non-smooth optimization in deep learning. It introduces the Local Geometric Instability (LGI) metric ρ, defined as the variance of randomized directional derivatives, as an estimator of Clarke subdifferential diameter. An adaptive damping term exp(-λρ) modulates step sizes to reduce chattering. The paper claims a rigorous convergence proof via differential inclusions establishing almost-sure convergence to (δ,ε)-Clarke stationary points at the optimal O(1/√T) rate, together with empirical accuracy gains of up to 6% on CIFAR-100 and 3% on TinyImageNet versus AdamW and Prox-SGD in quantization-aware and high-noise settings.
Significance. If the central technical relation between the LGI variance estimator and Clarke subdifferential diameter can be established and the differential-inclusion argument completed, the work would supply a theoretically grounded adaptive mechanism for stabilizing training under non-smoothness induced by ReLUs and quantization, addressing a practically relevant gap between smooth-assumption optimizers and modern architectures.
major comments (2)
- [theoretical analysis / convergence proof] The differential-inclusion convergence argument (theoretical analysis section) relies on the damping exp(-λρ) being triggered exactly when the subdifferential is large. No lemma is supplied showing that the variance ρ of randomized directional derivatives is monotone in, or bounded by, the diameter of the Clarke subdifferential; without this relation the activation condition for the damping term is unverified and the almost-sure O(1/√T) guarantee does not follow.
- [LGI metric definition] The definition of the LGI metric (abstract and method section) asserts that variance of randomized directional derivatives yields a reliable estimator of Clarke subdifferential diameter, yet no supporting result (e.g., concentration inequality or monotonicity lemma) is provided; this estimator is load-bearing for both the adaptive mechanism and the convergence claim.
minor comments (2)
- [method / damping mechanism] The hyperparameter λ in exp(-λρ) is introduced without derivation or sensitivity analysis; its status as a free parameter should be clarified.
- [experiments] Experimental section lacks reported error bars, statistical significance tests, precise dataset splits, batch-size schedules, and ablation on the number of directional probes used to compute ρ.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The comments correctly identify that the connection between the LGI metric and the Clarke subdifferential requires explicit supporting results to fully justify the adaptive mechanism and convergence claim. We address each point below and will incorporate the necessary additions in the revised manuscript.
read point-by-point responses
-
Referee: [theoretical analysis / convergence proof] The differential-inclusion convergence argument (theoretical analysis section) relies on the damping exp(-λρ) being triggered exactly when the subdifferential is large. No lemma is supplied showing that the variance ρ of randomized directional derivatives is monotone in, or bounded by, the diameter of the Clarke subdifferential; without this relation the activation condition for the damping term is unverified and the almost-sure O(1/√T) guarantee does not follow.
Authors: We agree that an explicit lemma establishing the relationship between ρ and the subdifferential diameter is required to verify the damping activation and complete the convergence argument. In the revised version we will add Lemma 3.2 in the theoretical analysis section, proving that ρ is bounded above by the diameter of the Clarke subdifferential under the randomized directional derivative probing scheme. The proof will rely on the definition of the Clarke subdifferential and the variance of directional derivatives. We will also revise the differential-inclusion argument to cite this lemma directly, thereby confirming the almost-sure O(1/√T) rate to (δ,ε)-Clarke stationary points. revision: yes
-
Referee: [LGI metric definition] The definition of the LGI metric (abstract and method section) asserts that variance of randomized directional derivatives yields a reliable estimator of Clarke subdifferential diameter, yet no supporting result (e.g., concentration inequality or monotonicity lemma) is provided; this estimator is load-bearing for both the adaptive mechanism and the convergence claim.
Authors: We acknowledge that the manuscript would benefit from an explicit supporting result for the LGI estimator. In the revision we will insert a new proposition in the method section that provides both a monotonicity relation and a concentration inequality showing that the empirical variance of randomized directional derivatives approximates the subdifferential diameter with high probability for a sufficient number of probes. These results will be referenced in the abstract and will substantiate the estimator's role in the adaptive damping term. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper introduces the LGI metric as derived from variance of randomized directional derivatives to estimate Clarke subdifferential diameter, incorporates the damping exp(-λρ) into S-Adam, and states a convergence result to (δ,ε)-Clarke points at O(1/√T) via differential inclusions. No quoted step reduces the claimed result to its inputs by construction, renames a known pattern, or relies on a self-citation chain for a uniqueness theorem; the theoretical argument invokes standard tools for non-smooth optimization while empirical accuracy numbers are reported as separate evaluations. The scalar λ is a conventional hyperparameter and does not force the rate or almost-sure guarantee by definition.
Axiom & Free-Parameter Ledger
free parameters (1)
- λ
axioms (1)
- domain assumption Convergence analysis via differential inclusions applies to the discrete S-Adam iterates
invented entities (1)
-
Local Geometric Instability (LGI) metric
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
Curran Associates Inc. ISBN 9781510860964. Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., and Kalenichenko, D. Quantization and training of neural networks for efficient integer- arithmetic-only inference. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2704– 2713, 2018. doi: 10.1109/CVPR.2018.00286. K...
-
[2]
org/CorpusID:6628106
URL https://api.semanticscholar. org/CorpusID:6628106. Kwon, J., Kim, J., Park, H., and Choi, I. K. Asam: Adaptive sharpness-aware minimization for scale-invariant learn- ing of deep neural networks. InInternational Conference on Machine Learning, pp. 5905–5914. PMLR, 2021. Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. Visu- alizing the loss l...
2021
-
[3]
On the difficulty of training Recurrent Neural Networks
URL https://api.semanticscholar. org/CorpusID:2391217. Parikh, N. and Boyd, S. Proximal algorithms.Foundations and Trends in Optimization, 1(3):127–239, 2014. Pascanu, R., Mikolov, T., and Bengio, Y . On the difficulty of training recurrent neural networks, 2013. URL https: //arxiv.org/abs/1211.5063. 10 Singularity-aware Optimization via Randomized Geomet...
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[4]
Curran Associates Inc. ISBN 9798331314385. Shor, N. Z.Minimization Methods for Non- Differentiable Functions, volume 3 ofSpringer Series in Computational Mathematics. Springer Berlin, Heidelberg, 1985. ISBN 978-3-642-82118-9. doi: 10.1007/978-3-642-82118-9. Wainwright, M. J.High-Dimensional Statistics: A Non- Asymptotic Viewpoint. Cambridge Series in Stat...
-
[5]
Error decomposition.We have ˆσ2 −σ 2 = " 1 k X i Y 2 i −E[Y 2] # −[ˆµ2 −µ 2] ≤ 1 k X i Y 2 i −E[Y 2] +|ˆµ2 −µ 2|.(48)
-
[6]
Estimation error of 1 k P i Y 2 i .Let Zi =Y 2 i , then 0≤Z i ≤L 2, and E[Zi] =E[Y 2]. Therefore, applying Hoeffding’s inequality, we have P 1 k X i Zi −E[Z] ≥ τ 2 ! ≤2 exp 2k(τ /2)2 L4 = 2 exp − kτ 2 2L4 .(49) 13 Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization
-
[7]
Estimation error ofˆµ2.Since∥Y i∥ ≤L, we have|ˆµ| ≤Land|µ| ≤L. Then |ˆµ2 −µ 2| ≤ |ˆµ−µ| ·(|ˆµ|+|µ|)≤2L|ˆµ−µ|.(50) We have the following using Hoeffding’s inequality on the estimated valueˆµ: P(|ˆµ−µ| ≥s)≤2 exp − 2ks2 (2L)2 = 2 exp − ks2 2L2 .(51) If we let|ˆµ2 −µ 2| ≤2L|ˆµ−µ|< τ /2, then we have |ˆµ−µ|< τ 4L ⇒ |ˆµ2 −µ 2|< τ 2 .(52) So P |ˆµ−µ|< τ 4L ≤P |ˆ...
-
[8]
Therefore, we obtain the error of the variance estimate from Eqn
Estimation error of the variance ˆσ2.If 1 k P i Y 2 i −E[Y 2] < τ /2 and |ˆµ2 −µ2|< τ /2 , then |ˆσ2 −σ 2|< τ . Therefore, we obtain the error of the variance estimate from Eqn. (49) and (54) as follows P |ˆσ2 −σ 2| ≥τ ≤P 1 k X i Y 2 i −E[Y 2] ≥ τ 2 ! +P |ˆµ2 −µ 2| ≥ τ 2 ≤4 exp − kτ 2 32L4 .(55)
-
[9]
Error Analysis of LGI Estimation.Define a function g(a, b) =b/(a+b+ϵ) , where a=µ 2 and b=σ 2. Given that a, b≥0anda+b≤L 2, we calculate the two partial derivatives respectively ∂g ∂a = − b (a+b+ϵ) 2 ≤ L2 ϵ2 , ∂g ∂b = a+ϵ (a+b+ϵ) 2 ≤ a+b+ϵ (a+b+ϵ) 2 ≤ ϵ ϵ2 .(56) Then according to the Mean Value Theorem and the Cauchy-Schwarz inequality, we have |ˆρk −ρ|=|...
-
[10]
(51) and (55) P(|ˆρk −ρ| ≥∆)≤P(|ˆµ−µ| ≥τ) +P(|ˆσ 2 −σ 2| ≥τ) ≤2 exp − kτ 2 2L2 + 4 exp − kτ 2 32L4 ≤6 exp − kτ 2 32L4 = 6 exp − k∆2 32M2L4(4L2 + 1) (59)
Upper bound of joint probability.If|ˆµ−µ|< τand|ˆσ 2 −σ 2|< τ, then we have |ˆρk −ρ| ≤M p (ˆµ2 −µ 2)2 + (ˆσ2 −σ 2)2 < M p (2Lτ) 2 +τ 2 =M τ p 4L2 + 1(58) 14 Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization Let∆ =M τ √ 4L2 + 1, thenτ= ∆ M √ 4L2+1, with eqn. (51) and (55) P(|ˆρk −ρ| ≥∆)≤P(|ˆµ−µ| ≥τ) +P(...
-
[11]
Sample complexity.We set the upper bound of the probability to be less thanδ, that is, 6 exp − k∆2 32M2L4(4L2 + 1) ≤δ⇒k≥ 32L4M2(4L2 + 1) ∆2 log 6 δ .(60) Finally, notice thatM=O(1/ϵ 2), we obtain k=O L6 ϵ4∆2 log(1/δ) (61) D. Equivalence conditions between S-Adam and the proximal method (Prox-SGD) Let us define the proximal operator with a time-varying reg...
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.