Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization

Borong She; Qiufeng Wang; Ruoran Xu; Xiaobo Jin

arxiv: 2605.29547 · v1 · pith:3LY4ID4Pnew · submitted 2026-05-28 · 💻 cs.LG · cs.AI· math.OC

Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization

Ruoran Xu , Borong She , Xiaobo Jin , Qiufeng Wang This is my paper

Pith reviewed 2026-06-29 08:40 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OC

keywords non-smooth optimizationClarke stationary pointsadaptive optimizationS-Adamgradient chatteringquantization-aware trainingdirectional derivatives

0 comments

The pith

S-Adam uses variance of randomized directional derivatives to estimate local geometric instability and damp step sizes in non-smooth optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces S-Adam, an optimizer designed for non-smooth loss landscapes common in deep learning due to activations like ReLU. It defines a Local Geometric Instability metric from the variance of randomized directional derivatives to estimate the diameter of the Clarke subdifferential. This metric drives an adaptive damping mechanism that slows updates in unstable regions. The method comes with a convergence proof to Clarke stationary points at the standard rate and shows empirical improvements on quantization-aware training tasks.

Core claim

S-Adam stabilizes training by modulating step sizes with an adaptive damping factor exp(-λρ) based on the LGI metric, and converges almost surely to (δ,ε)-Clarke stationary points at the O(1/√T) rate while improving accuracy on CIFAR-100 and TinyImageNet.

What carries the argument

The Local Geometric Instability (LGI) metric, computed as the variance of randomized directional derivatives to estimate Clarke subdifferential diameter, which modulates step sizes to avoid chattering.

If this is right

S-Adam converges almost surely to (δ,ε)-Clarke stationary points at O(1/√T) rate.
It achieves up to 6% accuracy gains on CIFAR-100 and 3% on TinyImageNet compared to AdamW.
The damping mechanism mitigates gradient oscillations in high-noise small-batch settings.
It applies effectively to Quantization-Aware Training and other non-smooth regimes.
The analysis uses differential inclusions for the convergence guarantee.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The randomized probing technique could extend to other first-order methods struggling with subdifferentials.
Accuracy gains suggest better generalization in quantized models, potentially reducing the need for post-training adjustments.
Similar instability metrics might apply to other non-differentiable components like max-pooling or certain activation functions.
Testing on larger models or different architectures could reveal scalability limits not addressed in the current experiments.

Load-bearing premise

The variance of randomized directional derivatives yields a reliable estimator of Clarke subdifferential diameter that can be used to modulate step sizes without introducing new instability.

What would settle it

A counterexample or experiment on a simple non-smooth function where the LGI metric fails to correlate with actual subdifferential diameter, leading to divergence or worse performance than Adam.

Figures

Figures reproduced from arXiv: 2605.29547 by Borong She, Qiufeng Wang, Ruoran Xu, Xiaobo Jin.

**Figure 1.** Figure 1: Geometric instability visualization on synthetic nonsmooth landscape of f(x, y) = |x − 1| + |y − 1| + 0.5(x 2 + y 2 ) 1. Introduction The theoretical underpinnings of deep learning optimization are predominantly built upon the assumption of Lipschitzcontinuous gradients, which guarantees stable descent and convergence. However, this assumption is fundamentally incompatible with the architectural realitie… view at source ↗

**Figure 2.** Figure 2 [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Smooth Function(Left) & Non-smooth Function(Right) Modern neural architectures violate the Lipschitz-smooth assumption at numerous points: ReLU activations induce kinks, quantization operators create step discontinuities, and sparsity regularizers introduce ℓ1-type nondifferentiabilities. As visualized in [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Loss curve 7 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Training progress showing Epoch vs. Accuracy for both the CIFAR-100 and TinyImageNet datasets. 6.5. High-Noise Learning on Small Batch Size 6.5.1. RESILIENCE TO EXTREME STOCHASTIC NOISE (N = 2) The efficacy of S-Adam is most pronounced in the Batch Size = 2 regime ( [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Loss curves for ResNet18 across CIFAR10, CIFAR100, and Imagewoof2-160 datasets with varying batch sizes (4, 16, and 64). 21 [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

read the original abstract

Deep learning optimization relies heavily on the assumption of smooth loss landscapes, a condition systematically violated by modern architectures due to non-smooth components such as ReLU activations and quantization operators. In such non-smooth regimes, adaptive optimizers such as Adam suffer from gradient chattering, violent oscillations caused by conflicting signals within the Clarke subdifferential, leading to poor convergence and suboptimal generalization. To address this, we introduce Singularity-aware Adam (S-Adam), a novel optimizer that stabilizes training by dynamically modulating step sizes based on local geometric instability. Our key contribution is the Local Geometric Instability (LGI) metric, a computationally efficient estimator of the Clarke subdifferential diameter derived from the variance of randomized directional derivatives. S-Adam incorporates an adaptive damping mechanism exp(-$\lambda$$\rho$) that decelerates updates in high-instability regions while preserving fast convergence in smooth basins. We provide a rigorous convergence analysis using differential inclusions, proving that S-Adam converges almost surely to ($\delta$,$\epsilon$)-Clarke stationary points at the optimal O(1/$\sqrt(T)$) rate. Empirical evaluations on Quantization-Aware Training (QAT) and high-noise small-batch learning demonstrate that S-Adam consistently outperforms AdamW and Prox-SGD, achieving accuracy gains of up to 6 percent on CIFAR-100 and 3 percent on TinyImageNet while effectively mitigating gradient oscillations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

S-Adam adds a variance-based damping term to Adam for non-smooth losses, but the link from that variance to Clarke subdifferential diameter is not shown.

read the letter

The main thing to know is that this paper defines a Local Geometric Instability metric from the variance of randomized directional derivatives, then uses it to multiply Adam steps by an exp(-λρ) factor when that variance is high. They claim this yields almost-sure convergence to (δ,ε)-Clarke points at the standard O(1/√T) rate via differential inclusions, plus 3-6% accuracy lifts on CIFAR-100 and TinyImageNet under quantization-aware training.

What is actually new is the concrete LGI construction and its placement inside an Adam-style update with the specific damping rule. Standard non-smooth optimization work does not combine randomized probing, variance estimation, and this exact adaptive term in the same way.

The paper correctly flags gradient chattering as a practical issue when ReLUs and quantization make the loss non-smooth, and the idea of modulating step size by a local instability signal is reasonable on its face.

The central weakness is exactly the one in the stress-test note. The convergence argument requires that the directional-derivative variance be monotone in or bounded by the Clarke subdifferential diameter so the damping activates at the right times. No such relation is supplied in the abstract, and without it the differential-inclusion analysis does not go through. λ is also a free parameter whose selection is not derived from first principles. The reported gains lack any mention of controls, error bars, or dataset splits, so they cannot be assessed.

This work is aimed at people already tuning Adam variants for quantized or small-batch regimes. A reader who wants a concrete heuristic to test on their own non-smooth models could extract the LGI formula and try it, but the theory needs the missing lemma before it can be trusted.

It is worth sending to peer review because the problem is genuine and the proposal is specific enough to critique in detail, even though the current argument has a load-bearing gap.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Singularity-aware Adam (S-Adam) for non-smooth optimization in deep learning. It introduces the Local Geometric Instability (LGI) metric ρ, defined as the variance of randomized directional derivatives, as an estimator of Clarke subdifferential diameter. An adaptive damping term exp(-λρ) modulates step sizes to reduce chattering. The paper claims a rigorous convergence proof via differential inclusions establishing almost-sure convergence to (δ,ε)-Clarke stationary points at the optimal O(1/√T) rate, together with empirical accuracy gains of up to 6% on CIFAR-100 and 3% on TinyImageNet versus AdamW and Prox-SGD in quantization-aware and high-noise settings.

Significance. If the central technical relation between the LGI variance estimator and Clarke subdifferential diameter can be established and the differential-inclusion argument completed, the work would supply a theoretically grounded adaptive mechanism for stabilizing training under non-smoothness induced by ReLUs and quantization, addressing a practically relevant gap between smooth-assumption optimizers and modern architectures.

major comments (2)

[theoretical analysis / convergence proof] The differential-inclusion convergence argument (theoretical analysis section) relies on the damping exp(-λρ) being triggered exactly when the subdifferential is large. No lemma is supplied showing that the variance ρ of randomized directional derivatives is monotone in, or bounded by, the diameter of the Clarke subdifferential; without this relation the activation condition for the damping term is unverified and the almost-sure O(1/√T) guarantee does not follow.
[LGI metric definition] The definition of the LGI metric (abstract and method section) asserts that variance of randomized directional derivatives yields a reliable estimator of Clarke subdifferential diameter, yet no supporting result (e.g., concentration inequality or monotonicity lemma) is provided; this estimator is load-bearing for both the adaptive mechanism and the convergence claim.

minor comments (2)

[method / damping mechanism] The hyperparameter λ in exp(-λρ) is introduced without derivation or sensitivity analysis; its status as a free parameter should be clarified.
[experiments] Experimental section lacks reported error bars, statistical significance tests, precise dataset splits, batch-size schedules, and ablation on the number of directional probes used to compute ρ.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments correctly identify that the connection between the LGI metric and the Clarke subdifferential requires explicit supporting results to fully justify the adaptive mechanism and convergence claim. We address each point below and will incorporate the necessary additions in the revised manuscript.

read point-by-point responses

Referee: [theoretical analysis / convergence proof] The differential-inclusion convergence argument (theoretical analysis section) relies on the damping exp(-λρ) being triggered exactly when the subdifferential is large. No lemma is supplied showing that the variance ρ of randomized directional derivatives is monotone in, or bounded by, the diameter of the Clarke subdifferential; without this relation the activation condition for the damping term is unverified and the almost-sure O(1/√T) guarantee does not follow.

Authors: We agree that an explicit lemma establishing the relationship between ρ and the subdifferential diameter is required to verify the damping activation and complete the convergence argument. In the revised version we will add Lemma 3.2 in the theoretical analysis section, proving that ρ is bounded above by the diameter of the Clarke subdifferential under the randomized directional derivative probing scheme. The proof will rely on the definition of the Clarke subdifferential and the variance of directional derivatives. We will also revise the differential-inclusion argument to cite this lemma directly, thereby confirming the almost-sure O(1/√T) rate to (δ,ε)-Clarke stationary points. revision: yes
Referee: [LGI metric definition] The definition of the LGI metric (abstract and method section) asserts that variance of randomized directional derivatives yields a reliable estimator of Clarke subdifferential diameter, yet no supporting result (e.g., concentration inequality or monotonicity lemma) is provided; this estimator is load-bearing for both the adaptive mechanism and the convergence claim.

Authors: We acknowledge that the manuscript would benefit from an explicit supporting result for the LGI estimator. In the revision we will insert a new proposition in the method section that provides both a monotonicity relation and a concentration inequality showing that the empirical variance of randomized directional derivatives approximates the subdifferential diameter with high probability for a sufficient number of probes. These results will be referenced in the abstract and will substantiate the estimator's role in the adaptive damping term. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper introduces the LGI metric as derived from variance of randomized directional derivatives to estimate Clarke subdifferential diameter, incorporates the damping exp(-λρ) into S-Adam, and states a convergence result to (δ,ε)-Clarke points at O(1/√T) via differential inclusions. No quoted step reduces the claimed result to its inputs by construction, renames a known pattern, or relies on a self-citation chain for a uniqueness theorem; the theoretical argument invokes standard tools for non-smooth optimization while empirical accuracy numbers are reported as separate evaluations. The scalar λ is a conventional hyperparameter and does not force the rate or almost-sure guarantee by definition.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Only the abstract is available, so the ledger records components explicitly named in it; several implementation details remain unspecified.

free parameters (1)

λ
Scalar controlling the strength of the exp(-λρ) damping term; value not derived from the analysis.

axioms (1)

domain assumption Convergence analysis via differential inclusions applies to the discrete S-Adam iterates
Invoked to obtain the almost-sure convergence statement to Clarke stationary points.

invented entities (1)

Local Geometric Instability (LGI) metric no independent evidence
purpose: Estimator of Clarke subdifferential diameter from variance of randomized directional derivatives
Newly defined quantity used to drive the adaptive damping; no independent evidence supplied in abstract.

pith-pipeline@v0.9.1-grok · 5792 in / 1274 out tokens · 29146 ms · 2026-06-29T08:40:19.116723+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

Curran Associates Inc. ISBN 9781510860964. Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., and Kalenichenko, D. Quantization and training of neural networks for efficient integer- arithmetic-only inference. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2704– 2713, 2018. doi: 10.1109/CVPR.2018.00286. K...

work page doi:10.1109/cvpr.2018.00286 2018
[2]

org/CorpusID:6628106

URL https://api.semanticscholar. org/CorpusID:6628106. Kwon, J., Kim, J., Park, H., and Choi, I. K. Asam: Adaptive sharpness-aware minimization for scale-invariant learn- ing of deep neural networks. InInternational Conference on Machine Learning, pp. 5905–5914. PMLR, 2021. Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. Visu- alizing the loss l...

2021
[3]

On the difficulty of training Recurrent Neural Networks

URL https://api.semanticscholar. org/CorpusID:2391217. Parikh, N. and Boyd, S. Proximal algorithms.Foundations and Trends in Optimization, 1(3):127–239, 2014. Pascanu, R., Mikolov, T., and Bengio, Y . On the difficulty of training recurrent neural networks, 2013. URL https: //arxiv.org/abs/1211.5063. 10 Singularity-aware Optimization via Randomized Geomet...

work page internal anchor Pith review Pith/arXiv arXiv 2014
[4]

ISBN 9798331314385

Curran Associates Inc. ISBN 9798331314385. Shor, N. Z.Minimization Methods for Non- Differentiable Functions, volume 3 ofSpringer Series in Computational Mathematics. Springer Berlin, Heidelberg, 1985. ISBN 978-3-642-82118-9. doi: 10.1007/978-3-642-82118-9. Wainwright, M. J.High-Dimensional Statistics: A Non- Asymptotic Viewpoint. Cambridge Series in Stat...

work page doi:10.1007/978-3-642-82118-9 1985
[5]

Error decomposition.We have ˆσ2 −σ 2 = " 1 k X i Y 2 i −E[Y 2] # −[ˆµ2 −µ 2] ≤ 1 k X i Y 2 i −E[Y 2] +|ˆµ2 −µ 2|.(48)
[6]

Estimation error of 1 k P i Y 2 i .Let Zi =Y 2 i , then 0≤Z i ≤L 2, and E[Zi] =E[Y 2]. Therefore, applying Hoeffding’s inequality, we have P 1 k X i Zi −E[Z] ≥ τ 2 ! ≤2 exp 2k(τ /2)2 L4 = 2 exp − kτ 2 2L4 .(49) 13 Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization
[7]

Estimation error ofˆµ2.Since∥Y i∥ ≤L, we have|ˆµ| ≤Land|µ| ≤L. Then |ˆµ2 −µ 2| ≤ |ˆµ−µ| ·(|ˆµ|+|µ|)≤2L|ˆµ−µ|.(50) We have the following using Hoeffding’s inequality on the estimated valueˆµ: P(|ˆµ−µ| ≥s)≤2 exp − 2ks2 (2L)2 = 2 exp − ks2 2L2 .(51) If we let|ˆµ2 −µ 2| ≤2L|ˆµ−µ|< τ /2, then we have |ˆµ−µ|< τ 4L ⇒ |ˆµ2 −µ 2|< τ 2 .(52) So P |ˆµ−µ|< τ 4L ≤P |ˆ...
[8]

Therefore, we obtain the error of the variance estimate from Eqn

Estimation error of the variance ˆσ2.If 1 k P i Y 2 i −E[Y 2] < τ /2 and |ˆµ2 −µ2|< τ /2 , then |ˆσ2 −σ 2|< τ . Therefore, we obtain the error of the variance estimate from Eqn. (49) and (54) as follows P |ˆσ2 −σ 2| ≥τ ≤P 1 k X i Y 2 i −E[Y 2] ≥ τ 2 ! +P |ˆµ2 −µ 2| ≥ τ 2 ≤4 exp − kτ 2 32L4 .(55)
[9]

Error Analysis of LGI Estimation.Define a function g(a, b) =b/(a+b+ϵ) , where a=µ 2 and b=σ 2. Given that a, b≥0anda+b≤L 2, we calculate the two partial derivatives respectively ∂g ∂a = − b (a+b+ϵ) 2 ≤ L2 ϵ2 , ∂g ∂b = a+ϵ (a+b+ϵ) 2 ≤ a+b+ϵ (a+b+ϵ) 2 ≤ ϵ ϵ2 .(56) Then according to the Mean Value Theorem and the Cauchy-Schwarz inequality, we have |ˆρk −ρ|=|...
[10]

(51) and (55) P(|ˆρk −ρ| ≥∆)≤P(|ˆµ−µ| ≥τ) +P(|ˆσ 2 −σ 2| ≥τ) ≤2 exp − kτ 2 2L2 + 4 exp − kτ 2 32L4 ≤6 exp − kτ 2 32L4 = 6 exp − k∆2 32M2L4(4L2 + 1) (59)

Upper bound of joint probability.If|ˆµ−µ|< τand|ˆσ 2 −σ 2|< τ, then we have |ˆρk −ρ| ≤M p (ˆµ2 −µ 2)2 + (ˆσ2 −σ 2)2 < M p (2Lτ) 2 +τ 2 =M τ p 4L2 + 1(58) 14 Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization Let∆ =M τ √ 4L2 + 1, thenτ= ∆ M √ 4L2+1, with eqn. (51) and (55) P(|ˆρk −ρ| ≥∆)≤P(|ˆµ−µ| ≥τ) +P(...
[11]

Sample complexity.We set the upper bound of the probability to be less thanδ, that is, 6 exp − k∆2 32M2L4(4L2 + 1) ≤δ⇒k≥ 32L4M2(4L2 + 1) ∆2 log 6 δ .(60) Finally, notice thatM=O(1/ϵ 2), we obtain k=O L6 ϵ4∆2 log(1/δ) (61) D. Equivalence conditions between S-Adam and the proximal method (Prox-SGD) Let us define the proximal operator with a time-varying reg...

2019

[1] [1]

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

Curran Associates Inc. ISBN 9781510860964. Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., and Kalenichenko, D. Quantization and training of neural networks for efficient integer- arithmetic-only inference. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2704– 2713, 2018. doi: 10.1109/CVPR.2018.00286. K...

work page doi:10.1109/cvpr.2018.00286 2018

[2] [2]

org/CorpusID:6628106

URL https://api.semanticscholar. org/CorpusID:6628106. Kwon, J., Kim, J., Park, H., and Choi, I. K. Asam: Adaptive sharpness-aware minimization for scale-invariant learn- ing of deep neural networks. InInternational Conference on Machine Learning, pp. 5905–5914. PMLR, 2021. Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. Visu- alizing the loss l...

2021

[3] [3]

On the difficulty of training Recurrent Neural Networks

URL https://api.semanticscholar. org/CorpusID:2391217. Parikh, N. and Boyd, S. Proximal algorithms.Foundations and Trends in Optimization, 1(3):127–239, 2014. Pascanu, R., Mikolov, T., and Bengio, Y . On the difficulty of training recurrent neural networks, 2013. URL https: //arxiv.org/abs/1211.5063. 10 Singularity-aware Optimization via Randomized Geomet...

work page internal anchor Pith review Pith/arXiv arXiv 2014

[4] [4]

ISBN 9798331314385

Curran Associates Inc. ISBN 9798331314385. Shor, N. Z.Minimization Methods for Non- Differentiable Functions, volume 3 ofSpringer Series in Computational Mathematics. Springer Berlin, Heidelberg, 1985. ISBN 978-3-642-82118-9. doi: 10.1007/978-3-642-82118-9. Wainwright, M. J.High-Dimensional Statistics: A Non- Asymptotic Viewpoint. Cambridge Series in Stat...

work page doi:10.1007/978-3-642-82118-9 1985

[5] [5]

Error decomposition.We have ˆσ2 −σ 2 = " 1 k X i Y 2 i −E[Y 2] # −[ˆµ2 −µ 2] ≤ 1 k X i Y 2 i −E[Y 2] +|ˆµ2 −µ 2|.(48)

[6] [6]

Estimation error of 1 k P i Y 2 i .Let Zi =Y 2 i , then 0≤Z i ≤L 2, and E[Zi] =E[Y 2]. Therefore, applying Hoeffding’s inequality, we have P 1 k X i Zi −E[Z] ≥ τ 2 ! ≤2 exp 2k(τ /2)2 L4 = 2 exp − kτ 2 2L4 .(49) 13 Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization

[7] [7]

Estimation error ofˆµ2.Since∥Y i∥ ≤L, we have|ˆµ| ≤Land|µ| ≤L. Then |ˆµ2 −µ 2| ≤ |ˆµ−µ| ·(|ˆµ|+|µ|)≤2L|ˆµ−µ|.(50) We have the following using Hoeffding’s inequality on the estimated valueˆµ: P(|ˆµ−µ| ≥s)≤2 exp − 2ks2 (2L)2 = 2 exp − ks2 2L2 .(51) If we let|ˆµ2 −µ 2| ≤2L|ˆµ−µ|< τ /2, then we have |ˆµ−µ|< τ 4L ⇒ |ˆµ2 −µ 2|< τ 2 .(52) So P |ˆµ−µ|< τ 4L ≤P |ˆ...

[8] [8]

Therefore, we obtain the error of the variance estimate from Eqn

Estimation error of the variance ˆσ2.If 1 k P i Y 2 i −E[Y 2] < τ /2 and |ˆµ2 −µ2|< τ /2 , then |ˆσ2 −σ 2|< τ . Therefore, we obtain the error of the variance estimate from Eqn. (49) and (54) as follows P |ˆσ2 −σ 2| ≥τ ≤P 1 k X i Y 2 i −E[Y 2] ≥ τ 2 ! +P |ˆµ2 −µ 2| ≥ τ 2 ≤4 exp − kτ 2 32L4 .(55)

[9] [9]

Error Analysis of LGI Estimation.Define a function g(a, b) =b/(a+b+ϵ) , where a=µ 2 and b=σ 2. Given that a, b≥0anda+b≤L 2, we calculate the two partial derivatives respectively ∂g ∂a = − b (a+b+ϵ) 2 ≤ L2 ϵ2 , ∂g ∂b = a+ϵ (a+b+ϵ) 2 ≤ a+b+ϵ (a+b+ϵ) 2 ≤ ϵ ϵ2 .(56) Then according to the Mean Value Theorem and the Cauchy-Schwarz inequality, we have |ˆρk −ρ|=|...

[10] [10]

(51) and (55) P(|ˆρk −ρ| ≥∆)≤P(|ˆµ−µ| ≥τ) +P(|ˆσ 2 −σ 2| ≥τ) ≤2 exp − kτ 2 2L2 + 4 exp − kτ 2 32L4 ≤6 exp − kτ 2 32L4 = 6 exp − k∆2 32M2L4(4L2 + 1) (59)

Upper bound of joint probability.If|ˆµ−µ|< τand|ˆσ 2 −σ 2|< τ, then we have |ˆρk −ρ| ≤M p (ˆµ2 −µ 2)2 + (ˆσ2 −σ 2)2 < M p (2Lτ) 2 +τ 2 =M τ p 4L2 + 1(58) 14 Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization Let∆ =M τ √ 4L2 + 1, thenτ= ∆ M √ 4L2+1, with eqn. (51) and (55) P(|ˆρk −ρ| ≥∆)≤P(|ˆµ−µ| ≥τ) +P(...

[11] [11]

Sample complexity.We set the upper bound of the probability to be less thanδ, that is, 6 exp − k∆2 32M2L4(4L2 + 1) ≤δ⇒k≥ 32L4M2(4L2 + 1) ∆2 log 6 δ .(60) Finally, notice thatM=O(1/ϵ 2), we obtain k=O L6 ϵ4∆2 log(1/δ) (61) D. Equivalence conditions between S-Adam and the proximal method (Prox-SGD) Let us define the proximal operator with a time-varying reg...

2019