Zeroth-Order Nonconvex Nonsmooth Optimization with Heavy-Tailed Noise

Luo Luo; Zhuanghua Liu

arxiv: 2605.24513 · v1 · pith:SMMNENRSnew · submitted 2026-05-23 · 💻 cs.LG

Zeroth-Order Nonconvex Nonsmooth Optimization with Heavy-Tailed Noise

Zhuanghua Liu , Luo Luo This is my paper

Pith reviewed 2026-06-30 15:01 UTC · model grok-4.3

classification 💻 cs.LG

keywords zeroth-order optimizationnonconvex nonsmooth optimizationheavy-tailed noiseGoldstein stationary pointstochastic optimizationclipped estimatoronline-to-nonconvex conversion

0 comments

The pith

A clipped two-point estimator in the online-to-nonconvex framework finds (δ, ε)-Goldstein stationary points for nonconvex Lipschitz objectives under p-moment bounded heavy-tailed noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a stochastic zeroth-order algorithm for nonconvex nonsmooth optimization when only noisy function values are available and the noise has heavy tails. It refines the online-to-nonconvex conversion approach by clipping the two-point gradient estimator to control the effect of unbounded moments. The resulting method reaches a (δ, ε)-Goldstein stationary point with oracle complexity O(d^{p/2(p-1)} δ^{-1} ε^{-(2p-1)/p-1}) for p in (1,2]. This bound recovers the best-known dimension dependence from convex zeroth-order work and the accuracy dependence from first-order nonconvex nonsmooth work. The contribution matters because many machine-learning tasks involve derivative-free access and realistic noise that violates lighter-tailed assumptions.

Core claim

By clipping the two-point gradient estimator inside the online-to-nonconvex conversion framework, the algorithm achieves zeroth-order oracle complexity O(d^{p/2(p-1)} δ^{-1} ε^{-(2p-1)/(p-1)}) to locate a (δ, ε)-Goldstein stationary point of a Lipschitz continuous objective whose stochastic evaluations have bounded p-moments, p ∈ (1,2].

What carries the argument

Clipped two-point gradient estimator inside the online-to-nonconvex conversion framework

If this is right

Dimension dependence on d matches the best-known stochastic zeroth-order results for convex nonsmooth problems.
Dependence on δ and ε matches the best-known stochastic first-order results for nonconvex nonsmooth problems.
The method applies directly to any Lipschitz objective whose stochastic evaluations satisfy the p-moment bound.
Numerical experiments confirm practical effectiveness on standard test problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Clipping may serve as a modular robustness device for other derivative-free methods facing heavy-tailed noise.
The same conversion-plus-clipping pattern could be tested on problems where only noisy gradient oracles exist rather than function values.
Empirical checks could compare convergence when clipping is removed versus retained on data sets known to exhibit heavy tails.

Load-bearing premise

The objective is Lipschitz continuous and the stochastic function evaluations possess bounded p-moments for some p in (1,2].

What would settle it

Run the algorithm on a simple Lipschitz function whose noise has unbounded second moments and observe whether the iterates fail to approach any (δ, ε)-Goldstein point at the claimed rate.

Figures

Figures reproduced from arXiv: 2605.24513 by Luo Luo, Zhuanghua Liu.

read the original abstract

This paper considers the nonconvex nonsmooth problem in which the objective function is Lipschitz continuous. We focus on the stochastic setting where the algorithm can access stochastic function value evaluations with heavy-tailed noise, which is prevalent in many popular machine learning applications. We propose a stochastic zeroth-order algorithm that refines the framework of online-to-nonconvex conversion by clipping the two-point gradient estimator. The theoretical analysis shows that our algorithm can find a $(\delta, \epsilon)$-Goldstein stationary point with zeroth-order oracle complexity of ${\mathcal O}(d^{\frac{p}{2(p-1)}}\delta^{-1}\epsilon^{-\frac{2p-1}{p-1}})$, where $d$ is the problem dimension and $p\in(1,2]$ is the order of bounded moments. Note that our dependence on dimension $d$ matches the best-known results of stochastic zeroth-order optimization for finding the sub-optimal solution of a stochastic convex nonsmooth problem. In addition, our dependence on accuracy parameters $\delta$ and $\epsilon$ is consistent with that of the best-known stochastic first-order algorithms for stochastic nonconvex nonsmooth problems. Finally, we conduct numerical experiments to demonstrate the effectiveness of the proposed method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper clips the two-point estimator inside an online-to-nonconvex framework to get matching ZO rates for nonconvex nonsmooth problems under p-moment noise.

read the letter

The core move is to clip the two-point zeroth-order estimator so the online-to-nonconvex conversion still works when function evaluations have only p-moments for p in (1,2]. The claimed complexity for a (δ,ε)-Goldstein point is O(d^{p/2(p-1)} δ^{-1} ε^{-(2p-1)/(p-1)}), which the abstract says matches the best convex ZO dimension dependence and the best nonconvex FO accuracy dependence.

That alignment is the main thing the work delivers. The assumptions are stated plainly (Lipschitz objective, bounded p-moments on the stochastic values), and the abstract positions the result as consistent with independent prior bounds rather than self-referential. Numerical experiments are included to check practical behavior.

The rates do not beat the cited references in any regime, so the novelty sits in the extension to this exact combination of nonconvex, nonsmooth, zeroth-order, and heavy-tailed noise rather than in faster asymptotics. Without the full proofs it is hard to inspect the clipping analysis for hidden constants or edge cases, but the stress-test description shows no internal inconsistency in the argument structure.

The paper is aimed at researchers who need zeroth-order methods for ML problems where gradients are unavailable and noise is heavier than sub-Gaussian. It is a direct, incremental refinement of an existing framework, so it is worth sending to referees for a proper check of the derivations and experiments.

Referee Report

0 major / 3 minor

Summary. The paper proposes a stochastic zeroth-order algorithm for nonconvex nonsmooth Lipschitz optimization under heavy-tailed stochastic function evaluations with bounded p-moments (p ∈ (1,2]). It refines the online-to-nonconvex conversion framework by clipping the two-point gradient estimator and proves that the method finds a (δ, ε)-Goldstein stationary point with zeroth-order oracle complexity O(d^{p/2(p-1)} δ^{-1} ε^{-(2p-1)/p-1}). The rates are positioned as matching the best-known dimension dependence for stochastic convex nonsmooth ZO and the best-known accuracy dependence for stochastic nonconvex nonsmooth FO methods; numerical experiments are included to illustrate practical behavior.

Significance. If the stated complexity holds under the given assumptions, the work is significant for extending zeroth-order methods to heavy-tailed noise settings common in machine learning while preserving optimal dependence on dimension and accuracy parameters from related convex and first-order results. Credit is due for explicitly aligning the new bound with prior best-known rates rather than claiming improvement, and for including numerical validation.

minor comments (3)

The abstract states the complexity but does not indicate whether the proof of the clipping step (which is central to handling p-moments) appears in a dedicated theorem or is embedded in the online-to-nonconvex reduction; a pointer to the relevant theorem number would improve readability.
Notation for the (δ, ε)-Goldstein stationarity definition is used without an explicit reminder of its relation to standard Goldstein subdifferential; a one-sentence recall in §2 would aid readers unfamiliar with the nonsmooth nonconvex literature.
The numerical experiments section would benefit from reporting the specific p values tested and whether the observed iteration counts scale consistently with the derived exponent -(2p-1)/(p-1).

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of the paper, recognition of its significance in extending zeroth-order methods to heavy-tailed noise while aligning with best-known dimension and accuracy dependencies, and recommendation for minor revision. We appreciate the credit given for the explicit rate comparisons and numerical validation.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper derives a complexity bound for a clipped two-point zeroth-order algorithm in an online-to-nonconvex framework under explicit assumptions of Lipschitz continuity and bounded p-moments. The stated rate O(d^{p/2(p-1)} δ^{-1} ε^{-(2p-1)/p-1}) follows from standard analysis of the Goldstein stationarity measure and does not reduce by construction to any fitted parameter, self-citation chain, or renamed input. Comparisons to prior convex ZO and nonconvex FO rates are external benchmarks, not load-bearing justifications internal to the derivation. No self-definitional, fitted-prediction, or uniqueness-imported steps are present.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Ledger extracted from abstract only; full paper may introduce additional parameters or assumptions in the proof.

axioms (2)

domain assumption The objective function is Lipschitz continuous.
Required for the nonconvex nonsmooth setting and stated in the abstract.
domain assumption Stochastic function evaluations have bounded p-moments for p ∈ (1,2].
Central to handling heavy-tailed noise via clipping and stated in the abstract.

pith-pipeline@v0.9.1-grok · 5742 in / 1395 out tokens · 53338 ms · 2026-06-30T15:01:43.652672+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 9 canonical work pages · 2 internal anchors

[1]

Optimal algorithms for online convex optimization with multi-point bandit feedback

Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. InProc. COLT, pages 28–40, 2010

2010
[2]

Linear attention is (maybe) all you need (to understand transformer optimization).arXiv preprint arXiv:2310.01082, 2023

Kwangjun Ahn, Xiang Cheng, Minhak Song, Chulhee Yun, Ali Jadbabaie, and Suvrit Sra. Linear attention is (maybe) all you need (to understand transformer optimization).arXiv preprint arXiv:2310.01082, 2023

work page arXiv 2023
[3]

Zeroth-order nonconvex stochastic optimization: Handling constraints, high dimensionality, and saddle points.Foundations of Computational Mathematics, 22(1):35–76, 2022

Krishnakumar Balasubramanian and Saeed Ghadimi. Zeroth-order nonconvex stochastic optimization: Handling constraints, high dimensionality, and saddle points.Foundations of Computational Mathematics, 22(1):35–76, 2022

2022
[4]

Libsvm: A library for support vector machines.ACM Transactions on Intelligent Systems and Technology, 2(3, article 27), 2007

Chih Chung Chang and Chih Jen Lin. Libsvm: A library for support vector machines.ACM Transactions on Intelligent Systems and Technology, 2(3, article 27), 2007

2007
[5]

Faster gradient-free algorithms for nonsmooth nonconvex stochastic optimization

Lesi Chen, Jing Xu, and Luo Luo. Faster gradient-free algorithms for nonsmooth nonconvex stochastic optimization. InInternational Conference on Machine Learning, pages 5219–5233. PMLR, 2023

2023
[6]

Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models

Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. InProceedings of the 10th ACM workshop on artificial intelligence and security, pages 15–26, 2017

2017
[7]

Structured evolution with compact architectures for scalable policy optimization

Krzysztof Choromanski, Mark Rowland, Vikas Sindhwani, Richard Turner, and Adrian Weller. Structured evolution with compact architectures for scalable policy optimization. InInternational Conference on Machine Learning, pages 970–978. PMLR, 2018

2018
[8]

SIAM, 1990

Frank H Clarke.Optimization and nonsmooth analysis. SIAM, 1990

1990
[9]

Optimal stochastic non-smooth non-convex optimization through online-to-non-convex conversion

Ashok Cutkosky, Harsh Mehta, and Francesco Orabona. Optimal stochastic non-smooth non-convex optimization through online-to-non-convex conversion. InInternational Conference on Machine Learning, pages 6643–6670. PMLR, 2023

2023
[10]

Stochastic model-based minimization of weakly convex functions

Damek Davis and Dmitriy Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207–239, 2019

2019
[11]

Proximally guided stochastic subgradient method for nonsmooth, nonconvex problems.SIAM Journal on Optimization, 29(3):1908–1930, 2019

Damek Davis and Benjamin Grimmer. Proximally guided stochastic subgradient method for nonsmooth, nonconvex problems.SIAM Journal on Optimization, 29(3):1908–1930, 2019

1908
[12]

A gradient sampling method with complexity guarantees for Lipschitz functions in high and low dimensions

Damek Davis, Dmitriy Drusvyatskiy, Yin Tat Lee, Swati Padmanabhan, and Guanghao Ye. A gradient sampling method with complexity guarantees for Lipschitz functions in high and low dimensions. In Proc. NeurIPS, 2022. 10

2022
[13]

Randomized smoothing for stochastic optimization.SIAM Journal on Optimization, 22(2):674–701, 2012

John C Duchi, Peter L Bartlett, and Martin J Wainwright. Randomized smoothing for stochastic optimization.SIAM Journal on Optimization, 22(2):674–701, 2012

2012
[14]

Variable selection via nonconcave penalized likelihood and its oracle properties.Journal of the American statistical Association, 96(456):1348–1360, 2001

Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties.Journal of the American statistical Association, 96(456):1348–1360, 2001

2001
[15]

On proximal policy optimization’s heavy- tailed gradients

Saurabh Garg, Joshua Zhanson, Emilio Parisotto, Adarsh Prasad, Zico Kolter, Zachary Lipton, Sivaraman Balakrishnan, Ruslan Salakhutdinov, and Pradeep Ravikumar. On proximal policy optimization’s heavy- tailed gradients. InInternational Conference on Machine Learning, pages 3610–3619. PMLR, 2021

2021
[16]

Stochastic first-and zeroth-order methods for nonconvex stochastic programming.SIAM journal on optimization, 23(4):2341–2368, 2013

Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming.SIAM journal on optimization, 23(4):2341–2368, 2013

2013
[17]

Deep sparse rectifier neural networks

Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 315–323. JMLR Workshop and Conference Proceedings, 2011

2011
[18]

Goldstein

A.A. Goldstein. Optimization of Lipschitz continuous functions.Mathematical Programming, 13:14–22, 1977

1977
[19]

Discrete optimization via simulation using compass.Operations research, 54(1):115–129, 2006

L Jeff Hong and Barry L Nelson. Discrete optimization via simulation using compass.Operations research, 54(1):115–129, 2006

2006
[20]

From gradient clipping to normalization for heavy tailed sgd.arXiv preprint arXiv:2410.13849,

Florian H¨ ubler, Ilyas Fatkhullin, and Niao He. From gradient clipping to normalization for heavy tailed sgd.arXiv preprint arXiv:2410.13849, 2024

work page arXiv 2024
[21]

Deterministic nonsmooth nonconvex optimization

Michael Jordan, Guy Kornowski, Tianyi Lin, Ohad Shamir, and Manolis Zampetakis. Deterministic nonsmooth nonconvex optimization. InThe Thirty Sixth Annual Conference on Learning Theory, pages 4570–4597. PMLR, 2023

2023
[22]

Gradient-free methods for non-smooth convex stochastic optimization with heavy-tailed noise on convex compact.Computational Management Science, 20(1):37, 2023

Nikita Kornilov, Alexander Gasnikov, Pavel Dvurechensky, and Darina Dvinskikh. Gradient-free methods for non-smooth convex stochastic optimization with heavy-tailed noise on convex compact.Computational Management Science, 20(1):37, 2023

2023
[23]

Accelerated zeroth-order method for non-smooth stochastic convex optimization problem with infinite variance.Advances in Neural Information Processing Systems, 36:64083–64102, 2023

Nikita Kornilov, Ohad Shamir, Aleksandr Lobanov, Darina Dvinskikh, Alexander Gasnikov, Innokentiy Shibaev, Eduard Gorbunov, and Samuel Horv´ ath. Accelerated zeroth-order method for non-smooth stochastic convex optimization problem with infinite variance.Advances in Neural Information Processing Systems, 36:64083–64102, 2023

2023
[24]

Median clipping for zeroth-order non-smooth convex optimization and multi-armed bandit problem with heavy-tailed symmetric noise.arXiv preprint arXiv:2402.02461, 2024

Nikita Kornilov, Yuriy Dorn, Aleksandr Lobanov, Nikolay Kutuzov, Innokentiy Shibaev, Eduard Gorbunov, Alexander Gasnikov, and Alexander Nazin. Median clipping for zeroth-order non-smooth convex optimization and multi-armed bandit problem with heavy-tailed symmetric noise.arXiv preprint arXiv:2402.02461, 2024

work page arXiv 2024
[25]

An algorithm with optimal dimension-dependence for zero-order nonsmooth nonconvex stochastic optimization.Journal of Machine Learning Research, 25(122):1–14, 2024

Guy Kornowski and Ohad Shamir. An algorithm with optimal dimension-dependence for zero-order nonsmooth nonconvex stochastic optimization.Journal of Machine Learning Research, 25(122):1–14, 2024

2024
[26]

Gradient-free methods for deterministic and stochastic nonsmooth nonconvex optimization.Advances in Neural Information Processing Systems, 35:26160–26175, 2022

Tianyi Lin, Zeyu Zheng, and Michael Jordan. Gradient-free methods for deterministic and stochastic nonsmooth nonconvex optimization.Advances in Neural Information Processing Systems, 35:26160–26175, 2022

2022
[27]

High-probability bound for non-smooth non-convex stochastic optimization with heavy tails

Langqi Liu, Yibo Wang, and Lijun Zhang. High-probability bound for non-smooth non-convex stochastic optimization with heavy tails. InForty-first International Conference on Machine Learning, 2024

2024
[28]

Zeroth-order methods for constrained nonconvex nonsmooth stochastic optimization

Zhuanghua Liu, Cheng Chen, Luo Luo, and Bryan Kian Hsiang Low. Zeroth-order methods for constrained nonconvex nonsmooth stochastic optimization. InForty-first International Conference on Machine Learning, 2024. 11

2024
[29]

Online convex optimization with heavy tails: Old algorithms, new regrets, and applications

Zijian Liu. Online convex optimization with heavy tails: Old algorithms, new regrets, and applications. arXiv preprint arXiv:2508.07473, 2025

work page arXiv 2025
[30]

Nonconvex stochastic optimization under heavy-tailed noises: Opti- mal convergence without gradient clipping

Zijian Liu and Zhengyuan Zhou. Nonconvex stochastic optimization under heavy-tailed noises: Opti- mal convergence without gradient clipping. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[31]

Simple random search provides a competitive approach to reinforcement learning

Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search provides a competitive approach to reinforcement learning.arXiv preprint arXiv:1803.07055, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[32]

Sparsenet: Coordinate descent with nonconvex penalties.Journal of the American Statistical Association, 106(495):1125–1138, 2011

Rahul Mazumder, Jerome H Friedman, and Trevor Hastie. Sparsenet: Coordinate descent with nonconvex penalties.Journal of the American Statistical Association, 106(495):1125–1138, 2011

2011
[33]

Rectified linear units improve restricted boltzmann machines

Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010

2010
[34]

Optimization via simulation over discrete decision variables

Barry L Nelson. Optimization via simulation over discrete decision variables. InRisk and optimization in an uncertain world, pages 193–207. Informs, 2010

2010
[35]

Random gradient-free minimization of convex functions.Founda- tions of Computational Mathematics, 17(2):527–566, 2017

Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions.Founda- tions of Computational Mathematics, 17(2):527–566, 2017

2017
[36]

Improved convergence in high probability of clipped gradient methods with heavy-tailed noise

Ta Duy Nguyen, Thien Hang Nguyen, Alina Ene, and Huy Le Nguyen. High probability convergence of clipped-sgd under heavy-tailed noise.arXiv preprint arXiv:2302.05437, 2023

work page arXiv 2023
[37]

Online Learning: A Modern Introduction Using Convex Optimization

Francesco Orabona. A modern introduction to online learning.arXiv preprint arXiv:1912.13213, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1912
[38]

High-probability bounds for stochastic optimization and variational inequalities: the case of unbounded variance

Abdurakhmon Sadiev, Marina Danilova, Eduard Gorbunov, Samuel Horv´ ath, Gauthier Gidel, Pavel Dvurechensky, Alexander Gasnikov, and Peter Richt´ arik. High-probability bounds for stochastic optimization and variational inequalities: the case of unbounded variance. InInternational Conference on Machine Learning, pages 29563–29648. PMLR, 2023

2023
[39]

A tail-index analysis of stochastic gradient noise in deep neural networks

Umut Simsekli, Levent Sagun, and Mert Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. InInternational Conference on Machine Learning, pages 5827–5837. PMLR, 2019

2019
[40]

Do differentiable simulators give better policy gradients? InInternational Conference on Machine Learning, pages 20668–20696

Hyung Ju Suh, Max Simchowitz, Kaiqing Zhang, and Russ Tedrake. Do differentiable simulators give better policy gradients? InInternational Conference on Machine Learning, pages 20668–20696. PMLR, 2022

2022
[41]

Gradient normalization provably benefits nonconvex sgd under heavy-tailed noise.arXiv preprint arXiv:2410.16561, 2024

Tao Sun, Xinwang Liu, and Kun Yuan. Gradient normalization provably benefits nonconvex sgd under heavy-tailed noise.arXiv preprint arXiv:2410.16561, 2024

work page arXiv 2024
[42]

No dimension-free deterministic algorithm computes approximate stationarities of lipschitzians.arXiv preprint arXiv:2210.06907, 2022

Lai Tian and Anthony Man-Cho So. No dimension-free deterministic algorithm computes approximate stationarities of lipschitzians.arXiv preprint arXiv:2210.06907, 2022

work page arXiv 2022
[43]

On the finite-time complexity and practical computation of approximate stationarity concepts of Lipschitz functions

Lai Tian, Kaiwen Zhou, and Anthony Man-Cho So. On the finite-time complexity and practical computation of approximate stationarity concepts of Lipschitz functions. InProc. ICML, pages 21360– 21379, 2022

2022
[44]

On stochastic gradient and subgradient methods with adaptive steplength sequences.Automatica, 48(1):56–67, 2012

Farzad Yousefian, Angelia Nedi´ c, and Uday V Shanbhag. On stochastic gradient and subgradient methods with adaptive steplength sequences.Automatica, 48(1):56–67, 2012

2012
[45]

Nearly unbiased variable selection under minimax concave penalty.The Annals of Statistics, 2010

Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty.The Annals of Statistics, 2010

2010
[46]

Why are adaptive methods good for attention models?Advances in Neural Information Processing Systems, 33:15383–15393, 2020

Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra. Why are adaptive methods good for attention models?Advances in Neural Information Processing Systems, 33:15383–15393, 2020. 12

2020
[47]

Complexity of finding stationary points of nonconvex nonsmooth functions

Jingzhao Zhang, Hongzhou Lin, Stefanie Jegelka, Suvrit Sra, and Ali Jadbabaie. Complexity of finding stationary points of nonconvex nonsmooth functions. InInternational Conference on Machine Learning, pages 11173–11182. PMLR, 2020

2020
[48]

Analysis of multi-stage convex relaxation for sparse regularization.Journal of Machine Learning Research, 11(3), 2010

Tong Zhang. Analysis of multi-stage convex relaxation for sparse regularization.Journal of Machine Learning Research, 11(3), 2010. 13 The appendix is organized as follows. Section A introduces several key lemmas that are essential for the convergence analysis of the ZOCOON method. Section B presents the proof of the in-expectation and high-probability con...

2010

[1] [1]

Optimal algorithms for online convex optimization with multi-point bandit feedback

Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. InProc. COLT, pages 28–40, 2010

2010

[2] [2]

Linear attention is (maybe) all you need (to understand transformer optimization).arXiv preprint arXiv:2310.01082, 2023

Kwangjun Ahn, Xiang Cheng, Minhak Song, Chulhee Yun, Ali Jadbabaie, and Suvrit Sra. Linear attention is (maybe) all you need (to understand transformer optimization).arXiv preprint arXiv:2310.01082, 2023

work page arXiv 2023

[3] [3]

Zeroth-order nonconvex stochastic optimization: Handling constraints, high dimensionality, and saddle points.Foundations of Computational Mathematics, 22(1):35–76, 2022

Krishnakumar Balasubramanian and Saeed Ghadimi. Zeroth-order nonconvex stochastic optimization: Handling constraints, high dimensionality, and saddle points.Foundations of Computational Mathematics, 22(1):35–76, 2022

2022

[4] [4]

Libsvm: A library for support vector machines.ACM Transactions on Intelligent Systems and Technology, 2(3, article 27), 2007

Chih Chung Chang and Chih Jen Lin. Libsvm: A library for support vector machines.ACM Transactions on Intelligent Systems and Technology, 2(3, article 27), 2007

2007

[5] [5]

Faster gradient-free algorithms for nonsmooth nonconvex stochastic optimization

Lesi Chen, Jing Xu, and Luo Luo. Faster gradient-free algorithms for nonsmooth nonconvex stochastic optimization. InInternational Conference on Machine Learning, pages 5219–5233. PMLR, 2023

2023

[6] [6]

Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models

Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. InProceedings of the 10th ACM workshop on artificial intelligence and security, pages 15–26, 2017

2017

[7] [7]

Structured evolution with compact architectures for scalable policy optimization

Krzysztof Choromanski, Mark Rowland, Vikas Sindhwani, Richard Turner, and Adrian Weller. Structured evolution with compact architectures for scalable policy optimization. InInternational Conference on Machine Learning, pages 970–978. PMLR, 2018

2018

[8] [8]

SIAM, 1990

Frank H Clarke.Optimization and nonsmooth analysis. SIAM, 1990

1990

[9] [9]

Optimal stochastic non-smooth non-convex optimization through online-to-non-convex conversion

Ashok Cutkosky, Harsh Mehta, and Francesco Orabona. Optimal stochastic non-smooth non-convex optimization through online-to-non-convex conversion. InInternational Conference on Machine Learning, pages 6643–6670. PMLR, 2023

2023

[10] [10]

Stochastic model-based minimization of weakly convex functions

Damek Davis and Dmitriy Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207–239, 2019

2019

[11] [11]

Proximally guided stochastic subgradient method for nonsmooth, nonconvex problems.SIAM Journal on Optimization, 29(3):1908–1930, 2019

Damek Davis and Benjamin Grimmer. Proximally guided stochastic subgradient method for nonsmooth, nonconvex problems.SIAM Journal on Optimization, 29(3):1908–1930, 2019

1908

[12] [12]

A gradient sampling method with complexity guarantees for Lipschitz functions in high and low dimensions

Damek Davis, Dmitriy Drusvyatskiy, Yin Tat Lee, Swati Padmanabhan, and Guanghao Ye. A gradient sampling method with complexity guarantees for Lipschitz functions in high and low dimensions. In Proc. NeurIPS, 2022. 10

2022

[13] [13]

Randomized smoothing for stochastic optimization.SIAM Journal on Optimization, 22(2):674–701, 2012

John C Duchi, Peter L Bartlett, and Martin J Wainwright. Randomized smoothing for stochastic optimization.SIAM Journal on Optimization, 22(2):674–701, 2012

2012

[14] [14]

Variable selection via nonconcave penalized likelihood and its oracle properties.Journal of the American statistical Association, 96(456):1348–1360, 2001

Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties.Journal of the American statistical Association, 96(456):1348–1360, 2001

2001

[15] [15]

On proximal policy optimization’s heavy- tailed gradients

Saurabh Garg, Joshua Zhanson, Emilio Parisotto, Adarsh Prasad, Zico Kolter, Zachary Lipton, Sivaraman Balakrishnan, Ruslan Salakhutdinov, and Pradeep Ravikumar. On proximal policy optimization’s heavy- tailed gradients. InInternational Conference on Machine Learning, pages 3610–3619. PMLR, 2021

2021

[16] [16]

Stochastic first-and zeroth-order methods for nonconvex stochastic programming.SIAM journal on optimization, 23(4):2341–2368, 2013

Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming.SIAM journal on optimization, 23(4):2341–2368, 2013

2013

[17] [17]

Deep sparse rectifier neural networks

Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 315–323. JMLR Workshop and Conference Proceedings, 2011

2011

[18] [18]

Goldstein

A.A. Goldstein. Optimization of Lipschitz continuous functions.Mathematical Programming, 13:14–22, 1977

1977

[19] [19]

Discrete optimization via simulation using compass.Operations research, 54(1):115–129, 2006

L Jeff Hong and Barry L Nelson. Discrete optimization via simulation using compass.Operations research, 54(1):115–129, 2006

2006

[20] [20]

From gradient clipping to normalization for heavy tailed sgd.arXiv preprint arXiv:2410.13849,

Florian H¨ ubler, Ilyas Fatkhullin, and Niao He. From gradient clipping to normalization for heavy tailed sgd.arXiv preprint arXiv:2410.13849, 2024

work page arXiv 2024

[21] [21]

Deterministic nonsmooth nonconvex optimization

Michael Jordan, Guy Kornowski, Tianyi Lin, Ohad Shamir, and Manolis Zampetakis. Deterministic nonsmooth nonconvex optimization. InThe Thirty Sixth Annual Conference on Learning Theory, pages 4570–4597. PMLR, 2023

2023

[22] [22]

Gradient-free methods for non-smooth convex stochastic optimization with heavy-tailed noise on convex compact.Computational Management Science, 20(1):37, 2023

Nikita Kornilov, Alexander Gasnikov, Pavel Dvurechensky, and Darina Dvinskikh. Gradient-free methods for non-smooth convex stochastic optimization with heavy-tailed noise on convex compact.Computational Management Science, 20(1):37, 2023

2023

[23] [23]

Accelerated zeroth-order method for non-smooth stochastic convex optimization problem with infinite variance.Advances in Neural Information Processing Systems, 36:64083–64102, 2023

Nikita Kornilov, Ohad Shamir, Aleksandr Lobanov, Darina Dvinskikh, Alexander Gasnikov, Innokentiy Shibaev, Eduard Gorbunov, and Samuel Horv´ ath. Accelerated zeroth-order method for non-smooth stochastic convex optimization problem with infinite variance.Advances in Neural Information Processing Systems, 36:64083–64102, 2023

2023

[24] [24]

Median clipping for zeroth-order non-smooth convex optimization and multi-armed bandit problem with heavy-tailed symmetric noise.arXiv preprint arXiv:2402.02461, 2024

Nikita Kornilov, Yuriy Dorn, Aleksandr Lobanov, Nikolay Kutuzov, Innokentiy Shibaev, Eduard Gorbunov, Alexander Gasnikov, and Alexander Nazin. Median clipping for zeroth-order non-smooth convex optimization and multi-armed bandit problem with heavy-tailed symmetric noise.arXiv preprint arXiv:2402.02461, 2024

work page arXiv 2024

[25] [25]

An algorithm with optimal dimension-dependence for zero-order nonsmooth nonconvex stochastic optimization.Journal of Machine Learning Research, 25(122):1–14, 2024

Guy Kornowski and Ohad Shamir. An algorithm with optimal dimension-dependence for zero-order nonsmooth nonconvex stochastic optimization.Journal of Machine Learning Research, 25(122):1–14, 2024

2024

[26] [26]

Gradient-free methods for deterministic and stochastic nonsmooth nonconvex optimization.Advances in Neural Information Processing Systems, 35:26160–26175, 2022

Tianyi Lin, Zeyu Zheng, and Michael Jordan. Gradient-free methods for deterministic and stochastic nonsmooth nonconvex optimization.Advances in Neural Information Processing Systems, 35:26160–26175, 2022

2022

[27] [27]

High-probability bound for non-smooth non-convex stochastic optimization with heavy tails

Langqi Liu, Yibo Wang, and Lijun Zhang. High-probability bound for non-smooth non-convex stochastic optimization with heavy tails. InForty-first International Conference on Machine Learning, 2024

2024

[28] [28]

Zeroth-order methods for constrained nonconvex nonsmooth stochastic optimization

Zhuanghua Liu, Cheng Chen, Luo Luo, and Bryan Kian Hsiang Low. Zeroth-order methods for constrained nonconvex nonsmooth stochastic optimization. InForty-first International Conference on Machine Learning, 2024. 11

2024

[29] [29]

Online convex optimization with heavy tails: Old algorithms, new regrets, and applications

Zijian Liu. Online convex optimization with heavy tails: Old algorithms, new regrets, and applications. arXiv preprint arXiv:2508.07473, 2025

work page arXiv 2025

[30] [30]

Nonconvex stochastic optimization under heavy-tailed noises: Opti- mal convergence without gradient clipping

Zijian Liu and Zhengyuan Zhou. Nonconvex stochastic optimization under heavy-tailed noises: Opti- mal convergence without gradient clipping. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[31] [31]

Simple random search provides a competitive approach to reinforcement learning

Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search provides a competitive approach to reinforcement learning.arXiv preprint arXiv:1803.07055, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[32] [32]

Sparsenet: Coordinate descent with nonconvex penalties.Journal of the American Statistical Association, 106(495):1125–1138, 2011

Rahul Mazumder, Jerome H Friedman, and Trevor Hastie. Sparsenet: Coordinate descent with nonconvex penalties.Journal of the American Statistical Association, 106(495):1125–1138, 2011

2011

[33] [33]

Rectified linear units improve restricted boltzmann machines

Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010

2010

[34] [34]

Optimization via simulation over discrete decision variables

Barry L Nelson. Optimization via simulation over discrete decision variables. InRisk and optimization in an uncertain world, pages 193–207. Informs, 2010

2010

[35] [35]

Random gradient-free minimization of convex functions.Founda- tions of Computational Mathematics, 17(2):527–566, 2017

Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions.Founda- tions of Computational Mathematics, 17(2):527–566, 2017

2017

[36] [36]

Improved convergence in high probability of clipped gradient methods with heavy-tailed noise

Ta Duy Nguyen, Thien Hang Nguyen, Alina Ene, and Huy Le Nguyen. High probability convergence of clipped-sgd under heavy-tailed noise.arXiv preprint arXiv:2302.05437, 2023

work page arXiv 2023

[37] [37]

Online Learning: A Modern Introduction Using Convex Optimization

Francesco Orabona. A modern introduction to online learning.arXiv preprint arXiv:1912.13213, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1912

[38] [38]

High-probability bounds for stochastic optimization and variational inequalities: the case of unbounded variance

Abdurakhmon Sadiev, Marina Danilova, Eduard Gorbunov, Samuel Horv´ ath, Gauthier Gidel, Pavel Dvurechensky, Alexander Gasnikov, and Peter Richt´ arik. High-probability bounds for stochastic optimization and variational inequalities: the case of unbounded variance. InInternational Conference on Machine Learning, pages 29563–29648. PMLR, 2023

2023

[39] [39]

A tail-index analysis of stochastic gradient noise in deep neural networks

Umut Simsekli, Levent Sagun, and Mert Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. InInternational Conference on Machine Learning, pages 5827–5837. PMLR, 2019

2019

[40] [40]

Do differentiable simulators give better policy gradients? InInternational Conference on Machine Learning, pages 20668–20696

Hyung Ju Suh, Max Simchowitz, Kaiqing Zhang, and Russ Tedrake. Do differentiable simulators give better policy gradients? InInternational Conference on Machine Learning, pages 20668–20696. PMLR, 2022

2022

[41] [41]

Gradient normalization provably benefits nonconvex sgd under heavy-tailed noise.arXiv preprint arXiv:2410.16561, 2024

Tao Sun, Xinwang Liu, and Kun Yuan. Gradient normalization provably benefits nonconvex sgd under heavy-tailed noise.arXiv preprint arXiv:2410.16561, 2024

work page arXiv 2024

[42] [42]

No dimension-free deterministic algorithm computes approximate stationarities of lipschitzians.arXiv preprint arXiv:2210.06907, 2022

Lai Tian and Anthony Man-Cho So. No dimension-free deterministic algorithm computes approximate stationarities of lipschitzians.arXiv preprint arXiv:2210.06907, 2022

work page arXiv 2022

[43] [43]

On the finite-time complexity and practical computation of approximate stationarity concepts of Lipschitz functions

Lai Tian, Kaiwen Zhou, and Anthony Man-Cho So. On the finite-time complexity and practical computation of approximate stationarity concepts of Lipschitz functions. InProc. ICML, pages 21360– 21379, 2022

2022

[44] [44]

On stochastic gradient and subgradient methods with adaptive steplength sequences.Automatica, 48(1):56–67, 2012

Farzad Yousefian, Angelia Nedi´ c, and Uday V Shanbhag. On stochastic gradient and subgradient methods with adaptive steplength sequences.Automatica, 48(1):56–67, 2012

2012

[45] [45]

Nearly unbiased variable selection under minimax concave penalty.The Annals of Statistics, 2010

Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty.The Annals of Statistics, 2010

2010

[46] [46]

Why are adaptive methods good for attention models?Advances in Neural Information Processing Systems, 33:15383–15393, 2020

Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra. Why are adaptive methods good for attention models?Advances in Neural Information Processing Systems, 33:15383–15393, 2020. 12

2020

[47] [47]

Complexity of finding stationary points of nonconvex nonsmooth functions

Jingzhao Zhang, Hongzhou Lin, Stefanie Jegelka, Suvrit Sra, and Ali Jadbabaie. Complexity of finding stationary points of nonconvex nonsmooth functions. InInternational Conference on Machine Learning, pages 11173–11182. PMLR, 2020

2020

[48] [48]

Analysis of multi-stage convex relaxation for sparse regularization.Journal of Machine Learning Research, 11(3), 2010

Tong Zhang. Analysis of multi-stage convex relaxation for sparse regularization.Journal of Machine Learning Research, 11(3), 2010. 13 The appendix is organized as follows. Section A introduces several key lemmas that are essential for the convergence analysis of the ZOCOON method. Section B presents the proof of the in-expectation and high-probability con...

2010